我們的HDFS生產環境是Hadoop-0.21,機器規模200臺,block在7KW左右. 集羣每運行幾個月,NameNode就會頻繁FGC,最後不得不restart NameNode. 因此懷疑NameNode存在內存泄漏問題,我們dump出了NameNode進程在重啓前後的對象統計信息。
07-10重啓前:
07-10重啓後:
從上面的信息可以看出,NameNode節點重啓前最佔內存的對象是[Ljava.lang.Object、[C、org.apache.hadoop.hdfs.server.namenode.INodeFile、org.apache.hadoop.hdfs.server.namenode.BlockInfo、[B、org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction$ReplicaUnderConstruction等,它們的引用關係如下:
其中,根據NameNode節點的內部處理邏輯,INodeFileUnderConstruction和BlockInfoUnderConstruction都屬於中間狀態,當文件的寫關閉之後,INodeFileUnderConstruction會變成INodeFile,BlockInfoUnderConstruction會變成BlockInfo,而集羣的文件寫壓力不可能在100W/s級別,因此,NameNode節點可能存在內存泄漏。
文件在close時會調用NameNode的complete方法來關閉,此時BlocksMap的映射就會從BlockInfoUnderConstruction-->BlockInfoUnderConstruction 變成BlockInfo->BlockInfo. (我們暫且描述爲oldBlock>oldBlock 替換爲newBlock->newBlock) BlocksMap對這一狀態轉變的處理邏輯是:
BlockInfo replaceBlock(BlockInfo newBlock) {
BlockInfo currentBlock = map.get(newBlock);
assert currentBlock != null : "the block if not in blocksMap";
// replace block in data-node lists
for(int idx = currentBlock.numNodes()-1; idx >= 0; idx--) {
DatanodeDescriptor dn = currentBlock.getDatanode(idx);
Log.info("Replace Block[" + newBlock + "] to Block[" + currentBlock + "] in DataNode[" + dn + "]");
dn.replaceBlock(currentBlock, newBlock);
}
// replace block in the map itself
map.put(newBlock, newBlock);
return newBlock;
}
Block重寫了hashCode和equals方法,使得newBlock和oldBlock有相同的hashCode,而且newBlock.equals(oldBlock)=true.
上述代碼的原意是將map中的Entry(oldBlock,oldBlock)替換成(newBlock,newBlock). 而HashMap在處理put的時候,如果key相同(注意:這裏的相同是指(newKey.hashCode==oldKey.hashCode && (oldKey==newKey || oldkey.equals(newKey))). 只會將對應的value替換,導致oldBlock->oldBlock被替換成oldBlock->newBlock, 也就是oldBlock依然沒有被釋放,也就是所謂的內存泄漏。
請參考HashMap的代碼:
/**
* Associates the specified value with the specified key in this map.
* If the map previously contained a mapping for the key, the old
* value is replaced.
*
* @param key key with which the specified value is to be associated
* @param value value to be associated with the specified key
* @return the previous value associated with <tt>key</tt>, or
* <tt>null</tt> if there was no mapping for <tt>key</tt>.
* (A <tt>null</tt> return can also indicate that the map
* previously associated <tt>null</tt> with <tt>key</tt>.)
*/
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(hash, key, value, i);
return null;
}
建議將BlocksMap修復如下,已經提交patch,請見:https://issues.apache.org/jira/browse/HDFS-7592
BlockInfo replaceBlock(BlockInfo newBlock) {
+ /**
+ * change to fix bug about memory leak of NameNode by huahua.xu
+ * 2013-08-17 15:20
+ */
BlockInfo currentBlock = map.get(newBlock);
assert currentBlock != null : "the block if not in blocksMap";
// replace block in data-node lists
for(int idx = currentBlock.numNodes()-1; idx >= 0; idx--) {
DatanodeDescriptor dn = currentBlock.getDatanode(idx);
dn.replaceBlock(currentBlock, newBlock);
}
// replace block in the map itself
+ BlockInfo currentBlock = map.remove(newBlock);
map.put(newBlock, newBlock);
return newBlock;
}
截止到目前爲止,該patch已經正式更新到線上集羣,解決了內存泄漏問題。
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
這是公司生產環境中遇到的較爲嚴重的bug,目前已經提交社區:https://issues.apache.org/jira/browse/HDFS-7592 ,在此和大家share一下.