Hadoop0.21內存泄漏問題：數據塊映射管理的一個bug

我們的HDFS生產環境是Hadoop-0.21,機器規模200臺，block在7KW左右. 集羣每運行幾個月，NameNode就會頻繁FGC,最後不得不restart NameNode. 因此懷疑NameNode存在內存泄漏問題，我們dump出了NameNode進程在重啓前後的對象統計信息。

07-10重啓前:

num     #instances         #bytes class name
----------------------------------------------
   1:      59262275     3613989480 [Ljava.lang.Object;
...
10:       8549361      615553992 org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction
11:       5941511      427788792 org.apache.hadoop.hdfs.server.namenode.INodeFileUnderConstruction
...

07-10重啓後:

num     #instances         #bytes class name
----------------------------------------------
   1:      44188391     2934099616 [Ljava.lang.Object;
...
23:        721763       51966936 org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction
24:        620028       44642016 org.apache.hadoop.hdfs.server.namenode.INodeFileUnderConstruction

...

從上面的信息可以看出，NameNode節點重啓前最佔內存的對象是[Ljava.lang.Object、[C、org.apache.hadoop.hdfs.server.namenode.INodeFile、org.apache.hadoop.hdfs.server.namenode.BlockInfo、[B、org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction$ReplicaUnderConstruction等，它們的引用關係如下:

其中，根據NameNode節點的內部處理邏輯，INodeFileUnderConstruction和BlockInfoUnderConstruction都屬於中間狀態，當文件的寫關閉之後，INodeFileUnderConstruction會變成INodeFile，BlockInfoUnderConstruction會變成BlockInfo，而集羣的文件寫壓力不可能在100W/s級別，因此，NameNode節點可能存在內存泄漏。

文件在close時會調用NameNode的complete方法來關閉，此時BlocksMap的映射就會從BlockInfoUnderConstruction-->BlockInfoUnderConstruction 變成BlockInfo-~~>BlockInfo. (我們暫且描述爲oldBlock>oldBlock 替換爲newBlock~~->newBlock) BlocksMap對這一狀態轉變的處理邏輯是:

 BlockInfo replaceBlock(BlockInfo newBlock) {
    BlockInfo currentBlock = map.get(newBlock);
 
    assert currentBlock != null : "the block if not in blocksMap";
    // replace block in data-node lists
    for(int idx = currentBlock.numNodes()-1; idx >= 0; idx--) {
      DatanodeDescriptor dn = currentBlock.getDatanode(idx);
      Log.info("Replace Block[" + newBlock + "] to Block[" + currentBlock + "] in DataNode[" + dn + "]");
      dn.replaceBlock(currentBlock, newBlock);
    }
    
    // replace block in the map itself
    map.put(newBlock, newBlock);
    return newBlock;
  }

Block重寫了hashCode和equals方法，使得newBlock和oldBlock有相同的hashCode,而且newBlock.equals(oldBlock)=true.

上述代碼的原意是將map中的Entry(oldBlock,oldBlock)替換成(newBlock,newBlock). 而HashMap在處理put的時候，如果key相同(注意：這裏的相同是指(newKey.hashCode==oldKey.hashCode && (oldKey==newKey || oldkey.equals(newKey))). 只會將對應的value替換，導致oldBlock->oldBlock被替換成oldBlock->newBlock，也就是oldBlock依然沒有被釋放，也就是所謂的內存泄漏。

請參考HashMap的代碼:

    /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

建議將BlocksMap修復如下，已經提交patch，請見:https://issues.apache.org/jira/browse/HDFS-7592

 BlockInfo replaceBlock(BlockInfo newBlock) {

+   /**
+    * change to fix bug about memory leak of NameNode by huahua.xu  
+    * 2013-08-17 15:20
+    */
 
    BlockInfo currentBlock = map.get(newBlock);
 
    assert currentBlock != null : "the block if not in blocksMap";
    // replace block in data-node lists
    for(int idx = currentBlock.numNodes()-1; idx >= 0; idx--) {
      DatanodeDescriptor dn = currentBlock.getDatanode(idx);
      dn.replaceBlock(currentBlock, newBlock);
    }
    
    // replace block in the map itself


+   BlockInfo currentBlock = map.remove(newBlock);     
    map.put(newBlock, newBlock);
    return newBlock;
  }

截止到目前爲止，該patch已經正式更新到線上集羣，解決了內存泄漏問題。

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

這是公司生產環境中遇到的較爲嚴重的bug,目前已經提交社區：https://issues.apache.org/jira/browse/HDFS-7592 ，在此和大家share一下.

Hadoop0.21內存泄漏問題：數據塊映射管理的一個bug

.NET週刊【5月第3期 2024-05-19】

(開源) 寫了一個無代碼平臺 brick

2020年上半年數據庫系統工程師考試

昔日輝煌不再，PHP老矣，尚能飯否？

Hadoop Snappy安裝終極教程

Spark源碼分析——deploy模塊

Spark源碼解析——Shuffle

linux系統中oom killer策略

HDFS源碼學習（1）——NameNode主要數據結構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結