jvm，apache-commons-pool的PhantomReference引起的一次線上內存崩掉的分析

前一段時間，臨部門的兄弟泰國站的項目，系統上線二天，或者重啓之後系統總是莫名的shutdown，我對這方面比較感興趣，也處理過一些這種問題，就寫下處理的過程：

左邊是沒有修改之前的，右邊是修改之後的，分析這個問題之前，我先介紹一下工具，用的是Mat(Memory Analyzer Tool)，我比較喜歡用這個，導入內存dump快照：

一般選擇leak suspects report這個view就可以了，看下面的視圖：

從上面的視圖，可以看出com.mysql.jdbc.NonRegisteringDriver這個對相關佔有了85.98的內存，主要是這個對象所持有的ConcurrentHashMap佔有了絕大多數的內存。接下來轉換視圖，我一般用的是Histogram和Dominator_Tree這二個視圖，把上面的類複製進去，看一下情況：

從這二個圖上不難看出ConnectionPhantomReference這個對象太多，從代碼裏可以來看：

public class NonRegisteringDriver implements java.sql.Driver {
	private static final String ALLOWED_QUOTES = "\"'";

	private static final String REPLICATION_URL_PREFIX = "jdbc:mysql:replication://";

	private static final String URL_PREFIX = "jdbc:mysql://";

	private static final String MXJ_URL_PREFIX = "jdbc:mysql:mxj://";

	public static final String LOADBALANCE_URL_PREFIX = "jdbc:mysql:loadbalance://";

	protected static final ConcurrentHashMap<ConnectionPhantomReference, ConnectionPhantomReference> connectionPhantomRefs = new ConcurrentHashMap<ConnectionPhantomReference, ConnectionPhantomReference>();

protected static void trackConnection(Connection newConn) {
		
		ConnectionPhantomReference phantomRef = new ConnectionPhantomReference((ConnectionImpl) newConn, refQueue);
		connectionPhantomRefs.put(phantomRef, phantomRef);
	}

熟悉apache commons pool的不難看出來，使用的是common-pool的連接池，而這個方法是每創建一個連接就會放一個Connection對象在這個裏面，這個虛引用的作用，就是在你外部關閉鏈接，但是沒有釋放資源，做一個保底操作，在gc的時候，把持有的資源釋放掉：

public void run() {
		threadRef = this;
		while (running) {
			try {
				Reference<? extends ConnectionImpl> ref = NonRegisteringDriver.refQueue.remove(100);
				if (ref != null) {
					try {
						((ConnectionPhantomReference) ref).cleanup();
					} finally {
						NonRegisteringDriver.connectionPhantomRefs.remove(ref);
					}
				}

			} catch (Exception ex) {
				// no where to really log this if we're static
			}
		}
	}

在發生full gc的時候，會把對象放到refQueue中，最後會把連接所持有的資源釋放掉，但是這個釋放資源是巨耗時間的，所以內存計算導致docker崩掉並不稀罕。但是數據連接池都是有池化資源的概念的，資源循環利用，怎麼可能出現如此顯而易見的錯誤，這是不可能發生的基本上，但是事出必有因，只好進一步的分析問題，到底什麼問題導致這個現象，網上搜了一下，基本上都是草草了之，只有表現，沒有解釋根本原因，所以我不得不自己看這個問題，我的第一個猜測就是長時間連接不用，超過waittime，被回收掉，然後又創建，就這樣頻繁回收和創建，這個猜測的理論必須是minpool是0纔可以，但是minpool和maxpool並沒有問題，都是5和5

問題又一度陷在了這個上面，所有的一切都不符合常理，我只能去看源碼，看一下apache-commons-pool回收連接的代碼，這個項目用的是commons-pool 1.x而不是2.x，這個真的很重要，下面貼下代碼：

private class Evictor extends TimerTask {
        /**
         * Run pool maintenance.  Evict objects qualifying for eviction and then
         * invoke {@link GenericObjectPool#ensureMinIdle()}.
         */
        public void run() {
            try {
                evict();
            } catch(Exception e) {
                // ignored
            } catch(OutOfMemoryError oome) {
                // Log problem but give evictor thread a chance to continue in
                // case error is recoverable
                oome.printStackTrace(System.err);
            }
            try {
                ensureMinIdle();
            } catch(Exception e) {
                // ignored
            }
        }
    }

public void evict() throws Exception {
        assertOpen();
        synchronized (this) {
            if(_pool.isEmpty()) {
                return;
            }
            if (null == _evictionCursor) {
                _evictionCursor = (_pool.cursor(_lifo ? _pool.size() : 0));
            }
        }

        for (int i=0,m=getNumTests();i<m;i++) {
            final ObjectTimestampPair pair;
            synchronized (this) {
                if ((_lifo && !_evictionCursor.hasPrevious()) ||
                        !_lifo && !_evictionCursor.hasNext()) {
                    _evictionCursor.close();
                    _evictionCursor = _pool.cursor(_lifo ? _pool.size() : 0);
                }

                pair = _lifo ?
                        (ObjectTimestampPair) _evictionCursor.previous() :
                        (ObjectTimestampPair) _evictionCursor.next();

                _evictionCursor.remove();
                _numInternalProcessing++;
            }

            boolean removeObject = false;
            final long idleTimeMilis = System.currentTimeMillis() - pair.tstamp;
            if ((getMinEvictableIdleTimeMillis() > 0) &&
                    (idleTimeMilis > getMinEvictableIdleTimeMillis())) {
                removeObject = true;
            } else if ((getSoftMinEvictableIdleTimeMillis() > 0) &&
                    (idleTimeMilis > getSoftMinEvictableIdleTimeMillis()) &&
                    ((getNumIdle() + 1)> getMinIdle())) { // +1 accounts for object we are processing
                removeObject = true;
            }
            if(getTestWhileIdle() && !removeObject) {
                boolean active = false;
                try {
                    _factory.activateObject(pair.value);
                    active = true;
                } catch(Exception e) {
                    removeObject=true;
                }
                if(active) {
                    if(!_factory.validateObject(pair.value)) {
                        removeObject=true;
                    } else {
                        try {
                            _factory.passivateObject(pair.value);
                        } catch(Exception e) {
                            removeObject=true;
                        }
                    }
                }
            }

            if (removeObject) {
                try {
                    _factory.destroyObject(pair.value);
                } catch(Exception e) {
                    // ignored
                }
            }
            synchronized (this) {
                if(!removeObject) {
                    _evictionCursor.add(pair);
                    if (_lifo) {
                        // Skip over the element we just added back
                        _evictionCursor.previous();
                    }
                }
                _numInternalProcessing--;
            }
        }
    }

private void ensureMinIdle() throws Exception {
        // this method isn't synchronized so the
        // calculateDeficit is done at the beginning
        // as a loop limit and a second time inside the loop
        // to stop when another thread already returned the
        // needed objects
        int objectDeficit = calculateDeficit(false);
        for ( int j = 0 ; j < objectDeficit && calculateDeficit(true) > 0 ; j++ ) {
            try {
                addObject();
            } finally {
                synchronized (this) {
                    _numInternalProcessing--;
                    allocate();
                }
            }
        }
    }

問題就出在上面的代碼當中，細心的小夥伴可能已經發現了問題：

if ((getMinEvictableIdleTimeMillis() > 0) &&
                    (idleTimeMilis > getMinEvictableIdleTimeMillis())) {
                removeObject = true;
            }

這代碼不管三七二十一，只要檢測到空閒時間過長，上去先把renoveObject變成true，先銷燬，然後再ensureMinIdle，創建新連接，問題也就是這裏了，這裏的池化概念有問題，可能開發的理念不同，應該先判斷是不是大於minpool，再判斷空閒時間。然後修改成大一點的空閒時間檢測，根據業務來決定，多大合適，2.x是沒有這個問題的，我就不貼代碼了，之後就是最上面的那個圖片了。

修改完之後就是比較平穩的曲線了，其實這個問題的原因還有一個，就是我們上層還有一層redis，redis的能夠承擔大多數的讀，寫入mysql的量也不大，在國內站很久以前也有這種情況，不過現在他們組國內站的機器都500臺，qps也很高了，基本上問題不大。上面就是我處理的過程，當然真正的處理過程沒有這麼輕鬆，其中的一些過程省略了，到此爲止。。。。。。

jvm，apache-commons-pool的PhantomReference引起的一次線上內存崩掉的分析

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

緩存穿透,緩存擊穿,緩存雪崩處理

hibernate入門級

spring mvc之jpa的配置

struts學習之入門級

Garbage First（g1）垃圾回收器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結