節點名稱 39 158 211 線程總數 366 341 282 RUNNABLE 264 221 162 WAITING 64 88 92 TIME_WAITING 28 32 28 BLOCKED 10 0 0"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再按線程池進行分類統計:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":" 節點名稱 39 158 211 Lucene Merge Thread 77 0 0 http_server_worker 64 64 64 search 49 49 49 transport_client_boss 28 64 30 bulk 32 32 32 generic 15 6 4 transport_server_worker 27 55 29 refresh 10 5 10 management 5 2 3 warmer 5 5 5 flush 5 5 5 others 49 54 51"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以發現:"},{"type":"text","marks":[{"type":"strong"}],"text":"39節點上的Lucene Merge Thread明顯偏多,而其它兩個節點沒有任何Merge的線程"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再對39節點的Thread Dump文件進行深入分析,發現的異常點總結如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Lucene Merge Thread達到77個,其中一個線程的調用棧如下所示:"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/11\/11\/11b930042d214dcebbe91ff17568b111.jpg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":2,"normalizeStart":2},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"有8個線程在競爭鎖定ExpiringCache:"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/62\/52\/624dd3580031d1a8935737b6263b6b52.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":3,"normalizeStart":3},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"有8個線程都在做HashMap#hash計算:"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/e3\/22\/e398674395e94f381f68223584a7e722.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現象1中提到了有77個同時在做Merge的線程,但無法確定這些Merge任務是同時被觸發的,還是因爲系統處理過慢逐步堆積成這樣的狀態。"},{"type":"text","marks":[{"type":"strong"}],"text":"無論如何這像是一條重要線索"},{"type":"text","text":"。再考慮到這是一個新上線的應用,關於環境信息與使用姿勢的調研同樣重要:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#40A9FF","name":"blue"}}],"text":"集羣共有3個節點,目前共有500+個Indices。每個節點上寫活躍的分片數在70個左右。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#40A9FF","name":"blue"}}],"text":"按租戶創建Index,每個租戶每天創建3個Indices。上線初期,寫入吞吐量較低。每個索引在每分鐘Flush成的Segment在KB~數MB之間。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我開始懷疑這種特殊的使用方式:集羣中存在多個寫活躍的索引,但每分鐘的寫入量都偏小,在KB至數MB級別。這意味着,Flush可能都是週期性觸發,而不是超過預設閾值後觸發。這種寫入方式,會導致產生大量的小文件。抽樣觀察了幾個索引中新產生的Segment文件,的確每一次生成的文件都非常小。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於第2點現象,我認真閱讀了java.io.UnixFileSystem的源碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#40A9FF","name":"blue"}}],"text":"UnixFileSystem中需要對一個新文件的路徑按照操作系統標準進行標準化。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#40A9FF","name":"blue"}}],"text":"標準化的結果存放在ExpiringCache對象中。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多個線程都在爭相調用ExpiringCache#put操作,這側面反映了文件列表的高頻變化,這說明系統中存在高頻的Flush和Merge操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這加劇了我關於使用姿勢的懷疑:\"細雨綿綿\"式的寫入,被動觸發Flush,如果週期相同,意味着同時Flush,多個Shard同時Merge的概率變大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"於是,我開始在測試環境中模擬這種使用方式,創建類似的分片數量,控制寫入頻率。計劃讓測試程序至少運行一天的時間,觀察是否可以復現此問題。在程序運行的同時,我繼續調查Thread Dump日誌。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第3點現象中,僅僅是做一次hash計算,卻表現出特別慢的樣子。如果將這三點現象綜合起來,可以發現所有的調用點都在做CPU計算。按理說,CPU應該特別的忙碌。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"等問題在現場復現的時候,客戶協助獲取了CPU使用率與負載信息,結果顯示CPU資源非常閒。在這之前,同事也調研過IO資源,也是非常閒的。這排除了系統資源方面的影響。此時也發現,"},{"type":"text","marks":[{"type":"strong"}],"text":"每一次復現的節點是隨機的,與機器無關"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一天過去後,在本地測試環境中,問題沒能復現出來。儘管使用姿勢不優雅,但卻不像是問題的癥結所在。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"詭異的 STW 中斷"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過jstack命令獲取Thread Dump日誌時,需要讓JVM進程進入Safepoint,相當於整個進程先被掛起。獲取到的Thread Dump日誌,也恰恰是進程掛起時每個線程的瞬間狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有忙碌的線程都剛好在做CPU計算,但CPU並不忙碌。"},{"type":"text","marks":[{"type":"strong"}],"text":"這提示需要進一步調查GC日誌"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現場應用並未開啓GC日誌。考慮到問題當前尚未復現,通過jstat工具來查看GC次數與GC統計時間的意義不太大。讓現場人員在jvm.options中手動添加了如下參數來開啓GC日誌:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"8:-XX:+PrintGCDetails\n\n8:-XX:+PrintGCDateStamps\n\n8:-XX:+PrintTenuringDistribution\n\n8:-XX:+PrintGCApplicationStoppedTime\n\n8:-Xloggc:logs\/gc.log\n\n8:-XX:+UseGCLogFileRotation\n\n8:-XX:NumberOfGCLogFiles=32\n\n8:-XX:GCLogFileSize=32m"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"添加"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#40A9FF","name":"blue"}}],"text":"PrintGCApplicationStoppedTime"},{"type":"text","text":"是爲了將每一次JVM進程發生的STW(Stop-The-World)中斷記錄在GC日誌中。通常,此類STW中斷都是因GC引起,也可能與偏向鎖有關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"剛剛重啓,現場人員把GC日誌tail的結果發了過來,這是爲了確認配置已生效。詭異的是,剛剛重啓的進程居然在不停的打印STW日誌:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/64\/c4\/649216b0bcdb6488c4d5219154c0afc4.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於STW日誌(”…Total time for which application threads were stopped…”),這裏有必要簡單解釋一下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JVM有時需要執行一些全局操作,典型如GC、偏向鎖回收,此時需要暫停所有正在運行的Thread,這需要依賴於JVM的Safepoint機制,Safepoint就好比一條大馬路上設置的紅燈。"},{"type":"text","text":"JVM每一次進入STW(Stop-The-World)階段,都會打印這樣的一行日誌:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"2020-09-10T13:59:43.210+0800: 73032.559: Total time for which application threads were stopped: 0.0002853 seconds, Stopping threads took: 0.0000217 seconds"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這行日誌中,提示了STW階段持續的時間爲0.0002853秒,而叫停所有的線程(Stopping threads)花費了0.0000217秒,前者包含了後者。通常,Stopping threads的時間佔比極小,如果過長的話可能與代碼實現細節有關,這裏不過多展開。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回到問題,一開始就打印大量的STW日誌,容易想到與偏向鎖回收有關。直到問題再次復現時,拿到了3個節點的完整的GC日誌,發現無論是YGC還是FGC,觸發的頻次都很低,這排除了GC方面的影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"出現的大量STW日誌"},{"type":"text","text":",使我意識到該現象極不合理。有同學提出懷疑,每一次中斷時間很短啊?寫了一個簡單的工具,對每一分鐘的STW中斷頻次、中斷總時間做了統計:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/64\/4e\/645d418567cde91d80bed4142c02ea4e.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"統計結果顯示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正常每分鐘都有5秒左右的中斷。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在11:29~11:30之間,中斷頻次陡增,這恰恰是問題現象開始出現的時間段。"},{"type":"text","text":"每分鐘的中斷總時間甚至高達20~30秒。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就好比,一段1公里的馬路上,正常是遇不見任何紅綠燈的,現在突然增加了幾十個紅綠燈,實在是讓人崩潰。"},{"type":"text","marks":[{"type":"strong"}],"text":"這些中斷很好的解釋了“所有的線程都在做CPU計算,然而CPU資源很閒”的現象"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"關於Safepoint的調查"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Safepoint有多種類型,爲了確認Safepoint的具體類型,繼續讓現場同學協助,在jvm.options中添加如下參數,打開JVM日誌:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"-XX:+PrintSafepointStatistics\n\n-XX:PrintSafepointStatisticsCount=10\n\n-XX:+UnlockDiagnosticVMOptions\n\n-XX:-DisplayVMOutput\n\n-XX:+LogVMOutput\n\n-XX:LogFile= |
不要再亂下載 JDK 了:Elasticsearch 在國產化 ARM 環境下的首個大坑
{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"問題來了"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"筆者近期在工作中遇到這樣一個問題:某客戶新上線了一個Elasticsearch應用,但運行一段時間後就變的特別慢,甚至查詢超時。重啓後服務恢復,但每隔3~4小時後問題重現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這個問題,我身邊的同事也幫忙做了簡單分析,發現存在大量Merge的線程,應該怎麼辦呢?根據我之前定位問題的經驗,一般通過Thread Dump日誌分析,就能找到問題原因的正確方向,然後再分析該問題不斷重複的原因。按着這個思路,問題分析起來應該不算複雜。But,後來劇情還是出現了波折。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"困惑的堆棧日誌"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因網絡隔離原因,只能由客戶配合獲取Thread Dump日誌。並跟客戶強調了獲取Thread Dump日誌的技巧,每個節點每隔幾秒獲取一次,輸出到一個獨立的文件中。集羣涉及到三個節點,我們暫且將這三個節點稱之爲39,158, 211。問題復現後,拿到了第一批Thread Dump文件:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/6c\/12\/6c38b9fd6d2b52c3718890b4640cde12.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從文件的大小,可輕易看出39節點大概率是一個問題節點,它的Thread Dump日誌明顯大出許多:查詢變慢或者卡死,通常表現爲大量的Worker Thread忙碌,也就是說,活躍線程數量顯著增多。而在ES(Elasticsearch,以下簡稱爲ES)的默認查詢行爲下,只要有一個節點出現問題,就會讓整個查詢受牽累。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼我們先對三個節點任選的1個Thread Dump文件的線程總體情況進行對比:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.