前言
在上篇文章(一次HDFS JournalNode transaction lag問題分析排查)中,筆者詳細地講述了一次HDFS JN lag延時問題的分析排查,在文中結尾筆者提到了問題的root cause是因爲部署在JN同節點上的RM traffic陡增造成的JN lag影響。不過筆者當時只是發現了與RM traffic陡增的可能因素,並未進一步分析這些增加的RM traffic的真正來源。在後續的時間裏,我們對這些RM traffic的組成部分以及爲什麼做了更加進一步的分析,最終真正找到了問題的來源。本文筆者將簡單闡述JN lag問題的後續問題:RM traffic陡增的緣由。
背景
首先提供RM traffic陡增的一個趨勢圖,如下圖所示(在上篇JN lag的文章中也提到過)。
當時的一個推測方向是部分業務方的業務上的調整,導致了此部分流量的變化。爲了更加明確我們的這個猜想,後續我們進行了對RM traffic的進一步分析。
iftop命令分析
在Linux的工具命令中,iftop命令能夠檢測出準實時的節點帶寬使用的多少。於是,我們在RM的機器上進行了iftop的執行,命令執行結果如下:
命令:sudo iftop -B -P
rm-node:8031 => nm-node1:42810 47.2KB 94.4KB 94.4KB
<= 996B 2.15KB 2.15KB
rm-node:8031 => nm-node2:37624 47.2KB 82.6KB 82.6KB
<= 1.33KB 1.95KB 1.95KB
rm-node:8031 => nm-node3:46594 141KB 82.5KB 82.5KB
<= 2.96KB 1.97KB 1.97KB
rm-node:8031 => nm-node4:59782 47.2KB 82.6KB 82.6KB
<= 1.12KB 1.90KB 1.90KB
rm-node:8031 => nm-node5:41700 47.2KB 82.6KB 82.6KB
<= 1.18KB 1.79KB 1.79KB
rm-node:8031 => nm-node6:50672 47.2KB 82.6KB 82.6KB
<= 966B 1.79KB 1.79KB
rm-node:8031 => nm-node7:55306 94.4KB 82.6KB 82.6KB
<= 1.88KB 1.71KB 1.71KB
rm-node:8031 => nm-node8:51870 47.2KB 82.6KB 82.6KB
<= 1.71KB 1.68KB 1.68KB
rm-node:8031 => nm-node9:49730 47.3KB 82.6KB 82.6KB
<= 907B 1.62KB 1.62KB
上圖數據是其中截取的一個時刻點的RM traffic數據。我們得到了許多十分關鍵的信息:
- RM的traffic使用主要集中在8031端口的數據傳輸上。
- 在RM 8031口的通信過程中,從RM到NM的數據傳輸要遠大於從NM到RM的方向的數據傳輸。
再進一步分析RM 8031口的用途,主要用於做RM、NM之間進行heartbeat通信的。我們再轉換成上面iftop的語義,NM向RM發送node heartbeat,數據量比較小,但是RM返回給NM的heartbeat response則耗費了同比大很多倍的內容大小。
tcpdump的同期驗證
同期,我們挑選了1臺NM,用tcpdump命令,抓取在此機器上的實時流量,src和dst分別設置成rm機器進行測試。
結果如下:
NM->RM
命令:tcpdump dst rm-node
20:16:48.028768 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13088574:13092670, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327940], length 4096
20:16:48.028778 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13092670:13096766, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028791 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13096766:13100862, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028800 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13100862:13104958, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028809 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13104958:13109054, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028822 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13109054:13113150, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028832 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13113150:13117246, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028851 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13117246:13121342, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028862 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13121342:13125438, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028875 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13125438:13129534, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028886 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13129534:13130257, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 723
20:16:48.037348 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4234513, win 2889, options [nop,nop,TS val 561229068 ecr 498327949], length 0
20:16:48.037407 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4238857, win 2877, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037453 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4241257, win 2885, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037461 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4242705, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037499 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4247049, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037627 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4248497, win 2889, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037679 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4252841, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037729 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4258633, win 2877, options [nop,nop,TS val 561229069 ecr 498327949], length 0
RM->NM
命令:tcpdump src rm-node
20:49:59.617672 IP rm-node.8030 > nm-node.58940: Flags [P.], seq 2156:2854, ack 1746598, win 5955, options [nop,nop,TS val 500319529 ecr 2277606464], length 698
20:49:59.672596 IP rm-node.8031 > nm-node.58058: Flags [.], ack 3590, win 901, options [nop,nop,TS val 500319583 ecr 2277606545], length 0
20:49:59.674288 IP rm-node.8031 > nm-node.58058: Flags [.], seq 288621:290069, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 1448
20:49:59.674336 IP rm-node.8031 > nm-node.58058: Flags [P.], seq 290069:296813, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 6744
20:49:59.674337 IP rm-node.8031 > nm-node.58058: Flags [.], seq 296813:301157, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 4344
20:49:59.674380 IP rm-node.8031 > nm-node.58058: Flags [.], seq 301157:302605, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 1448
20:49:59.674588 IP rm-node.8031 > nm-node.58058: Flags [.], seq 302605:304053, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 1448
20:49:59.674641 IP rm-node.8031 > nm-node.58058: Flags [.], seq 304053:319981, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 15928
20:49:59.674689 IP rm-node.8031 > nm-node.58058: Flags [.], seq 319981:331565, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 11584
20:49:59.674822 IP rm-node.8031 > nm-node.58058: Flags [.], seq 331565:333013, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 1448
20:49:59.674881 IP rm-node.8031 > nm-node.58058: Flags [.], seq 333013:366317, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 33304
20:49:59.674923 IP rm-node.8031 > nm-node.58058: Flags [.], seq 366317:376453, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 10136
20:49:59.674974 IP rm-node.8031 > nm-node.58058: Flags [P.], seq 376453:384828, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 8375
20:50:00.015461 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1750751, win 5955, options [nop,nop,TS val 500319926 ecr 2277606868], length 0
20:50:00.015503 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1754847, win 5955, options [nop,nop,TS val 500319926 ecr 2277606888], length 0
20:50:00.015505 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1767135, win 5955, options [nop,nop,TS val 500319926 ecr 2277606888], length 0
20:50:00.015550 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1822079, win 5806, options [nop,nop,TS val 500319927 ecr 2277606888], length 0
從上述結果能夠看出:
- 從NM到RM的發送的數據packet,length比較統一,而且都比較小。
- 從RM返回給NM的數據packet,則是length並不統一,而且length都比較大。
測試結果也是吻合了上節的iftop命令的分析結果。
RM NodeHeartbeat Response的具體分析
上面命令分析的結果把RM traffic的來源縮小到了NodeHeartbeat Response的對象裏了,但是問題是我們還不知道是裏面的什麼東西造成如此巨大的response的。iftop和tcpdump都無法抓取如此細節的信息。那麼這個時候,只能從代碼角度進行剖析了。
以下是Heartbeat response的類定義:
public interface NodeHeartbeatResponse {
int getResponseId();
NodeAction getNodeAction();
List<ContainerId> getContainersToCleanup();
List<ContainerId> getContainersToBeRemovedFromNM();
List<ApplicationId> getApplicationsToCleanup();
void setResponseId(int responseId);
void setNodeAction(NodeAction action);
MasterKey getContainerTokenMasterKey();
void setContainerTokenMasterKey(MasterKey secretKey);
MasterKey getNMTokenMasterKey();
void setNMTokenMasterKey(MasterKey secretKey);
void addAllContainersToCleanup(List<ContainerId> containers);
// This tells NM to remove finished containers from its context. Currently, NM
// will remove finished containers from its context only after AM has actually
// received the finished containers in a previous allocate response
void addContainersToBeRemovedFromNM(List<ContainerId> containers);
void addAllApplicationsToCleanup(List<ApplicationId> applications);
long getNextHeartBeatInterval();
void setNextHeartBeatInterval(long nextHeartBeatInterval);
String getDiagnosticsMessage();
void setDiagnosticsMessage(String diagnosticsMessage);
// Credentials (i.e. hdfs tokens) needed by NodeManagers for application
// localizations and logAggreations.
Map<ApplicationId, ByteBuffer> getSystemCredentialsForApps();
void setSystemCredentialsForApps(
Map<ApplicationId, ByteBuffer> systemCredentials); <====
boolean getAreNodeLabelsAcceptedByRM();
void setAreNodeLabelsAcceptedByRM(boolean areNodeLabelsAcceptedByRM);
}
經過快速的對NodeHeartbeatResponse的瀏覽,其中有個叫SystemCredentialsForApps的變量有着比其它諸如ContainerId和ApplicationId更爲複雜的映射存儲。按照上面註釋的解釋,這些credentials(即hdfs token)是RM返回給NM用來做application的localizations and logAggreations的。這裏其實透漏着一個隱含的條件關係:如果NM上跑的application數量上去了,那麼則對應其接受到RM的這些credentials也會跟着變多。
這似乎能夠解釋的通上文提到的業務方業務上的調整導致了RM traffic的升高。這裏面的因果關係如下:
1)某業務方進行業務調整,導致提交了很多新的application到RM上來。
2)鑑於這些新application的lolization過程和log aggregation操作需要有對應的token,RM在nodeHeartbeat response裏給NM返回了更多credentials信息。
3)因爲這些credentials信息增多導致的nodeHeartbeat response變大,進而導致RM機器上traffic的上漲。
RM NodeHeartbeat大response的解決
後續我們在社區上尋找因爲credentials導致的NodeHeartbeat response數據過大的解決方案,最終我們找到了對應此問題的一個JIRA改進 YARN-6523(Optimize system credentials sent in node heartbeat responses)。此JIRA提到的問題的場景與我們生產環境遇到的極爲類似。
YARN-6523的解決策略是在node hearbeat裏新增了sequence number做token更新狀態的標記,只有當NM端需要有delta的token更新時,RM纔會將credentials設置到NM的response裏面去。之後NM如果沒有需要額外token的需要時,RM將不會發送重複的token信息給NM端。我們目前也在積極的嘗試此套解決方案,期待能達到預期良好的優化效果。