一次HDFS JN lag延時問題的排查分析後續:RM陡增traffic的來源分析

前言


在上篇文章(一次HDFS JournalNode transaction lag問題分析排查)中,筆者詳細地講述了一次HDFS JN lag延時問題的分析排查,在文中結尾筆者提到了問題的root cause是因爲部署在JN同節點上的RM traffic陡增造成的JN lag影響。不過筆者當時只是發現了與RM traffic陡增的可能因素,並未進一步分析這些增加的RM traffic的真正來源。在後續的時間裏,我們對這些RM traffic的組成部分以及爲什麼做了更加進一步的分析,最終真正找到了問題的來源。本文筆者將簡單闡述JN lag問題的後續問題:RM traffic陡增的緣由。

背景


首先提供RM traffic陡增的一個趨勢圖,如下圖所示(在上篇JN lag的文章中也提到過)。
在這裏插入圖片描述
當時的一個推測方向是部分業務方的業務上的調整,導致了此部分流量的變化。爲了更加明確我們的這個猜想,後續我們進行了對RM traffic的進一步分析。

iftop命令分析

在Linux的工具命令中,iftop命令能夠檢測出準實時的節點帶寬使用的多少。於是,我們在RM的機器上進行了iftop的執行,命令執行結果如下:

命令:sudo iftop -B -P

rm-node:8031            => nm-node1:42810  47.2KB  94.4KB  94.4KB
                        <=                                                       996B   2.15KB  2.15KB
rm-node:8031            => nm-node2:37624  47.2KB  82.6KB  82.6KB
                        <=                                                      1.33KB  1.95KB  1.95KB
rm-node:8031            => nm-node3:46594   141KB  82.5KB  82.5KB
                        <=                                                      2.96KB  1.97KB  1.97KB
rm-node:8031            => nm-node4:59782  47.2KB  82.6KB  82.6KB
                        <=                                                      1.12KB  1.90KB  1.90KB
rm-node:8031            => nm-node5:41700  47.2KB  82.6KB  82.6KB
                        <=                                                      1.18KB  1.79KB  1.79KB
rm-node:8031            => nm-node6:50672  47.2KB  82.6KB  82.6KB
                        <=                                                       966B   1.79KB  1.79KB
rm-node:8031            => nm-node7:55306  94.4KB  82.6KB  82.6KB
                        <=                                                      1.88KB  1.71KB  1.71KB
rm-node:8031            => nm-node8:51870  47.2KB  82.6KB  82.6KB
                        <=                                                      1.71KB  1.68KB  1.68KB
rm-node:8031           => nm-node9:49730  47.3KB  82.6KB  82.6KB
                        <=                                                       907B   1.62KB  1.62KB

上圖數據是其中截取的一個時刻點的RM traffic數據。我們得到了許多十分關鍵的信息:

  • RM的traffic使用主要集中在8031端口的數據傳輸上。
  • 在RM 8031口的通信過程中,從RM到NM的數據傳輸要遠大於從NM到RM的方向的數據傳輸。

再進一步分析RM 8031口的用途,主要用於做RM、NM之間進行heartbeat通信的。我們再轉換成上面iftop的語義,NM向RM發送node heartbeat,數據量比較小,但是RM返回給NM的heartbeat response則耗費了同比大很多倍的內容大小。

tcpdump的同期驗證


同期,我們挑選了1臺NM,用tcpdump命令,抓取在此機器上的實時流量,src和dst分別設置成rm機器進行測試。

結果如下:

NM->RM
命令:tcpdump dst rm-node

20:16:48.028768 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13088574:13092670, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327940], length 4096
20:16:48.028778 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13092670:13096766, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028791 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13096766:13100862, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028800 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13100862:13104958, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028809 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13104958:13109054, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028822 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13109054:13113150, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028832 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13113150:13117246, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028851 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13117246:13121342, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028862 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13121342:13125438, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028875 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13125438:13129534, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 4096
20:16:48.028886 IP nm-node.38708 > rm-node.8031: Flags [P.], seq 13129534:13130257, ack 4233065, win 2893, options [nop,nop,TS val 561229060 ecr 498327941], length 723
20:16:48.037348 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4234513, win 2889, options [nop,nop,TS val 561229068 ecr 498327949], length 0
20:16:48.037407 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4238857, win 2877, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037453 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4241257, win 2885, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037461 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4242705, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037499 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4247049, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037627 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4248497, win 2889, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037679 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4252841, win 2881, options [nop,nop,TS val 561229069 ecr 498327949], length 0
20:16:48.037729 IP nm-node.38708 > rm-node.8031: Flags [.], ack 4258633, win 2877, options [nop,nop,TS val 561229069 ecr 498327949], length 0

RM->NM
命令:tcpdump src rm-node

20:49:59.617672 IP rm-node.8030 > nm-node.58940: Flags [P.], seq 2156:2854, ack 1746598, win 5955, options [nop,nop,TS val 500319529 ecr 2277606464], length 698
20:49:59.672596 IP rm-node.8031 > nm-node.58058: Flags [.], ack 3590, win 901, options [nop,nop,TS val 500319583 ecr 2277606545], length 0
20:49:59.674288 IP rm-node.8031 > nm-node.58058: Flags [.], seq 288621:290069, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 1448
20:49:59.674336 IP rm-node.8031 > nm-node.58058: Flags [P.], seq 290069:296813, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 6744
20:49:59.674337 IP rm-node.8031 > nm-node.58058: Flags [.], seq 296813:301157, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 4344
20:49:59.674380 IP rm-node.8031 > nm-node.58058: Flags [.], seq 301157:302605, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606545], length 1448
20:49:59.674588 IP rm-node.8031 > nm-node.58058: Flags [.], seq 302605:304053, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 1448
20:49:59.674641 IP rm-node.8031 > nm-node.58058: Flags [.], seq 304053:319981, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 15928
20:49:59.674689 IP rm-node.8031 > nm-node.58058: Flags [.], seq 319981:331565, ack 3590, win 901, options [nop,nop,TS val 500319585 ecr 2277606547], length 11584
20:49:59.674822 IP rm-node.8031 > nm-node.58058: Flags [.], seq 331565:333013, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 1448
20:49:59.674881 IP rm-node.8031 > nm-node.58058: Flags [.], seq 333013:366317, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 33304
20:49:59.674923 IP rm-node.8031 > nm-node.58058: Flags [.], seq 366317:376453, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 10136
20:49:59.674974 IP rm-node.8031 > nm-node.58058: Flags [P.], seq 376453:384828, ack 3590, win 901, options [nop,nop,TS val 500319586 ecr 2277606547], length 8375
20:50:00.015461 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1750751, win 5955, options [nop,nop,TS val 500319926 ecr 2277606868], length 0
20:50:00.015503 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1754847, win 5955, options [nop,nop,TS val 500319926 ecr 2277606888], length 0
20:50:00.015505 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1767135, win 5955, options [nop,nop,TS val 500319926 ecr 2277606888], length 0
20:50:00.015550 IP rm-node.8030 > nm-node.58940: Flags [.], ack 1822079, win 5806, options [nop,nop,TS val 500319927 ecr 2277606888], length 0

從上述結果能夠看出:

  • 從NM到RM的發送的數據packet,length比較統一,而且都比較小。
  • 從RM返回給NM的數據packet,則是length並不統一,而且length都比較大。

測試結果也是吻合了上節的iftop命令的分析結果。

RM NodeHeartbeat Response的具體分析


上面命令分析的結果把RM traffic的來源縮小到了NodeHeartbeat Response的對象裏了,但是問題是我們還不知道是裏面的什麼東西造成如此巨大的response的。iftop和tcpdump都無法抓取如此細節的信息。那麼這個時候,只能從代碼角度進行剖析了。

以下是Heartbeat response的類定義:

public interface NodeHeartbeatResponse {
   
       
  int getResponseId();
  NodeAction getNodeAction();

  List<ContainerId> getContainersToCleanup();
  List<ContainerId> getContainersToBeRemovedFromNM();

  List<ApplicationId> getApplicationsToCleanup();

  void setResponseId(int responseId);
  void setNodeAction(NodeAction action);

  MasterKey getContainerTokenMasterKey();
  void setContainerTokenMasterKey(MasterKey secretKey);
  
  MasterKey getNMTokenMasterKey();
  void setNMTokenMasterKey(MasterKey secretKey);

  void addAllContainersToCleanup(List<ContainerId> containers);

  // This tells NM to remove finished containers from its context. Currently, NM
  // will remove finished containers from its context only after AM has actually
  // received the finished containers in a previous allocate response
  void addContainersToBeRemovedFromNM(List<ContainerId> containers);
  
  void addAllApplicationsToCleanup(List<ApplicationId> applications);

  long getNextHeartBeatInterval();
  void setNextHeartBeatInterval(long nextHeartBeatInterval);
  
  String getDiagnosticsMessage();

  void setDiagnosticsMessage(String diagnosticsMessage);

  // Credentials (i.e. hdfs tokens) needed by NodeManagers for application
  // localizations and logAggreations.
  Map<ApplicationId, ByteBuffer> getSystemCredentialsForApps();  

  void setSystemCredentialsForApps(
      Map<ApplicationId, ByteBuffer> systemCredentials);  <====
  
  boolean getAreNodeLabelsAcceptedByRM();
  void setAreNodeLabelsAcceptedByRM(boolean areNodeLabelsAcceptedByRM);
}

經過快速的對NodeHeartbeatResponse的瀏覽,其中有個叫SystemCredentialsForApps的變量有着比其它諸如ContainerId和ApplicationId更爲複雜的映射存儲。按照上面註釋的解釋,這些credentials(即hdfs token)是RM返回給NM用來做application的localizations and logAggreations的。這裏其實透漏着一個隱含的條件關係:如果NM上跑的application數量上去了,那麼則對應其接受到RM的這些credentials也會跟着變多。

這似乎能夠解釋的通上文提到的業務方業務上的調整導致了RM traffic的升高。這裏面的因果關係如下:

1)某業務方進行業務調整,導致提交了很多新的application到RM上來。
2)鑑於這些新application的lolization過程和log aggregation操作需要有對應的token,RM在nodeHeartbeat response裏給NM返回了更多credentials信息。
3)因爲這些credentials信息增多導致的nodeHeartbeat response變大,進而導致RM機器上traffic的上漲。

RM NodeHeartbeat大response的解決


後續我們在社區上尋找因爲credentials導致的NodeHeartbeat response數據過大的解決方案,最終我們找到了對應此問題的一個JIRA改進 YARN-6523(Optimize system credentials sent in node heartbeat responses)。此JIRA提到的問題的場景與我們生產環境遇到的極爲類似。

YARN-6523的解決策略是在node hearbeat裏新增了sequence number做token更新狀態的標記,只有當NM端需要有delta的token更新時,RM纔會將credentials設置到NM的response裏面去。之後NM如果沒有需要額外token的需要時,RM將不會發送重複的token信息給NM端。我們目前也在積極的嘗試此套解決方案,期待能達到預期良好的優化效果。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章