生產環境踩坑系列::Hive on Spark的connection timeout 問題

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"起因","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7/16凌晨,釘釘突然收到了一條告警,一個公司所有業務部門的組織架構表的ETL過程中,數據推送到DIM層的過程中出現異常,導致任務失敗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲這個數據會影響到第二天所有大數據組對外的應用服務中組織架構基礎數據,當然,我們的Pla-nB也不是喫素的,一旦出現錯誤,後面的權限管理模塊與網關會自動配合切換前一天的最後一次成功處理到DIM中的組織架構數據,只會影響到在前一天做過組織架構變化的同事在系統上的操作,但是這個影響數量是可控的,並且我們會也有所有組織架構變化的審計數據,如果第二天這個推數的ETL修復不完的話,我們會手動按照審計數據對這些用戶先進行操作,保證線上的穩定性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"技術架構","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣:CDH 256G/64C計算物理集羣 X 18臺","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度:dolphin","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據抽取:datax","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DIM層數據庫:Doris","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive版本:2.1.1","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"告警","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"告警策略現在是有機器人去捕捉dolphin的告警郵件,發到釘釘羣裏,dolphin其實是可以獲取到異常的,需要進行一系列的開發,但是擔心複雜的調度過程會有任務監控的遺漏,導致告警丟失,這樣就是大問題,所以簡單粗暴,機器人代替人來讀取郵件併發送告警到釘釘,這樣只關注這個幸福來敲門的小可愛即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/21/21eaad0893443e3ba950429c5a66a3bf.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"集羣log","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"kotlin"},"content":[{"type":"text","text":"Log Type: stderr\n\nLog Upload Time: Fri Jul 16 01:27:46 +0800 2021\n\nLog Length: 10569\n\nSLF4J: Class path contains multiple SLF4J bindings.\nSLF4J: Found binding in [jar:file:/data7/yarn/nm/usercache/dolphinscheduler/filecache/8096/__spark_libs__6065796770539359217.zip/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]\nSLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]\nSLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.\nSLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]\n21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for TERM\n21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for HUP\n21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for INT\n21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls to: yarn,dolphinscheduler\n21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls to: yarn,dolphinscheduler\n21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls groups to: \n21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls groups to: \n21/07/16 01:27:43 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dolphinscheduler); groups with view permissions: Set(); users with modify permissions: Set(yarn, dolphinscheduler); groups with modify permissions: Set()\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1625364172078_3093_000001\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...\n21/07/16 01:27:43 INFO client.RemoteDriver: Connecting to HiveServer2 address: hadoop-task-1.bigdata.xx.com:24173\n21/07/16 01:27:44 INFO conf.HiveConf: Found configuration file file:/data8/yarn/nm/usercache/dolphinscheduler/filecache/8097/__spark_conf__.zip/__hadoop_conf__/hive-site.xml\n21/07/16 01:27:44 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\njava.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.XX.com/10.25.15.104:24173\n at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)\n at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:155)\n at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)\n at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)\n at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)\n at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)\n at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)\n at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: java.net.ConnectException: Connection refused\n ... 10 more\n21/07/16 01:27:44 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.dd.com/10.25.15.104:24173\n at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)\n at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:155)\n at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)\n at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)\n at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)\n at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)\n at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)\n at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: java.net.ConnectException: Connection refused\n ... 10 more\n)\n21/07/16 01:27:44 ERROR yarn.ApplicationMaster: Uncaught exception: \norg.apache.spark.SparkException: Exception thrown in awaitResult: \n at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)\n at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:447)\n at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:275)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:805)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:804)\n at java.security.AccessController.doPrivileged(Native Method)\n at javax.security.auth.Subject.doAs(Subject.java:422)\n at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)\n at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)\n at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)\nCaused by: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)\n at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:155)\n at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)\n at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)\n at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)\n at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)\n at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)\n at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: java.net.ConnectException: Connection refused\n ... 10 more\n21/07/16 01:27:44 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://lbc/user/dolphinscheduler/.sparkStaging/application_1625364172078_3093\n21/07/16 01:27:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n21/07/16 01:27:44 INFO util.ShutdownHookManager: Shutdown hook called\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"原因分析","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"運維角度","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"資源負載","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運維同學的思路從來都是先看負載,其實這套18個計算節點的集羣已經平穩運行一段時間了,當天晚上推送的這個時間段的任務並行度其實也很低,Yarn的每個隊列也都做了隔離,我們在dolphin上面的任務也通過直接抓dolphin的Mysql數據庫直接獲取到了所有的運行計劃和實際執行計劃,所以因爲資源跑滿導致的問題不太令人信服,並且也沒有支持這個觀點的證據。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"網絡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"ConnectionTimeOut","attrs":{}}],"attrs":{}},{"type":"text","text":"都會去網絡同學那裏打交換機的log,一通篩查,但是其實如果不是嚴重的網絡情況,這也是比較難以發現問題的,果然,網絡同學的回覆是一切正常。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"開發角度","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"程序Bug","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個ETL的過程運行了幾個星期了,一直正常,我們從dolphin的郵件監控,釘釘機器人監控,任務的footprint監控等等都一直跟蹤,程序bug的可能性不高。之所以沒說一定不是程序bug的原因是,ETL的過程本身從數據源的數據類型,數據集以及突發的一些變化可能會影響到後續的整體數據搬移的過程,也許是一些考不到的點在這個時間點突然間發力,導致問題,這樣也需要對程序的健壯性進行增強。對程序進行了初步篩查,排除了程序的問題。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"開源工具","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般程序也沒有問題的話,就是一個很可怕的消息了。我們需要從log中排查開源工具的執行流,然後分析步驟,從出問題的地方開始分析導致問題發生的各種可能性,最主要的是,這個問題可能是一個無法重現的問題。一般如果分析到這裏,就需要有對開源架構非常瞭解的同學或者是對開源框架運行原理相對熟悉的同學出手了,當然,也有那種從沒有跟蹤過這一塊源代碼的同學非常有興趣也可以從頭開始調查。本次需要從開源工具的架構來分析問題出在哪裏了。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"問題分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從log上看,本次出問題的地方是Hive on Spark的運行過程中,HQL已經變成了Spark任務,在AM中初始化了Driver的線程。關於Driver啓動和Executor的關係我也想整理一套文章,有空發出來。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"定位出問題的地方","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最關鍵重要的兩條log如下","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"prolog"},"content":[{"type":"text","text":"21/07/16 01:27:43 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這兩句的代碼的來源分別是來自","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"org.apache.spark.deploy.yarn.ApplicationMaster.scala","attrs":{}}],"attrs":{}},{"type":"text","text":"的","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"runDriver","attrs":{}}],"attrs":{}},{"type":"text","text":"方法與","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"startUserApplication","attrs":{}}],"attrs":{}},{"type":"text","text":"方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"runDriver","attrs":{}}],"attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"scala"},"content":[{"type":"text","text":" private def runDriver(): Unit = {\n addAmIpFilter(None)\n userClassThread = startUserApplication()\n\n // This a bit hacky, but we need to wait until the spark.driver.port property has\n // been set by the Thread executing the user class.\n logInfo(\"Waiting for spark context initialization...\")\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"startUserApplication","attrs":{}}],"attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"scala"},"content":[{"type":"text","text":" /**\n * Start the user class, which contains the spark driver, in a separate Thread.\n * If the main routine exits cleanly or exits with System.exit(N) for any N\n * we assume it was successful, for all other cases we assume failure.\n *\n * Returns the user thread that was started.\n */\n private def startUserApplication(): Thread = {\n logInfo(\"Starting the user application in a separate Thread\")\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這兩個方法即yarn-cluster模式下啓動用戶提交的spark運行的jar文件的過程,在用戶提交的代碼中是原生應該是處理數據的代碼,即各種算子的計算,根據shuffle算子進行Stage劃分,遇到Action算子則進行任務提交等等。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"分析可能的原因","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"解讀問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而就在Driver線程運行的過程中卻有一行這樣的錯誤:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"kotlin"},"content":[{"type":"text","text":"21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...\n21/07/16 01:27:43 INFO client.RemoteDriver: Connecting to HiveServer2 address: hadoop-task-1.bigdata.xx.com:24173\n21/07/16 01:27:44 INFO conf.HiveConf: Found configuration file file:/data8/yarn/nm/usercache/dolphinscheduler/filecache/8097/__spark_conf__.zip/__hadoop_conf__/hive-site.xml\n21/07/16 01:27:44 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\njava.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.XX.com/10.25.15.104:24173\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我分析問題從來都是有依據的從大到小的去有目的的收縮,並剔除不可能的選項,從可能的問題中精確最終答案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"從以上的log中可以分析出","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Driver已經啓動了","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在Driver中進行了連接到HiveServer2的連接","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"而這個連接發生了ConnectionTimeout的錯誤","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"表象原因","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從log中解讀出來的錯誤就是,Driver啓動後,Driver線程裏面與HiveServer2,也就是Hive的Server進行的連接,在連接的時候出現了timeout,導致任務失敗,到這裏具體問題出在哪裏就知道了,那麼下一個問題就是Why?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive on spark是什麼處理機制?爲什麼會在Driver線程中去連接HiveServer2的服務?這個處理過程因爲什麼會導致timeout呢?帶着問題進行深入分析,只能去源代碼中一探究竟","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"深入分析","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive on Spark(下稱HOS)機制","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive on Spark,即Hive的SQL(HQL)的執行過程從默認的MapReduce變成Spark引擎來實現,利用Spark的速度優勢與計算能力解決原生MR笨重的實現","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive on Spark的實現架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏需要一幅圖(來源於網絡,跟我我對源代碼的解讀,這個架構是正確的)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個大的結構圖先有個大體印象即可,後續分析每一塊細節的時候再回頭來理解會更簡單","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/00/00c3c40ea6df9d571737be6568f30ce9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive on Spark細節技術點","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"入口:SparkTask","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以理解成,HQL提交在Hive的client端,提交HQL後,經過一系列的轉換變成spark的任務,整體開始向Spark任務轉換的起始位置就是","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkTask","attrs":{}}],"attrs":{}},{"type":"text","text":",至於從哪裏如何調用到","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkTask","attrs":{}}],"attrs":{}},{"type":"text","text":"的,我暫時還沒有細緻研究,後續有需要或者有小夥伴有興趣我們一起探討跟蹤這部分的邏輯。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/355f6e85a2743eae55c8761bec080ef7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"與上面的架構圖呼應,整個對一個HQL任務的提交(不算後續的Job的監控)其實就是","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Session的創建","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Session的submit","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這兩個步驟的調用大體流程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SparkSession的獲取的一系列調用過程","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sparkSession = SparkUtilities.getSparkSession(conf, sparkSessionManager)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sparkSession = sparkSessionManager.getSession(sparkSession, conf, true);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"existingSession.open(conf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"hiveSparkClient = HiveSparkClientFactory.createHiveSparkClient(conf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"return new RemoteHiveSparkClient(hiveconf, sparkConf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"createRemoteClient();","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"remoteClient = SparkClientFactory.createClient(conf, hiveConf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"return new SparkClientImpl(server, sparkConf, hiveConf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"this.driverThread = startDriver(rpcServer, clientId, secret);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"扔Driver的jar到spark集羣","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"spark-submit —class xxxx.jar","attrs":{}}],"attrs":{}},{"type":"text","text":" ... 的處理","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"sparkSession.submit的一系列調用過程","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SparkJobRef jobRef = sparkSession.submit(driverContext, sparkWork);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"return hiveSparkClient.execute(driverContext, sparkWork);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"【","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteHiveSparkClient","attrs":{}},{"type":"text","text":"】 return submit(driverContext, sparkWork);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JobHandle jobHandle = remoteClient.submit(job);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"【","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SparkClientImpl","attrs":{}},{"type":"text","text":"】 return protocol.submit(job);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"【","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ClientProtocol","attrs":{}},{"type":"text","text":"】 final io.netty.util.concurrent.Future rpc = driverRpc.call(new JobRequest(jobId, job));","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"初始化:SparkSession","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衆所周知,一個離線的spark任務是用戶首先編寫一個User Class,然後達成jar包,把這個jar包投入到spark集羣中即可,一般生產環境上,我們會使用","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"—master yarn —deploy-mode cluster","attrs":{}}],"attrs":{}},{"type":"text","text":" 的yarn的提交方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一直以來,我理解的HOS中提交一個HQL就是解析成一個spark的job,提交到spark集羣即可,但是這個job每次都是打成一個jar包,或者整體打成一個jar包來提交麼?這塊一直沒有細緻的研究,其實細想起來,每次打個包是個多麼愚蠢的設計,看過SparkSession的實現後,可以理解到,HOS本身的設計架構其實是這樣的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"先回顧一下spark提交任務的簡單模型","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c2/c2071dd39661371719d8fbae7e935cb1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"加上HOS的過程則爲","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a3c09269f88e4a6f81cbdbe90ca0bf4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個初始化SparkSessin的過程居然完成了提交一個User Class到spark集羣的過程,而且這個過程其實非常的巧妙。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"在HiveServer2(HS2)與spark集羣建立連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲HQL是提交到HS2的服務器,HS2解析HQL並轉換成爲sparkTask並執行一系列的處理,如上圖所示,在HS2中利用","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientFactory.initialize","attrs":{}}],"attrs":{}},{"type":"text","text":"首先建立了一個Netty的Server,,然後通過","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientFactory.createClient","attrs":{}}],"attrs":{}},{"type":"text","text":"初始化了","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientImpl","attrs":{}}],"attrs":{}},{"type":"text","text":",並且在","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientImpl","attrs":{}}],"attrs":{}},{"type":"text","text":"的構造函數中調用了","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"startDriver","attrs":{}}],"attrs":{}},{"type":"text","text":"方法,在這個方法中完成了","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"spark-submit","attrs":{}}],"attrs":{}},{"type":"text","text":"的操作,代碼片段如下","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"bash"},"content":[{"type":"text","text":"if (sparkHome != null) {\n argv.add(new File(sparkHome, \"bin/spark-submit\").getAbsolutePath());\n } else {\n LOG.info(\"No spark.home provided, calling SparkSubmit directly.\");\n argv.add(new File(System.getProperty(\"java.home\"), \"bin/java\").getAbsolutePath());\n\n if (master.startsWith(\"local\") || master.startsWith(\"mesos\") || master.endsWith(\"-client\") || master.startsWith(\"spark\")) {\n String mem = conf.get(\"spark.driver.memory\");\n if (mem != null) {\n argv.add(\"-Xms\" + mem);\n argv.add(\"-Xmx\" + mem);\n }\n\n String cp = conf.get(\"spark.driver.extraClassPath\");\n if (cp != null) {\n argv.add(\"-classpath\");\n argv.add(cp);\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到拼裝了一個帶有","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"bin/spark-submit","attrs":{}}],"attrs":{}},{"type":"text","text":"的cmd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"回顧一下一般提交一個User Class中代碼的形式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"一般提交的User Class的樣子","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般,我們提交到spark集羣的User Class(jar文件)都是一段代碼文件體,比如以下的代碼片段,一段wordcount。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"reduceByKey","attrs":{}}],"attrs":{}},{"type":"text","text":"是一個Shuffle算子,切分出來2個stage","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"saveAsTextFile","attrs":{}}],"attrs":{}},{"type":"text","text":"是一個Action算子,會提交整個job","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" object WordCount {\n\n def main(args: Array[String]): Unit = {\n // code\n val conf = new SparkConf().setAppName(\"WordCount\")\n val sc = new SparkContext()\n\n //read\n val line: RDD[String] = sc.textFile(args(0))\n\n // flatmap\n val words: RDD[String] = line.flatMap(_.split(\" \"))\n\n val wordAndOne = words.map((_, 1))\n\n val wordAndCount = wordAndOne.reduceByKey(_ + _)\n\n // save to hdfs\n wordAndCount.saveAsTextFile(args(1))\n\n // close\n sc.stop()\n\n }\n\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這個過程的精妙之處","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"提交的User Class(","attrs":{}},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteDriver.java","attrs":{}}],"attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":")本身是一個Netty的client","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RemoteDriver被條到Spark集羣中,會啓動一個Netty client,去連接到HS2的Netty Server,如圖,這個Netty Server前述構建的時間點了","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"提交的HQL即在Netty Server與Netty Client(已經提交到Spark集羣中的RemoteDriver)的通信","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從這幅圖可以看出(猜想)來,提交的HQL通過這兩個Netty間的服務傳遞到Spark集羣內部,從而實現在集羣內的計算處理","attrs":{}}]}]}],"attrs":{}},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a3c09269f88e4a6f81cbdbe90ca0bf4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"巡查可疑的問題點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上面的錯誤log中可以看出,就是在Driver線程中啓動了RemoteDriver後,反向連接HS2造成了timeout,也就是上圖中的這個Netty Rpc連接過程中造成了timeout,需要再細看一下這個過程是如何處理的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteDriver的代碼細節","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在初始化的過程中,有這樣的一段代碼,就是在初始化Netty的client,這裏有一行註釋非常的亮眼,其實這行註釋已經提醒我們注意time out,因爲,本身Rpc.createClient返回的是一個Promise,而這裏又進一步進行了get的同步調用阻塞的獲取到clientRpc。在這個獲取過程中如果控制不好卻是容易造成timeout","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" // The RPC library takes care of timing out this.\n this.clientRpc = Rpc.createClient(mapConf, egroup, serverAddress, serverPort,\n clientId, secret, protocol).get();\n this.running = true;\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteDriver中調用","attrs":{}},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Rpc.createClient","attrs":{}}],"attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"的代碼細節","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我直接在代碼中進行標註解釋,這裏,構建client的bootstrap的過程中,使用到了一個","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"int connectTimeoutMs = (int) rpcConf.getConnectTimeoutMs();","attrs":{}}],"attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個來源的timeout常量","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" public static Promise createClient(\n Map config,\n final NioEventLoopGroup eloop,\n String host,\n int port,\n final String clientId,\n final String secret,\n final RpcDispatcher dispatcher) throws Exception {\n final RpcConfiguration rpcConf = new RpcConfiguration(config);\n\n // client端連接Netty server端的timeout時長\n int connectTimeoutMs = (int) rpcConf.getConnectTimeoutMs();\n\n final ChannelFuture cf = new Bootstrap()\n .group(eloop)\n .handler(new ChannelInboundHandlerAdapter() { })\n .channel(NioSocketChannel.class)\n .option(ChannelOption.SO_KEEPALIVE, true)\n\n // 在這裏被設置\n .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, connectTimeoutMs)\n .connect(host, port);\n\n final Promise promise = eloop.next().newPromise();\n final AtomicReference rpc = new AtomicReference();\n\n // Set up a timeout to undo everything.\n final Runnable timeoutTask = new Runnable() {\n @Override\n public void run() {\n promise.setFailure(new TimeoutException(\"Timed out waiting for RPC server connection.\"));\n }\n };\n final ScheduledFuture> timeoutFuture = eloop.schedule(timeoutTask,\n rpcConf.getServerConnectTimeoutMs(), TimeUnit.MILLISECONDS);\n\n // The channel listener instantiates the Rpc instance when the connection is established,\n // and initiates the SASL handshake.\n cf.addListener(new ChannelFutureListener() {\n @Override\n public void operationComplete(ChannelFuture cf) throws Exception {\n if (cf.isSuccess()) {\n SaslClientHandler saslHandler = new SaslClientHandler(rpcConf, clientId, promise,\n timeoutFuture, secret, dispatcher);\n Rpc rpc =createRpc(rpcConf, saslHandler, (SocketChannel) cf.channel(), eloop);\n saslHandler.rpc = rpc;\n saslHandler.sendHello(cf.channel());\n } else {\n promise.setFailure(cf.cause());\n }\n }\n });\n\n // Handle cancellation of the promise.\n promise.addListener(new GenericFutureListener>() {\n @Override\n public void operationComplete(Promise p) {\n if (p.isCancelled()) {\n cf.cancel(true);\n }\n }\n });\n\n return promise;\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"追蹤這個可以的timeout時長","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出,這個timeout時長的default值爲1000ms,一旦client端到server端連接超過1s,則直接會出現timeout錯誤,也就是本文最初描述的timeout","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" long getConnectTimeoutMs() {\n String value = config.get(HiveConf.ConfVars.SPARK_RPC_CLIENT_CONNECT_TIMEOUT.varname);\n return value != null ? Integer.parseInt(value) :DEFAULT_CONF.getTimeVar(\n HiveConf.ConfVars.SPARK_RPC_CLIENT_CONNECT_TIMEOUT, TimeUnit.MILLISECONDS);\n }\n\n SPARK_RPC_CLIENT_CONNECT_TIMEOUT(\"hive.spark.client.connect.timeout\",\n \"1000ms\", new TimeValidator(TimeUnit.MILLISECONDS),\n \"Timeout for remote Spark driver in connecting back to Hive client.\"),\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"結案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上文分析所示,在Driver線程中連接到HS2的Server的過程中,timeout的常量被設置成了1s,一旦超過1s,則會出現timeout錯誤。這個1s本身設置的過短,很容易出現問題,所以提高這個timeout常量的設置即可解決問題,提高穩定性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"官方解答","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實這個問題,我已經最初搜索到了官方的一個bug fix","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參看:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://issues.apache.org/jira/browse/HIVE-16794","title":"","type":null},"content":[{"type":"text","text":"https://issues.apache.org/jira/browse/HIVE-16794","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://issues.apache.org/jira/secure/attachment/12872466/HIVE-16794.patch","title":"","type":null},"content":[{"type":"text","text":"https://issues.apache.org/jira/secure/attachment/12872466/HIVE-16794.patch","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之所以沒有一開始就按照這個issue修改或者做Hive升級是想詳細的再研究一下這個問題的本質,以及HOS的基礎原理","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"後續","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HOS的基礎過程在這次trouble shooting中做了簡單的回顧,後續會針對RemoteDriver是如何向spark提交job的,並且job又是如何從HS2的Netty Server端生成並傳入到RemoteDriver的做詳細的說明,","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"完","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章