手撕Spark之WordCount RDD執行流程
寫在前面
一個Spark程序在初始化的時候會構造DAGScheduler、TaskSchedulerImpl、MapOutTrackerMaster等對象,DAGScheduler主要負責生成DAG、啓動Job、提交Stage等操作,TaskSchedulerImpl主要負責Task Set的添加調度等,MapOutTrackerMaster主要負責數據的Shuffle等,這裏不再贅述。
注意幾個概念:
- Application //一個Spark程序會有一個Application,也就擁有了唯一的一個applicationId
- Job //調用Action 算子 觸發runJob,觸發一次runJob就會產生一個Job
- Stage //遇到一次寬依賴就會生成一個Stage
- Task //Spark程序運行的最小單元
注:一個Spark程序會有1個Application,會有1~N 個Job,會有1~N 個Stage,會有1~N 個Task
1 Application = [1 ~ N ] Job
1 Job = [ 1 ~ N ] Stage
1 Stage = [ 1 ~ N ] Task
軟件環境
- Spark:2.3.0
代碼
寫一個簡單的WordCount計算代碼
data.txt
hello world
hello java
hello scala
hello hadoop
hello spark
WCAnalyzer.scala
//設置日誌輸出級別,便於觀察日誌
Logger.getLogger("org.apache").setLevel(Level.ALL)
//創建sc
val sc = new SparkContext(new SparkConf().setMaster("local[1]")
.setAppName("WCAnalyzer"))
//從文件讀取數據
sc.textFile("data/data.txt", 1)
//將數據按照空格進行切分(切分出單個單詞)
.flatMap(_.split(" "))
//將每個單詞和1組成一個Tuple
.map((_, 1))
//按照相同的單詞進行聚合
.reduceByKey(_ + _)
//將聚合後的結果將(key,value)數據進行倒置 轉換成(value,key)便於排序
.map(v => (v._2, v._1))
//按照聚合後的單詞數量進行降序排序
.sortByKey(false)
//將排序後的數據進行倒置
.map(v => (v._2, v._1))
//將數據收集到driver
.collect()
//輸出數據
.foreach(println)
//關閉sc
sc.stop()
}
過程分析
本代碼只會生成一個Job,3個Stage,8個RDD。
-
劃分Stage
Stage的劃分要從後向前,每遇到一次寬依賴就劃分一個Stage,因此這個簡單的WC代碼可以分爲3個Stage,分別是由textFile、flatMap、map算子組成的第一個Stage 0;由reduceByKey、map算子組成的Stage 1;由sortByKey、map算子組成的Stage 2。
-
RDD的生成
textFile(HadoopRDD [0] ,MapPartitionsRDD [1] ) //[ ] 內爲該rdd的序號
flatMap(MapPartitionsRDD [2] )
map(MapPartitionsRDD [3] )
reduceByKey(ShuffledRDD [4] )
map(MapPartitionsRDD [5] )
sortByKey(ShuffledRDD [6] )
map (MapPartitionsRDD [7] )
-
日誌分析
-
org.apache.spark.SparkContext - Starting job: collect at WCAnalyzer.scala:34
由collect算子觸發runJob 啓動一個Job,代碼中的
foreach(println)
其中foreach
並不是RDD中的算子,因此不會觸發runJob,也就不會生成一個Job -
org.apache.spark.scheduler.DAGScheduler - Got job 0 (collect at WCAnalyzer.scala:34) with 1 output partitions
生成一個Job 0 ,這個Job是由collect算子生成,在代碼第34行,有一個分區
-
org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 2 (collect at WCAnalyzer.scala:34) org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List(ShuffleMapStage 1) org.apache.spark.scheduler.DAGScheduler - Missing parents: List(ShuffleMapStage 1) org.apache.spark.scheduler.DAGScheduler - submitStage(ResultStage 2) org.apache.spark.scheduler.DAGScheduler - missing: List(ShuffleMapStage 1) org.apache.spark.scheduler.DAGScheduler - submitStage(ShuffleMapStage 1) org.apache.spark.scheduler.DAGScheduler - missing: List(ShuffleMapStage 0) org.apache.spark.scheduler.DAGScheduler - submitStage(ShuffleMapStage 0) org.apache.spark.scheduler.DAGScheduler - missing: List()
Job 的Final Stage 爲ResultStage 0,ResultStage 的父依賴爲ShuffleMapStage 1,遺留的父依賴爲ShuffleMapStage 1。
-
org.apache.spark.scheduler.DAGScheduler - submitStage(ResultStage 2)
嘗試提交ResultStage 2
-
org.apache.spark.scheduler.DAGScheduler - missing: List(ShuffleMapStage 1)
遺留一個ShuffleMapStage 1
-
org.apache.spark.scheduler.DAGScheduler - submitStage(ShuffleMapStage 1)
嘗試提交ShuffleMapStage 1
-
org.apache.spark.scheduler.DAGScheduler - missing: List(ShuffleMapStage 0)
遺留一個ShuffleMapStage 0
-
org.apache.spark.scheduler.DAGScheduler - submitStage(ShuffleMapStage 0)
嘗試提交ShuffleMapStage 0
-
org.apache.spark.scheduler.DAGScheduler - missing: List()
沒有遺留的Stage
-
org.apache.spark.scheduler.DAGScheduler - Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WCAnalyzer.scala:24), which has no missing parents
提交ShuffleMapStage 0,該Stage的最後一個RDD是MapPartitionsRDD[3],是由map算子生成,在代碼第24行
-
org.apache.spark.scheduler.DAGScheduler - submitMissingTasks(ShuffleMapStage 0)
提交Tasks,一個Stage就是一個Task Set集合
-
org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 0.0 with 1 tasks
TaskSchedulerImpl 調度器添加一個Task Set集合
-
org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7909 bytes)
TaskSetManager 啓動stage 0.0 中的task 0.0(taskid=0.0,host=localhost,executor=driver,partition=0,taskLocality=PROCESS_LOCAL,serializedTask=7909 bytes
-
org.apache.spark.executor.Executor - Running task 0.0 in stage 0.0 (TID 0)
Executor 端運行task
-
org.apache.spark.executor.Executor - Finished task 0.0 in stage 0.0 (TID 0). 1159 bytes result sent to driver
Executor 端 運行完成task,將序列化後大小爲1159 bytes結果數據發送回driver端
-
org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 0.0 (TID 0) in 194 ms on localhost (executor driver) (1/1)
TaskSetManager 運行完task 完成task數量/總攻task數量
-
org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 0.0, whose tasks have all completed, from pool
TaskSchedulerImpl 移除TaskSet 集合
-
org.apache.spark.scheduler.DAGScheduler - ShuffleMapTask finished on driver
DAGScheduler 完成ShuffleMapTask 的計算
-
org.apache.spark.scheduler.DAGScheduler - ShuffleMapStage 0 (map at WCAnalyzer.scala:24) finished in 0.289 s
DAGScheduler 完成ShuffleMapStage 的計算,用時共 0.289 s
-
org.apache.spark.scheduler.DAGScheduler - looking for newly runnable stages org.apache.spark.scheduler.DAGScheduler - running: Set() org.apache.spark.scheduler.DAGScheduler - waiting: Set(ShuffleMapStage 1, ResultStage 2) org.apache.spark.scheduler.DAGScheduler - failed: Set() org.apache.spark.MapOutputTrackerMaster - Increasing epoch to 1 org.apache.spark.scheduler.DAGScheduler - Checking if any dependencies of ShuffleMapStage 0 are now runnable org.apache.spark.scheduler.DAGScheduler - running: Set() org.apache.spark.scheduler.DAGScheduler - waiting: Set(ShuffleMapStage 1, ResultStage 2) org.apache.spark.scheduler.DAGScheduler - failed: Set()
Stage在計算完後,DAGScheduler會查詢是否還有未完成的計算,直到有新的Stage提交
-
============================ ShuffleMapStage 1 的提交計算過程 ========================== org.apache.spark.scheduler.DAGScheduler - submitStage(ShuffleMapStage 1) org.apache.spark.scheduler.DAGScheduler - missing: List() org.apache.spark.scheduler.DAGScheduler - Submitting ShuffleMapStage 1 (MapPartitionsRDD[5] at map at WCAnalyzer.scala:28), which has no missing parents org.apache.spark.scheduler.DAGScheduler - submitMissingTasks(ShuffleMapStage 1) org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 1.0 with 1 tasks org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 7638 bytes) org.apache.spark.executor.Executor - Running task 0.0 in stage 1.0 (TID 1) org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 1331 bytes result sent to driver org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 102 ms on localhost (executor driver) (1/1) org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool org.apache.spark.scheduler.DAGScheduler - ShuffleMapTask finished on driver org.apache.spark.scheduler.DAGScheduler - ShuffleMapStage 1 (map at WCAnalyzer.scala:28) finished in 0.117 s ============================ ResultStage 2 的提交計算過程 ============================= org.apache.spark.scheduler.DAGScheduler - looking for newly runnable stages org.apache.spark.scheduler.DAGScheduler - running: Set() org.apache.spark.scheduler.DAGScheduler - waiting: Set(ResultStage 2) org.apache.spark.scheduler.DAGScheduler - failed: Set() org.apache.spark.MapOutputTrackerMaster - Increasing epoch to 2 org.apache.spark.scheduler.DAGScheduler - Checking if any dependencies of ShuffleMapStage 1 are now runnable org.apache.spark.scheduler.DAGScheduler - running: Set() org.apache.spark.scheduler.DAGScheduler - waiting: Set(ResultStage 2) org.apache.spark.scheduler.DAGScheduler - failed: Set() org.apache.spark.scheduler.DAGScheduler - submitStage(ResultStage 2) org.apache.spark.scheduler.DAGScheduler - missing: List() org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 2 (MapPartitionsRDD[7] at map at WCAnalyzer.scala:32), which has no missing parents org.apache.spark.scheduler.DAGScheduler - submitMissingTasks(ResultStage 2) org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 2.0 with 1 tasks org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 2.0 (TID 2, localhost, executor driver, partition 0, ANY, 7649 bytes) org.apache.spark.executor.Executor - Running task 0.0 in stage 2.0 (TID 2) org.apache.spark.executor.Executor - Finished task 0.0 in stage 2.0 (TID 2). 1387 bytes result sent to driver org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 2.0 (TID 2) in 44 ms on localhost (executor driver) (1/1) org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 2.0, whose tasks have all completed, from pool org.apache.spark.scheduler.DAGScheduler - ResultStage 2 (collect at WCAnalyzer.scala:34) finished in 0.057 s
以上是ShuffleMapStage 1和ResultStage 2的提交計算過程,與ShuffleMapStage 0一樣,不再贅述
org.apache.spark.scheduler.DAGScheduler - Job 0 finished: collect at WCAnalyzer.scala:34, took 0.770898 s
DAGScheduler 當所有的Stage 提交計算完成 結束Job
-
注:
本文屬於作者原創,如需轉載,請註明。
內部如果引用的文字,連接,圖片等資源存在侵犯原作者的情況,請聯繫本人,立即刪除。