手撕Spark之WordCount RDD執行流程

手撕Spark之WordCount RDD執行流程

寫在前面

一個Spark程序在初始化的時候會構造DAGScheduler、TaskSchedulerImpl、MapOutTrackerMaster等對象,DAGScheduler主要負責生成DAG、啓動Job、提交Stage等操作,TaskSchedulerImpl主要負責Task Set的添加調度等,MapOutTrackerMaster主要負責數據的Shuffle等,這裏不再贅述。

注意幾個概念:

  • Application //一個Spark程序會有一個Application,也就擁有了唯一的一個applicationId
  • Job //調用Action 算子 觸發runJob,觸發一次runJob就會產生一個Job
  • Stage //遇到一次寬依賴就會生成一個Stage
  • Task //Spark程序運行的最小單元

注:一個Spark程序會有1個Application,會有1~N 個Job,會有1~N 個Stage,會有1~N 個Task

1 Application = [1 ~ N ] Job
1 Job = [ 1 ~ N ] Stage
1 Stage = [ 1 ~ N ] Task

軟件環境

  • Spark:2.3.0

代碼

寫一個簡單的WordCount計算代碼

data.txt

hello world
hello java
hello scala
hello hadoop
hello spark

WCAnalyzer.scala


    //設置日誌輸出級別,便於觀察日誌
    Logger.getLogger("org.apache").setLevel(Level.ALL)

    //創建sc
    val sc = new SparkContext(new SparkConf().setMaster("local[1]")
    .setAppName("WCAnalyzer"))

    //從文件讀取數據
    sc.textFile("data/data.txt", 1)
      //將數據按照空格進行切分(切分出單個單詞)
      .flatMap(_.split(" "))
      //將每個單詞和1組成一個Tuple
      .map((_, 1))
      //按照相同的單詞進行聚合
      .reduceByKey(_ + _)
      //將聚合後的結果將(key,value)數據進行倒置 轉換成(value,key)便於排序
      .map(v => (v._2, v._1))
      //按照聚合後的單詞數量進行降序排序
      .sortByKey(false)
      //將排序後的數據進行倒置
      .map(v => (v._2, v._1))
      //將數據收集到driver
      .collect()
      //輸出數據
      .foreach(println)

    //關閉sc
    sc.stop()
  }

過程分析

本代碼只會生成一個Job,3個Stage,8個RDD。

  • 劃分Stage

    Stage的劃分要從後向前,每遇到一次寬依賴就劃分一個Stage,因此這個簡單的WC代碼可以分爲3個Stage,分別是由textFile、flatMap、map算子組成的第一個Stage 0;由reduceByKey、map算子組成的Stage 1;由sortByKey、map算子組成的Stage 2。

  • RDD的生成

    textFile(HadoopRDD [0] ,MapPartitionsRDD [1] ) //[ ] 內爲該rdd的序號

    flatMap(MapPartitionsRDD [2] )

    map(MapPartitionsRDD [3] )

    reduceByKey(ShuffledRDD [4] )

    map(MapPartitionsRDD [5] )

    sortByKey(ShuffledRDD [6] )

    map (MapPartitionsRDD [7] )

  • 日誌分析

    • org.apache.spark.SparkContext                     - Starting job: collect at WCAnalyzer.scala:34
      

      由collect算子觸發runJob 啓動一個Job,代碼中的foreach(println)其中foreach並不是RDD中的算子,因此不會觸發runJob,也就不會生成一個Job

    • org.apache.spark.scheduler.DAGScheduler           - Got job 0 (collect at WCAnalyzer.scala:34) with 1 output partitions
      
      生成一個Job 0 ,這個Job是由collect算子生成,在代碼第34行,有一個分區
      
    • org.apache.spark.scheduler.DAGScheduler          - Final stage: ResultStage 2 (collect at WCAnalyzer.scala:34)
      org.apache.spark.scheduler.DAGScheduler          - Parents of final stage: List(ShuffleMapStage 1)
      org.apache.spark.scheduler.DAGScheduler          - Missing parents: List(ShuffleMapStage 1)
      org.apache.spark.scheduler.DAGScheduler          - submitStage(ResultStage 2)
      org.apache.spark.scheduler.DAGScheduler          - missing: List(ShuffleMapStage 1)
      org.apache.spark.scheduler.DAGScheduler          - submitStage(ShuffleMapStage 1)
      org.apache.spark.scheduler.DAGScheduler          - missing: List(ShuffleMapStage 0)
      org.apache.spark.scheduler.DAGScheduler          - submitStage(ShuffleMapStage 0)
      org.apache.spark.scheduler.DAGScheduler          - missing: List()
      

      Job 的Final Stage 爲ResultStage 0,ResultStage 的父依賴爲ShuffleMapStage 1,遺留的父依賴爲ShuffleMapStage 1。

    • org.apache.spark.scheduler.DAGScheduler         - submitStage(ResultStage 2)
      

      嘗試提交ResultStage 2

    • org.apache.spark.scheduler.DAGScheduler         - missing: List(ShuffleMapStage 1)
      

      遺留一個ShuffleMapStage 1

    • org.apache.spark.scheduler.DAGScheduler         - submitStage(ShuffleMapStage 1)
      

      嘗試提交ShuffleMapStage 1

    • org.apache.spark.scheduler.DAGScheduler         - missing: List(ShuffleMapStage 0)
      

      遺留一個ShuffleMapStage 0

    •  org.apache.spark.scheduler.DAGScheduler         - submitStage(ShuffleMapStage 0)
      

      嘗試提交ShuffleMapStage 0

    • org.apache.spark.scheduler.DAGScheduler         - missing: List()
      

      沒有遺留的Stage

    • org.apache.spark.scheduler.DAGScheduler         - Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WCAnalyzer.scala:24), which has no missing parents
      

      提交ShuffleMapStage 0,該Stage的最後一個RDD是MapPartitionsRDD[3],是由map算子生成,在代碼第24行

    • org.apache.spark.scheduler.DAGScheduler         - submitMissingTasks(ShuffleMapStage 0)
      

      提交Tasks,一個Stage就是一個Task Set集合

    • org.apache.spark.scheduler.TaskSchedulerImpl    - Adding task set 0.0 with 1 tasks
      

      TaskSchedulerImpl 調度器添加一個Task Set集合

    • org.apache.spark.scheduler.TaskSetManager       - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7909 bytes)
      

      TaskSetManager 啓動stage 0.0 中的task 0.0(taskid=0.0,host=localhost,executor=driver,partition=0,taskLocality=PROCESS_LOCAL,serializedTask=7909 bytes

    • org.apache.spark.executor.Executor              - Running task 0.0 in stage 0.0 (TID 0)
      

      Executor 端運行task

    • org.apache.spark.executor.Executor              - Finished task 0.0 in stage 0.0 (TID 0). 1159 bytes result sent to driver
      

      Executor 端 運行完成task,將序列化後大小爲1159 bytes結果數據發送回driver端

    • org.apache.spark.scheduler.TaskSetManager       - Finished task 0.0 in stage 0.0 (TID 0) in 194 ms on localhost (executor driver) (1/1)
      

      TaskSetManager 運行完task 完成task數量/總攻task數量

    • org.apache.spark.scheduler.TaskSchedulerImpl     - Removed TaskSet 0.0, whose tasks have all completed, from pool 
      

      TaskSchedulerImpl 移除TaskSet 集合

    • org.apache.spark.scheduler.DAGScheduler          - ShuffleMapTask finished on driver
      

      DAGScheduler 完成ShuffleMapTask 的計算

    • org.apache.spark.scheduler.DAGScheduler          - ShuffleMapStage 0 (map at WCAnalyzer.scala:24) finished in 0.289 s
      

      DAGScheduler 完成ShuffleMapStage 的計算,用時共 0.289 s

    • org.apache.spark.scheduler.DAGScheduler          - looking for newly runnable stages
      org.apache.spark.scheduler.DAGScheduler          - running: Set()
      org.apache.spark.scheduler.DAGScheduler          - waiting: Set(ShuffleMapStage 1, ResultStage 2)
      org.apache.spark.scheduler.DAGScheduler          - failed: Set()
      org.apache.spark.MapOutputTrackerMaster          - Increasing epoch to 1
      org.apache.spark.scheduler.DAGScheduler          - Checking if any dependencies of ShuffleMapStage 0 are now runnable
      org.apache.spark.scheduler.DAGScheduler          - running: Set()
      org.apache.spark.scheduler.DAGScheduler          - waiting: Set(ShuffleMapStage 1, ResultStage 2)
      org.apache.spark.scheduler.DAGScheduler          - failed: Set()
      

      Stage在計算完後,DAGScheduler會查詢是否還有未完成的計算,直到有新的Stage提交

    • ============================   ShuffleMapStage 1 的提交計算過程  ==========================
      org.apache.spark.scheduler.DAGScheduler          - submitStage(ShuffleMapStage 1)
      org.apache.spark.scheduler.DAGScheduler          - missing: List()
      org.apache.spark.scheduler.DAGScheduler          - Submitting ShuffleMapStage 1 (MapPartitionsRDD[5] at map at WCAnalyzer.scala:28), which has no missing parents
      org.apache.spark.scheduler.DAGScheduler          - submitMissingTasks(ShuffleMapStage 1)
      org.apache.spark.scheduler.TaskSchedulerImpl     - Adding task set 1.0 with 1 tasks
      org.apache.spark.scheduler.TaskSetManager        - Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 7638 bytes)
      org.apache.spark.executor.Executor               - Running task 0.0 in stage 1.0 (TID 1)
      org.apache.spark.executor.Executor               - Finished task 0.0 in stage 1.0 (TID 1). 1331 bytes result sent to driver
      org.apache.spark.scheduler.TaskSetManager        - Finished task 0.0 in stage 1.0 (TID 1) in 102 ms on localhost (executor driver) (1/1)
      org.apache.spark.scheduler.TaskSchedulerImpl     - Removed TaskSet 1.0, whose tasks have all completed, from pool 
      org.apache.spark.scheduler.DAGScheduler          - ShuffleMapTask finished on driver
      org.apache.spark.scheduler.DAGScheduler          - ShuffleMapStage 1 (map at WCAnalyzer.scala:28) finished in 0.117 s
      ============================   ResultStage 2 的提交計算過程  =============================
      org.apache.spark.scheduler.DAGScheduler          - looking for newly runnable stages
      org.apache.spark.scheduler.DAGScheduler          - running: Set()
      org.apache.spark.scheduler.DAGScheduler          - waiting: Set(ResultStage 2)
      org.apache.spark.scheduler.DAGScheduler          - failed: Set()
      org.apache.spark.MapOutputTrackerMaster          - Increasing epoch to 2
      org.apache.spark.scheduler.DAGScheduler          - Checking if any dependencies of ShuffleMapStage 1 are now runnable
      org.apache.spark.scheduler.DAGScheduler          - running: Set()
      org.apache.spark.scheduler.DAGScheduler          - waiting: Set(ResultStage 2)
      org.apache.spark.scheduler.DAGScheduler          - failed: Set()
      org.apache.spark.scheduler.DAGScheduler          - submitStage(ResultStage 2)
      org.apache.spark.scheduler.DAGScheduler          - missing: List()
      org.apache.spark.scheduler.DAGScheduler          - Submitting ResultStage 2 (MapPartitionsRDD[7] at map at WCAnalyzer.scala:32), which has no missing parents
      org.apache.spark.scheduler.DAGScheduler          - submitMissingTasks(ResultStage 2)
      org.apache.spark.scheduler.TaskSchedulerImpl     - Adding task set 2.0 with 1 tasks
      org.apache.spark.scheduler.TaskSetManager        - Starting task 0.0 in stage 2.0 (TID 2, localhost, executor driver, partition 0, ANY, 7649 bytes)
      org.apache.spark.executor.Executor               - Running task 0.0 in stage 2.0 (TID 2)
      org.apache.spark.executor.Executor               - Finished task 0.0 in stage 2.0 (TID 2). 1387 bytes result sent to driver
      org.apache.spark.scheduler.TaskSetManager        - Finished task 0.0 in stage 2.0 (TID 2) in 44 ms on localhost (executor driver) (1/1)
      org.apache.spark.scheduler.TaskSchedulerImpl     - Removed TaskSet 2.0, whose tasks have all completed, from pool 
      org.apache.spark.scheduler.DAGScheduler          - ResultStage 2 (collect at WCAnalyzer.scala:34) finished in 0.057 s
      

      以上是ShuffleMapStage 1和ResultStage 2的提交計算過程,與ShuffleMapStage 0一樣,不再贅述

      org.apache.spark.scheduler.DAGScheduler         - Job 0 finished: collect at WCAnalyzer.scala:34, took 0.770898 s
      

      DAGScheduler 當所有的Stage 提交計算完成 結束Job

注:

本文屬於作者原創,如需轉載,請註明。
內部如果引用的文字,連接,圖片等資源存在侵犯原作者的情況,請聯繫本人,立即刪除。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章