1 從reduce看Job執行流程

1.1 reduce操作

以reduce操作爲例，看看作業執行的流程

def reduce(f: (T, T) => T): T = withScope {
    val cleanF = sc.clean(f)
    val reducePartition: Iterator[T] => Option[T] = iter => {
      if (iter.hasNext) {
        Some(iter.reduceLeft(cleanF))
      } else {
        None
      }
    }
    var jobResult: Option[T] = None
    val mergeResult = (index: Int, taskResult: Option[T]) => {
      if (taskResult.isDefined) {
        jobResult = jobResult match {
          case Some(value) => Some(f(value, taskResult.get))
          case None => taskResult
        }
      }
    }
    sc.runJob(this, reducePartition, mergeResult)
    // Get the final result out of our Option, or throw an exception if the RDD was empty
    jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
  }

reduce兩個方法具體作用之前總結了。

最後reduce中調用sc.runJob(this, reducePartition, mergeResult)，執行reducePartition和mergeResult兩個方法。

那麼runJob做了什麼呢？

現在進入sparkContext域中的runJob：

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

runJob輸入參數：
rdd，目標RDD；
func，在RDD每個分區上執行的函數；
partitions，有些job不會計算目標RDD上的所有partitions。
resultHandler，每個分區的結果寫入這個callback函數

2.1 dagScheduler管理

之後，Job將由dagScheduler管理，在dagScheduler的runJob方法中，提交job，又會進入submitJob

在此之前，先看看這個dagScheduler到底有些什麼屬性

2.1.1 dagScheduler與stage相關屬性

首先是各個部分的ID的設置：

  private[scheduler] val nextJobId = new AtomicInteger(0)//jobId初始化爲0
  private[scheduler] def numTotalJobs: Int = nextJobId.get()//jobId跟job個數相等，有多少編號就多少
  private val nextStageId = new AtomicInteger(0)//stageId

  private[scheduler] val jobIdToStageIds = new HashMap[Int, HashSet[Int]]//一個job可以對應多個stage
  private[scheduler] val stageIdToStage = new HashMap[Int, Stage]

然後是幾個存儲stage的hash表：

  private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage]//有shuffle操作的map的stage劃分結果
  private[scheduler] val jobIdToActiveJob = new HashMap[Int, ActiveJob]//正在執行的job

  // 存儲那些父輩還沒有執行的stage
  private[scheduler] val waitingStages = new HashSet[Stage]

  // 正在執行的stage
  private[scheduler] val runningStages = new HashSet[Stage]

2.1.2 submitJob

dagScheduler的runJob實際調用submitJob方法

val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)

submitJob將向scheduler提交這個action：

    val jobId = nextJobId.getAndIncrement()//新增Job，id +1s
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]//與task關聯
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

2.1.3 DAGSchedulerEventProcessLoop

之後程序進入DAGSchedulerEventProcessLoop處理環節：

先看eventProcessLoop.post：

 def post(event: E): Unit = {
    eventQueue.put(event)
  }

eventQueue是一個event隊列，現在就是將event入隊

一旦入隊，doOnReceive將會處理event。因爲之前傳入是JobSubmitted對象，那麼根據匹配執行handleJobSubmitted

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

.......#此處省略一萬字

所以handleJobSubmitted纔是stages劃分和提交的關鍵所在。

2.2 handleJobSubmitted：

首先是完整的handleJobSubmit：

private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: ResultStage = null
    try {
      "//獲取最後一個stages
      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }

    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)//表示一個在DAG中運行的Job
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions".format(
      job.jobId, callSite.shortForm, partitions.length))
    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))

    val jobSubmissionTime = clock.getTimeMillis()
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
    submitStage(finalStage)//提交finalStage
  }

在這個方法中，將得到finalstage：

finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)

//

進入createResultStage

  private def createResultStage(
      rdd: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      jobId: Int,
      callSite: CallSite): ResultStage = {
    val parents = getOrCreateParentStages(rdd, jobId)//獲取父stages"
    val id = nextStageId.getAndIncrement()
    val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
    stageIdToStage(id) = stage
    updateJobIdStageIdMaps(jobId, stage)
    stage
  }

這裏流程進入一個關鍵函數中：

 private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
    getShuffleDependencies(rdd).map { shuffleDep =>
      getOrCreateShuffleMapStage(shuffleDep, firstJobId)
    }.toList
  }

首先是getOrCreateParentStages，以當前的rdd和jobid作爲參數，返回一個List(parentStage,id),前面的代表這個當前的resultStage所依賴的全部stage，後面的就是返回當前stage的id

  private[scheduler] def getShuffleDependencies(
      rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
    val parents = new HashSet[ShuffleDependency[_, _, _]]/#父依賴hashset
    val visited = new HashSet[RDD[_]]#訪問過的
    val waitingForVisit = new Stack[RDD[_]]//
    waitingForVisit.push(rdd)//待訪問的rdd首先壓入棧
    while (waitingForVisit.nonEmpty) {
      val toVisit = waitingForVisit.pop()
      if (!visited(toVisit)) {
        visited += toVisit
        toVisit.dependencies.foreach {
          case shuffleDep: ShuffleDependency[_, _, _] =>
            parents += shuffleDep
          case dependency =>
            waitingForVisit.push(dependency.rdd)//這個怎麼辦？
        }
      }
    }
    parents
  }

根據入棧的rdd的依賴關係，

如果是窄依賴，那麼就將壓入棧底；
如果是寬依賴，那麼就將至添加到parents中。最後結果將返回parents。也就是parents中是寬依賴的rdd關係。

這樣之後得到最終的stage結果。

  private def getOrCreateShuffleMapStage(
      shuffleDep: ShuffleDependency[_, _, _],
      firstJobId: Int): ShuffleMapStage = {
    shuffleIdToMapStage.get(shuffleDep.shuffleId) match {     #如果原本shuffleIdToMapStage中就有ShuffleMapStage ，直接返回
      case Some(stage) =>
        stage
      #如果沒有，調用getMissingAncestorShuffleDependencies找到祖先的寬依賴
      case None =>              getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>

          if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
            createShuffleMapStage(dep, firstJobId)
          }
        }
        // Finally, create a stage for the given shuffle dependency.
        createShuffleMapStage(shuffleDep, firstJobId)
    }
  }

追尋stages的父輩，如果父輩stages爲missing的狀態，要麼已經沒有父輩了，要麼父輩已經都提交了：

private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]//尚未提交的父stages
    val visited = new HashSet[RDD[_]]//已經處理過的RDD
    // 未處理的將存入這個棧
    val waitingForVisit = new Stack[RDD[_]]
    def visit(rdd: RDD[_]) {
      if (!visited(rdd)) {
        visited += rdd
        val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
        if (rddHasUncachedPartitions) {
          for (dep <- rdd.dependencies) {
            dep match {//如果dep是寬依賴就可以直接產生stage
              case shufDep: ShuffleDependency[_, _, _] =>
                val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }
                //如果是窄依賴，將rdd存入棧中，這個棧中的rdd都是窄依賴
              case narrowDep: NarrowDependency[_] =>
                waitingForVisit.push(narrowDep.rdd)
            }
          }
        }
      }
    }
    waitingForVisit.push(stage.rdd)//壓入第一個rdd，然後遞歸遍歷整個stage中rdd，尋找其父依賴，直至最開始rdd
    while (waitingForVisit.nonEmpty) {
      visit(waitingForVisit.pop())
    }
    missing.toList//父依賴返回list
  }

最終返回ShuffleMapStage。

def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
    val rdd = shuffleDep.rdd
    val numTasks = rdd.partitions.length//每個分區對應一個task
    val parents = getOrCreateParentStages(rdd, jobId)
    val id = nextStageId.getAndIncrement()
    val stage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep)

    stageIdToStage(id) = stage
    shuffleIdToMapStage(shuffleDep.shuffleId) = stage
    updateJobIdStageIdMaps(jobId, stage)

    if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {

      val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)// 根據shuffleId從mapOutputTracker中獲取序列化的多個MapOutputStatus對象
      val locs = MapOutputTracker.deserializeMapStatuses(serLocs)//反序列化
      (0 until locs.length).foreach { i =>
        if (locs(i) ne null) {
          // locs(i) will be null if missing
          stage.addOutputLoc(i, locs(i))
        }
      }
    } else {
      // Kind of ugly: need to register RDDs with the cache and map output tracker here
      // since we can't do it in the RDD constructor because # of partitions is unknown
      logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
      mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
    }
    stage
  }

這樣getOrCreateParentStages就得到了該rdd的父依賴stage的List。

ResultStage將返回最終的jobId綁定的stage結果，最終將賦值給finalStage。

之後就是根據劃分的stage激活job：

 val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    clearCacheLocs()

然後提交該job的最後stage：

submitStage(finalStage)

  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)//根據stage獲取JobID
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
     //沒有等待父stage，沒有正在運行，且沒有失敗的情況下退出。
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)//獲取stage還未提交的parent
        logDebug("missing: " + missing)
        if (missing.isEmpty) {//所有的parent已經提交
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          submitMissingTasks(stage, jobId.get)//如果所有的父stage都完成，就可以將其提交到task中了。
        } else {
          for (parent <- missing) {
            submitStage(parent)//否則一定要堅持將parent stages都提交完畢。
          }
          waitingStages += stage//將該stage放入待處理棧中
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

最後在分析一下submitMissingTask：

主要工作有：

1、確定stages需要計算的分區的id；
2、開啓新的stage；
3、創建分區id與任務位置信息的map；
4、標記新的stageAttempt；
5、對stage序列化並廣播；
6、對stages每個分區創建task，最後彙集爲taskSet；
7、taskScheduler.submitTasks()提交task

private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")

    // 需要計算的partitions的id"
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // 與該stage相關的Job的屬性
    val properties = jobIdToActiveJob(jobId).properties

    runningStages += stage//加入正在運行的stages中
    /
    stage match {//根據stages的不同，調用stageStart，啓動stage
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    //獲取stages中每個rdd的每個分區的位置信息
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)

    // If there are tasks to execute, record the submission time of the stage. Otherwise,
    // post the even without the submission time, which indicates that this stage was
    // skipped.
    if (partitionsToCompute.nonEmpty) {
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    }
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

    // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
    // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
    // the serialized copy of the RDD and for each task we will deserialize it, which means each
    // task gets a different copy of the RDD. This provides stronger isolation between tasks that
    // might modify state of objects referenced in their closures. This is necessary in Hadoop
    // where the JobConf/Configuration object is not thread-safe.
    var taskBinary: Broadcast[Array[Byte]] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      val taskBinaryBytes: Array[Byte] = stage match {
        case stage: ShuffleMapStage =>
          JavaUtils.bufferToArray(
            closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
        case stage: ResultStage =>
          JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
      }

      taskBinary = sc.broadcast(taskBinaryBytes)
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString, Some(e))
        runningStages -= stage

        // Abort execution
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }
    // stage的每個分區構造task
    val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    if (tasks.size > 0) {//如果存在tasks，則利用taskScheduler.submitTasks()提交task，否則標記stage已完成
      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
    } else {
      // Because we posted SparkListenerStageSubmitted earlier, we should mark
      // the stage as completed here in case there are no tasks to run
      markStageAsFinished(stage, None)

      val debugString = stage match {
        case stage: ShuffleMapStage =>
          s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
        case stage : ResultStage =>
          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
      }
      logDebug(debugString)

      submitWaitingChildStages(stage)
    }
  }

3 stage劃分總結

首先，action操作將會觸發計算，向DAGScheduler提交作業；

DAGScheduler收到之後，會首先從RDD的依賴鏈的末端處的RDD，遍歷整個RDD的dependences；

當某個RDD的dependences中出現shuffle依賴之後，該RDD將會作爲本stage的輸入信息，並以此構建新的stage。

然後，得到的包含多個stage的Stage集合，其中直接觸發job開始的則作爲FinalStage，並生成一個job實例。

提交stage的時候，會先判斷其父stage的結果能否使用，能夠則提交，不能則將stage放入waitingstage中的。

如果一箇中間過程stage的任務完成以後，DAGScheduler會檢查所有的任務是否都完成了，重新掃描waitingstages中的stage，直至他們都沒有沒有完成的stage爲止，就可以提交了。

spark源碼之Job執行(1)stage劃分與提交

1 從reduce看Job執行流程

1.1 reduce操作

2.1 dagScheduler管理

2.1.1 dagScheduler與stage相關屬性

2.1.2 submitJob

2.1.3 DAGSchedulerEventProcessLoop

2.2 handleJobSubmitted：

然後提交該job的最後stage：

3 stage劃分總結

工作中用到的腳本合集

通過f-string編寫簡潔高效的Python格式化輸出代碼

24-5-18 X

做論文必須知道什麼叫review

在windows下安裝python

Qt中QTabWidget常用

ffmpeg+sdl編程----給ffmpeg加入左右聲道切換功能(原創)

Qt開發設置技巧

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結