spark源碼解析-作業執行過程

spark版本: 2.0.0

1. 引入

通過前一篇介紹spark submit的文章,我們知道如果以客戶端模式最終運行的是–class指定類的main方法,這也是執行作業的入口。接下來,我們就以一個簡單的例子,說明作業是如何執行的。

2.作業執行過程

val sparkConf = new SparkConf().setAppName("earthquake").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val lines = sc.textFile("home/haha/helloSpark.txt")
lines.flatMap(_.split(" ")).count()

這是一個非常常見的例子,flatMap只是作爲轉換並不會觸發作業執行,真正執行的是count方法,即執行sparkContext.runJob方法

RDD.scala
--------------

  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
  
SparkContext.scala
-----------------

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    // 當在spark包中調用類時,返回調用spark的用戶代碼類的名稱,以及它們調用的spark方法。(解析棧)
    val callSite = getCallSite
    // clean方法實際上調用了ClosureCleaner的clean方法,這裏一再清除閉包中的不能序列化的變量,防止RDD在網絡傳輸過程中反序列化失敗。
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    // 交給dagScheduler處理
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    // 打印當前進度
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

在運行job時,解析了運行棧,顯示了運行進度,設置檢查點等,但是最核心的還是dagScheduler.runJob,通過dagScheduler運行job,現在來看一下這個方法的實現:

 def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    // 提交任務,返回回調對象
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    val awaitPermission = null.asInstanceOf[scala.concurrent.CanAwait]
    // 一直等待任務執行成功
    waiter.completionFuture.ready(Duration.Inf)(awaitPermission)
    // 判斷任務執行結果
    waiter.completionFuture.value.get match {
      case scala.util.Success(_) =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case scala.util.Failure(exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }

在提交作業之後,就是等待作業執行並處理結果,所以執行作業的過程是同步進行的。所以需要關注的就是提交作業方法submitJob

 def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    // 判斷分區數是否正常
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }
    // 產生jobId(自增)
    val jobId = nextJobId.getAndIncrement()

    // 如果沒有作業直接返回jobwaiter對象
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    // waiter就是一個等待回調對象
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    // 如果需要執行作業,需要先提交到作業隊列並等待執行
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

submitJob方法主要是做了一些校驗,確定唯一標識等操作,最後將作業封裝成JobSubmitted對象提交到事件執行隊列eventProcessLoop(DAGSchedulerEventProcessLoop對象)中,但是DAGSchedulerEventProcessLoop沒有post方法,需要找到它父類的post方法。

EventLoop.scala
-----------------



   /**
    * 添加事件到事件隊列中
   */
  def post(event: E): Unit = {
    eventQueue.put(event)
  }


 /**
    * 事件隊列中有事件就執行
    */
  private val eventThread = new Thread(name) {
    setDaemon(true)

    override def run(): Unit = {
      try {
        while (!stopped.get) {
          val event = eventQueue.take()
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) =>
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
          }
        }
      } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }

  }

上面有一個屬性eventThread它表示一個線程對象,如果有事件放入到eventQueue中,就會有一個後臺線程,調用onReceive的方法處理。

  // 接收並處理事件
  override def onReceive(event: DAGSchedulerEvent): Unit = {
    val timerContext = timer.time()
    try {
      doOnReceive(event)
    } finally {
      timerContext.stop()
    }
  }

  /**
    * 處理事件的真正邏輯
    * @param event
    */
  private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

事件處理最終調用的方法是dagScheduler.handleJobSubmitted,這個方法會確定最終階段,觸發作業開始的監聽器,最終提交所有的階段。

 /**
    * 處理已經提交的作業
    */
  private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: ResultStage = null
    try {
      // 先確定ResultStage(job的最後一個階段)
      finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    // 如果創建ResultStage成功,表明當前job可以執行,改爲ActiveJob對象
    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    // 清除本地的中間數據
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions".format(
      job.jobId, callSite.shortForm, partitions.length))
    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))

    val jobSubmissionTime = clock.getTimeMillis()
    // jobId -> ActiveJob
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    // 設置監聽通知(比如:執行進度,日誌打印等)
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
      // 提交階段(觸發任務執行的核心)
    submitStage(finalStage)

    submitWaitingStages()
  }

在提交階段過程中,先要確保所有的階段都正常提交,然後開始提交tasks(一個stage對應多個task)。

  private def submitStage(stage: Stage) {
    // 通過stage查找jobId
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      // 等待/運行/失敗的階段不處理
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        // 查找缺失的階段列表
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        // 如果所有的階段都正常,直接提交task
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")

          // 提交task集合
          submitMissingTasks(stage, jobId.get)
        } else {
          // 將所有缺失的stage完善,遞歸執行
          for (parent <- missing) {
            submitStage(parent)
          }
          // 添加到等待階段
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

submitMissingTasks中主要是將ShuffleMapStage轉化爲了ShuffleMapTask集合,ResultStage轉換爲了ResultTask,然後交給taskScheduler執行。

DAGScheduler.scala
-----------------------
    
    .......
    /**
      * stage -> task集合
      */
    val tasks: Seq[Task[_]] = try {
      stage match {
          // ShuffleMapStage
        case stage: ShuffleMapStage =>
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
          }
          // ResultStage
        case stage: ResultStage =>
          val job = stage.activeJob.get
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    if (tasks.size > 0) {
      logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
      stage.pendingPartitions ++= tasks.map(_.partitionId)
      logDebug("New pending partitions: " + stage.pendingPartitions)

      // 封裝成taskset對象並交給taskScheduler處理
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    } else {

      // 標記stage已經完成
      markStageAsFinished(stage, None)

      val debugString = stage match {
        case stage: ShuffleMapStage =>
          s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
        case stage : ResultStage =>
          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
      }
      logDebug(debugString)
    }

我們現在可以做一個簡單的總結,作業提交完成之後就會交給dagScheduler處理將作業劃分爲多個階段,然後轉換爲taskset對象交給taskScheduler執行。

現在來看一下submitTasks方法的實現,注意這個實際的調用類是TaskSchedulerImpl

 override def submitTasks(taskSet: TaskSet) {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
      // maxTaskFailures:最大失敗次數
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      val stage = taskSet.stageId
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager
      // 判斷同一個階段是不是存在多個taskSet
      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
        ts.taskSet != taskSet && !ts.isZombie
      }
      if (conflictingTaskSet) {
        throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
          s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
      }
      // 調度器中添加taskset manager
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
      //  不是本地且沒有接收task,啓動一個timer定時調度,如果一直沒有task就警告,直到有task
      if (!isLocal && !hasReceivedTask) {
        starvationTimer.scheduleAtFixedRate(new TimerTask() {
          override def run() {
            if (!hasLaunchedTask) {
              logWarning("Initial job has not accepted any resources; " +
                "check your cluster UI to ensure that workers are registered " +
                "and have sufficient resources")
            } else {
              this.cancel()
            }
          }
        }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
      }
      hasReceivedTask = true
    }
    // backend對象創建參考org.apache.spark.SparkContext.createTaskScheduler
    backend.reviveOffers()
  }

最後一行代碼backend.reviveOffers()將觸發任務提交與執行,但是這個backend其實根據不同的master(用戶傳入的參數)會產生不同的對象,現在以standalone模式說明。此時backend是StandaloneSchedulerBackend,我們來看一下。

CoarseGrainedSchedulerBackend.scala
------------------------------
  
  // 最終會將driver endpoint發送ReviveOffers請求
 override def reviveOffers() {
    driverEndpoint.send(ReviveOffers)
  }

可以看出處理邏輯是通過driverEndpoint處理的,所以先看一下driver endpoint的定義是如何?在CoarseGrainedSchedulerBackend類中有一個class DriverEndpoint定義了driver endpoint的實現。在前面的rpc原理中可知,driver endpoint啓動的時候肯定會執行onStart方法,並且在接收到請求之後會調用receive,receiveAndReply等方法。

CoarseGrainedSchedulerBackend.scala
-------------------

   override def onStart() {
      // Periodically revive offers to allow delay scheduling to work
      val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")
      // 定義給自身發送ReviveOffers,但是如果沒有offer進來是不會有任何處理
      reviveThread.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          Option(self).foreach(_.send(ReviveOffers))
        }
      }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
    }

可以看出自身也是在等待ReviveOffers,那就來看一下ReviveOffers請求類型的處理方式吧:

 override def receive: PartialFunction[Any, Unit] = {
 .......
       case ReviveOffers =>
        makeOffers()
        

它直接調用自身的makeOffers方法。

    /**
      * driver處理請求
      *
      */
    private def makeOffers() {
      // Filter out executors under killing
      // 所有可用的executors
      val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
      val workOffers = activeExecutors.map { case (id, executorData) =>
        new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
      }.toSeq
      // 根據調度器分配資源,並啓動任務
      launchTasks(scheduler.resourceOffers(workOffers))
    }
 
 /**
  *launchTasks方法就是對提交的tasks做了判斷,然後通過executorEndpoint發送一個LaunchTask請求,即讓executor執行task
  */
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
      for (task <- tasks.flatten) {
        // 序列化task
        val serializedTask = ser.serialize(task)
        // 是否超出rpc 消息大小
        if (serializedTask.limit >= maxRpcMessageSize) {
          scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
            try {
              var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
                "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
                "spark.rpc.message.maxSize or using broadcast variables for large values."
              msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize)
              taskSetMgr.abort(msg)
            } catch {
              case e: Exception => logError("Exception in error callback", e)
            }
          }
        }
        else {
          val executorData = executorDataMap(task.executorId)
          executorData.freeCores -= scheduler.CPUS_PER_TASK

          logInfo(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
            s"${executorData.executorHost}.")
          // 通知executor執行task,endpoint來自類CoarseGrainedExecutorBackend
          executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
        }
      }
    }

executorEndpoint對應的是CoarseGrainedExecutorBackend,在該類中可以看到處理LaunchTask請求類型。

CoarseGrainedExecutorBackend.scala
-----------
  
  override def receive: PartialFunction[Any, Unit] = {
      /**
      * 這個方法只是反序列化了task實例,最終調用的是executor.launchTask
      */
    case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        // 反序列化任務描述
        val taskDesc = ser.deserialize[TaskDescription](data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
          taskDesc.name, taskDesc.serializedTask)
      }
      
      
Executor.scala
-----------------
  /**
    * 封裝任務,並放入線程池中,真正運行的是TaskRunner.run
    */
def launchTask(
      context: ExecutorBackend,
      taskId: Long,
      attemptNumber: Int,
      taskName: String,
      serializedTask: ByteBuffer): Unit = {
    val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
      serializedTask)
    // 記錄正在運行的task
    runningTasks.put(taskId, tr)
    // 放到線程池中執行
    threadPool.execute(tr)
  } 
      

TaskRunner將是最終運行的位置,代碼量比較大,將抽出核心代碼進行介紹:

  override def run(): Unit = {
      // 內存管理器
      val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)
      .....
      // 告知driver endpoint task狀態更新
      execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
      var taskStart: Long = 0
      startGCTime = computeTotalGcTime()

      try {
        // 反序列化運行任務需要的jar,文件,屬性等
        val (taskFiles, taskJars, taskProps, taskBytes) =
          Task.deserializeWithDependencies(serializedTask)

        Executor.taskDeserializationProps.set(taskProps)
        // 根據文件/jar的元數據更新依賴
        updateDependencies(taskFiles, taskJars)
        task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
        task.localProperties = taskProps
        task.setTaskMemoryManager(taskMemoryManager)
        .....

        // Run the actual task and measure its runtime.
        taskStart = System.currentTimeMillis()
        var threwException = true
        val value = try {
        // 運行任務,這個任務分爲ShuffleMapTask和ResultTask。ShuffleMapTask會將結果記錄到blockManager,用於下一個task使用,而ResultTask直接得到結果。
          val res = task.run(
            taskAttemptId = taskId,
            attemptNumber = attemptNumber,
            metricsSystem = env.metricsSystem)
          threwException = false
          res
        } finally {
            ........
        }
        val taskFinish = System.currentTimeMillis()

        .....
        // 將結果序列化
        val directResult = new DirectTaskResult(valueBytes, accumUpdates)
        val serializedDirectResult = ser.serialize(directResult)
        val resultSize = serializedDirectResult.limit

        /**
          * 處理不同大小的序列化結果
          */
        val serializedResult: ByteBuffer = {

          if (maxResultSize > 0 && resultSize > maxResultSize) {
            // 大於結果的最大值(1G),直接丟棄
            logWarning(s"Finished $taskName (TID $taskId). Result is larger than maxResultSize " +
              s"(${Utils.bytesToString(resultSize)} > ${Utils.bytesToString(maxResultSize)}), " +
              s"dropping it.")
            ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize))

          } else if (resultSize > maxDirectResultSize) {
            // 小於最大值,但是大於maxDirectResultSize,通過blockManager保存,返回保存結果元數據
            val blockId = TaskResultBlockId(taskId)
            env.blockManager.putBytes(
              blockId,
              new ChunkedByteBuffer(serializedDirectResult.duplicate()),
              StorageLevel.MEMORY_AND_DISK_SER)
            logInfo(
              s"Finished $taskName (TID $taskId). $resultSize bytes result sent via BlockManager)")
            ser.serialize(new IndirectTaskResult[Any](blockId, resultSize))
          } else {
            // 其他情況(小於maxDirectResultSize),直接傳輸
            logInfo(s"Finished $taskName (TID $taskId). $resultSize bytes result sent to driver")
            serializedDirectResult
          }
        }
        // 通知driver更新狀態
        execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)

      } catch {
        .....
      } finally {
      // 移除task
        runningTasks.remove(taskId)
      }
    }
  }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章