Spark Streaming源碼解讀之JobScheduler內幕實現和深度思考

JobGenerator每隔batchInterval時間會動態的生成JobSet提交給JobScheduler。JobScheduler接收到JobSet後的處理流程（源代碼十分清晰）：

def submitJobSet(jobSet: JobSet) {
  if (jobSet.jobs.isEmpty) {
    logInfo("No jobs added for time " + jobSet.time)
  } else {
    listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
    jobSets.put(jobSet.time, jobSet)
    jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
    logInfo("Added jobs for time " + jobSet.time)
  }
}

這裏會爲每個job生成一個新的JobHandler，交給jobExecutor運行。

private val jobExecutor =
  ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")

jobExecutor是一個線程池，線程的個數由參數配置。如果需要多個job同時運行，比如在同一個batchInterval中有多個output，則需要配置該參數。

這裏最重要的處理邏輯是 job => jobExecutor.execute(new JobHandler(job))，也就是將每個 job 都在 jobExecutor 線程池中、用 new JobHandler 來處理。

先來看JobHandler針對Job的主要處理邏輯：

private class JobHandler(job: Job) extends Runnable with Logging {
  import JobScheduler._

  def run() {
    try {
      val formattedTime = UIUtils.formatBatchTime(
        job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
      val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
      val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

      ssc.sc.setJobDescription(
        s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
      ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
      ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)

      // We need to assign `eventLoop` to a temp variable. Otherwise, because
      // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
      // it's possible that when `post` is called, `eventLoop` happens to null.
      var _eventLoop = eventLoop
      if (_eventLoop != null) {
        _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
        // Disable checks for existing output directories in jobs launched by the streaming
        // scheduler, since we may need to write output to an existing directory during checkpoint
        // recovery; see SPARK-4835 for more details.
        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
          job.run()
        }
        _eventLoop = eventLoop
        if (_eventLoop != null) {
          _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
        }
      } else {
        // JobScheduler has been stopped.
      }
    } finally {
      ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
      ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
    }
  }
}

也就是說，JobHandler除了做一些狀態記錄外，最主要的就是調用job.run()！這裏就與我們在 DStream 生成 RDD 實例詳解裏分析的對應起來了：在ForEachDStream.generateJob(time)時，是定義了Job的運行邏輯，即定義了Job.func。而在JobHandler這裏，是真正調用了Job.run()、將觸發Job.func的真正執行。

結合前幾篇文章的分析可知：

JobScheduler是SparkStreaming 所有Job調度的中心，內部有兩個重要的成員：JobGenerator負責Job的生成，ReceiverTracker負責記錄輸入的數據源信息。
JobScheduler的啓動會導致ReceiverTracker和JobGenerator的啓動。ReceiverTracker的啓動導致運行在Executor端的Receiver啓動並且接收數據，ReceiverTracker會記錄Receiver接收到的數據meta信息。JobGenerator的啓動導致每隔BatchDuration，就調用DStreamGraph生成RDD Graph，並生成Job。JobScheduler中的線程池來提交封裝的JobSet對象(時間值，Job，數據源的meta)。Job中封裝了業務邏輯，導致最後一個RDD的action被觸發，被DAGScheduler真正調度在Spark集羣上執行該Job。

Spark Streaming源碼解讀之JobScheduler內幕實現和深度思考

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Spark Streaming源碼解讀之Driver容錯安全性

Spark Streaming源碼解讀之流數據不斷接收全生命週期徹底研究和思考

RDD：基於內存的集羣計算容錯抽象

Spark Streaming源碼解讀之Executor容錯安全性

Spark Streaming源碼解讀之JobScheduler內幕實現和深度思考

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結