Spark Streaming源碼解讀之Job動態生成和深度思考

JobGenerator和ReceiverTracker的類對象是JobSchedule的類成員。從SparkStreaming應用程序valssc=StreamingContext(conf)入口開始，直到ssc.start()啓動了SparkStreaming框架的執行後，一直到JobSchedule調用start()，schedule.start()調用了ReceiverTracker和JobGenerator類對象：

def start(): Unit = synchronized {
  if (eventLoop != null) return // scheduler has already been started

  logDebug("Starting JobScheduler")
  eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
    override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
  }
  eventLoop.start()

  // attach rate controllers of input streams to receive batch completion updates
  for {
    inputDStream <- ssc.graph.getInputStreams
    rateController <- inputDStream.rateController
  } ssc.addStreamingListener(rateController)

  listenerBus.start(ssc.sparkContext)
  receiverTracker = new ReceiverTracker(ssc)
  inputInfoTracker = new InputInfoTracker(ssc)
  receiverTracker.start()
  jobGenerator.start()
  logInfo("Started JobScheduler")
}

JobScheduler有兩個非常重要的成員：

· JobGenerator

· ReceiverTracker

JobScheduler 將每個batch的RDD DAG的具體生成工作委託給JobGenerator，將源頭數據輸入的記錄工作委託給ReceiverTracker 。

在JobGenerator中有兩個至關重要的成員就是RecurringTimer和EventLoop；RecurringTimer它控制了job的觸發。每到batchInterval時間，就往EventLoop的隊列中放入一個消息。而EventLoop則不斷的查看消息隊列，一旦有消息就處理。JobGenerator會根據BatchDuration時間間隔，隨着時間的推移，會不斷的產生作業，驅使checkpoint操作和清理之前DStream的數據。

先看下JobGenerator的start方法，checkpoint的初始化操作，實例化並啓動消息循環體EventLoop，開啓定時生成Job的定時器:

/** Start generation of jobs */
def start(): Unit = synchronized {
  if (eventLoop != null) return // generator has already been started

  // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
  // See SPARK-10125
  checkpointWriter

  eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
    override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = {
      jobScheduler.reportError("Error in job generator", e)
    }
  }
  eventLoop.start()

  if (ssc.isCheckpointPresent) {
    restart()
  } else {
    startFirstTime()
  }
}

EvenLoop類中有存儲消息的LinkedBlockingDeque類對象和後臺線程，後臺線程從隊列中獲取消息，然後調用onReceive方法對該消息進行處理，這裏的onReceive方法即匿名內部類中重寫onReceive方法的processEvent方法。

processEvent方法是對消息類型進行模式匹配，然後路由到對應處理該消息的方法中。消息的處理一般是發給另外一個線程來處理的，消息循環器不處理耗時的業務邏輯：

/** Processes all events */
private def processEvent(event: JobGeneratorEvent) {
  logDebug("Got event " + event)
  event match {
    case GenerateJobs(time) => generateJobs(time)
    case ClearMetadata(time) => clearMetadata(time)
    case DoCheckpoint(time, clearCheckpointDataLater) =>
      doCheckpoint(time, clearCheckpointDataLater)
    case ClearCheckpointData(time) => clearCheckpointData(time)
  }
}

GenerateJobs在獲取到數據後調用DStreamGraph的generateJobs方法來生成Job：

def generateJobs(time: Time): Seq[Job] = {
  logDebug("Generating jobs for time " + time)
  val jobs = this.synchronized {
    outputStreams.flatMap { outputStream =>
      val jobOption = outputStream.generateJob(time)
      jobOption.foreach(_.setCallSite(outputStream.creationSite))
      jobOption
    }
  }
  logDebug("Generated " + jobs.length + " jobs for time " + time)
  jobs
}

streamIdToInputInfos是基於時間的數據，獲得了這個數據後，jobScheduler.submitJobSet這個方法就產生了jobset，以這個JobSet交給JobSchedule進行調度執行Job。

generateJobs方法中outputStreams是整個DStream中的最後一個DStream。這裏outputStream.generateJob(time)類似於RDD中從後往前推：

def generateJobs(time: Time): Seq[Job] = {
  logDebug("Generating jobs for time " + time)
  val jobs = this.synchronized {
    outputStreams.flatMap { outputStream =>
      val jobOption = outputStream.generateJob(time)
      jobOption.foreach(_.setCallSite(outputStream.creationSite))
      jobOption
    }
  }
  logDebug("Generated " + jobs.length + " jobs for time " + time)
  jobs
}

generateJob方法中jobFunc 封裝了context.sparkContext.runJob(rdd, emptyFunc)：

/**
 * Generate a SparkStreaming job for the given time. This is an internal method that
 * should not be called directly. This default implementation creates a job
 * that materializes the corresponding RDD. Subclasses of DStream may override this
 * to generate their own jobs.
 */
private[streaming] def generateJob(time: Time): Option[Job] = {
  getOrCompute(time) match {
    case Some(rdd) => {
      val jobFunc = () => {
        val emptyFunc = { (iterator: Iterator[T]) => {} }
        context.sparkContext.runJob(rdd, emptyFunc)
      }
      Some(new Job(time, jobFunc))
    }
    case None => None
  }
}

Job對象，方法run會導致傳入的func被調用:

private[streaming]
class Job(val time: Time, func: () => _) {
  private var _id: String = _
  private var _outputOpId: Int = _
  private var isSet = false
  private var _result: Try[_] = null
  private var _callSite: CallSite = null
  private var _startTime: Option[Long] = None
  private var _endTime: Option[Long] = None

  def run() {
    _result = Try(func())
  }

getOrCompute方法，先根據傳入的時間在HashMap中查找下RDD是否存在，如果不存在則調用compute方法計算獲取RDD，再根據storageLevel 是否需要persist，是否到了checkpoint時間點進行checkpoint操作，最後把該RDD放入到HashMap中：

private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
  // If RDD was already generated, then retrieve it from HashMap,
  // or else compute the RDD
  generatedRDDs.get(time).orElse {
    // Compute the RDD if time is valid (e.g. correct time in a sliding window)
    // of RDD generation, else generate nothing.
    if (isTimeValid(time)) {

      val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
        // Disable checks for existing output directories in jobs launched by the streaming
        // scheduler, since we may need to write output to an existing directory during checkpoint
        // recovery; see SPARK-4835 for more details. We need to have this call here because
        // compute() might cause Spark jobs to be launched.
        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
          compute(time)
        }
      }

      rddOption.foreach { case newRDD =>
        // Register the generated RDD for caching and checkpointing
        if (storageLevel != StorageLevel.NONE) {
          newRDD.persist(storageLevel)
          logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
        }
        if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
          newRDD.checkpoint()
          logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
        }
        generatedRDDs.put(time, newRDD)
      }
      rddOption
    } else {
      None
    }
  }
}

再次回到JobGenerator類中，看下start方法中在消息循環體啓動後，先判斷之前是否進行checkpoint操作，如果是從checkpoint目錄中讀取然後再調用restart重啓JobGenerator，如果是第一次則調用startFirstTime方法：

JobGenerator類中的startFirstTime方法，啓動定時生成Job的Timer：

timer對象爲RecurringTimer，其start方法內部啓動一個線程，在線程中不斷調用triggerActionForNextInterval方法：

private[streaming]
class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)
  extends Logging {

  private val thread = new Thread("RecurringTimer - " + name) {
    setDaemon(true)
    override def run() { loop }
  }

  @volatile private var prevTime = -1L
  @volatile private var nextTime = -1L
  @volatile private var stopped = false

  /**
   * Get the time when this timer will fire if it is started right now.
   * The time will be a multiple of this timer's period and more than
   * current system time.
   */
  def getStartTime(): Long = {
    (math.floor(clock.getTimeMillis().toDouble / period) + 1).toLong * period
  }

  /**
   * Get the time when the timer will fire if it is restarted right now.
   * This time depends on when the timer was started the first time, and was stopped
   * for whatever reason. The time must be a multiple of this timer's period and
   * more than current time.
   */
  def getRestartTime(originalStartTime: Long): Long = {
    val gap = clock.getTimeMillis() - originalStartTime
    (math.floor(gap.toDouble / period).toLong + 1) * period + originalStartTime
  }

  /**
   * Start at the given start time.
   */
  def start(startTime: Long): Long = synchronized {
    nextTime = startTime
    thread.start()
    logInfo("Started timer for " + name + " at time " + nextTime)
    nextTime
  }

  /**
   * Start at the earliest time it can start based on the period.
   */
  def start(): Long = {
    start(getStartTime())
  }

  /**
   * Stop the timer, and return the last time the callback was made.
   *
   * @param interruptTimer True will interrupt the callback if it is in progress (not guaranteed to
   *                       give correct time in this case). False guarantees that there will be at
   *                       least one callback after `stop` has been called.
   */
  def stop(interruptTimer: Boolean): Long = synchronized {
    if (!stopped) {
      stopped = true
      if (interruptTimer) {
        thread.interrupt()
      }
      thread.join()
      logInfo("Stopped timer for " + name + " after time " + prevTime)
    }
    prevTime
  }

  private def triggerActionForNextInterval(): Unit = {
    clock.waitTillTime(nextTime)
    callback(nextTime)
    prevTime = nextTime
    nextTime += period
    logDebug("Callback for " + name + " called at time " + prevTime)
  }

  /**
   * Repeatedly call the callback every interval.
   */
  private def loop() {
    try {
      while (!stopped) {
        triggerActionForNextInterval()
      }
      triggerActionForNextInterval()
    } catch {
      case e: InterruptedException =>
    }
  }
}

triggerActionForNextInterval方法，等待BatchDuration後回調callback這個方法，這裏的callback方法是構造RecurringTimer對象時傳入的方法，即longTime => eventLoop.post(GenerateJobs(new Time(longTime)))，不斷向消息循環體發送GenerateJobs消息。

再次聚焦generateJobs這個方法生成Job的步驟：

第一步：獲取當前時間段內的數據。

第二步：生成Job，RDD之間的依賴關係。

第三步：獲取生成Job對應的StreamId的信息。

第四步：封裝成JobSet交給JobScheduler。

第五步：進行checkpoint操作。

其中submitJobSet方法，只是把JobSet放到ConcurrentHashMap中，把Job封裝爲JobHandler提交到jobExecutor線程池中：

JobHandler對象爲實現Runnable 接口，job的run方法導致了func的調用，即基於DStream的業務邏輯：

private class JobHandler(job: Job) extends Runnable with Logging {
  import JobScheduler._

  def run() {
    try {
      val formattedTime = UIUtils.formatBatchTime(
        job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
      val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
      val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

      ssc.sc.setJobDescription(
        s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
      ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
      ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)

      // We need to assign `eventLoop` to a temp variable. Otherwise, because
      // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
      // it's possible that when `post` is called, `eventLoop` happens to null.
      var _eventLoop = eventLoop
      if (_eventLoop != null) {
        _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
        // Disable checks for existing output directories in jobs launched by the streaming
        // scheduler, since we may need to write output to an existing directory during checkpoint
        // recovery; see SPARK-4835 for more details.
        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
          job.run()
        }
        _eventLoop = eventLoop
        if (_eventLoop != null) {
          _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
        }
      } else {
        // JobScheduler has been stopped.
      }
    } finally {
      ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
      ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
    }
  }
}

Spark Streaming源碼解讀之Job動態生成和深度思考

Spark Streaming源碼解讀之Driver容錯安全性

Spark Streaming源碼解讀之流數據不斷接收全生命週期徹底研究和思考

RDD：基於內存的集羣計算容錯抽象

Spark Streaming源碼解讀之Executor容錯安全性

Spark Streaming源碼解讀之JobScheduler內幕實現和深度思考

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結