第6課：Spark Streaming源碼解讀之Job動態生成和深度思考

一：Spark Streaming Job生成深度思考
1. 做大數據例如Hadoop,Spark等，如果不是流處理的話，一般會有定時任務。例如10分鐘觸發一次，1個小時觸發一次，這就是做流處理的感覺，一切不是流處理，或者與流處理無關的數據都將是沒有價值的數據，以前做批處理的時候其實也是隱形的在做流處理。
2. JobGenerator構造的時候有一個核心的參數是jobScheduler, jobScheduler是整個作業的生成和提交給集羣的核心，JobGenerator會基於DStream生成Job。這裏面的Job就相當於Java中線程要處理的Runnable裏面的業務邏輯封裝。Spark的Job就是運行的一個作業。
3. Spark Streaming除了基於定時操作以外參數Job，還可以通過各種聚合操作，或者基於狀態的操作。
4. 每5秒鐘JobGenerator都會產生Job，此時的Job是邏輯級別的，也就是說有這個Job，並且說這個Job具體該怎麼去做，此時並沒有執行。具體執行的話是交給底層的RDD的action去觸發，此時的action也是邏輯級別的。底層物理級別的，Spark Streaming他是基於DStream構建的依賴關係導致的Job是邏輯級別的，底層是基於RDD的邏輯級別的。

val ssc = new StreamingContext(conf, Seconds(5))11

 5. Spark Streaming的觸發器是以時間爲單位的，storm是以事件爲觸發器，也就是基於一個又一個record. Spark Streaming基於時間，這個時間是Batch Duractions

從邏輯級別翻譯成物理級別，最後一個操作肯定是RDD的action，但是並不想一翻譯立馬就觸發job。這個時候怎麼辦？
6. action觸發作業，這個時候作爲Runnable接口封裝，他會定義一個方法，這個方法裏面是基於DStream的依賴關係生成的RDD。翻譯的時候是將DStream的依賴關係翻譯成RDD的依賴關係，由於DStream的依賴關係最後一個是action級別的，翻譯成RDD的時候，RDD的最後一個操作也應該是action級別的，如果翻譯的時候直接執行的話，就直接生成了Job，就沒有所謂的隊列，所以會將翻譯的事件放到一個函數中或者一個方法中，因此，如果這個函數沒有指定的action觸發作業是執行不了的。
7. Spark Streaming根據時間不斷的去管理我們的生成的作業，所以這個時候我們每個作業又有action級別的操作，這個action操作是對DStream進行邏輯級別的操作，他生成每個Job放到隊列的時候，他一定會被翻譯爲RDD的操作，那基於RDD操作的最後一個一定是action級別的，如果翻譯的話直接就是觸發action的話整個Spark Streaming的Job就不受管理了。因此我們既要保證他的翻譯，又要保證對他的管理，把DStream之間的依賴關係轉變爲RDD之間的依賴關係，最後一個DStream使得action的操作，翻譯成一個RDD之間的action操作，整個翻譯後的內容他是一塊內容，他這一塊內容是放在一個函數體中的，這個函數體，他會函數的定義，這個函數由於他只是定義還沒有執行，所以他裏面的RDD的action不會執行，不會觸發Job,當我們的JobScheduler要調度Job的時候，轉過來在線程池中拿出一條線程執行剛纔的封裝的方法。

二：Spark Streaming Job生成源碼解析
Spark 作業動態生成三大核心：
JobGenerator: 負責Job生成。
JobSheduler：負責Job調度。
ReceiverTracker: 獲取元數據。
1. JobScheduler的start方法被調用的時候，會啓動JobGenerator的start方法。

/** Start generation of jobs */def start(): Unit = synchronized {
//eventLoop是消息循環體，因爲不斷的生成Job  if (eventLoop != null) return // generator has already been started

  // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
  // See SPARK-10125
  checkpointWriter
//匿名內部類
  eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
    override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = {
      jobScheduler.reportError("Error in job generator", e)
    }
  }
//調用start方法。
  eventLoop.start()  if (ssc.isCheckpointPresent) {
    restart()
  } else {
    startFirstTime()
  }
}12345678910111213141516171819202122232425261234567891011121314151617181920212223242526

EvenLoop: 的start方法被調用，首先會調用onstart方法。然後就啓動線程。

/**
 * An event loop to receive events from the caller and process all events in the event thread. It
 * will start an exclusive event thread to process all events.
 *
 * Note: The event queue will grow indefinitely. So subclasses should make sure `onReceive` can
 * handle events in time to avoid the potential OOM.
 */private[spark] abstract class EventLoop[E](name: String) extends Logging {

  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()  private val stopped = new AtomicBoolean(false)//開啓後臺線程。   
  private val eventThread = new Thread(name) {
    setDaemon(true)    override def run(): Unit = {      try {//不斷的從BlockQueue中拿消息。
        while (!stopped.get) {//線程的start方法調用就會不斷的循環隊列，而我們將消息放到eventQueue中。
          val event = eventQueue.take()          try {//
            onReceive(event)
          } catch {            case NonFatal(e) => {              try {
                onError(e)
              } catch {                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
            }
          }
        }
      } catch {        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }

  }  def start(): Unit = {    if (stopped.get) {      throw new IllegalStateException(name + " has already been stopped")
    }    // Call onStart before starting the event thread to make sure it happens before onReceive

    onStart()
    eventThread.start()
  }12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152531234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253

onReceive:不斷的從消息隊列中獲得消息，一旦獲得消息就會處理。
不要在onReceive中添加阻塞的消息，如果這樣的話會不斷的阻塞消息。
消息循環器一般都不會處理具體的業務邏輯，一般消息循環器發現消息以後都會將消息路由給其他的線程去處理。

/**
 * Invoked in the event thread when polling events from the event queue.
 *
 * Note: Should avoid calling blocking actions in `onReceive`, or the event thread will be blocked
 * and cannot process events in time. If you want to call some blocking actions, run them in
 * another thread.
 */
protected def onReceive(event: E): Unit123456789123456789

消息隊列接收到事件後具體處理如下：

/** Processes all events */private def processEvent(event: JobGeneratorEvent) {
  logDebug("Got event " + event)
  event match {    case GenerateJobs(time) => generateJobs(time)    case ClearMetadata(time) => clearMetadata(time)    case DoCheckpoint(time, clearCheckpointDataLater) =>
      doCheckpoint(time, clearCheckpointDataLater)    case ClearCheckpointData(time) => clearCheckpointData(time)
  }
}123456789101112123456789101112

基於Batch Duractions生成Job，並完成checkpoint.
Job生成的5個步驟。

/** Generate jobs and perform checkpoint for the given `time`.  */private def generateJobs(time: Time) {  // Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
  SparkEnv.set(ssc.env)
  Try {//第一步：獲取當前時間段裏面的數據。根據分配的時間來分配具體要處理的數據。
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch//第二步：生成Job，獲取RDD的DAG依賴關係。在此基於DStream生成了RDD實例。
    graph.generateJobs(time) // generate jobs using allocated block
  } match {    case Success(jobs) =>//第三步：獲取streamIdToInputInfos的信息。BacthDuractions要處理的數據，以及我們要處理的業務邏輯。
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)//第四步：將生成的Job交給jobScheduler
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }//第五步：進行checkpoint
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}123456789101112131415161718192021222324123456789101112131415161718192021222324

此時的outputStream是整個DStream中的最後一個DStream，也就是foreachDStream.

def generateJobs(time: Time): Seq[Job] = {
  logDebug("Generating jobs for time " + time)
  val jobs = this.synchronized {
    outputStreams.flatMap { outputStream =>
//根據最後一個DStream，然後根據時間生成Job.
      val jobOption = outputStream.generateJob(time)
      jobOption.foreach(_.setCallSite(outputStream.creationSite))
      jobOption
    }
  }
  logDebug("Generated " + jobs.length + " jobs for time " + time)
  jobs
}12345678910111213141234567891011121314

此時的JobFunc就是我們前面提到的用函數封裝了Job。
generateJob基於給定的時間生成Spark Streaming 的Job，這個方法會基於我們的DStream的操作物化成了RDD，由此可以看出，DStream是邏輯級別的，RDD是物理級別的。

/**
 * Generate a SparkStreaming job for the given time. This is an internal method that
 * should not be called directly. This default implementation creates a job
 * that materializes the corresponding RDD. Subclasses of DStream may override this
 * to generate their own jobs.
 */private[streaming] def generateJob(time: Time): Option[Job] = {
  getOrCompute(time) match {    case Some(rdd) => {      val jobFunc = () => {        val emptyFunc = { (iterator: Iterator[T]) => {} }//rdd => 就是RDD的依賴關係
        context.sparkContext.runJob(rdd, emptyFunc)
      }//此時的
      Some(new Job(time, jobFunc))
    }    case None => None
  }
}123456789101112131415161718192021123456789101112131415161718192021

Job這個類就代表了Spark業務邏輯，可能包含很多Spark Jobs.

/**
 * Class representing a Spark computation. It may contain multiple Spark jobs.
 */private[streaming]class Job(val time: Time, func: () => _) {
  private var _id: String = _  private var _outputOpId: Int = _  private var isSet = false
  private var _result: Try[_] = null
  private var _callSite: CallSite = null
  private var _startTime: Option[Long] = None  private var _endTime: Option[Long] = None  def run() {//調用func函數，此時這個func就是我們前面generateJob中的func
    _result = Try(func())
  }123456789101112131415161718123456789101112131415161718

此時put函數中的RDD是最後一個RDD，雖然觸發Job是基於時間，但是也是基於DStream的action的。

/**
 * Get the RDD corresponding to the given time; either retrieve it from cache
 * or compute-and-cache it.
 */private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {  // If RDD was already generated, then retrieve it from HashMap,
  // or else compute the RDD//基於時間生成RDD
  generatedRDDs.get(time).orElse {    // Compute the RDD if time is valid (e.g. correct time in a sliding window)
    // of RDD generation, else generate nothing.
    if (isTimeValid(time)) {

      val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {        // Disable checks for existing output directories in jobs launched by the streaming
        // scheduler, since we may need to write output to an existing directory during checkpoint
        // recovery; see SPARK-4835 for more details. We need to have this call here because
        // compute() might cause Spark jobs to be launched.
        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {//
          compute(time)
        }
      }//然後對generated RDD進行checkpoint
      rddOption.foreach { case newRDD =>        // Register the generated RDD for caching and checkpointing
        if (storageLevel != StorageLevel.NONE) {
          newRDD.persist(storageLevel)
          logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
        }        if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
          newRDD.checkpoint()
          logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
        }//以時間爲Key,RDD爲Value,此時的RDD爲最後一個RDD
        generatedRDDs.put(time, newRDD)
      }
      rddOption
    } else {
      None
    }
  }
}12345678910111213141516171819202122232425262728293031323334353637383940414243441234567891011121314151617181920212223242526272829303132333435363738394041424344

回到JobGenerator中的start方法。

  if (ssc.isCheckpointPresent) {//如果不是第一次啓動的話，就需要從checkpoint中恢復。
    restart()
  } else {//否則的話，就是第一次啓動。
    startFirstTime()
  }
}123456789123456789

StartFirstTime的源碼如下：

/** Starts the generator for the first time */private def startFirstTime() {
  val startTime = new Time(timer.getStartTime())//告訴DStreamGraph第一個Batch啓動時間。
  graph.start(startTime - graph.batchDuration)//timer啓動，整個job不斷生成就開始了。
  timer.start(startTime.milliseconds)
  logInfo("Started JobGenerator at " + startTime)
}1234567891012345678910

這裏的timer是RecurringTimer。RecurringTimer的start方法會啓動內置線程thread.

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
  longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")123123

Timer.start源碼如下：

/**
 * Start at the given start time.
 */def start(startTime: Long): Long = synchronized {
  nextTime = startTime //每次調用的
  thread.start()
  logInfo("Started timer for " + name + " at time " + nextTime)
  nextTime
}1234567891012345678910

調用thread啓動後臺進程。

private val thread = new Thread("RecurringTimer - " + name) {
  setDaemon(true)  override def run() { loop }
}1234512345

loop源碼如下：

  /**
   * Repeatedly call the callback every interval.
   */
  private def loop() {    try {      while (!stopped) {
        triggerActionForNextInterval()
      }
      triggerActionForNextInterval()
    } catch {      case e: InterruptedException =>
    }
  }
}123456789101112131415123456789101112131415

tiggerActionForNextInterval源碼如下：

private def triggerActionForNextInterval(): Unit = {
  clock.waitTillTime(nextTime)
  callback(nextTime)
  prevTime = nextTime
  += period
  logDebug("Callback for " + name + " called at time " + prevTime)
}1234567812345678

此時的callBack是RecurringTimer傳入的。下面就去找callBack是誰傳入的，這個時候就應該找RecurringTimer什麼時候實例化的。

private[streaming]class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)
  extends Logging {  private val thread = new Thread("RecurringTimer - " + name) {
    setDaemon(true)    override def run() { loop }
  }123456789123456789

在jobGenerator中，匿名函數會隨着時間不斷的推移反覆被調用。

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,//匿名函數，複製給callback。
  longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")12341234

而此時的eventLoop就是JobGenerator的start方法中eventLoop.eventLoop是一個消息循環體當收到generateJobs，就會將消息放到線程池中去執行。
至此，就知道了基於時間怎麼生成作業的流程就貫通了。

Jobs: 此時的jobs就是jobs的業務邏輯，就類似於RDD之間的依賴關係，保存最後一個job，然後根據依賴關係進行回溯。
streamIdToInputInfos：基於Batch Duractions以及要處理的業務邏輯，然後就生成了JobSet.

jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))11

此時的JobSet就包含了數據以及對數據處理的業務邏輯。

/** Class representing a set of Jobs
  * belong to the same batch.
  */private[streaming]case class JobSet(
    time: Time,
    jobs: Seq[Job],
    streamIdToInputInfo: Map[Int, StreamInputInfo] = Map.empty) {

  private val incompleteJobs = new HashSet[Job]()  private val submissionTime = System.currentTimeMillis() // when this jobset was submitted
  private var processingStartTime = -1L // when the first job of this jobset started processing
  private var processingEndTime = -1L // when the last job of this jobset finished processing

  jobs.zipWithIndex.foreach { case (job, i) => job.setOutputOpId(i) }
  incompleteJobs ++= jobs  def handleJobStart(job: Job) {    if (processingStartTime < 0) processingStartTime = System.currentTimeMillis()
  }123456789101112131415161718192021123456789101112131415161718192021

submitJobSet:

def submitJobSet(jobSet: JobSet) {
  if (jobSet.jobs.isEmpty) {
    logInfo("No jobs added for time " + jobSet.time)
  } else {
    listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
//
    jobSets.put(jobSet.time, jobSet)
//jobHandler
    jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
    logInfo("Added jobs for time " + jobSet.time)
  }
}1234567891011121312345678910111213

JobHandle是一個Runnable接口，Job就是我們業務邏輯，代表的就是一系列RDD的依賴關係,job.run方法就導致了func函數的調用。

  private class JobHandler(job: Job) extends Runnable with Logging {
    import JobScheduler._    def run() {      try {        val formattedTime = UIUtils.formatBatchTime(
          job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)        val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
        val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

        ssc.sc.setJobDescription(
          s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
        ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
        ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)        // We need to assign `eventLoop` to a temp variable. Otherwise, because
        // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
        // it's possible that when `post` is called, `eventLoop` happens to null.
        var _eventLoop = eventLoop        if (_eventLoop != null) {
          _eventLoop.post(JobStarted(job, clock.getTimeMillis()))          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {//
            job.run()
          }
          _eventLoop = eventLoop          if (_eventLoop != null) {
            _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
          }
        } else {          // JobScheduler has been stopped.
        }
      } finally {
        ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
        ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
      }
    }
  }
}1234567891011121314151617181920212223242526272829303132333435363738394041424312345678910111213141516171819202122232425262728293031323334353637383940414243

此時的func就是基於DStream的業務邏輯。也就是RDD之間依賴的業務邏輯。

def run() {
  _result = Try(func())
}12341234

總體架構如下：

備註：

1、DT大數據夢工廠微信公衆號DT_Spark
2、IMF晚8點大數據實戰YY直播頻道號：68917580
3、新浪微博: http://www.weibo.com/ilovepains

本文轉自http://blog.csdn.net/snail_gesture/article/details/51417769

第6課：Spark Streaming源碼解讀之Job動態生成和深度思考

第11課：Spark Streaming源碼解讀之Driver中的ReceiverTracker架構設計以及具體實現徹底研究

第13課：Spark Streaming源碼解讀之Driver容錯安全性

第8課：Spark Streaming源碼解讀之RDD生成全生命週期徹底研究和思考

第6課：Spark Streaming源碼解讀之Job動態生成和深度思考

第14課：Spark Streaming源碼解讀之State管理之updateStateByKey和mapWithState解密

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結