spark streaming2.4.0 任務啓動源碼剖析

官方案例

首先以官方啓動入手


object SparkStreamingTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("aaa").setMaster("local[*]")
    val sc = SparkSession.builder().config(conf).getOrCreate()

    val ssc = new StreamingContext(conf,Seconds(1))
//    第二種方式
    val ssc2 = new StreamingContext(sc.sparkContext,Seconds(1))
    //
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.40.179", 9999)

    // words 這個DStream內部有一個dependence保存着父親lines: Dstream的對象,而且會保留父親節點的slideDuration, 且重寫了compute方法
    val words: DStream[String] = lines.flatMap(_.split(" ")) // FlatMappedDStream 繼承DStream抽象

    // pairs 這個DStream內部有一個dependence保存着父親words Dstream的對象,而且會保留父親節點的slideDuration, 且重寫了compute方法
    val pairs: DStream[(String, Int)] = words.map(word => (word, 1)) //  MappedDStream 繼承DStream抽象

    // wordCounts 這個DStream內部有一個dependence保存着父親words Dstream的對象,而且會保留父親節點的slideDuration, 且重寫了compute方法
    val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) //  ShuffledDStream 繼承DStream抽象

    wordCounts.print()


    ssc.start()
    ssc.awaitTermination()
  }
}

StreamingContext

創建上下文步驟

package org.apache.spark.streaming


class StreamingContext private[streaming] (
    _sc: SparkContext,
    _cp: Checkpoint,
    _batchDur: Duration
  ) extends Logging {
     .... 中間過程省略

  /**
   * 方式一爲StreamingContext 提供 new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }
  
  /**
   * 方式二 爲StreamingContext 提供 sparkContext上下文
   * @param sparkContext existing SparkContext
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }

根據StreamingContext構造函數我們可以看到StreamingContext至少包含三個主要成員變量,如下

  1. _sc爲sparkcontext,這個是spark的上下文對象

可以通過方式一或方式二創建

  1. _cp checkpoint中間數據保存點以及結果保存點,

由於本文關注於spark streaming啓動的原理,所以暫時不對保存點進行詳細的介紹,不過可以把checkpoint理解成流爲了保證有序可靠穩定的運行,會把RDD中的中間計算結果臨時保存到磁盤中,以作爲後期運行的保障

  1. batchDur,即每個批次最大的接收間隔時長

    理解這一點需要看StreamingContext內部的一個graph成員變量以及後文中的調度篇(timer).其實很簡單,StreamingContext就相當於就是一個畫展室,畫展主要功能就是把一些畫展示給客戶看的,同時有很多部件組成,比如畫展室內部的燈光,以及控制燈光色彩的開關,音樂,開放的時間作息等等。我們先了解一下這個畫展室中的那副鎮館之寶(graph),可以想象成,而定時器就好比在燈光調控式中的

package org.apache.spark.streaming


class StreamingContext private[streaming]{
....
  private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      _cp.graph.setContext(this)
      _cp.graph.restoreCheckpointData()
      _cp.graph
    } else {
      require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(_batchDur)
      newGraph
    }
  }
  ....
  }
  1. state (狀態)

在sparkContext中有三種類型的狀態

  /**
   * :: DeveloperApi ::
   *
   * Return the current state of the context. The context can be in three possible states -
   *
   *  - StreamingContextState.INITIALIZED 初始化狀態- The context has been created, but not started yet.
   *    Input DStreams, transformations and output operations can be created on the context.,此時的一些inputstream以及轉換算子還有一些輸出操作邏輯都會被創建,但是並不執行
   *  - StreamingContextState.ACTIVE 激活狀態 - The context has been started, and not stopped.永不停止
   *    Input DStreams, transformations and output operations cannot be created on the context.異常操作開始執行,並不會創建新的操作,永無止盡地執行用戶定義好的操作
   *  - StreamingContextState.STOPPED 關閉狀態 - The context has been stopped and cannot be used any more.
   */
  @DeveloperApi
  def getState(): StreamingContextState = synchronized {
    state
  }
  1. start 開啓

StreamingContext會根據當前應用所處的狀態進行同步操作,在start中會開啓跟site(網頁端口)相關的環境,這裏不做贅述了,並且做了很多健壯性驗證方面的處理以及相關參數的配置,可以不做考慮,

 /**
   * Start the execution of the streams.
   *
   * @throws IllegalStateException if the StreamingContext is already stopped.
   */
  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
            scheduler.listenerBus.post(
              StreamingListenerStreamingStarted(System.currentTimeMillis()))
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        logDebug("Adding shutdown hook") // force eager creation of logger
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(() => stopOnShutdown())
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

核心就是調用了scheduler.start(),讀者可以看到這個方法是

  ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              scheduler.start()
            }

在streamingContext中生成一個子線程用來開啓調度器,並給調度器的listenerBus(調度器內部有一條listener總線來監聽發送數據)發送一個當前Streaming環境開啓的事件
同時給StreamingContext設置當前激活的環境,用來保證應用中只包含一個被激活的streamingContext對象

 StreamingContext.setActiveContext(this)
 
 object StreamingContext extends Logging {
    private val activeContext = new AtomicReference[StreamingContext](null)
     private def setActiveContext(ssc: StreamingContext): Unit = {
        ACTIVATION_LOCK.synchronized {
          activeContext.set(ssc)
        }
      }
 }
DStreamGraph

關於StreamingContext中的batchDuration,我們需要關注圖的類型是的類型:DStreamGraph ,所以我們首先來看 DStreamGraph 源碼分析

可以發現,這個類是最爲頂級的類,而且用final修飾說明該類不能夠擴展了,是一個成熟的類,所以我們來看看它有哪些成員以及是如何工作的

package org.apache.spark.streaming
...

final private[streaming] class DStreamGraph extends Serializable with Logging {
    private val inputStreams = new ArrayBuffer[InputDStream[_]]()
  private val outputStreams = new ArrayBuffer[DStream[_]]()

  @volatile private var inputStreamNameAndID: Seq[(String, Int)] = Nil

  var rememberDuration: Duration = null
  var checkpointInProgress = false

  var zeroTime: Time = null
  var startTime: Time = null
  var batchDuration: Duration = null

    
    
    
}
DStream

我們可以發現內部組合並維護了 一個可變數組類型的輸入流和一個可變數組類型的輸出流對象(import scala.collection.mutable.ArrayBuffer 可變數組) ,且 這兩個輸出流都是抽象類型的(可以使用ClassTag運行時動態添加任一類型的rdd,k可以看如下代碼,一個離散流內部定義的一個以時間爲key,值爲 與DStream類型一致的RDD對象,這裏可以驗證spark streaming底層是由RDD作爲計算任務實現功能的)。我們可以想象成這個圖畫中心有一個圓形水庫,水庫的水來自上游的瀑布,瀑布上水飛流直下,而水庫的水通過水汞按部就班地輸送給下方村莊,供人們使用。這是一幅以農村山水爲背景,結合現代化工具的現代化鄉村生活圖。


package org.apache.spark.streaming.dstream
abstract class DStream[T: ClassTag] {
    ....
      /** Method that generates an RDD for the given time */
  def compute(validTime: Time): Option[RDD[T]]

  // =======================================================================
  // Methods and fields available on all DStreams
  // =======================================================================

  // RDDs generated, marked as private[streaming] so that testsuites can access it
  @transient
  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()
  ....
}

而且 InputDStream也是繼承於DStream,也就是說InputDStream不僅擁有於DStream的所有功能以外,還自定義了一些新方法,用來更好地控制自身

abstract class InputDStream[T: ClassTag](_ssc: StreamingContext)
  extends DStream[T](_ssc) {
  ...
  }

官方在講解sparkStreaming的時候開篇就講了很多有關DStream的相關內容
我這裏從源碼的角度來看看DStream到底由哪些成員於對象組成,跟batchduration由什麼關係嗎?

abstract class DStream[T: ClassTag] (
    @transient private[streaming] var ssc: StreamingContext
  ) extends Serializable with Logging {

通過對DStream我們發現,DStream是強依賴於ssc(StreamingContext,爲了後期方便寫文章,這裏用ssc專門代表StreamingContext生成的實例對象),換個角度來說,ssc內部那副水庫圖(graph)中的上下兩臺管道會受到ssc本身的影響,如果ssc內景的燈光色彩絢麗,映射到圖上顯示出五彩斑斕的水流色彩,可以想象有一彩虹懸於瀑布(inputstream)之上,如果ssc內景燈光灰暗,瀑布上烏雲一片。。。

調度篇

輸入流的注入與開啓

調度一般調度的任務,而ssc是以個任務嗎?其實ssc通過DSteam這一篇我們發現,Dstream內部會有一個generateJob方法,是專門用於定期生成job的方法,而由於DStreamGraph內部維護了一個DStream的可變outputstreams,所以,理所當然地也會觸發一個generateJobs(複數),用來觸發所有outputstream的job生成任務

class StreamingContext private[streaming] (
    _sc: SparkContext,
    _cp: Checkpoint,
    _batchDur: Duration
  ) extends Logging {
  ....
    private[streaming] val scheduler = new JobScheduler(this)
  ...
}

這時候我們可以大膽猜測其實是在ssc內部的schedule(調度器)內部執行了這副圖的 jobs ,其實在調度器的start方法(通常也是啓動方法,或者線程啓動方法),調用了內部的任務生成器(JobGenerator(this))

JobScheduler.scala

package org.apache.spark.streaming.scheduler
...
/**
 * This class schedules jobs to be run on Spark. It uses the JobGenerator to generate
 * the jobs and runs them using a thread pool.
 */
private[streaming]
class JobScheduler(val ssc: StreamingContext) extends Logging {
...
  private val jobGenerator = new JobGenerator(this)
 
  ...
  def start(): Unit = synchronized {
        ...
      receiverTracker.start()
      jobGenerator.start()
      ...
  }
    
}

一個調度器內部有其只有一個任務生成器,此時的這個任務生成器纔是最終要工作的對象,此時代碼做了下面的操作,爲這一個生成器創建一個事件生成器對象用來監聽所有接收到的事件,同時通過當前ssc環境是否有保存點作爲條件走兩條路,第一條是restart(),即根據上次保存的保存點數據繼續運行,第二條是startFirstTime,也就是第一次運行job任務。現在我們可以確定的一點是,ssc中的輸入輸出離散流的工作就是在這兩個地方觸發的,不信,我們可以查看這兩個方法
JobGenerator.scala

package org.apache.spark.streaming.scheduler
...
class JobGenerator(jobScheduler: JobScheduler) extends Logging {
...
/** Start generation of jobs */
  def start(): Unit = synchronized {
    if (eventLoop != null) return // generator has already been started

    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
    // See SPARK-10125
    checkpointWriter

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    eventLoop.start()

    if (ssc.isCheckpointPresent) {
      restart()
    } else {
      startFirstTime()
    }
  }
  ...
}
  1. restart
    在此模式下,是用戶重新啓動的流程
/** Restarts the generator based on the information in checkpoint */
  private def restart() {
    // If manual clock is being used for testing, then
    // either set the manual clock to the last checkpointed time,
    // or if the property is defined set it to that time
    if (clock.isInstanceOf[ManualClock]) {
      val lastTime = ssc.initialCheckpoint.checkpointTime.milliseconds
      val jumpTime = ssc.sc.conf.getLong("spark.streaming.manualClock.jump", 0)
      clock.asInstanceOf[ManualClock].setTime(lastTime + jumpTime)
    }

    val batchDuration = ssc.graph.batchDuration

    // Batches when the master was down, that is,
    // between the checkpoint and current restart time
    val checkpointTime = ssc.initialCheckpoint.checkpointTime
    val restartTime = new Time(timer.getRestartTime(graph.zeroTime.milliseconds))
    val downTimes = checkpointTime.until(restartTime, batchDuration)
    logInfo("Batches during down time (" + downTimes.size + " batches): "
      + downTimes.mkString(", "))

    // Batches that were unprocessed before failure
    val pendingTimes = ssc.initialCheckpoint.pendingTimes.sorted(Time.ordering)
    logInfo("Batches pending processing (" + pendingTimes.length + " batches): " +
      pendingTimes.mkString(", "))
    // Reschedule jobs for these times
    val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < restartTime }
      .distinct.sorted(Time.ordering)
    logInfo("Batches to reschedule (" + timesToReschedule.length + " batches): " +
      timesToReschedule.mkString(", "))
    timesToReschedule.foreach { time =>
      // Allocate the related blocks when recovering from failure, because some blocks that were
      // added but not allocated, are dangling in the queue after recovering, we have to allocate
      // those blocks to the next batch, which is the batch they were supposed to go.
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      jobScheduler.submitJobSet(JobSet(time, graph.generateJobs(time)))
    }

    // Restart the timer
    timer.start(restartTime.milliseconds)
    logInfo("Restarted JobGenerator at " + restartTime)
  }

jobScheduler.submitJobSet(JobSet(time, graph.generateJobs(time))) 這個就是調用了ssc中那副圖的生成jobs的方法,並設置成了一個jobSet(即當前的生成的任務的集合,任務以每一個批次duration時長作爲單位劃分一個jobset,每個批次最多分成batchDuration的時長,(仔細看timesToReschedule這個對象)對象,回調給其自身的scheduler調度器的,會根據jobs是否是空進行任務的真實執行流程,讀者會引發一個問題,因爲流會不斷地生成,難道會不斷地執行任務嗎?相信這裏地答案會給你一個答案,一旦當前的任務是空的,那麼就會執行

  def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

那麼任務怎麼樣纔算空的呢?其實回到開篇的DStream的generateJob方法,我們可以清楚地看到,如果有getOrCompute(time) 是有rdd的那麼就會返回一個job類型的Some對象,否則沒有rdd對象那就會返回一個none對象,此時的運行模式是重啓模型

 private[streaming] def generateJob(time: Time): Option[Job] = {
    getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => {
          val emptyFunc = { (iterator: Iterator[T]) => {} }
          context.sparkContext.runJob(rdd, emptyFunc)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }
  1. startFirstTime
    第一次啓動是什麼時候會執行如下操作
  /** Starts the generator for the first time */
  private def startFirstTime() {
    val startTime = new Time(timer.getStartTime())
    graph.start(startTime - graph.batchDuration)
    timer.start(startTime.milliseconds)
    logInfo("Started JobGenerator at " + startTime)
  }

就是開啓圖的start方法,並開始計時(timer),其開啓的是輸入流,而這裏只是對輸出流進行輸出話開始時間以及需要記住的時間間隔,那麼對其的開啓去哪裏了呢?

final private[streaming] class DStreamGraph extends Serializable with Logging {
...
    def start(time: Time) {
        this.synchronized {
          require(zeroTime == null, "DStream graph computation already started")
          zeroTime = time
          startTime = time
          outputStreams.foreach(_.initialize(zeroTime))
          outputStreams.foreach(_.remember(rememberDuration))
          outputStreams.foreach(_.validateAtStart())
          numReceivers = inputStreams.count(_.isInstanceOf[ReceiverInputDStream[_]])
          inputStreamNameAndID = inputStreams.map(is => (is.name, is.id))
          inputStreams.par.foreach(_.start()) // 開啓輸入流
        }
    }
  ...
}

然而我們又可以發現,inputStreams.par.foreach中對每一個inputDstream都進行開啓開啓,但是我們可以通過官方的 socketTextStream 輸入流對象的源碼可以發現,其具體實現並沒有對inputDstream的start方法進行實現,也就是說,對圖中的inputDStream的開啓並不是調度器JobGenerator乾的活,那麼是哪個部件乾的活呢? 其實是ReceiverTracker 對象乾的活

/**
 * This class manages the execution of the receivers of ReceiverInputDStreams. Instance of
 * this class must be created after all input streams have been added and StreamingContext.start()
 * has been called because it needs the final set of input streams at the time of instantiation.
 *
 * @param skipReceiverLaunch Do not launch the receiver. This is useful for testing. 只用於測試,不要開啓reeiver了
 /
private[streaming]
class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging {
private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
...
  def start(): Unit = synchronized {
    if (isTrackerStarted) {
      throw new SparkException("ReceiverTracker already started")
    }

    if (!receiverInputStreams.isEmpty) {
      endpoint = ssc.env.rpcEnv.setupEndpoint(
        "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
      if (!skipReceiverLaunch) launchReceivers()
      logInfo("ReceiverTracker started")
      trackerState = Started
    }
  }
  ...
    /**
   * Get the receivers from the ReceiverInputDStreams, distributes them to the
   * worker nodes as a parallel collection, and runs them.
   */
  private def launchReceivers(): Unit = {
    val receivers = receiverInputStreams.map { nis =>
      val rcvr = nis.getReceiver()
      rcvr.setReceiverId(nis.id)
      rcvr
    }

    runDummySparkJob()

    logInfo("Starting " + receivers.length + " receivers")
    endpoint.send(StartAllReceivers(receivers))
  }
  ...
  /** RpcEndpoint to receive messages from the receivers. */
  private class ReceiverTrackerEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint {

    private val walBatchingThreadPool = ExecutionContext.fromExecutorService(
      ThreadUtils.newDaemonCachedThreadPool("wal-batching-thread-pool"))

    @volatile private var active: Boolean = true

    override def receive: PartialFunction[Any, Unit] = {
      // Local messages
      case StartAllReceivers(receivers) =>
        val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        for (receiver <- receivers) {
          val executors = scheduledLocations(receiver.streamId)
          updateReceiverScheduledExecutors(receiver.streamId, executors)
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          startReceiver(receiver, executors)
        }
        。。。
      }
      。。。
    }
}

源碼也很好地詮釋了 這個類會獲取圖中的接收器inputStreams類型,可以查看這個接收器跟蹤類的start啓動方法,發現設置一個spark rpc(netty模式)下的一個接收器終端,並根據這個終端開啓接收器,一旦send信息到該終端,就會觸發ReceiverTrackerEndpoint的recive方法,開啓所有的接收器,其實通過源碼可知,接收器是inputDstream中的一種而已。從ReceiverTracker源碼的param skipReceiverLaunch 推薦不要使用接收器,所以,應該用其他的inputDstream.同時我們可以確定通過查看ReceiverTrackerEndpoint.startReceiver方法,並可查看到接收器開啓的時候會有一個ReceiverSupervisor對象專門用來開啓接收器,所以我們很容易想到只要接收器實現ontStart方法就能開啓流了,如下來驗證想法

private[streaming] abstract class ReceiverSupervisor(
    receiver: Receiver[_],
    conf: SparkConf
  ) extends Logging {
  ...
/** Start receiver */
  def startReceiver(): Unit = synchronized {
    try {
      if (onReceiverStart()) {
        logInfo(s"Starting receiver $streamId")
        receiverState = Started
        receiver.onStart() // 只要接收器實現了onStart方法,就會給你初始化一個接收器輸入流
        logInfo(s"Called receiver $streamId onStart")
      } else {
        // The driver refused us
        stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
      }
    } catch {
      case NonFatal(t) =>
        stop("Error starting receiver " + streamId, Some(t))
    }
  }
...
}

據目前位置,我們對接收器模式下的開啓流程有了大體的把控,那麼輸出流如何開啓呢?它的運行機制又是什麼?

輸出流的注入
wordCounts.print()

這個例子就是注入輸出流的的具體操作,

 /**
   * Print the first ten elements of each RDD generated in this DStream. This is an output
   * operator, so this DStream will be registered as an output stream and there materialized.
   */
  def print(): Unit = ssc.withScope {
    print(10)
  }

  /**
   * Print the first num elements of each RDD generated in this DStream. This is an output
   * operator, so this DStream will be registered as an output stream and there materialized.
   */
  def print(num: Int): Unit = ssc.withScope {
    def foreachFunc: (RDD[T], Time) => Unit = {
      (rdd: RDD[T], time: Time) => {
        val firstNum = rdd.take(num + 1)
        // scalastyle:off println
        println("-------------------------------------------")
        println(s"Time: $time")
        println("-------------------------------------------")
        firstNum.take(num).foreach(println)
        if (firstNum.length > num) println("...")
        println()
        // scalastyle:on println
      }
    }
    foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
  }

通過源碼我們可以看到,這個print操作實質上式調用了DStream::foreachRDD中的方法,也就是如下方法,會創建一個ForEachDStream 流,這個流同時會把當前的流注冊(register)到當前ssc中的圖內,這樣就完成了輸入出流的創建

private def foreachRDD(
      foreachFunc: (RDD[T], Time) => Unit,
      displayInnerRDDOps: Boolean): Unit = {
    new ForEachDStream(this,
      context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
  }
  
DStream
/**
   * Register this streaming as an output stream. This would ensure that RDDs of this
   * DStream will be generated.
   */
  private[streaming] def register(): DStream[T] = {
    ssc.graph.addOutputStream(this)
    this
  }
流動?

那麼並沒有想象中的那樣,圖內部的兩個管道在流動呀?比如DStream的運行時動態銷燬和運行時動態添加流的操作,我們並沒有看到,這個是什麼情況???

其實我們在調度器(JobScheduler)的start方法與 任務生成器(JobGenerator)的start方法中都見到了一個內置對象,eventLoop對象,字母意思來講就是事件循環,那麼,事件循環式什麼東東?我們對比查看這兩個方法

JobScheduler 中start方法中的事件循環
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()
    
JobGenerator 中start方法中的事件循環
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    eventLoop.start()

我們可以看到,在調度器中生成了以個永無止盡的調度事件類型(JobSchedulerEvent)的事件循環,專門生產任務調度的事件,並存入這個調度事件循環對象中的事件隊列中eventQueue,同理任務生成器中也有一個永無止盡的生產任務生成事件(JobGeneratorEvent)的事件生成器,並通過onReceive方法處理相應的事件。

看到這應該也差不多明白了,爲什麼固定的DStream會定時會處理我們自定義的任務呢?其實在任務生成器(JobGenerator)內部有個定時器timer,這東東就是會根據傳入的批事件間隔(batchDuration)不斷地生產job事件,不斷地往eventLoop對象中發送事件信息,因此就會不斷觸發任務生成器內部的processEvent函數,並調用 圖的生成器函數不斷執行輸出流中的任務graph.generateJobs(time),並根據生成的任務如果爲 Success(jobs),那麼就會調用生成器所包裹的調度器中的submitJobSet方法,並提交給jobExecutor 執行任務

private[streaming]
class JobGenerator(jobScheduler: JobScheduler) extends Logging {
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
    。。。
    /** Processes all events */
  private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }
  private def generateJobs(time: Time) {
    // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
    // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
  }
class JobScheduler
  def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

有不對的地方希望指正謝謝。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章