SparkStreaming案例：NetworkWordCount--ReceiverSupervisorImpl.onStart()如何將Reciver數據寫到BlockManager中

上文提到“ReceiverInputDstream的Receiver是如何被放到Executor上執行的”關鍵代碼ReceiverSupervisorImpl的start方法。

val startReceiverFunc: Iterator[Receiver[_]] => Unit =
  (iterator: Iterator[Receiver[_]]) => {
    if (!iterator.hasNext) {
      throw new SparkException(
        "Could not start receiver as object not found.")
    }
    //得到當前活動的TaskContext, attemptNumber:表示這個任務嘗試了多少次。

第一個任務嘗試將被分配 attemptNumber = 0，隨後的嘗試將有增加的嘗試次數。
    if (TaskContext.get().attemptNumber() == 0) {
      val receiver = iterator.next()
      //任務第一次進來時，iterator被上面next之後裏面就沒有元素。不然會報錯
      assert(iterator.hasNext == false)
      //ReceiverSupervisorImpl是Receiver的監控器，同時負責數據的寫等操作,也就是每個週期receiver得到的數據寫入到
      //BlockGenerator中，然後讓ReceiverInputStream中的compute方法將每個週期內的數據從BlockGenerator取出，然後寫入RDD中
      val supervisor = new ReceiverSupervisorImpl(
        receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
      //背後是調用ReceiverSupervisor父類實現方法，最終由ReceiverSupervisorImpl子類實現onStart方法
      /*def start() {
          onStart()
          startReceiver()
        }*/
      supervisor.start()
      supervisor.awaitTermination()
    } else {
      //意思是當.attemptNumber大於0時，表示重啓TaskScheduler，Receiver不需要重啓，直接退出就可以
      // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
    }
  }

1，supervisor.start()該方法是啓動Receiver開始在Executor上接收數據的入口

start()方法是在ReceiverSupervisorImpl的父類ReceiverSupervisor中實現的

/** Start the supervisor */
def start() {
  onStart()
  startReceiver() //真正調用的Receiver實現類中的onStart方法，如SocketInputDstream中實現的Receive中onStart方法
}

2，進入onStart(),註解說得很清楚，該方法必須在receiver.onStart()方法調用之前，先進行調用。從而保證BlockGenerator提前啓動就緒，來接收receiver傳過來的數據

/**
 * Called when supervisor is started.
 * Note that this must be called before the receiver.onStart() is called to ensure
 * things like [[BlockGenerator]]s are started before the receiver starts sending data.
 */

override protected def onStart() {
   registeredBlockGenerators.foreach { _.start() }
}

a,其中registeredBlockGenerators就是ArrayBuffer[BlockGenerator]，這個集合會在當前ReceiverSupervisorImpl初始化時將BlockGenerator生成出來放到ArrayBuffer集合中,其中的_.start()是BlockGenerator.start

/**
 * Concrete implementation of [[org.apache.spark.streaming.receiver.ReceiverSupervisor]]
 * which provides all the necessary functionality for handling the data received by
 * the receiver. Specifically, it creates a [[org.apache.spark.streaming.receiver.BlockGenerator]]
 * object that is used to divide the received data stream into blocks of data.
  * a,該類實現ReceiverSupervisor提供了所有處理receiver接收過來的數據，
  * b,它會在初始化時createBlockGenerator創建BlockGenerator對象，
  * c,同時它使用BlockGenerator對象來切分receiver得到的數據放到block塊中
  *
  *ReceiverSupervisorImpl是Receiver的監控器，同時負責數據的寫等操作,也就是每個週期receiver得到的數據寫入到BlockGenerator中，

然後讓ReceiverInputStream中的compute方法將每個週期內的數據從BlockGenerator取出，然後寫入RDD中
 */
private[streaming] class ReceiverSupervisorImpl(
    receiver: Receiver[_],
    env: SparkEnv,
    hadoopConf: Configuration,
    checkpointDirOption: Option[String]
  ) extends ReceiverSupervisor(receiver, env.conf) with Logging {

 ….

  /** Unique block ids if one wants to add blocks directly */
  private val newBlockId = new AtomicLong(System.currentTimeMillis())
  //它初始化後會被下面createBlockGenerator方法調用，將BlockGenerator填充到該集合中ArrayBuffer[BlockGenerator]
  private val registeredBlockGenerators = new mutable.ArrayBuffer[BlockGenerator]
    with mutable.SynchronizedBuffer[BlockGenerator]

  /** Divides received data records into data blocks for pushing in BlockManager.
    * 初始一個BlockGeneratorListener，幫助切分數據放到BlockManager中
    * */
  private val defaultBlockGeneratorListener = new BlockGeneratorListener {
    def onAddData(data: Any, metadata: Any): Unit = { }

    def onGenerateBlock(blockId: StreamBlockId): Unit = {
      print("onGenerateBlock no impl.....")
    }

    def onError(message: String, throwable: Throwable) {
      reportError(message, throwable)
    }

    def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
      //將receiver使用store存放在ArrayBuffer放到內存中
      pushArrayBuffer(arrayBuffer, None, Some(blockId))
    }
  }
  //初始化時createBlockGenerator創建BlockGenerator對象,用該BlockGenerator來切分Receiver接收到的數據,

每初始化一次創建一個BlockGenerator，這個BlockGenerator除放在registeredBlockGenerators集合中，還跟方法返回了
  private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)

b，查看一下createBlockGenerator(),該方法用於創建BlockGenerator對象，同時將BlockGeneratorListener注入到BlockGenerator中

override def createBlockGenerator(
    blockGeneratorListener: BlockGeneratorListener): BlockGenerator = {
  // Cleanup BlockGenerators that have already been stopped
  //registeredBlockGenerators集合是ArrayBuffer[BlockGenerator]，先將BlockGenerator停止掉的內容去掉
  registeredBlockGenerators --= registeredBlockGenerators.filter{ _.isStopped() }
  //生成一個BlockGenerator實例加到registeredBlockGenerators中，並返回生成的BlockGenerator
  val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf)
  registeredBlockGenerators += newBlockGenerator
  newBlockGenerator
}

3，查看一下BlockGenerator.start()方法

/**
 * Generates batches of objects received by a
 * [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately
 * named blocks at regular intervals. This class starts two threads,
 * one to periodically start a new batch and prepare the previous batch of as a block,
 * the other to push the blocks into the block manager.
  *
  *一，通過Receiver接收到一批數據，同時將它們放到內部對應的規則名稱中，這個類有兩個線程：
  * a,會週期性的去開始一個新的批次，並準備塊的前一批，也就是下面RecurringTimer對應的updateCurrentBuffer方法
  * b,第二個線程將塊放到blockManager中，對應下面的blockPushingThread線程
 *
 * Note: Do not create BlockGenerator instances directly inside receivers. Use
 * `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it.
 */
private[streaming] class BlockGenerator(
    listener: BlockGeneratorListener,
    receiverId: Int,
    conf: SparkConf,
    clock: Clock = new SystemClock()
  ) extends RateLimiter(conf) with Logging {
  // 第一個參數是StreamBlockId(receiverid,下一次處理的時間點-200ms)：第二個參數receiver從數據源store進來的數據
  private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any])

  /**
    * BlockGenerator的成員state，有五個可能狀態：
    * Initialized：表示還沒有開始
    * Active：start()已被調用，將數據放到生成block中
    * StoppedAddingData：stop()已被調用，數據添加被停止，block還在生並數據可以推送到block中
    * StoppedGeneratingBlocks：停止block的生成，但數據還可以被推送。
    * StoppedAll：停止了所有，BlockGenerator對象被GC掉


   * The BlockGenerator can be in 5 possible states, in the order as follows.
   *  - Initialized: Nothing has been started
   *  - Active: start() has been called, and it is generating blocks on added data.
   *  - StoppedAddingData: stop() has been called, the adding of data has been stopped,
   *                       but blocks are still being generated and pushed.
   *  - StoppedGeneratingBlocks: Generating of blocks has been stopped, but
   *                             they are still being pushed.
   *  - StoppedAll: Everything has stopped, and the BlockGenerator object can be GCed.
   */
  private object GeneratorState extends Enumeration {
    type GeneratorState = Value
    val Initialized, Active, StoppedAddingData, StoppedGeneratingBlocks, StoppedAll = Value
  }
  import GeneratorState._
  //receive接收的數據在存儲到spark裏面時，將數據切分成塊的時間間隔，建議最低不超過50ms
  private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
  require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")
  //每200ms調用一次updateCurrentBuffer方法
  private val blockIntervalTimer =
    new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
  private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
  private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
  private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

  @volatile private var currentBuffer = new ArrayBuffer[Any]
  @volatile private var state = Initialized

  /** Start block generating and pushing threads. */
  def start(): Unit = synchronized {
    if (state == Initialized) {
      state = Active
      blockIntervalTimer.start()
      blockPushingThread.start()
      logInfo("Started BlockGenerator")
    } else {
      throw new SparkException(
        s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
    }
  }

a，先看一下blockIntervalTimer週期性線程，它是不斷去生成Block用的

//receive接收的數據在存儲到spark裏面時，將數據切分成塊的時間間隔，建議最低不超過50ms
private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")
//每200ms調用一次updateCurrentBuffer方法
private val blockIntervalTimer =
  new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")

b,updateCurrentBuffer()如果receiver有數據存放進來，會每200ms生成一個Block，放到ArrayBlockQueue隊列中，當這個隊列有元素時，就會不斷將Block存放到BlockManager進行存儲

/** Change the buffer to which single records are added to.
  * 更改將單個記錄添加到的緩衝區 
  * */
private def updateCurrentBuffer(time: Long): Unit = {
  try {
    var newBlock: Block = null
    //上面初始化進來的@volatile private var currentBuffer = new ArrayBuffer[Any]
    //這個currentBuffer集合需要Receiver的實現類，調用store（）方法後，再調用當前類中的addData方法，集合纔會有值
    synchronized {
      if (currentBuffer.nonEmpty) {
        //如果currentBuffer存在元素就將它賦給另一個變量，然後給它賦一個新的ArrayBuffer[Any]
        val newBlockBuffer = currentBuffer
        currentBuffer = new ArrayBuffer[Any]
        //streaming的Batch週期最好不要小於500ms，分塊的blockIntervalMs的默認值是200ms，最好不低於50ms
        val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
        //listener如果是Streaming傳過來的BlockGeneratorListener對象,該listener並沒有實現onGenerateBlock方法
        listener.onGenerateBlock(blockId)
        newBlock = new Block(blockId, newBlockBuffer)
      }
    }
    if (newBlock != null) {
      //將Block放到blocksForPushin這個ArrayBlockingQueue[Block]裏面
      blocksForPushing.put(newBlock)  // put is blocking when queue is full
    }
  } catch {
    case ie: InterruptedException =>
      logInfo("Block updating timer thread was interrupted")
    case e: Exception =>
      reportError("Error in block updating thread", e)
  }
}

c，再看blockPushingThread線程：它的線程會不斷監聽ArrayBlockingQueue[Block]隊列是否有數據進來，然後放到BlockManager中進行存儲。

private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

/** Keep pushing blocks to the BlockManager. */
private def keepPushingBlocks() {
  logInfo("Started block pushing thread")

  def areBlocksBeingGenerated: Boolean = synchronized {
    state != StoppedGeneratingBlocks
  }

  try {
    // While blocks are being generated, keep polling for to-be-pushed blocks and push them.
    //如果state沒有停止block的生成，就一直從blocksForPushing對應的ArrayBlockingQueue[Block]取出來，

這個blocksForPushing隊列需要上面updateCurrentBuffer方法，對應的currentBuffer的集合ArrayBuffer[Any]裏面有元素纔會執行pushBlock()方法
    while (areBlocksBeingGenerated) {
      //在10ms內取出隊列頭部元素，如果超時返回Null
      Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
          //pushBlock方法就是將Receiver的store()接收到的數據放BlockManager的diskStore或MemoryStore中
        case Some(block) => pushBlock(block)
        case None =>
      }
    }

    // At this point, state is StoppedGeneratingBlock. So drain the queue of to-be-pushed blocks.
    logInfo("Pushing out the last " + blocksForPushing.size() + " blocks")
    while (!blocksForPushing.isEmpty) {
      val block = blocksForPushing.take()
      logDebug(s"Pushing block $block")
      //會將block放到初始化生成的defaultBlockGeneratorListener中
      pushBlock(block)
      logInfo("Blocks left to push " + blocksForPushing.size())
    }
    logInfo("Stopped block pushing thread")
  } catch {
    case ie: InterruptedException =>
      logInfo("Block pushing thread was interrupted")
    case e: Exception =>
      reportError("Error in block pushing thread", e)
  }
}

4，接下來看一下pushBlock(block),它會將生成的block放到ReceiverSupervisorImpl中的defaultBlockGeneratorListener中

//pushBlock方法就是將Receiver的store()接收到的數據放BlockManager的diskStore或MemoryStore中
private def pushBlock(block: Block) {
  //這個block.id就是StreamBlockId(receiverId,time-200ms),block.buffer就是對應receiver.store進來的數據，

而這個listener就是ReceiverSupervisorImpl，初始化生成的defaultBlockGeneratorListener
  listener.onPushBlock(block.id, block.buffer)
  logInfo("Pushed block " + block.id)
}

a,看一下BlockGeneratorListener這個監聽器是onPushBlock()如何將Block數據放到BlockManager中的

/** Divides received data records into data blocks for pushing in BlockManager.
  * 初始一個BlockGeneratorListener，幫助切分數據放到BlockManager中
  * */
private val defaultBlockGeneratorListener = new BlockGeneratorListener {
  def onAddData(data: Any, metadata: Any): Unit = { }
  def onGenerateBlock(blockId: StreamBlockId): Unit = {
    print("onGenerateBlock no impl.....")
  }
  def onError(message: String, throwable: Throwable) {
    reportError(message, throwable)
  }
  def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
    //將receiver使用store存放在ArrayBuffer放到內存中
    pushArrayBuffer(arrayBuffer, None, Some(blockId))
  }
}

b,進pushArrayBuffer查看一下具體的實現：

/** Store an ArrayBuffer of received data as a data block into Spark's memory.
  * 將receiver使用store存放在arrayBuffer這個ArrayBuffer內存中
  * pushArrayBuffer(arrayBuffer[Any], None, Some(StreamBlockId(receiverId,time-200ms)))
  * */
def pushArrayBuffer(
    arrayBuffer: ArrayBuffer[_],
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  //將信息告訴driver，ArrayBufferBlock是ReceivedBlock的cass class表示數據塊存放在ArrayBuffer中
  pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}

==》查看一下pushAndReportBlock是如何保存block並報告driver的

/** Store block and report it to driver
  * 保存block並報告給driver
  * ArrayBufferBlock(arrayBuffer[Any]), None, Some(StreamBlockId(receiverId,time-200ms)
  * */
def pushAndReportBlock(
    receivedBlock: ReceivedBlock,
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  //如果沒有得到StreamBlockId，會生成一個新的StreamBlockId
  val blockId = blockIdOption.getOrElse(nextBlockId)
  val time = System.currentTimeMillis
  //將block保存到BlockManager中，按指定的storageLevel,返回BlockManagerBasedStoreResult(blockId, numRecords)，
  // 其中numRecords就是store（）存放進來的個數
  val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
  logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
  val numRecords = blockStoreResult.numRecords
  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
//將ReceiverBlockInfo給ReceiverTrackerEndpoint對應的receiveAndReply，
// 從而將ReceivedBlockInfo會將Block的元數據信息放在AddBlock身上 通過Driver
  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
  logDebug(s"Reported block $blockId")
}

===》從源碼可知將接收到的數據存儲到BlockManager中是receiverdBlockHandler.storeBlock()

=== 》先看一下receivedBlockHandler是如何實現的，從源碼可以分析當前案例得到的是BlockManagerBaseBlockHandler實例

//ReceivedBlockHandler處理接收receiver的Block信息的類
private val receivedBlockHandler: ReceivedBlockHandler = {
  //如果conf沒有設置：spark.streaming.receiver.writeAheadLog.enable這個值，默認是false
  //即不會生成WriteAheadLogBasedBlockHandler實例
  if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
    if (checkpointDirOption.isEmpty) {
      throw new SparkException(
        "Cannot enable receiver write-ahead log without checkpoint directory set. " +
          "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
          "See documentation for more details.")
    }
    new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
      receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
  } else {
    //取得SprakEnv對應Executor的BlockManager，它用於存儲block，如果是socketTextStream得到的receiver就是SocketReceiver
    new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
  }
}

===》從源碼得知當前storeBlock會將receiver得到的數據，以BlockManager.putIterator方式以storageLevel的方式存儲在spark集羣中。然後將StreamBlockId及當前存儲的記錄數通過BlockManagerBaseStoreResult()實例返回。

（對於BlockManager.putIterator（）相關代碼，後面針對性的分析一下）

/**
 * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which
 * stores the received blocks into a block manager with the specified storage level.
  * 將block保存到BlockManager中，按指定的storageLevel
 */
private[streaming] class BlockManagerBasedBlockHandler(
    blockManager: BlockManager, storageLevel: StorageLevel)
  extends ReceivedBlockHandler with Logging {
  //ReceivedBlock就是store()存放進來的數據對應ArrayBufferBlock(arrayBuffer[Any])
  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {
    var numRecords = None: Option[Long]
    val putResult: Seq[(BlockId, BlockStatus)] = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        //得到存放到arrayBuffer[Any]集合的元素個數
        numRecords = Some(arrayBuffer.size.toLong)
        //將block保存到BlockManager中，按指定的storageLevel
        blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,
          tellMaster = true)
      case IteratorBlock(iterator) =>
       。。。。

    BlockManagerBasedStoreResult(blockId, numRecords)
  }

===>最後會使用ReceiverTrackerEndpoint通知Driver

//將ReceiverBlockInfo給ReceiverTrackerEndpoint對應的receiveAndReply，
// 從而將ReceivedBlockInfo會將Block的元數據信息放在AddBlock身上 通過Driver
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))

===》被ReceiverTrackerEndpoint的receiveAndRely接收到，會回覆addBlock方法返回的boolean值

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
  // Remote messages
  case RegisterReceiver(streamId, typ, host, executorId, receiverEndpoint) =>
    val successful =
      registerReceiver(streamId, typ, host, executorId, receiverEndpoint, context.senderAddress)
    context.reply(successful)
  case AddBlock(receivedBlockInfo) =>
    if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) {
      walBatchingThreadPool.execute(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          if (active) {
            context.reply(addBlock(receivedBlockInfo))
          } else {
            throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.")
          }
        }
      })
    } else {
      context.reply(addBlock(receivedBlockInfo))
    }
  case DeregisterReceiver(streamId, message, error) =>
    。。。。。
}

===>addBlock方法作用就是ReceivedBlockInfo這個元數據信息放到一個ReceivedBlockQueue隊列中，元素就是這個ReceivedBlockInfo元數據信息。該方法發生異常時會返回false

==》該方法在ReceivedBlockTracker中實現

/** Add received block. This event will get written to the write ahead log (if enabled). */
def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
  try {
    val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
    if (writeResult) {
      synchronized {
        getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
      }
      logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
        s"block ${receivedBlockInfo.blockStoreResult.blockId}")
    } else {
      logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " +
        s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.")
    }
    writeResult
  } catch {
    case NonFatal(e) =>
      logError(s"Error adding block $receivedBlockInfo", e)
      false
  }
}

至此，ReceiverSupervisorImpl的onStart()方法是如何得到Reciver的數據寫到spark的BlockManager中結束

下面來分析ReceiverSupervisorImpl中的startReceiver(),Receiver如何將數據store到RDD的?

SparkStreaming案例：NetworkWordCount--ReceiverSupervisorImpl.onStart()如何將Reciver數據寫到BlockManager中

SparkStreaming案例：NetworkWordCount--ReceiverSupervisorImpl.onStart()如何將Reciver數據寫到BlockManager中

SparkStream例子HdfsWordCount--從Dstream到RDD全過程解析

SparkStream源碼分析：JobScheduler的JobStarted、JobCompleted是怎麼被調用的

SparkStream例子HdfsWordCount--InputDStream及OutputDstream是如何進入DStreamGraph中

spark-core_08: $SPARK_HOME/sbin/slaves.sh、start-slave.sh腳本分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結