上文提到“ReceiverInputDstream的Receiver是如何被放到Executor上執行的”關鍵代碼ReceiverSupervisorImpl的start方法。
val startReceiverFunc: Iterator[Receiver[_]] => Unit = (iterator: Iterator[Receiver[_]]) => { if (!iterator.hasNext) { throw new SparkException( "Could not start receiver as object not found.") } //得到當前活動的TaskContext, attemptNumber:表示這個任務嘗試了多少次。
第一個任務嘗試將被分配 attemptNumber = 0,隨後的嘗試將有增加的嘗試次數。 if (TaskContext.get().attemptNumber() == 0) { val receiver = iterator.next() //任務第一次進來時,iterator被上面next之後裏面就沒有元素。不然會報錯 assert(iterator.hasNext == false) //ReceiverSupervisorImpl是Receiver的監控器,同時負責數據的寫等操作,也就是每個週期receiver得到的數據寫入到 //BlockGenerator中,然後讓ReceiverInputStream中的compute方法將每個週期內的數據從BlockGenerator取出,然後寫入RDD中 val supervisor = new ReceiverSupervisorImpl( receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption) //背後是調用ReceiverSupervisor父類實現方法,最終由ReceiverSupervisorImpl子類實現onStart方法 /*def start() { onStart() startReceiver() }*/ supervisor.start() supervisor.awaitTermination() } else { //意思是當.attemptNumber大於0時,表示重啓TaskScheduler,Receiver不需要重啓,直接退出就可以 // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it. } }
1,supervisor.start()該方法是啓動Receiver開始在Executor上接收數據的入口
start()方法是在ReceiverSupervisorImpl的父類ReceiverSupervisor中實現的
/** Start the supervisor */ def start() { onStart() startReceiver() //真正調用的Receiver實現類中的onStart方法,如SocketInputDstream中實現的Receive中onStart方法 }
2,進入onStart(),註解說得很清楚,該方法必須在receiver.onStart()方法調用之前,先進行調用。從而保證BlockGenerator提前啓動就緒,來接收receiver傳過來的數據
/** * Called when supervisor is started. * Note that this must be called before the receiver.onStart() is called to ensure * things like [[BlockGenerator]]s are started before the receiver starts sending data. */
override protected def onStart() { registeredBlockGenerators.foreach { _.start() } }
a,其中registeredBlockGenerators就是ArrayBuffer[BlockGenerator],這個集合會在當前ReceiverSupervisorImpl初始化時將BlockGenerator生成出來放到ArrayBuffer集合中,其中的_.start()是BlockGenerator.start
/** * Concrete implementation of [[org.apache.spark.streaming.receiver.ReceiverSupervisor]] * which provides all the necessary functionality for handling the data received by * the receiver. Specifically, it creates a [[org.apache.spark.streaming.receiver.BlockGenerator]] * object that is used to divide the received data stream into blocks of data. * a,該類實現ReceiverSupervisor提供了所有處理receiver接收過來的數據, * b,它會在初始化時createBlockGenerator創建BlockGenerator對象, * c,同時它使用BlockGenerator對象來切分receiver得到的數據放到block塊中 * *ReceiverSupervisorImpl是Receiver的監控器,同時負責數據的寫等操作,也就是每個週期receiver得到的數據寫入到BlockGenerator中,
然後讓ReceiverInputStream中的compute方法將每個週期內的數據從BlockGenerator取出,然後寫入RDD中 */ private[streaming] class ReceiverSupervisorImpl( receiver: Receiver[_], env: SparkEnv, hadoopConf: Configuration, checkpointDirOption: Option[String] ) extends ReceiverSupervisor(receiver, env.conf) with Logging { ….
/** Unique block ids if one wants to add blocks directly */ private val newBlockId = new AtomicLong(System.currentTimeMillis()) //它初始化後會被下面createBlockGenerator方法調用,將BlockGenerator填充到該集合中ArrayBuffer[BlockGenerator] private val registeredBlockGenerators = new mutable.ArrayBuffer[BlockGenerator] with mutable.SynchronizedBuffer[BlockGenerator] /** Divides received data records into data blocks for pushing in BlockManager. * 初始一個BlockGeneratorListener,幫助切分數據放到BlockManager中 * */ private val defaultBlockGeneratorListener = new BlockGeneratorListener { def onAddData(data: Any, metadata: Any): Unit = { } def onGenerateBlock(blockId: StreamBlockId): Unit = { print("onGenerateBlock no impl.....") } def onError(message: String, throwable: Throwable) { reportError(message, throwable) } def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) { //將receiver使用store存放在ArrayBuffer放到內存中 pushArrayBuffer(arrayBuffer, None, Some(blockId)) } } //初始化時createBlockGenerator創建BlockGenerator對象,用該BlockGenerator來切分Receiver接收到的數據,
每初始化一次創建一個BlockGenerator,這個BlockGenerator除放在registeredBlockGenerators集合中,還跟方法返回了 private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)
b,查看一下createBlockGenerator(),該方法用於創建BlockGenerator對象,同時將BlockGeneratorListener注入到BlockGenerator中
override def createBlockGenerator( blockGeneratorListener: BlockGeneratorListener): BlockGenerator = { // Cleanup BlockGenerators that have already been stopped //registeredBlockGenerators集合是ArrayBuffer[BlockGenerator],先將BlockGenerator停止掉的內容去掉 registeredBlockGenerators --= registeredBlockGenerators.filter{ _.isStopped() } //生成一個BlockGenerator實例加到registeredBlockGenerators中,並返回生成的BlockGenerator val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf) registeredBlockGenerators += newBlockGenerator newBlockGenerator }
3,查看一下BlockGenerator.start()方法
/** * Generates batches of objects received by a * [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately * named blocks at regular intervals. This class starts two threads, * one to periodically start a new batch and prepare the previous batch of as a block, * the other to push the blocks into the block manager. * *一,通過Receiver接收到一批數據,同時將它們放到內部對應的規則名稱中,這個類有兩個線程: * a,會週期性的去開始一個新的批次,並準備塊的前一批,也就是下面RecurringTimer對應的updateCurrentBuffer方法 * b,第二個線程將塊放到blockManager中,對應下面的blockPushingThread線程 * * Note: Do not create BlockGenerator instances directly inside receivers. Use * `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it. */ private[streaming] class BlockGenerator( listener: BlockGeneratorListener, receiverId: Int, conf: SparkConf, clock: Clock = new SystemClock() ) extends RateLimiter(conf) with Logging { // 第一個參數是StreamBlockId(receiverid,下一次處理的時間點-200ms):第二個參數receiver從數據源store進來的數據 private case class Block(id: StreamBlockId, buffer: ArrayBuffer[Any]) /** * BlockGenerator的成員state,有五個可能狀態: * Initialized:表示還沒有開始 * Active:start()已被調用,將數據放到生成block中 * StoppedAddingData:stop()已被調用,數據添加被停止,block還在生並數據可以推送到block中 * StoppedGeneratingBlocks:停止block的生成,但數據還可以被推送。 * StoppedAll:停止了所有,BlockGenerator對象被GC掉
* The BlockGenerator can be in 5 possible states, in the order as follows. * - Initialized: Nothing has been started * - Active: start() has been called, and it is generating blocks on added data. * - StoppedAddingData: stop() has been called, the adding of data has been stopped, * but blocks are still being generated and pushed. * - StoppedGeneratingBlocks: Generating of blocks has been stopped, but * they are still being pushed. * - StoppedAll: Everything has stopped, and the BlockGenerator object can be GCed. */ private object GeneratorState extends Enumeration { type GeneratorState = Value val Initialized, Active, StoppedAddingData, StoppedGeneratingBlocks, StoppedAll = Value } import GeneratorState._ //receive接收的數據在存儲到spark裏面時,將數據切分成塊的時間間隔,建議最低不超過50ms private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms") require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value") //每200ms調用一次updateCurrentBuffer方法 private val blockIntervalTimer = new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator") private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10) private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize) private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } } @volatile private var currentBuffer = new ArrayBuffer[Any] @volatile private var state = Initialized /** Start block generating and pushing threads. */ def start(): Unit = synchronized { if (state == Initialized) { state = Active blockIntervalTimer.start() blockPushingThread.start() logInfo("Started BlockGenerator") } else { throw new SparkException( s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]") } }
a,先看一下blockIntervalTimer週期性線程,它是不斷去生成Block用的
//receive接收的數據在存儲到spark裏面時,將數據切分成塊的時間間隔,建議最低不超過50ms private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms") require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value") //每200ms調用一次updateCurrentBuffer方法 private val blockIntervalTimer = new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
b,updateCurrentBuffer()如果receiver有數據存放進來,會每200ms生成一個Block,放到ArrayBlockQueue隊列中,當這個隊列有元素時,就會不斷將Block存放到BlockManager進行存儲
/** Change the buffer to which single records are added to. * 更改將單個記錄添加到的緩衝區 * */ private def updateCurrentBuffer(time: Long): Unit = { try { var newBlock: Block = null //上面初始化進來的@volatile private var currentBuffer = new ArrayBuffer[Any] //這個currentBuffer集合需要Receiver的實現類,調用store()方法後,再調用當前類中的addData方法,集合纔會有值 synchronized { if (currentBuffer.nonEmpty) { //如果currentBuffer存在元素就將它賦給另一個變量,然後給它賦一個新的ArrayBuffer[Any] val newBlockBuffer = currentBuffer currentBuffer = new ArrayBuffer[Any] //streaming的Batch週期最好不要小於500ms,分塊的blockIntervalMs的默認值是200ms,最好不低於50ms val blockId = StreamBlockId(receiverId, time - blockIntervalMs) //listener如果是Streaming傳過來的BlockGeneratorListener對象,該listener並沒有實現onGenerateBlock方法 listener.onGenerateBlock(blockId) newBlock = new Block(blockId, newBlockBuffer) } } if (newBlock != null) { //將Block放到blocksForPushin這個ArrayBlockingQueue[Block]裏面 blocksForPushing.put(newBlock) // put is blocking when queue is full } } catch { case ie: InterruptedException => logInfo("Block updating timer thread was interrupted") case e: Exception => reportError("Error in block updating thread", e) } }
c,再看blockPushingThread線程:它的線程會不斷監聽ArrayBlockingQueue[Block]隊列是否有數據進來,然後放到BlockManager中進行存儲。
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
/** Keep pushing blocks to the BlockManager. */ private def keepPushingBlocks() { logInfo("Started block pushing thread") def areBlocksBeingGenerated: Boolean = synchronized { state != StoppedGeneratingBlocks } try { // While blocks are being generated, keep polling for to-be-pushed blocks and push them. //如果state沒有停止block的生成,就一直從blocksForPushing對應的ArrayBlockingQueue[Block]取出來,
這個blocksForPushing隊列需要上面updateCurrentBuffer方法,對應的currentBuffer的集合ArrayBuffer[Any]裏面有元素纔會執行pushBlock()方法 while (areBlocksBeingGenerated) { //在10ms內取出隊列頭部元素,如果超時返回Null Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match { //pushBlock方法就是將Receiver的store()接收到的數據放BlockManager的diskStore或MemoryStore中 case Some(block) => pushBlock(block) case None => } } // At this point, state is StoppedGeneratingBlock. So drain the queue of to-be-pushed blocks. logInfo("Pushing out the last " + blocksForPushing.size() + " blocks") while (!blocksForPushing.isEmpty) { val block = blocksForPushing.take() logDebug(s"Pushing block $block") //會將block放到初始化生成的defaultBlockGeneratorListener中 pushBlock(block) logInfo("Blocks left to push " + blocksForPushing.size()) } logInfo("Stopped block pushing thread") } catch { case ie: InterruptedException => logInfo("Block pushing thread was interrupted") case e: Exception => reportError("Error in block pushing thread", e) } }
4,接下來看一下pushBlock(block),它會將生成的block放到ReceiverSupervisorImpl中的defaultBlockGeneratorListener中
//pushBlock方法就是將Receiver的store()接收到的數據放BlockManager的diskStore或MemoryStore中 private def pushBlock(block: Block) { //這個block.id就是StreamBlockId(receiverId,time-200ms),block.buffer就是對應receiver.store進來的數據,
而這個listener就是ReceiverSupervisorImpl,初始化生成的defaultBlockGeneratorListener listener.onPushBlock(block.id, block.buffer) logInfo("Pushed block " + block.id) }
a,看一下BlockGeneratorListener這個監聽器是onPushBlock()如何將Block數據放到BlockManager中的
/** Divides received data records into data blocks for pushing in BlockManager. * 初始一個BlockGeneratorListener,幫助切分數據放到BlockManager中 * */ private val defaultBlockGeneratorListener = new BlockGeneratorListener { def onAddData(data: Any, metadata: Any): Unit = { } def onGenerateBlock(blockId: StreamBlockId): Unit = { print("onGenerateBlock no impl.....") } def onError(message: String, throwable: Throwable) { reportError(message, throwable) } def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) { //將receiver使用store存放在ArrayBuffer放到內存中 pushArrayBuffer(arrayBuffer, None, Some(blockId)) } }
b,進pushArrayBuffer查看一下具體的實現:
/** Store an ArrayBuffer of received data as a data block into Spark's memory. * 將receiver使用store存放在arrayBuffer這個ArrayBuffer內存中 * pushArrayBuffer(arrayBuffer[Any], None, Some(StreamBlockId(receiverId,time-200ms))) * */ def pushArrayBuffer( arrayBuffer: ArrayBuffer[_], metadataOption: Option[Any], blockIdOption: Option[StreamBlockId] ) { //將信息告訴driver,ArrayBufferBlock是ReceivedBlock的cass class表示數據塊存放在ArrayBuffer中 pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption) }
==》查看一下pushAndReportBlock是如何保存block並報告driver的
/** Store block and report it to driver * 保存block並報告給driver * ArrayBufferBlock(arrayBuffer[Any]), None, Some(StreamBlockId(receiverId,time-200ms) * */ def pushAndReportBlock( receivedBlock: ReceivedBlock, metadataOption: Option[Any], blockIdOption: Option[StreamBlockId] ) { //如果沒有得到StreamBlockId,會生成一個新的StreamBlockId val blockId = blockIdOption.getOrElse(nextBlockId) val time = System.currentTimeMillis //將block保存到BlockManager中,按指定的storageLevel,返回BlockManagerBasedStoreResult(blockId, numRecords), // 其中numRecords就是store()存放進來的個數 val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock) logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms") val numRecords = blockStoreResult.numRecords val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult) //將ReceiverBlockInfo給ReceiverTrackerEndpoint對應的receiveAndReply, // 從而將ReceivedBlockInfo會將Block的元數據信息放在AddBlock身上 通過Driver trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo)) logDebug(s"Reported block $blockId") }
===》從源碼可知將接收到的數據存儲到BlockManager中是receiverdBlockHandler.storeBlock()
=== 》先看一下receivedBlockHandler是如何實現的,從源碼可以分析當前案例得到的是BlockManagerBaseBlockHandler實例
//ReceivedBlockHandler處理接收receiver的Block信息的類 private val receivedBlockHandler: ReceivedBlockHandler = { //如果conf沒有設置:spark.streaming.receiver.writeAheadLog.enable這個值,默認是false //即不會生成WriteAheadLogBasedBlockHandler實例 if (WriteAheadLogUtils.enableReceiverLog(env.conf)) { if (checkpointDirOption.isEmpty) { throw new SparkException( "Cannot enable receiver write-ahead log without checkpoint directory set. " + "Please use streamingContext.checkpoint() to set the checkpoint directory. " + "See documentation for more details.") } new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId, receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get) } else { //取得SprakEnv對應Executor的BlockManager,它用於存儲block,如果是socketTextStream得到的receiver就是SocketReceiver new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel) } }
===》從源碼得知當前storeBlock會將receiver得到的數據,以BlockManager.putIterator方式以storageLevel的方式存儲在spark集羣中。然後將StreamBlockId及當前存儲的記錄數通過BlockManagerBaseStoreResult()實例返回。
(對於BlockManager.putIterator()相關代碼,後面針對性的分析一下)
/** * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which * stores the received blocks into a block manager with the specified storage level. * 將block保存到BlockManager中,按指定的storageLevel */ private[streaming] class BlockManagerBasedBlockHandler( blockManager: BlockManager, storageLevel: StorageLevel) extends ReceivedBlockHandler with Logging { //ReceivedBlock就是store()存放進來的數據對應ArrayBufferBlock(arrayBuffer[Any]) def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = { var numRecords = None: Option[Long] val putResult: Seq[(BlockId, BlockStatus)] = block match { case ArrayBufferBlock(arrayBuffer) => //得到存放到arrayBuffer[Any]集合的元素個數 numRecords = Some(arrayBuffer.size.toLong) //將block保存到BlockManager中,按指定的storageLevel blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel, tellMaster = true) case IteratorBlock(iterator) => 。。。。
BlockManagerBasedStoreResult(blockId, numRecords) }
===>最後會使用ReceiverTrackerEndpoint通知Driver
//將ReceiverBlockInfo給ReceiverTrackerEndpoint對應的receiveAndReply, // 從而將ReceivedBlockInfo會將Block的元數據信息放在AddBlock身上 通過Driver trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
===》被ReceiverTrackerEndpoint的receiveAndRely接收到,會回覆addBlock方法返回的boolean值
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { // Remote messages case RegisterReceiver(streamId, typ, host, executorId, receiverEndpoint) => val successful = registerReceiver(streamId, typ, host, executorId, receiverEndpoint, context.senderAddress) context.reply(successful) case AddBlock(receivedBlockInfo) => if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) { walBatchingThreadPool.execute(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { if (active) { context.reply(addBlock(receivedBlockInfo)) } else { throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.") } } }) } else { context.reply(addBlock(receivedBlockInfo)) } case DeregisterReceiver(streamId, message, error) => 。。。。。 }
===>addBlock方法作用就是ReceivedBlockInfo這個元數據信息放到一個ReceivedBlockQueue隊列中,元素就是這個ReceivedBlockInfo元數據信息。該方法發生異常時會返回false
==》該方法在ReceivedBlockTracker中實現
/** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } }
至此,ReceiverSupervisorImpl的onStart()方法是如何得到Reciver的數據寫到spark的BlockManager中結束
下面來分析ReceiverSupervisorImpl中的startReceiver(),Receiver如何將數據store到RDD的?