一:Spark Streaming Job生成深度思考
1. 做大數據例如Hadoop,Spark等,如果不是流處理的話,一般會有定時任務。例如10分鐘觸發一次,1個小時觸發一次,這就是做流處理的感覺,一切不是流處理,或者與流處理無關的數據都將是沒有價值的數據,以前做批處理的時候其實也是隱形的在做流處理。
2. JobGenerator構造的時候有一個核心的參數是jobScheduler, jobScheduler是整個作業的生成和提交給集羣的核心,JobGenerator會基於DStream生成Job。這裏面的Job就相當於Java中線程要處理的Runnable裏面的業務邏輯封裝。Spark的Job就是運行的一個作業。
3. Spark Streaming除了基於定時操作以外參數Job,還可以通過各種聚合操作,或者基於狀態的操作。
4. 每5秒鐘JobGenerator都會產生Job,此時的Job是邏輯級別的,也就是說有這個Job,並且說這個Job具體該怎麼去做,此時並沒有執行。具體執行的話是交給底層的RDD的action去觸發,此時的action也是邏輯級別的。底層物理級別的,Spark Streaming他是基於DStream構建的依賴關係導致的Job是邏輯級別的,底層是基於RDD的邏輯級別的。
val ssc = new StreamingContext(conf, Seconds(5))11
5. Spark Streaming的觸發器是以時間爲單位的,storm是以事件爲觸發器,也就是基於一個又一個record. Spark Streaming基於時間,這個時間是Batch Duractions
從邏輯級別翻譯成物理級別,最後一個操作肯定是RDD的action,但是並不想一翻譯立馬就觸發job。這個時候怎麼辦?
6. action觸發作業,這個時候作爲Runnable接口封裝,他會定義一個方法,這個方法裏面是基於DStream的依賴關係生成的RDD。翻譯的時候是將DStream的依賴關係翻譯成RDD的依賴關係,由於DStream的依賴關係最後一個是action級別的,翻譯成RDD的時候,RDD的最後一個操作也應該是action級別的,如果翻譯的時候直接執行的話,就直接生成了Job,就沒有所謂的隊列,所以會將翻譯的事件放到一個函數中或者一個方法中,因此,如果這個函數沒有指定的action觸發作業是執行不了的。
7. Spark Streaming根據時間不斷的去管理我們的生成的作業,所以這個時候我們每個作業又有action級別的操作,這個action操作是對DStream進行邏輯級別的操作,他生成每個Job放到隊列的時候,他一定會被翻譯爲RDD的操作,那基於RDD操作的最後一個一定是action級別的,如果翻譯的話直接就是觸發action的話整個Spark Streaming的Job就不受管理了。因此我們既要保證他的翻譯,又要保證對他的管理,把DStream之間的依賴關係轉變爲RDD之間的依賴關係,最後一個DStream使得action的操作,翻譯成一個RDD之間的action操作,整個翻譯後的內容他是一塊內容,他這一塊內容是放在一個函數體中的,這個函數體,他會函數的定義,這個函數由於他只是定義還沒有執行,所以他裏面的RDD的action不會執行,不會觸發Job,當我們的JobScheduler要調度Job的時候,轉過來在線程池中拿出一條線程執行剛纔的封裝的方法。
二:Spark Streaming Job生成源碼解析
Spark 作業動態生成三大核心:
JobGenerator: 負責Job生成。
JobSheduler: 負責Job調度。
ReceiverTracker: 獲取元數據。
1. JobScheduler的start方法被調用的時候,會啓動JobGenerator的start方法。
/** Start generation of jobs */def start(): Unit = synchronized { //eventLoop是消息循環體,因爲不斷的生成Job if (eventLoop != null) return // generator has already been started // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock. // See SPARK-10125 checkpointWriter //匿名內部類 eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator", e) } } //調用start方法。 eventLoop.start() if (ssc.isCheckpointPresent) { restart() } else { startFirstTime() } }12345678910111213141516171819202122232425261234567891011121314151617181920212223242526
EvenLoop: 的start方法被調用,首先會調用onstart方法。然後就啓動線程。
/** * An event loop to receive events from the caller and process all events in the event thread. It * will start an exclusive event thread to process all events. * * Note: The event queue will grow indefinitely. So subclasses should make sure `onReceive` can * handle events in time to avoid the potential OOM. */private[spark] abstract class EventLoop[E](name: String) extends Logging { private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]() private val stopped = new AtomicBoolean(false)//開啓後臺線程。 private val eventThread = new Thread(name) { setDaemon(true) override def run(): Unit = { try {//不斷的從BlockQueue中拿消息。 while (!stopped.get) {//線程的start方法調用就會不斷的循環隊列,而我們將消息放到eventQueue中。 val event = eventQueue.take() try {// onReceive(event) } catch { case NonFatal(e) => { try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } } def start(): Unit = { if (stopped.get) { throw new IllegalStateException(name + " has already been stopped") } // Call onStart before starting the event thread to make sure it happens before onReceive onStart() eventThread.start() }12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152531234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
onReceive:不斷的從消息隊列中獲得消息,一旦獲得消息就會處理。
不要在onReceive中添加阻塞的消息,如果這樣的話會不斷的阻塞消息。
消息循環器一般都不會處理具體的業務邏輯,一般消息循環器發現消息以後都會將消息路由給其他的線程去處理。
/** * Invoked in the event thread when polling events from the event queue. * * Note: Should avoid calling blocking actions in `onReceive`, or the event thread will be blocked * and cannot process events in time. If you want to call some blocking actions, run them in * another thread. */ protected def onReceive(event: E): Unit123456789123456789
消息隊列接收到事件後具體處理如下:
/** Processes all events */private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) case ClearMetadata(time) => clearMetadata(time) case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater) case ClearCheckpointData(time) => clearCheckpointData(time) } }123456789101112123456789101112
基於Batch Duractions生成Job,並完成checkpoint.
Job生成的5個步驟。
/** Generate jobs and perform checkpoint for the given `time`. */private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try {//第一步:獲取當前時間段裏面的數據。根據分配的時間來分配具體要處理的數據。 jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch//第二步:生成Job,獲取RDD的DAG依賴關係。在此基於DStream生成了RDD實例。 graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) =>//第三步:獲取streamIdToInputInfos的信息。BacthDuractions要處理的數據,以及我們要處理的業務邏輯。 val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)//第四步:將生成的Job交給jobScheduler jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) }//第五步:進行checkpoint eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }123456789101112131415161718192021222324123456789101112131415161718192021222324
此時的outputStream是整個DStream中的最後一個DStream,也就是foreachDStream.
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { outputStreams.flatMap { outputStream => //根據最後一個DStream,然後根據時間生成Job. val jobOption = outputStream.generateJob(time) jobOption.foreach(_.setCallSite(outputStream.creationSite)) jobOption } } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs }12345678910111213141234567891011121314
此時的JobFunc就是我們前面提到的用函數封裝了Job。
generateJob基於給定的時間生成Spark Streaming 的Job,這個方法會基於我們的DStream的操作物化成了RDD,由此可以看出,DStream是邏輯級別的,RDD是物理級別的。
/** * Generate a SparkStreaming job for the given time. This is an internal method that * should not be called directly. This default implementation creates a job * that materializes the corresponding RDD. Subclasses of DStream may override this * to generate their own jobs. */private[streaming] def generateJob(time: Time): Option[Job] = { getOrCompute(time) match { case Some(rdd) => { val jobFunc = () => { val emptyFunc = { (iterator: Iterator[T]) => {} }//rdd => 就是RDD的依賴關係 context.sparkContext.runJob(rdd, emptyFunc) }//此時的 Some(new Job(time, jobFunc)) } case None => None } }123456789101112131415161718192021123456789101112131415161718192021
Job這個類就代表了Spark業務邏輯,可能包含很多Spark Jobs.
/** * Class representing a Spark computation. It may contain multiple Spark jobs. */private[streaming]class Job(val time: Time, func: () => _) { private var _id: String = _ private var _outputOpId: Int = _ private var isSet = false private var _result: Try[_] = null private var _callSite: CallSite = null private var _startTime: Option[Long] = None private var _endTime: Option[Long] = None def run() {//調用func函數,此時這個func就是我們前面generateJob中的func _result = Try(func()) }123456789101112131415161718123456789101112131415161718
此時put函數中的RDD是最後一個RDD,雖然觸發Job是基於時間,但是也是基於DStream的action的。
/** * Get the RDD corresponding to the given time; either retrieve it from cache * or compute-and-cache it. */private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = { // If RDD was already generated, then retrieve it from HashMap, // or else compute the RDD//基於時間生成RDD generatedRDDs.get(time).orElse { // Compute the RDD if time is valid (e.g. correct time in a sliding window) // of RDD generation, else generate nothing. if (isTimeValid(time)) { val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) { // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. We need to have this call here because // compute() might cause Spark jobs to be launched. PairRDDFunctions.disableOutputSpecValidation.withValue(true) {// compute(time) } }//然後對generated RDD進行checkpoint rddOption.foreach { case newRDD => // Register the generated RDD for caching and checkpointing if (storageLevel != StorageLevel.NONE) { newRDD.persist(storageLevel) logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel") } if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) { newRDD.checkpoint() logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing") }//以時間爲Key,RDD爲Value,此時的RDD爲最後一個RDD generatedRDDs.put(time, newRDD) } rddOption } else { None } } }12345678910111213141516171819202122232425262728293031323334353637383940414243441234567891011121314151617181920212223242526272829303132333435363738394041424344
回到JobGenerator中的start方法。
if (ssc.isCheckpointPresent) {//如果不是第一次啓動的話,就需要從checkpoint中恢復。 restart() } else {//否則的話,就是第一次啓動。 startFirstTime() } }123456789123456789
StartFirstTime的源碼如下:
/** Starts the generator for the first time */private def startFirstTime() { val startTime = new Time(timer.getStartTime())//告訴DStreamGraph第一個Batch啓動時間。 graph.start(startTime - graph.batchDuration)//timer啓動,整個job不斷生成就開始了。 timer.start(startTime.milliseconds) logInfo("Started JobGenerator at " + startTime) }1234567891012345678910
這裏的timer是RecurringTimer。RecurringTimer的start方法會啓動內置線程thread.
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")123123
Timer.start源碼如下:
/** * Start at the given start time. */def start(startTime: Long): Long = synchronized { nextTime = startTime //每次調用的 thread.start() logInfo("Started timer for " + name + " at time " + nextTime) nextTime }1234567891012345678910
調用thread啓動後臺進程。
private val thread = new Thread("RecurringTimer - " + name) { setDaemon(true) override def run() { loop } }1234512345
loop源碼如下:
/** * Repeatedly call the callback every interval. */ private def loop() { try { while (!stopped) { triggerActionForNextInterval() } triggerActionForNextInterval() } catch { case e: InterruptedException => } } }123456789101112131415123456789101112131415
tiggerActionForNextInterval源碼如下:
private def triggerActionForNextInterval(): Unit = { clock.waitTillTime(nextTime) callback(nextTime) prevTime = nextTime += period logDebug("Callback for " + name + " called at time " + prevTime) }1234567812345678
此時的callBack是RecurringTimer傳入的。下面就去找callBack是誰傳入的,這個時候就應該找RecurringTimer什麼時候實例化的。
private[streaming]class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String) extends Logging { private val thread = new Thread("RecurringTimer - " + name) { setDaemon(true) override def run() { loop } }123456789123456789
在jobGenerator中,匿名函數會隨着時間不斷的推移反覆被調用。
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,//匿名函數,複製給callback。 longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")12341234
而此時的eventLoop就是JobGenerator的start方法中eventLoop.eventLoop是一個消息循環體當收到generateJobs,就會將消息放到線程池中去執行。
至此,就知道了基於時間怎麼生成作業的流程就貫通了。
Jobs: 此時的jobs就是jobs的業務邏輯,就類似於RDD之間的依賴關係,保存最後一個job,然後根據依賴關係進行回溯。
streamIdToInputInfos:基於Batch Duractions以及要處理的業務邏輯,然後就生成了JobSet.
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))11
此時的JobSet就包含了數據以及對數據處理的業務邏輯。
/** Class representing a set of Jobs * belong to the same batch. */private[streaming]case class JobSet( time: Time, jobs: Seq[Job], streamIdToInputInfo: Map[Int, StreamInputInfo] = Map.empty) { private val incompleteJobs = new HashSet[Job]() private val submissionTime = System.currentTimeMillis() // when this jobset was submitted private var processingStartTime = -1L // when the first job of this jobset started processing private var processingEndTime = -1L // when the last job of this jobset finished processing jobs.zipWithIndex.foreach { case (job, i) => job.setOutputOpId(i) } incompleteJobs ++= jobs def handleJobStart(job: Job) { if (processingStartTime < 0) processingStartTime = System.currentTimeMillis() }123456789101112131415161718192021123456789101112131415161718192021
submitJobSet:
def submitJobSet(jobSet: JobSet) { if (jobSet.jobs.isEmpty) { logInfo("No jobs added for time " + jobSet.time) } else { listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo)) // jobSets.put(jobSet.time, jobSet) //jobHandler jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job))) logInfo("Added jobs for time " + jobSet.time) } }1234567891011121312345678910111213
JobHandle是一個Runnable接口,Job就是我們業務邏輯,代表的就是一系列RDD的依賴關係,job.run方法就導致了func函數的調用。
private class JobHandler(job: Job) extends Runnable with Logging { import JobScheduler._ def run() { try { val formattedTime = UIUtils.formatBatchTime( job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false) val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}" val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]" ssc.sc.setJobDescription( s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""") ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString) ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString) // We need to assign `eventLoop` to a temp variable. Otherwise, because // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then // it's possible that when `post` is called, `eventLoop` happens to null. var _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobStarted(job, clock.getTimeMillis())) // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) {// job.run() } _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobCompleted(job, clock.getTimeMillis())) } } else { // JobScheduler has been stopped. } } finally { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null) } } } }1234567891011121314151617181920212223242526272829303132333435363738394041424312345678910111213141516171819202122232425262728293031323334353637383940414243
此時的func就是基於DStream的業務邏輯。也就是RDD之間依賴的業務邏輯。
def run() { _result = Try(func()) }12341234
總體架構如下:
備註:
1、DT大數據夢工廠微信公衆號DT_Spark
2、IMF晚8點大數據實戰YY直播頻道號:68917580
3、新浪微博: http://www.weibo.com/ilovepains
本文轉自http://blog.csdn.net/snail_gesture/article/details/51417769