一:JobSheduler的源碼解析
1. JobScheduler是Spark Streaming整個調度的核心,相當於Spark Core上的DAGScheduler.
2. Spark Streaming爲啥要設置兩條線程?
setMaster指定的兩條線程是指程序運行的時候至少需要兩條線程。一條線程用於接收數據,需要不斷的循環。而我們指定的線程數是用於作業處理的。
3. JobSheduler的啓動是在StreamContext的start方法被調用的時候啓動的。
def start(): Unit = synchronized { state match { case INITIALIZED => startSite.set(DStream.getCreationSite()) StreamingContext.ACTIVATION_LOCK.synchronized { StreamingContext.assertNoOtherContextIsActive() try { validate() //而這裏面啓動的新線程是調度方面的,因此和我們設置的線程數沒有關係。 // Start the streaming scheduler in a new thread, so that thread local properties // like call sites and job groups can be reset without affecting those of the // current thread. ThreadUtils.runInNewThread("streaming-start") { sparkContext.setCallSite(startSite.get) sparkContext.clearJobGroup() sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false") scheduler.start() }
4. jobScheduler會負責邏輯層面的Job,並將其物理級別的運行在Spark之上.
/** * This class schedules jobs to be run on Spark. It uses the JobGenerator to generate * the jobs and runs them using a thread pool. */private[streaming]class JobScheduler(val ssc: StreamingContext) extends Logging {
5. jobScheduler的start方法源碼如下:
def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } eventLoop.start() // attach rate controllers of input streams to receive batch completion updates for { inputDStream <- ssc.graph.getInputStreams rateController <- inputDStream.rateController } ssc.addStreamingListener(rateController) listenerBus.start(ssc.sparkContext) receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) receiverTracker.start() jobGenerator.start() logInfo("Started JobScheduler") }
6. 其中processEvent的源碼如下:
private def processEvent(event: JobSchedulerEvent) { try { event match { case JobStarted(job, startTime) => handleJobStart(job, startTime) case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime) case ErrorReported(m, e) => handleError(m, e) } } catch { case e: Throwable => reportError("Error in job scheduler", e) } }
7. handleJobStart的源碼如下:
private def handleJobStart(job: Job, startTime: Long) { val jobSet = jobSets.get(job.time) val isFirstJobOfJobSet = !jobSet.hasStarted jobSet.handleJobStart(job) if (isFirstJobOfJobSet) { // "StreamingListenerBatchStarted" should be posted after calling "handleJobStart" to get the // correct "jobSet.processingStartTime". listenerBus.post(StreamingListenerBatchStarted(jobSet.toBatchInfo)) } job.setStartTime(startTime) listenerBus.post(StreamingListenerOutputOperationStarted(job.toOutputOperationInfo)) logInfo("Starting job " + job.id + " from job set of time " + jobSet.time) }
8. JobScheduler初始化的時候幹了那些事?
此時爲啥要設置並行度呢?
1) 如果Batch Duractions中有多個Output操作的話,提高並行度可以極大的提高性能。
2) 不同的Batch,線程池中有很多的線程,也可以併發運行。
將邏輯級別的Job轉化爲物理級別的job就是通過newDaemonFixedThreadPool線程實現的。
// Use of ConcurrentHashMap.keySet later causes an odd runtime problem due to Java 7/8 diff// https://gist.github.com/AlainODea/1375759b8720a3f9f094private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]//可以手動設置並行度private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)// numConcurrentJobs 默認是1private val jobExecutor = ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")//初始化JoGeneratorprivate val jobGenerator = new JobGenerator(this)val clock = jobGenerator.clock//val listenerBus = new StreamingListenerBus()// These two are created only when scheduler starts.// eventLoop not being null means the scheduler has been started and not stoppedvar receiverTracker: ReceiverTracker = null123456789101112131415161718123456789101112131415161718
print的函數源碼如下:
1. DStream中的print源碼如下:
/** * Print the first ten elements of each RDD generated in this DStream. This is an output * operator, so this DStream will be registered as an output stream and there materialized. */def print(): Unit = ssc.withScope { print(10) }
2. 實際調用的時候還是對RDD進行操作。
/** * Print the first num elements of each RDD generated in this DStream. This is an output * operator, so this DStream will be registered as an output stream and there materialized. */def print(num: Int): Unit = ssc.withScope { def foreachFunc: (RDD[T], Time) => Unit = { (rdd: RDD[T], time: Time) => { val firstNum = rdd.take(num + 1) // scalastyle:off println println("-------------------------------------------") println("Time: " + time) println("-------------------------------------------") firstNum.take(num).foreach(println) if (firstNum.length > num) println("...") println() // scalastyle:on println } } foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false) }
3. foreachFunc封裝了RDD的操作。
/** * Apply a function to each RDD in this DStream. This is an output operator, so * 'this' DStream will be registered as an output stream and therefore materialized. * @param foreachFunc foreachRDD function * @param displayInnerRDDOps Whether the detailed callsites and scopes of the RDDs generated * in the `foreachFunc` to be displayed in the UI. If `false`, then * only the scopes and callsites of `foreachRDD` will override those * of the RDDs on the display. */ private def foreachRDD( foreachFunc: (RDD[T], Time) => Unit, displayInnerRDDOps: Boolean): Unit = { new ForEachDStream(this, context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register() }
4. 每個BatchDuractions都會根據generateJob生成作業。
/** * An internal DStream used to represent output operations like DStream.foreachRDD. * @param parent Parent DStream * @param foreachFunc Function to apply on each RDD generated by the parent DStream * @param displayInnerRDDOps Whether the detailed callsites and scopes of the RDDs generated * by `foreachFunc` will be displayed in the UI; only the scope and * callsite of `DStream.foreachRDD` will be displayed. */ private[streaming]class ForEachDStream[T: ClassTag] ( parent: DStream[T], foreachFunc: (RDD[T], Time) => Unit, displayInnerRDDOps: Boolean ) extends DStream[Unit](parent.ssc) { override def dependencies: List[DStream[_]] = List(parent) override def slideDuration: Duration = parent.slideDuration override def compute(validTime: Time): Option[RDD[Unit]] = None//每個Batch Duractions都根據generateJob生成Job override def generateJob(time: Time): Option[Job] = { parent.getOrCompute(time) match { case Some(rdd) => val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { //foreachFunc基於rdd和time封裝爲func了,此時的foreachFunc就被job.run //的時候調用了。 //此時的RDD就是基於時間生成的RDD,這個RDD就是DStreamGraph中的最後一個DStream決定的。然後 foreachFunc(rdd, time) } Some(new Job(time, jobFunc)) case None => None } } }
5. 此時的foreachFunc是從哪裏來的?
private[streaming] //參數傳遞過來的,這個時候就要去找forEachDStream在哪裏被調用。 class ForEachDStream[T: ClassTag] ( parent: DStream[T], foreachFunc: (RDD[T], Time) => Unit, displayInnerRDDOps: Boolean ) extends DStream[Unit](parent.ssc) {
6. 由此可以知道真正Job的生成是通過ForeachDStream通generateJob來生成的,此時是邏輯級別的,但是真正被物理級別的調用是在JobGenerator中generateJobs被調用的。
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { //此時的outputStream就是forEachDStream outputStreams.flatMap { outputStream => val jobOption = outputStream.generateJob(time) jobOption.foreach(_.setCallSite(outputStream.creationSite)) jobOption } } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs }
6. 由此可以知道真正Job的生成是通過ForeachDStream通過generateJob來生成的,此時是邏輯級別的,但是真正被物理級別的調用是在JobGenerator中generateJobs被調用的。
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { //此時的outputStream就是forEachDStream outputStreams.flatMap { outputStream => val jobOption = outputStream.generateJob(time) jobOption.foreach(_.setCallSite(outputStream.creationSite)) jobOption } } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs }
備註:
1、DT大數據夢工廠微信公衆號DT_Spark
2、IMF晚8點大數據實戰YY直播頻道號:68917580
3、新浪微博: http://www.weibo.com/ilovepains
本文轉自http://blog.csdn.net/snail_gesture/article/details/51417769