1 從reduce看Job執行流程
1.1 reduce操作
以reduce操作爲例,看看作業執行的流程
def reduce(f: (T, T) => T): T = withScope {
val cleanF = sc.clean(f)
val reducePartition: Iterator[T] => Option[T] = iter => {
if (iter.hasNext) {
Some(iter.reduceLeft(cleanF))
} else {
None
}
}
var jobResult: Option[T] = None
val mergeResult = (index: Int, taskResult: Option[T]) => {
if (taskResult.isDefined) {
jobResult = jobResult match {
case Some(value) => Some(f(value, taskResult.get))
case None => taskResult
}
}
}
sc.runJob(this, reducePartition, mergeResult)
// Get the final result out of our Option, or throw an exception if the RDD was empty
jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
reduce兩個方法具體作用之前總結了。
最後reduce中調用sc.runJob(this, reducePartition, mergeResult),執行reducePartition和mergeResult兩個方法。
那麼runJob做了什麼呢?
現在進入sparkContext域中的runJob:
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
runJob輸入參數:
rdd,目標RDD;
func,在RDD每個分區上執行的函數;
partitions,有些job不會計算目標RDD上的所有partitions。
resultHandler,每個分區的結果寫入這個callback函數
2.1 dagScheduler管理
之後,Job將由dagScheduler管理,在dagScheduler的runJob方法中,提交job,又會進入submitJob
在此之前,先看看這個dagScheduler到底有些什麼屬性
2.1.1 dagScheduler與stage相關屬性
首先是各個部分的ID的設置:
private[scheduler] val nextJobId = new AtomicInteger(0)//jobId初始化爲0
private[scheduler] def numTotalJobs: Int = nextJobId.get()//jobId跟job個數相等,有多少編號就多少
private val nextStageId = new AtomicInteger(0)//stageId
private[scheduler] val jobIdToStageIds = new HashMap[Int, HashSet[Int]]//一個job可以對應多個stage
private[scheduler] val stageIdToStage = new HashMap[Int, Stage]
- 1
- 2
- 3
- 4
- 5
- 6
然後是幾個存儲stage的hash表:
private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage]//有shuffle操作的map的stage劃分結果
private[scheduler] val jobIdToActiveJob = new HashMap[Int, ActiveJob]//正在執行的job
// 存儲那些父輩還沒有執行的stage
private[scheduler] val waitingStages = new HashSet[Stage]
// 正在執行的stage
private[scheduler] val runningStages = new HashSet[Stage]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
2.1.2 submitJob
dagScheduler的runJob實際調用submitJob方法
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
- 1
- 2
submitJob將向scheduler提交這個action:
val jobId = nextJobId.getAndIncrement()//新增Job,id +1s
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]//與task關聯
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
2.1.3 DAGSchedulerEventProcessLoop
之後程序進入DAGSchedulerEventProcessLoop處理環節:
先看eventProcessLoop.post:
def post(event: E): Unit = {
eventQueue.put(event)
}
- 1
- 2
- 3
eventQueue是一個event隊列,現在就是將event入隊
一旦入隊,doOnReceive將會處理event。因爲之前傳入是JobSubmitted對象,那麼根據匹配執行handleJobSubmitted
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
.......#此處省略一萬字
- 1
- 2
- 3
- 4
- 5
- 6
所以handleJobSubmitted纔是stages劃分和提交的關鍵所在。
2.2 handleJobSubmitted:
首先是完整的handleJobSubmit:
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
"//獲取最後一個stages
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)//表示一個在DAG中運行的Job
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage)//提交finalStage
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
在這個方法中,將得到finalstage:
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
//
- 1
- 2
- 3
- 4
進入createResultStage
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
val parents = getOrCreateParentStages(rdd, jobId)//獲取父stages"
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
這裏流程進入一個關鍵函數中:
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
getShuffleDependencies(rdd).map { shuffleDep =>
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
- 1
- 2
- 3
- 4
- 5
- 6
首先是getOrCreateParentStages,以當前的rdd和jobid作爲參數,返回一個List(parentStage,id),前面的代表這個當前的resultStage所依賴的全部stage,後面的就是返回當前stage的id
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
val parents = new HashSet[ShuffleDependency[_, _, _]]/#父依賴hashset
val visited = new HashSet[RDD[_]]#訪問過的
val waitingForVisit = new Stack[RDD[_]]//
waitingForVisit.push(rdd)//待訪問的rdd首先壓入棧
while (waitingForVisit.nonEmpty) {
val toVisit = waitingForVisit.pop()
if (!visited(toVisit)) {
visited += toVisit
toVisit.dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
parents += shuffleDep
case dependency =>
waitingForVisit.push(dependency.rdd)//這個怎麼辦?
}
}
}
parents
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
根據入棧的rdd的依賴關係,
如果是窄依賴,那麼就將壓入棧底;
如果是寬依賴,那麼就將至添加到parents中。最後結果將返回parents。也就是parents中是寬依賴的rdd關係。
這樣之後得到最終的stage結果。
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleIdToMapStage.get(shuffleDep.shuffleId) match { #如果原本shuffleIdToMapStage中就有ShuffleMapStage ,直接返回
case Some(stage) =>
stage
#如果沒有,調用getMissingAncestorShuffleDependencies找到祖先的寬依賴
case None => getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
createShuffleMapStage(dep, firstJobId)
}
}
// Finally, create a stage for the given shuffle dependency.
createShuffleMapStage(shuffleDep, firstJobId)
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
追尋stages的父輩,如果父輩stages爲missing的狀態,要麼已經沒有父輩了,要麼父輩已經都提交了:
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]//尚未提交的父stages
val visited = new HashSet[RDD[_]]//已經處理過的RDD
// 未處理的將存入這個棧
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {//如果dep是寬依賴就可以直接產生stage
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//如果是窄依賴,將rdd存入棧中,這個棧中的rdd都是窄依賴
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)//壓入第一個rdd,然後遞歸遍歷整個stage中rdd,尋找其父依賴,直至最開始rdd
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
missing.toList//父依賴返回list
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
最終返回ShuffleMapStage。
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length//每個分區對應一個task
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep)
stageIdToStage(id) = stage
shuffleIdToMapStage(shuffleDep.shuffleId) = stage
updateJobIdStageIdMaps(jobId, stage)
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)// 根據shuffleId從mapOutputTracker中獲取序列化的多個MapOutputStatus對象
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)//反序列化
(0 until locs.length).foreach { i =>
if (locs(i) ne null) {
// locs(i) will be null if missing
stage.addOutputLoc(i, locs(i))
}
}
} else {
// Kind of ugly: need to register RDDs with the cache and map output tracker here
// since we can't do it in the RDD constructor because # of partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
這樣getOrCreateParentStages就得到了該rdd的父依賴stage的List。
ResultStage將返回最終的jobId綁定的stage結果,最終將賦值給finalStage。
之後就是根據劃分的stage激活job:
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
- 1
- 2
然後提交該job的最後stage:
submitStage(finalStage)
- 1
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)//根據stage獲取JobID
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
//沒有等待父stage,沒有正在運行,且沒有失敗的情況下退出。
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)//獲取stage還未提交的parent
logDebug("missing: " + missing)
if (missing.isEmpty) {//所有的parent已經提交
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)//如果所有的父stage都完成,就可以將其提交到task中了。
} else {
for (parent <- missing) {
submitStage(parent)//否則一定要堅持將parent stages都提交完畢。
}
waitingStages += stage//將該stage放入待處理棧中
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
最後在分析一下submitMissingTask:
主要工作有:
1、確定stages需要計算的分區的id;
2、開啓新的stage;
3、創建分區id與任務位置信息的map;
4、標記新的stageAttempt;
5、對stage序列化並廣播;
6、對stages每個分區創建task,最後彙集爲taskSet;
7、taskScheduler.submitTasks()提交task
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// 需要計算的partitions的id"
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// 與該stage相關的Job的屬性
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage//加入正在運行的stages中
/
stage match {//根據stages的不同,調用stageStart,啓動stage
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
//獲取stages中每個rdd的每個分區的位置信息
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
// If there are tasks to execute, record the submission time of the stage. Otherwise,
// post the even without the submission time, which indicates that this stage was
// skipped.
if (partitionsToCompute.nonEmpty) {
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
}
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
// Abort execution
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
// stage的每個分區構造task
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {//如果存在tasks,則利用taskScheduler.submitTasks()提交task,否則標記stage已完成
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None)
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
submitWaitingChildStages(stage)
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
3 stage劃分總結
首先,action操作將會觸發計算,向DAGScheduler提交作業;
DAGScheduler收到之後,會首先從RDD的依賴鏈的末端處的RDD,遍歷整個RDD的dependences;
當某個RDD的dependences中出現shuffle依賴之後,該RDD將會作爲本stage的輸入信息,並以此構建新的stage。
然後,得到的包含多個stage的Stage集合,其中直接觸發job開始的則作爲FinalStage,並生成一個job實例。
提交stage的時候,會先判斷其父stage的結果能否使用,能夠則提交,不能則將stage放入waitingstage中的。
如果一箇中間過程stage的任務完成以後,DAGScheduler會檢查所有的任務是否都完成了,重新掃描waitingstages中的stage,直至他們都沒有沒有完成的stage爲止,就可以提交了。