RDD簡介
Spark計算中一個重要的概念就是可以跨越多個節點的可伸縮分佈式數據集 RDD(resilient distributed
dataset) Spark的內存計算的核心就是RDD的並行計算。RDD可以理解是一個彈性的,分佈式、不可變的、帶有分區的數據集合,所謂的Spark的批處理,實際上就是正對RDD的集合操作,RDD有以下特點:
- 任意一個RDD都包含分區數(決定程序某個階段計算並行度)
- RDD所謂的分佈式計算是在分區內部計算的
- 因爲RDD是隻讀的,RDD之間的變換存着依賴關係(寬依賴、窄依賴)
- 針對於k-v類型的RDD,一般可以指定分區策略(一般系統提供)
- 針對於存儲在HDFS上的文件,系統可以計算最優位置,計算每個切片。(瞭解)
如下案例:
通過上述的代碼中不難發現,Spark的整個任務的計算無外乎圍繞RDD的三種類型操作RDD創建、RDD轉換、RDD Action.通常習慣性的將flatMap/map/reduceByKey稱爲RDD的轉換算子,collect觸發任務執行,因此被人們稱爲動作算子。在Spark中所有的Transform算子都是lazy執行的,只有在Action算子的時候,Spark纔會真正的運行任務,也就是說只有遇到Action算子的時候,SparkContext纔會對任務做DAG狀態拆分,系統纔會計算每個狀態下任務的TaskSet,繼而TaskSchedule纔會將任務提交給Executors執行。現將以上字符統計計算流程描述如下:
textFile("路徑",分區數)
-> flatMap
-> map
-> reduceByKey
-> sortBy
在這些轉換中其中flatMap/map
、reduceByKey
、sotBy
都是轉換算子,所有的轉換算子都是Lazy
執行的。程序在遇到collect(Action 算子)
系統會觸發job執行。Spark底層會按照RDD的依賴關係將整個計算拆分成若干個階段,我們通常將RDD的依賴關係稱爲RDD的血統-lineage
。血統的依賴通常包含:寬依賴
、窄依賴
。
RDD容錯
在理解DAGSchedule如何做狀態劃分的前提是需要大家瞭解一個專業術語lineage通常被人們稱爲RDD的血統。在瞭解什麼是RDD的血統之前,先來看看程序猿進化過程。
上圖中描述了一個程序猿起源變化的過程,我們可以近似的理解類似於RDD的轉換也是一樣的,Spark的計算本質就是對RDD做各種轉換,因爲RDD是一個不可變只讀的集合,因此每次的轉換都需要上一次的RDD作爲本次轉換的輸入,因此RDD的lineage描述的是RDD間的相互依賴關係。爲了保證RDD中數據的健壯性,RDD數據集通過所謂的血統關係(Lineage)記住了它是如何從其它RDD中演變過來的。Spark將RDD之間的關係歸類爲寬依賴和窄依賴。Spark會根據Lineage存儲的RDD的依賴關係對RDD計算做故障容錯,目前Saprk的容錯策略更具RDD依賴關係重新計算、對RDD做Cache、對RDD做Checkpoint手段完成RDD計算的故障容錯。
RDD 寬窄依賴
RDD在Lineage依賴方面分爲兩種Narrow Dependencies
與Wide Dependencies
用來解決數據容錯的高效性。Narrow Dependencies
是指父RDD的每一個分區最多被一個子RDD的分區所用,表現爲一個父RDD的分區對應於一個子RDD的分區或多個父RDD的分區對應於子RDD的一個分區,也就是說一個父RDD的一個分區不可能對應一個子RDD的多個分區。Wide Dependencies
父RDD的一個分區對應一個子RDD的多個分區。
對於Wide Dependencies這種計算的輸入和輸出在不同的節點上,一般需要誇節點做Shuffle,因此如果是RDD在做寬依賴恢復的時候需要多個節點重新計算成本較高。相對於Narrow Dependencies RDD間的計算是在同一個Task當中實現的是線程內部的的計算,因此在RDD分區數據丟失的的時候,也非常容易恢復。
Sage劃分(重點)
Spark任務階段的劃分是按照RDD的lineage關係逆向生成的這麼一個過程,Spark任務提交的流程大致如下圖所示:
這裏可以分析一下DAGScheduel中對State拆分的邏輯代碼片段如下所示:
DAGScheduler.scala
第719
行
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
//...
}
DAGScheduler
-675
行
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
//eventProcessLoop 實現的是一個隊列,系統底層會調用 doOnReceive -> case JobSubmitted -> dagScheduler.handleJobSubmitted(951行)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
DAGScheduler
-951
行
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
//...
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
//...
}
submitStage(finalStage)
}
DAGScheduler
-1060
行
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//計算當前State的父Stage
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//如果當前的State沒有父Stage,就提交當前Stage中的Task
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
//遞歸查找當前父Stage的父Stage
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
DAGScheduler
-549
行 (獲取當前State的父State)
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new ArrayStack[RDD[_]]//棧
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
//如果是寬依賴ShuffleDependency,就添加一個Stage
case shufDep: ShuffleDependency,[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//如果是窄依賴NarrowDependency,將當前的父RDD添加到棧中
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {//循環遍歷棧,計算 stage
visit(waitingForVisit.pop())
}
missing.toList
}
DAGScheduler
-1083
行 (提交當前Stage的TaskSet)
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
// If there are tasks to execute, record the submission time of the stage. Otherwise,
// post the even without the submission time, which indicates that this stage was
// skipped.
if (partitionsToCompute.nonEmpty) {
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
}
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null
var partitions: Array[Partition] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
var taskBinaryBytes: Array[Byte] = null
// taskBinaryBytes and partitions are both effected by the checkpoint status. We need
// this synchronization in case another concurrent job is checkpointing this RDD, so we get a
// consistent view of both variables.
RDDCheckpointData.synchronized {
taskBinaryBytes = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
partitions = stage.rdd.partitions
}
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
// Abort execution
return
case e: Throwable =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
// Abort execution
return
}
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
stage.rdd.isBarrier())
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None)
stage match {
case stage: ShuffleMapStage =>
logDebug(s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})")
markMapStageJobsAsFinished(stage)
case stage : ResultStage =>
logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
}
submitWaitingChildStages(stage)
}
}
小結
通過以上源碼分析,可以得出Spark所謂寬窄依賴事實上指的是ShuffleDependency
或者是NarrowDependency
如果是ShuffleDependency
系統會生成一個ShuffeMapStage
,如果是NarrowDependency
則忽略,歸爲當前Stage。當系統回推到起始RDD的時候因爲發現當前RDD或者ShuffleMapStage沒有父Stage的時候,當前系統會將當前State下的Task封裝成ShuffleMapTask
(如果是ResultStage就是ResultTask
),當前Task的數目等於當前state分區的分區數。然後將Task封裝成TaskSet通過調用taskScheduler.submitTasks將任務提交給集羣。
RDD緩存機制
緩存是一種RDD計算容錯的一種手段,程序在RDD數據丟失的時候,可以通過緩存快速計算當前RDD的值,而不需要反推出所有的RDD重新計算,因此Spark在需要對某個RDD多次使用的時候,爲了提高程序的執行效率用戶可以考慮使用RDD的cache。如下測試:
val conf = new SparkConf()
.setAppName("word-count")
.setMaster("local[2]")
val sc = new SparkContext(conf)
val value: RDD[String] = sc.textFile("file:///D:/demo/words/")
.cache()
value.count()
var begin=System.currentTimeMillis()
value.count()
var end=System.currentTimeMillis()
println("耗時:"+ (end-begin))//耗時:253
//失效緩存
value.unpersist()
begin=System.currentTimeMillis()
value.count()
end=System.currentTimeMillis()
println("不使用緩存耗時:"+ (end-begin))//2029
sc.stop()
除了調用cache之外,Spark提供了更細粒度的RDD緩存方案,用戶可以根據集羣的內存狀態選擇合適的緩存策略。用戶可以使用persist方法指定緩存級別。緩存級別有如下可選項:
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
xxRDD.persist(StorageLevel.MEMORY_AND_DISK_SER_2)
其中:
MEMORY_ONLY
:表示數據完全不經過序列化存儲在內存中,效率高,但是有可能導致內存溢出.
MEMORY_ONLY_SER
和MEMORY_ONLY一樣,只不過需要對RDD的數據做序列化,犧牲CPU節省內存,同樣會導致內存溢出可能。
其中
_2
表示緩存結果有備份,如果大家不確定該使用哪種級別,一般推薦MEMORY_AND_DISK_SER_2
Check Point 機制
除了使用緩存機制可以有效的保證RDD的故障恢復,但是如果緩存失效還是會在導致系統重新計算RDD的結果,所以對於一些RDD的lineage較長的場景,計算比較耗時,用戶可以嘗試使用checkpoint機制存儲RDD的計算結果,該種機制和緩存最大的不同在於,使用checkpoint之後被checkpoint的RDD數據直接持久化在文件系統中,一般推薦將結果寫在hdfs中,這種checpoint並不會自動清空。注意checkpoint在計算的過程中先是對RDD做mark,在任務執行結束後,再對mark的RDD實行checkpoint,也就是要重新計算被Mark之後的rdd的依賴和結果,因此爲了避免Mark RDD重複計算,推薦使用策略
val conf = new SparkConf().setMaster("yarn").setAppName("wordcount")
val sc = new SparkContext(conf)
sc.setCheckpointDir("hdfs:///checkpoints")
val lineRDD: RDD[String] = sc.textFile("hdfs:///words/t_word.txt")
val cacheRdd = lineRDD.flatMap(line => line.split(" "))
.map(word => (word, 1))
.groupByKey()
.map(tuple => (tuple._1, tuple._2.sum))
.sortBy(tuple => tuple._2, false, 1)
.cache()
cacheRdd.checkpoint()
cacheRdd.collect().foreach(tuple=>println(tuple._1+"->"+tuple._2))
cacheRdd.unpersist()
//3.關閉sc
sc.stop()