先分析一下Dstream的子類:
A,從上圖可以發現子類InputDstream都是屬於數據源Dstream;InputDStream分成兩個類型,一種是ReceiverInputDstream,一種不需要實現ReceiverInputDstream.如FileInputDStream。
B,上圖中ForEachDStream就是OutputDstream:所有output算子最終都會調用到這個類上一、 例子中FileInputDStream如何加到DstreamGraph中的 呢?
1,還是從案例開始順藤摸瓜
/** * @author [email protected] */
object HdfsWordCount { def main(args: Array[String]) { if (args.length < 1) { System.err.println("Usage: HdfsWordCount <directory>") System.exit(1) } StreamingExamples.setStreamingLogLevels() val sparkConf = new SparkConf().setAppName("HdfsWordCount") // Create the context val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create the FileInputDStream on the directory and use the // stream to count words in new files created val lines = ssc.textFileStream(args(0)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
2、跟蹤textFileStream()方法發現,會調用到如下代碼:
def fileStream[ K: ClassTag, V: ClassTag, F <: NewInputFormat[K, V]: ClassTag ] (directory: String): InputDStream[(K, V)] = { //ssc.textFileStream會觸發新建一個FileInputDStream。FileInputDStream繼承於InputDStream new FileInputDStream[K, V, F](this, directory) }
3,哪FileInputDStream是如何被加到DstreamGraph中的呢?
因爲FileInputDStream繼承於InputDStream。發現到實例化FileInputDStream時,父類InputDstream中的成員,
ssc.graph.addInputStream(this)也會被初始化,所以會將自己的實例加DstreamGraph中的
abstract class InputDStream[T: ClassTag] (ssc_ : StreamingContext) extends DStream[T](ssc_) { private[streaming] var lastValidTime: Time = null ssc.graph.addInputStream(this)
進入DstreamGraph,簡單分析一下addInputStream方法,從中發現加進來FileInputDStream的放在inputStreams是一個數組
==》所以得到結論是:開發代碼時可以多次調用textFileStream()方法,來監聽不同目錄
final private[streaming] class DStreamGraph extends Serializable with Logging { private val inputStreams = new ArrayBuffer[InputDStream[_]]() private val outputStreams = new ArrayBuffer[DStream[_]]()
。。。。
def addInputStream(inputStream: InputDStream[_]) { this.synchronized { inputStream.setGraph(this) inputStreams += inputStream } }
4,而後面的Dstream算子只會形成Dstream的DAG視圖,類似RDD的DAG視圖一樣。
val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
進入flatMap算子:
def flatMap[U: ClassTag](flatMapFunc: T => Traversable[U]): DStream[U] = ssc.withScope { new FlatMappedDStream(this, context.sparkContext.clean(flatMapFunc)) }
進入FlatMappedDStream簡單分析一下
private[streaming] class FlatMappedDStream[T: ClassTag, U: ClassTag]( parent: DStream[T], flatMapFunc: T => Traversable[U] ) extends DStream[U](parent.ssc) { * 每個 DStream 的子類都會實現 def dependencies: List[DStream[_]]方法,該方法用來返回自己的依賴的父 DStream 列表。 * 比如,沒有父DStream 的 InputDStream 的 dependencies方法返回List()。 * 而FlatMappedDStream它的依賴只有一個,該parent對象就是調用flatmap算子的Dstream實例 override def dependencies: List[DStream[_]] = List(parent) override def slideDuration: Duration = parent.slideDuration /** Method that generates a RDD for the given time:
Dstream的解釋說該方法是生產RDD用的,後面解析,compute是如何被調用同時如何生成的RDD override def compute(validTime: Time): Option[RDD[U]] = { parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc)) } }
二、 例子中OutputDStream如何加到DstreamGraph中的 呢?
1,上面的案例中調用print()算子最終就會調用ForEachDStream這個Dstream上。
(實際上所有output算子都會調用ForEachDStream,可以跟蹤源碼,跟幾步就可以看到結果)
wordCounts.print()
2,哪它是如何加到DstreamGraph中的呢?
A,print()算子,默認會打印RDD的前10元素
def print(num: Int): Unit = ssc.withScope { def foreachFunc: (RDD[T], Time) => Unit = { (rdd: RDD[T], time: Time) => { val firstNum = rdd.take(num + 1) println("-------------------------------------------") println("Time: " + time) println("-------------------------------------------") firstNum.take(num).foreach(println) if (firstNum.length > num) println("...") println() } }
#簡單說一下這個clean方法吧:所有scala函數,都是jvm的object對象,clean方法最重要作用就是在系列化對象時,將相同域中不必要的成員都清理掉,
並將函數相關聯的成員都加載進來。保證高效系列化。(這個方法很重要,基本上源碼中處處都有) foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false) }
B,foreachRDD方法跟進去:
* Dstream的output 操作,包括 print, saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles, foreachRDD,
所有outPut操作都會創建ForEachDStream實例並調用register方法將自身添加到DStreamGraph.outputStreams成員中。
*
* 與DStream transform 操作返回一個新的 DStream 不同,output 操作不會返回任何東西,只會創建一個ForEachDStream作爲依賴鏈的終結
private def foreachRDD( foreachFunc: (RDD[T], Time) => Unit, displayInnerRDDOps: Boolean): Unit = { new ForEachDStream(this, context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register() }
跟進register()方法,就找到DstreamGraph對象將當前ForeachDstream對象,加到它的成員中
/** * Register this streaming as an output stream. This would ensure that RDDs of this * DStream will be generated. */ private[streaming] def register(): DStream[T] = { ssc.graph.addOutputStream(this) this }
簡單看一下DstreamGraph的addOutputStream方法:發現ouputStreams也是一個scala的可變數組。
final private[streaming] class DStreamGraph extends Serializable with Logging { private val inputStreams = new ArrayBuffer[InputDStream[_]]() private val outputStreams = new ArrayBuffer[DStream[_]]()
def addOutputStream(outputStream: DStream[_]) { this.synchronized {
#ForeachDstream將DstreamGraph對象加到自己身上及父Dstream上從而形成DstreamGraph的DAG視圖 outputStream.setGraph(this) outputStreams += outputStream } }
3,進入setGraph方法,該方法在ForeachDstream父類Dstream中,並且會將ForeachDstream所有依賴的父Dstream遍歷,
同時都將當前的DstreamGraph設置到所有Dstream實例中
private[streaming] def setGraph(g: DStreamGraph) { if (graph != null && graph != g) { throw new SparkException("Graph is already set in " + this + ", cannot set it again") } graph = g
#在addInputStream方法中也調用這個setGraph,因爲FileInputDStream沒有父Dstream,
所以FileInputDStream只給自己設置graph就可以 dependencies.foreach(_.setGraph(graph)) }
A,給每個Dstream實例成員graph變量進行賦值,從每個Dstream的成員dependencies得到一個結論:Dstream的DAG思路和RDD的DAG思路是一樣的
private[streaming] var graph: DStreamGraph = null
B,當前案例的ForeachDstream的父Dstream如下:
ShuffledDStream
MappedDStream
FlatMappedDStream
FileInputDStream
下面來分析一下:HdfsWordCount從Dstream到RDD全過程解析