嘗試spark streaming的有狀態轉化: updateStateByKey和mapWithState

streaming wordCount示例

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf

object StreamWordCount {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    val lineStreams = ssc.socketTextStream(localhost, 9999)

    val wordStreams = lineStreams.flatMap(_.split(" "))
    val wordAndOneStreams = wordStreams.map((_, 1))
    val wordAndCountStreams = wordAndOneStreams.reduceByKey(_+_)

    wordAndCountStreams.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

在上面這種方式中，僅僅是對當前批次的word進行統計。

但是在實際的需求中，往往是需要一個累加的操作，需要跨批次的進行累加。

在streaming中，DStream的操作是分成無狀態轉化和有狀態轉化的。

無狀態轉化

無狀態轉化操作就是把簡單的RDD轉化操作應用到每個批次上，也就是轉化DStream中的每一個RDD。

具體的算子和spark core中基本沒有差別，常見算子如下。

Transformation（轉換）	Meaning（含義）
map(func)	利用函數 func 處理原 DStream 的每個元素，返回一個新的 DStream。
flatMap(func)	與 map 相似，但是每個輸入項可用被映射爲 0 個或者多個輸出項。。
filter(func)	返回一個新的 DStream，它僅僅包含原 DStream 中函數 func 返回值爲 true 的項。
repartition(numPartitions)	通過創建更多或者更少的 partition 以改變這個 DStream 的並行級別（level of parallelism）。
union(otherStream)	返回一個新的 DStream，它包含源 DStream 和 otherDStream 的所有元素。
count()	通過 count 源 DStream 中每個 RDD 的元素數量，返回一個包含單元素（single-element）RDDs 的新 DStream。
reduce(func)	利用函數 func 聚集源 DStream 中每個 RDD 的元素，返回一個包含單元素（single-element）RDDs 的新 DStream。函數應該是相關聯的，以使計算可以並行化。
countByValue()	在元素類型爲 K 的 DStream上，返回一個（K,long）pair 的新的 DStream，每個 key 的值是在原 DStream 的每個 RDD 中的次數。
reduceByKey(func, [numTasks])	當在一個由 (K,V) pairs 組成的 DStream 上調用這個算子時，返回一個新的，由 (K,V) pairs 組成的 DStream，每一個 key 的值均由給定的 reduce 函數聚合起來。注意：在默認情況下，這個算子利用了 Spark 默認的併發任務數去分組。你可以用 numTasks 參數設置不同的任務數。

有狀態轉化

UpdateStateByKey 操作

下面是官網描述的翻譯：

該 updateStateByKey 操作允許您維護任意狀態，同時不斷更新新信息。你需要通過兩步來使用它。

定義 state - state 可以是任何的數據類型。
定義 state update function（狀態更新函數）- 使用函數指定如何使用先前狀態來更新狀態，並從輸入流中指定新值。

在每個 batch 中，Spark 會使用狀態更新函數爲所有已有的 key 更新狀態，不管在 batch 中是否含有新的數據。如果這個更新函數返回一個 none，這個 key-value pair 也會被消除。

使用updateStateByKey重新編寫wordCount

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WorldCount {

  def main(args: Array[String]) {

    // 定義更新狀態方法，參數values爲當前批次單詞頻度，state爲以往批次單詞頻度
    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.foldLeft(0)(_ + _)
      val previousCount = state.getOrElse(0)
      Some(currentCount + previousCount)
    }

    val conf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.checkpoint("hdfs://hadoop:9000/chk")

    val lines = ssc.socketTextStream("hadoop", 9999)
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))

    val stateDstream = pairs.updateStateByKey[Int](updateFunc)
    stateDstream.print()

    //val wordCounts = pairs.reduceByKey(_ + _)
    //wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }

}

mapWithState操作

理解起來和updateStateByKey差不多。
和map操作相比就是維護了歷史狀態


import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}

object WordCountWithState {

  //自定義mappingFunction，累加單詞出現的次數並更新狀態
  val mappingFunc = (word: String, count: Option[Int], state: State[Int]) => {
    val sum = count.getOrElse(0) + state.getOption.getOrElse(0)
    //必須進行的是歷史狀態的更新，然後要把累加的結果返回
    state.update(sum)
    (word, sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
    val ssc = new StreamingContext(sparkconf,Seconds(5))
    ssc.checkpoint("data/chk")

    val scoketDStream = ssc.socketTextStream("localhost", 8888)
    val wordPair = scoketDStream.flatMap(_.split("\\s+"))
        .map(word => (word,1))

    wordPair.mapWithState(StateSpec.function(mappingFunc))
          .foreachRDD(rdd =>{
          	rdd.foreach(println(_))
          })


    ssc.start()
    ssc.awaitTermination()
  }
}

Window Operations

窗口類的算子也能實現帶狀態計算

常見如reduceByKeyAndWindow()

這裏就不再寫示例了。

總結

1.只要使用有狀態轉化的算子，必須設置checkpoint

2.upstatebykey會每批次將所有數據展示，例如歷史數據是((“張三”, 1), (“羅翔”, 1))，

然後本批次輸入爲(“張三”)

打印結果是((“張三”, 2), (“羅翔”, 1))t

3.mapwithstate在每批次只展示更新的數據，例如歷史數據是((“張三”, 1), (“羅翔”, 1))，

然後本批次輸入爲(“張三”)

打印結果是((“張三”, 2)

4.mapwithstate可以設置state的過期時間
5.想看mapwithstate源碼梳理的可以看這個大佬的博客：https://blog.csdn.net/czmacd/article/details/54705988

嘗試spark streaming的有狀態轉化: updateStateByKey和mapWithState

streaming wordCount示例

無狀態轉化

有狀態轉化

UpdateStateByKey 操作

mapWithState操作

Window Operations

總結

druid數據源 xml配置

idea 重命名和全局替換快捷鍵

spark streaming中updateStateByKey算子的使用介紹

hive練習:窗口函數相關

azkaban的概覽

hive練習:行列轉換相關

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結