【Spark】Spark Streaming（二）—— DStream Transformation操作

本節主要內容

本節部分內容來自官方文檔：http://spark.apache.org/docs/latest/streaming-programming-guide.html

DStream Transformation操作

1. Transformation操作

Transformation	Meaning
map(func)	對DStream中的各個元素進行func函數操作，然後返回一個新的DStream.
flatMap(func)	與map方法類似，只不過各個輸入項可以被輸出爲零個或多個輸出項
filter(func)	過濾出所有函數func返回值爲true的DStream元素並返回一個新的DStream
repartition(numPartitions)	增加或減少DStream中的分區數，從而改變DStream的並行度
union(otherStream)	將源DStream和輸入參數爲otherDStream的元素合併，並返回一個新的DStream.
count()	通過對DStreaim中的各個RDD中的元素進行計數，然後返回只有一個元素的RDD構成的DStream
reduce(func)	對源DStream中的各個RDD中的元素利用func進行聚合操作，然後返回只有一個元素的RDD構成的新的DStream.
countByValue()	對於元素類型爲K的DStream，返回一個元素爲（K,Long）鍵值對形式的新的DStream，Long對應的值爲源DStream中各個RDD的key出現的次數
reduceByKey(func, [numTasks])	利用func函數對源DStream中的key進行聚合操作，然後返回新的（K，V）對構成的DStream
join(otherStream, [numTasks])	輸入爲（K,V)、（K,W）類型的DStream，返回一個新的（K，（V，W）類型的DStream
cogroup(otherStream, [numTasks])	輸入爲（K,V)、（K,W）類型的DStream，返回一個新的 (K, Seq[V], Seq[W]) 元組類型的DStream
transform(func)	通過RDD-to-RDD函數作用於源碼DStream中的各個RDD，可以是任意的RDD操作，從而返回一個新的RDD
updateStateByKey(func)	根據於key的前置狀態和key的新值，對key進行更新，返回一個新狀態的DStream

具體示例：

    //讀取本地文件~/streaming文件夾
    val lines = ssc.textFileStream(args(0))
    val words = lines.flatMap(_.split(" "))
    val wordMap = words.map(x => (x, 1))
    val wordCounts=wordMap.reduceByKey(_ + _)
    val filteredWordCounts=wordCounts.filter(_._2>1)
    val numOfCount=filteredWordCounts.count()
    val countByValue=words.countByValue()
    val union=words.union(word1)
    val transform=words.transform(x=>x.map(x=>(x,1)))
    //顯式原文件
    lines.print()
    //打印flatMap結果
    words.print()
    //打印map結果
    wordMap.print()
    //打印reduceByKey結果
    wordCounts.print()
    //打印filter結果
    filteredWordCounts.print()
    //打印count結果
    numOfCount.print()
    //打印countByValue結果
    countByValue.print()
    //打印union結果
    union.print()
    //打印transform結果
    transform.print()

下面的代碼是運行時添加的文件內容

root@sparkmaster:~/streaming# echo "A B C D" >> test12.txt; echo "A B" >> test12.txt

下面是前面各個函數的結果

-------------------------------------------
lines.print()
-------------------------------------------
A B C D
A B

-------------------------------------------
flatMap結果
-------------------------------------------
A
B
C
D
A
B

-------------------------------------------
map結果
-------------------------------------------
(A,1)
(B,1)
(C,1)
(D,1)
(A,1)
(B,1)

-------------------------------------------
reduceByKey結果
-------------------------------------------
(B,2)
(D,1)
(A,2)
(C,1)


-------------------------------------------
filter結果
-------------------------------------------
(B,2)
(A,2)

-------------------------------------------
count結果
-------------------------------------------
2

-------------------------------------------
countByValue結果
-------------------------------------------
(B,2)
(D,1)
(A,2)
(C,1)

-------------------------------------------
union結果
-------------------------------------------
A
B
C
D
A
B
A
B
C
D
...

-------------------------------------------
transform結果
-------------------------------------------
(A,1)
(B,1)
(C,1)
(D,1)
(A,1)
(B,1)

示例2：
上節課中演示的WordCount代碼並沒有只是對輸入的單詞進行分開計數，沒有記錄前一次計數的狀態，如果想要連續地進行計數，則可以使用updateStateByKey方法來進行。下面的代碼主要給大家演示如何updateStateByKey的方法，

import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming._

object StatefulNetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
    }

   //函數字面量，輸入的當前值與前一次的狀態結果進行累加
    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.sum

      val previousCount = state.getOrElse(0)

      Some(currentCount + previousCount)
    }

     //輸入類型爲K,V,S,返回值類型爲K,S
     //V對應爲帶求和的值，S爲前一次的狀態
    val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
    }

    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local[4]")

    //每一秒處理一次
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    //當前目錄爲checkpoint結果目錄，後面會講checkpoint在Spark Streaming中的應用
    ssc.checkpoint(".")

    //RDD的初始化結果
    val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))


    //使用Socket作爲輸入源，本例ip爲localhost，端口爲9999
    val lines = ssc.socketTextStream(args(0), args(1).toInt)
    //flatMap操作
    val words = lines.flatMap(_.split(" "))
    //map操作
    val wordDstream = words.map(x => (x, 1))

    //updateStateByKey函數使用
    val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
      new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

下圖是初始時的值：

使用下列命令啓動netcat server

root@sparkmaster:~/streaming# nc -lk 9999

然後輸入

root@sparkmaster:~/streaming# nc -lk 9999
hello

將得到下圖的結果

然後再輸入world，

root@sparkmaster:~/streaming# nc -lk 9999
hello
world

則將得到下列結果

【Spark】Spark Streaming（二）—— DStream Transformation操作

本節主要內容

1. Transformation操作

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

【MaxCompute】MaxCompute SQL with as 語句

【Spark】Spark cache的用法及其誤區分析

【MaxCompute】實現自定義UDF、UDTF詳解

【Oozie】oozie學習筆記

【Spark】Spark Streaming（二）—— DStream Transformation操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結