Spark-DStream的窗口計算

基本概念

Spark Steaming支持對某個時間窗口內實現對數據計算
在這裏插入圖片描述
上圖描繪了是以3倍的微批次作爲一個窗口長度,並且以2倍微批次作爲滑動間隔。將落入到相同窗口的微批次合併成一個相對較大的微批次-窗口批次。

Spark要求所有的窗口的長度以及滑動的間隔必須是微批次的整數倍

  1. 滑動窗口:窗口長度 > 滑動間隔 窗口與窗口之間存在元素的重疊。
  2. 滾動窗口:窗口長度 = 滑動間隔 窗口與窗口之間沒有元素的重疊。
    注:目前不存在:==窗口長度 < 滑動間隔 == 這種窗口 ,會造成數據的丟失。

窗口計算時間屬性

Event Time -事件時間 < Ingestion Time -攝入時間 < Processing Time -處理時間
在這裏插入圖片描述
Spark DStreaming 目前僅僅支持 processing Time -處理時間, 但是Spark的Structured Streaming 支持Event Time

窗口算子

Transformation Meaning
window(windowLength, slideInterval) Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

window(windowLength,slideInterval)

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 進行窗口的計算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .window(Seconds(4),Seconds(2))  //窗口長度 >滑動間隔
      .reduceByKey(_+_)
      .print()


    scc.start()
    scc.awaitTermination()
  }
}

以上window後可以更的算子:countreducereduceByKeycountByValue爲了方便起見Spark提供合成算子例如

window+count 等價於 countByWindow**(windowLength, slideInterval)、window+reduceByKey 等價 reduceByKeyAndWindow

reduceByKeyAndWindow

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 進行窗口的計算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .reduceByKeyAndWindow((v1:Int,v2:Int)=>v1+v2,Seconds(4),Seconds(3))
      .print()
    scc.start()
    scc.awaitTermination()
  }
}

如果窗口重合過半,在計算窗口值的時候,可以使用下面方式計算結果

val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoints")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(//當窗口重疊 超過50% ,使用以下計算效率較高
    (v1:Int,v2:Int)=>v1+v2,//上一個窗口結果+新進來的元素
    (v1:Int,v2:Int)=>v1-v2,//減去移出元素
    Seconds(4),
    Seconds(3),
    filterFunc = (t)=> t._2 > 0
)
.print()

ssc.start()
ssc.awaitTermination()

DStreams輸出

Output Operation Meaning
print() Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(
    (v1:Int,v2:Int)=>v1+v2,//上一個窗口結果+新進來的元素
    Seconds(60),
    Seconds(1)
)
.filter(t=> t._2 > 10)
.foreachRDD(rdd=>{
    rdd.foreachPartition(vs=>{
        vs.foreach(v=>KafkaSink.send2Kafka(v._1,v._2.toString))
    })
})

ssc.start()
ssc.awaitTermination()
object KafkaSink {

  private def createKafkaProducer(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"10")
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")
    new KafkaProducer[String,String](props)
  }
  val kafkaProducer:KafkaProducer[String,String]=createKafkaProducer()

  def send2Kafka(k:String,v:String): Unit ={
    val message = new ProducerRecord[String,String]("topic01",k,v)
    kafkaProducer.send(message)
  }
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.flush()
      kafkaProducer.close()
    }
  })
}

注:對於Spark而言,默認只有當窗口的時間結束之後纔會將窗口的計算結果最終輸出,通常將該種輸出方式成爲鉗制輸出形式。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章