基本概念

Spark Steaming支持對某個時間窗口內實現對數據計算

上圖描繪了是以3倍的微批次作爲一個窗口長度，並且以2倍微批次作爲滑動間隔。將落入到相同窗口的微批次合併成一個相對較大的微批次-窗口批次。

Spark要求所有的窗口的長度以及滑動的間隔必須是微批次的整數倍

滑動窗口：窗口長度 > 滑動間隔 窗口與窗口之間存在元素的重疊。
滾動窗口：窗口長度 = 滑動間隔 窗口與窗口之間沒有元素的重疊。
注：目前不存在：==窗口長度 < 滑動間隔 == 這種窗口，會造成數據的丟失。

窗口計算時間屬性

Event Time -事件時間 < Ingestion Time -攝入時間 < Processing Time -處理時間

Spark DStreaming 目前僅僅支持 processing Time -處理時間，但是Spark的Structured Streaming 支持Event Time

窗口算子

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

window（windowLength，slideInterval）

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 進行窗口的計算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .window(Seconds(4),Seconds(2))  //窗口長度 >滑動間隔
      .reduceByKey(_+_)
      .print()


    scc.start()
    scc.awaitTermination()
  }
}

以上window後可以更的算子:count、reduce、reduceByKey、countByValue爲了方便起見Spark提供合成算子例如

window+count 等價於 countByWindow**(windowLength, slideInterval)、window+reduceByKey 等價 reduceByKeyAndWindow

reduceByKeyAndWindow

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 進行窗口的計算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .reduceByKeyAndWindow((v1:Int,v2:Int)=>v1+v2,Seconds(4),Seconds(3))
      .print()
    scc.start()
    scc.awaitTermination()
  }
}

如果窗口重合過半，在計算窗口值的時候，可以使用下面方式計算結果

val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoints")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(//當窗口重疊 超過50% ,使用以下計算效率較高
    (v1:Int,v2:Int)=>v1+v2,//上一個窗口結果+新進來的元素
    (v1:Int,v2:Int)=>v1-v2,//減去移出元素
    Seconds(4),
    Seconds(3),
    filterFunc = (t)=> t._2 > 0
)
.print()

ssc.start()
ssc.awaitTermination()

DStreams輸出

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Output Operation

Meaning

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(
    (v1:Int,v2:Int)=>v1+v2,//上一個窗口結果+新進來的元素
    Seconds(60),
    Seconds(1)
)
.filter(t=> t._2 > 10)
.foreachRDD(rdd=>{
    rdd.foreachPartition(vs=>{
        vs.foreach(v=>KafkaSink.send2Kafka(v._1,v._2.toString))
    })
})

ssc.start()
ssc.awaitTermination()

object KafkaSink {

  private def createKafkaProducer(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"10")
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")
    new KafkaProducer[String,String](props)
  }
  val kafkaProducer:KafkaProducer[String,String]=createKafkaProducer()

  def send2Kafka(k:String,v:String): Unit ={
    val message = new ProducerRecord[String,String]("topic01",k,v)
    kafkaProducer.send(message)
  }
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.flush()
      kafkaProducer.close()
    }
  })
}

注：對於Spark而言，默認只有當窗口的時間結束之後纔會將窗口的計算結果最終輸出，通常將該種輸出方式成爲鉗制輸出形式。

Spark-DStream的窗口計算

基本概念

窗口計算時間屬性

窗口算子

window（windowLength，slideInterval）

reduceByKeyAndWindow

DStreams輸出

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

squirrel 連接phoenix訪問Hbase

phoenix 安裝教程

Java API 操作Phoenix

phoenix+springBoot+mybatis 整合

Hbase的完全分佈式集羣搭建

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結