前言
在streaming中可以分爲有狀態運算和無狀態運算
無狀態運算就是每個批次間都彼此隔離,每次都從空開始
有狀態運算爲批次之間提供了管道,管道中保存的信息就是歷史狀態
常見的有狀態算子包括updateStateByKey,mapWithState,窗口函數
其中updateStateByKey和mapWithState是比較相似的,區別在於無論本批次內有沒有key對應的數據,updateStateByKey都會執行一遍運算邏輯,而mapWithState則不會被觸發。
下面看一下updateStateByKey的幾類使用:
1.最基礎
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* In every batch the updateFunc will be called for each state even if there are no new values.
* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}
使用wordCount演示
object UpdateStateByKeyDemo1 {
def updateFunc(one: Seq[Int], state: Option[Int]): Option[Int] = {
val sum = one.sum + state.getOrElse(0)
Some(sum)
}
def main(args: Array[String]): Unit = {
val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
val ssc = new StreamingContext(sparkconf, Seconds(5))
ssc.checkpoint("file:///E:/chk")
val scoketDStream = ssc.socketTextStream("mini", 8888)
scoketDStream.flatMap(_.split("\\s+"))
.map(word => (word, 1))
.updateStateByKey(updateFunc)
.print()
ssc.start()
ssc.awaitTermination()
}
}
2.可以指定分區數的updateStateByKey
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* In every batch the updateFunc will be called for each state even if there are no new values.
* Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param numPartitions Number of partitions of each RDD in the new DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
numPartitions: Int
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
}
演示
object updateStateByKeyDemo2 {
def main(args: Array[String]): Unit = {
val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
val ssc = new StreamingContext(sparkconf, Seconds(5))
ssc.checkpoint("file:///E:/chk")
val updateFunc = (one: Seq[Int], state: Option[Int]) => {
val sum = one.sum + state.getOrElse(0)
Some(sum)
}
val scoketDStream = ssc.socketTextStream("mini", 8888)
scoketDStream.flatMap(_.split("\\s+"))
.map(word => (word, 1))
.updateStateByKey(updateFunc, 1)
.print()
ssc.start()
ssc.awaitTermination()
}
}
3.可以使用自定義分區器的updateStateByKey
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* In every batch the updateFunc will be called for each state even if there are no new values.
* [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true)
}
演示
object updateStateByKeyDemo3 {
def main(args: Array[String]): Unit = {
val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
val ssc = new StreamingContext(sparkconf, Seconds(5))
ssc.checkpoint("file:///E:/chk")
val updateFunc = (one: Seq[Int], state: Option[Int]) => {
val sum = one.sum + state.getOrElse(0)
Some(sum)
}
val scoketDStream = ssc.socketTextStream("mini", 8888)
scoketDStream.flatMap(_.split("\\s+"))
.map(word => (word, 1))
.updateStateByKey(updateFunc, new MyPartitionerDemo(4))
.print()
ssc.start()
ssc.awaitTermination()
}
}
class MyPartitionerDemo(numParts: Int) extends Partitioner {
override def numPartitions: Int = numParts
override def getPartition(key: Any): Int = {
val code = (key.hashCode % numPartitions)
if (code < 0) {
code + numPartitions
} else {
code
}
}
}
4.可以自定義是否記住分區器的updateStateByKey
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* In every batch the updateFunc will be called for each state even if there are no new values.
* [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
* @param updateFunc State update function. Note, that this function may generate a different
* tuple with a different key than the input key. Therefore keys may be removed
* or added in this way. It is up to the developer to decide whether to
* remember the partitioner despite the key being changed.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream
* @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {
val cleanedFunc = ssc.sc.clean(updateFunc)
val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
cleanedFunc(it)
}
new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)
}
值得注意的是,此時的updateFunc要求參數是迭代器了。
這也說明這種形式下,進行的是批處理。而之前則是每條數據調用updateFunc一次。
object updateStateByKeyDemo4 {
def MyFunction(key: String, value: Seq[Int], state: Option[Int]): Option[Int] = {
val sum = value.sum + state.getOrElse(0)
Some(sum)
}
def main(args: Array[String]): Unit = {
val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
val ssc = new StreamingContext(sparkconf, Seconds(5))
ssc.checkpoint("file:///E:/chk")
val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(it => MyFunction(it._1, it._2, it._3).map(s => (it._1, s)))
}
val scoketDStream = ssc.socketTextStream("mini", 8888)
scoketDStream.flatMap(_.split("\\s+"))
.map(word => (word, 1))
.updateStateByKey(updateFunc, new MyPartitionerDemo(4), false)
.print()
ssc.start()
ssc.awaitTermination()
}
}
5.可以指定初始化狀態的updateStateByKey
/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of the key.
* In every batch the updateFunc will be called for each state even if there are no new values.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. If `this` function returns None, then
* corresponding state key-value pair will be eliminated.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream.
* @param initialRDD initial state value of each key.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
}
代碼演示
object updateStateByKeyDemo5 {
def main(args: Array[String]): Unit = {
val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
val ssc = new StreamingContext(sparkconf, Seconds(5))
ssc.checkpoint("file:///E:/chk")
// val updateFunc = (one: Seq[Int], state: Option[Int]) => {
// val sum = one.sum + state.getOrElse(0)
// Some(sum)
// }
val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
iter.flatMap(elem => {
val one = elem._2
val state = elem._3
val sum = state.getOrElse(0) + one.sum
Option(sum)
}.map(s => (elem._1, s)))
}
val initialRDD = ssc.sparkContext.parallelize(Seq(("hello", 10)))
val scoketDStream = ssc.socketTextStream("mini", 8888)
scoketDStream.flatMap(_.split("\\s+"))
.map(word => (word, 1))
.updateStateByKey(updateFunc, new MyPartitionerDemo(4), false, initialRDD)
.print()
ssc.start()
ssc.awaitTermination()
}
}
總結
根據實際需求,我們選擇使用updateStateByKey的不同重載。