spark streaming中updateStateByKey算子的使用介紹

前言

在streaming中可以分爲有狀態運算和無狀態運算
無狀態運算就是每個批次間都彼此隔離，每次都從空開始
有狀態運算爲批次之間提供了管道，管道中保存的信息就是歷史狀態
常見的有狀態算子包括updateStateByKey，mapWithState，窗口函數
其中updateStateByKey和mapWithState是比較相似的，區別在於無論本批次內有沒有key對應的數據，updateStateByKey都會執行一遍運算邏輯，而mapWithState則不會被觸發。
下面看一下updateStateByKey的幾類使用：

1.最基礎

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S]
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner())
  }

使用wordCount演示

object UpdateStateByKeyDemo1 {
  def updateFunc(one: Seq[Int], state: Option[Int]): Option[Int] = {
    val sum = one.sum + state.getOrElse(0)
    Some(sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
        .map(word => (word, 1))
        .updateStateByKey(updateFunc)
        .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

2.可以指定分區數的updateStateByKey

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @param numPartitions Number of partitions of each RDD in the new DStream.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      numPartitions: Int
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
  }

演示

object updateStateByKeyDemo2 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
      val sum = one.sum + state.getOrElse(0)
      Some(sum)
    }

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, 1)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

3.可以使用自定義分區器的updateStateByKey

/**
 * Return a new "state" DStream where the state for each key is updated by applying
 * the given function on the previous state of the key and the new values of the key.
 * In every batch the updateFunc will be called for each state even if there are no new values.
 * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
 * @param updateFunc State update function. If `this` function returns None, then
 *                   corresponding state key-value pair will be eliminated.
 * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
 *                    DStream.
 * @tparam S State type
 */
def updateStateByKey[S: ClassTag](
    updateFunc: (Seq[V], Option[S]) => Option[S],
    partitioner: Partitioner
  ): DStream[(K, S)] = ssc.withScope {
  val cleanedUpdateF = sparkContext.clean(updateFunc)
  val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
    iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
  }
  updateStateByKey(newUpdateFunc, partitioner, true)
}

演示

object updateStateByKeyDemo3 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
      val sum = one.sum + state.getOrElse(0)
      Some(sum)
    }

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4))
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}


class MyPartitionerDemo(numParts: Int) extends Partitioner {
  override def numPartitions: Int = numParts

  override def getPartition(key: Any): Int = {
    val code = (key.hashCode % numPartitions)
    if (code < 0) {
      code + numPartitions
    } else {
      code
    }
  }
}

4.可以自定義是否記住分區器的updateStateByKey

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
   * @param updateFunc State update function. Note, that this function may generate a different
   *                   tuple with a different key than the input key. Therefore keys may be removed
   *                   or added in this way. It is up to the developer to decide whether to
   *                   remember the partitioner despite the key being changed.
   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
   *                    DStream
   * @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
      partitioner: Partitioner,
      rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {
    val cleanedFunc = ssc.sc.clean(updateFunc)
    val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
      cleanedFunc(it)
    }
    new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)
  }

值得注意的是，此時的updateFunc要求參數是迭代器了。

這也說明這種形式下，進行的是批處理。而之前則是每條數據調用updateFunc一次。

object updateStateByKeyDemo4 {

  def MyFunction(key: String, value: Seq[Int], state: Option[Int]): Option[Int] = {
    val sum = value.sum + state.getOrElse(0)
    Some(sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(it => MyFunction(it._1, it._2, it._3).map(s => (it._1, s)))
    }


    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4), false)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

5.可以指定初始化狀態的updateStateByKey

/**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of the key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * org.apache.spark.Partitioner is used to control the partitioning of each RDD.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
   *                    DStream.
   * @param initialRDD initial state value of each key.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      partitioner: Partitioner,
      initialRDD: RDD[(K, S)]
    ): DStream[(K, S)] = ssc.withScope {
    val cleanedUpdateF = sparkContext.clean(updateFunc)
    val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
      iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
    }
    updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
  }

代碼演示

object updateStateByKeyDemo5 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

//    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
//      val sum = one.sum + state.getOrElse(0)
//      Some(sum)
//    }

    val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
      iter.flatMap(elem => {
        val one = elem._2
        val state = elem._3
        val sum = state.getOrElse(0) + one.sum
        Option(sum)
      }.map(s => (elem._1, s)))
    }

    val initialRDD = ssc.sparkContext.parallelize(Seq(("hello", 10)))

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4), false, initialRDD)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

總結

根據實際需求，我們選擇使用updateStateByKey的不同重載。

spark streaming中updateStateByKey算子的使用介紹

前言

1.最基礎

2.可以指定分區數的updateStateByKey

3.可以使用自定義分區器的updateStateByKey

4.可以自定義是否記住分區器的updateStateByKey

5.可以指定初始化狀態的updateStateByKey

總結

idea 重命名和全局替換快捷鍵

spark streaming中updateStateByKey算子的使用介紹

hive練習:窗口函數相關

azkaban的概覽

hive練習:行列轉換相關

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結