State的作用
state是Flink程序某個時刻某個task/operator的狀態,state數據是程序運行中某一時刻數據結果。首先要將state和checkpoint概念區分開,可以理解爲checkpoint是要把state數據持久化存儲起來,checkpoint默認情況下會存儲在JoManager的內存中。checkpoint表示一個Flink job在一個特定時刻的一份全局狀態快照,方便在任務失敗的情況下數據的恢復。
State 狀態值存儲(checkpoint會存儲在hdfs上)
env.setStateBackend(new FsStateBackend("hdfs:///user/flink/app_statistics/checkpoint"))
checkpoint存儲state數據,重啓時恢復state數據
//設置checkpoint, job失敗重啓可以恢復數據, 默認是CheckpointingMode.EXACTLY_ONCE
//flink-conf.yaml配置文件中配置了默認的重啓策略: fixed-delay(4, 10s)
env.enableCheckpointing(60000)
//不希望因爲checkpoint的失敗而導致task失敗
env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
//設置checkpoint的存儲管理
env.setStateBackend(new FsStateBackend("hdfs:///user/flink/app_statistics/checkpoint"))
State的應用
-
State->KeyedState(最常用的)
KeyedState是基於KeyedStream上的狀態,這個狀態是跟特定的key綁定的,對KeyedStream流上的每個key都有對應的state。Keyed State 僅僅可以被使用在基於KeyStream上的Rich functions。
案例一:Flink:Keyed State,實現蒙特卡洛模擬求Pi
重寫map方法
// 定義一個MonteCarlo類
case class MonteCarloPoint(x: Double, y: Double) {
def pi = if (x * x + y * y <= 1) 1 else 0
}
object MonteCarko extends App {
// 自定義一個Source,實現隨機座標點的生成
class MonteCarloSource extends RichSourceFunction[MonteCarloPoint] {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// state 需要在RichFunction中實現
val myMapFun = new RichMapFunction[(Long, MonteCarloPoint), (Long, Double)] {
// 定義原始狀態
var countAndPi: ValueState[(Long, Long)] = _
override def map(value: (Long, MonteCarloPoint)): (Long, Double) = {
// 通過 ValueState.value獲取狀態值
val tmpCurrentSum = countAndPi.value
val currentSum = if (tmpCurrentSum != null) {
tmpCurrentSum
} else {
(0L, 0L)
}
val allcount = currentSum._1 + 1
val picount = currentSum._2 + value._2.pi
// 計算新的狀態值
val newState: (Long, Long) = (allcount, picount)
// 更新狀態值
countAndPi.update(newState)
//輸出總樣本量和模擬極速那的Pi值
(allcount, 4.0 * picount / allcount)
}
override def open(parameters: Configuration): Unit = {
countAndPi = getRuntimeContext.getState(
new ValueStateDescriptor[(Long, Long)]("MonteCarloPi", createTypeInformation[(Long, Long)])
)
}
}
// 添加數據源
val dataStream: DataStream[MonteCarloPoint] = env.addSource(new MonteCarloSource)
// 轉換成KeyedStream
val keyedStream= dataStream.map((1L, _)).keyBy(0)
// 調用定義好的RichFunction並打印結果
keyedStream.map(myMapFun).print()
env.execute("Monte Carko Test")
}
案例二:官網案例(重寫flatmap方法)
import java.lang
import org.apache.flink.api.common.functions.{RichFlatMapFunction, RichMapFunction}
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.util.Collector
import org.apache.flink.configuration.Configuration
import scala.collection.JavaConverters._
class CountWindowAverage extends RichFlatMapFunction[(Long, Long), (Long, Long)] {
private var sum: ValueState[(Long, Long)] = _
override def flatMap(input: (Long, Long), out: Collector[(Long, Long)]): Unit = {
// access the state value
val tmpCurrentSum = sum.value
// If it hasn't been used before, it will be null
val currentSum = if (tmpCurrentSum != null) {
tmpCurrentSum
} else {
(0L, 0L)
}
// update the count
val newSum = (currentSum._1 + 1, currentSum._2 + input._2)
// update the state
sum.update(newSum)
// if the count reaches 2, emit the average and clear the state
if (newSum._1 >= 2) {
out.collect((input._1, newSum._2 / newSum._1))
sum.clear()
}
}
override def open(parameters: Configuration): Unit = {
sum = getRuntimeContext.getState(
new ValueStateDescriptor[(Long, Long)]("average", createTypeInformation[(Long, Long)])
)
}
}
object ExampleCountWindowAverage extends App {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromCollection(List(
(1L, 3L),
(1L, 5L),
(1L, 7L),
(1L, 4L),
(1L, 2L)
)).keyBy(_._1)
.flatMap(new CountWindowAverage()).print()
// the printed output will be (1,4) and (1,5)
env.execute("ExampleManagedState")
}
案例三:計算最熱門top3商品
ProcessFunction是Flink提供的一個low-level API,用於實現更高級的功能。它主要提供了定時器timer的功能(支持EventTime或ProcessingTime)。本案例中我們將利用timer來判斷何時收齊了某個window下所有商品的點擊量數據。由於Watermark的進度是全局的,在processElement方法中,每當收到一條數據ItemViewCount,我們就註冊一個windowEnd+1的定時器(Flink框架會自動忽略同一時間的重複註冊)。windowEnd+1的定時器被觸發時,意味着收到了windowEnd+1的Watermark,即收齊了該windowEnd下的所有商品窗口統計值。我們在onTimer()中處理將收集的所有商品及點擊量進行排序,選出TopN,並將排名信息格式化成字符串後進行輸出。
這裏我們還使用了ListState<ItemViewCount>來存儲收到的每條ItemViewCount消息,保證在發生故障時,狀態數據的不丟失和一致性。ListState是Flink提供的類似Java List接口的State API,它集成了框架的checkpoint機制,自動做到了exactly-once的語義保證。
import com.sun.jmx.snmp.Timestamp
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import scala.collection.mutable.ListBuffer
case class UserBehavior(userId: Long, itemId: Long, categoryId: Int,
behavior: String, timestamp: Long)
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)
object UserBehaviorAnalysis {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
val value: DataStream[UserBehavior] = env.readTextFile("D:\\projects\\flinkStudy\\src\\userBehavior.csv").
map(line => {
val linearray = line.split(",")
UserBehavior(linearray(0).toLong, linearray(1).toLong, linearray(2).toInt, linearray(3), linearray(4).toLong)
})
val watermarkDataStream: DataStream[UserBehavior] = value.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[UserBehavior]
(Time.milliseconds(1000)) {
override def extractTimestamp(element: UserBehavior): Long = {
return element.timestamp
}
})
val itemIdWindowStream: DataStream[ItemViewCount] = watermarkDataStream.filter(_.behavior == "pv").
keyBy("itemId").
timeWindow(Time.minutes(60),Time.minutes(5))
//按照每個窗口進行聚合
.aggregate(new CountAgg(), new WindowResultFunction())
itemIdWindowStream.keyBy("windowEnd").process(new TopNHotItems(3)).print()
env.execute("Hot Items Job")
}
}
class CountAgg extends AggregateFunction[UserBehavior, Long, Long] {
override def createAccumulator(): Long = 0L
override def add(userBehavior: UserBehavior, acc: Long): Long = acc + 1
override def getResult(acc: Long): Long = acc
override def merge(acc: Long, acc1: Long): Long = acc1+acc
}
// 用於輸出窗口的結果
class WindowResultFunction extends WindowFunction[Long, ItemViewCount, Tuple, TimeWindow] {
override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
var itemId=key.asInstanceOf[Tuple1[Long]]._1
var count=input.iterator.next()
out.collect(ItemViewCount(itemId, window.getEnd, count))
}
}
class TopNHotItems(topSize: Int) extends KeyedProcessFunction[Tuple, ItemViewCount, String] {
private var itemState : ListState[ItemViewCount] = _
override def open(parameters: Configuration): Unit = {
super.open(parameters)
// 命名狀態變量的名字和狀態變量的類型
val itemsStateDesc = new ListStateDescriptor[ItemViewCount]("itemState-state", classOf[ItemViewCount])
// 定義狀態變量
itemState = getRuntimeContext.getListState(itemsStateDesc)
}
override def processElement(input: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
// 每條數據都保存到狀態中
itemState.add(input)
// 註冊 windowEnd+1 的 EventTime Timer, 當觸發時,說明收齊了屬於windowEnd窗口的所有商品數據
// 也就是當程序看到windowend + 1的水位線watermark時,觸發onTimer回調函數
context.timerService.registerEventTimeTimer(input.windowEnd + 1)
}
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
// 獲取收到的所有商品點擊量
val allItems: ListBuffer[ItemViewCount] = ListBuffer()
import scala.collection.JavaConversions._
for (item <- itemState.get) {
allItems += item
}
// 提前清除狀態中的數據,釋放空間
itemState.clear()
// 按照點擊量從大到小排序
val sortedItems = allItems.sortBy(_.count)(Ordering.Long.reverse).take(topSize)
// 將排名信息格式化成 String, 便於打印
val result: StringBuilder = new StringBuilder
result.append("====================================\n")
result.append("時間: ").append(new Timestamp(timestamp - 1)).append("\n")
for(i <- sortedItems.indices){
val currentItem: ItemViewCount = sortedItems(i)
// e.g. No1: 商品ID=12224 瀏覽量=2413
result.append("No").append(i+1).append(":")
.append(" 商品ID=").append(currentItem.itemId)
.append(" 瀏覽量=").append(currentItem.count).append("\n")
}
result.append("====================================\n\n")
// 控制輸出頻率,模擬實時滾動結果
Thread.sleep(1000)
out.collect(result.toString)
}
}