Flink Window的5個使用小技巧

說明：今天看到小米技術雲公衆號上的一片文章，覺得很好很實用就轉載了一下

Window是Flink的核心功能之一，使用好Window對解決一些業務場景是非常有幫助的。

今天分享5個Flink Window的使用小技巧，不過在開始之前，我們先複習幾個核心概念。

Window有幾個核心組件：

Assigner，負責確定待處理元素所屬的Window；
Trigger，負責確定Window何時觸發計算；
Evictor，可以用來“清理”Window中的元素；
Function，負責處理窗口中的數據；

Window是有狀態的，這個狀態和元素的Key以及Window綁定，我們可以抽象的理解爲形式爲(Key, Window) -> WindowState的Map。

Window分爲兩類，Keyed和Non-Keyed Window，今天我們只討論Keyed Window。

OK，接下來進入正題。

技巧一：Mini-Batch輸出

看到這個標題大家可能會很疑惑，Flink的一大優勢是“純流式計算”，相比於Mini-Batch方式在實時性上有很大優勢，這裏的技巧卻是和Mini-Batch有關的，這不是“自斷手腳”嗎？

在解答這個疑問之前，我們先介紹一下問題背景。

大部分Flink作業在處理完數據後，都要把結果寫出到外部系統，比如Kafka。在寫外部系統的時候，我們有如下兩種方式：

每條消息都發送一次；這種方式延遲較低，但是吞吐也比較低；
“積攢”一部分消息，以Batch發送；這種方式延遲增大，但是吞吐提高；

在實際生產中，除非對延遲要求非常高，否則使用第一種方式會給外部存儲系統帶來很大的QPS壓力，所以一般建議採用第二種方式。

這裏多介紹一下，實際上很多存儲系統在設計SDK的時候，已經考慮了對Batch發送的支持，比如Kafka：

Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request.

對於這種SDK，用戶在使用的時候會更加省心，只需要對每條消息調用一下send接口即可，消息會緩存在隊列裏，由異步線程對消息進行Batch發送。

不過需要注意的是，在Flink Checkpoint的時候，一定要通過flush把未發送的數據發送出去。

以FlinkKafkaProducer示意：

class FlinkKafkaProducer extends RichSinkFunction implements CheckpointedFunction {
public void invoke(IN next, Context context) throws Exception {
kafkaProducer.send(record, callback);
}

public void snapshotState(FunctionSnapshotContext ctx) throws Exception {
kafkaProducer.flush();
}
}

如果外部存儲系統的SDK沒有提供異步Batch發送功能的話，那就需要用戶自己實現了：

第一種思路是這樣的，在RichSinkFunction中，通過Flink ListState緩存數據，然後根據消息數量和延遲時間來確定發送時機。

這種方式在原理上並沒有問題，缺點是需要用戶自己對狀態進行維護和清理，稍微有點麻煩。

其實我們可以通過Window來實現這一需求，Window自帶的State，可以很好地實現緩存數據功能，並且狀態的維護清理不需要用戶操心。

用戶需要關心的主要是兩個點：

如何緩存一定數量之後觸發發送；
如何延遲一定時間之後觸發發送；

第一個點很好實現，通過CountWindow + ProcessWindowFunction即可實現：

DataStream<Tuple2<String, Long>> input = ...;

input
.keyBy(t -> t.f0)
.countWindow(200) // batch大小
.process(new BatchSendFunction());

public class BatchSendFunction extends ProcessWindowFunction {

public void process(Object o, Context context, Iterable elements, Collector out) throws Exception {
List<Object> batch = new ArrayList<>();
for(Object e: elements) {
batch.add(e);
}

// batch發送
client.send(batch);
}
}

這種方式簡單有效，能應對大部分情況，但是有一個缺陷，就是無法控制延遲。

如果某個Key對應的消息比較少，那可能延遲一段時間才能發到外部系統。舉一個極端的例子，如果某個Key的消息數量“湊不夠”設定的Batch大小，那麼窗口就永遠不會觸發計算，這顯然是不能接受的。

爲了解決這個問題，即滿足上面的第二點，我們就需要用到Trigger了。可以通過自定義Trigger，實現根據消息的數量以及延遲來確定發送時機：

DataStream<Tuple2<String, Long>> input = ...;

input
.keyBy(t -> t.f0)
.window(GlobalWindows.create())
.trigger(new BatchSendTrigger())
.process(new BatchSendFunction());

public class BatchSendTrigger<T> extends Trigger<T, GlobalWindow> {
// 最大緩存消息數量
long maxCount;
// 最大緩存時長
long maxDelay;

// 當前消息數量
int elementCount;
// processing timer的時間
long timerTime;

public BatchSendTrigger(long maxCount, long maxDelay) {
this.maxCount = maxCount;
this.maxDelay = maxDelay;
}

public TriggerResult onElement(T element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
if (elementCount == 0) {
timerTime = ctx.getCurrentProcessingTime() + maxDelay;
ctx.registerProcessingTimeTimer(timerTime);
}

// maxCount條件滿足
if (++elementCount >= maxCount) {
elementCount = 0;
ctx.deleteProcessingTimeTimer(timerTime);
return TriggerResult.FIRE_AND_PURGE;
}

return TriggerResult.CONTINUE;
}

public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
// maxDelay條件滿足
elementCount = 0;
return TriggerResult.FIRE_AND_PURGE;
}

public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}

public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
}
}

技巧二：去重

消息去重是分佈式計算中一個很常見的需求，以Kafka爲例，如果網絡不穩定或者Kafka Producer所在的進程失敗重啓，都有可能造成Topic中消息的重複。那麼如何在消費Topic數據的時候去重，自然成了業務關心的問題。

消息去重有兩個關鍵點：

數據的唯一ID；
數據重複的時間跨度；

數據的唯一ID是用來判斷是否重複的標準。時間跨度決定了ID對應的狀態保存的時長，如果無限存儲下去，一定會造成內存和效率問題。

如果使用Flink實現去重的話，首先想到的思路可能是這樣的：自定義一個FilterFunction，通過HashMap<ID, Boolean>來保存消息是否出現的狀態；爲了避免HashMap持續增長，我們可以使用Guava中支持過期配置的Cache來保存數據。

代碼示意如下：

public class DedupeFilterFunction extends RichFilterFunction {
LoadingCache<ID, Boolean> cache;

public void open(Configuration parameters) throws Exception {
cache = CacheBuilder.newBuilder()
// 設置過期時間
.expireAfterWrite(timeout, TimeUnit.MILLISECONDS)
.build(...);
}

public boolean filter(T value) throws Exception {
ID key = value.getID();
boolean seen = cache.get(key);
if (!seen) {
cache.put(key, true);
return true;
} else {
return false;
}
}
}

這段代碼看起來已經能夠很好地解決我們的需求了，但是有一個問題，如果作業異常重啓的話，Cache中的狀態就都丟失了。因此這種方式還需要再加一些邏輯，可以通過實現CheckpointedFunction接口，在snapshotState和initializeState的時候，對Cache中的數據進行保存和恢復。

這裏就不展開具體代碼了，我們直接看一下如何通過Window實現去重。

大致思路是這樣的，首先通過keyBy操作，把相同ID的數據發往下游同一個節點；下游窗口保存並處理數據，只發送一條數據到下游。

這裏有兩個關鍵點：

窗口的大小與數據重複的時間跨度有關；
窗口的狀態不需要也不應該保存所有數據，只需要保存一條即可；

代碼示意如下：

DataStream input = ...;

input
.keyBy(...)
.timeWindow(Time.minutes(2))
.reduce(new ReduceFunction<String>() {
public String reduce(String s, String t1) throws Exception {
return s;
}
})

這個實現看起來簡單有效，但是有兩個問題：

第一個問題，Tumbling Window的窗口劃分是和窗口大小對齊的，和我們的預期並不符。

如上2min的窗口劃分產生的窗口類似於[00: 00, 00: 02), [00: 02, 00: 04) ...。如果某條消息在00: 01到達，那麼1min之後該窗口就會觸發計算。這樣數據重複檢測的時間跨度就縮小爲了1min，這樣會影響去重的效果。

這個問題我們可以通過Session Window來解決，比如ProcessingTimeSessionWindows.withGap(Time.minutes(2))。不過這裏的2min表示的是“如果間隔2min沒有重複數據到達的話，則判定爲後續沒有重複數據”，和上面timeWindow的參數表示的含義是不同的。

第二個問題，由於需要等窗口結束的時候才觸發計算，從而導致了數據發送到下游的延遲較大。

這個可以通過自定義Trigger來解決，當第一條消息到達的時候，就觸發計算，且後續不再觸發新的計算。

修改後的代碼示意如下：

DataStream input = ...;

input
.keyBy(...)
.window(ProcessingTimeSessionWindows.withGap(Time.minutes(2)))
.trigger(new DedupTrigger(Time.minutes(2).toMilliseconds()))
.reduce(new ReduceFunction<String>() {
public String reduce(String s, String t1) throws Exception {
return s;
}
})

public class DedupTrigger extends Trigger<Object, TimeWindow> {
long sessionGap;
public DedupTrigger(long sessionGap) {
this.sessionGap = sessionGap;
}

public TriggerResult onElement(Object element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
// 如果窗口大小和session gap大小相同，則判斷爲第一條數據；
if (window.getEnd() - window.getStart() == sessionGap) {
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
}

public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}

public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}

public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
}

public boolean canMerge() {
return true;
}

public void onMerge(TimeWindow window, OnMergeContext ctx) throws Exception {
}
}

說明一下，這裏是通過判斷窗口大小是否和Session Gap大小相同來判斷是否爲第一條數據的，這是因爲第二條消息到達後，窗口Merge會導致窗口變大。

極端情況是兩條消息的處理間隔小於1ms，不過考慮到實際生產中數據重複產生的場景，這種極端情況可以不考慮。如果不放心，可以考慮通過ValueState<Boolean>來保存並判斷是否爲第一條數據，這裏不展示具體代碼了。

技巧三：以“天”劃分窗口

具體需求大致是這樣的，以自然天劃分窗口，每隔5min觸發一次計算。這個需求並不複雜，但是挺常見，我們簡單討論一下。

首先想到的是Tumbling Window，timeWindow(Time.days(1))，但是這樣無法實現每隔5min觸發一次計算。

然後想到的是通過Sliding Window，timeWindow(Time.days(1), Time.minutes(5))，但是這樣的窗口切分並不是自然天，而是大小爲一天的滑動窗口。

這裏我們可以通過Tumbling Window + 自定義Trigger的方式來實現。不過需要注意的是，業務需要的自然天一般是指本地時間（東八區）的自然天，但是Flink的窗口切分默認按照UTC進行計算。好在Flink提供了接口來滿足類似需求，TumblingProcessingTimeWindows#of(Time size, Time offset)。

代碼示意如下：

DataStream input = ...;

input
.keyBy(...)
.window(TumblingProcessingTimeWindows.of(Time.days(1), Time.hours(-8)))
.trigger(new Trigger<String, TimeWindow>() {

public TriggerResult onElement(String element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
// 觸發計算的間隔
long interval = Time.minutes(5).toMilliseconds();
long timer = TimeWindow.getWindowStartWithOffset(ctx.getCurrentProcessingTime(), 0, interval) + interval;
ctx.registerProcessingTimeTimer(timer);

return TriggerResult.CONTINUE;
}

public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.FIRE;
}

public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}

public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
}
})
.reduce(...)

技巧四：慎用ProcessWindowFunction

關於ProcessWindowFunction的缺點，官方文檔有說明：

This comes at the cost of performance and resource consumption, because elements cannot be incrementally aggregated but instead need to be buffered internally until the window is considered ready for processing.

相較於ReduceFunction/AggregateFunction/FoldFunction的可以提前對數據進行聚合處理，ProcessWindowFunction是把數據緩存起來，在Trigger觸發計算之後再處理，可以把其對應的WindowState簡單理解成形式爲(Key, Window) -> List<StreamRecord>的Map。

雖然我們在上面“Mini-Batch輸出”章節中用到了ProcessWindowFunction，但是考慮到設定的Window最大數據量比較少，所以問題並不大。但如果窗口時間跨度比較大，比如幾個小時甚至一天，那麼緩存大量數據就可能會導致較嚴重的內存和效率問題了，尤其是以filesystem作爲state backend的作業，很容易出現OOM異常。

我們以求窗口中數值的平均值爲例，ProcessWindowFunction可能是這樣的:

public class AvgProcessWindowFunction extends ProcessWindowFunction<Long, Double, Long, TimeWindow> {
public void process(Long integer, Context context, Iterable<Long> elements, Collector<Double> out) throws Exception {
int cnt = 0;
double sum = 0;
// elements 緩存了所有數據
for (long e : elements) {
cnt++;
sum += e;
}
out.collect(sum / cnt);
}
}

AggregateFunction可以對每個窗口只保存Sum和Count值：

public class AvgAggregateFunction implements AggregateFunction<Long, Tuple2<Long, Long>, Double> {
public Tuple2<Long, Long> createAccumulator() {
// Accumulator是內容爲<SUM, COUNT>的Tuple
return new Tuple2<>(0L, 0L);
}

public Tuple2<Long, Long> add(Long in, Tuple2<Long, Long> acc) {
return new Tuple2<>(acc.f0 + in, acc.f1 + 1L);
}

public Double getResult(Tuple2<Long, Long> acc) {
return ((double) acc.f0) / acc.f1;
}

public Tuple2<Long, Long> merge(Tuple2<Long, Long> acc0, Tuple2<Long, Long> acc1) {
return new Tuple2<>(acc0.f0 + acc1.f0, acc0.f1 + acc1.f1);
}
}

可以看出AggregateFunction在節省內存上的優勢。

不過需要注意的是，如果同時指定了Evictor的話，那麼即使使用 ReduceFunction/AggregateFunction/FoldFunction，Window也會緩存所有數據，以提供給Evictor進行過濾，因此要慎重使用。

這裏通過源碼簡單說明一下：

// WindowedStream
public SingleOutputStreamOperator aggregate(...) {
...
if (evictor != null) {
TypeSerializer<StreamRecord<T>> streamRecordSerializer =
(TypeSerializer<StreamRecord<T>>) new StreamElementSerializer(input.getType().createSerializer(getExecutionEnvironment().getConfig()));
// 如果配置了Evictor，則通過ListState保存原始StreamRecord數據；
ListStateDescriptor<StreamRecord<T>> stateDesc =
new ListStateDescriptor<>("window-contents", streamRecordSerializer);

operator = new EvictingWindowOperator<>(...)

} else {
// 如果沒有配置Evictor，則通過AggregatingStateDescriptor保存Accumulator狀態
AggregatingStateDescriptor<T, ACC, V> stateDesc = new AggregatingStateDescriptor<>("window-contents",
aggregateFunction, accumulatorType.createSerializer(getExecutionEnvironment().getConfig()));

operator = new WindowOperator<>(...);
}
...
}

技巧五：慎用“細粒度”的SlidingWindow

使用SlidingWindow的時候需要指定window_size和window_slide，這裏的”細粒度“是指window_size/window_slide特別大的滑動窗口。

以timeWindow(Time.days(1), Time.minutes(5))爲例，Flink會爲每個Key維護days(1) / minutes(5) = 288個窗口，總的窗口數量是keys * 288。由於每個窗口會維護單獨的狀態，並且每個元素會應用到其所屬的所有窗口，這樣就會給作業的狀態保存以及計算效率帶來很大影響。

有如下解決思路可以嘗試：

第一，通過TumblingWindow + 自定義Trigger來實現，如“技巧三”中所示的方法；

第二，不使用Window，通過ProcessFunction實現。通過Flink State來保存聚合狀態，在processElement中更新狀態並設定Timer，在onTimer中把聚合結果發往下游。

代碼示意如下：

public class MyProcessFunction extends ProcessFunction<Object, Object> {
MapState<Object, Object> state;

public void open(Configuration parameters) throws Exception {
super.open(parameters);
state = getRuntimeContext().getMapState(new MapStateDescriptor<Object, Object>(...));
}

public void processElement(Object value, Context ctx, Collector<Object> out) throws Exception {
// 觸發計算的間隔
long interval = Time.minutes(5).toMilliseconds();
long timer = TimeWindow.getWindowStartWithOffset(ctx.timerService().currentProcessingTime(), 0, interval) + interval;
ctx.timerService().registerProcessingTimeTimer(timer);
// 根據消息更新State
state.put(...);
}

public void onTimer(long timestamp, OnTimerContext ctx, Collector<Object> out) throws Exception {
super.onTimer(timestamp, ctx, out);
// 根據State計算結果
Object result = ...;
out.collect(result);
}
}

小結

通過深入理解和熟練使用Assigner/Trigger/Evictor/Function以及Window State，可以很方便的解決業務中的一些需求問題。

Flink Window的5個使用小技巧

16 Master-Worker模式

flink實戰 -- 數據寫入clickhouse(ClickHouseSink)

14 併發Queue

13 ConcurrentMap & Copy-On-Write容器

內存分配器（Memory Allocator）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結