Flink Watermark是用於處理數據亂序問題,網上已經有很多優秀的文章介紹,這裏就不重複了。參考:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/event_timestamps_watermarks.html
今天要說的使用Watermark過程中自己挖的坑,使用sideOutputLateData()過程中沒有正常輸出的問題,在此記錄一下:
先來看一下源碼解析:
/**
* Send late arriving data to the side output identified by the given {@link OutputTag}. Data
* is considered late after the watermark has passed the end of the window plus the allowed
* lateness set using {@link #allowedLateness(Time)}.
*
* <p>You can get the stream of late data using
* {@link SingleOutputStreamOperator#getSideOutput(OutputTag)} on the
* {@link SingleOutputStreamOperator} resulting from the windowed operation
* with the same {@link OutputTag}.
*/
@PublicEvolving
public WindowedStream<T, K, W> sideOutputLateData(OutputTag<T> outputTag) {
Preconditions.checkNotNull(outputTag, "Side output tag must not be null.");
this.lateDataOutputTag = input.getExecutionEnvironment().clean(outputTag);
return this;
}
首先,延遲的數據通過outputTag輸出,必須要事件時間大於watermark + allowed lateness,數據纔會存儲在outputTag中。
然後,注意,坑位來了,調用getSideOutput()方法獲取DataStream時,必須調用windowed operation返回的SingleOutputStreamOperator對象才能獲取到期望的延遲數據。
總結:
sideOutputLateData() 是一個兜底方案,數據延遲嚴重,可以保證數據不丟失。
使用第三方類庫前還是要先閱讀源碼,不要憑直覺。