問題導讀
1.爲何產生window窗口計算?
2.你認爲什麼情況下使用Window Apply?
3.Window Fold可以用來做什麼?
4.window 流是否可以union和join?
5.DataStream是否可以split?
這篇文章,主要講windows,那麼我們思考爲什麼會產生windows?
我們前面流式處理,一條條消息處理不行嗎?可以的。不過有些場景使用窗口更加適合,比如我們想看10分鐘內下單量是多少。那麼這時候我們就可以使用窗口計算了。窗口計算是對流式的一個封裝,在某個時間內,對這個時間段內的數據一起處理。
理解了什麼是windows,我們接着繼續:
1.Window
KeyedStream → WindowedStream
可以在已經分區的KeyedStream上定義Windows。 Windows根據某些特徵(例如,在最近5秒內到達的數據)對每個key中的數據進行分組。 有關窗口的說明,可參考窗口。
1
|
dataStream.keyBy( 0 ).window(TumblingEventTimeWindows.of(Time.seconds( 5 ))); // Last 5 seconds of data |
1
|
dataStream.keyBy( 0 ).window(TumblingEventTimeWindows.of(Time.seconds( 5 ))) // Last 5 seconds of data |
2.WindowAll
DataStream → AllWindowedStream
Windows可以在常規DataStream上定義。 Windows根據某些特徵(例如,在最近5秒內到達的數據)對所有流事件進行分組。 有關窗口的完整說明,可參考windows。
也就是說: 針對全局的不基於某個key進行分組的window的窗口函數的實現
注意:在許多情況下,這是不是並行transformation。 所有記錄將收集在windowAll operator 的一個任務中。
[Java]
1
|
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds( 5 ))); // Last 5 seconds of data |
[Scala]
1
|
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds( 5 ))) // Last 5 seconds of data |
3.Window Apply
WindowedStream → DataStream
AllWindowedStream → DataStream
將通用功能應用於window,下面是window元素手工求和
如果是windowAll transformation,你需要替換爲 AllWindowFunction
windowedStream.apply (new WindowFunction<Tuple2<String,Integer>, Integer, Tuple, Window>() {
public void apply (Tuple tuple,
Window window,
Iterable<Tuple2<String, Integer>> values,
Collector<Integer> out) throws Exception {
int sum = 0;
for (value t: values) {
sum += t.f1;
}
out.collect (new Integer(sum));
}
});
// applying an AllWindowFunction on non-keyed window stream
allWindowedStream.apply (new AllWindowFunction<Tuple2<String,Integer>, Integer, Window>() {
public void apply (Window window,
Iterable<Tuple2<String, Integer>> values,
Collector<Integer> out) throws Exception {
int sum = 0;
for (value t: values) {
sum += t.f1;
}
out.collect (new Integer(sum));
}
});
[Scala]
1 2 3 4 |
|
4.Window Reduce
WindowedStream → DataStream
將函數reduce功能應用於窗口並返回reduce的值。
1
2
3
4
5
6
7
|
windowedStream.reduce ( new ReduceFunction<Tuple2<String,Integer>>() { public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception { return new Tuple2<String,Integer>(value1.f0, value1.f1 + value2.f1); } }); |
1
|
windowedStream.reduce { _ + _ } |
5.Window Fold
WindowedStream → DataStream
將功能Fold功能應用於窗口並返回folded 值。 示例函數應用於序列(1,2,3,4,5)時,將序列folded 爲字符串“start-1-2-3-4-5”:
1
2
3
4
5
|
windowedStream.fold( "start" , new FoldFunction<Integer, String>() { public String fold(String current, Integer value) { return current + "-" + value; } }); |
1
2
|
val result : DataStream[String] = windowedStream.fold( "start" , (str, i) = > { str + "-" + i }) |
6.windows聚合
WindowedStream → DataStream
聚合窗口的內容。 min和minBy之間的差異是min返回最小值,而minBy返回該字段中具有最小值的元素(max和maxBy相同)。
01
02
03
04
05
06
07
08
09
10
|
windowedStream.sum( 0 ); windowedStream.sum( "key" ); windowedStream.min( 0 ); windowedStream.min( "key" ); windowedStream.max( 0 ); windowedStream.max( "key" ); windowedStream.minBy( 0 ); windowedStream.minBy( "key" ); windowedStream.maxBy( 0 ); windowedStream.maxBy( "key" ); |
01
02
03
04
05
06
07
08
09
10
|
windowedStream.sum( 0 ) windowedStream.sum( "key" ) windowedStream.min( 0 ) windowedStream.min( "key" ) windowedStream.max( 0 ) windowedStream.max( "key" ) windowedStream.minBy( 0 ) windowedStream.minBy( "key" ) windowedStream.maxBy( 0 ) windowedStream.maxBy( "key" ) |
上面窗口計算完畢,接着我們介紹新的內容,流和窗口等的結合
7.Union
DataStream* → DataStream
兩個或多個數據流Union操作,來創建包含來自所有流的所有元素的新流。 注意:如果將數據流與自身union,則會在結果流中每個元素獲取兩次。
1
|
dataStream.union(otherStream1, otherStream2, ...); |
1
|
dataStream.union(otherStream 1 , otherStream 2 , ...) |
8.Window Join
DataStream,DataStream → DataStream
給定的key和通用窗口Join兩個數據流
1
2
3
4
|
dataStream.join(otherStream) .where(<key selector>).equalTo(<key selector>) .window(TumblingEventTimeWindows.of(Time.seconds( 3 ))) .apply ( new JoinFunction () {...}); |
1
2
3
4
|
dataStream.join(otherStream) .where(<key selector>).equalTo(<key selector>) .window(TumblingEventTimeWindows.of(Time.seconds( 3 ))) .apply { ... } |
9.Interval Join
KeyedStream,KeyedStream → DataStream
在給定的時間間隔內使用公共keye ,Join 兩個keye流的兩個元素e1和e2,以便e1.timestamp + lowerBound <= e2.timestamp <= e1.timestamp + upperBound
1
2
3
4
5
6
7
|
// this will join the two streams so that // key1 == key2 && leftTs - 2 < rightTs < leftTs + 2 keyedStream.intervalJoin(otherKeyedStream) .between(Time.milliseconds(- 2 ), Time.milliseconds( 2 )) // lower and upper bound .upperBoundExclusive( true ) // optional .lowerBoundExclusive( true ) // optional .process( new IntervalJoinFunction() {...}); |
10.Window CoGroup
DataStream,DataStream → DataStream
在給定key和通用窗口上對兩個數據流進行Cogroup。
1
2
3
4
|
dataStream.coGroup(otherStream) .where( 0 ).equalTo( 1 ) .window(TumblingEventTimeWindows.of(Time.seconds( 3 ))) .apply ( new CoGroupFunction () {...}); |
1
2
3
4
|
dataStream.coGroup(otherStream) .where( 0 ).equalTo( 1 ) .window(TumblingEventTimeWindows.of(Time.seconds( 3 ))) .apply {} |
這裏面CoGroup與join他們之間是有關聯的,CoGroup可以實現datastream join。
11.Connect
DataStream,DataStream → ConnectedStreams
“Connect”兩個保留類型的數據流。 Connect允許兩個流之間的共享狀態。
1
2
3
|
DataStream<Integer> someStream = //... DataStream<String> otherStream = //... ConnectedStreams<Integer, String> connectedStreams = someStream.connect(otherStream); |
1
2
3
|
someStream : DataStream[Int] = ... otherStream : DataStream[String] = ... val connectedStreams = someStream.connect(otherStream) |
12.CoMap, CoFlatMap
ConnectedStreams → DataStream
類似於連接數據流上的map和flatMap
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
connectedStreams.map( new CoMapFunction<Integer, String, Boolean>() { @Override public Boolean map1(Integer value) { return true ; } @Override public Boolean map2(String value) { return false ; } }); connectedStreams.flatMap( new CoFlatMapFunction<Integer, String, String>() { @Override public void flatMap1(Integer value, Collector<String> out) { out.collect(value.toString()); } @Override public void flatMap2(String value, Collector<String> out) { for (String word: value.split( " " )) { out.collect(word); } } }); |
1
2
3
4
5
6
7
8
|
connectedStreams.map( ( _ : Int) = > true , ( _ : String) = > false ) connectedStreams.flatMap( ( _ : Int) = > true , ( _ : String) = > false ) |
13.Split
DataStream → SplitStream
根據某些標準將流拆分爲兩個或更多個流。
01
02
03
04
05
06
07
08
09
10
11
12
13
|
SplitStream<Integer> split = someDataStream.split( new OutputSelector<Integer>() { @Override public Iterable<String> select(Integer value) { List<String> output = new ArrayList<String>(); if (value % 2 == 0 ) { output.add( "even" ); } else { output.add( "odd" ); } return output; } }); |
1
2
3
4
5
6
7
|
val split = someDataStream.split( (num : Int) = > (num % 2 ) match { case 0 = > List( "even" ) case 1 = > List( "odd" ) } ) |
14.Select
SplitStream → DataStream
從拆分流中select一個或多個流。
1
2
3
4
|
SplitStream<Integer> split; DataStream<Integer> even = split.select( "even" ); DataStream<Integer> odd = split.select( "odd" ); DataStream<Integer> all = split.select( "even" , "odd" ); |
1
2
3
|
val even = split select "even" val odd = split select "odd" val all = split.select( "even" , "odd" ) |
15.Iterate
DataStream → IterativeStream → DataStream
通過將一個operator的輸出重定向到某個先前的operator,在流中創建“feedback”循環。 這對於定義不斷更新模型的算法特別有用。 以下代碼以流開頭並連續應用迭代體。 大於0的元素將被髮送回feedback通道,其餘元素將向下遊轉發。 有關完整說明,請參閱迭代。
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
|
IterativeStream<Long> iteration = initialStream.iterate(); DataStream<Long> iterationBody = iteration.map ( /*do something*/ ); DataStream<Long> feedback = iterationBody.filter( new FilterFunction<Long>(){ @Override public boolean filter(Integer value) throws Exception { return value > 0 ; } }); iteration.closeWith(feedback); DataStream<Long> output = iterationBody.filter( new FilterFunction<Long>(){ @Override public boolean filter(Integer value) throws Exception { return value <= 0 ; } }); |
1
2
3
4
5
6
|
initialStream.iterate { iteration = > { val iterationBody = iteration.map { /*do something*/ } (iterationBody.filter( _ > 0 ), iterationBody.filter( _ < = 0 )) } } |
16.Extract Timestamps
DataStream → DataStream
從記錄中提取時間戳,以便使用 event time 語義的窗口。
1
|
stream.assignTimestamps ( new TimeStampExtractor() {...}); |
1
|
stream.assignTimestamps { timestampExtractor } |