Apache Flink 數據流Transformations窗口及相關操作

問題導讀


1.爲何產生window窗口計算?
2.你認爲什麼情況下使用Window Apply?
3.Window Fold可以用來做什麼?
4.window 流是否可以union和join?
5.DataStream是否可以split?

 

這篇文章,主要講windows,那麼我們思考爲什麼會產生windows?
我們前面流式處理,一條條消息處理不行嗎?可以的。不過有些場景使用窗口更加適合,比如我們想看10分鐘內下單量是多少。那麼這時候我們就可以使用窗口計算了。窗口計算是對流式的一個封裝,在某個時間內,對這個時間段內的數據一起處理。

理解了什麼是windows,我們接着繼續:
1.Window
KeyedStream → WindowedStream        
可以在已經分區的KeyedStream上定義Windows。 Windows根據某些特徵(例如,在最近5秒內到達的數據)對每個key中的數據進行分組。 有關窗口的說明,可參考窗口
 

[Java] 
1
dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data

 

[Scala] 
 
1
dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5))) // Last 5 seconds of data



2.WindowAll
DataStream → AllWindowedStream
Windows可以在常規DataStream上定義。 Windows根據某些特徵(例如,在最近5秒內到達的數據)對所有流事件進行分組。 有關窗口的完整說明,可參考windows
也就是說: 針對全局的不基於某個key進行分組的window的窗口函數的實現
注意:在許多情況下,這是不是並行transformation。 所有記錄將收集在windowAll operator 的一個任務中。

[Java] 

1
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data

 

[Scala] 

1
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))) // Last 5 seconds of data


3.Window Apply
WindowedStream → DataStream
AllWindowedStream → DataStream
將通用功能應用於window,下面是window元素手工求和
如果是windowAll transformation,你需要替換爲 AllWindowFunction

windowedStream.apply (new WindowFunction<Tuple2<String,Integer>, Integer, Tuple, Window>() {
 
    public void apply (Tuple tuple,
            Window window,
            Iterable<Tuple2<String, Integer>> values,
            Collector<Integer> out) throws Exception {
        int sum = 0;
        for (value t: values) {
            sum += t.f1;
        }
        out.collect (new Integer(sum));
    }
});
 
// applying an AllWindowFunction on non-keyed window stream
allWindowedStream.apply (new AllWindowFunction<Tuple2<String,Integer>, Integer, Window>() {
    public void apply (Window window,
            Iterable<Tuple2<String, Integer>> values,
            Collector<Integer> out) throws Exception {
        int sum = 0;
        for (value t: values) {
            sum += t.f1;
        }
        out.collect (new Integer(sum));
    }
 
});

[Scala] 

1

2

3

4

windowedStream.apply { WindowFunction }

 

// applying an AllWindowFunction on non-keyed window stream

allWindowedStream.apply { AllWindowFunction }

4.Window Reduce
WindowedStream → DataStream
將函數reduce功能應用於窗口並返回reduce的值。

[Java] 
1
2
3
4
5
6
7
windowedStream.reduce (new ReduceFunction<Tuple2<String,Integer>>() {
 
    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
        return new Tuple2<String,Integer>(value1.f0, value1.f1 + value2.f1);
    }
 
});


 

[Scala] 
1
windowedStream.reduce { _ + _ }



5.Window Fold
WindowedStream → DataStream
將功能Fold功能應用於窗口並返回folded 值。 示例函數應用於序列(1,2,3,4,5)時,將序列folded 爲字符串“start-1-2-3-4-5”:

[Java] 
1
2
3
4
5
windowedStream.fold("start", new FoldFunction<Integer, String>() {
    public String fold(String current, Integer value) {
        return current + "-" + value;
    }
});

 

[Scala] 
1
2
val result: DataStream[String] =
    windowedStream.fold("start", (str, i) => { str + "-" + i })



6.windows聚合
WindowedStream → DataStream
聚合窗口的內容。 min和minBy之間的差異是min返回最小值,而minBy返回該字段中具有最小值的元素(max和maxBy相同)。

[Java] 
01
02
03
04
05
06
07
08
09
10
windowedStream.sum(0);
windowedStream.sum("key");
windowedStream.min(0);
windowedStream.min("key");
windowedStream.max(0);
windowedStream.max("key");
windowedStream.minBy(0);
windowedStream.minBy("key");
windowedStream.maxBy(0);
windowedStream.maxBy("key");

 

[Scala] 
01
02
03
04
05
06
07
08
09
10
windowedStream.sum(0)
windowedStream.sum("key")
windowedStream.min(0)
windowedStream.min("key")
windowedStream.max(0)
windowedStream.max("key")
windowedStream.minBy(0)
windowedStream.minBy("key")
windowedStream.maxBy(0)
windowedStream.maxBy("key")



上面窗口計算完畢,接着我們介紹新的內容,流和窗口等的結合

7.Union
DataStream* → DataStream

兩個或多個數據流Union操作,來創建包含來自所有流的所有元素的新流。 注意:如果將數據流與自身union,則會在結果流中每個元素獲取兩次。

[Java] 
1
dataStream.union(otherStream1, otherStream2, ...);

 

[Scala] 
1
dataStream.union(otherStream1, otherStream2, ...)



8.Window Join
DataStream,DataStream → DataStream        

給定的key和通用窗口Join兩個數據流

[Java] 
1
2
3
4
dataStream.join(otherStream)
    .where(<key selector>).equalTo(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new JoinFunction () {...});

 

[Scala] 
1
2
3
4
dataStream.join(otherStream)
    .where(<key selector>).equalTo(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply { ... }



9.Interval Join
KeyedStream,KeyedStream → DataStream
在給定的時間間隔內使用公共keye ,Join 兩個keye流的兩個元素e1和e2,以便e1.timestamp + lowerBound <= e2.timestamp <= e1.timestamp + upperBound

[Java] 
1
2
3
4
5
6
7
// this will join the two streams so that
// key1 == key2 && leftTs - 2 < rightTs < leftTs + 2
keyedStream.intervalJoin(otherKeyedStream)
    .between(Time.milliseconds(-2), Time.milliseconds(2)) // lower and upper bound
    .upperBoundExclusive(true) // optional
    .lowerBoundExclusive(true) // optional
    .process(new IntervalJoinFunction() {...});



10.Window CoGroup
DataStream,DataStream → DataStream

在給定key和通用窗口上對兩個數據流進行Cogroup。

[Java] 
1
2
3
4
dataStream.coGroup(otherStream)
    .where(0).equalTo(1)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new CoGroupFunction () {...});

 

[Scala] 
1
2
3
4
dataStream.coGroup(otherStream)
    .where(0).equalTo(1)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply {}


這裏面CoGroup與join他們之間是有關聯的,CoGroup可以實現datastream join。

11.Connect
DataStream,DataStream → ConnectedStreams

“Connect”兩個保留類型的數據流。 Connect允許兩個流之間的共享狀態。

[Java] 
1
2
3
DataStream<Integer> someStream = //...
DataStream<String> otherStream = //...
ConnectedStreams<Integer, String> connectedStreams = someStream.connect(otherStream);

 

[Scala] 
1
2
3
someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...
val connectedStreams = someStream.connect(otherStream)



12.CoMap, CoFlatMap
ConnectedStreams → DataStream
類似於連接數據流上的map和flatMap

[Java] 
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
connectedStreams.map(new CoMapFunction<Integer, String, Boolean>() {
    @Override
    public Boolean map1(Integer value) {
        return true;
    }
 
    @Override
    public Boolean map2(String value) {
        return false;
    }
});
connectedStreams.flatMap(new CoFlatMapFunction<Integer, String, String>() {
 
   @Override
   public void flatMap1(Integer value, Collector<String> out) {
       out.collect(value.toString());
   }
 
   @Override
   public void flatMap2(String value, Collector<String> out) {
       for (String word: value.split(" ")) {
         out.collect(word);
       }
   }
});


 

[Scala] 
1
2
3
4
5
6
7
8
connectedStreams.map(
    (_ : Int) => true,
    (_ : String) => false
)
connectedStreams.flatMap(
    (_ : Int) => true,
    (_ : String) => false
)



13.Split
DataStream → SplitStream
根據某些標準將流拆分爲兩個或更多個流。

[Java] 
01
02
03
04
05
06
07
08
09
10
11
12
13
SplitStream<Integer> split = someDataStream.split(new OutputSelector<Integer>() {
    @Override
    public Iterable<String> select(Integer value) {
        List<String> output = new ArrayList<String>();
        if (value % 2 == 0) {
            output.add("even");
        }
        else {
            output.add("odd");
        }
        return output;
    }
});


 

[Scala] 
1
2
3
4
5
6
7
val split = someDataStream.split(
  (num: Int) =>
    (num % 2) match {
      case 0 => List("even")
      case 1 => List("odd")
    }
)



14.Select
SplitStream → DataStream
從拆分流中select一個或多個流。

[Java] 
1
2
3
4
SplitStream<Integer> split;
DataStream<Integer> even = split.select("even");
DataStream<Integer> odd = split.select("odd");
DataStream<Integer> all = split.select("even","odd");

 

[Scala] 
1
2
3
val even = split select "even"
val odd = split select "odd"
val all = split.select("even","odd")



15.Iterate
DataStream → IterativeStream → DataStream
通過將一個operator的輸出重定向到某個先前的operator,在流中創建“feedback”循環。 這對於定義不斷更新模型的算法特別有用。 以下代碼以流開頭並連續應用迭代體。 大於0的元素將被髮送回feedback通道,其餘元素將向下遊轉發。 有關完整說明,請參閱迭代。

[Java] 
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
IterativeStream<Long> iteration = initialStream.iterate();
DataStream<Long> iterationBody = iteration.map (/*do something*/);
DataStream<Long> feedback = iterationBody.filter(new FilterFunction<Long>(){
    @Override
    public boolean filter(Integer value) throws Exception {
        return value > 0;
    }
});
iteration.closeWith(feedback);
DataStream<Long> output = iterationBody.filter(new FilterFunction<Long>(){
    @Override
    public boolean filter(Integer value) throws Exception {
        return value <= 0;
    }
});

 

[Scala] 
1
2
3
4
5
6
initialStream.iterate {
  iteration => {
    val iterationBody = iteration.map {/*do something*/}
    (iterationBody.filter(_ > 0), iterationBody.filter(_ <= 0))
  }
}



16.Extract Timestamps
DataStream → DataStream
從記錄中提取時間戳,以便使用 event time 語義的窗口。

[Java] 
1
stream.assignTimestamps (new TimeStampExtractor() {...});

 

[Scala] 
1
stream.assignTimestamps { timestampExtractor }

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章