Chapter 2 Data Processing Using the DataStream API- Window

Window

The window function allows the grouping of existing KeyedDataStreama by time or other conditions. The following transformation emits groups of records by a time window of 10 seconds
windows函數允許對已有的KeyedDataStream通過時間或其他條件進行分組。下面的transformation通過10秒的時間窗口產生了一組數據。

In Java

inputStream. keyBy (0).window (TumblingEventTimeWindows.of (Time.seconde (10)));

In Scala:

inputStream. keyBy (0).window (TumblingEventTimeWindows.of (Time.seconde (10)))

Flink defines slices of data in order to process (potentially) infinite data streams. These slices are called windows. This slicing helps processing data in chunks by applying transformations. To do windowing on a stream, we need to assign a key on which the distribution can be made and a function which describes what transformations to perform on a windowed stream
Flink定義數據分片,以便處理無窮的數據流。這些分片叫windows。這個分片通過transformation有助於處理大塊數據。要在流上執行窗口化。我們需要指定key,基於key可以實現分佈式;還需要一個function,這個方法描述了在窗口化的流上需要執行什麼樣的transformation

To slice streams into windows, we can use pre-implemented Flink window assigners. We have options such as, tumbling windows, sliding windows, global and session windows.
Flink also allows you to write custom window assigners by extending WindowAssigner class. Let's try to understand how these various assigners work.

將流切成窗口,我們可以使用預先實現好的Flink 窗口分配器。包括tumbling windows,sliding windows,global and session windows
Flink 也允許你寫一些自定義的分配器通過繼承 WindowAssigner類。下面我們先了解一下這些內置的分配器是如何工作的。

Global windows

Global windows are never-ending windows unless specified by a trigger. Generally in this case, each element is assigned to one single per-key global Window. If we don't specify any trigger, no computation will ever get triggered.

全局窗口是永遠不會結束的窗口,除非指定觸發器。通常,在這種場景下,每個元素都會被分配到單獨的per-key全局窗口。如果未指定觸發器,則不會觸發計算。

Tumbling windows (翻滾窗口,無重疊)

Tumbling windows are created based on certain times. They are fixed-length windows and non over lapping. Tumbling windows should be useful when you need to do computation of elements in specific time. For example, tumbling window of 10 minutes can be used to compute a group of events occurring in 10 minutes time.
Tumbling windows是基於確定時間的。它們的窗口長度是固定的並且不會有重疊。這種窗口用於當需要對指定時間內的元素進行計算時。舉個例子,10分鐘的翻滾窗口可以對10分鐘內產生的事件進行計算。

Sliding windows(滑動窗口,有重疊)

Sliding windows are like tumbling windows but they are overlapping. They are fixed length windows overlapping the previous ones by a user given window slide parameter .This type of windowing is useful when you want to compute something out of a group of events occurring in a certain time frame.
Sliding windowstumbling windows類似,但它有重疊。它的長度固定但通過用戶給定的滑動參數會和上一個窗口有重疊。這種窗口用於你想對固定時間框架內發生的事件進行計算。

Session windows

Session windows are useful when windows boundaries need to be decided upon the input data. Session windows allows flexibility in window start time and window size. We can also provide session gap configuration parameter which indicates how long to wait before considering the session in closed。

Session windows在根據輸入數據確定窗口邊界的場景是有用的。Session windows允許靈活的配置窗口啓動時間和窗口大小。我們還可以提供會話間隙配置參數,該參數指定在關閉會話之前需要等待多長時間。

WindowAIl

The windowAll function allows the grouping of regular data streams. Generally this is a non-parallel data transformation, as it runs on non-partitioned streams of data.

windowAll函數允許對常規數據流進行分組。通常這是一個非並行數據 transformation,因爲它運行在非分區的數據流上。

In Java:

inputStream.windowAll (TumblingEventTimeWindows.of (Time.seconda (10)));

In Scala:

inputStream.windowAll (TumblingEventTimeWindows.of (Time.seconde (10)));

Similar to regular data stream functions, we have window data stream functions as well.The only difference is they work on windowed data streams. So window reduce works like the Reduce function, Window fold works like the Fold function, and there are aggregations as well

和普通的數據流一樣,窗口數據流也有對應的函數。它們唯一的區別是它們工作在窗口數據流上。Window reduce的運行和Reduce方法一樣,Window foldFold方法一樣。其他的聚合方法也是如此。

Union

The Union function performs the union of two or more data streams together. This does the combining of data streams in parallel。 If we combine one stream with itself then it outputs each record twice.

Union方法將兩個或多個數據流合併在一起。這個合併是並行的。如果我們將stream和它自己組合的話,那麼每條記錄會輸出兩次。
In Java:

inputStream. union (inputstreaml, inputstream2, ...);

In Scala:

inputStream.union (inputstream1, inputstream2. ...)

Window join

We can also join two data streams by some keys in a common window. The following example shows the joining of two streams in a Window of 5 seconds where the joining condition of the first attribute of the first stream is equal to the second attribute of the other stream

我們也可以將共有窗口的兩個流通過key連接起來。下面 的例 子演示了兩個流在一個5秒的窗口進行連接;連接條件是第一個流的第一個屬於和另一個流的第二個屬性相等。
In Java:

inputStream.join (inputStream1).
where (0).equalTo (1)
.window (TumblingEventTimeWindows.of (Time.seconds(5)))
.apply (new JoinFunction (){...});

In Scala:

inputStream. join (inputStream1)
.where (0) .equalTo (1)
.window (TumblingEventTimewindows.of (Time.seconds (5)))
.apply{...}

Split

This function splits the stream into two or more streams based on the criteria. This can be used when you get a mixed stream and you may want to process each data separately
這個方法會根據條件將一個流拆分成兩個或多個流。當你拿到一個混合流,然後你想分別去處理它們的數據時,這個方法很有用。

In Java:

SplitStream<Integer> split = inputStream. split (new outputSelector<Integer>() {
@override public Iterable<string> select (Integer value) {
 List<String> output = new ArrayList<String> ();
if (value% 2 ==0){ 
output. add ("even");
}
else {
output.add ("odd");
}
return output);

In Scala:

val split= inputStream. split
(num: Int) => 
  (num % 2) match {
     case 0 => List ("even")
     case 1 => List ("odd")
   }
}

Select

This function allows you to select a specific stream from the split stream

該方法允許你從split stream中選擇特定的流。

In Java:

SplitStream split;
DataStream even = split.select ("even");
DataStream odd = split.select ("odd");
DataStream all = split.select ("even", "odd");

In Scala:

val even = split select "even" 
val odd = split select "odd" 
val all = split.select ("even", "odd")

Project

The Project function allows you to select a sub-set of attributes from the event stream and only sends selected elements to the next processing stream.
Project方法允許從事件流中選擇一個屬性的子集,並只把選中的選項發到下一個待處理的流中。
In Java

DataStream<Tuple4<Integer,Double,String,String>>in=//{....}
DataStream<Tuple2<String,String>>out=in.project(3,2);

In Scala:

var in:DateStream[(Int,Double,String,String)]=//{....}
var out=in.project(3,2)

The preceding function selects the attribute numbers 2 and 3 from the given records. The following is the sample input and output records
上個方法從給定的記錄集中,將第2和第3個屬性選中。下面是簡單輸入及對應的輸出記錄。

(1,10.0, A, B)=> (B,A)
(2,20.0,C, D)=> (D,C)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章