Chapter 2 Data Processing Using the DataStream API- partitioning

Physical partitioning

Flink allows us to perform physical partitioning of the stream data. You have an option to provide custom partitioning. Let us have a look at the different types of partitioning.(Flink 允許我們對流進行物理分區。你也可以選擇自定義分區。讓我個先看一下不同類型的分區。)

Custom partitioning

As mentioned earlier, you can provide custom implementation of a partitioner.
正如上文提到的,你可以選擇一個自定義分區器實現。

In Java

inputStream.partitionCustom(partitioner, "somekey");
inputStream.partitionCuatom (partitioner, 0);

In Scala:

inputStream.partitionCustom(partitioner, "somekey");
inputStream.partitionCustom(partitioner, o).

While writing a custom partitioner you need make sure you implement an efficient hash function.
編寫自定義分區器的同時需要一個高效的hash 算法。

Random partitioning

Random partitioning randomly partitions data streams in an evenly manner.
Random partitioning平均地對數據流進行隨機分區。
In Java

inputStream,shuffle();

In Scala

inputstream, shuffle ()

Rebalancing partitioning

This type of partitioning helps distribute the data evenly. It uses a round robin method for distribution. This type of partitioning is good when data is skewed.
Rebalancing分區有助於平衡地分佈數據。它使得round robin (輪詢調度算法)方法進行數據分佈。當數據傾斜時,這種分區是很好的選擇。

In Java

inputStream.rebalance ();

In Scala

inputStream.rebalance ()

Rescaling(重新調整)

Rescaling is used to distribute the data across operations, perform transformations on sub sets of data and combine them together. This rebalancing happens over a single node only hence it does not require any data transfer across networks.
(Rescaling)用於跨operations的數據分佈。對數據子集進行 transformation然後將它們組合在一起。這種重平衡只在單機的場景下會發生,因此,它不會產生跨網絡的數據傳輸。
The following diagram shows the distribution:
(下圖展示了數據的分佈情況)


In Java:

inputStream. rescale()

In Scala:

inputStream.rescale()

Broadcasting

Broadcasting distributes all records to each partition. This fans out each and every element to all partitions.
Broadcasting分將所有的記錄都分佈到每個分區。Broadcasting分將每一個元素都分散到所有的分區中。

In Java:

inputStream.broadcast ();

In Scala:

inputStream.broadcast ();

Data sinks

After the data transformations are done, we need to save results into some place. The following are some options Flink provides us to save results:
當所有的transformations都執行完之後,我們需要將結果保存起來。下面是Flink提供的一些結果保存選項:

  • writeAslext (): Writes records one line at a time as strings.
    (以字符串的形式寫記錄,一次寫一條。)
  • writeAsCsv (): Writes tuples as comma separated value files. Row and fields delimiter can also be configured.(將tuple與入以,號分隔的文本文件(值文件),行和字段的分隔符是可以配置的。)
  • print ()/priatErr (): Writes records to the standard output. You can also choose to write to the standard error.(將記錄輸出山到System.out(標準輸出)或System.error(標準錯誤))
  • writeUsingQutputFormat (): You can also choose to provide a custom output format. While defining the custom format you need to extend the OutputFormat which takes care of serialization and deserialization.(可以選擇自定義的輸出格式,當定義自定義的格式時,你需要繼承OutputFormat類,這個類負責序列化和反序列化。)
  • writeToSocket (): Flink supports writing data to a specific socket as well. It is required to define SerializationSchema for proper serialization and formatting
    (Flink 提供將數據寫到特定的socket。它需要爲序列化和格式化定義適合的 SerializationSchema
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章