Chapter 2 Data Processing Using the DataStream API

Real-time analytics is currently an important issue. Many different domains need to process data in real time. So far there have been multiple technologies trying to provide this capability. Technologies such as Storm and Spark have been on the market for a long time now. Applications derived from the Internet of Things (IoT) need data to be stored,processed, and analyzed in real or near real time. In order to cater for such needs, Flink provides a streaming data processing API called DataStream API.
(當前Real-time分析是一個非常重要的問題。很多領域都需要實時地處理數據。截止目前,有很多技術來提供這種數據的實時處理能力。象Storm Spark這種技術很早就已經出現了。源於互聯網的應用程序需要實時或準實時地存儲,處理及分析它們的數據。爲滿足這些需求,Flink提供了流數據處理API 叫DataStream API)

In this chapter, we are going to look at the details relating to DataStream API, covering the following topics:
(在這一節,我們着眼於DataStream API相關的一些細節,覆蓋以下幾個topic)

  • Execution environment
  • Data sources
  • Transformations
  • Data sinks
  • Connectors
  • Use case -sensor data analytics

Any Flink program works on a certain defined anatomy as follows:
Flink應用程序基於確定的結構工作。如下圖所示:


We will be looking at each step and how we can use DataStream API with this anatomy.
我們會研究這裏的每一步,以及我們怎麼使用DataStream API

Execution environment

In order to start writing a Flink program, we first need to get an existing execution environment or create one.
Flink應用程序,首先,我們需要獲得一個execution environment,或者創建一個execution environment
Depending upon what you are trying to do, Flink supports:

  • Getting an already existing Flink environment
  • Creating a local environment.
  • Creating a remote environment。
    根據你的想法(獲取還是新建?)Flink 支持:
  • 獲取一個存在的Flink environment
  • 創建一個本地的Flink environment
  • 創建一個遠程的Flink environment

Typically, you only need to use getExecutionEnvironment (). This will do the right thing based on your context. If you are executing on a local environment in an IDE then it will start a local execution environment . Otherwise, if you are executing the JAR then the Flink cluster manager will execute the in a distributed manner.
(典型的,你只需要用getExecutionEnvironment ()方法,Flink 會基於你的上下文獲取正確的Flink environment。如果 你在本地IDE執行它將啓動一個local execution environment。否則,如果你執行JAR,那麼Flink cluster Manager會以分佈式方式運行。)

If you want to create a local or remote environment on your own then you can also choose do so by using methods such as createLocalEnvironment () and createRemoteEnvironment (string host, int port, string, and . iar files).

如果你想在自己的環境中創建一個local environmentremove environment,你可以選擇這兩個方法:

  • createLocalEnvironment ()
  • createRemoteEnvironment (string host, int port, string, and . jar files)

Data sources

Sources are places where the Flink program expects to get its data from. This is a second step in the Flink program's anatomy. Flink supports a number of pre-implemented data source functions. It also supports writing custom data source functions so anything that is not supported can be programmed easily. First let's try to understand the built-in source functions.
Sources是Flink應用程序預期獲取數據的地方。這是Flink 程序結構的第二步。Flink會支持一些預先實現的Sources方法。而對於不支持的Sources,它提供自定義方法,所以很容易通過編程實現。首先,我們先了解一下build-in(內建)的Source 方法。

Socket-based

DataStream API supports reading data from a socket. You just need to specify the host and port to read the data from and it will do the work:
DataStream API支持從socket讀數據。你只需要指定hostpost即可,它

sockeTextStream(hostName,port);//譯者注:default delimiter is "\n"

You can also choose to specify the delimiter:

sockeTextStream(hoatName,port,delimiter)

You can also specify the maximum number of times the API should try to fetch the data sockeTextStream (hostName, port, delimiter, maxRetry)

File-based

You can also choose to stream data from a file source using file-based source functions in Flink. You can use readTextFile (string path) to stream data from a file specified in the path. By default it will read TextInputFormat and will read strings line by line
你可以用file-bases source方法從文件中讀取流。具體用readTextFile(String path)方法從指定的文件中獲取stream。該方法默認用TextInputFormat一行一行地讀取內容。

If the file format is other than text, you can specify the same using these functions:
如果文件的format不是text,而是其他的format,你可以指定FileInputFormat參數
方法如下

readFile(FileInputFormat<Out> inputFormat,string path)

Flink also supports reading file streams as they are produced using the readFileStream ().function:
Filnk 的readFileStream ()支持在文件流產生時讀取

//譯者注 @deprecated Use {@link #readFile(FileInputFormat, String, FileProcessingMode, long)} instead'
readFileStream (string filepath,
 long inkervalMillis,FileMonitorincEunction. watchTvpe watchType).

譯者摘選部分源碼


/**
 * The mode in which the {@link ContinuousFileMonitoringFunction} operates.
 * This can be either {@link #PROCESS_ONCE} or {@link #PROCESS_CONTINUOUSLY}.
 */
@PublicEvolving
public enum FileProcessingMode {

    /** Processes the current contents of the path and exits. */
    PROCESS_ONCE,

    /** Periodically scans the path for new data. */
    PROCESS_CONTINUOUSLY
}


/**
     * The watch type of the {@code FileMonitoringFunction}.
     */
    public enum WatchType {
        ONLY_NEW_FILES, // Only new files will be processed.
        REPROCESS_WITH_APPENDED, // When some files are appended, all contents
                                    // of the files will be processed.
        PROCESS_ONLY_APPENDED // When some files are appended, only appended
                                // contents will be processed.
    }

You just need to specify the file path, the polling interval in which the file path should be polled, and the watch type.Watch types consist of three types:
你只需要指定文件路徑,對該文件的輪循間隔以及watch type
watch type包括以下三種(譯者注:該方法已過期,見上文代碼註釋

  • FileMonitoringFunction. WatchType.ONLY_NEW_FILES is used when the system should process only new files (新文件全讀)
  • FileMonitoringFunction. WatchType. PROCESS_ONLY_APPENDED is used when the system should process only appended contents of files (只讀append 部分)
  • FileMonitoringFunction. WatchType. REPROCESS_WIIH _APPENDED is used when the system should re-process not only the appended contents of files but also the previous content in the file(有apend 全讀)

If the file is not a text file, then we do have an option to use following function, which lets us define the file input format
如果不是文本文件,我們使用下面這個方法,這個方法讓我們定義一個FileFormat參數

readFile (fileInputFormat, path, watchType, interval, pathFilter,typeInfo)

Internally, it divides the reading file task into two sub-tasks. One sub task only monitors the file path based on the WatchType given. The second sub-task does the actual file reading in parallel. The sub-task which monitors the file path is a non-parallel sub-task. Its job is to keep scanning the file path based on the polling interval and report files to be processed, split the files, and assign the splits to the respective downstream threads:
Flink 內部,它會將這個讀文件的任務分成兩個子任務。一個子任務只監控基於給定WatchTypefile path。第二個是實際讀文件的任務,這個任務會並行運行。而這個監控文件路徑的任務不是並行的。它會持續根據輪循週期掃描file path。然後報告這些文件(files),分割文件,並將這些分片指給對應的下游線程。
譯者注:這裏path是路徑還是文件?每個split 是一個大文件的切片還是對一個目錄下的小文件?

Transformations

Data transformations transform the data stream from one form into another. The input could be one or more data streams and the output could also be zero, or one or more data streams. Now let's try to understand each transformation one by one.
Data transformation會將數stream從一種形式轉換成另一種形式。輸入的數據流可以是一個,也可以是多個;而輸出也可能沒有,可能是一個或多個。好了,下面我們一個一個地來理解transformation

Map

This is one of the simplest transformations, where the input is one data stream and the output is also one data stream

Map 是最簡單的transformation 之一,這種transformation有輸入和輸出都只有一個。
In Java:

inputStream.map (new MapFunction() <Integer,Integer>(){
    @Override
      public Integer map (Intege value) throws Exception{
        return 5 *value;
    }
}};

In Scala:

inputStream.map {x =>x5}

FlatMap

FlatMap takes one record and outputs zero, one, or more than one record

FlatMap 的輸入只有1條記錄,而輸出可以是0,1或更多的記錄。
In Java:

inputStream. flatMap (new FlatMaprunction<string, string>() {
 @override
public void flatMap (string value, collector<string> out) throws Exception {
    for (string word: value.split("")){ 
        out.collect (word);
    }
  }
});

In Scala

inputStream. flatMap {atr => atr.aplit(" ") }

Filter

Filter functions evaluate the conditions and then, if they result as true, only emit the record.Filter functions can output zero records
Filter 方法會計算條件的值,然後判斷結果值如果爲true,則發出一條記錄。該方法也可以輸出0條記錄。

In Java:

inputStream. filter (new FilterFunction<Integer>(){ 
@override public boolean filter (intecer value) throws Exception {
      return value!= 1;
    }
});

In Scala:

inputStream.filter {-!=1}

KeyBy

KeyBy logically partitions the stream-based on the key. Internally it uses hash functions to partition the stream. It returns KeyedDataStream.
KeyBy方法會在邏輯上通過key對stream進行分區。內部會使用hash方法對流進行分區,它返回KeyedDataStream

In Java:

inputStream. KeyBy ("someKey");

In Scala:

inputStream.keyBy ("someKey")

Reduce

Reduce rolls out the KeyedDataStream by reducing the last reduced value with the current value. The following code does the sum reduce of a KeyedDataStream
Reduce會通過將最後歸納的結果值和當前的值進行歸納而推出KeyedDataStream

In Java:

keyedInputStream. reduce (new Reducerunction() {
@override
public Integer reduce (Integer valuel, Integer value2)throws Exception {
    return value1 -value2;}
});

In Scala:

keyedInputStream. reduce{_+_}

Fold

Fold rolls out the KeyedDataStream by combining the last folder stream with the current record. It emits a data stream back
Fold通過將最後folder流和當前記錄組合而推出KeyDataStream,它返回數據流。
In Java:

keyedInputStream keyedstream. fold("start", new Foldrunction<Integer, string>(){
@override public string fold(string current, Integer value) { 
return current ."=" -value;
}
});

In Scala:

keyedInputStream. fold("start") ((str, i) =>str+"="+i}).

The preceding given function when applied on a stream of (1,2,3,4.5) would emit a stream like this: Start=1-2-3-4-5
前面給出的函數在(1,2,3,4.5)流上應用時將得出這樣的流:Start=1-2-3-4-5

Aggregations

DataStream API supports various aggregations such as min, max, sum, and so on. These functions can be applied on KeyedDataStream in order to get rolling aggregations
DataStream API 支持多種象min,max,sum等操作。這些函數應用在KeyedDataStream上,以便進行滾動聚合。

In Java

keyedInputStream. sum (0)
keyedInputStream. sum ("key") 
kevedInputStream.min (0)
keyedInputStream.min ("key") 
keyedInputStream.max (0)
kevedInputStream.max ("key") 
keyedInputStream.minBy (0) 
keyedInputStream.minBy ("key")
keyedInputStream.maxBy (0)
keyedInputStream, maxBy ("key")

In Scala:

keyedInputStream. sum (0).
keyedInputStream. sum ("key")
keyedInputStream.min(0)
keyedInputStream. min ("key") 
keyedInputStream.max (0)
keyedInputStream. max ("key") 
keyedInputStream.minBy (0)
keyedInputStream. minBy ("key")
keyedInputStream.maxBy (0) 
keyedInputStream. maxBy ("key")

The difference between max and maxBy is that max returns the maximum value in a stream but maxBy returns a key that has a maximum value. The same applies to min and minBy.
minmaxBy的區別是:min返回流中的最大值,而maxBy會返回具有最大值的key,對於minminBy也是一樣的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章