Chapter 1 Introduction to Apache Flink-Check pointing etc...

Check pointing

Check pointing is Flink's backbone for providing consistent fault tolerance. It keeps on taking consistent snapshots for distributed data streams and executor states. It is inspired by the Chandy-Lamport algorithum but has been modified for Flink's tailored requirement.The details about the Chandy-Lamport algorithm can be found at: http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy-pdf.
check pointing是Flink爲了實現容錯的核心功能。它負責爲分佈式數據流(steam)和運行器狀態拍照。它的靈感來源於Chandy-Lamport算法,不過已經爲了Flink的定製需求做一些修改。關於Chandy-Lamport算法請參見論文。

The exact implementation details about snapshotting are provided in the following research paper: Lightiweight Asynchronous Snapshots for Distributed Dataflows
(http://arxiv.org/ab:/1506.08603)
關於快照的確切實現請參考以下論文(針對於分佈式數據流的輕量級異步快照)

The fault-tolerant mechanism keeps on creating lightweight snapshots for the data flows .They therefore continue the functionality without any significant over-burden. Generally the state of the data flow is kept in a configured place such as HDFS
(這個容錯機制會爲數據流持續創建lightweight snapshots,因此,它們會在沒有重大負擔的情況下繼續運行它們的功能。通常情況下,這個數據流的狀態的配置會被放在HDFS裏)

In case of any failure, Flink stops the executors and resets them and starts executing from the latest available checkpoint
(在失敗的情況下,Flink 停止exectors並重置它們的狀態,然後從最後可用的checkpoint開始執行)

Stream barriers are core elements of Flink's snapshots. They are ingested into data streams without affecting the flow. Barriers never overtake the records. They group sets of records into a snapshot. Each barrier carries a unique ID. The following diagram shows how the barriers are injected into the data stream for snapshots:
Stream barriers是Flink 快照的核心項。它們被嵌入到數據流中,並不會對流產生任何影響。barriers不會超過records.它們將記錄集分成一組快照。每個barrier 帶一個unique ID.
下圖顯示了barriers爲實現快照而將barriers嵌入到data stream中。

Each snapshot state is reported to the Flink Job Manager's checkpoint coordinator. While drawing snapshots, Flink handles the alignment of records in order to avoid re-processing the same records because of any failure. This alignment generally takes some milliseconds.But for some intense applications, where even millisecond latency is not acceptable, we have an option to choose low latency over exactly a single record processing. By default Flink processes each record exactly once. If any application needs low latency and is fine with at least a single delivery, we can switch off that trigger. This will skip the alignment and will improve the latency.
( 每個snapshot狀態都被上報至Flink Job Manager的checkpoint協調器中。畫快照時,Flink 處理記錄的對齊以避免因爲失敗而導致的相同的記錄被重複處理。這個對齊通常會花費幾毫秒時間。但是對於一些對延遲反應強烈的應用程序來講,這也是無法接受的,我們提供一個選項可以在一個確切的記錄上選擇是否開啓低延遲。默認的Flink處理每條記錄都是exactly once 。如果應用程序需要低延遲並且在at least分發的情況下也能很好的工作,那麼可以關閉這個觸發器。那麼將跳過對齊並會降低延遲(提高性能)。)

Task manager

Task managers are worker nodes that execute the tasks in one or more threads in JVM. Parallelism of task execution is determined by the task slots available on each Task Manager. Each task represents a set of resources allocated to the task slot. For example, if a Task Manager has four slots then it will allocate 25% of the memory to each slot. There could be one or more threads running in a task slot. Threads in the same slot share the same JVM. Tasks in the same JVM share TCP connections and heart beat messages:
(Task manager就是工作節點,這些節點在JVM中以單線程或在多線程模式運行。任務運行的並行性由每個Task Manager中的可用slots來確定的。每個任務代表着task slot的資源分配的集合。舉個例子:如果Task Manager有4個slots,那麼每個slot將分配 25%內存。可以有一個或多個線程運行在task slot中。在相同slot中的多個線程共享JVM,在相同JVM中的任務共享TCP 連接和心跳消息。)

Job client

The Job client is not an internal part of Flink's program execution but it is the starting point of the execution. The Job client is responsible for accepting the program from the user and then creating a data flow and then submitting the data flow to the Job Manager for further execution. Once the execution is completed, the job client provides the results back to the user.
(Job Client不在Flink的程序內部運行,但它是程序的執行的起點。Job Client負責接受用戶的程序然後創建data flow,然後提交data flowjob manager以進一步執行。一旦程序執行結束。job client向用戶返回結果。)
A data flow is a plan of execution. Consider a very simple word count program:
data flow就是一個執行計劃(譯者注:與steam是不同的,steam 指具體的數據),下面是一個非常簡單的word count程序)

var text=env.readTextFile("input.txt") //Source
var counts=text.flatMap{_.toLowerCase.split("\\W+") fliter{_.notEmpty}}
.map{(_.1)}
.groupBy(0)
.sum(1)                            //Transformation
counts.writeAsCsv("output.txt","\n"," ") //Sink

When a client accepts the program from the user, it then transforms it into a data flow. The Data flow for the aforementioned program may look like this:
(當client從用戶接收到程序時,會被轉換爲data flow(執行計劃),那麼上述的data flow看起來象這個樣子:)

The preceding diagram shows how a program gets transformed into a data flow. Flink data flows are parallel and distributed by default. For parallel data processing, Flink partitions the operators and streams. Operator partitions are called sub-tasks. Streams can distribute the data in a one-to-one or a re-distributed manner.The data flows directly from the source to the map operators as there is no need to shuffle the data. But for a GroupBy operation Flink may need to redistribute the data by keys in order to get the correct results:(上圖展示了一個程序轉換爲data flow。Flink data flow默認是並行的並且是分佈式的。對於並行的數據處理,Flink對operatorsstreams 進行分區。Operator分區叫sub-tasksStreams可以以一對一或重分佈的方式分佈數據。)

The data flows directly from the source to the map operators as there is no need to shuffle the data. But for a GroupBy Operation Flink may need to redistribute the data by keys in order to get the correct results:
(data flow可以直接從source映射到operators,因此不需要shuffle數據。但對於GroupBy操作,Flink也許需要通過key redistribute數據,以便獲取正確的結果。)

Features

In the earlier sections, we tried to understand the Flink architecture and its execution model. Because of its robust architecture, Flink is full of various features.
(前幾節,我們已經理解Flink的架構和執行模型。因爲它的魯棒架構,Flink具有多種特性。)

High performance

Flink is designed to achieve high performance and low latency. Unlike other streaming frameworks such as Spark, you don't need to do many manual configurations to get the best performance. Flink's pipelined data processing gives better performance compared to its counterparts.
(Flink 被設計成具有高性能和低延遲的架構。不象其他的流處理框架(比如spark),你不需要手動配置獲得最佳的性能。Flink的pipelined 數據處理比其他流處理框架(spark streamming)具有更好的性能。)

Exactly-once stateful computation

As we discussed in the previous section, Flink's distributed checkpoint processing helps to guarantee processing each record exactly once. In the case of high-throughput applications, Flink provides us with a switch to allow at least once processing.
( 我們上一節已經討論過,Flink 分佈式的checkpoint處理有助於保證每條記錄處理的exactly once,那麼在高吞吐的應用程序中,Flink提供允許我們以at least的方式處理的選項。)

Flexible streaming windows

Flink supports data-driven windows. This means we can design a window based on time, counts, or sessions. A window can also be customized which allows us to detect specific pattens in event streams.
(Flink 支持data-driver的窗口。這意味着我們可以設計一個基於時間,計數或會話的窗口。一個窗口可以被定製,它允許我們檢測事件流中的特定模式。)

Fault tolerance

Flink's distributed, lightweight snapshot mechanism helps in achieving a great degree of fault tolerance. It allows Flink to provide high-throughput performance with guaranteed delivery.(Fink的分佈式的,輕量級的快照機制有助於得到最好的容錯性。它允許Flink在保證分發的情況下具有高吞吐量。)

Memory management

Flink is supplied with its own memory management inside a JVM which makes it independent of Java's default garbage collector. It efficiently does memory management by using hashing, indexing, caching, and sorting.
(在JVM內部,Flink 提供它自己的內存管理,這使得它獨立於JAVA默認的GC.用hashing,indexing caching 和sorting 高效地對內存進行管理)

Optimizer

Flink's batch data processing API is optimized in order to avoid memory-consuming operations such as shuffle, sort, and so on. It also makes sure that caching is used in order to avoid heavy disk IO operations.
(Flink的批處理API是被優化過的,以便可以避免內存消耗的操作,比如shuffle sort等。它保證使用緩存以避免大量的磁盤IO操作。)

Stream and batch in one platform

Flink provides APIs for both batch and stream data processing. So once you set up the Flink environment, it can host stream and batch processing applications easily. In fact Flink works on Streaming first principle and considers batch processing as the special case of streaming.
(Flink提供的API同時支持批處理和流處理。所以一旦你安裝 了Flink環境,它可以容易地同時承載流和批處理應用程序。事實上,Flink是以流優先原則工作的,而將批處理看作是特殊的流。)

Libraries

Flink has a rich set of libraries to do machine learning, graph processing, relational data processing, and so on. Because of its architecture, it is very easy to perform complex event processing and alerting. We are going to see more about these libraries in subsequent chapters.
(Flink 有非常豐富的包來支持機器學習,圖處理,關係型數據處理等待。因爲它的架構,它是很容易去完成複雜的事件處理和警告。我們將在後續章節中看到更多關於這些包的介紹)

Event time semantics

Flink supports event time semantics. This helps in processing streams where events arrive out of order. Sometimes events may come delayed. Flink's architecture allows us to define windows based on time, counts, and sessions, which helps in dealing with such scenarios.
(Flink 支持event time語義。這有幫我們處理以亂序到達的流。有時事件可能會延遲到達。Flink 架構允許我們定義基於時間,計數和會話的窗口,這些窗口有助於處理上面說的這些場景。)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章