Chapter 1 Introduction to Apache Flink

With distributed technologies evolving ([ɪ'vɒlvɪŋ])all the time, engineers are trying to push those technologies to their limits. Earlier, people were looking for faster, cheaper ways to process data. This need was satisfied when Hadoop was introduced. Everyone started using.Hadoop, started replacing their ETLs with Hadoop-bound ecosystem tools. Now that this need has been satisfied and Hadoop is used in production at so many companies, another need arose to process data in a streaming manner, which gave rise to technologies such as Apache Spark and Flink. Features, such as fast processing engines, the ability to scale in no time, and support for machine learning and graph technologies, are popularizing these technologies among the developer community
(隨着分佈式技術的持續發展,工程師們在嘗試將這些技術推向極限。在這之前,人們尋求的是高性價比的方式去處理數據。這些伴隨着Hadoop技術的面世將得到滿足。幾乎每個人都開始使用Hadoop-bound生態的工具替換他們的ETLs。現在這些需求都已經被滿足並且Hadoop在很多公司被用於生產環境。另外,隨着流處理需求的出現,催生了像Apache SparkFlink技術的出現,比如象快速處理引擎,在線擴展能力及機器學習和圖技術,這些技術在開發者社區都非常受歡迎。)

Some of you might have been already using Apache Spark in your day-to-day life and might have been wondering if I have Spark why I need to use Flink? The question is quite expected and the comparison is natural. Let me try to answer that in brief. The very first thing we need to understand here is Flink is based on the streaming first principle which means it is real streaming processing engine and not a fast processing engine that collects streams as mini batches. Flink considers batch processing as a special case of streaming whereas it is vice-versa in the case of Spark. Likewise we will discover more such differences throughout this book
(有些人在日常工作中可能已經使用了Apache Spark,可能也在考慮我已經使用了Spark,爲什麼還要用Flink?這個問題也是我意料之中的,並且這種比較也很正常。我用簡要的方式試着回答一下這個問題。首先,我們需要理解Flink是基於流優先原則的,這意味着這是一個真正的流處理引擎,並且不是那種以蒐集流作爲mini 批處理的快速處理引擎。Flink 考慮將批處理作爲一種特殊的流處理引擎。在這Spark里正好相反。同樣的,我們將在本書中發現更多的差異)

this book is about one of the most promising technologies-Apache Flink. In this chapter we are going to talk about the following topics :
(本書是關於這個最有前途的技術(Apache Flink) 在這一節我們將繼續討論以下幾個主題)

  • History (歷史)
  • Architecture (架構)
  • Distrubuted execution(分佈式執行)
  • Features (特性)
  • Quick start setup(快速開始)
  • Cluster setup
  • Running a sample application

History

Flink started as a research project named Stratosphere with the goal of building a next generation big data analytics platform at universities in the Berlin area. It was accepted as an Apache Incubator project on April 16, 2014. Initial versions of Stratosphere were based on a research paper by Nephele at http://stratosphere.eu/assets/papers/Nephele_09.pdf.
(Flink 始於一個叫Stratosphere的科研型項目旨在柏林地區的大學裏,構建下一代的大數據分析平臺。它在2014.4.16被Apache 作爲孵化項目接受。Stratosphere的初始版本是基於Nephele的研究論文。

The following diagram shows how the evolution of Stratosphere happened over time:
下圖展示了Stratosphere發展歷程:

The very first version of Stratosphere was focused on having a runtime, optimizer, and the Java API. Later, as the platform got more mature, it started supporting execution on various local environments as well as on YARN. From version 0.6, Stratosphere was renamed Flink.The latest versions of Flink are focused on supporting various features such as batch processing, stream processing, graph processing, machine learning, and so on
Stratosphere最初的版本專注在runtime optimizerJava API,後來,隨着平臺越加成熟,開始支持在各種本地環境中運行 比如YARN。從0.6開始,Stratosphere 被命名爲Flink,Flink的最後版本專注於各種特性的支持,如batch processing,stream processing,graph processing,machine leaning 等等)

Flink 0.7 introduced the most important feature of Flink that is, Flink's streaming API Initially release only had the Java API. Later releases started supporting Scala API as well.Now let's look the current architecture of Flink in the next section.
(0.7版引入的最重要的功能是,Flink 的流API最始的發佈版只有JAVA API.後來的發佈版支持了scale API. 現在,我們看一下當前版本的架構)

Architecture

Flink 1.X's architecture consists of various components such as deploy, core processing, and APIs. We can easily compare the latest architecture with Stratosphere's architecture and see its evolution. The following diagram shows the components, APIs, and libraries:
(Flink 1.x 的架構包括了很多組件,比如deploy,core processing 和一些API.我們比較容易比較Stratosphere與Flink最新版的架構,並可以看出它們的演變過程。下邊的圖顯示了各種組件和API及包。)

Flink has a layered architecture where each component is a part of a specific layer. Each layer is built on top of the others for clear abstraction. Flink is designed to run on local machines, in a YARN cluster, or on the cloud, Runtime is Flink's core data processing engine that receives the program through APIs in the form of JobGraph. JobGraph is a simple parallel data flow with a set of tasks that produce and consume data streams.
Flink 的架構是分層架構,在這個構架裏每個組件都是某層的組成部分。某一層都構建在其他層之前爲了更清晰的抽象。Flink 的架構設計可以運行在本地機器上,也可以運行在YARN 集羣,或者雲,Runtime 是Flink的數據處理引擎核心,該引擎會通過API以JobGraph的形式接受應用程序。JobGraph是一個簡單的並行data flow,該data flow是一組生產和消費data stream的任務的集合。

The DataStream and DataSet APIs are the interfaces programmers can use for defining the Job. JobGraphs are generated by these APIs when the programs are compiled. Once compiled, the DataSet API allows the optimizer to generate the optimal execution plan while DataStream API uses a stream build for efficient execution plans
(DataStream和DataSet
APIs 是程序可以用來定義job的接口。當程序被編譯時,通過這些API生成JobGraphs。一旦編譯完成,DataSet API 允許優化器生成最優的執行計劃,而DataStream API 使用stream 構建有效的執行計劃。)

The optimized JobGraph is then submitted to the executors according to the deployment model. You can choose a local, remote, or YARN mode of deployment. If you have a Hadoop cluster already running, it is always better to use a YARN mode of deployment.
(優化後的JobGraph隨後會根據部署模型被提交給執行器,你可以選擇local,remote or YARN 部署模型。如果你已經運行着hadoop cluster,那麼使用YARN部署模型會比較好。)

Distributed execution

Flink's distributed execution consists of two important processes, master and worker. When a Flink program is executed, various processes take part in the execution, namely Job Manager, Task Manager, and Job Client
(Flink的分佈式運行包括兩個重要的進程,master 和worker。當Flink 應用程序被執行時,很多進行會參與執行,它們分別是Job Manager,Task Manager 和Job Client)

The following diagram shows the Flink program execution:
(下面顯示了Flink程序的執行過程)


The Flink program needs to be submitted to a job Client. The Job Client then submits the job to the Job Manager. It's the Job Manager's responsibility to orchestrate the resource allocation and job execution. The very first thing it does is allocate the required resources.Once the resource allocation is done, the task is submitted to the respective the Task Manager. On receiving the task, the Task Manager initiates a thread to start the execution.While the execution is in place, the Task Managers keep on reporting the change of states to the Job Manager. There can be various states such as starting the execution, in progress, or finished. Once the job execution is complete, the results are sent back to the client.(Flink 應用程序需要被提交給Job Client.Job Client緊接着會被提交給Job Manager.協調資源分配和job執行是 Job Manager的職責。首先,它需要分配Job所需的資源。一旦資源分配完成,那麼任務會被提交給相應的Task Manager. Task Manager 收到任務就會啓動一個線程開始執行。execution就位的同時,Task Manager 會持續上報Job Manager狀態的變化。會有很多種狀態,比如 starting in progressfinished。一旦job 執行完成,結果將返回給客戶端。)

Job Manager

The master processes, also known as Job Managers, coordinate and manage the execution of the program. Their main responsibilities include scheduling tasks, managing checkpoints, failure recovery, and so on.
(master進程,也叫Job Managers用於協調和管理程序的執行。他們的主要職責包括任務調度,管理checkpoints,失敗恢復等待。)

There can be many Masters running in parallel and sharing these responsibilities. This helps in achieving high availability. One of the masters needs to be the leader. If the leader node goes down, the master node (standby) will be elected as leader.(可能會有多個Maters並行運行,共享這些職責。這有助於實現高可用。其中的一個master需要成爲leader.如果leader節點down掉,那麼standby 的master節點會被選爲leader.)

The Job Manager consists of the following important components:
Job Manager包含以下重要的組件:

  • Actor system
  • Scheduler
  • Check pointing

Flink internally uses the Akka actor system for communication between the Job Managers and the Task Managers.(Flink 內部使用Akka actor系統來實現Job Managers 和Task Managers的通信。)

Actor system

An actor system is a container of actors with various roles. It provides services such as scheduling, configuration, logging, and so on. It also contains a thread pool from where all actors are initiated. All actors reside in a hierarchy. Each newly created actor would be assigned to a parent. Actors talk to each other using a messaging system. Each actor has its own mailbox from where it reads all the messages. If the actors are local, the messages are shared through shared memory but if the actors are remote then messages are passed thought RPC calls.
actor系統包括很多角色。它提供象scheduling. configuration.logging 這樣的服務。它包含一個線程池,所有的actor都在這裏啓動。所有的actor都駐留在這一層。所有新創建的actor都會被指定一個parent。actors彼此用消息系統 通信.每個actor 有它自己的mailbox,它從這個mailbox裏讀取消息。如果actors 是在本地,那麼消息通過共享內存傳遞,如果它是在遠程,則通過rpc傳遞。)

Each parent is responsible for the supervision of its children. If any error happens with the children, the parent gets notified. If an actor can solve its own problem then it can restart its children. If it cannot solve the problem then it can escalate the issue to its own parent.
(每個parent都負責監護它們的actor。如果任何一個actor出問題,那麼parent將被通知。如果actor自己能解決問題那麼它就重啓actor.如果解決不了,則上報給parent)

In Flink, an actor is a container having state and behavior. An actor's thread sequentially keeps on processing the messages it will receive in its mailbox. The state and the behavior are determined by the message it has received.
(在Flink中,actor是一個含有狀態和行爲的容器。actor線程可以順序地並持續地處理它從mailbox裏收到的消息。這個狀態和行爲由它收到的消息確定。譯者注:這裏象netty的event Loop,線程自己會有一個隊列(mailbox))

Scheduler

Executors in Flink are defined as task slots. Each Task Manager needs to manage one or more task slots. Internally, Flink decides which tasks needs to share the slot and which tasks must be placed into a specific slot. It defines that through the SlotSharingGroup and CoLocationGroup.
(在flink中,Exectors 被定義爲task slots。每個task manager 需要管理多個task slots。在內部,Flink 決定哪些任務需要共享slots,哪些任務必須運行在指定的slot裏,這些通過SlotShardingGroupCoLocationGroup來定義的。)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章