storm 基本原理

                                                               



The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and

過去幾十年見證了數據處理的改革,MapReduce, Hadoop和其他相關技術使存儲和處理大規模的數據成爲可能,這在以前是不敢想的,



 process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to


                           但不幸的事是,這些處理技術不是實時的處理系統,他們註定不是這種系統。


 be. There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.

     也沒有辦法,把hadoop變成一個實時的數據處理系統,實時數據處理,相對於批處理來說有一些根本的不同的要求。




However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a “Hadoop of realtime” has become the biggest hole in the data processing ecosystem.


然而,商業越來越需要,這一個可以實時處理大數據的系統,hadoop上的實時處理系統的缺失,是最大的一個缺失,在hadoop生態系統上,


Storm fills that hole.


storm 填補了那個空白



Before Storm, you would typically have to manually build a network of queues and workers to do realtime processing. Workers would process 

在strom之前,你不得不自己建立一個網絡隊列和工作者來做實時處理,工作


messages off a queue, update databases, and send new messages to other queues for further processing. Unfortunately, this approach has

會處理消息隊列,更新數據庫,再發送新的消息到其它隊列來進一步處理,不幸的是,


 serious limitations:


這樣做有一些限制。

  1. Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate
          
          太無聊了:你花費大量的開發時間去配置這些消息發往哪裏,部署處理工作者,部署中間的


  1.  queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.

    隊列,一個你關心的實時處理邏輯和你代碼庫相關一致性是很小的。
  1. Brittle: There’s little fault-tolerance. You’re responsible for keeping each worker and queue up.
       脆弱的:容錯性很低,你負責管理每個worker 並讓他們有序。
  1. Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.

        擴展是痛苦的:當吞吐量對單個worker或者queue來說過大時,你需要拆分數據,然後分發下去,你需重新配置其他的worker 告訴他們往那個新位置發送數據,這裏需要注意,移動數據或者新部分是會失敗的。

Although the queues and workers paradigm breaks down for large numbers of messages, message processing is clearly the fundamental paradigm for realtime computation. The question is: how do you do it in a way that doesn’t lose data, scales to huge volumes of messages, and is dead-simple to use and operate?


即使queue和workers 範例會崩潰由於很大的數據量,消息處理是實時計算最根本的功能,問題是,你怎麼做才能使數據不丟失,吞吐大量消息,而且非常簡單的使用和操作。

Storm satisfies these goals.


Storm 符合這些要求



Why Storm is important


爲什麼storm 是重要的?

Storm exposes a set of primitives for doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm’s primitives greatly ease the writing of parallel realtime computation.


storm爲實時計算暴露了一系列基礎操作。就map/reduce使編寫並行批處理變得簡單。 storm的一些基本操作很大程度上簡化了編寫並寫實時計算的過程。




The key properties of Storm are:

  1. Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm’s small set of primitives satisfy a stunning number of use cases.

      異常廣泛的使用場景:storm可以用來處理消息,更新數據庫(流出裏),在一個數據流上做一個持續的查詢,流化結果到客戶端(進一步計算)。並行化一個查詢還有更多,storm這些基本操作可以滿足數量驚人的用戶場景。

  1. Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.

     擴展: storm每秒吞吐大量的消息,要擴大你topology的規模,所有你需要做的就是添加機器,然後提高這個topology的並行配置。作爲一個storm的吞吐量的例子,在一個10個節點的集羣刪每秒處理1百萬的消息,包括每秒數以百記的查詢,storm用zookeeper來保持集羣一直性,這使它容易擴張到更大的集羣。

  1. Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
     確保沒有消息丟失:實時計算系統一定要保證數據被成功的處理了,一個有數據丟失的系統,只有很小的使用場景,storm確保每個消息都被處理了,

這和S4這種系統截然相反。

  1. Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.

   異常穩定:不像hadoop那樣很難管理。storm集羣就工作起來,這是storm目標,讓用戶管理起來異常的簡單
    
  1. Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).


       接受失敗:計算過程若出了錯,如果需要的話,storm會重新指派任務,storm保證在你幹掉這個計算任務之前,他會一直運算下去。

  1. Programming language agnostic: Robust and scalable realtime processing shouldn’t be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.

  
      語言無關性:穩定又容易擴張的實時處理系統不能侷限在一種平臺上,storm topologies 和計算組件可以被定義爲爲何語言,這樣幾乎每個人都可以使用它。
 
  
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章