- Rationale
- 基本原理
The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and
過去幾十年見證了數據處理的改革,MapReduce, Hadoop和其他相關技術使存儲和處理大規模的數據成爲可能,這在以前是不敢想的,
process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to
但不幸的事是,這些處理技術不是實時的處理系統,他們註定不是這種系統。
be. There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.
也沒有辦法,把hadoop變成一個實時的數據處理系統,實時數據處理,相對於批處理來說有一些根本的不同的要求。
However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a “Hadoop of realtime” has become the biggest hole in the data processing ecosystem.
然而,商業越來越需要,這一個可以實時處理大數據的系統,hadoop上的實時處理系統的缺失,是最大的一個缺失,在hadoop生態系統上,
Storm fills that hole.
storm 填補了那個空白
Before Storm, you would typically have to manually build a network of queues and workers to do realtime processing. Workers would process
在strom之前,你不得不自己建立一個網絡隊列和工作者來做實時處理,工作
messages off a queue, update databases, and send new messages to other queues for further processing. Unfortunately, this approach has
會處理消息隊列,更新數據庫,再發送新的消息到其它隊列來進一步處理,不幸的是,
serious limitations:
這樣做有一些限制。
- Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate
- queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.
- Brittle: There’s little fault-tolerance. You’re responsible for keeping each worker and queue up.
- Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.
Although the queues and workers paradigm breaks down for large numbers of messages, message processing is clearly the fundamental paradigm for realtime computation. The question is: how do you do it in a way that doesn’t lose data, scales to huge volumes of messages, and is dead-simple to use and operate?
即使queue和workers 範例會崩潰由於很大的數據量,消息處理是實時計算最根本的功能,問題是,你怎麼做才能使數據不丟失,吞吐大量消息,而且非常簡單的使用和操作。
Storm satisfies these goals.
Storm 符合這些要求
Why Storm is important
Storm exposes a set of primitives for doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm’s primitives greatly ease the writing of parallel realtime computation.
storm爲實時計算暴露了一系列基礎操作。就map/reduce使編寫並行批處理變得簡單。 storm的一些基本操作很大程度上簡化了編寫並寫實時計算的過程。
The key properties of Storm are:
- Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm’s small set of primitives satisfy a stunning number of use cases.
- Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
- Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
- Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
- Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
- Programming language agnostic: Robust and scalable realtime processing shouldn’t be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.