Storm(二)Storm基本術語

Topologies


The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.

應用程序的所有邏輯被打包到Storm的topology中。Storm的topology類似與MapReduce的job。兩者之間的一個不同點就是MR的job最終是會運行結束的,而topology則會一直運行下去,除非人爲的kill掉。topology中定義了spouts、bolts以及數據流的分組。

Resources:

  • TopologyBuilder: use this class to construct topologies in Java
  • Cluster: Running topologies on a production cluster
  • Local mode: Read this to learn how to develop and test topologies in local mode.

Streams


The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

Stream是Storm的核心概念。Stream是無邊界的tuple序列,在分佈式場景下可以被並行的創造和處理。可以使用schema來定義Stream中傳遞的tuple的字段。默認情況下,tuple可以包含integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays。你也可以自定義自己的序列化對象類型。

Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of “default”.

每個stream在聲明的時候需要給定一個id。但是自從single-stream spouts and bolts更加常用之後,OutputFieldsDeclarer中具有了更加便捷的方法用來聲明一個stream而不需要指定id,在這種情況下,默認的id是 ‘default’。

Resources:

  • Tuple: streams are composed of tuples
  • OutputFieldsDeclarer: used to declare streams and their schemas
  • Serialization: Information about Storm’s dynamic typing of tuples and declaring custom serializations
  • ISerialization: custom serializers must implement this interface
  • CONFIG.TOPOLOGY_SERIALIZATIONS: custom serializers can be registered using this configuration

Spouts

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

Spout是topology中stream的源頭。通常spout會從外部數據源讀取tuples,然後emit到topology中的bolt。Spout可以是reliable或者unreliable。一個可靠的spout在失敗後可以重新恢復,而不可靠的spout則無法做到這一點。

Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emitmethod on SpoutOutputCollector.

Spout可以emit到多個stream。聲明多個streams使用declareStream方法,並且在emit數據的時候要聲明具體emit到哪個stream。

The main method on spouts is nextTuple. nextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.

Spout的主要方法是nextTuple。nextTuple可以emit一個新的tuple到topology或者僅僅return。在任何spout的實現中nextTuple方法都不要block住,因爲storm是在相同的線程中調用spout的方法的。

The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

Spout中的其他主要方法是ack和fail。當spout emit數據成功或者失敗的時候這兩個方法會被觸發。ack和fail只針對reliable spout。

Resources:

  • IRichSpout: this is the interface that spouts must implement.
  • Guaranteeing message processing

Bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Topology中對tuple的邏輯處理都是在bolts中。bolts可以通過filtering, functions, aggregations, joins, talking to databases做任意的事情。bolts可以進行簡單的數據流的轉換,也可以進行復雜的數據流轉換,進行復雜的數據流轉換往往需要多個bolt來進行數據處理。

Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

Bolts可以emit到多個stream。聲明多個streams使用declareStream方法,並且在emit數據的時候要聲明具體emit的stream。

When you declare a bolt’s input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually.InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping(“1”) subscribes to the default stream on component “1” and is equivalent todeclarer.shuffleGrouping(“1”, DEFAULT_STREAM_ID).

當聲明一個bolt的輸入流時,你通常需要訂閱另一個組件的stream。如果你想訂閱另一個組件的所有stream的話,你需要針對每一個stream都訂閱。InputDeclarer中聲明瞭通過stream 的id來訂閱stream的方法。declarer.shuffleGrouping(“1”) 訂閱了另一個組件”1”的默認的stream,等同於declarer.shuffleGrouping(“1”, DEFAULT_STREAM_ID)。

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

Bolts中的主要方法是execute,將tuple作爲輸入參數。bolts 使用OutputCollector emit 新的tuple對象。bolts每處理完成一個tuple就必須調用OutputCollector 的ack方法,以便於讓storm知道tuples完成了。通常情況下處理tuple輸入的時候,根據實際情況emit 0或者更多的tuples,然後ack對應的輸入tuple,storm提供了IBasicBolt接口來自動完成ack機制。

Please note that OutputCollector is not thread-safe, and all emits, acks, and fails must happen on the same thread. Please refer Troubleshooting for more details.

OutputCollector 並不是線程安全的,所以所有的emits、acks和fails必須在相同的線程中。

Resources:

  • IRichBolt: this is general interface for bolts.
  • IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions.
  • OutputCollector: bolts emit tuples to their output streams using an instance of this class
  • Guaranteeing message processing

Stream groupings

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.

定義topology的時候需要爲每個bolt聲明具體需要接受哪些streams作爲輸入。stream的分組定義瞭如何在bolt的task之間進行stream的partition。

There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface:

下面是8個內置的stream grouping,你也可以通過實現 CustomStreamGrouping接口來自定義stream grouping。

  • Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.

    tuples在bolts的task之間隨機的分佈,在分佈的時候會盡量確保分佈均勻。

  • Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id“‘s may go to different tasks.**

    stream根據在在grouping中定義的變量名稱來對stream進行partition,例如stream根據”user-id”變量進行分組,那麼tuples中具有相同”user-id”的就會被被分配到相同的task處理,不同的”user-id”會被分配到不同的task處理。

  • Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.

    stream根據在在grouping中定義的變量名稱來對stream進行partition,與Fields grouping類似,但是在。。。

  • All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.

    stream在bolts的每個task中進行了replicate,意思就是每個task收到的tuple都是相同的。

  • Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.

    所有的stream都去了bolts的那個id最低的task中。

  • None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).

    當前Nonoe grouping等同於shuffle grouping。

  • Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the emitDirect methods. A bolt can get the task ids of its consumers by either using the providedTopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).

    這個一個特殊的grouping。stream使用這種方式進行grouping意味着tuple的producer決定了哪個task會接收到該tuple。Direct grouping只可以用在那些已經聲明瞭direct stream的stream上。tuple emit到一個直接的stream上必須使用emitDirect方法。bolt可以使用TopologyContext或者追蹤emit的輸出來得到task的id。

  • Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

    如果目標bolt在相同的worker中有一個或者多個tasks,tuple僅僅只會在這些worker的task中會tuple進行分配,否則的話該grouping類似於shuffle grouping。

Resources:

  • TopologyBuilder: use this class to define topologies
  • InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt’s input streams and how those streams should be grouped
  • CoordinatedBolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct groupings

Reliability

Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a “message timeout” associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

Storm確保每一個tuple都會被topology處理。topology通過追蹤tuples tree中的每一個tuple來判斷tuple是否被處理成功。每一個topology都有一個 “message timeout”與之關聯。如果Storm在超時時間之內仍然未發現該tuple被處理完成的話,那麼該tuple則處理失敗並在一段時間後重試!

To take advantage of Storm’s reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you’ve finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that you’re finished with a tuple using the ack method.

爲了確保Storm的可靠處理能力,當tuple tree中有新的tuple被創建的時候,以及每個tuple被處理完成的時候你的都必須告訴storm。這些都是使用OutputCollector對象來完成的。是在emit方法中隱式的完成,你可以使用ack方法來聲明tuple被處理完畢。

This is all explained in much more detail in Guaranteeing message processing.

Tasks

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

spout或者bolt在集羣中作爲多個task來運行。每個task都是一個線程,stream grouping決定了將tuples從一個task集合發送到另一個task集合的方式。你可以使用TopologyBuilder的setSpout和setBolt針對每個spout或者bolt設置並行處理。

Workers

Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

topology運行在一個或多個worker中。每個worker都是一個單獨的JVM,可以處理所有task中的一部分。例如:topology中設置的parallelism是300,分配的workers是50,那麼每個worker會運行6個task。storm會盡量均勻的爲每個worker分配task。

Resources:

  • Config.TOPOLOGY_WORKERS: this config sets the number of workers to allocate for executing the topology


參考鏈接:http://storm.apache.org/documentation/Concepts.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章