Storm概念

概念

本文列出了Storm的主要概念及相关的信息链接。讨论到的概念有：

Topologies
Streams
Spouts
Bolts
Stream groupings
Reliability
Tasks
Workers

Topologies

实时应用的逻辑被打包成了Storm topology。Storm topology跟MapReduce的job类似，它们之间的一个主要的不同是MapReduce job最终是要结束的，而Storm topology是永不停止的（直到你杀死它）。一个topology是对一组spouts、bolts、strem groupings的描述。

资源:

TopologyBuilder: 在Java中使用这个类构造topooygies
Running topologies on a production cluster
Local mode: 学习怎么在本地模式下开发、测试topologies

Streams

流是Storm中的核心抽象。流是无边界的、顺序的tuples，这些tuples在分布式环境中被并行的创建和处理。使用一个模式来定义流，这个模式里面有很多tuples，tuple里定义了字段。默认情况下，tuples可以包含 integers、 longs、 shorts、 bytes、 strings、 doubles、floats、 booleans和byte arrays. 你也可以定义你自己的serializer来自定义在tuple中可以用的类型。

每个流在声明的时候被赋予了一个id。因为只有一个流的spouts、bolts很常见，OutputFieldsDeclarer 提供了方便声明一个不用指定id的流的方法，id默认是default。

Resources:

Tuple: 流由tuples组成
OutputFieldsDeclarer: 用来声明流和它的模式
Serialization: tuple动态类型和自定义serialization信息
ISerialization: 自定义的serializer必须实现这个接口
CONFIG.TOPOLOGY_SERIALIZATIONS: 自定义的serializer可以通过这个配置注册

Spouts

Spout是一个topology的流的源头。通常情况下，spout从外部读取tuples（例如 Kestrel queue、Twitter API），然后将它们发射到topology。Spouts可以是可靠地，也可以是不可靠的。一个可靠的spout能够重新执行失败的tuple，而不可靠的spout将tuple发射出去后就将这个tuple忘记了。

Spouts能够发射多个流。使用OutputFieldsDeclarer的declareStream方法指定多个流模式，在SpoutOutputCollector的emit方法上制定要发射到的流。

Spout的主要方法是 nextTuple。 nextTuple 可以发射，也可以不发射流。实现nextTuple 方法时不能堵塞，因为Storm在同一个线程里调用所有的spout。

Spout其他主要的方法有 ack 和 fail。当Storm检测到从这个spout发射的tuple执行成功或失败后，会调用这两个方法。ack 和fail 仅仅被可靠的spout调用。

Resources:

IRichSpout: spout接口
Guaranteeing message processing

Bolts

topology的所有处理过程都在bolt里。Bolts可以做过滤、功能、聚合、连接、访问数据库等任何事情。

Bolts可以做简单的流转换。做复杂的流转换通常需要多个步骤即多个bolts。

Bolts可以发射多个流。使用OutputFieldsDeclarer的declareStream方法指定多个流模式，在OutputCollector的emit方法上制定要发射到的流。

如果你想订阅其他组件的所有流，你必须一个一个的订阅。InputDeclarer 提供了一些订阅流的方法，declarer.shuffleGrouping("1") 订阅了默认流的组件"1"的流，跟 declarer.shuffleGrouping("1", DEFAULT_STREAM_ID)的效果是一样的。

Bolt的主要方法是 execute ，这个方法接受一个tuple输入。Bolts使用 OutputCollector 来发射新的tuples。Bolts必须在处理完每个tuple后调用OutputCollector的 ack方法，Storm才能知道tuple什么时候被完成的（并且最终能决定反馈原始spout tuple是否安全？）(and can eventually determine that its safe to ack the original spout tuples).。通常的处理场景是，处理输入的tuple，发射0个或更多的tuple，然后ack输入的tuple，Storm提供了IBasicBolt接口来自动ack。

很好的做法是，在bolts中加载新的线程来做异步处理。OutputCollector 是线程安全的，可以在任何时候被调用。

Resources:

IRichBolt: Blot的通用接口.
IBasicBolt: Blot方便做过滤和简单功能的接口。
OutputCollector: Bolt用这个接口发射tuple
Guaranteeing message processing

Stream groupings

为每个bolt指定接受的流是定义topology的一部分。 Stream grouping定义了流按照什么方式进行分区。

在Storm中有7个内置stream grouping，你可以实现CustomStreamGrouping接口来自定义sream grouping：

Shuffle grouping: Tuples被随机分布到bolt上，每个bolt上处理的tuple数量是一样的。
Fields grouping: 流按照指定的字段进行分区。指定字段的值相同的tuple总是被送到同一个task；字段值不同的tuple可能在同一个task。
Partial Key grouping: 跟Fields grouping类似，也是根据指定字段值进行分区。但是，它是负载均衡的。当数据倾斜时，它能更好的利用资源。关于它是怎么工作的，有什么优势，这篇文章有详细的说明。
All grouping: 流在所有的blot上进行复制。使用这个grouping的时候要小心。
Global grouping: 整个流被发送到同故意恶bolt（拥有最小id的那个）。
None grouping: 使用这个grouping指明你不关心流式怎么分组的。当前，这个grouping相当于shuffle grouping。
Direct grouping: 这是一个特殊的gourping。tuple的生产者决定哪个消费任务来接受这个tuple。这个grouping只能用在被声明为direct stream的流上。必须使用dirct stream方法(/javadoc/apidocs/backtype/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) 。 Bolt获取task id的方式：1）TopologyContext 2）监听OutputCollector emit 方法的输出（返回这个tuple被送去的task的id）。
Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

Resources:

TopologyBuilder: use this class to define topologies
InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt's input streams and how those streams should be grouped
CoordinatedBolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct groupings

Reliability

Storm保证每个tuple都将被处理。 Storm监听tuples tree，检测这个tree什么时候被成功处理了。每个topology有个消息timeout配置，如果tuple在这个时间内没有被处理，则Storm认定这个tuple失败了，并重新处理它。It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a "message timeout" associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

如果想利用Storm的可靠性功能，你必须告诉Storm tuple tree的tuple什么时候被创建，什么时候完成处理。在emit之后，使用ack声明这个tuple被成功处理了。To take advantage of Storm's reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you've finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that you're finished with a tuple using the ack method.

This is all explained in much more detail in Guaranteeing message processing.

Tasks

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

Workers

Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

Resources:

Config.TOPOLOGY_WORKERS: this config sets the number of workers to allocate for executing the topology

参考文档：http://storm.apache.org/documentation/Concepts.html