Kafka简介及各个组件介绍【转】

Kafka：分布式发布-订阅消息系统
注：本文翻译自官方文档。

1. 介绍
Kafka is a distributed,partitioned,replicated commit log service. It provides the functionality of a messaging system, but with a unique design.

Kafka是一种分布式发布-订阅消息系统，它提供了一种独特的消息系统功能。

Kafka maintains feeds of messages in categories called topics.
We’ll call processes that publish messages to a Kafka topic producers.
We’ll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
1) Kafka维护的消息流称为topic。
2) 发布消息者称为 producer。
3) 订阅并消费消息的称为 consumers。
4) Kafka运行在多server的集群之上，每个server称为broker。

2. 组件
Topics and Logs
A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:

一个Topic可以认为是一类消息，Kafka集群将每个topic将被分成多个partition(区)，逻辑上如上图所示。

Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.

每一个partition都是一个有序的、不可变的消息序列，它在存储层面是以append log文件形式存在的。任何发布到此partition的消息都会被直接追加到log文件的尾部。每条消息在文件中的位置称为offset(偏移量)，offset为一个long型数字，它是唯一标记一条消息。

The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so retaining lots of data is not a problem.

Kafka集群保留了所有以发布消息，即使消息被消费，消息仍然会被保留一段时间。例如，如果log被设置为保留两天，那么在一条消息被消费之后的两天内仍然有效，之后它将会被丢弃以释放磁盘空间。Kafuka的性能相对于数据量来说是恒定的，所以保留大量的数据并不是问题。

In fact the only metadata retained on a per-consumer basis is the position of the consumer in the log, called the “offset”. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes. For example a consumer can reset to an older offset to reprocess.

每个consumer(消费者)的基础元数据只有一个，那就是offset，它表示消息在log文件中的位置，它由consumer所控制，通常情况下，offset将会”线性”的向前驱动，也就是说消息将依次顺序被消费。而事实上，consumer可以通过设置offset来消费任意位置的消息。例如，consumer可以重置offset来从新处理消息。

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.

这些特性意味着Kafkaconsumer非常轻量级，它可以随意切入和离开，而不会对集群里其他的consumer造成太大的影响。比如，你可以使用tail命令工具来查看任意topic的内容，而不会影响消息是否被其他consumer所消费。

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

在消息系统中采用Partitions设计方式的目的有多个。首先，允许更大的数据容量，每个topic可以拥有多个partitions，每个独立的patition运行于servers之上，因此，topic几乎能够容纳任意大小的数据量。第二点，partitions都是并行单位。

Partition
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Kafka集群中，一个Topic的多个partitions被分布在多个server上。每个server负责partitions中消息的读写操作。每个partition可以被备份到多台server上，以提高可靠性。

Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

每一个patition中有一个leader和若干个follower。leader处理patition内所有的读写请求，而follower是leader的候补。如果leader挂了，其中一个follower会自动成为新的leader。每一台server作为担任一些partition的leader，同时也担任其他patition的follower，以此达到集群内的负载均衡。

Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message). More on the use of partitioning in a second.

Producer将消息发送的指定topic中，producer决定将消息发送到哪个partition中。比如基于”round-robin”方式实现简单的负载均衡或者通过其他的一些算法等.

Consumers
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each message goes to one of them; in publish-subscribe the message is broadcast to all consumers. Kafka offers a single consumer abstraction that generalizes both of these—the consumer group.

消息基本上有两种模式：queuing(队列模式) 和 publish-subscribe(发布-订阅模式) ，在队列模式中，consumer池从server中读取消息，每个消息都会到达一个consumer。在发布-订阅模式中，消息被广播到所有的consumer。Kafka提供了consumer group这个抽象概念来概括这两种模式。

Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

每个consumer属于一个consumer group, 如果consumer group订阅了topic，那么它会接收到该topic发布的每条消息，该消息只会被分配到一个consumer上。consumer实例可以部署在不同的进程或机器上。

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

如果所有的consumer都具有相同的group，这种情况和queue模式很像，消息将会在consumers之间负载均衡。

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

如果所有的consumer都具有不同的group，那这就是”发布-订阅”，消息将会广播给所有的消费者。

More commonly, however, we have found that topics have a small number of consumer groups, one for each “logical subscriber”. Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is cluster of consumers instead of a single process.

然而，我们发现大多数情况下topic只有少量的逻辑上的订阅者 consumer group，每个group由许多的consumer实例组成，以提高扩展性和容错性。这就是发布-订阅模式，订阅者是consumer集群而非单个进程。

Kafka has stronger ordering guarantees than a traditional messaging system, too.

相比于传统的消息系统，Kafka具有更强的序列保证。

A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

传统的队列在server上保持有序，如果多个consumer从队列中消费，队列会按序弹出，然后消息被异步分配到consumer上，因此，消息到达consumer时可能会破坏顺序。这意味着在并行处理过程中，消息处理是无序的。为了解决这个问题，消息系统的exclusive consumer机制只允许单进程从队列中消费消息，当然，这就是说，没有了并行处理能力。

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

Kafka具有更好的解决方案。通过parallelism—the partition—within the topics机制，Kafka能够提提供有序保证，使consumer池能够负载均衡。这是通过把topic中的partition分派给consumer group中的consumer来实现的，因此，每个partition由group中一个确定的consumer来消费。通过这种方式我们保证了consumer是指定partition的唯一reader，并且按顺序消费数据。由于有很多partition，这种方式使得consumer实例可以负载均衡。

Kafka only provides a total order over messages within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

kafka只能保证一个partition中的消息被某个consumer消费时，消息是顺序的。事实上，从Topic角度来说，消息仍不是有序的。如果你需要topic范围内的有序，那么你可以只使用一个partition，这也就是说，group中也只有一个consumer。

Guarantees
At a high-level Kafka gives the following guarantees:
1. Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
2. A consumer instance sees messages in the order they are stored in the log.
3. For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.

在更高的层面，Kafka给出以下保证：
1) 发送到partitions中的消息将会按照它接收的顺序追加到日志中。
2) 对于消费者而言,它们消费消息的顺序和log中消息顺序一致。
3) 如果Topic的”replication factor“为N，那么允许N-1个kafka实例失效。
————————————————
版权声明：本文为CSDN博主「李小静」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/a568078283/article/details/51464524

Kafka简介及各个组件介绍 【转】

Kafka简介及各个组件介绍【转】