Paxos Made Simple

The Part-Time Parliament，Lamport，1998，ACM Transactions on Computer Systems. 晦涩的原文

http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf

Paxos Made Simple

http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F/notes/paxos-simple.pdf

http://blog.csdn.net/sparkliang/article/details/5740882

Paxos Made Live, 如何实现Paxos

http://labs.google.com/papers/paxos_made_live.pdf

http://duanple.blog.163.com/blog/static/7097176720121123114918404/

Paxos小议

http://stblog.baidu-tech.com/?p=1196

-----------------------------------------------------------------------------------------------------------------------------

Paxos Made Simple

词汇表

Process, 个体, 人, 可以扮演多种Agent(角色), 在paxos中往往需要同时扮演所有角色
三种类型角色, proposer(提案者)，acceptor(审批者)和listener(通知者,需要知道结果)

Proposal(提案), 由proposer提出
Proposal由number(编号)和value(提案内容)组成.
Issue(= propose), proposer issue a proposal

Message (消息), agent之间的通信
Accept(同意)
Majority (多数派), 被majority accept的value才认为被choose
Choose(选中, 定案)

The Problem, 解决什么问题?

Assume a collection of processes that can propose values.
A consensus algorithm ensures that a single one among the proposed values is chosen.
If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value.

确保在众多提案value中, 只有一个被选中, 并且保证该结果让所有processes都知道.

强调value, 因为可能多个proposal包含相同的value, 此时可能多个proposal被choose

并且需要满足如下的安全性需求,

只有被propose的value才可能被choose
有且仅有一个value被choose (只要有value被proposal)
process无法预先知道哪个value被choose, 除非该value已经被choose
Agents不稳定性, operate at arbitrary speed(可能反应很慢), may fail by stopping, and may restart.
所以可能Agent需要保存状态信息, 在restart后可以恢复状态
message被deliver的速度是不可预测的(可能很长时间), 消息可能会重复或丢失, 但是内容不会corrupted
即non-Byzantine model, 在拜占庭将军模式下, 内容可能被叛徒篡改, 而这里假设内容不会被篡改

Choosing a Value

思路和推演过程

单个Acceptor

最简单的方法, 指定单个acceptor, 所有proposal都发给它, 它接受最先收到的proposal. Problem, 单点问题

多个Acceptor

多个acceptor就解决了单点问题, 某个acceptor crash不会影响到choose
如果多数的acceptor(ex, 超过半数) accept某个value, 则该value被choose. Problem, 多个proposal被choose

Acceptor只能accept一个proposal. 并为了保证当只有一个value的情况下也会被choose, 加强为P1

In the absence of failure or message loss, we want a value to be chosen even if only one value is proposed by a single proposer. This suggests the requirement:

P1. An acceptor must accept the first proposal that it receives
P1. acceptor必须accept收到的第一个proposal. Problem, 可能所有proposal都达不到多数accept

比如, 规定只有超过50%才能被choose, 但是两个proposal都刚好获得50%的accept, 导致无法确定哪个proposal被choose

Acceptor必须被允许accept 多个proposal, 但怎么避免多个value被choose?

proposal由number和value组成

编号的唯一性, different proposals have different numbers

编号是全序的, numbers are totally ordered, 递增的, 后提出的proposal具有higher-number?

We can allow multiple proposals to be chosen, but we must guarantee that all chosen proposals have the same value. By induction on the proposal number, it suffices to guarantee:

P2. If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v.
P2. 如果一个内容为v的proposal被chosen, 那么后面所有被choose的higher-numbered proposal的内容必须是v. (有效避免多个value被choose)

问题是怎么保证P2? 需要更进一步具体和细化

P2a . If a proposal with value v is chosen, then every higher-numbered proposal accepted by any acceptor has value v.
P2a . 如果一个内容为v的proposal被choose, 那么所有acceptor只能accept内容为v的higher-numbered proposal

在acceptor层面控制, 来保证后面被choose的proposal的内容也一定是v
可惜这个假设很难实现, 因为v被choose只能保证大多数acceptor, 你很难保证所有acceptor. Problem, 无法保证所有acceptor满足该条件
比如由于通信异步或其他问题, acceptor c从来没有收到过任何的proposal, 此时一个新的proposer ‘wakeup’, 并提出内容不为v的higher-numbered proposal
而c恰好收到该proposal, 按照p1, c一定会accept该proposal
就会导致违背P2a

P2b . If a proposal with value v is chosen, then every higher-numbered proposal issued by any proposer has value v.
P2b . 如果一个内容为v的proposal被choose, 那么所有proposer提的higher-numbered proposal的内容都必须为v

从proposer的层面解决问题, 因为acceptor能够accept的proposal首先需要被issue, 所以P2b可以包含P2a

怎样保证所有proposer满足P2b? 继续加强和具体化

P2c. For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such that either
(a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.
P2c. 如果需要issue一个proposal ’number n, value v’, 必须有满足如下任一条件(a或b)的set S (S包含a majority of acceptors)存在
(a)S中所有的acceptor都没有accept过任何小于n的proposal (保证没有小于n的proposal被choose)
(b)S中所有的acceptor所accept过的小于n的proposal中, highest-numbered proposal的内容是v (保证被choose的proposal的内容是v)

P2c包含P2b是显而易见的

并且P2c比P2b容易操作, 只需要check是否有这样的set S存在即可
至此, 由P2演化至P2c, 只需要issue新proposal的时候符合P2c, 就可以满足P2

现在问题就变为, 在需要提新的proposal之前, 如何知道是否有这样的set S存在?
方法是发prepare请求给所有的acceptor, 所有满足条件的acceptor回复, 当收到的回复超过majority, 就认为set S存在
但是有个问题是网络通信往往需要较长时间, acceptor在回复后一直到新的proposal提出这段时间内, acceptor的状态可能会发生变化
所以必须约定, 一旦acceptor回复编号为n的prepare请求, 该acceptor就不能批准任何编号小于n的议案

议案提出算法

1. Prepare request

A proposer chooses a new proposal number n and sends a request to each member of some set of acceptors, asking it to respond with:
(a) A promise never again to accept a proposal numbered less than n, and
(b) The proposal with the highest number less than n that it has accepted, if any.
Proposer向某集合的所有acceptors发出编号n的prepare request, 并要求回应
(a)永不批准编号小于n的议案的承诺 (从未accept过的acceptor)
(b)回应编号小于n中highest number的proposal (对于accept过小于n的proposal的acceptor)

2. Accept request

If the proposer receives the requested responses from a majority of the acceptors, then it can issue a proposal with number n and value v, where v is the value of the highest-numbered proposal among the responses, or is any value selected by the proposer if the responders reported no proposals. A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.)

Proposer只要接收到超过majority的acceptor的回应, 就表明存在set S, 即可以issue一个编号为n的新proposal.
至于新proposal的value, 取决于majority acceptor的回复, 如果回复的是highest number的v, 那么新proposal的value必须为v; 如果majority回复的是promise, 那么新proposal的v可以是任意新值

Acceptor算法

This describes a proposer’s algorithm. What about an acceptor?

It can receive two kinds of requests from proposers: prepare requests and accept requests. An acceptor can ignore any request without compromising safety.
So, we need to say only when it is allowed to respond to a request. It can always respond to a prepare request. It can respond to an accept request, accepting the proposal, iff it has not promised not to. In other words:

P1a . An acceptor can accept a proposal numbered n iff it has not responded to a prepare request having a number greater than n.

P1a . Acceptor可以accept一个编号为n的proposal，当且仅当它没有回应过一个编号大于n的prepare request

P1a是包含P1的

We now have a complete algorithm for choosing a value that satisfies the required safety properties—assuming unique proposal numbers. The final algorithm is obtained by making one small optimization.
现在已经得到一个满足安全需求的chooseing value算法, 做些小的优化就得到最终的算法,

a. acceptor可以忽略所有小于或等于n的prepare请求 (n是它已经回应过的最大prepare请求编号)
b. acceptor可以忽略所有已经被accept过的proposal的prepare请求
c. acceptor必须保存它已经accept过的highest number的proposal(需要number和value), 以及已经回应过的highest number的prepare请求(只需number)
甚至当fail and restart, 仍然需要记住这些信息.
d. proposer可以随时放弃一个proposal, 并重新提新number的proposal
当出现开始说的, 由于碰巧多个proposal都没有达到majority时, 可以随时放弃, 并重新提

最终choosing value算法

Putting the actions of the proposer and acceptor together, we see that the algorithm operates in the following two phases.

Phase 1.

(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

Phase 2.

(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

Proposer可以提多个proposals, 并且可以随时abandon一个proposal, 而不会影响协议的正确性.
但是当已经有其他的proposer提出higher-numbered的提案的时候, 该proposer的提案会被abandon, 此时为了效率优化, 应该通知该proposer已经有higher-numbered的提案被提出
A proposer can make multiple proposals, so long as it follows the algorithm for each one.
It can abandon a proposal in the middle of the protocol at any time. (Correctness is maintained, even though requests and/or responses for the proposal may arrive at their destinations long after the proposal was abandoned.)
It is probably a good idea to abandon a proposal if some proposer has begun trying to issue a higher-numbered one.
Therefore, if an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number, then it should probably inform the proposer, who should then abandon its proposal. This is a performance optimization that does not affect correctness.

Learning a Chosen Value

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors.
在分布式环境中, 去中心化的设计, 每个节点都只知道自己是否accept该proposal, 问题如果知道是否majority的acceptor都accept该proposal, 从而导致该proposal被choose

1. 当然最简单的方法, 每个acceptor在accept该proposal的时候, 通知所有的learner(其实就是除自己外的所有其他节点) Problem, 消息量太大

2. 基于非拜占庭假设, 一个learner可以从另一个learner得知被选择的决议(可信的). 所以选一个master learner, 所有acceptor将消息发给master, 由master通知其他learner Problem, 单点问题

3. 折衷的方案, 多个master learner, balance消息量和安全性 Problem, 由于网络的不稳定, 不保证可以收到所有accptor的通知消息, 从而无法确定是否有proposal被choose

4. 无论是acceptor发消息给learner, 还是learner主动去询问acceptor, 由于网络消息的lose, 都有可能无法确定是否有propoasl被choose
最终的方案, 如果想知道该value是否被chosen, 最直接的方法是重新提一个proposal prepare request
如果返回的majority回复的是highest number的v, 那么就可以知道v已经被chosen
如果返回的majority回复的是promise, 那么就可以知道当前还没有v被chosen
如果返回达不到majority, 说明当前无法issue一个proposal, 这种情况说明网络不稳定, 无法知道是否有v被chosen

Paxos存在的问题

当然到目前为止, 只是个理论上的算法
如果需要去实现, 会存在很多问题需要去讨论, 这儿只列出部分

1. 如何保证全局编号问题

你提的时候怎么知道当前的最大number?
在分布式环境下, 确实无法直接确定
只能根据我知道的最大number先提一个proposal, 如果被拒绝, 那么拒绝的acceptor应该通知他知道的最大的number是多少

然后为了减少try的次数, 提高效率, number不是简单递增而是使用下面的方法递增, 很容易理解

在Google的Chubby论文中给出了这样一种方法：

假设有n个proposer，每个编号为i_r (0<=i_r <n)，proposol编号的任何值s都应该大于它已知的最大值，并且满足：s %n = i

2. 活锁问题, 需要proposer leader

比如两个proposer, p1提的number为1, p2提的number为2

当然p1提的prepare request当遇到已经promise过p2.2的acceptor, 一定会被拒, 并被告知已有更高的number

此时p1从效率考虑, 会立即放弃P1.1, 提出P1.3

但是此时, 你要考虑网络的延时, 所以有可能P2.2的prepare request也会遇到已经promise过p1.3的acceptor, 同样被拒
于是p2从效率考虑, 会立即放弃P2.2, 提出P2.4 …………

这样就会产生活锁问题, 大家不停的递增number…

这个问题怎么解决?

如果任何节点都可以提proposer, 虽然不影响正确性, 但是确实会大大影响算法效率
Lamport给出的解决办法是选举出一个proposer作leader，所有的proposal都通过leader来提交，当Leader宕机时马上再选举其他的Leader

问题似乎变的循环了, 如果能选一个leader, 就可以用M/S方法, leader可以决定一致性问题, 还需要paxos干吗?

首先, leader只是为了提高Paxos的效率, 避免活锁发生, 可以想象如果只有一个proposer的情况下, proposal被一次性accpet的可能性非常大, 不需要多次重新proposal, 造成资源浪费
leader的选取成功与否, 并不会影响paxos的正确性, 所以有可能发生leader选取失败或有多个proposer都认为自己是leader的case出现
这就是为什么leader并不能代替paxos本身的原因
另外, paxos算法要求无论发生什么错误, 都能够保证算法的正确性, 即选定一个value，并让大家知道.
所以如果仅仅基于leader来实现一致性, 当leader fail的时候所有信息丢失了, 就算重新选举leader, 之前所有的选举的结果也无法知道
而基于paxos的方法, 即使proposer fail, 也可以简单的从majority acceptors得到之前choose的结果

Paxos本来就是选举算法，能否用paxos来选举Leader呢?
存在称之为PaxosLease的paxos算法简化版可以完成leader的选举，像Keyspace、 Libpaxos、Zookeeper、goole chubby等实现中都采用了该算法

3. 轮(instance)概念

这个英文为啥是instance?
一轮会完成一次完整的paxos算法, choose出一个value
为了方便, 可以允许多轮paxos算法同时进行, 典型的例子就是下面对state machine的实现
但具体怎么实现, instance id在什么地方维护, instance怎么算结束等问题...还需要更多的考虑

4. 持久存储问题

在算法执行的过程中会产生很多的异常情况, 各种角色都可能fail

但无论何种错误必须保证paxos算法的正确性，需要各个节点都做持久存储, 以做到server”醒来"后仍能正确参与paxos处理

propose该存储已提交的最大proposal编号
acceptor存储已promise的最大proposal编号, 或已accept的最大proposal的编号和value
learn存储已学习过的proposal被accepted情况

Implementing a State Machine

The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system—an approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems [4].

[4] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.

本节描述了对Paxos的一种应用, 用于保证搭建分布式系统中有限状态机模型的一致性问题, 这个问题作为[4]论文的主题而被大家熟知.
说白了, 就是在分布式系统中怎么保证全序问题

实现分布式系统最简单的方式就是C/S结构, 一堆client向server发命令.
此时server可以看作是有限状态机, 不断读入命令以改变状态.
这个系统明显有单点问题

因此可以使用一组服务器, 每一个都是独立的状态机. 因为状态机是确定的, 如果执行相同的命令序列, 所有的服务器将会产生同样的状态序列和输出.
这样当一台server fail的时候, client可以从其他的server读出状态结果
但是问题是, 在分布式的环境中, 无法保证不同server收到的命令序列是相同的, 比如由于网络延迟
如果命令执行序列不同, 就无法产生同样的状态, 所以如何保证每个server都可以得到相同的命令执行序列?

当然我们需要使用Paxos算法来解决这个问题, 其实就是投票或选取问题, 我们依次投票决定第一, 二...N个被执行的命令, 所以需要独立执行N轮(instance概念见上)Paxos算法
为了提高效率, 选举一个leader proposer来依次发起多轮命令选举
Paxos有个特点, phase1和phase2可以完全分开执行, 因为到phase2才决定具体的value
并且这里可以有个优化是, 当第一轮的phase1成功后, 后续轮的phase1可以直接省略...可以简单想想为啥

所以Leader的可以根据接受到的client命令的顺序依次发起phase2的commit请求, 但是这儿不能保证所有请求都成功, 有可能某些请求由于网络问题失败, 这样就执行命令序列中产生了Gap
比如对于1 ~ 140个命令, 其中136, 137失败了, 那么其他所有server可以先执行1~135, 而138~140就无法执行, 因为136和137没有确定
如果希望尽快的消除Gap, 可以对136和137提出一个特殊的"noop”命令(不改变状态), 用于快速弥补gap (比如某种情况下, client很长时间都不发新的命令, 导致136,137的gap无法填补, 后面的命令就一直被block)

上面的讨论假设一直存在单个leader, 除了当前leader失效和选举新leader之间的一小段时间外, 在异常环境中, leader 选择可能失败.
如果没有服务器被选为leader, 那么将不能接受命令. 如果多个服务器都认为自己是leader, 在同一个算法实例中, 它们将都能提出议案, 这可能会导致所有的议案都不能被选择.
然而这不会影响算法的安全性和正确性, leader只是为了提高效率和避免冲突

Paxos Made Simple

The Part-Time Parliament，Lamport，1998，ACM Transactions on Computer Systems. 晦涩的原文

Paxos Made Simple

Paxos Made Live, 如何实现Paxos

chubby论文

关于Paxos的历史

Paxos算法1-算法形成理论

Paxos Made Simple

词汇表

The Problem, 解决什么问题?

Choosing a Value

思路和推演过程

议案提出算法

Acceptor算法

最终choosing value算法

Learning a Chosen Value

Paxos存在的问题

Implementing a State Machine