The Paxos algorithm, when presented in plain English, is very simple
Consensus is one of the fundamental problems in distributed systems. In this post, we want to see what is consensus and review the most famous consensus algorithm—Paxos. we will see that despite exaggeration about its complexities, Paxos, at least the singe-decree Paxos that aims to achieve consensus on a single value, is actually very intuitive and easy to understand.
Listen to this Post
Outline
Consensus
The purpose of a consensus algorithm, as the name suggests, is to allow multiple processes to reach an agreement in a distributed system. The consensus problem is defined as follows: we have a group of participants. These
participants must decide on a value of a register such
that the following properties are satisfied:
- Agreement: The value decided by the participants must be the same for all participants.
- Integrity: Once a participant decided, it cannot change its decision.
- Validity: The value decided by the participants must be one of the values suggested by the participants.
- Termination: A value must be eventually decided by the participants.
The first three properties are safety property, whereas the third
property is a liveness property.
The consensus is a fundamental problem in distributed systems, and many problems in distributed systems can be reduced to a consensus problem. For example, leader election can be reduced to a consensus problem where the agreement is about who is the leader. As another example, consider a
replicated datastore where clients can submit their writes to different replicas. Here the replicas must agree on the next write in the system. By
agreeing on the next write in each step, we can guarantee that we apply
writes in the same orders in all replicas which is necessary to keep
replicas consistent with each other.
FLP Imposibility Result
You may have heard about FLP impossibility, a theorem proved by
Fisher, Lynch, and Paterson [1], that shows it is impossible to achieve consensus in a distributed system. You may wonder having FLP, then, how do
Paxos and other consensus protocols solve consensus. Note that FLP proves the
impossibility of consensus under a very strict condition of total asynchrony.
Under this asynchronous model, the processes don't have a shared notion of time and any step in the system may take an arbitrarily long time. As a
result, we cannot assume any upper bound for the time that a process or
network takes to do anything. Thus, we cannot rely on timeouts to conclude that some failure has happened. The following paragraph is from [1].
"Crucial to our proof is that processing is completely asynchronous, that is, we make no assumptions about the relative speeds of processes nor about the delay time in delivering a message. We also assume that processes do not have access to synchronized clocks, so algorithms based on timeouts, for example, cannot be used."
As you can see, the assumptions taken by FLP are indeed very strict. In practice, we
have some level of synchrony. We can assume some reasonable upper bound for
the time that it takes to process or send something over the network.
Similarly, although we have clock drifts between nodes, we can still assume
some shared notion of time between processes. All consensus protocols
assume the relaxed partially synchronous model of the distributed system and
try to solve the consensus under this model.
Thus, Paxos and other consensus protocols do not violate FLP. They just live
in a different world. FLP is true in asynchronous distributed systems, while consensus protocols work in partially synchronous distributed systems where most of the time our assumptions about the synchrony between processes are correct, but sometimes they might be wrong. If those unlikely situations occur, usually consensus algorithms do not guarantee termination, but safety is always guaranteed. For example, as we will see, with Paxos, we may have a situation where proposers step on each other continuously, preventing
reaching agreement forever, but in practice with a very high probability that
does not occur when we have random backoffs.
Paxos
Paxos [2] is the most famous consensus algorithm. Although you may have heard
exaggerations about the complexity of Paxos, it is not that hard to understand
it. Interestingly, the inventor of Paxos, Leslie Lamport, published a
follow-up paper [3] to show Paxos is easy to understand once explained in
plain English.
Figure 1. Lamport published a follow-up paper [3] in 2001 just to show Paxos is not really that hard to understand. The abstract of the paper is just one sentence stating this fact. |
So, let's see how Paxos works!
In Paxos, we have three roles: Proposer, Acceptor, and
Learner. These are logical roles and all can be executed at the same
node. In practice, any node has usually all three roles. An execution of the
basic Paxos algorithm is to decide a single value. The proposers propose their
values to the acceptors, once a value decided by the acceptors, the learners
"learn" it, i.e., consider it as decided.
Paxos runs in two phases:
- Phase 1: In the first phase, one of the proposers establishes a temporary leadership by receiving promises from a majority of acceptors. This temporary leadership lasts until another proposer gets promises from a majority of the acceptors.
- Phase 2: In the second phase, the proposer uses its leadership to call a value. The proposer has a short time to utilize its established leadership to get acceptance from a majority of the acceptors. If the proposer is not quick enough, its temporary leadership may be taken over by another proposer, and this process must be repeated by the new proposer.
But don't we get into a situation where the proposers keep taking over
the leadership and never let the current leader get the job
done?
Yes, that can happen if proposers propose aggressively. To avoid that, one
solution is to make a proposer waits for a random amount of time before
trying again when it fails to gets its value accepted. We can use
exponential random backoffs.
Upon receiving a prepare (n) message, an acceptor proceeds as
follows:
- If the acceptor has already responded to a proposal by another proposer with the proposal number m such that m > n, it sends Nack to the proposer.
- This Nack message means: Sorry, your proposal number is too small. I have already promised (or even accepted) another proposal with a proposal number greater than yours.
- If the acceptor has already accepted a value v proposed by a different proposer with the proposal number m < n, it sends a promise(m, v) message in response to the proposer.
- This promise(m, v) message means: Your proposal number looks good to me, as it is large enough, but you are too late. I have already decided. The value that I have decided is v, and its proposal number was m.
- Otherwise, the acceptor returns a promise(n) message to the proposer.
- This promise(n) message means: OK, I promise to reject any proposal with a proposal number smaller than n.
Once a proposer receives promise(n) or
promise(m, v) messages from a majority of the acceptors, it has gained the temporary leadership and is ready to go
to phase 2. But before that, it picks value v with the highest proposal
number among values it received from the acceptors and its own value. If the proposer does not receive promises from a majority of acceptors, it will try again with a larger proposal number, but as said above, only after some random wait time.
In phase 2, the proposer sends Accept(n, v) message where v is
the selected value by the proposer to all acceptors. Upon receiving an
Accept(n, v) message, the acceptor accepts the value only if it has not
received a prepare(m) message such that m > n. After
accepting a value, the acceptor sends an Accepted message to the
proposer. Once the proposer received Accepted messages from a
majority of the acceptors, it considers the value accepted and sends it to the
learners. Alternatively, the acceptors can send Accepted messages directly to
all learners. In that case, a learner learns the value, once it receives it
from a majority of the acceptors. Since usually in practice, all these roles co-exist in all participants, both ways are the same.
Note that before sending any message in this algorithm, an acceptor must write
its state to stable storage such as a disk (i.e. we have
to fsync!). That is important, as an acceptor must remember
what it promised or accepted after recovering from a failure.
With this algorithm, we can tolerate the failure of f acceptors when we have
2f+1 acceptors. For example, when we have 5 acceptors we can tolerate the
failure of 2 acceptors. Once we have f+1 or more failed acceptors, the system
becomes unavailable as we cannot get a majority of acceptors.
Pseudocode
The pseudocode of the basic Paxos is as follows:
Pseudocode of the Proposer
Proposer{//the state of the proposerv; //proposed valuep_num; //proposal numberacks; //number of acks received from acceptorspromises; //promises received from acceptors/*The client calls this function of the proposer to propose its desired value.This will be called also after abort to try again.*/on propose(client_value){p_num = generate a unique proposal number greater than the last generated p_numv = client_Value;acks = 0;promises = {};Send<Prepapre, p_num> to all acceptors;}/*This function is called when the proposer receives a promise message from one of the acceptors.*/on promise(n, promise){if (n != p_num) //ignore itreturn;promises.add(promise);if (promises.size == ceil(acceptors.size+1)/2) { //majority (1)if (promises.max_value() != NULL){ //at least of the promises has valuev = promises.max_value(); //pick value with the maximum number received from acceptors}else {//do nothing, let v be the value set in the propose function by the client.}Send <Accept, p_num, v> to all acceptors;}}/*This function is called when the proposer receives an accepted message from one of the acceptors.*/on accepted(n) {if (n != p_num)return; //ignore itacks++;if (acks == ceil(acceptors.size+1)/2) { //majority (2)Send <Descide, v> to all learners;}}/*This function is called when the proposer receives a nack message from one of the acceptors.*/on nack(n) {if (n != p_num)return; //ignore itabort();p_num = 0; //will cause all future messages to be ignored}}
Pseudocode of the Acceptor
Acceptor {//the state of the acceptoraccepted_num //the propsoal number of the value accepted by the acceptorpromised_num //the highest proposal number the acceptor has receivedaccepted_value //the value accepted by the acceptor/*This function is called when the acceptor receives a prepare message from one of the proposers.*/on prepare (n, sender) {if (promised_num < n) {promised_num = n;Persist state on disk;Send <Promise, n, promise(accepted_num, accepted_value)> to sender;} else {Send <Nack, n> to sender;}}/*This function is called when the acceptor receives an accept message from one of the proposers.*/on accept (n, v, sender) {if (promised_num <= n){promised_num = n;accepted_num = n;accepted_value = v;Persist state on disk;Send <Accepted, n> to sender;}else {Send <Nack, n> to sender;}}}
Pseudocode for the Leaner
Learner {//the state of the leaderdecided_value = NULL;/*This function is called when the learner receives a decide message from one of the proposers.*/on decide (v){if (decided_value == NULL) {decided_value = v;Learn (v);}}}
As said earlier, typically each node has all three roles. Thus, you can have a
general class named Participant and add the state variables and functions of all
classes above to it.
Conclusion
In this post, we saw the Paxos algorithm, if not very simple as Leslie Lamport said, at least, is not that hard to understand. We talked about the single-decree Paxos that is used to decide on a single value. An extension of this algorithm is called Multi-Paxos which lets participants agree on a sequence of values and is more efficient than running Paxos multiple times. An alternative to the Multi-Paxos algorithm is the Raft algorithm that many find more understandable than Multi-Paxos.
References
[1] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson.
"Impossibility of distributed consensus with one faulty process." Journal of
the ACM (JACM) 32, no. 2 (1985): 374-382.
[2] Leslie Lamport. "The part-time parliament." In Concurrency: the Works of
Leslie Lamport, pp. 277-317. 2019.
[3] Leslie Lamport. "Paxos made simple." ACM Sigact News 32, no. 4 (2001):
18-25.
Comments
Post a Comment