The Paxos algorithm, when presented in plain English, is very simple

Consensus is one of the fundamental problems in distributed systems. In this post, we want to see what is consensus and review the most famous consensus algorithm—Paxos. we will see that despite exaggeration about its complexities, Paxos, at least the singe-decree Paxos that aims to achieve consensus on a single value, is actually very intuitive and easy to understand.

Listen to this Post

Outline 

Consensus

The purpose of a consensus algorithm, as the name suggests, is to allow multiple processes to reach an agreement in a distributed system.  The consensus problem is defined as follows: we have a group of participants. These participants must decide on a value of a register such that the following properties are satisfied: 
  • Agreement: The value decided by the participants must be the same for all participants. 
  • Integrity: Once a participant decided, it cannot change its decision. 
  • Validity: The value decided by the participants must be one of the values suggested by the participants. 
  • Termination: A value must be eventually decided by the participants. 
The first three properties are safety property, whereas the third property is a liveness property. 

The consensus is a fundamental problem in distributed systems, and many problems in distributed systems can be reduced to a consensus problem. For example, leader election can be reduced to a consensus problem where the agreement is about who is the leader. As another example, consider a replicated datastore where clients can submit their writes to different replicas. Here the replicas must agree on the next write in the system. By agreeing on the next write in each step, we can guarantee that we apply writes in the same orders in all replicas which is necessary to keep replicas consistent with each other. 

FLP Imposibility Result 

You may have heard about FLP impossibility, a theorem proved by Fisher, Lynch, and Paterson [1], that shows it is impossible to achieve consensus in a distributed system. You may wonder having FLP, then, how do Paxos and other consensus protocols solve consensus. Note that FLP proves the impossibility of consensus under a very strict condition of total asynchrony. Under this asynchronous model, the processes don't have a shared notion of time and any step in the system may take an arbitrarily long time. As a result, we cannot assume any upper bound for the time that a process or network takes to do anything. Thus, we cannot rely on timeouts to conclude that some failure has happened. The following paragraph is from [1].

"Crucial to our proof is that processing is completely asynchronous, that is, we make no assumptions about the relative speeds of processes nor about the delay time in delivering a message. We also assume that processes do not have access to synchronized clocks, so algorithms based on timeouts, for example, cannot be used."

As you can see, the assumptions taken by FLP are indeed very strict. In practice, we have some level of synchrony. We can assume some reasonable upper bound for the time that it takes to process or send something over the network. Similarly, although we have clock drifts between nodes, we can still assume some shared notion of time between processes.  All consensus protocols assume the relaxed partially synchronous model of the distributed system and try to solve the consensus under this model.
 
Thus, Paxos and other consensus protocols do not violate FLP. They just live in a different world. FLP is true in asynchronous distributed systems, while consensus protocols work in partially synchronous distributed systems where most of the time our assumptions about the synchrony between processes are correct, but sometimes they might be wrong. If those unlikely situations occur, usually consensus algorithms do not guarantee termination, but safety is always guaranteed. For example, as we will see, with Paxos, we may have a situation where proposers step on each other continuously,  preventing reaching agreement forever, but in practice with a very high probability that does not occur when we have random backoffs. 

Paxos

Paxos [2] is the most famous consensus algorithm. Although you may have heard exaggerations about the complexity of Paxos, it is not that hard to understand it. Interestingly, the inventor of Paxos, Leslie Lamport, published a follow-up paper [3] to show Paxos is easy to understand once explained in plain English. 
Figure 1. Lamport published a follow-up paper [3] in 2001 just to show Paxos is not really that hard to understand. The abstract of the paper is just one sentence stating this fact.

So, let's see how Paxos works!

In Paxos, we have three roles: Proposer, Acceptor, and Learner. These are logical roles and all can be executed at the same node. In practice, any node has usually all three roles. An execution of the basic Paxos algorithm is to decide a single value. The proposers propose their values to the acceptors, once a value decided by the acceptors, the learners "learn" it, i.e., consider it as decided. 

Paxos runs in two phases: 
  • Phase 1: In the first phase, one of the proposers establishes a temporary leadership by receiving promises from a majority of acceptors. This temporary leadership lasts until another proposer gets promises from a majority of the acceptors.
  • Phase 2: In the second phase, the proposer uses its leadership to call a value. The proposer has a short time to utilize its established leadership to get acceptance from a majority of the acceptors. If the proposer is not quick enough, its temporary leadership may be taken over by another proposer, and this process must be repeated by the new proposer. 
 But don't we get into a situation where the proposers keep taking over the leadership and never let the current leader get the job done?
Yes, that can happen if proposers propose aggressively. To avoid that, one solution is to make a proposer waits for a random amount of time before trying again when it fails to gets its value accepted. We can use exponential random backoffs.     

The temporary leadership is assigned as follows: each proposer picks a number called "proposal number". This number must be unique and monotonically increasing. To have unique proposal numbers, we can have proposers pick their numbers from disjoint sets of numbers. For example, if we have two proposers, one can use odd numbers, and the other can use even numbers.  Each proposer sends a prepare(n) message to all acceptors where n is its proposal number. 

Upon receiving a prepare (n) message, an acceptor proceeds as follows: 
  1. If the acceptor has already responded to a proposal by another proposer with the proposal number m such that m > n, it sends Nack to the proposer. 
    • This Nack message means: Sorry, your proposal number is too small. I have already promised (or even accepted) another proposal with a proposal number greater than yours. 
  2. If the acceptor has already accepted a value v proposed by a different proposer with the proposal number m < n, it sends a promise(m, v) message in response to the proposer. 
    • This promise(m, v) message means: Your proposal number looks good to me, as it is large enough, but you are too late. I have already decided. The value that I have decided is v, and its proposal number was m.
  3. Otherwise, the acceptor returns a promise(n) message to the proposer. 
    • This promise(n) message means: OK, I promise to reject any proposal with a proposal number smaller than n
Once a proposer receives promise(n) or promise(m, v) messages from a majority of the acceptors, it has gained the temporary leadership and is ready to go to phase 2. But before that, it picks value v with the highest proposal number among values it received from the acceptors and its own value. If the proposer does not receive promises from a majority of acceptors, it will try again with a larger proposal number, but as said above, only after some random wait time.

In phase 2, the proposer sends Accept(n, v) message where v is the selected value by the proposer to all acceptors. Upon receiving an Accept(n, v) message, the acceptor accepts the value only if it has not received a prepare(m) message such that m > n.  After accepting a value, the acceptor sends an Accepted message to the proposer. Once the proposer received Accepted messages from a majority of the acceptors, it considers the value accepted and sends it to the learners. Alternatively, the acceptors can send Accepted messages directly to all learners. In that case, a learner learns the value, once it receives it from a majority of the acceptors. Since usually in practice, all these roles co-exist in all participants, both ways are the same. 

Note that before sending any message in this algorithm, an acceptor must write its state to stable storage such as a disk (i.e. we have to fsync!).  That is important, as an acceptor must remember what it promised or accepted after recovering from a failure.

With this algorithm, we can tolerate the failure of f acceptors when we have 2f+1 acceptors. For example, when we have 5 acceptors we can tolerate the failure of 2 acceptors. Once we have f+1 or more failed acceptors, the system becomes unavailable as we cannot get a majority of acceptors. 

Pseudocode 

The pseudocode of the basic Paxos is as follows: 

Pseudocode of the Proposer 
Proposer
{
    //the state of the proposer
    v;  //proposed value
    p_num; //proposal number
    acks;  //number of acks received from acceptors
    promises; //promises received from acceptors 

    /*
    The client calls this function of the proposer to propose its desired value.
    This will be called also after abort to try again. 
    */
    on propose(client_value)
    {
        p_num = generate a unique proposal number greater than the last generated p_num
        v = client_Value; 
        acks = 0
        promises = {}; 
        Send<Prepapre, p_num> to all acceptors;
    }

    /* 
    This function is called when the proposer receives a promise message from one of the acceptors.
    */
    on promise(n, promise)
    {
        if (n != p_num) //ignore it
            return;
        promises.add(promise);
        if (promises.size == ceil(acceptors.size+1)/2) {  //majority (1) 
            if (promises.max_value() != NULL){ //at least of the promises has value
                v = promises.max_value();  //pick value with the maximum number received from acceptors
            }else {
                //do nothing, let v be the value set in the propose function by the client.
            }
            Send <Accept, p_num, v> to all acceptors; 
        }
    }

    /* 
    This function is called when the proposer receives an accepted message from one of the acceptors.
    */
    on accepted(n) {
        if (n != p_num)
            return; //ignore it
        acks++; 
        if (acks == ceil(acceptors.size+1)/2) { //majority (2)
            Send <Descide, v> to all learners; 
        }
    }

    /* 
    This function is called when the proposer receives a nack message from one of the acceptors.
    */
    on nack(n) {
        if (n != p_num)
            return; //ignore it
        abort();
        p_num = 0; //will cause all future messages to be ignored
        
    }

}

Pseudocode of the Acceptor
Acceptor {
    //the state of the acceptor
    accepted_num  //the propsoal number of the value accepted by the acceptor
    promised_num //the highest proposal number the acceptor has received
    accepted_value //the value accepted by the acceptor 

    /* 
    This function is called when the acceptor receives a prepare message from one of the proposers.
    */
    on prepare (n, sender) {
        if (promised_num < n) {
            promised_num = n; 
            Persist state on disk;
            Send <Promise, n, promise(accepted_num, accepted_value)> to sender; 
        } else {
            Send <Nack, n> to sender;
        }
    }

    /* 
    This function is called when the acceptor receives an accept message from one of the proposers.
    */
    on accept (n, v, sender) {
        if (promised_num <= n){
            promised_num = n; 
            accepted_num = n; 
            accepted_value = v; 
            Persist state on disk;
            Send <Accepted, n> to sender; 
        }else {
            Send <Nack, n> to sender;
        }
    }
}

Pseudocode for the Leaner
Learner {
    //the state of the leader 
    decided_value = NULL;
    
    /* 
    This function is called when the learner receives a decide message from one of the proposers.
    */
    on decide (v){
        if (decided_value == NULL) {
            decided_value = v; 
            Learn (v);
        }
    }
}

As said earlier, typically each node has all three roles. Thus, you can have a general class named Participant and add the state variables and functions of all classes above to it.

Conclusion 

In this post, we saw the Paxos algorithm, if not very simple as Leslie Lamport said, at least, is not that hard to understand. We talked about the single-decree Paxos that is used to decide on a single value. An extension of this algorithm is called Multi-Paxos which lets participants agree on a sequence of values and is more efficient than running Paxos multiple times. An alternative to the Multi-Paxos algorithm is the Raft algorithm that many find more understandable than Multi-Paxos. 

References

[1] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process." Journal of the ACM (JACM) 32, no. 2 (1985): 374-382.
[2] Leslie Lamport. "The part-time parliament." In Concurrency: the Works of Leslie Lamport, pp. 277-317. 2019.
[3] Leslie Lamport. "Paxos made simple." ACM Sigact News 32, no. 4 (2001): 18-25.



Comments

Popular posts from this blog

In-memory vs. On-disk Databases

ByteGraph: A Graph Database for TikTok

DynamoDB, Ten Years Later

Amazon DynamoDB: ACID Transactions using Timestamp Ordering