Spanner, 2PC+2PL over Paxos
Spanner provides strictly serializable distributed transactions on a global
scale. Spanner was introduced in 2012, and today, it is available as a managed
service called Cloud Spanner via Google Cloud Platform. CockroachDB and
YugaByteDB are two open-source products based on Spanner. In this post, we see
how Spanner provides strictly serializable distributed transactions while being a replicated and partitioned database.
In my previous post, we saw why achieving serializability (and strict serializability, a.k.a.
external consistency) for partitioned and replicated databases is a
challenging problem. Spanner leverages 2PC, 2PL, and MVCC methods to provide
strict serializability for both read-write and read-only transactions. To
achieve that, Spanner uses something that is usually considered unreliable in
distributed systems--the time. With tightly synchronized clocks, Spanner uses
the time to reason about the order of the events in the system and provide strictly serializable transactions.
Listen to the Audio Blog
2PC+2PL over Paxos Leaders
When replicas apply the exact same updates in the exact same order, they will
end up in the exact same state. That is known as state machine replication. The agreement on which update to apply next is, in fact, the consensus problem. Spanner uses Paxos for state machine
replication. We don't want to cover Paxos in this post. For now, it is enough
to know that Paxos lets a group of replicas agree on the order of a sequence
of updates.
Ok, now consider a system where data is partitioned, and each partition is
replicated. The replicas for each partition form a Paxos group. Spanner
provides serializable transactions by simply running 2PC+2PL on the Paxos
leaders. One of the leaders is the coordinator, and the rest of the leaders
are the participants. All states of the 2PC for both the coordinator and
participant are recorded in their Paxos' state machine. Thus, if any of them
crashes in the middle of 2PC, the new leader has all the information to
continue and complete the 2PC. That increases fault-tolerance, especially for
the coordinator, as we know the crash of the coordinator is the main weakness
of 2PC.
So far so good. If we only access the leaders, for both reads and writes, the
serializability is guaranteed and we can rest assured that we read the most
recent versions, as the leaders have the most recent updates. In that case,
the replicas are only used for the purpose of fault-tolerance, so the clients
cannot go to them even for reads. By only accessing leaders and running 2PL we
guarantee that we achieve a serialization order that respects
the real-time order of the transactions, i.e. if transaction
T1 commits before transaction T2 starts, then the system should appear that
T2 has executed after T1. Thus, the clients should never see a state that includes the effect of T2, but does not include the effect of
T1. This is called external consistency which is also known
as strict serializability (see
this post).
If you only care about read-write transactions and you are OK with running all
reads and writes only on the leaders and use replicas only for
fault-tolerance, you can stop here; running 2PC+2PL over Paxos leaders is the
only thing you need to know about Spanner.
Read-only Transactions and Snapshot Reads
We usually want to scale our read operations by allowing users to go to any
replica that host the object, not just the leaders. Also, no matter where we
run our reads, we prefer not to block writes by reads. Note that when reads
are part of a read-write transaction, they have to block writes during the
2PL. However, when a client wants to just read a set of items without writing
anything, we don't want these reads to block the writes. For this kind of pure
reads, Spanner provides another abstraction called
read-only transactions. The reads of read-only transactions do not
acquire any lock, so they never block write operations.
Spanner serves read-only transactions via snapshot reads. Using a
snapshot, we can read the values for a set of items at a given time such that
we are reading the values as if the system was frozen at that time. Spanner
assigns a timestamp to each version written for each item and maintains
multiple versions for each item. When it wants to read an item for a snapshot
at time t, it reads the most recent value of that item with the timestamp not
greater than t.
For example, suppose we have following versions for item x:
<value=5, timestamp=6>
<value=3, timestamp=4>
<value=10, timestamp=1>
and we want to read x in a snapshot at time 5. Spanner returns 3 as this
version is the version with the highest timestamp no greater than 5.
But how can we be sure that there does not exist a version with timestamp
t' <= t that has not been received by the replica that we are reading
from? In the example above, we could have a version at time 5 that is not
received by the replica that the client is accessing.
To make sure that a replica is up-to-date enough, the replica first
calculates a variable called t_safe. As we will explain later, Spanner
guarantees that a replica has received all versions with timestamps less than or equal to the calculated t_safe. Now, if t_safe is smaller than t, the
operation will be blocked until t_safe becomes greater than or equal to t.
This way, Spanner makes sure what you described does not occur.
To serve a read-only transaction, Spanner picks a timestamp (as we will
explain below), and runs a snapshot read at that time. Thus, a read-only
transaction is basically a snapshot at a timestamp picked by the Spanner.
Ok, we understood how read-only transactions and snapshot reads generally work
given we have the timestamps. The following are the details that we will focus
on next.
- How should we assign a timestamp to a transaction, i.e. to the version written by a transaction?
- How should we pick timestamp for a read-only transaction?
- How can we calculate t_safe in a replica?
Timestamps
Spanner assigns timestamps such that any version written by a transaction T1
have a timestamp smllar than the timestamp assigned to any version written by
transaction T2 that starts after T1 commits. For example,
suppose transaction T1 writes value v1 for
item x, and Spanner assigns timestamp 5 to this version. T1
commits. Then, transaction T2 starts and writes value v2 for
the same item x. Spanner guarantees that v2 has a timestamp
greater than 5. We call a versioning that respects the
real-time order of the transactions a proper versioning [1].
Isn't proper versioning easy? If we just use the system clock to
timestamp, won't the timestamp of v2 be always greater than the timestamp of
v1?
In a perfect world that clocks are perfectly synchronized, yes, but with current technology no. Due to the clock skew between machines, we may violate the proper versioning for two transactions executed by two different nodes, if we simply use system clocks.
Here is where Spanner uses TrueTime. TrueTime is a distributed clock
system whose API returns time as an interval instead of a number. The real
value of time is guaranteed to be within the returned interval.
What do you mean by the "real" value of time?
We mean the time with respect to a universal reference without any error.
Using these intervals we might be able to argue about the order of events. For
example, when TrueTime timestamp of
events e1 and e2 are [5,7] and
[10,12], respectively, we can be sure e2 has occurred
after e1 because 10 > 7. However, when there is an
overlap between the two intervals, e.g. [5,7] and [6, 8], we can't conclude
any order.
When the Spanner wants to commit and assign a timestamp to a transaction, the
coordinator asks TrueTime the time. The TrueTime returns an interval [earliest, latest]. The coordinator assigns a timestamp t not less than the latest (we will explain how exactly Spanner does that below). By choosing a
timestamp not less than the latest, we guarantee that the timestamp of a transaction is larger than the real start time of the
transaction. We call this invariant start.
The coordinator then waits before actually committing the transaction and
keeps asking TrueTime and TrueTime keeps returning [earliest, latest] intervals. The coordinator finally commits the transaction when
it finds out earliest > t. This guarantees that the timestamp of a transaction is smaller than the real commit time of the
transactions. We call this invariant commit wait.
Figure 2. start and commit wait together guarantee that start time < timestamp < commit
time.
|
Let's go back to the external consistency. As we said above, we want this:
when T2 starts after T1 commits, then the timestamp of T2 must be greater than
that of T1. Let's see how this is guaranteed by Spanner:
- timestamp of T1 < real commit time of T1 (commit wait)
- real commit time of T1 < real start time of T2 (assumption)
- real start time of T2 < timestamp of T2 (start)
Figure 3. start and commit wait guarantee that when T2 starts after T1 commit, its timestamp is greater
than that of T1.
Time synchronization protocols such as NTP also provide uncertainty
intervals. Can't we do the same with them?
Yes, you can do that. What is special about TrueTime is that it relies on
special hardware that keeps the uncertainty window very small like 7 ms (or
even less than 1 ms according to this talk) while it is 100-250ms for NTP. So, if we want to do the same with NTP, we
may have to deliberately delay each transaction for 250 ms; not good.
The start and commit wait are invariants that Spanner has to respect, but we have not talked about how
Spanner actually assign timestamp sofar. Now, let's see how Spanner actually
assigns timestamps to the versions written in a read-write transaction and
picks a timestamp for a read-only transaction.
Read-write Transactions
For both reads and writes in a read-write transaction, we have to go to the
leaders, and get the locks as part of 2PL. Once the client performed all of
its reads and buffered all of its writes, it is ready to commit the
transactions. The client picks one of the leaders as the coordinator and
requests to commit. The coordinator then runs 2PC to commit the transaction.
Each participant, which is the leader of its own replica group, picks
a prepare timestamp that must be greater than any timestamp
that it has assigned to previous transactions (let's call this invariant
prepare_timestamp), and logs a prepare
message in its Paxos state and responds to the coordinator. Once the
coordinator received all the responses, it picks a timestamp that satisfies
the following constraints:
- It is not less than any of the prepare timestamps picked by the participants (let's call this invariant greater_than_all_prepare_timestamps).
- It is greater than any timestamp picked by itself.
- It is greater than the current TrueTime latest. (start)
Once the coordinator picked a proper timestamp, it waits to satisfy the commit wait invariant explained above. Then, it writes a commit to its Paxos
state and sends the commit timestamp to the client and all participants.
Once a participant receives the outcome of the transaction, it applies the
outcome to its Paxos state at the given timestamp and releases the
locks.
Read-only Transactions
For read-only transactions that span more than one partition, Spanner gets
current [earliest, latest] from TrueTime and picks
the latest as the timestamp. This will guarantee that the
real start time of the read-only transaction to be smaller than the snapshot
read timestamp we assign to it. Let's call this invariant, ro_start, which is similar to start in read-write transactions.
Let's see why this selection of snapshot read timestamp will guarantee
external consistency:
To prove external consistency for snapshot reads, we have to show that the
effect of any transaction committed before a read-only transaction starts is
reflected in it. Suppose transaction T1 is committed before the start of the
read-only transaction. Now, we have:
- timestamp of T1 < real commit time of T1 (commit wait)
- real commit time of T1 < real start time of the read-only transaction (assumption)
- real start time of the read-only transaction < snapshot read timestamp (ro_start)
=> timestamp of T1 < snapshot read timestamp
Ok, since snapshot read timestamp < t_safe (see above), we have:
timestamp of T1 < t_safe that guarantees T1 is received by the
replica, and since the timestamp of T1 < snapshot read timestamp, the
effect of T1 will be reflected in the snapshot, i.e. external consistency.
(end of proof)
For read-only transactions that read items in a single partition and there is
no prepared but not committed transactions, we can use the timestamp of the
last committed Paxos write as the snapshot timestamp instead
of latest. Since there is no prepared transaction, picking Paxos'
highest timestamp will guarantee that any transaction committed before
snapshot has a smaller timestamp.
But picking the latest is always correct whether the read-only transaction
is single-node or not, right? Then, why not always picking latest? what is
the benefit of picking Paxos highest write instead of the latest?
That is true. Picking the latest always results in external
consistency. However, note that when snapshot read timestamp is greater than
t_safe, the snapshot will get blocked. Thus, to reduce the risk of being
block, we want to choose the minimum snapshot read timestamp that satisfies
external consistency. At any given time, the highest Paxos timestamp is
guaranteed to be smaller than latest.
It is important to note that we have to get the last Paxos timestamp from a node with the most recent Paxos log. The Paxos leader is guaranteed to have most recent Paxos log. Thus, to get the timestamp for read-only transactions that span only one partition, the client must go to the leader. The actual reading via snapshot at that timestamp, however, can be executed in any of the replicas. Note that if the replica is behind the leader, the snapshot read might get blocked.
Calculating t_safe
As we said above, to execute a snapshot at a given time, the executing replica
needs to make sure it has received all transactions committed with a timestamp
smaller than the snapshot time. Otherwise, the snapshot may miss a version
that must have been included. The replica first computes the t_safe and blocks
the snapshot at time t, if t_safe < t. Again note that, as we said above,
the client is free to contact any of the replicas (not just the Paxos leaders)
to execute the snapshot at its desired timestamp.
Let's first consider the easier case where the replica is not aware of any
prepared but not committed transaction. Let t_p be the timestamp of the most
recent applied Paxos write. In this case, we set t_safe to t_p. Why this is
correct? because we know that any future timestamp will be higher than t_p as
a result of invariants prepare_timestamp and greater_than_all_prepare_timestamps introduced
above.
Now, let's consider the case where the replica is aware of a prepared but not
committed transaction. In this case, we might see a future Paxos record with a
smaller timestamp than the most recent Paxos write due to older but pending
transactions. Thus, we have to take prepared transactions into account when we
compute t_safe. Let t_m be the minimum timestamp we assigned to prepared
transactions - 1. Picking t_safe = min (t_p, t_m) will guarantee that we will
never apply a transaction with a timestamp smaller than t_safe in this
replica, which is the guarantee that we need for the external
consistency.
For example, suppose a replica receives <prepare T2, 4> and
then <commit T1, 6>. Now, t_m is 6, but t_m is 3. Thus, t_safe will
be min(6,3) = 3.
But in a Paxos group, not all replicas are guaranteed to know the most
recent Paxos write. Thus, it is possible that we have a prepare transaction,
but this particular replica is not aware of it. In the example above, the
replica might not receive <prepare T2, 4>, and it may choose 6
incorrectly.
That is true that replica may not be aware of the most recent Paxos write.
However, what you describe cannot happen, as in the Paxos group writes are
guaranteed to apply in order. Thus, it is impossible for the replica to see
<commit T1, 6>, but not <prepare T2, 4>. Not being aware of the
most recent Paxos writes and the prepare records may result in smaller t_safe
which in turn results in a higher risk of blockage due to t_safe < snapshot
timestamp, but it does not violate external consistency.
What if a snapshot picks time t, but there is no activity in the system.
Thus, t_safe won't advance and the snapshot will block forever. How can we
avoid that?
We don't want to go to the details of the Paxos algorithm used by Spanner
here, but basically, Spanner guarantees that t_p will advance periodically
even when there is no new transaction for the partition.
In summary,
- Spanner uses 2PC+2PL over Paxos leaders to support serializability in a replicated and partitioned system.
- It uses MVCC to provide consistent snapshots.
- It uses TrueTime and waits out the uncertainty window to guarantee external consistency.
References
[1] Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher
Frost, Jeffrey John Furman, Sanjay Ghemawat et al. "Spanner: Google’s
globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 1-22.
Comments
Should this be - Now, t_p is 6, but t_m is 4. Thus, t_safe will be min(6,4) = 4.
Post a Comment