Amazon DynamoDB: ACID Transactions using Timestamp Ordering
Amazon DynamoDB uses a lightweight transaction protocol via timestamp ordering. Although both Spanner and DynamoDB use timestamps in their transaction protocol, DynamoDB seems significantly simpler than Spanner; while Spanner uses 2PL and MVCC and needs TrueTime, DynamoDB only relies on timestamp ordering and does not use any locking or MVCC scheme. In this post, we want to review the DynamoDB transaction protocol and compare it with Google Spanner.
We have covered several protocols for distributed transactions in this blog so far. We saw Spanner uses 2PC+2PL over Paxos leaders and uses TrueTime for external consistency, Calvin uses a deterministic approach, and FoundationDB uses OCC. In this post, we cover DynamoDB which uses timestamp ordering that perhaps is the simplest approach compared to those we have covered so far. DynamoDB does not require tight clock synchronization, and unlike Spanner does not assume any uncertainty window. It does not need MVCC and avoids any locking scheme (e.g. 2PL or deterministic locking). The timestamp ordering approach used by DynamoDB is very simple: assign timestamps to transactions and let these timestamps define the serialization order, i.e., a transaction with a smaller timestamp must appear to be executed before the one with a larger timestamp. Thus, the job of the database is to abort any transaction that violates this illusion. We will see how, with this simple approach, DynamoDB provides ACID transactions while avoiding the complexities of more sophisticated protocols.
Related Post: DynamoDB, Ten Years Later
ACID Transactions in DynamoDB
This section is a summary of the great talk by Doug Terry at FAST19 [1] plus
some additional discussions.
Design Goals
DynamoDB is designed to provide ACID transactions
together with scalability. Terry defines scalability as the combination of 1)
unbounded growth, and 2) predictable performance [1]. The reason that many
cloud databases embrace simple data models without consistency or isolation
guarantees is to satisfy these scalability requirements. When a database
does not care about consistency or isolation, we can horizontally scale
the system by simply partitioning the data and keep adding new shards as
needed. On the other hand, additional consistency or isolation means higher latency as we scale.
In addition to scalability, DynamoDB transactions could not use MVCC,
because the underlying storage does not support it, and also since DynamoDB is
a cloud database and customers pay for the storage, dealing with the
additional storage due to MVCC was not straightforward. Another design goal for DynamoDB was to avoid any impact on
non-transactional operations.
The following list is a summary of the design goals for DynamoDB transactions:
- Atomicity and Serializability
- Scalability
- Unbounded growth
- Predictable performance
- No MVCC
- No impact on non-transactional operations
API: One-Shot Transactions
Since DynamoDB is a multi-tenant cloud database, interactive transactions that
may hold resources for a long time are not acceptable. DynamoDB
adopts the simple one-shot transactions approach. Specifically, we have two
following APIs for transactions.
TransactGetItems { //(Read-only Transaction)
Get (table: "T1", key: k1),
Get (table: "T2", key: k2),
Get (table: "T3", key: k3)
}
TransactWriteItems { //(Conditional Write Transaction)
Put (table: "T1", key: k1, value:
v1),
Delete (table: "T2", key: k2),
Update (table: "T3", key: k3, value: v1),
Check (table: "T3", key: k3, value: <100)
}
As you can see, there is no read-write transaction API available for Dynamo.
For example, we cannot have a transaction that reads the value of key k1 and
sets the value of k2 to it. However, such a use case can be done using the simple Get and conditional write transaction API as follows.
v1 = GeT (table: "T1", key: k1)
TransactWriteItems {
Check (table: "T1", key: k1, value: =v1)
Update (table: "T2", key: k2, value: v1)
}
If the transaction fails, due to its check statement, we have to repeat both the Get operation and transaction again and again until the transaction is successful. Note
that, although we say we don't have interactive transactions in DynamoDB, you
can see that the approach above basically emulates interactive transactions in
other OCC systems. For example, in FoundationDB, the writes are buffered on the client and are submitted as a
batch together with the read-set to the transaction systems. The transaction
will be rejected if any of the values read by the client is updated. The same approach can be used with Calvin to emulate interactive transactions.
Architecture
Figure 2 shows the overall architecture of DynamoDB. Transactional and
non-transaction operations go through separate routes. This design tries to satisfy the "No impact on non-transactional operation" requirement mentioned
above. There is a fleet of stateless transaction coordinators that manage the execution of transactions. Although Terry says DynamoDB avoids 2PC, it seems
2PC is exactly what the TCs do. A TC sends the transaction request to the
involved nodes. Each node sees if it can execute the transaction or not and
returns its decision to the coordinator. Once all nodes agree to accept a
transaction, the TC asks them to complete the transactions by making the
mutations durable. There are two points of no returns in 2PC that must be durably
stored (see this post): 1) the vote to accept/reject by the nodes, 2)
transaction decision on the coordinator. The state of the transaction on the
coordinator is stored on a durable ledger. There is a recovery manager that
handles the failure of a TC and continues pending transactions recorded on the
ledger. The coordinator makes sure that mutations of an accepted transaction are delivered to the state of all shards exactly once. It keeps sending the decision
until all nodes acknowledge it. Thus, to avoid duplicate mutation, nodes have to
de-duplicate messages from the TC (see this post for more info on exactly-once
delivery).
A node decides to accept or reject a transaction based on two things: 1) the checks statements specified in the transaction code, 2) the results of
conflict-checking with other transactions. We will see how this
conflict-checking works in the next section.
Conflict-Checking
All right, we have arrived at the exciting part—the conflict-checking! The conflict-checking is done in the first phase of 2PC on each participant node. DynamoDB
does not use 2PL, MVCC, or the deterministic approach. Instead, it follows a
simple optimistic approach via timestamp ordering to serialize
transactions. As we said earlier,
serializability simply means that transactions must appear to be executed in sequential order. The database is allowed to internally run transactions
concurrently, but an external observer must see transactions in some
sequential order. The order does not matter, i.e., any order is acceptable.
DynamoDB uses physical timestamps of transactions to order them.
In DynamoDB, the serialization order is defined by the
timestamps assigned by the TCs.
The timestamps are physical clock timestamps, and we can break ties by the TC IDs. Thus, the timestamps totally order the transactions. When we say this order
"defines the serialization order" it means the store must behave in a way
that to the client it appears transactions are executed based on this order. Thus, if executing a transaction violates this perception of the database, DynamoDB
aborts it. To make sure we follow this order, the nodes receiving
transaction requests from TCs follow a set of rules to decide to
accept/reject each transaction.
Rules for Accepting a TransactWriteItems
Three reasons may cause a node to reject a
transaction.
- Conflict with an already completed transaction.
- Conflict with an already accepted transaction.
- Violation of a transaction check.
The third one is clear. Below, we explain 1 and 2.
Conflict with Completed Transactions
Once a node receives a request to commit a new write transaction TxNew, the node checks its timestamp with the timestamps of currently completed transactions that conflict with the new transaction. Since DynamoDB uses
timestamps order to define the serialization order, if the new transaction
request has a smaller timestamp of any of the already completed conflicting
transactions, the node rejects the transaction.
What are conflicting transactions?
We say two transactions are conflicting if they both access the same data
item. Note that in practice, it is usually done via ranges of keys instead of checking individual keys to avoid high overhead.
Why do only conflicting transactions matter?
If two transactions do not conflict with each other, i.e. they access
completely different parts of data, their order does matter. Any order is
indistinguishable from the order defined by the timestamps, so it is fine to
accept a new transaction with a smaller timestamp if it is touching some part of data that is not touched by already completed transactions.
We can do one optimization here. If an already completed transaction Tx1 does
not have any condition, we can go ahead and accept TxNew, even if its
timestamp is smaller than that of Tx1, because even if TxNew was executed
before Tx1, we would still execute Tx1.
But wait! what if TxNew and Tx1 both write the same item x? In that
case, to be faithful to the order defined by timestamps, Tx1 with the larger
timestamp must be the winner, but by accepting TxNew we end up having a value
for x written by the smaller timestamp. Isn't that a problem?
No. Note that this is just the first phase where we decide to accept or
reject a transaction. This node can go ahead and accept the transaction.
However, later when the transaction coordinator sends the final write command, it can
simply ignore the write by TxNew on x because the current version has a higher
timestamp.
Then what is the point of accepting a transaction that will be eventually
ignored?
The point is maybe TxNew has other writes that may not be ignored. The value written to x by TxNew will be ignored, but maybe TxNew writes y, possibly even in a different node, that won't be ignored.
Side note: It seems there is a mistake in the optimization explained above or maybe this case is not covered in [1]. Consider the following transactions:
T1 {
x=2
}
TxNew{
check(x,=2)
y=3
}
T1 is just an unconditional put. Thus, based on the optimization above, when TxNew comes, we can still accept it even when its timestamp is smaller than T1. However, TxNew has a check the is true due to a write by T1. Thus, in the serialization order, TxNew must be after T1 which does not conform to the order of the timestamps.
Conflict with Accepted Transactions
In addition to completed transactions, the node checks the new transaction with accepted (but not completed) conflicting transactions before it accepts
the transaction. Similar to completed transactions, if the new transaction has
a timestamp smaller than that of an already accepted conflicting transaction
Tx2, we have to reject it. The optimization explained above can be
used here as well, i.e., if Tx2 does not have any condition, then there is no issue in
accepting TxNew regarding Tx2.
Unlike completed transactions, the node may still reject a transaction due to
an accepted transaction Tx2 even when TxNew has a higher timestamp. Such a situation happens when TxNew has a condition that might be affected by Tx2.
The reason why we don't have this extra reason to abort TxNew regarding completed
transactions is that the effect of completed transactions is already
applied in the store, so we can be sure the checks of TxNew will be correctly
evaluated. However, for Tx2, the effect is not there yet, so if we go ahead
and evaluate the check of TxNew we might wrongly accept TxNew while it had to
be rejected.
Figure 3 summarizes the rules explained for deciding on a new write
transaction.
Rules for Accepting a TransactGetItems
Similar to write transactions, we should abort a read transaction that has a
timestamp smaller than that of an already completed transaction in the node. Otherwise, we might read the effect of a transaction with a higher
timestamp that violates the order defined by the timestamps. In other words, a
read transaction with a smaller timestamp is trying to read a value in the
past. Since DynamoDB does not maintain older versions, we have no choice to
abort. If DynamoDB had MVCC it could accept the transaction and return the
older version requested by the read transaction.
Regarding accepted transactions, the rule is reversed, i.e., if the timestamp of
the read transaction is smaller than that of an accepted transaction it is
fine because the read transaction does not read the effect of the accepted
transaction and it is not supposed to read as well. However, if there is an
accepted transaction with a timestamp smaller than the timestamp of the read
transactions, then we have to either reject the read transaction or wait for
the accepted transaction to take effect before we execute our reads. Figure 6
shows the rules for accepting a new read transaction.
Non-Transactional Operations
One of the design goals of DynamoDB was not to affect the non-transactional
operations. That goal is satisfied for single Get operations, i.e., there is
no change needed for single Get operation; we can go ahead and read the latest
value written for the item. However, Put operations might be rejected in
transactional DynamoDB. Specifically, a Put operation must be rejected if it
might affect the condition of an already accepted but not completed
transaction. Also, if the Put itself is a conditional Put and its condition
might be affected by an accepted but not completed transaction we have to
reject the Put, because we don't have the value of the accepted transaction so
we cannot go ahead and evaluate the Put operation. Note that Put might only
be rejected due to accepted and not completed transactions. Completed
transactions never cause a Put operation to be rejected.
Applying Transaction Mutations
The above discussion was only about the first phase of the 2PC, i.e. to decide to accept/reject new transactions. However, for write transactions, we have to be careful in the second phase as well to make sure the execution does not violate the order defined by the timestamps.
Specifically, there are two things that we have to be careful:- We have to ignore a Put operation with a timestamp smaller than the timestamp of the current version.
- The Update operations to keys being accessed by concurrent transactions must be delayed.
To understand the first case, consider Tx1 and Tx2 that both Puts to a and b hosted on two different nodes. Tx1 and Tx2 are both accepted on both nodes, but the Write command (the second phase of 2PC) arrives in different orders. If we go ahead and apply Puts as they come, we will end up in the situation where a=2 and b=1 which is unacceptable, as it is not the result of any serial execution. To avoid this, the nodes must simply ignore any Put to a key whose current version has a higher timestamp than that of the new update. In this example, node b must ignore the Put by Tx1, so we will end up in a=2 and b=2 which conform to order Tx1, Tx2 which is the order defined by the timestamps.
Figure 4. We have to ignore a Put with a timestamp smaller than the timestamp of the current version [1]
For the second case, consider this example shown in Figure 5. It is the same as Figure 3, except Tx2 has two Updates operations instead of Put. Upon receiving the write message for Tx2, if we go ahead and apply the updates, we will end up in a=2 and b=1 which does not conform to order Tx1 and Tx2. To avoid that, the Update operations must be delayed if a pending transaction with a smaller timestamp is accessing the same item.
- We have to ignore a Put operation with a timestamp smaller than the timestamp of the current version.
- The Update operations to keys being accessed by concurrent transactions must be delayed.
To understand the first case, consider Tx1 and Tx2 that both Puts to a and b hosted on two different nodes. Tx1 and Tx2 are both accepted on both nodes, but the Write command (the second phase of 2PC) arrives in different orders. If we go ahead and apply Puts as they come, we will end up in the situation where a=2 and b=1 which is unacceptable, as it is not the result of any serial execution. To avoid this, the nodes must simply ignore any Put to a key whose current version has a higher timestamp than that of the new update. In this example, node b must ignore the Put by Tx1, so we will end up in a=2 and b=2 which conform to order Tx1, Tx2 which is the order defined by the timestamps.
Figure 4. We have to ignore a Put with a timestamp smaller than the timestamp of the current version [1] |
DynamoDB vs. Spanner
We covered Spanner in an earlier post in this blog. We know Spanner also
provides ACID transactions and uses timestamps in its transaction
protocol, so the natural question is how Spanner and DynamoDB are compared
with each other. While Spanner relies on 2PL for serializability, DynamoDB only
relies on timestamps for serializability and does not use any sort of locking mechanism. In addition, Spanner uses timestamps that need special hardware to guarantee tight clock
synchronization. DynamoDB, on the other hand, does not need clock
synchronization for its correctness; we can have multiple transaction
coordinators and they can assign timestamps without any coordination with each
other.
So Spanner seems more complicated and requires more expensive hardware than
DynamoDB. The question now is what does Spanner buy with this additional
complexity and cost? To answer this question, we have to first understand the
main difference between Spanner and DynamoDB.
The Main Difference between Spanner and DynamoDB
- In DynamoDB, the serialization order is defined by the timestamps
assigned by the TCs.
- In Spanner, the serialization order is controlled by 2PL.
Using TrueTime, Spanner guarantees that timestamps respect the
real-time order of transactions, i.e., if T2 starts after T1 is
committed, then T2.timestamp > T1.timestamp.
Thus, Unlike DynamoDB, Spanner does NOT need timestamps to guarantee serializability for its read-write transactions, as that is guaranteed by 2PL. Figure 6 shows the difference between DynamoDB and Spanner at a very high level. 2PL is much more complicated than the simple rules that DynamoDB
uses for accepting/rejecting transactions, but are there any advantages in
using 2PL? Yes! consider this simple example. Suppose we have two transactions
T1 and T2, both writing the same item. In 2PL, if T2 arrives after T1 is finished,
we can go ahead and commit T2. With timestamp ordering, on the other hand, if the timestamp of
T2 is smaller than that of T1, we have to reject T2,
even though no concurrent transaction is conflicting with it. Thus, with timestamp ordering, we always have some false rejection,
i.e., rejecting transactions when not rejecting them would be fine. This
issue is aggravated by poor clock synchronization. Thus, although DynamoDB does not need clock
synchronization for its correctness, poor clock synchronization between
transaction managers will result in rejecting transactions. The issue also gets worse with higher contention. In general, we know OCC
systems do poorly on high contention compared with pessimistic locking like
2PL.
Figure 6. DynamoDB vs. Spanner |
OK, but if Spanner relies on 2PL for serializability, why does it bother
with timestamps, TrueTime, waiting out the uncertainty window, etc? Isn't
2PL enough?
2PL is enough to guarantee the serializability of read-write transactions.
However, Spanner has another API--Snapshots. A snapshot read is a read-only
operation. Spanner uses MVCC and timestamps to provide strictly serializable
snapshot reads without involving any locking. Using TrueTime,
Spanner guarantees if T2 starts after T1 is committed, T2.timestamp >
T1.timestamp. This way, Spanner provides linearizability for its snapshot
reads. By ignoring values with a version higher than the snapshot time, it
guarantees serializability for the snapshots. Thus, its snapshots are
(linearizable + serializable) which is referred to as strict serializable (see
this post).
DynamoDB also has snapshot reads. How are Spanners read-only transaction/snapshot compared with DynamoDB read TransactGetItems?
Spanner's snapshot can be read from any replica. Using MVCC and proper versioning via TrueTime, Spanner can serve strictly serializable read-only transactions from any replica. On the other hand, DynamoDB snapshots must be conflict-checked like read-write transactions, and they might be rejected. Thus, Spanner provides better support for snapshots, especially in high contention.
Conclusion
In this post, we saw how DynamoDB provides ACID transactions with a simple timestamp ordering technique without any complexity of locking or MVCC. Its transaction protocol is easy to reason and surely simpler than more complex systems for developers to implement. We specifically compared DynamoDB with Spanner that also provides ACID transactions. We saw that although both systems rely on timestamps, DynamoDB uses timestamps to define the order, while Spanner uses 2PL for serializability and makes sure timestamp follows that order. Spanner is more complex and needs higher storage and more expensive hardware, but in return, it enjoys better performance, especially for workloads with higher contention. Snapshots do not need to be conflict-checked, and will never be rejected, and they can be served by any replica. Thus, Spanner is expected to provide better scalability and performance for read-only operations.
References
[1] Doug Terry, talk at FAST'19:
https://www.youtube.com/watch?v=CK6h48zOY9k&t=2151s
Comments
https://www.mydistributed.systems/search/label/Transactions
You can start with my post on serializability if you are interested in learning more basics:
https://www.mydistributed.systems/2020/08/distributed-transactions-serializability.html
or just google as you did.
Post a Comment