Amazon DynamoDB: ACID Transactions using Timestamp Ordering

Amazon DynamoDB uses a lightweight transaction protocol via timestamp ordering. Although both Spanner and DynamoDB use timestamps in their transaction protocol, DynamoDB seems significantly simpler than Spanner; while Spanner uses 2PL and MVCC and needs TrueTime, DynamoDB only relies on timestamp ordering and does not use any locking or MVCC scheme. In this post, we want to review the DynamoDB transaction protocol and compare it with Google Spanner. 

We have covered several protocols for distributed transactions in this blog so far. We saw Spanner uses 2PC+2PL over Paxos leaders and uses TrueTime for external consistency, Calvin uses a deterministic approach, and FoundationDB uses OCC. In this post, we cover DynamoDB which uses timestamp ordering that perhaps is the simplest approach compared to those we have covered so far. DynamoDB does not require tight clock synchronization, and unlike Spanner does not assume any uncertainty window. It does not need MVCC and avoids any locking scheme (e.g. 2PL or deterministic locking). The timestamp ordering approach used by DynamoDB is very simple: assign timestamps to transactions and let these timestamps define the serialization order, i.e., a transaction with a smaller timestamp must appear to be executed before the one with a larger timestamp. Thus, the job of the database is to abort any transaction that violates this illusion. We will see how, with this simple approach, DynamoDB provides ACID transactions while avoiding the complexities of more sophisticated protocols. 

ACID Transactions in DynamoDB

This section is a summary of the great talk by Doug Terry at FAST19 [1] plus some additional discussions. 

Design Goals

DynamoDB is designed to provide ACID transactions together with scalability. Terry defines scalability as the combination of 1) unbounded growth, and 2) predictable performance [1]. The reason that many cloud databases embrace simple data models without consistency or isolation guarantees is to satisfy these scalability requirements. When a database does not care about consistency or isolation, we can horizontally scale the system by simply partitioning the data and keep adding new shards as needed. On the other hand, additional consistency or isolation means higher latency as we scale. 
Figure 1. Latency vs. scale. 

In addition to scalability, DynamoDB transactions could not use MVCC, because the underlying storage does not support it, and also since DynamoDB is a cloud database and customers pay for the storage, dealing with the additional storage due to MVCC was not straightforward. Another design goal for DynamoDB was to avoid any impact on non-transactional operations.   

The following list is a summary of the design goals for DynamoDB transactions:
  • Atomicity and Serializability 
  • Scalability
    • Unbounded growth
    • Predictable performance 
  • No MVCC 
  • No impact on non-transactional operations

API: One-Shot Transactions 

Since DynamoDB is a multi-tenant cloud database, interactive transactions that may hold resources for a long time are not acceptable. DynamoDB adopts the simple one-shot transactions approach. Specifically, we have two following APIs for transactions.  

TransactGetItems {   //(Read-only Transaction)
   Get (table: "T1", key: k1),
   Get (table: "T2", key: k2),
   Get (table: "T3", key: k3)
}

TransactWriteItems {  //(Conditional Write Transaction)
   Put    (table: "T1", key: k1, value: v1),
   Delete (table: "T2", key: k2),
   Update (table: "T3", key: k3, value: v1),
   Check  (table: "T3", key: k3, value: <100)                    
}
As you can see, there is no read-write transaction API available for Dynamo. For example, we cannot have a transaction that reads the value of key k1 and sets the value of k2 to it. However, such a use case can be done using the simple Get and conditional write transaction API as follows. 
v1 = GeT (table: "T1", key: k1)

TransactWriteItems {
   Check  (table: "T1", key: k1, value: =v1)
   Update (table: "T2", key: k2, value: v1)
}
If the transaction fails, due to its check statement, we have to repeat both the Get operation and transaction again and again until the transaction is successful. Note that, although we say we don't have interactive transactions in DynamoDB, you can see that the approach above basically emulates interactive transactions in other OCC systems. For example, in FoundationDB, the writes are buffered on the client and are submitted as a batch together with the read-set to the transaction systems. The transaction will be rejected if any of the values read by the client is updated. The same approach can be used with Calvin to emulate interactive transactions. 

Architecture 

Figure 2 shows the overall architecture of DynamoDB. Transactional and non-transaction operations go through separate routes. This design tries to satisfy the "No impact on non-transactional operation" requirement mentioned above. There is a fleet of stateless transaction coordinators that manage the execution of transactions. Although Terry says DynamoDB avoids 2PC, it seems 2PC is exactly what the TCs do. A TC sends the transaction request to the involved nodes. Each node sees if it can execute the transaction or not and returns its decision to the coordinator. Once all nodes agree to accept a transaction, the TC asks them to complete the transactions by making the mutations durable. There are two points of no returns in 2PC that must be durably stored (see this post): 1) the vote to accept/reject by the nodes, 2) transaction decision on the coordinator. The state of the transaction on the coordinator is stored on a durable ledger. There is a recovery manager that handles the failure of a TC and continues pending transactions recorded on the ledger. The coordinator makes sure that mutations of an accepted transaction are delivered to the state of all shards exactly once. It keeps sending the decision until all nodes acknowledge it. Thus, to avoid duplicate mutation, nodes have to de-duplicate messages from the TC (see this post for more info on exactly-once delivery). 

Figure 2. The Architecture of DynamoDB [1]

A node decides to accept or reject a transaction based on two things: 1) the checks statements specified in the transaction code, 2) the results of conflict-checking with other transactions. We will see how this conflict-checking works in the next section.

Conflict-Checking

All right, we have arrived at the exciting part—the conflict-checking! The conflict-checking is done in the first phase of 2PC on each participant node. DynamoDB does not use 2PL, MVCC, or the deterministic approach. Instead, it follows a simple optimistic approach via timestamp ordering to serialize transactions. As we said earlier, serializability simply means that transactions must appear to be executed in sequential order. The database is allowed to internally run transactions concurrently, but an external observer must see transactions in some sequential order. The order does not matter, i.e., any order is acceptable. DynamoDB uses physical timestamps of transactions to order them. 
In DynamoDB, the serialization order is defined by the timestamps assigned by the TCs. 
The timestamps are physical clock timestamps, and we can break ties by the TC IDs. Thus, the timestamps totally order the transactions. When we say this order "defines the serialization order" it means the store must behave in a way that to the client it appears transactions are executed based on this order. Thus, if executing a transaction violates this perception of the database, DynamoDB aborts it. To make sure we follow this order, the nodes receiving transaction requests from TCs follow a set of rules to decide to accept/reject each transaction. 

Rules for Accepting a TransactWriteItems 

Three reasons may cause a node to reject a transaction. 
  1. Conflict with an already completed transaction. 
  2. Conflict with an already accepted transaction. 
  3. Violation of a transaction check. 
The third one is clear. Below, we explain 1 and 2. 

Conflict with Completed Transactions

Once a node receives a request to commit a new write transaction TxNew, the node checks its timestamp with the timestamps of currently completed transactions that conflict with the new transaction. Since DynamoDB uses timestamps order to define the serialization order, if the new transaction request has a smaller timestamp of any of the already completed conflicting transactions, the node rejects the transaction. 

 What are conflicting transactions? 
We say two transactions are conflicting if they both access the same data item. Note that in practice, it is usually done via ranges of keys instead of checking individual keys to avoid high overhead. 

Why do only conflicting transactions matter? 
If two transactions do not conflict with each other, i.e. they access completely different parts of data, their order does matter. Any order is indistinguishable from the order defined by the timestamps, so it is fine to accept a new transaction with a smaller timestamp if it is touching some part of data that is not touched by already completed transactions. 

We can do one optimization here. If an already completed transaction Tx1 does not have any condition, we can go ahead and accept TxNew, even if its timestamp is smaller than that of Tx1, because even if TxNew was executed before Tx1, we would still execute Tx1. 

 But wait! what if TxNew and Tx1 both write the same item x? In that case, to be faithful to the order defined by timestamps, Tx1 with the larger timestamp must be the winner, but by accepting TxNew we end up having a value for x written by the smaller timestamp. Isn't that a problem?
No. Note that this is just the first phase where we decide to accept or reject a transaction. This node can go ahead and accept the transaction. However, later when the transaction coordinator sends the final write command, it can simply ignore the write by TxNew on x because the current version has a higher timestamp. 

Then what is the point of accepting a transaction that will be eventually ignored? 
The point is maybe TxNew has other writes that may not be ignored. The value written to x by TxNew will be ignored, but maybe TxNew writes  y, possibly even in a different node, that won't be ignored.

Side note: It seems there is a mistake in the optimization explained  above or maybe this case is not covered in [1]. Consider the following transactions: 
T1 {
    x=2
}

TxNew{
    check(x,=2)
    y=3 
}
T1 is just an unconditional put. Thus, based on the optimization above, when TxNew comes, we can still accept it even when its timestamp is smaller than T1. However, TxNew has a check the is true due to a write by T1. Thus, in the serialization order, TxNew must be after T1 which does not conform to the order of the timestamps. 

Conflict with Accepted Transactions

In addition to completed transactions, the node checks the new transaction with accepted (but not completed) conflicting transactions before it accepts the transaction. Similar to completed transactions, if the new transaction has a timestamp smaller than that of an already accepted conflicting transaction Tx2, we have to reject it. The optimization explained  above can be used here as well, i.e., if Tx2 does not have any condition, then there is no issue in accepting TxNew regarding Tx2. 

Unlike completed transactions, the node may still reject a transaction due to an accepted transaction Tx2 even when TxNew has a higher timestamp. Such a situation happens when TxNew has a condition that might be affected by Tx2. The reason why we don't have this extra reason to abort TxNew regarding completed transactions is that the effect of completed transactions is already applied in the store, so we can be sure the checks of TxNew will be correctly evaluated. However, for Tx2, the effect is not there yet, so if we go ahead and evaluate the check of TxNew we might wrongly accept TxNew while it had to be rejected.  

Figure 3 summarizes the rules explained for deciding on a new write transaction. 

Figure 3. Rules for accepting/rejecting a new write transaction [1]

Rules for Accepting a TransactGetItems 

Similar to write transactions, we should abort a read transaction that has a timestamp smaller than that of an already completed transaction in the node. Otherwise, we might read the effect of a transaction with a higher timestamp that violates the order defined by the timestamps. In other words, a read transaction with a smaller timestamp is trying to read a value in the past. Since DynamoDB does not maintain older versions, we have no choice to abort. If DynamoDB had MVCC it could accept the transaction and return the older version requested by the read transaction. 

Regarding accepted transactions, the rule is reversed, i.e., if the timestamp of the read transaction is smaller than that of an accepted transaction it is fine because the read transaction does not read the effect of the accepted transaction and it is not supposed to read as well. However, if there is an accepted transaction with a timestamp smaller than the timestamp of the read transactions, then we have to either reject the read transaction or wait for the accepted transaction to take effect before we execute our reads. Figure 6 shows the rules for accepting a new read transaction. 

Figure 6. Rules for accepting/rejecting a new read transaction [1]

Non-Transactional Operations

One of the design goals of DynamoDB was not to affect the non-transactional operations. That goal is satisfied for single Get operations, i.e., there is no change needed for single Get operation; we can go ahead and read the latest value written for the item. However, Put operations might be rejected in transactional DynamoDB. Specifically, a Put operation must be rejected if it might affect the condition of an already accepted but not completed transaction. Also, if the Put itself is a conditional Put and its condition might be affected by an accepted but not completed transaction we have to reject the Put, because we don't have the value of the accepted transaction so we cannot go ahead and evaluate the Put operation. Note that Put might only be rejected due to accepted and not completed transactions. Completed transactions never cause a Put operation to be rejected. 

Figure 7. Rules for accepting/rejecting a non-transactional Put operation [1] 

Applying Transaction Mutations

The above discussion was only about the first phase of the 2PC, i.e. to decide to accept/reject new transactions. However, for write transactions, we have to be careful in the second phase as well to make sure the execution does not violate the order defined by the timestamps.

Specifically, there are two things that we have to be careful:
  1. We have to ignore a Put operation with a timestamp smaller than the timestamp of the current version. 
  2. The Update operations to keys being accessed by concurrent transactions must be delayed. 
To understand the first case, consider Tx1 and Tx2 that both Puts to a and b hosted on two different nodes. Tx1 and Tx2 are both accepted on both nodes, but the Write command (the second phase of 2PC) arrives in different orders. If we go ahead and apply Puts as they come, we will end up in the situation where a=2 and b=1 which is unacceptable, as it is not the result of any serial execution. To avoid this, the nodes must simply ignore any Put to a key whose current version has a higher timestamp than that of the new update. In this example, node b must ignore the Put by Tx1, so we will end up in a=2 and b=2 which conform to order Tx1, Tx2 which is the order defined by the timestamps. 

Figure 4. We have to ignore a Put with a timestamp smaller than the timestamp of the current version [1]

For the second case, consider this example shown in Figure 5. It is the same as Figure 3, except Tx2 has two Updates operations instead of Put. Upon receiving the write message for Tx2, if we go ahead and apply the updates, we will end up in a=2 and b=1 which does not conform to order Tx1 and Tx2. To avoid that, the Update operations must be delayed if a pending transaction with a smaller timestamp is accessing the same item. 

Figure 5. We have to delay updates to items accessed by concurrent transactions with a smaller timestamp, i.e., we have to wait for previous writes to be done, before we make our new update [1]

DynamoDB vs. Spanner

We covered Spanner in an earlier post in this blog. We know Spanner also provides ACID transactions and uses timestamps in its transaction protocol, so the natural question is how Spanner and DynamoDB are compared with each other. While Spanner relies on 2PL for serializability, DynamoDB only relies on timestamps for serializability and does not use any sort of locking mechanism. In addition, Spanner uses timestamps that need special hardware to guarantee tight clock synchronization. DynamoDB, on the other hand, does not need clock synchronization for its correctness; we can have multiple transaction coordinators and they can assign timestamps without any coordination with each other.

So Spanner seems more complicated and requires more expensive hardware than DynamoDB. The question now is what does Spanner buy with this additional complexity and cost? To answer this question, we have to first understand the main difference between Spanner and DynamoDB. 
The Main Difference between Spanner and DynamoDB
- In DynamoDB, the serialization order is defined by the timestamps assigned by the TCs. 
- In Spanner, the serialization order is controlled by 2PL.

Using TrueTime, Spanner guarantees that timestamps respect the real-time order of transactions, i.e., if T2 starts after T1 is committed, then T2.timestamp > T1.timestamp. 

Thus, Unlike DynamoDB, Spanner does NOT need timestamps to guarantee serializability for its read-write transactions, as that is guaranteed by 2PL. Figure 6 shows the difference between DynamoDB and Spanner at a very high level. 2PL is much more complicated than the simple rules that DynamoDB uses for accepting/rejecting transactions, but are there any advantages in using 2PL? Yes! consider this simple example. Suppose we have two transactions T1 and T2, both writing the same item. In 2PL, if T2 arrives after T1 is finished, we can go ahead and commit T2. With timestamp ordering, on the other hand, if the timestamp of T2 is smaller than that of T1, we have to reject T2, even though no concurrent transaction is conflicting with it. Thus, with timestamp ordering, we always have some false rejection, i.e., rejecting transactions when not rejecting them would be fine. This issue is aggravated by poor clock synchronization. Thus, although DynamoDB does not need clock synchronization for its correctness, poor clock synchronization between transaction managers will result in rejecting transactions. The issue also gets worse with higher contention. In general, we know OCC systems do poorly on high contention compared with pessimistic locking like 2PL. 

Figure 6. DynamoDB vs. Spanner

OK, but if Spanner relies on 2PL for serializability, why does it bother with timestamps, TrueTime, waiting out the uncertainty window, etc? Isn't 2PL enough?
2PL is enough to guarantee the serializability of read-write transactions. However, Spanner has another API--Snapshots. A snapshot read is a read-only operation. Spanner uses MVCC and timestamps to provide strictly serializable snapshot reads without involving any locking. Using TrueTime, Spanner guarantees if T2 starts after T1 is committed, T2.timestamp > T1.timestamp. This way, Spanner provides linearizability for its snapshot reads. By ignoring values with a version higher than the snapshot time, it guarantees serializability for the snapshots. Thus, its snapshots are (linearizable + serializable) which is referred to as strict serializable (see this post). 

DynamoDB also has snapshot reads. How are Spanners read-only transaction/snapshot compared with DynamoDB read TransactGetItems? 
Spanner's snapshot can be read from any replica. Using MVCC and proper versioning via TrueTime, Spanner can serve strictly serializable read-only transactions from any replica. On the other hand, DynamoDB snapshots must be conflict-checked like read-write transactions, and they might be rejected. Thus, Spanner provides better support for snapshots, especially in high contention. 

Conclusion 

In this post, we saw how DynamoDB provides ACID transactions with a simple timestamp ordering technique without any complexity of locking or MVCC. Its transaction protocol is easy to reason and surely simpler than more complex systems for developers to implement. We specifically compared DynamoDB with Spanner that also provides ACID transactions. We saw that although both systems rely on timestamps, DynamoDB uses timestamps to define the order, while Spanner uses 2PL for serializability and makes sure timestamp follows that order. Spanner is more complex and needs higher storage and more expensive hardware, but in return, it enjoys better performance, especially for workloads with higher contention. Snapshots do not need to be conflict-checked, and will never be rejected, and they can be served by any replica. Thus, Spanner is expected to provide better scalability and performance for read-only operations. 

References

[1] Doug Terry, talk at FAST'19: https://www.youtube.com/watch?v=CK6h48zOY9k&t=2151s

Comments

Unknown said…
It'd be more accessible if less assumption is made about reader. I have some very basic understanding of distributed system. I started reading and all those abbreviations without any one time explanation makes it difficult to follow. I could get some like 2PC, but MVCC, 2PL and others required me to leave and google.
Thanks for your comment. You are right, this article assumes the reader is familiar with standard approaches for transactions. I have covered distributed transactions in a series of posts on this blog:
https://www.mydistributed.systems/search/label/Transactions

You can start with my post on serializability if you are interested in learning more basics:
https://www.mydistributed.systems/2020/08/distributed-transactions-serializability.html

or just google as you did.

Popular posts from this blog

In-memory vs. On-disk Databases

ByteGraph: A Graph Database for TikTok

DynamoDB, Ten Years Later

Eventual Consistency and Conflict Resolution - Part 2