Serializability

Serializability is the strongest isolation level that a transactional system can offer. This level of isolation removes all possible anomalies due to concurrently running transactions. It provides a powerful abstraction for application developers. The letter "I" in ACID, stands for isolation, and when people talk about ACID transactions, they most likely mean serializable transactions. However, note that many systems that claim to provide ACID transactions, in reality, provide weaker isolation levels. In this post, we want to review serializability and see how a database system can provide serializability for single-node and distributed transactions.

Listen to the Audio Blog

Serializability

Serializability simply means that the database should appear to run transactions in some sequential order. In other words, although we might run transactions concurrently, to the clients, it should appear we are running transactions one-after-another.

Strict Serializability: Serializability + Linearizability

There is something interesting in the definition of serializability; serializability only requires "some" order. Thus, "any" order is acceptable by serializability. As a result, if I run transaction T1 that changes the value of x from 40 to 50, and then run transaction T2 that reads x, it is OK under serializability to read 40 instead of 50 because we can say the system appears to first execute T2 and then T1.

This might not be acceptable for some systems. Those systems require a stronger guarantee called strict serializability. The strict serializability requires the order of serializability to respect the real-time order of transactions, i.e. if T1 commits before T2 starts, then the system should appear to run T2 after T1. With strict serializability, in the example above, we are guaranteed to read 50 instead of 40.

Obviously, strict serializability is stronger than serializability. Then, why people say serializability is the strongest isolation level?

Well, when we talk about isolation, we are talking about the concurrency of the transactions. With regard to isolation, serializability is the strongest level that we can achieve, as it provides ultimate isolation possible--one-after-another. No isolation stronger than that is possible. What strict serializability adds to serializability does not concern with isolation. Instead, it concerns the recency of the data. Strict serializability requires linearizability which requires the reads to respect the real-time order of the events. Thus, it is true to say the regarding isolation alone, serializability is the strongest level possible.

Serializability and Replica Consistency

Note that when we have only a single copy of our data the traditional ways to achieve serializability such as 2PL (as we will see next) actually provide strict serializability by default. That's why strict serializability is also known as strong one-copy serializability. The linearizability aspect of strict serializability typically becomes important when we want to manage transactions for replicated systems. In replicated systems, we can have various levels of consistency between replicas. Linearizability is one of those consistency levels which is actually the strongest level possible. Weaker consistency levels include pure eventual consistency, session guarantees, causal consistency, and sequential consistency. I will try to cover them in a separate post. On the other hand, we have weaker isolation levels such as read uncommitted, read committed, snapshot isolation.

Figure 1. Isolation and replica consistency are orthogonal requirements. Strict serializability requires the strongest levels of both of these requirements.

How to achieve Serializability

Concurrency control mechanisms can be classified into pessimistic and optimistic. Pessimistic solutions provides isolation by conservatively preventing transactions from accessing the same items at the same time. To do that, pessimistic solutions usually use some kind of locking mechanism. Optimistic solutions, on the other hand, do not restrict transactions from accessing objects. Instead, they let transactions do whatever they want, and only at the commit time, they check whether committing a transaction violates the desired isolation level or not.

2-Phase Locking (2PL)

One pessimistic way of guaranteeing serializability is to actually run transactions sequentially, i.e., run transactions one by one in a single thread. While this approach is used for some in-memory databases, generally databases do not use this solution as it results in zero parallelism. Imagine you have a machine with multiple cores and you can easily afford running transactions in parallel. By running transactions sequentially, you lose all the processing power that is available to you. It would be great if we could allow transactions run in parallel while guaranteeing serializability. Using 2-Phase Locking (2PL), we can exactly do that. It works as follows: We assign a lock for each object in the database. This lock can be acquired in two modes: shared or exclusive. Operations get the locks as follows, and once obtained a lock, they never release it until the end of the transaction:

A read operation obtains the lock in the shared mode if no other transaction has already obtained it in the exclusive mode.

Note that if another transaction has obtained it in the shared mode, it is OK, i.e. more than one transaction can obtain a lock in the shared mode simultaneously.

A write operation obtains the lock in the exclusive mode if no other transaction has already obtained it in either shared or exclusive mode.
When a transaction first obtains a lock in a shared mode, but later it wants to write it, it must upgrade its lock from shared to exclusive mode. Note that, just like getting the lock in exclusive mode, upgrading lock to exclusive mode requires that the lock is not obtained by any other transaction.

Phantoms

In the method explained above, we assign a lock to each object. To satisfy serializability with 2PL, we have to assign locks to not only objects that already exist in the database, but also to those that are not in the database yet.

Let's see an example. Suppose two patients are trying to book an appointment with a physician. They both want to book their appointments for Monday, at 1 pm. The application executes a transaction for each patient. Each transaction first checks if the physician is available on Monday at 1pm by checking the rows in the appointments table, then adds a row to this table only if it didn't find any row for Monday at 1 pm.

Now, let’s see how the system may end up in a state where both patients are booked for the same time slot: the first transaction checks the rows. It does not find anything and does not lock anything. The second transaction also checks the rows and does not find or lock anything. Then, each of them adds a row for Monday 1 pm to the table. Now we have two appointments booked for the same time. That is not desirable and is a violation of serializability. This anomaly is called a phantom write.

We can prevent phantoms by locking non-existing rows. To do that, we can lock ranges using indexes. Suppose we have an index on the date of the appointments table. Now any record that is for this Monday will fall under the same index entry. Thus, by locking that index entry we are basically locking those objects whether they already exist or not. Any future object that falls under this index value is associate with that lock, so we are basically locking future and non-existing objects too.

Deadlocks

In the example above, suppose both transactions acquire the lock for index "date = this Monday" in the shared mode to read. Now, they want to write, so they have to upgrade their shared locks to exclusive locks. However, since the lock is held in the shared mode by the other transaction, none of them can proceed. This situation is called a deadlock. There are two general approaches to deal with deadlocks: 1) deadlock detection and 2) deadlock avoidance.

Deadlock Detection: In this approach, we try to detect and resolve deadlocks. A simple method falling into this category is to introduce timeouts and retries. When a transaction sees that it cannot make progress, it aborts after some time, and retries. However, this is not efficient, as anything that a transaction has done so far will be lost due to the deadlock. Also, our detection mechanism may have false positive, i.e. we might abort a transaction due to timeout thinking it is in deadlock, while in reality it is not in deadlock and it is just taking a long time to finish. In addition, there is no guarantee that the transaction won't face deadlock again. Our two transactions may get really unlucky and do the same loop again and again. This is called livelock.

A better method to detect and resolve deadlocks is to use the wait-for graph. In this graph, we have one vertex for each in-flight transaction. When transaction A requests a lock held by transaction B, we add an edge from A to B. Any time we have a cycle in the graph, we have a deadlock. Thus, we have a way to detect deadlocks. After detection, we have to abort one of the transactions involved in the cycle to break the cycle. Usually, we abort the most recent transaction, as we prefer to let older transactions proceed. Otherwise, a transaction might starve and never commit if it gets really unlucky. We can do the detection either periodically or continuously (i.e., on every edge we add to the graph).

Deadlock Avoidance: In this approach, we eliminate the possibility of deadlocks. One way to avoid deadlocks is through conservative 2PL. In normal 2PL, transactions gradually get the locks, hold them, and release them at the end of the transaction. In the conservative 2PL, a transaction gets all the locks it might ask during its lifetime right at the beginning. This way, it might wait to start, but once started, it never waits for any lock again, so it never gets into any deadlock. The conservative 2PL is not very efficient, as it reduces the concurrency in our system, i.e. there might be cases where we would be OK if we let a transaction start, but due to the conservative nature of this approach, we delayed it.

Instead of conservative 2PL, transaction managers usually define priority for transactions according to their timestamps. We typically honor the older transactions (to avoid starvation). Thus older transactions (with smaller timestamps) have higher priority than younger transactions (with larger timestamps).

Now, suppose transactions T1 requests a lock hold by transaction T2. The transaction manager can do any of the following approaches to avoid deadlocks:

Wait-die: If T1 is older than T2, T1 blocks and waits for T2 to release the lock. If T1 is younger than T2, T1 aborts.

Older transactions wait for younger ones.
Younger transactions never wait for older ones; they just abort right away.

Wound-wait: If T1 is older, T2 aborts right away. If T1 is younger it waits for T2.

Older transactions wound younger ones by aborting them.
Younger transactions wait for older ones.

A better way to remember these two is: we know lower-priority transactions must be sacrificed for higher-priority transactions somewhere.

In wait-die, lower-priority transactions die when they request higher-priority transactions' locks.
In wound-wait, lower-priority transactions die when higher-priority transactions request their locks.

For example, Google Spanner uses 2-PL with the wound-wait approach.

Optimistic Concurrency Control

We don't want to go to details of these methods, but basically, OCC systems work as follows:

The database can provide a snapshot of the data, i.e. we can read all of our objects in the context of a transaction at a frozen point in time. OCC systems usually provide this feature by keeping multiple versions for each object. A technique that is known as Multi-Version Concurrency Control (MVCC). So, all reads of a transaction read from a snapshot.
All writes are buffered without actually writing to the database.
When the client wants to commit, the transaction manager checks what versions the transaction has read. If it finds out that any of the versions that the transaction has read has been overwritten by another transaction, it aborts the transaction. Otherwise, it commits and applies the buffered writes to the store.

Phantoms

Just like 2PL, with OCC also, there is a subtlety regarding non-existing objects. As we said above, in 2PL, we have to assign locks not only to the existing objects but also to the non-existing objects to avoid phantom anomaly. Similarly, in OCC, we have to keep track of both existing and non-existing objects. For example, if a transaction queries the database to find any booking row for Monday at 1 pm, and then at the commit time we see there is such record, we have to abort the transaction. Just as in 2PL, we can use indexes to estimate predicates.

Serializability for Distributed Transactions

In practice, many applications cannot fit their entire data in a single node. Thus, they partition their data and host it on multiple machines. In these applications, a transaction may span over multiple nodes. Such transactions are called distributed transactions.

We can use the same approaches explained above to provide serializability for distributed transactions too. For example, to use 2PL, each node can manage the locks for the keys that it hosts and respond to the requests to acquire and release locks. With regard to isolation, nothing is different between single-node or distributed transactions. What makes distributed transactions different than single-node transactions is the way to provide atomicity which requires a transaction to either apply completely or cancel completely. For single-node transactions, atomicity is usually provided by writing the updates to stable storage (in a write-ahead log) before committing the transactions. For distributed transactions, we need more.

2-Phase Commit (2PC)

A well-known algorithm to provide atomicity for distributed transactions is called 2-Phase Commit (2PC). Once a client is ready, it requests a coordinator to commit the transaction. The coordinator works with participants each hosting part of the objects included in the transaction to commit the transaction. As the name suggests, the algorithm runs in two phases:

Phase 1: In the first phase, the coordinator sends the Prepare messages to the participants. Each node checks if it can commit the transaction or not.
The answer to Prepare message by a participant cannot change. Thus, before a participant sends YES back to the coordinator, it must make sure that under any circumstance (even crash) it will be able to commit the transaction. It is crucial for the correctness of the 2PC. Thus, usually, each participant appends the writes of the transaction to a log on persistent storage before sending YES to the coordinator.
Phase 2: Once the coordinator got OKs from all the participants, it sends Commit messages to all of them. If any of the participants sends NO, the coordinator sends Abort to them. The participants commit/abort the transaction based on the decision by the coordinator.
The final decision by the coordinator cannot change. Thus, before sending a Commit/Abort message, the coordinator writes its decision to persistent storage to remember it after recovery from a crash.

Mixing 2PC with Serializability Approaches

As explained earlier, 2PC concerns another letter in ACID-- the "A" which stands for atomicity. This requirement is orthogonal to isolation. Thus, we can use 2PC with any of the serializability approaches explained above. For example, Google Spanner uses 2PC+2PL (with wound-wait). Another system called GRIT uses 2PC + OCC.

Distributed Transactions for Replicated Systems

To provide high availability, many modern systems use replication. As soon as we replicate our data, another dimension adds to our transactional systems. As mentioned earlier, there are various levels of replica consistency. Providing strict serializability for partitions and replicated systems is an interesting and challenging problem. Following systems are examples of transactional systems that solve this problem using different approaches:

Google Spanner uses 2PC+2PL over replica leaders.
Calvin uses deterministic concurrency control.
FoundationDB uses OCC.

Search This Blog

MyDistributed.Systems

Serializability

Listen to the Audio Blog

Serializability

Strict Serializability: Serializability + Linearizability

Serializability and Replica Consistency

How to achieve Serializability

2-Phase Locking (2PL)

Phantoms

Deadlocks

Optimistic Concurrency Control

Phantoms

Serializability for Distributed Transactions

2-Phase Commit (2PC)

Mixing 2PC with Serializability Approaches

Distributed Transactions for Replicated Systems

Comments

Post a Comment

Popular posts from this blog

In-memory vs. On-disk Databases

ByteGraph: A Graph Database for TikTok

DynamoDB, Ten Years Later

The Paxos algorithm, when presented in plain English, is very simple

Amazon DynamoDB: ACID Transactions using Timestamp Ordering