In-memory vs. On-disk Databases

Figure 1 shows the main difference between in-memory and on-disk databases:
An in-memory database stores the data in memory and uses disk for backup, while
an on-disk database stores the data on disk and uses memory for caching. 
Figure 1. In-memory vs On-disk storage engines

Of course, we all know memory (RAM) is much (multiple order of magnitude) faster than disk, so the advantage of an in-memory database is clear. However, the first question that can come up about an in-memory database is what about the durability of our data? We know the data in the memory is not durable. The answer is: we use disk for backing up the data. 

But wait! doesn't backing up data on the disk defeat the purpose of using the memory to have faster operations?! If we synchronously backup the data, then operations must wait for the disk. On the other hand, if we do it asynchronously, we might have data loss, when our system crashes after writing to the memory and returning the operation and before backing up on disk. So, how an in-memory database enjoys the higher speed of the RAM while guaranteeing the durability of our data? 

If we want to guarantee the durability of our data, we have to write it to non-volatile memory. That is something that we cannot avoid. Thus, even an in-memory database must make sure that data has been written to some sort of non-volatile memory such as disk before returning the operation to the client. However, note that the latency overhead of an on-disk database does not come only from writing to the disk, rather it comes from maintaining a data structure such as B-tree on disk. If we simply append data to a file, we are still faster than an on-disk database. We can batch together multiple writes and quickly write a whole disk block. Don't think all disk accesses are the same. There is a huge difference between quickly write blocks one after another to a file with seeking different locations and updating data in them. Thus, to guarantee durability, in-memory databases first simply append the data to an append-only log, and then write it to the memory and return the operation. Now, when the system recovers from a crash, it fills the memory with data that it has on the disk.

Figure 2. Durability in an in-memory database

But doesn't this log become very long overtime? If we simply append data, instead of updating the existing data, our data will continue growing. What is the end game?!

That's a valid point. Having a very long log has two big issues: 1) it wastes storage and 2) it makes recovery time very long, as the system has to read the whole log to rebuild the database on the memory. To prevent that, in-memory databases maintain a disk data structure such as B-tree and write the log entries periodically to the B-tree. After applying log entries to the B-tree, we purge them from the log. This is called checkpointing. Now, when the system restarts, we first load the checkpoint and write more recent log entries that are not included in the checkpoint to the memory. 

Figure 3. Recovery from a checkpoint


 In an on-disk database also we have a WAL and a B-tree, and we cache the data in the RAM. Can't we conclude an in-memory database is nothing but a normal on-disk database with a big cache?! 

In an on-disk database, we still have serialization and data layout overhead compared with an in-memory database. In an in-memory database, we assume the whole data is always in the RAM. We do not have such an assumption in an on-disk database. This will change how we design things like data structures and indexing for our database. 

Comments

Shital said…
Really helpful post to understand in-memory db.
Zhiheng said…
"However, note that the latency overhead of an on-disk database does come completely from writing to the disk, ... " It should be "does not come completely"?
Anonymous said…
Very useful, Thanks
Anonymous said…
Short, clear and useful. Great post Mohammad!

Popular posts from this blog

ByteGraph: A Graph Database for TikTok

DynamoDB, Ten Years Later

Amazon DynamoDB: ACID Transactions using Timestamp Ordering