Log-structured storage is a powerful approach to organizing data that prioritizes sequential writes for efficiency and durability. In this post, I delve into the inner workings of log-structured storage, focusing on LSM-trees. We'll explore their write and read paths, the mechanics of compaction, and compare leveled and tiered compaction strategies. Additionally, I'll discuss the differences between LSM-trees and B-trees, and talk about variations like Jungle and Bitcask, highlighting their unique optimizations and trade-offs.

Basics

Write Path

Write are first appended to a Write-Ahead-Log (WAL) for durability, then they are written to a MemTable which is basically an in-memory sorted data structure such as a balanced tree (e.g. red-black or AVL tree) or a skip list (I have a detailed post about skip list on this blog). After writing a write to WAL and the MemTable, we can return success to the client. We can have multiple MemTables, but at any given time, we have exactly one writable MemTable; the other ones can only be used for reads. Once a MemTable reaches a certain size, it become read-only and a new MemTable is created. Read-only MemTables later are dumped to SSTables. Note that deletions are written with tomestones.

Figure 1. Write path

Read Path

When looking up a key, since it can be in the MemTable(s) or any of the SSTables, we should keep searching them until we find the key. Note that if there are multiple occurrences of the key, we only care about the last write. Thus, by searching MemTables(s) and SStables in reverse chronological order, we can return the first value we encounter for a key. Since both MemTables and SSTables are sorted data structures, we can search them with binary search in logarithmic time. Using bloom filters, we can know for sure if a key does not exist in a SSTable thereby saving our time by not searching it. Thus, for each SSTable, we store its bloom filter data on the file as well to use it during searches. Not that without bloom filters, searching a key that does not exist will require searching all SSTables. Also note that, most of the time, the key parts of SSTables are cached so searching SSTables does not necessarily require a binary search on disk.

Figure 2. Read path

Compaction

We never update the value of a key in place. Instead, we record its new value by appending to MemTable that will be dumped as SSTables. Thus, we may have different values for the same key in multiple SSTables. We only care about the last value, and storing the older values is unnecessary. Using compaction, we can delete older values and reclaim space.

Compaction is basically running a merge sort on two or more SSTables, and creating a new SSTable, possibly with a larger size. One way to merge sort SSTables is as follows: we can iterate each SSTable, and put the next value of each input SSTable into a heap data structure, and keep writing the top of the heap to the output SSTable (so each time we can find next element in O(S) where S is number of SSTables). Note that since we care about the last value of a key, we can order keys based on <key, version> such that lower key and higher version have higher priority in the heap. Once we write a key to the output heap, we ignore the rest heap's top with the same key, and just pop them, until we find a new key.

One naive approach for compaction is as follows: We first merge sort all SSTable to a single big SSTable and delete all small SSTable. Then, once in a while when we have more than a threshold SSTables, we merge sort all of them AND the big SSTable again, keep the new SSTable and delete all small SSTables and the old big SSTable.

With this naive approach, consider this scenario: suppose we have 1 million keys in the big SSTable. Now, we start writing keys in the range of key101-200. Thus, we have a working key set of about 100 keys. The changes to these 100 keys, creates many SSTable such that it kicks off a compaction. Now, with the above approach, what should we do? We have to re-write 1 million keys in the big SSTable to reflect changes to only 100 keys! To avoid scenarios like this, for compactions, LSM-trees use more sophisticated approaches. The two most common ones are leveled and tiered compactions.

Leveled Compaction

In leveled compaction, SSTables are organized in levels. Level 0 consists of raw SSTables that are dumps of the MemTable. They may overlap, as we simply write the MemTable content when we create each of them. Each level i > 0, consists of non-overlapping SSTables. Thus, each SSTable in those levels covers a range of keys not included in any other SSTable in that level. The maximum size of SSTables in level grows exponentially. For example, L0 might have size 300MB, and each level we multiply the size by 10 (i.e., 3G, 30G, 300G, etc.). The ratio of size of next level to current level is called the fanout.

We have a limit for the number of SSTable in each level. Once the number of SSTable in a level goes higher than the maximum allowed, we do a compaction. Suppose we need to do comp action for level Li. We pick one (or more) SSTables in level i. We see what range this SSTables(s) cover and see which of the SSTables in the higher level (Li+1) overlap with this range. Then we merge all of the SSTables we picked from Li and overlapping SSTables in Li+1. The output of this merge might be a big SSTable. Since we have a limit for the size of SSTable in each level, we may need to split the output SSTable to multiple non-overlapping SSTables. We put output SSTables in Li+1, then delete the original SSTables we merged in Li and Li+1.

Figure 3. Level Compaction

One issue of leveled compaction is that to merge a SSTable in level Li, we have to re-write a SSTable at level Li+1 to merge. Thus, the write amplification is size(Li+1)/size(Li) which is the fanout size (see above). This write amplification could be limiting for write-heavy workloads.

Why in figure 3, higher levels seem to cover smaller ranges despite being larger SSTables?

Note that lower and upper bound for a SSTable does NOT mean that the SSTable includes all keys in that range. For example, in SSTable 1, having 200 and 250 as lower and upper bounds does not necessarily mean SSTable 1 covers all keys in range [200, 250]. As we go to higher levels, the SSTable becomes denser, i.e., they cover more keys in their range. That's why SSTables of higher levels are usually larger while covering smaller ranges.

Tiered Compaction

The tried compaction is another compaction approach that compared to leveled compaction has lower write amplification. In this approach, we simply pick N SSTables with similar size and merge them. The output SSTable could be of bigger size, that will be merged with its similar size SSTables. Another way of thinking of Tiered compaction is think of it again in terms of levels. Each level has its size limit, and for each level we have a limitation of how many SSTable we can have. Unlike SSTable where in each level we have non-overlapping SSTables, in tiered compaction, we can stack overlapping SSTables in the same range on top of each other. When the stack is full, we merge all SSTables in the stack and possibly create a new SSTable of larger size that should be stacked on a range on the next level.

The key difference between leveled and tiered is this:

In leveled, you compact Li and Li+1 SSTables and write to Li+1
In tiered, you compact Li SSTables and write Li+1

Figure 4. Tiered Compaction

So we never re-write a SSTable from a level to its own level, we only write from Li to Li+1. Thus, the write amplification is only 1 which is significantly lower than the fanout which is write amplification in leveled compaction.

Note that nothing prevents us from using one approach for one level and another approach for another level in the same LSM-tree. Thus, we can use a hybrid approach.

LSM-Tree vs. B-Tree

Both LSM-tree and B-tree are disk-based sorted data structures, but with very different internals making them good for different purposes. In general, LSM-tree is better for write-heavy workloads, while B-tree is better for read-heavy or balanced workloads.

Note that the following is not an accurate comparison, and it just touches on some aspects to consider when comparing B-trees and LSM-trees.

Write amplification (write to disk/write to DB):
For a B-tree write amplification results from at least two things: updating internal nodes, and having to write the entire page back for each change. Note that, yes, we can cache pages, but when the memory buffer is full, to update a page, we have to write back one of the existing dirty pages. On the other hand, for a LSM-tree, the write amplification is the result of re-writes during compactions. For leveled compaction, the write amplification from level i to i+1 is equal to the fanout (size of SSTables in Li+1 / that of Li). With tiered write amplification from level i to i+1 is 1. The actual comparison depends on the parameters, but generally for write amplification we can say tiered LSM-tree < leveled LSM-tree < B-tree. Thus, the least write amplification is for a tiered LSM-tree.

Read amplification (pages read per query): For reads, the performance heavily depends on how much data we assumed it is cached? If we assume all B-tree internal nodes are cached, then its read amplification is 1, as we search the tree in memory and do only one page read to read the row. Similarly, if we assume for a LSM-tree we have cached all index blocks of all levels in memory, again read amplification is 1, as we can do all binary searches over SSTable indexes in memory. If we consider the cold cache case, in B-tree we need to search the tree on disk to find the row, so read amplification is logarithmic in the size of DB. On the other hand, for LSM, we need to do binary search for each SSTable so we don't get negative results from its bloom filter. Generally, read amplification of B-trees is less than that of LSM-trees.
Space amplification (size on disk / DB size): One reason for space amplification in B-tree is fragmentation due to not full pages. For LSM-tree space amplification is due to keeping old versions of keys before compaction. As explained above, leveled compaction has less space amplification than tiered compaction at the cost of higher write amplification.

Other Log Structured Systems

There are many variations of log structured storage systemS. You can refer to [1] to learn more.

Here, I briefly talk about two interesting variations.

Jungle: LSM-Tree with COW B+-Tree

As we said above, tiered LSM-tree has lower write amplification than leveled LSM-tree, but they suffer from higher read amplification, because at the read time, for each level we have to search multiple SSTables as opposed to leveled LSM-tree where in each level > 0, we need to search only one SSTable. Jungle [2] solves this issue by maintaining an append-only (a.k.a Copy-On-Write (cow)) B-tree instead of a stack of SSTables. Similar to tiered LSM-tree, Jungle avoids compacting a level to itself, so it still enjoys write amplification 1 between two levels, but unlike tiered LSM-tree at the read time we don't need to search multiple SSTable. Instead, we can use the B-tree to quickly find the key similar to binary searching just one SSTable.

Figure 5. Jungle

How come Jungle does not increase the read amplification when we maintain an append-only B-tree? To search an append-only B-tree don't we need to first figure out the latest state of B-tree at the read time, and doesn't that come with overhead?

Note that since in the append-only B-tree, the root is always at the end of the B-tree log, we can quickly find the latest state of the B-tree and run an efficient search. In other words, although we have an un-compacted B-tree, we still can find the latest state without any extra overhead: we quickly find the latest root, and from there we quickly find the latest children and so on. Thus, the uncompacted state of the B-tree only affects the space and not the read amplification.

Since the B-tree in Jungle is append-only, we need to compact it once-in-a-while to reclaim the space and a more efficient range queries by putting data in order of keys on this disk. Thus, in the Jungle we have two types of compaction. One to compact the B-tree for each sorted run, and one similar to tiered compaction from one level to the next level. By tuning the B-tree compaction threshold, we can tradeoff between space and write amplification: higher threshold means we allow B-tree to grow more before we compact it. That results in lower write amplification but higher space amplification. Lower threshold means more B-tree compaction which results in higher write amplification, but also lower space amplification due to smaller B-tree data.

Bitcask: Unordered Log Structured Storage

The LSM-trees we discussed so far are all ordered. Ordered storage is great when we need to perform range queries. But what if we don't care about range queries and only want to perform read queries. For example, suppose we want to just have a persistent unordered map. In that case, we can do even better and have significantly lower write amplification. Bitcask is an example of an unordered log structured storage system that originally was designed for Riak. You can think of Bitcask, as an LSM-tree without MemTables and SSTables and with only the WALs: we just append updates to a file, once the file reaches a certain size, we make it read-only, and start writing to a new file. There is nothing else written to the disk! Just append only sequential writes to these log files. Then to have a fast lookup, we keep a hashmap in memory that has the location of the latest state of a key in one of these files. Since, we only append, and never update files, we need to compact them once-in-while to reclaim the space. That's it! Again, the issue with Bitcask is that you cannot efficiently perform range queries, as keys are not stored in the ordered fashion, so reading a range requires random access.

Figure 6. Bitcask

Conclusion

In this post, I talked about the basics of log-structured storage, focusing on how LSM-trees handle writes with WALs and MemTables, and reads using SSTables and bloom filters. I explained compaction techniques, comparing leveled and tiered approaches, and discussed their trade-offs in terms of write, read, and space amplification. Additionally, I touched on the differences between LSM-trees and B-trees, and explored variations like Jungle, which combines LSM-trees with B-trees, and Bitcask, a simpler unordered log-structured storage system.

References

[1] Chen Luo and Michael J. Carey. "LSM-based storage techniques: a survey." The VLDB Journal 29, no. 1 (2020): 393-418.

[2] Jung-Sang Ahn, Mohiuddin Abdul Qader, Woon-Hak Kang, Hieu Nguyen, Guogen Zhang, and Sami Ben-Romdhane. "Jungle: Towards Dynamically Adjustable {Key-Value} Store by Combining {LSM-Tree} and {Copy-On-Write}{B+-Tree}." In 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). 2019.

Search This Blog

MyDistributed.Systems

Log Structured Storage

Basics

Write Path

Read Path

Compaction

Leveled Compaction

Tiered Compaction

LSM-Tree vs. B-Tree

Other Log Structured Systems

Jungle: LSM-Tree with COW B+-Tree

Bitcask: Unordered Log Structured Storage

Conclusion

References

Comments

Post a Comment

Popular posts from this blog

In-memory vs. On-disk Databases

DynamoDB, Ten Years Later

ByteGraph: A Graph Database for TikTok

The Paxos algorithm, when presented in plain English, is very simple

Eventual Consistency and Conflict Resolution - Part 1