Log Structured Storage
Log-structured storage is a powerful approach to organizing data that prioritizes sequential writes for efficiency and durability. In this post, I delve into the inner workings of log-structured storage, focusing on LSM-trees. We'll explore their write and read paths, the mechanics of compaction, and compare leveled and tiered compaction strategies. Additionally, I'll discuss the differences between LSM-trees and B-trees, and talk about variations like Jungle and Bitcask, highlighting their unique optimizations and trade-offs.
Basics
Write Path
Write are first appended to a Write-Ahead-Log (WAL) for durability, then they
are written to a MemTable which is basically an in-memory sorted data
structure such as a balanced tree (e.g. red-black or AVL tree) or a skip list
(I have a
detailed post about skip list
on this blog). After writing a write to WAL and the MemTable, we can return
success to the client. We can have multiple MemTables, but at any given time,
we have exactly one writable MemTable; the other ones can only be used for
reads. Once a MemTable reaches a certain size, it become read-only and a
new MemTable is created. Read-only MemTables later are dumped to SSTables.
Note that deletions are written with tomestones.
Figure 1. Write path
Read Path
When looking up a key, since it can be in the MemTable(s) or any of the
SSTables, we should keep searching them until we find the key. Note that if
there are multiple occurrences of the key, we only care about the last write.
Thus, by searching MemTables(s) and SStables in reverse chronological order,
we can return the first value we encounter for a key. Since both MemTables and
SSTables are sorted data structures, we can search them with binary search in
logarithmic time. Using bloom filters, we can know for sure if a key does not
exist in a SSTable thereby saving our time by not searching it. Thus, for
each SSTable, we store its bloom filter data on the file as well to use it
during searches. Not that without bloom filters, searching a key that does not
exist will require searching all SSTables. Also note that, most of the
time, the key parts of SSTables are cached so searching SSTables does not
necessarily require a binary search on disk.
Figure 2. Read path
Compaction
We never update the value of a key in place. Instead, we record its new value by
appending to MemTable that will be dumped as SSTables. Thus, we may have
different values for the same key in multiple SSTables. We only care about the
last value, and storing the older values is unnecessary. Using
compaction, we can delete older values and reclaim space.
Compaction is basically running a merge sort on two or more SSTables, and
creating a new SSTable, possibly with a larger size. One way to merge sort
SSTables is as follows: we can iterate each SSTable, and put the next value of
each input SSTable into a heap data structure, and keep writing the top of the
heap to the output SSTable (so each time we can find next element in O(S)
where S is number of SSTables). Note that since we care about the last value
of a key, we can order keys based on <key, version> such that
lower key and higher version have higher priority in the heap.
Once we write a key to the output heap, we ignore the rest heap's top with the
same key, and just pop them, until we find a new key.
One naive approach for compaction is as follows: We first merge sort all
SSTable to a single big SSTable and delete all small SSTable. Then, once in a while when we have more than a threshold SSTables, we merge sort all of them
AND the big SSTable again, keep the new SSTable and delete all small SSTables
and the old big SSTable.
With this naive approach, consider this scenario: suppose we have 1 million
keys in the big SSTable. Now, we start writing keys in the range of key101-200.
Thus, we have a working key set of about 100 keys. The changes to these 100
keys, creates many SSTable such that it kicks off a compaction. Now, with the
above approach, what should we do? We have to re-write 1 million keys in the
big SSTable to reflect changes to only 100 keys! To avoid scenarios like
this, for compactions, LSM-trees use more sophisticated approaches. The two
most common ones are leveled and tiered compactions.
Leveled Compaction
In leveled compaction, SSTables are organized in levels. Level 0 consists of
raw SSTables that are dumps of the MemTable. They may overlap, as we simply
write the MemTable content when we create each of them. Each level i > 0,
consists of non-overlapping SSTables. Thus, each SSTable in those levels covers a range of keys not included in any other SSTable in that level. The maximum
size of SSTables in level grows exponentially. For example, L0 might have size
300MB, and each level we multiply the size by 10 (i.e., 3G, 30G, 300G, etc.).
The ratio of size of next level to current level is called the fanout.
We have a limit for the number of SSTable in each level. Once the number of
SSTable in a level goes higher than the maximum allowed, we do a compaction.
Suppose we need to do comp action for level Li. We pick one (or more) SSTables
in level i. We see what range this SSTables(s) cover and see which of the
SSTables in the higher level (Li+1) overlap with this range. Then we merge all
of the SSTables we picked from Li and overlapping SSTables in Li+1. The output
of this merge might be a big SSTable. Since we have a limit for the size of
SSTable in each level, we may need to split the output SSTable to multiple
non-overlapping SSTables. We put output SSTables in Li+1, then delete the
original SSTables we merged in Li and Li+1.
Figure 3. Level Compaction
One issue of leveled compaction is that to merge a SSTable in level Li, we
have to re-write a SSTable at level Li+1 to merge. Thus, the write
amplification is size(Li+1)/size(Li) which is the fanout size (see above).
This write amplification could be limiting for write-heavy workloads.
Why in figure 3, higher levels seem to cover smaller ranges despite being larger SSTables?
Note that lower and upper bound for a SSTable does NOT mean that the SSTable includes all keys in that range. For example, in SSTable 1, having 200 and 250 as lower and upper bounds does not necessarily mean SSTable 1 covers all keys in range [200, 250]. As we go to higher levels, the SSTable becomes denser, i.e., they cover more keys in their range. That's why SSTables of higher levels are usually larger while covering smaller ranges.
Tiered Compaction
The tried compaction is another compaction approach that compared to leveled
compaction has lower write amplification. In this approach, we simply pick N
SSTables with similar size and merge them. The output SSTable could be of
bigger size, that will be merged with its similar size SSTables. Another way of thinking of Tiered compaction is think of it again in terms of
levels. Each level has its size limit, and for each level we have a limitation
of how many SSTable we can have. Unlike SSTable where in each level we have
non-overlapping SSTables, in tiered compaction, we can
stack overlapping SSTables in the same range on top of each other.
When the stack is full, we merge all SSTables in the stack and possibly create
a new SSTable of larger size that should be stacked on a range on the next
level.
The key difference between leveled and tiered is this:
- In leveled, you compact Li and Li+1 SSTables and write to Li+1
- In tiered, you compact Li SSTables and write Li+1
Figure 4. Tiered Compaction
So we never re-write a SSTable from a level to its own level, we only write
from Li to Li+1. Thus, the write amplification is only 1 which is
significantly lower than the fanout which is write amplification in leveled
compaction.
Note that nothing prevents us from using one approach for one level and
another approach for another level in the same LSM-tree. Thus, we can use a
hybrid approach.
LSM-Tree vs. B-Tree
Both LSM-tree and B-tree are disk-based sorted data structures, but with very
different internals making them good for different purposes. In general,
LSM-tree is better for write-heavy workloads, while B-tree is better for
read-heavy or balanced workloads.
Note that the following is not an accurate comparison, and it just touches on
some aspects to consider when comparing B-trees and LSM-trees.
-
Write amplification (write to disk/write to DB):
For a B-tree write amplification results from at least two things: updating internal nodes, and having to write the entire page back for each change. Note that, yes, we can cache pages, but when the memory buffer is full, to update a page, we have to write back one of the existing dirty pages. On the other hand, for a LSM-tree, the write amplification is the result of re-writes during compactions. For leveled compaction, the write amplification from level i to i+1 is equal to the fanout (size of SSTables in Li+1 / that of Li). With tiered write amplification from level i to i+1 is 1. The actual comparison depends on the parameters, but generally for write amplification we can say tiered LSM-tree < leveled LSM-tree < B-tree. Thus, the least write amplification is for a tiered LSM-tree.
-
Read amplification (pages read per query): For reads, the performance heavily depends on how much data we assumed it is cached? If we assume all B-tree internal nodes are cached, then its read amplification is 1, as we search the tree in memory and do only one page read to read the row. Similarly, if we assume for a LSM-tree we have cached all index blocks of all levels in memory, again read amplification is 1, as we can do all binary searches over SSTable indexes in memory. If we consider the cold cache case, in B-tree we need to search the tree on disk to find the row, so read amplification is logarithmic in the size of DB. On the other hand, for LSM, we need to do binary search for each SSTable so we don't get negative results from its bloom filter. Generally, read amplification of B-trees is less than that of LSM-trees.
- Space amplification (size on disk / DB size): One reason for space amplification in B-tree is fragmentation due to not full pages. For LSM-tree space amplification is due to keeping old versions of keys before compaction. As explained above, leveled compaction has less space amplification than tiered compaction at the cost of higher write amplification.
Other Log Structured Systems
There are many variations of log structured storage systemS. You can refer to
[1] to learn more.
Here, I briefly talk about two interesting variations.
Jungle: LSM-Tree with COW B+-Tree
As we said above, tiered LSM-tree has lower write amplification than leveled
LSM-tree, but they suffer from higher read amplification, because at the read
time, for each level we have to search multiple SSTables as opposed to leveled
LSM-tree where in each level > 0, we need to search only one SSTable.
Jungle [2] solves this issue by maintaining an append-only (a.k.a Copy-On-Write
(cow)) B-tree instead of a stack of SSTables. Similar to tiered LSM-tree,
Jungle avoids compacting a level to itself, so it still enjoys write
amplification 1 between two levels, but unlike tiered LSM-tree at the read
time we don't need to search multiple SSTable. Instead, we can use the B-tree
to quickly find the key similar to binary searching just one SSTable.
Figure 5. Jungle
How come Jungle does not increase the read amplification when we maintain an
append-only B-tree? To search an append-only B-tree don't we need to first
figure out the latest state of B-tree at the read time, and doesn't that
come with overhead?
Note that since in the append-only B-tree, the root is always at the end of
the B-tree log, we can quickly find the latest state of the B-tree and run an
efficient search. In other words, although we have an un-compacted B-tree, we
still can find the latest state without any extra overhead: we quickly find
the latest root, and from there we quickly find the latest children and so
on. Thus, the uncompacted state of the B-tree only affects the space and
not the read amplification.
Since the B-tree in Jungle is append-only, we need to compact it once-in-a-while to reclaim the space and a more efficient range queries by putting data in order of keys on this disk. Thus, in the Jungle we have two types of compaction. One to compact the B-tree for each sorted run, and one similar to tiered compaction from one level to the next level. By tuning the B-tree compaction threshold, we can tradeoff between space and write amplification: higher threshold means we allow B-tree to grow more before we compact it. That results in lower write amplification but higher space amplification. Lower threshold means more B-tree compaction which results in higher write amplification, but also lower space amplification due to smaller B-tree data.
Bitcask: Unordered Log Structured Storage
The LSM-trees we discussed so far are all ordered. Ordered storage is great when we need to perform range queries. But what if we don't care about range queries and only want to perform read queries. For example, suppose we want to just have a persistent unordered map. In that case, we can do even better and have significantly lower write amplification. Bitcask is an example of an unordered log structured storage system that originally was designed for Riak. You can think of Bitcask, as an LSM-tree without MemTables and SSTables and with only the WALs: we just append updates to a file, once the file reaches a certain size, we make it read-only, and start writing to a new file. There is nothing else written to the disk! Just append only sequential writes to these log files. Then to have a fast lookup, we keep a hashmap in memory that has the location of the latest state of a key in one of these files. Since, we only append, and never update files, we need to compact them once-in-while to reclaim the space. That's it! Again, the issue with Bitcask is that you cannot efficiently perform range queries, as keys are not stored in the ordered fashion, so reading a range requires random access.
Figure 6. Bitcask
Conclusion
In this post, I talked about the basics of log-structured storage, focusing on how LSM-trees handle writes with WALs and MemTables, and reads using SSTables and bloom filters. I explained compaction techniques, comparing leveled and tiered approaches, and discussed their trade-offs in terms of write, read, and space amplification. Additionally, I touched on the differences between LSM-trees and B-trees, and explored variations like Jungle, which combines LSM-trees with B-trees, and Bitcask, a simpler unordered log-structured storage system.
References
[1] Chen Luo and Michael J. Carey. "LSM-based storage techniques: a survey." The VLDB Journal 29, no. 1 (2020): 393-418.
[2] Jung-Sang Ahn, Mohiuddin Abdul Qader, Woon-Hak Kang, Hieu Nguyen, Guogen Zhang, and Sami Ben-Romdhane. "Jungle: Towards Dynamically Adjustable {Key-Value} Store by Combining {LSM-Tree} and {Copy-On-Write}{B+-Tree}." In 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). 2019.
Comments
Post a Comment