NuRaftOnTheRocks: A Replicated Key-value Store using RocksDB and NuRaft

NuRaft is an open source C++ Raft implementation. NuRaft is highly pluggable; it allows you to provide your own persistent log store and state manager and integrate it with your state machine. That allows you to integrate NuRaft with any application whenever you need log replication. In this post, I show how you can integrate NuRaft with RocksDB to build a replicated key-value store. 

Raft

Raft is a consensus and log replication protocol. Raft allows multiple processes in a distributed system to agree on the entries of a shared log. Using such a guarantee, we can achieve state machine replication as follows: having a set of state machines starting at the same initial state, we record operations on the state machine on a shared log, and have each state machine perform operations specified in the shared log. This way all state machines remain consistent with each other. State machine replication is very useful. We can create replicated data stores. In this post, we want to create a replicated key-value store using Raft. In this example, our state machine is a key-value store for which we use RocksDB. 


Figure 1
. State machine replication with Raft.

I have talked in detail about Raft in an earlier post in this blog. If you like to know more about the internals of Raft you can refer to that post. 

NuRaft

NuRaft is an open source implementation of Raft in C++, developed by my ex-team at eBay. NuRaft is highly plugable; you can easily integrate it with your own persistent log implementation,  state manager and most importantly your own state machine. Specifically, to use NuRaft, you need to implement the following interfaces:
Once you have implementation for these interfaces, you don't need to worry about replication, as NuRaft takes care of that for you and guarantees sequential consistency, i.e., all state machines will apply operations in the log in the exact same order.

RocksDB

RocksDB is a popular embedded key-value store. RocksDB is not a distributed or replicated key-value store. Instead, you can use it as a storage engine for a distributed/replicated key-value store. RocksDB is an LSM-tree storage.
Figure 2. An LSM-tree with leveled compaction. 

I have talked in detail about log-structured storage and LSM-trees in an earlier post in this blog.  If you like to know more about the internals of LSM-trees you can refer to that post. 

Replicating RocksDB with NuRaft

In this post, I want to show how you can integrate NuRaft with RocksDB to build a replicated key-value store.

I use the in-memory implementations for log store and state manager provided in the NuRaft repository for examples. Note that with an in-memory log store we actually don't have the consistency guarantees. In a real system you need persistent implementations for these interfaces. You can refer to ClickHouse's log store for an example of real persistent log store.

Source Code

You can access the source code of the replicated key-value store explained in this post on Github:

Before reading the next section, refer to the repository's Readme and follow the instructions for building and running the tests. 

Code Walkthrough

The source code is organized as follows:
  • api: This folder contains a simple gRPC service definition. The service has two sets of operations: 1) key-value CRUD operations (e.g. put or get a key) and 2) Raft operations (e.g., for adding a new server to the raft group)
  • scripts: to run tests
  • src: The main code. 
  • tests: includes tests for all implementations
Let's walk through the src folder. I have four different implementations, under src folder:
  • single in-memory: a single in-memory key-value store.
  • single RocksDB: a single persistent key-value store backed by RockDB.
  • multi in-memory: a replicate in-memory key-value store backed by NuRaft.
  • multi RocksDB: a replicated persistent key-value store backed by NuRaft and RocksDB.
All of these implementations share code in the src/common folder. So let's see what we have in src/common. 
  • grpc_server: Defines GrpcServer class that implements the gRPC service. It has an instance of KeyValueServer interface.
  • kv_server_interface: Defines KeyValueServer interface that include the key-value CRUD operations and raft operations.
  • raft_kv_server: Implements the KeyValueServer interface and has an instance of KeyValueStateMachine interface.
  • state_machine_interface: Defines KeyValueStateMachine which is a virtual class that implements the nuraft::state_machine interface and requires its own pure virtual functions. 
    • getKey
    • scan
    • applyPayload
For single key-value stores (i.e. single in-memory and single RocksDB), we directly implement the KeyValueServer interface. For replicated key-value stores (i.e., multi in-memory and multi RocksDB), we implement the KeyValueStateMachine interface. 

Figure 3 shows the design and how classes and interfaces are related to each other.

Figure 3. Interfaces and classes

run_server.hxx file has the factory logic to create the appropriate GrpcServer instance based on the input arguments passed to the main function.

Conclusion

In this post, we saw how we can use NuRaft to create a replicated state machine system. Here, we used RocksDB as our state machine and built a replicated key-value store by integrating it with NuRaft, but you can plug NuRaft any state machine to achieve strong consistency and log replication.

Comments

Popular posts from this blog

In-memory vs. On-disk Databases

DynamoDB, Ten Years Later

ByteGraph: A Graph Database for TikTok

Eventual Consistency and Conflict Resolution - Part 1