Graph Databases

Graph data is very common in modern applications. The property of this kind of data is the high levels of connectivity between the entities. Although social networks are the most common example for demonstrating graph data and the importance of graph databases, graphs are not limited to social networks; many other applications can be modeled much more clearly and intuitively with graphs. Once we model our data as a graph, we usually want to explore certain parts of the graph to find relations or interesting patterns. That includes simple use cases such as finding a person's friends or friend-of-friends in a social network, or complex patterns revealing frauds in a financial graph. All these applications require fast traversal of graph nodes. Although we can store a graph in a tabular relational database or a NoSQL database such as key-value/document stores, they typically don't provide fast traversal which is critical for graph applications. In this post, we look at graph databases that are databases designed specifically for dealing with graph data.

Graph solutions are used in both online and offline processing. In this post, however, we only focus on graph databases for online/near-real-time transactional graph processing (OLTP). There are also different graph data models. The two main ones are Labeled Property Graphs (LPGs) and Resource Description Frameworks (RDFs). Here, we don't discuss different graph models. Instead, we only focus on the fast traversal property of the graph databases and see how they achieve it.

Listen to the Audio Blog

Storing Graph Data in Non-Graph Databases

Before talking about graph databases, let's see why can't we just use our familiar relational or NoSQL databases to store graph data. Consider a social network where users can be friends with each other. We can model the data of this social network as a graph where nodes/vertices represent users and the edges represent friendship relations between users. We can assume friendship is mutual, i.e., when A is friends with B, B is also friends with A. Here, however, we assume friendship is not mutual. Thus, when person A considers person B a friend, we have a directed edge from the node representing A to the node representing B. 

Sketch 1. Friendship graph in a social network. Nodes represent users. Edges represent one-directional friendships. 

We can use a relational schema shown in Sketch 2 to store the friendship data in a relational database. It is a nice schema, because we can easily add or remove friendships, and data is normalized, so we don't have any redundancy. Using foreign keys constraints, we can make sure our database prevents us from adding invalid edges. It also prevents us from deleting a node when we have an edge to it thereby preventing us from creating dangling edges in our graph. Thus, the database helps us to keep our graph "consistent" (the "C" in ACID).  

Sketch 2. A Schema for capturing friendship relations in a social network using a relational database

Now, suppose we want to get the list of A's friends. To answer this query, we have to search the friendship table to find rows having A's ID in column P_1. Typical relational databases store data using a sorted data structure. The most common data structures for this purpose are variants of B-Trees. By storing data in a B-Tree, we have a sorted index on the primary key of the table. Having this sorted index, we can find the rows that we are interested in with O(log E) time where E is the number of rows in the table. Right here, you can see one problem with this design: the amount of time we need to wait to find a user's friends grows as the total number of friendship relations grows in our system even if the actual number of the user's friends does not increase. Thus, our query time is a function of the overall size of the graph. That is a serious issue in a system with a huge number of users and friendship relations. 

To find A's friends' names, we need to pay O(log E) to find the IDs of the friends, and then for each of them, we need to pay a O(log N) to find the corresponding rows in the persons table, where N is the number of rows in the persons table (nodes). Thus, finding A's friends' names is of O(log E + F ☓ log N) where F is the average number of a person's friends. Now, suppose we want to find the A's second-degree friends' (i.e., friends of A's friends). The list of friends-of-friends is especially interesting for recommending people who are not already friends with A, but are good candidates to be A's new friends. To get A's second-degree friends' names, we have to first find the list of A's friends which require O(log E), then for each of them, we need to find friends again which require O(log E). Finally, for each of their friends, we need to look up persons table to find their names which requires O(log N). Thus, the total time complexity of finding A's second-degree friends' name is of O(log E + F ☓ (log E + F ☓ log N). In general, finding a user's mth-degree friends' names is of O(Fm-1 ☓ log E + Fm☓ log N). As you can see, the negative effect of the overall size of the graph just gets worse as we query deeper relations. Answering simple questions like finding neighbors or neighbors of neighbors in a graph using a typical relational database is inefficient, let alone finding complicated relations that require more explorations. 

If relational databases are inefficient in capturing relations, why are they called "relational"?! 
Relational databases got their name from the Relational Model defined by Edgar Codd. This model is based on first-order predicate logic. In the relational model, the database is considered as a set of propositions each states a relation between a set of variables from certain domains. We insert our data as propositions to the database, and then we can query it via relational algebra/calculus formulas. Formally, a relation is a subset of the Cartesian product of several sets. Tabular databases are an implementation of this model where relations are captured by tables. To avoid data redundancy, however, we have to normalize data by dividing our relations into multiple tables. To reconstruct the relations and have the full picture, we have to join tables, but to join we have to match rows, and this is where the performance issues show up. Thus, in such a relational database, a relation is divided into two parts; part of it is captured inside a table, and the other part is captured in other tables. Relational databases are efficient for dealing with the first part, but inefficient in dealing with the second part. For applications with non-graph data that occasionally need to join tables that is fine, but for applications dealing with highly connected data, relational databases just don't work. The point of a graph database, as we will see, is to capture all parts of relations equally and efficiently. 

We saw that relational databases are inefficient for storing graph data, but what about NoSQL data stores? Unfortunately, NoSQL data stores, such as key-value/document/wide-column stores, are also inefficient for dealing with graph data. In fact, they are even worse than relational databases in capturing connected data, because while in relational databases we can capture connections in the database using foreign keys, in non-graph NoSQL databases capturing relations must be handled 100% by the applications, as these stores provide no special facility for capturing and tracking relations. 

For example, let's see how we could use a document store to hold the data for our social network application explained above. In a document store, we can store a document (usually in JSON format) for each user. We can capture the list of the user's friends as an attribute in the JSON object. Now, we can quickly find the IDs of the friends in O(1). However, to access the documents storing the friends' information, e.g., to get friends' names, we still need to do an index lookup in O(log N). Thus, finding A's friends' names is still of O(F ☓ log N). In general, finding mth-degree friends' names is of O(Fm☓ log N). Unlike relational databases that provide referential integrity, in a NoSQL database with this schema, the application is responsible for keeping the data consistent. For example, when removing a person, we must make sure we remove it from other users' list of friends. Without this active housekeeping by the application, the database will soon be full of inconsistent information. 


Sketch 3. A Schema for Capturing Friendship Relations in a Social Network using a Document Store

Graph Databases

Graph databases are database management systems that are designed specifically for dealing with graph data. The goal of a graph database is to provide fast graph traversal. The key to fast graph traversal is the fast adjacency, i.e., to find neighbors of a node as fast as possible. Ideally, we want our query time to be independent of the overall size of the graph. 

The Main Idea

In the previous section, we saw why typical tabular relational databases or non-graph NoSQL databases perform poorly for dealing with graph data. As we said, the main issue is index lookups of O(log N) time complexity. The main idea of a graph database is to physically capture the relations between nodes in the write time, so we don't need to spend time figuring out the relations in the read time as we do in a relational database using join operations. By capturing the relations in the write time, graph databases can quickly get the list of direct pointers to edges of a node in the read time. The edges also directly point to the involved nodes. Thus, having these lists of edges, we can find the neighbors of a node in O(K) where K is the number of neighbors of the node. Below, we will see two designs implementing this idea using different approaches. Of course, in a real system, we can have optimizations to improve the performance of these approaches. Here, however, we only focus on the overall ideas. 

Fixed-size Record Storage

One way of implementing direct pointers on a disk is via fixed-sized record store files, i.e., files that consist of fixed-size records. If our files contain only fixed-sized records, then by knowing the offset of a record, we can jump right to its location on a disk without any index lookup. Thus, the offsets are used as the ID of the records. With this approach, a pointer is basically a tuple <filename,  offset>.  In addition to the fast lookup, fixed-sized record stores let us reuse the storage without having too much fragmentation that causes storage waste. Specifically, when a record is not needed anymore, we can easily mark it as deleted, and we can reuse it for a future record. Since records are of the same size, we don't need to worry whether the new record fits in space of the deleted record. Noe4j is a graph database that uses this approach [1]. Table 1 shows the record sizes in Neo4j.

Store File                        | Record size   | Contents
----------------------------------------------------------------------------------------------------------------------------
neostore.nodestore.db             | 15 B          | Nodes
neostore.relationshipstore.db     | 34 B          | Relationships
neostore.propertystore.db         | 41 B          | Properties for nodes and relationships
neostore.propertystore.db.strings | 128 B         | Values of string properties
neostore.propertystore.db.arrays  | 128 B         | Values of array properties
Table 1. Record sizes in Neo4j [1]

In Neo4j, each node has the ID of the first edge in its list of edges. The edges are stored as fixed-size records in the edges (relation) files. By multiplying the ID of the edge and the size of the edge records, we know the exact location of the edge.  Each edge points to the next and previous edges in the list of edges, thereby creating a doubly-linked list. Sketch 4 shows an example. The pink arrows move forward in the list, and the green arrows move backward. The edge record also includes the ID of the start and end nodes. By multiplying a node ID by the size of node records, we get the location of the node. Despite all these indirections, the time complexity is still O(1) as we find out the disk locations in O(1). Compare this to the case where we use a B-tree index, and we need O(log N) to find these locations. 

Note that, to avoid having two records per edge (one for the linked list of the start node and one for the linked list of the end node), we can store the linked list information for both nodes on the same record having four pointers in total: two pointers for the moving forward and backward in the list of edges for the start node, and two pointers for moving forward and backward in the list of edges for the end node. 

Sketch 4. Fixed-size record storage design

Now, let's calculate the time complexity of finding A's friends' names with this design. Since the pointer to the beginning of the linked list of A's friends is stored with A, we have the list of friends in O(1). And since these are direct pointers, we don't need any index lookup. Thus, we can get the name of each friend in O(1). Thus, the time complexity of finding A's friends' names is O(F). In general, finding A's mth-degree friends' names is of O(Fm). The nice thing about this complexity is that there is no term N in it. Thus, the latency of our query is independent of the overall size of our graph; we can grow to billions of users and the query time does not change (except for the initial index lookup to find A itself). 

Log-structured Storage 

In this section, we review the overall ideas of the design proposed in [2]. In this design, instead of fixed-sized record stores, we store data in an append-only log. As its name suggests, to add new data, we simply append it to the end of the log file and we never delete it. [2] provides an RFD-like data model where nodes are simply strings. Thus, to add a new node, we append a length-prefixed string to the end of the log file. There won't be two log entries for the same node. Thus, when the database is asked to add a new node, it first checks the log (via a hash index) to see if it already has the node. If the node is already added, it avoids adding it again. To add an edge, we also append a new entry to the log. Unlike nodes, edges can be removed from the graph. The deletion of an edge is also recorded on the log as a new entry. Thus, a log entry is either a node, the addition of an edge, or the deletion of an edge. Sketch 5 shows the graph shown in Sketch 1 in this design. Note that the size of records are provided just as an example. Each log entry is identified by its offset in the file. In this example, entries with ID (offset) 0, 8, 36, and 84 are nodes. And entries with ID 16, 44, 64, 92, and 112 are edges. Note that the edge from A to C which is added in offset 44 is later deleted in offset 112. 

An append-only file is great for writing the data (in fact, it is theoretically the optimal persistent storage from the write path perspective; what could be faster than just appending to the end of a file?!), but very inefficient for reading the data, because to answer a query, e.g., figuring out what are the neighbors of a given node, we have to read the whole history, every time! If we had only one node in the graph, that wasn't an issue, because all log entries would be related to our query. The issue is that on the log we have lots of entries that are unrelated to our target node. This is where the indexes come into the picture. Using different indexes, we can filter out our log and quickly read its entries related to our query. For example, we can have an index that maps the ID of a node and a certain edge type to log offsets that record addition/deletions of the edges of the given type to the given node. To store the log offsets, [2] uses the VList data structure. 

To avoid O(log E) time complexity in finding the related log entries, [2] uses hash indexes. Using a hash index we can find the related log offsets in O(1) and traverse them linearly in O(H) where H is the average size of the related history (i.e., number of related log entries). To answer a query, e.g., to find the neighbors of a node, we first find related log locations using the index, and then we check them one by one and figure out the list of edges. The process of figuring out the edges by reading the history is called materialization in [2]. With this design, finding A's mth-degree friends' names is of O(Hm). 

We can create various indexes to filter out the log from different perspectives that we are interested in. For example, if we are interested in traversing the relations backward, we can create an index that filters out the log based on <destination + edge type> instead of <source + edge type>

Sketch 5. Log structured design 

But isn't this O(H) an issue? For answering each query, we have to read the related history including all edge additions and deletions.  
For a highly dynamic graph where relations change frequently, yes; that would be a problem, because we will have long histories. However, for many applications such as social networks with low changes in relations, that is not an issue, especially when our data is in memory. 

Discussion: what can be called a graph database? 

We saw why we need a special type of database management system for dealing with graph data and reviewed two designs, but what is the definition of a graph database? There is a debate in the community of graph database developers on what can be called a "graph database". Of course, we don't have such a thing as a wrong or right definition; a definition is just a definition. Thus, the debate should be on what is a useful definition of a graph database. Some people believe as long as a database provides the graph API, it can be considered a graph database, no matter how that API is implemented. However, such definition is not useful, and in fact can be harmful, as application developers using such a database may store their connected data wrongly assuming that the database is designed for efficient handling of the connected data while the database and its underlying storage might not be designed for that purpose which leads to poor performance. 

The majority of developers believe the essence of graph data is fast adjacency/traversal. But now the question is what "fast"? Some vendors restrict fast adjacency with implementation details. They use the term "Index-free Adjacency" as the essence of graph database, and refer to graph databases with index-free adjacency as "native graph" databases. However, at least I was not able to find any formal and clear definition of index-free adjacency. Does it mean that database does not use an index at all? On the other hand, just using an index does not necessarily mean slow adjacency. Sorted indexes like B-Tree indexes needing O(log N) lookup might cause slow traversal, but what about hash indexes as used in [1] and other databases? Shouldn't they be considered a graph database because they use indexes? Overall, defining requirements through implementation details does not seem to be a good approach. 

Some may define "fast" traversal with time complexity, as we did above. For example, we can define a graph database as a database that provides O(K) time complexity for finding neighbors of a node where K is the size of neighbors. We may eagerly say "clearly O(K) is better than O(log N) where N is the overall size of the graph". However, we know that for sub-linear time complexities constants are critical. What if a graph database theoretically provides O(K) time complexity, but for most typical use cases its O(K) time complexity is slower than a relational database with O(log N)? 

As you can see, providing a formal definition for graph databases is not straightforward, and that is the source of debate in the community. From the practical point of view, fast traversal is what we want, but as always there is no one-size-fits-all technology. Thus, for each use case, we have to carefully see if the product we choose is capable of handling our graph data efficiently. 

Summary

In this post, we talked about graph databases, a kind of database management system that are designed for dealing with highly connected data. The key design point of all graph databases is the ability of fast traversal that is usually possible by physically tracking relations in the write time instead of figuring out them in the read time. We reviewed two different approaches for implementing this idea using fixed-size record stores and append-only logs. Both approaches make it possible to explore a subgraph in a time complexity independent of the overall size of the graph. 

References 

[1] Understanding Neo4j’s data on disk. https://neo4j.com/developer/kb/understanding-data-on-disk/  [August 8, 2021].

[2] Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer. "Nanosecond indexing of graph data with hash maps and VLists." In Proceedings of the 2019 International Conference on Management of Data, pp. 623-635. 2019. 

Comments

Anonymous said…
Very helpful post! A great starting point for those who are trying to understand systems aspect of graph databases or graph storage in general. Thank you.

Popular posts from this blog

In-memory vs. On-disk Databases

DynamoDB, Ten Years Later

ByteGraph: A Graph Database for TikTok

Eventual Consistency and Conflict Resolution - Part 1