263-3010-00: Big Data
Section 13
Graph Databases
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 12/08/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Most of the textbook so far was dedicated to storing and querying data in the shape of tables or trees, collections of trees being a denormalized form of tables.
We now turn to data in the shape of graphs.
Why graphs¶
Relational tables are excellent at querying highly structured, normalized datasets. Normal form avoid redundancies and anomalies by spreading the dataset over multiple tables, with each table storing "one thing" (customers, products, etc). These tables are related to each other by means of primary keys, and foreign keys pointing to them, which is in particular apparent in the entity-relationship model. At runtime, this translates into the use of joining queries.
Even though a single join can be highly optimized: with hash tables, it is even possible to do it in linear time, and in fact primary keys are commonly indexed (hash indices, etc). But this does not scale well, and there are query patterns that bring the relational algebra to its limits (traversal of a high number of relationships, reverse traversal requiring also indexing foreign keys, looking for patterns in the relationships, etc).
Now, we do know a way to avoid joins and studied it at length: denormalizing the data to (homogeneous) collections of trees is a way of "pre-computing" the joins statically, so that the data is already joined (via nesting) at runtime.
Now, why is it efficient? Because going down a tree only necessitates following pointers in memory. But trees cannot have cycles. Graph databases provide a way of generalizing the use in-memory pointers to traverse data to the general case in which cycles are present: this is called "index-free adjacency."
Kinds of graph databases¶
There are many different graph database system products on the market, and they can be classified along several dimensions.
Labeled property graph model vs. triple stores: there are two competing high-level data models for graphs. We will study them both in this chapter.
Read-intensive vs. write-intensive: some graph database systems are more suited for large-scale analytics (read-intensive, OLAP), while others support efficient updates (OLTP).
Local vs. distributed: some graph database systems work on a single machine, some others replicate the (same) data across multiple machines, and recent developments even allowed graph sharding, meaning that the data and the queries are partitioned and distributed over several machines.
Native vs. non-native: some graph database systems are in fact a layer on top of another kind of database system in a different shape (for example, a relational database or a document store) or query engine (for example, Apache Spark). Other graph database systems are directly implemented from scratch to efficiently support graphs.
In fact, these dimensions are not specific to graph database systems, and also apply to other shapes (except, of course, that the competing data models will be other models than those two).
Graph data models¶
Labeled property graphs¶
The basic ingredients of a graph, from a mathematical perspective, are nodes and edges.
Mathematically, a directed graph $G = (N,E)$ is given by a set $N$ of nodes (also called vertices), and a set $E \subseteq N \times N$ of directed edges (also called arcs, or arrows). Given these, it is possible to visually represent the graph as follows (note that in this case, the graph is planar in the sense that edges do not cross, but this does not hold in general):
Graphs can also be undirected, in which case the arrows do not have a direction. Then formally $E \subseteq \{ P \in P(N) | \text{card}(P) = 2\}$. An alternative modeling is to keep $E \in N \times N$, but with the additional constraint that all edges go in both directions:
$$\forall n, m \in P, (n, m) \in E \Rightarrow (m, n) \in E$$
Such is the math at the logical level, and it is already powerful enough to overfill happy Bachelor’s students with joy over many years of algorithms and problem solving (graph coloring, constraints on planar graphs, max-flow/min-cut, Dijkstra’s algorithm, etc). In fact, one of the benefits of graph database systems (or at least some of them) is that they are well suited for implementing and running such algorithms.
Computer scientists need to go one step further and also design how to store graphs physically. One way of doing so is to create an associative array mapping each node to the list of nodes that it connects to via an edge (adjacency lists):
Another storage form is with an adjacency matrix: each row and each column represent a node, and a 0 or a 1 indicate the absence or presence of an edge between the row node and the column node:
Yet another way is with nodes on the rows and edges on the columns, with a-1 (origin node) and 1 (destination node) in each column and 0s otherwise:
The latter way generalizes even to so-called hypergraphs, where each edge can have more than one origin node and more than one destination node.
Now, this does not quite work for us, because labeled property graphs enhance mathematical graphs with extra ingredients: properties, and labels:
Labels are some sort of "tags", in the form of a string, that can be attached to a node or an edge. For example, a node may be tagged as a "Person" and another as a "Course", and the edge between them can be tagged as "Attends". A small graph with these nodes and edge indicates that a person is attending a course.
Labels can be repeated, for example we could have a graph with 23,000 Person nodes and 5,000 Course nodes, and 50’000 Attends edges.
Now, how would we then store information about the persons and courses? Labels are surely not an option, as it would make it cumbersome to know which label is what. Instead, we use properties. Each node and each edge can be associated with a map from strings to values, which represents its properties. You can think of the properties associated with each node or edge as a JSON object, or as a single record in a relational table or in a DataFrame (if nested).
This is a visual of a single node with some labels and its properties:
In the case of neo4j, many property types are supported and we list a few here together with their JSound/JSONiq/XML Schema equivalent:
Neo4j type | JSONiq/JSound/XML Schema type |
---|---|
STRING | string |
BOOLEAN | boolean |
INTEGER | long |
FLOAT | double |
NULL | null |
DATE | date |
TIME | time |
DATETIME | dateTime |
ZONED DATETIME | dateTimeStamp |
ZONED TIME | time (time zone mandatory) |
LOCAL DATETIME | dateTime (no time zone) |
LOCAL TIME | time (no time zone) |
DURATION | duration |
LIST | array |
Neo4j supports also an additional type (POINT) for geographic locations (GIS), which does not have a JSound/JSONiq equivalent. We will see that objects can appear in Neo4j’s query language (the type is called MAP) but objects cannot be used as the properties of a node or edge: the only way to “break the first normal form” in properties is via arrays.
Maybethe reader, at this point, starts seeing some pattern, or might have even already figured out how to "convert" a relational table to a labeled property graph: labels can be seen as table names, nodes as records, and properties as the attribute values for the records:
This shows that relational tables can be physically stored as labeled property graphs. Of course, this does not work the other way round: given a graph, it will often not be possible to convert it "back" to a table in this specific way.
If there are foreign keys pointing to primary keys, these can be represented with edges in the graph, like so: