263-3010-00: Big Data
Section 13
Graph Databases
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 12/08/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Most of the textbook so far was dedicated to storing and querying data in the shape of tables or trees, collections of trees being a denormalized form of tables.
We now turn to data in the shape of graphs.
Why graphs¶
Relational tables are excellent at querying highly structured, normalized datasets. Normal form avoid redundancies and anomalies by spreading the dataset over multiple tables, with each table storing "one thing" (customers, products, etc). These tables are related to each other by means of primary keys, and foreign keys pointing to them, which is in particular apparent in the entity-relationship model. At runtime, this translates into the use of joining queries.
Even though a single join can be highly optimized: with hash tables, it is even possible to do it in linear time, and in fact primary keys are commonly indexed (hash indices, etc). But this does not scale well, and there are query patterns that bring the relational algebra to its limits (traversal of a high number of relationships, reverse traversal requiring also indexing foreign keys, looking for patterns in the relationships, etc).
Now, we do know a way to avoid joins and studied it at length: denormalizing the data to (homogeneous) collections of trees is a way of "pre-computing" the joins statically, so that the data is already joined (via nesting) at runtime.
Now, why is it efficient? Because going down a tree only necessitates following pointers in memory. But trees cannot have cycles. Graph databases provide a way of generalizing the use in-memory pointers to traverse data to the general case in which cycles are present: this is called "index-free adjacency."
Kinds of graph databases¶
There are many different graph database system products on the market, and they can be classified along several dimensions.
Labeled property graph model vs. triple stores: there are two competing high-level data models for graphs. We will study them both in this chapter.
Read-intensive vs. write-intensive: some graph database systems are more suited for large-scale analytics (read-intensive, OLAP), while others support efficient updates (OLTP).
Local vs. distributed: some graph database systems work on a single machine, some others replicate the (same) data across multiple machines, and recent developments even allowed graph sharding, meaning that the data and the queries are partitioned and distributed over several machines.
Native vs. non-native: some graph database systems are in fact a layer on top of another kind of database system in a different shape (for example, a relational database or a document store) or query engine (for example, Apache Spark). Other graph database systems are directly implemented from scratch to efficiently support graphs.
In fact, these dimensions are not specific to graph database systems, and also apply to other shapes (except, of course, that the competing data models will be other models than those two).
Graph data models¶
Labeled property graphs¶
The basic ingredients of a graph, from a mathematical perspective, are nodes and edges.
Mathematically, a directed graph $G = (N,E)$ is given by a set $N$ of nodes (also called vertices), and a set $E \subseteq N \times N$ of directed edges (also called arcs, or arrows). Given these, it is possible to visually represent the graph as follows (note that in this case, the graph is planar in the sense that edges do not cross, but this does not hold in general):
Graphs can also be undirected, in which case the arrows do not have a direction. Then formally $E \subseteq \{ P \in P(N) | \text{card}(P) = 2\}$. An alternative modeling is to keep $E \in N \times N$, but with the additional constraint that all edges go in both directions:
$$\forall n, m \in P, (n, m) \in E \Rightarrow (m, n) \in E$$
Such is the math at the logical level, and it is already powerful enough to overfill happy Bachelor’s students with joy over many years of algorithms and problem solving (graph coloring, constraints on planar graphs, max-flow/min-cut, Dijkstra’s algorithm, etc). In fact, one of the benefits of graph database systems (or at least some of them) is that they are well suited for implementing and running such algorithms.
Computer scientists need to go one step further and also design how to store graphs physically. One way of doing so is to create an associative array mapping each node to the list of nodes that it connects to via an edge (adjacency lists):
Another storage form is with an adjacency matrix: each row and each column represent a node, and a 0 or a 1 indicate the absence or presence of an edge between the row node and the column node:
Yet another way is with nodes on the rows and edges on the columns, with a-1 (origin node) and 1 (destination node) in each column and 0s otherwise:
The latter way generalizes even to so-called hypergraphs, where each edge can have more than one origin node and more than one destination node.
Now, this does not quite work for us, because labeled property graphs enhance mathematical graphs with extra ingredients: properties, and labels:
Labels are some sort of "tags", in the form of a string, that can be attached to a node or an edge. For example, a node may be tagged as a "Person" and another as a "Course", and the edge between them can be tagged as "Attends". A small graph with these nodes and edge indicates that a person is attending a course.
Labels can be repeated, for example we could have a graph with 23,000 Person nodes and 5,000 Course nodes, and 50’000 Attends edges.
Now, how would we then store information about the persons and courses? Labels are surely not an option, as it would make it cumbersome to know which label is what. Instead, we use properties. Each node and each edge can be associated with a map from strings to values, which represents its properties. You can think of the properties associated with each node or edge as a JSON object, or as a single record in a relational table or in a DataFrame (if nested).
This is a visual of a single node with some labels and its properties:
In the case of neo4j, many property types are supported and we list a few here together with their JSound/JSONiq/XML Schema equivalent:
Neo4j type | JSONiq/JSound/XML Schema type |
---|---|
STRING | string |
BOOLEAN | boolean |
INTEGER | long |
FLOAT | double |
NULL | null |
DATE | date |
TIME | time |
DATETIME | dateTime |
ZONED DATETIME | dateTimeStamp |
ZONED TIME | time (time zone mandatory) |
LOCAL DATETIME | dateTime (no time zone) |
LOCAL TIME | time (no time zone) |
DURATION | duration |
LIST | array |
Neo4j supports also an additional type (POINT) for geographic locations (GIS), which does not have a JSound/JSONiq equivalent. We will see that objects can appear in Neo4j’s query language (the type is called MAP) but objects cannot be used as the properties of a node or edge: the only way to “break the first normal form” in properties is via arrays.
Maybethe reader, at this point, starts seeing some pattern, or might have even already figured out how to "convert" a relational table to a labeled property graph: labels can be seen as table names, nodes as records, and properties as the attribute values for the records:
This shows that relational tables can be physically stored as labeled property graphs. Of course, this does not work the other way round: given a graph, it will often not be possible to convert it "back" to a table in this specific way.
If there are foreign keys pointing to primary keys, these can be represented with edges in the graph, like so:
Here ID is the primary key and Child the foreign key, and the table points to itself to keep the example concise.
In the presence of multiple tables, the corresponding nodes are just "merged" into a bigger graph (no need to create any partitions, as the labels do that job), and edges are added for foreign keys from some table pointing to primary keys of some other table.
Triple stores¶
Triple stores are a different and simpler model. It views the graph as nodes and edges that all have labels, but without any properties.
The graph is then represented as a list of edges, where each edge is a triple with the label of the origin node (called the subject), the label of the edge (called the property), and the label of the destination node (called the object).
Labels can be:
URIs, which look just like the namespaces we have seen before in XML and like URLs in REST APIs.
Literals, that is, atomic values: strings, integers, dates, etc (but not structured values such as objects or arrays). Literals are only allowed as objects.
Absent, in which case the node is called a blank node. Blank nodes are only allowed as subjects or objects, but not as properties.
Querying graph data¶
We will now have a look at query languages for querying graphs, with a focus on Cypher, which is neo4j’s query language. Other languages include PGQL, SPARQL and Gremlin.
At the core of Cypher (and some of the other languages for graph data) lies graph patterns.
For example, let us consider the following graph:
and the following pattern that we are looking for (a dashed edge followed by a full edge):
This is one instance of this pattern’s being recognized in the previous graph, which is called a match:
and this is another match (and there are several further matches, which we leave for the reader to find):
Now, let us look at the syntax to express a pattern. Of course, edge labels are not dashes and full lines, but have a string names, let us assume A for a dashed edge and B for a solid-line edge. Then, the previous pattern can be expressed with this nice ASCII art:
()-[:A]->()-[:B]->()
where the label of an edge is given after the colon character inside square brackets, and nodes are denoted with regular parentheses.
Patterns expressed in this way are used in a MATCH clause, like so:
MATCH (alpha)-[:A]->(beta)-[:B]->(gamma)
RETURN alpha, beta, gamma
Inside the parentheses, we inserted variable names, which can be referred to in the RETURN clause. For each match, a binding for the three variables alpha, beta, and gamma is created, and the RETURN clause is evaluated for each one of this bindings (does it ring a bell with JSONiq's FLWOR expressions? This is indeed the same logic with the tuple streams as we will shortly see).
One example of binding is visually given here:
The query above returns an output with three columns (alpha, beta and gamma) which are of type “node”. In this respect, the syntax of the RETURN clause is closer to that of the SELECT clause in SQL than the return clause in JSONiq.
The pattern can also filter nodes based on their labels, for example with the query below, the variable beta will only be bound to nodes that have the label "yellow" (leading to less matches):
MATCH (alpha)-[:A]->(beta:yellow)-[:B]->(gamma)
RETURN alpha, beta, gamma
It is also possible to filter on node or edge properties (in a way that will probably remind the reader of MongoDB selections in the JavaScript API):
MATCH (alpha)-[:A]->(beta { name: 'Einstein' })-[:B]->(gamma)
RETURN alpha, beta, gamma
It is also possible to filter on both a label and a property, like so:
MATCH (alpha)-[:A]->(beta)-[:B]->(gamma: blue { name : 'ETH'})
RETURN alpha, beta, gamma
Reverse edges are also possible in the pattern syntax:
MATCH (alpha)-[:A]->(beta)-[:B]->(gamma)<-[:B]-(delta)
RETURN alpha, beta, gamma, delta
It is also possible to reuse a variable in the pattern, for example to look for a cycle:
MATCH (alpha)-[:A]->(beta)-[:B]->(gamma)<-[:B]-(alpha)
RETURN alpha, beta, gamma
Finally, it is also possible to match longer paths within a specified length interval:
MATCH (alpha)-[*1..4]->(beta)<-[:B]-(alpha)
RETURN alpha, beta