263-3010-00: Big Data
Section 12
Querying Denormalized Data
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 12/07/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Motivation¶
Where we are¶
In several previous chapters, we studied the storage and processing of denormalized data: the syntax, the models, the validation, data frames.
We also looked at document stores, which are database systems that manage denormalized data not as a data lake, but as ETL-based collections with an optimized storage format hidden from the user and additional managed features such as indices that make queries faster and document-level atomicity.
If we now look at the stack we have built for denormalized data, both as data lake and as managed database systems, this is not fully satisfactory. Indeed, an important component is still missing, which is the query language. Indeed, with what we have covered, users are left with two options to handle denormalized datasets:
They can use an API within an imperative host language (e.g., Pandas in Python, or the MongoDB API in JavaScript, or the Spark RDD API in Java or Scala).
Or they can push SQL, including ad-hoc extensions to support nestedness, to its limits.
APIs are unsatisfactory for complex analytics use cases. They are very convenient and suitable for Data Engineers that implement more data management layers on top of these APIs, but they are not suitable for end users who want to run queries to analyse data.
There is agreement in the database community that SQL is more satisfactory for the case that data is flat and homogeneous (relational tables). Take the following query for example:
SELECT foo
FROM input
WHERE bar = "foobar"
which is much simpler to write than the following lower-level equivalent in APIs. With Spark RDDs:
rdd1 = sc.textFile("hdfs:///input.json")
rdd2 = rdd1.map(line => parseJSON(line))
rdd3 = rdd2.filter(obj => obj.bar = "foobar")
rdd4 = rdd3.map(obj => obj.foo)
rdd4.saveAsTextFile("hdfs:///output.json")
With the Spark DataFrame API:
df1 = spark.read.json("hdfs:///input.json")
df2 = df1.filter(df1['bar'] = "foobar")
df3 = df2.select(df2['foo])
df3.show()
Or even if nesting SQL in a host language, there is still additional logic needed to access the collection:
df1 = spark.read.json("hdfs:///input.json")
df1.createGlobalTempView("input")
df2 = df1.sql("SELECT foo
FROM input
WHERE bar = 'foobar'
")
df2.show()
SQL, possibly extended with a few dots, lateral view syntax and explode-like functions, will work nicely for the most simple use cases. But as soon as more complex functionality is needed, e.g., the dataset is nested up to a depth of 10, or the user would like to denormalize a dataset from relational tables to a single, nested collection, or the user would like to explore and discover a dataset that is heterogeneous, this approach becomes intractable. At best, this leads to gigantic and hard-to-read SQL queries. At worst, there is no way to express the use case in SQL. In both cases, the user ends up writing most of the code in an imperative language, invoking the lower-level API or nesting and chaining simple blocks of SQL. A concrete example that such is the case in the real world is the high-energy-physics community, who are working with dataframes APIs rather than SQL in spite of their (nested) data being homogeneous.
Here are a few examples of use cases that are simple enough to be manageable in Spark SQL, although they require some effort to be read and understood:
SELECT *
FROM person
LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age
and
SELECT key, values, collect_list(value + 1) AS values_plus_one
FROM nested_data
LATERAL VIEW explode(values) T AS value
GROUP BY key, values
But let us look at another use case that is simple to express: in the GitHub dataset, for each event, what are all the commits by the top-committer within this event?
In Spark SQL, this is what the query looks like:
In the language we will study in this chapter for denormalized data, this is how the query looks like. As you can see, it is much more compact and easier to read:
for $e in $events
let $top-committer := (
for $c in $e.commits[]
group by $c.author
stable order by count($c) descending
return $c.author)[1]
return [
$e.commits[][$$.author eq $top-committer]
]
This language is called JSONiq and it is tailor-made for denormalized data. It offers a data-independent layer on top of both data lakes and ETL-based, database management systems, similar to what SQL offers for (flat and homogeneous) relational tables.
98% of JSONiq is directly the same as a W3C standard, XQuery, which is a language offering this functionality for XML datasets. This functionality is the fruit of more than 20 years of work by a two-digitsized working group from many different companies, many of them with extensive SQL experience (or themselves SQL editors) who carefully discussed every single corner case, leading to a long and precise, publicly available specification. JSONiq is basically XQuery without XML and with (instead) JSON, similar to how one could bake a blueberry cake by using a strawberry cake recipe and simply replacing the strawberries with blueberries. This is a reminder that JSON and XML are very similar when it comes to querying, because both models are based on tree structures. JSONiq was born during the working group discussions on how to add support for maps and arrays to the language and became a standalone language optimized specifically for JSON. XQuery in its latest version supports maps and arrays, and is best suitable in an environment where both XML and JSON co-exist, which is out of scope in this course.
Denormalized data¶
What do we mean with denormalized data? Let us simply remind that it is characterized with two features: nestedness, and heterogeneity.
Consider the following example of a JSON Lines dataset (note that the objects are displayed on multiple lines only so it fits on the printed page, in reality they would each be on a single line):
{
"Name" : { "First" : "Albert", "Last" : "Einstein" },
"Countries" : [ "D", "I", "CH", "A", "BE", "US" ]
}
{
"Name" : { "First" : "Srinivasa", "Last" : "Ramanujan" },
"Countries" : [ "IN", "UK" ]
}
{
"Name" : { "First" : "Kurt", "Last" : "G¨odel" },
"Countries" : [ "CZ", "A", "US" ]
}
{
"Name" : { "First" : "John", "Last" : "Nash" },
"Countries" : "US"
}
{
"Name" : { "First" : "Alan", "Last" : "Turing" },
"Countries" : "UK"
}
{
"Name" : { "First" : "Maryam", "Last" : "Mirzakhani" },
"Countries" : [ "IR", "US" ]
}
{
"Name" : "Pythagoras",
"Countries" : [ "GR" ]
}
{
"Name" : { "First" : "Nicolas", "Last" : "Bourbaki" },
"Number" : 9,
"Countries" : null
}
If one wants to put it in a DataFrame in order to use Spark SQL, this is what one will get:
As can be seen, the columns "Name" and "Countries" are typed as string, because the system could not deal with the fact that they contain a mix of atomic and structured types. What is in fact happening is that the burden of dealing with heterogeneity is pushed up to the end user, who will be forced to write a lot of additional code in the host language (Python, Java...) to attempt to parse back these strings one by one and decide what to do. This, in turn, is likely to force the user to make a heavy use of UDFs (User-Defined Functions), which are blackboxes that can be called from SQL and with user code inside. But UDFs are very inefficient compared to native SQL execution, because first they need to be registered and shipped to all nodes in the cluster (which not all distributed processing technologies can do efficiently), and second because the SQL optimizer has no idea of what there is inside, which prevents many (otherwise possible) optimizations from kicking in.
In fact, denormalized datasets should not be seen as "broken tables pushed to their limits", but rather as collections of trees.
The GitHub archive dataset is a good illustration of this: it contains 2,900,000,000 events, each as a JSON document, taking 7.6 TB of space uncompressed. 10% of all the paths (you can think of them as "data frame columns" although, for a heterogeneous dataset, viewing it as a data frame is not very suitable) have mixed types. Furthermore, there are 1,300 such paths in total, although each event only uses 100 of them. One could think of fitting this into relational tables or dataframes with 1,300 attributes, but 1,300 is already beyond what many relational database systems can handle reasonably well.
Features of a query language¶
A query language for datasets has three main features.
Declarative¶
First, it is declarative. This means that the users do not focus on how the query is computed, but on what it should return. Thus, the database engine enjoys the flexibility to figure out the most efficient and fastest plan of execution to return the results.
Functional¶
Second, it is functional. This means that the query language is made of composable expressions that nest with each other, like a Lego game. Many, but not all, expressions can be seen as functions that take as input the output of their children expressions, and send their output to their parent expressions. However, the syntax of a good functional language should look nothing like a simple chain of function calls with parentheses and lambdas everywhere (this would then be an API, not a query language; examples of APIs are the Spark transformation APIs or Pandas): rather, expression syntax is carefully and naturally designed for ease of write and read. In complement to expressions (typically 20 or 30 different kinds of expressions), a rich function library (this time, with actual function call syntax) completes the expressions to a fully functional language.
Set-based¶
Finally, it is set-based, in the sense that the values taken and returned by expressions are not only single values (scalars), but are large sequences of items (in the case of SQL, an item is a row). In spite of the set-based terminology, set-based languages can still have bag or list semantics, in that they can allow for duplicates and sequences might be ordered on the logical level.
Query languages for denormalized data¶
The landscape for denormalized data querying is very different form that of structured, relational data: indeed, for structured data, SQL is undisputed.
For denormalized data though, sadly, the number of languages keeps increasing: the oldest ones being XQuery, JSONiq, but then now also JMESPath, SpahQL, JSON Query, PartiQL, UnQL, N1QL, ObjectPath, JSONPath, ArangoDBQueryLanguage(AQL),SQL++, GraphQL, MRQL, Asterix Query Language (AQL), RQL. One day, we expect the market to consolidate.
But the good news is that these languages share common features. In this course, we focus on JSONiq for several reasons:
It is fully documented;
Mostofits syntax, semantics, function library and type system relies on a W3C standard (XPath/XQuery), meaning that a group of 30+ very smart people with expertise and experience on SQL swept into every corner to define the language;
It has several independent implementations.
Having learned JSONiq, it will be very easy for the reader to learn any one of the other languages in the future.
JSONiq as a data calculator¶
The smoothest start with JSONiq is to understand it as a data calculator.
In particular, it can perform arithmetics
Query | 1 + 1 |
Result | 2 |
Query | 3 + 2 + 4 |
Result | 11 |
but also comparison and logic:
Query | 2 < 5 |
Result | true |
It is, however, more powerful than a common calculator and supports more complex constructs, for example variable binding:
Query |
let $i := 2 return $i + 1 |
Result | true |
It also supports all JSON values. Any copy-pasted JSON value literally returns itself:
Query | [ 1, 2, 3 ] |
Result | [ 1, 2, 3 ] |
Query | { "foo" : 1 } |
Result | { "foo" : 1 } |
Things start to become interesting with object and array navigation, with dots and square brackets:
Query | { "foo" : 1 }.foo |
Result | 1 |
Query | [3, 4, 5][[1]] |
Result | 3 |
Query | { "foo" : [ 3, 4, 5 ] }.foo[[1]] + 3 |
Result | 6 |
Another difference with a calculator is that a query can return multiple items, as a sequence:
Query |
{ "foo" : [ 3, 4, 5 ] }.foo[] |
Result |
3 4 5 |
Query |
1 to 4 |
Result |
1 2 3 4 |
Query |
for $i in 3 to 5 return { $i : $i * $i } |
Result |
{ "3" : 9 } { "4" : 16 } { "5" : 25 } |
Query |
for $i in { "foo" : [ 3, 4, 5 ] }.foo[] return { $i : $i * $i } |
Result |
{ "3" : 9 } { "4" : 16 } { "5" : 25 } |
And, unlike a calculator, it can access storage (data lakes, the Web, etc):
Query |
keys( for $i in json-file("s3://bucket/myfiles/json/*") return $i ) |
Result |
"foo" "bar" |
Query |
keys( for $i in parquet-file( "s3://bucket/myfiles/parquet" ) return $i ) |
Result |
"foo" "bar" |
The JSONiq Data Model¶
Every expression of the JSONiq “data calculator” returns a sequence of items. Always.
An item can be either an object, an array, an atomic item, or a function item. For the purpose of this course, we ignore function items, but it might interest the reader to know that function items can be used to represent Machine Learning models, as JSONiq can be used for training and testing models as well.
Atomic items can be any of the “core” JSON types: strings, numbers (integers, decimals, doubles...), booleans and nulls, but JSONiq has a much richer type system, which we covered in Chapter 7 in the context of both JSound and XML Schema.
Sequences of items are flat, in that sequences cannot be nested. But they scale massively and can contain billions or trillions of items. The only way to nest lists is to use arrays (which can be recursively nested). If you read the previous chapters, you can think of sequences as being similar to RDDs in Spark, or to JSON lines documents with billions of JSON objects.
Sequences can be homogeneous (e.g., a million integers, or a million of JSON objects valid against a flat schema, logically equivalent to a relational table) or heterogeneous (a messy list of various items: objects, arrays, integers, etc).
One item is logically the same as a sequence of one item. Thus, when we say that 1+1 returns 2, it in in fact means that it returns a singleton sequence with one item, which is the integer 2.
A sequence can also be empty. Caution, the empty sequence is not the same logically as a null item.
Navigation¶
A good angle to start with JSONiq is to take a large dataset and discover its structure. The online sandbox has examples of this for the GitHub archive dataset, which is continuously growing. Each hour of log can be downloaded from URIs with the pattern
https://data.gharchive.org/2022-11-01-0.json.gz
where you can pick the year, month, date and hour of the day.
For the purpose of this textbook, we will pick made-up data patterns to illustrate our point. Let us consider the following JSON document, consisting of a single, large JSON object (possibly on multiple line, as is common for single JSON documents). Let us assume it is named file.json.
{
"o": [
{
"a": {
"b": [
{
"c": 1,
"d": "a"
}
]
}
},
{
"a": {
"b": [
{
"c": 1,
"d": "f"
},
{
"c": 2,
"d": "b"
}
]
}
},
{
"a": {
"b": [
{
"c": 4,
"d": "e"
},
{
"c": 8,
"d": "d"
},
{
"c": 3,
"d": "c"
}
]
}
},
{
"a": {
"b": []
}
},
{
"a": {
"b": [
{
"c": 3,
"d": "h"
},
{
"c": 9,
"d": "z"
}
]
}
},
{
"a": {
"b": [
{
"c": 4,
"d": "g"
}
]
}
},
{
"a": {
"b": [
{
"c": 3,
"d": "l"
},
{
"c": 1,
"d": "m"
},
{
"c": 0,
"d": "k"
}
]
}