263-3010-00: Big Data
Section 12
Querying Denormalized Data
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 12/07/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Motivation¶
Where we are¶
In several previous chapters, we studied the storage and processing of denormalized data: the syntax, the models, the validation, data frames.
We also looked at document stores, which are database systems that manage denormalized data not as a data lake, but as ETL-based collections with an optimized storage format hidden from the user and additional managed features such as indices that make queries faster and document-level atomicity.
If we now look at the stack we have built for denormalized data, both as data lake and as managed database systems, this is not fully satisfactory. Indeed, an important component is still missing, which is the query language. Indeed, with what we have covered, users are left with two options to handle denormalized datasets:
They can use an API within an imperative host language (e.g., Pandas in Python, or the MongoDB API in JavaScript, or the Spark RDD API in Java or Scala).
Or they can push SQL, including ad-hoc extensions to support nestedness, to its limits.
APIs are unsatisfactory for complex analytics use cases. They are very convenient and suitable for Data Engineers that implement more data management layers on top of these APIs, but they are not suitable for end users who want to run queries to analyse data.
There is agreement in the database community that SQL is more satisfactory for the case that data is flat and homogeneous (relational tables). Take the following query for example:
SELECT foo
FROM input
WHERE bar = "foobar"
which is much simpler to write than the following lower-level equivalent in APIs. With Spark RDDs:
rdd1 = sc.textFile("hdfs:///input.json")
rdd2 = rdd1.map(line => parseJSON(line))
rdd3 = rdd2.filter(obj => obj.bar = "foobar")
rdd4 = rdd3.map(obj => obj.foo)
rdd4.saveAsTextFile("hdfs:///output.json")
With the Spark DataFrame API:
df1 = spark.read.json("hdfs:///input.json")
df2 = df1.filter(df1['bar'] = "foobar")
df3 = df2.select(df2['foo])
df3.show()
Or even if nesting SQL in a host language, there is still additional logic needed to access the collection:
df1 = spark.read.json("hdfs:///input.json")
df1.createGlobalTempView("input")
df2 = df1.sql("SELECT foo
FROM input
WHERE bar = 'foobar'
")
df2.show()
SQL, possibly extended with a few dots, lateral view syntax and explode-like functions, will work nicely for the most simple use cases. But as soon as more complex functionality is needed, e.g., the dataset is nested up to a depth of 10, or the user would like to denormalize a dataset from relational tables to a single, nested collection, or the user would like to explore and discover a dataset that is heterogeneous, this approach becomes intractable. At best, this leads to gigantic and hard-to-read SQL queries. At worst, there is no way to express the use case in SQL. In both cases, the user ends up writing most of the code in an imperative language, invoking the lower-level API or nesting and chaining simple blocks of SQL. A concrete example that such is the case in the real world is the high-energy-physics community, who are working with dataframes APIs rather than SQL in spite of their (nested) data being homogeneous.
Here are a few examples of use cases that are simple enough to be manageable in Spark SQL, although they require some effort to be read and understood:
SELECT *
FROM person
LATERAL VIEW EXPLODE(ARRAY(30, 60)) tabelName AS c_age
LATERAL VIEW EXPLODE(ARRAY(40, 80)) AS d_age
and
SELECT key, values, collect_list(value + 1) AS values_plus_one
FROM nested_data
LATERAL VIEW explode(values) T AS value
GROUP BY key, values
But let us look at another use case that is simple to express: in the GitHub dataset, for each event, what are all the commits by the top-committer within this event?
In Spark SQL, this is what the query looks like:
In the language we will study in this chapter for denormalized data, this is how the query looks like. As you can see, it is much more compact and easier to read:
for $e in $events
let $top-committer := (
for $c in $e.commits[]
group by $c.author
stable order by count($c) descending
return $c.author)[1]
return [
$e.commits[][$$.author eq $top-committer]
]
This language is called JSONiq and it is tailor-made for denormalized data. It offers a data-independent layer on top of both data lakes and ETL-based, database management systems, similar to what SQL offers for (flat and homogeneous) relational tables.
98% of JSONiq is directly the same as a W3C standard, XQuery, which is a language offering this functionality for XML datasets. This functionality is the fruit of more than 20 years of work by a two-digitsized working group from many different companies, many of them with extensive SQL experience (or themselves SQL editors) who carefully discussed every single corner case, leading to a long and precise, publicly available specification. JSONiq is basically XQuery without XML and with (instead) JSON, similar to how one could bake a blueberry cake by using a strawberry cake recipe and simply replacing the strawberries with blueberries. This is a reminder that JSON and XML are very similar when it comes to querying, because both models are based on tree structures. JSONiq was born during the working group discussions on how to add support for maps and arrays to the language and became a standalone language optimized specifically for JSON. XQuery in its latest version supports maps and arrays, and is best suitable in an environment where both XML and JSON co-exist, which is out of scope in this course.
Denormalized data¶
What do we mean with denormalized data? Let us simply remind that it is characterized with two features: nestedness, and heterogeneity.
Consider the following example of a JSON Lines dataset (note that the objects are displayed on multiple lines only so it fits on the printed page, in reality they would each be on a single line):
{
"Name" : { "First" : "Albert", "Last" : "Einstein" },
"Countries" : [ "D", "I", "CH", "A", "BE", "US" ]
}
{
"Name" : { "First" : "Srinivasa", "Last" : "Ramanujan" },
"Countries" : [ "IN", "UK" ]
}
{
"Name" : { "First" : "Kurt", "Last" : "G¨odel" },
"Countries" : [ "CZ", "A", "US" ]
}
{
"Name" : { "First" : "John", "Last" : "Nash" },
"Countries" : "US"
}
{
"Name" : { "First" : "Alan", "Last" : "Turing" },
"Countries" : "UK"
}
{
"Name" : { "First" : "Maryam", "Last" : "Mirzakhani" },
"Countries" : [ "IR", "US" ]
}
{
"Name" : "Pythagoras",
"Countries" : [ "GR" ]
}
{
"Name" : { "First" : "Nicolas", "Last" : "Bourbaki" },
"Number" : 9,
"Countries" : null
}
If one wants to put it in a DataFrame in order to use Spark SQL, this is what one will get:
As can be seen, the columns "Name" and "Countries" are typed as string, because the system could not deal with the fact that they contain a mix of atomic and structured types. What is in fact happening is that the burden of dealing with heterogeneity is pushed up to the end user, who will be forced to write a lot of additional code in the host language (Python, Java...) to attempt to parse back these strings one by one and decide what to do. This, in turn, is likely to force the user to make a heavy use of UDFs (User-Defined Functions), which are blackboxes that can be called from SQL and with user code inside. But UDFs are very inefficient compared to native SQL execution, because first they need to be registered and shipped to all nodes in the cluster (which not all distributed processing technologies can do efficiently), and second because the SQL optimizer has no idea of what there is inside, which prevents many (otherwise possible) optimizations from kicking in.
In fact, denormalized datasets should not be seen as "broken tables pushed to their limits", but rather as collections of trees.
The GitHub archive dataset is a good illustration of this: it contains 2,900,000,000 events, each as a JSON document, taking 7.6 TB of space uncompressed. 10% of all the paths (you can think of them as "data frame columns" although, for a heterogeneous dataset, viewing it as a data frame is not very suitable) have mixed types. Furthermore, there are 1,300 such paths in total, although each event only uses 100 of them. One could think of fitting this into relational tables or dataframes with 1,300 attributes, but 1,300 is already beyond what many relational database systems can handle reasonably well.
Features of a query language¶
A query language for datasets has three main features.
Declarative¶
First, it is declarative. This means that the users do not focus on how the query is computed, but on what it should return. Thus, the database engine enjoys the flexibility to figure out the most efficient and fastest plan of execution to return the results.
Functional¶
Second, it is functional. This means that the query language is made of composable expressions that nest with each other, like a Lego game. Many, but not all, expressions can be seen as functions that take as input the output of their children expressions, and send their output to their parent expressions. However, the syntax of a good functional language should look nothing like a simple chain of function calls with parentheses and lambdas everywhere (this would then be an API, not a query language; examples of APIs are the Spark transformation APIs or Pandas): rather, expression syntax is carefully and naturally designed for ease of write and read. In complement to expressions (typically 20 or 30 different kinds of expressions), a rich function library (this time, with actual function call syntax) completes the expressions to a fully functional language.
Set-based¶
Finally, it is set-based, in the sense that the values taken and returned by expressions are not only single values (scalars), but are large sequences of items (in the case of SQL, an item is a row). In spite of the set-based terminology, set-based languages can still have bag or list semantics, in that they can allow for duplicates and sequences might be ordered on the logical level.
Query languages for denormalized data¶
The landscape for denormalized data querying is very different form that of structured, relational data: indeed, for structured data, SQL is undisputed.
For denormalized data though, sadly, the number of languages keeps increasing: the oldest ones being XQuery, JSONiq, but then now also JMESPath, SpahQL, JSON Query, PartiQL, UnQL, N1QL, ObjectPath, JSONPath, ArangoDBQueryLanguage(AQL),SQL++, GraphQL, MRQL, Asterix Query Language (AQL), RQL. One day, we expect the market to consolidate.
But the good news is that these languages share common features. In this course, we focus on JSONiq for several reasons:
It is fully documented;
Mostofits syntax, semantics, function library and type system relies on a W3C standard (XPath/XQuery), meaning that a group of 30+ very smart people with expertise and experience on SQL swept into every corner to define the language;
It has several independent implementations.
Having learned JSONiq, it will be very easy for the reader to learn any one of the other languages in the future.
JSONiq as a data calculator¶
The smoothest start with JSONiq is to understand it as a data calculator.
In particular, it can perform arithmetics
Query | 1 + 1 |
Result | 2 |
Query | 3 + 2 + 4 |
Result | 11 |
but also comparison and logic:
Query | 2 < 5 |
Result | true |
It is, however, more powerful than a common calculator and supports more complex constructs, for example variable binding:
Query |
let $i := 2 return $i + 1 |
Result | true |
It also supports all JSON values. Any copy-pasted JSON value literally returns itself:
Query | [ 1, 2, 3 ] |
Result | [ 1, 2, 3 ] |
Query | { "foo" : 1 } |
Result | { "foo" : 1 } |
Things start to become interesting with object and array navigation, with dots and square brackets:
Query | { "foo" : 1 }.foo |
Result | 1 |
Query | [3, 4, 5][[1]] |
Result | 3 |
Query | { "foo" : [ 3, 4, 5 ] }.foo[[1]] + 3 |
Result | 6 |
Another difference with a calculator is that a query can return multiple items, as a sequence:
Query |
{ "foo" : [ 3, 4, 5 ] }.foo[] |
Result |
3 4 5 |
Query |
1 to 4 |
Result |
1 2 3 4 |
Query |
for $i in 3 to 5 return { $i : $i * $i } |
Result |
{ "3" : 9 } { "4" : 16 } { "5" : 25 } |
Query |
for $i in { "foo" : [ 3, 4, 5 ] }.foo[] return { $i : $i * $i } |
Result |
{ "3" : 9 } { "4" : 16 } { "5" : 25 } |
And, unlike a calculator, it can access storage (data lakes, the Web, etc):
Query |
keys( for $i in json-file("s3://bucket/myfiles/json/*") return $i ) |
Result |
"foo" "bar" |
Query |
keys( for $i in parquet-file( "s3://bucket/myfiles/parquet" ) return $i ) |
Result |
"foo" "bar" |
The JSONiq Data Model¶
Every expression of the JSONiq “data calculator” returns a sequence of items. Always.
An item can be either an object, an array, an atomic item, or a function item. For the purpose of this course, we ignore function items, but it might interest the reader to know that function items can be used to represent Machine Learning models, as JSONiq can be used for training and testing models as well.
Atomic items can be any of the “core” JSON types: strings, numbers (integers, decimals, doubles...), booleans and nulls, but JSONiq has a much richer type system, which we covered in Chapter 7 in the context of both JSound and XML Schema.
Sequences of items are flat, in that sequences cannot be nested. But they scale massively and can contain billions or trillions of items. The only way to nest lists is to use arrays (which can be recursively nested). If you read the previous chapters, you can think of sequences as being similar to RDDs in Spark, or to JSON lines documents with billions of JSON objects.
Sequences can be homogeneous (e.g., a million integers, or a million of JSON objects valid against a flat schema, logically equivalent to a relational table) or heterogeneous (a messy list of various items: objects, arrays, integers, etc).
One item is logically the same as a sequence of one item. Thus, when we say that 1+1 returns 2, it in in fact means that it returns a singleton sequence with one item, which is the integer 2.
A sequence can also be empty. Caution, the empty sequence is not the same logically as a null item.
Navigation¶
A good angle to start with JSONiq is to take a large dataset and discover its structure. The online sandbox has examples of this for the GitHub archive dataset, which is continuously growing. Each hour of log can be downloaded from URIs with the pattern
https://data.gharchive.org/2022-11-01-0.json.gz
where you can pick the year, month, date and hour of the day.
For the purpose of this textbook, we will pick made-up data patterns to illustrate our point. Let us consider the following JSON document, consisting of a single, large JSON object (possibly on multiple line, as is common for single JSON documents). Let us assume it is named file.json.
{
"o": [
{
"a": {
"b": [
{
"c": 1,
"d": "a"
}
]
}
},
{
"a": {
"b": [
{
"c": 1,
"d": "f"
},
{
"c": 2,
"d": "b"
}
]
}
},
{
"a": {
"b": [
{
"c": 4,
"d": "e"
},
{
"c": 8,
"d": "d"
},
{
"c": 3,
"d": "c"
}
]
}
},
{
"a": {
"b": []
}
},
{
"a": {
"b": [
{
"c": 3,
"d": "h"
},
{
"c": 9,
"d": "z"
}
]
}
},
{
"a": {
"b": [
{
"c": 4,
"d": "g"
}
]
}
},
{
"a": {
"b": [
{
"c": 3,
"d": "l"
},
{
"c": 1,
"d": "m"
},
{
"c": 0,
"d": "k"
}
]
}
}
]
}
We can open this document and return its contents.
Query |
json-doc("file.json") |
Result |
{ "o": [ { "a": { "b": [ { "c": 1, "d": "a" } ] } }, { "a": { "b": [ { "c": 1, "d": "f" }, { "c": 2, "d": "b" } ] } }, { "a": { "b": [ { "c": 4, "d": "e" }, { "c": 8, "d": "d" }, { "c": 3, "d": "c" } ] } }, { "a": { "b": [] } }, { "a": { "b": [ { "c": 3, "d": "h" }, { "c": 9, "d": "z" } ] } }, { "a": { "b": [ { "c": 4, "d": "g" } ] } }, { "a": { "b": [ { "c": 3, "d": "l" }, { "c": 1, "d": "m" }, { "c": 0, "d": "k" } ] } } ] } |
We are going to start our dataset exploration with JSON navigation. Navigating semi-structured data is several decades old and was pioneered on XML with XPath. JSON navigation uses similar ideas, but is considerably simpler than XML navigation. The general idea of navigation is that it is possible to “dive” into the nested data with dots and square brackets (originally, these were slashes with XPath) - all in parallel: starting with an original collection of objects (or, possibly, a single document), each step (i.e., for each dot and each pair of square brackets) goes down the nested structure and returns another sequence of nested items.
Theexperience feels like scanning the entire collection, moving down the nested structure in parallel. Some steps might massively increase the sequence size (i.e., when unboxing arrays); some other steps might on the contrary contract the sequence to a smaller one (i.e., when filtering, or in the presence of heterogeneity when parts of the collection go deeper than others).
Object lookups (dot syntax)¶
It is possible to navigate into objets with dots, similar to object-oriented programming. For example, this is how we can get the value associated with key o in the document (which is a sequence of one object).
Query |
json-doc("file.json").o |
Result |
[ { "a" : { "b" : [ { "c" : 1, "d" : "a" } ] } }, { "a" : { "b" : [ { "c" : 1, "d" : "f" }, { "c" : 2, "d" : "b" } ] } }, { "a" : { "b" : [ { "c" : 4, "d" : "e" }, { "c" : 8, "d" : "d" }, { "c" : 3, "d" : "c" } ] } }, { "a" : { "b" : [ ] } }, { "a" : { "b" : [ { "c" : 3, "d" : "h" } ] } }, { "a" : { "b" : [ { "c" : 4, "d" : "g" } ] } }, { "a" : { "b" : [ { "c" : 3, "d" : "l" } ] } } ] |
This returned an array, more precisely, a sequence of one array item.
Array unboxing (empty square bracket syntax)¶
We can unbox the array, meaning, extract its members as a sequence of seven object items, with empty square brackets, like so:
Query |
json-doc("file.json").o[] |
Result |
{ "a" : { "b" : [ { "c" : 1, "d" : "a" } ] } } { "a" : { "b" : [ { "c" : 1, "d" : "f" }, { "c" : 2, "d" : "b" } ] } } { "a" : { "b" : [ { "c" : 4, "d" : "e" }, { "c" : 8, "d" : "d" }, { "c" : 3, "d" : "c" } ] } } { "a" : { "b" : [ ] } } { "a" : { "b" : [ { "c" : 3, "d" : "h" } ] } } { "a" : { "b" : [ { "c" : 4, "d" : "g" } ] } } { "a" : { "b" : [ { "c" : 3, "d" : "l" } ] } } |
Parallel navigation¶
The dot syntax, in fact, works on sequences, too. It will extract the value associated with a key in every object of the sequence (anything else than an object is ignored and thrown away):
Query |
json-doc("file.json").o[].a |
Result |
{ "b" : [ { "c" : 1, "d" : "a" } ] } { "b" : [ { "c" : 1, "d" : "f" }, { "c" : 2, "d" : "b" } ] } { "b" : [ { "c" : 4, "d" : "e" }, { "c" : 8, "d" : "d" }, { "c" : 3, "d" : "c" } ] } { "b" : [ ] } { "b" : [ { "c" : 3, "d" : "h" } ] } { "b" : [ { "c" : 4, "d" : "g" } ] } { "b" : [ { "c" : 3, "d" : "l" } ] } |
Array unboxing works on sequences, too. Note how all the members are concatenated to a single, merged sequence, similar to a flatMap in Apache Spark.
Query |
json-doc("file.json").o[].a.b[] |
Result |
{ "c" : 1, "d" : "a" } { "c" : 1, "d" : "f" } { "c" : 2, "d" : "b" } { "c" : 4, "d" : "e" } { "c" : 8, "d" : "d" } { "c" : 3, "d" : "c" } { "c" : 3, "d" : "h" } { "c" : 4, "d" : "g" } { "c" : 3, "d" : "l" } |
Filtering with predicates (simple square bracket syntax)¶
It is possible to filter any sequence with a predicate, where \$\$ in the predicate refers to the current item being tested. Let us only keep those objects that associate c with 3:
Query |
json-doc("file.json").o[].a.b[][$$.c = 3] |
Result |
{ "c" : 3, "d" : "c" } { "c" : 3, "d" : "h" } { "c" : 3, "d" : "l" } |
It is also possible to access the item at position n in a sequence with this same notation: if what is inside the square brackets is a Boolean, then it acts as a filtering predicate; if it is an integer, it acts as a position:
Query |
json-doc("file.json").o[].a.b[][5] |
Result |
{ "c" : 8, "d" : "d" } |
Array lookup (double square bracket syntax)¶
To access the n-th member of an array, you can use double-square-brackets, like so:
Query |
json-doc("file.json").o[[2]].a |
Result |
{ "b" : [ { "c" : 1, "d" : "f" }, { "c" : 2, "d" : "b" } ] } |
Like dot object navigation and unboxing, double square brackets (array navigation) work with sequences as well. For any array that has less elements than the requested position, as well as for items that are not arrays, no items are contributed to the output:
Query |
json-doc("file.json").o[].a.b[[2]] |
Result |
{ "c" : 2, "d" : "b" } { "c" : 8, "d" : "d" } |
A common pitfall: Array lookup vs. Sequence predicates¶
Do not confuse sequence positions (single square brackets) with array positions (double square brackets)! The difference is easy to see on a simple example involving a sequence of two arrays with two members each:
Query |
([1, 2], [3, 4])[2] |
Result |
[ 3, 4 ] |
Query |
([1, 2], [3, 4])[[2]] |
Result |
2 4 |
Schema discovery¶
We now go on with more simple querying functionality related to discovering datasets with an unknown structure.
Collections¶
While there exist files that contain a single JSON document (or a single XML document), many datasets are in fact found in the form of large collections of smaller objects (as in document stores).
Such collections are accessed with a function call together with a name or (if reading from a data lake) a path. The name of the function can vary and in this Chapter we will just use the W3C-standard collection function. In RumbleDB, a JSON Lines dataset is accessed with the function json-line, in a similar way.
Query |
collection( "https://www.rumbledb.org/samples/git-archive.jsonl" ) |
Result |
{ "id" : "7045118886", "type" : "PushEvent", ... { "id" : "7045118891", "type" : "PushEvent", ... { "id" : "7045118892", "type" : "PullRequestEvent", ... { "id" : "7045118894", "type" : "PushEvent", ... { "id" : "7045118895", "type" : "WatchEvent", ... { "id" : "7045118896", "type" : "PushEvent", ... { "id" : "7045118899", "type" : "GollumEvent", ... { "id" : "7045118900", "type" : "PushEvent", ... { "id" : "7045118901", "type" : "PullRequestEvent", ... { "id" : "7045118904", "type" : "PushEvent", ... ... |
It is a good idea to look at the first object of a collection to get a rough idea of what the layout looks like (although there is always the risk of heterogeneity, and there is no guarantee all objects look the same):
Query |
collection( "https://www.rumbledb.org/samples/git-archive.jsonl" )[1] |
Result |
{ "id" : "7045118886", "type" : "PushEvent", ... |
One can also look at the top N objects using the position function in a predicate, which returns the position in the sequence of the current item being tested by the predicate (similar to the LIMIT clause in SQL):
Query |
collection( "https://www.rumbledb.org/samples/git-archive.jsonl" )[position() le 5] |
Result |
{ "id" : "7045118886", "type" : "PushEvent", ... { "id" : "7045118891", "type" : "PushEvent", ... { "id" : "7045118892", "type" : "PullRequestEvent", ... { "id" : "7045118894", "type" : "PushEvent", ... { "id" : "7045118895", "type" : "WatchEvent", ... ... |
Getting all top-level keys¶
The keys function retrieves all keys. It can be called on the entire sequence of objects and will return all unique keys found (at the top level) in that collection:
Query |
keys(collection( "https://www.rumbledb.org/samples/git-archive.jsonl" )) |
Result |
"repo" "org" "actor" "public" "type" "created_at" "id" "payload" |
Getting unique values associated with a key¶
With dot object lookup, we can look at all the values associated with a key like so:
Query |
collection( "https://www.rumbledb.org/samples/git-archive.jsonl" ).type |
Result |
"PushEvent" "PushEvent" "PullRequestEvent" "PushEvent" "WatchEvent" "PushEvent" "GollumEvent" "PushEvent" "PullRequestEvent" ... |
With distinct-values, it is then possible to eliminate duplicates and look at unique values:
Query |
distinct-values(collection( "https://www.rumbledb.org/samples/git-archive.jsonl" ).type) |
Result |
"PullRequestEvent" "MemberEvent" "PushEvent" "IssuesEvent" "PublicEvent" "CommitCommentEvent" "ReleaseEvent" "IssueCommentEvent" "ForkEvent" "GollumEvent" "WatchEvent" "PullRequestReviewCommentEvent" "CreateEvent" "DeleteEvent" ... |
Aggregations¶
Aggregations can be made on entire sequences with a single function call (this would be like a SQL GROUP BY clause but without grouping key). The five basic functions are count, sum, avg, min, max. Obviously, the last four require numeric values and will otherwise throw an error.
Query |
count(distinct-values(collection( "https://www.rumbledb.org/samples/git-archive.jsonl" ).type)) |
Result |
3597 |
Query |
count(collection( "https://www.rumbledb.org/samples/git-archive.jsonl" )) |
Result |
36577 |
Construction¶
Let us now look into how to construct new values with JSONiq.
Construction of atomic values¶
Atomic values that are core to JSON can be constructed with exactly the same syntax as JSON.
Query |
"foo" |
Result |
"foo" |
Query |
"This is a line\nand this is a new line" |
Result |
"This is a line\nand this is a new line" |
Query |
42 |
Result |
42 |
Query |
3.1415926535897932384626433832795028 |
Result |
3.1415926535897932384626433832795028 |
Query |
-6.022E23 |
Result |
-6.022E23 |
Query |
true |
Result |
true |
Query |
false |
Result |
false |
Query |
null |
Result |
null |
For more specific types, a cast is needed. This works with any of the atomic types we covered in Chapter 7. There are two syntaxes for this:
Query |
nonNegativeInteger("42") |
Result |
42 |
Query |
"42" cast as nonNegativeInteger |
Result |
42 |
Construction of objects and arrays¶
Objects and arrays are constructed with the same syntax as JSON. In fact, one can copy-paste any JSON value, and it will always be recognized as a valid JSONiq query returning that value.
Query |
[ { "foo" : "bar" }, { "bar" : [ 1, 2, true, null ] } ] |
Result |
[ { "foo" : "bar" }, { "bar" : [ 1, 2, true, null ] } ] |
It is also possible to build objects and arrays dynamically (with computed values, not known at compile time), as will be shown shortly when we discuss composability of expressions.
Construction of sequences¶
Sequences can be constructed (and concatenated) using commas:
Query |
[ 2, 3 ], true, "foo", { "f" : 1 } |
Result |
[ 2, 3] true "foo" { "f" : 1 } |
Query |
1 to 10 |
Result |
1 2 3 4 5 6 7 8 9 10 |
Another way to build sequences is with FLWOR expressions, covered a bit further down.
Scalar expressions¶
Sequences of items can have any number of items. A few JSONiq expression (arithmetic, logic, value comparison...) work on the particular case that a sequence has zero or one items.
Arithmetic¶
JSONiq supports basic arithmetic: addition (+), subtraction (-), division (div)2, integer division (idiv) and modulo (mod).
If both sides have exactly one item, the semantics is relatively natural.
Query |
1 + 1 |
Result |
2 |
Query |
42 - 10 |
Result |
32 |
Query |
6 * 7 |
Result |
42 |
Query |
42.3 div 7.2 |
Result |
5.875 |
Query |
42 idiv 9 |
Result |
4 |
Query |
42 mod 9 |
Result |
6 |
If the data types are different, then conversions are made automatically:
If one side is a double and the other side a float, then the float is converted to a double and a double is returned.
If one side is a double and the other side a decimal (or integer, etc), then the decimal is converted to a double and a double is returned.
If one side is a float and the other side a decimal (or integer, etc), then the decimal is converted to a float and a float is returned.
Note that an integer and a decimal are not considered different here, because an integer is a special case of decimal. Adding a decimal with an integer returns a decimal.
The empty sequence enjoys special treatment: if one of the sides (or both) is the empty sequence, then the arithmetic expression returns an empty sequence (no error):
Query |
() + 1 |
Result |
|
Arithmetic expressions also work with dates, times, and durations in a natural fashion (e.g., a date + a duration = a date).
Query |
date("2024-12-06")- date ("2023-11-06") |
Result |
"P396D" |
Query |
date("2024-12-06") + dayTimeDuration("P31D") |
Result |
"2025-01-06" |
If one of the sides (or both) is not a number, a date, a time, a dateTime, a duration, or the empty sequence, or the involves types are inconsistent with each other, then a type error is thrown.
String manipulation¶
String concatenation is done with a double vertical bar:
Query |
"foo" || "bar" |
Result |
"foobar" |
Most other string manipulation primitives are available from the rich JSONiq builtin function library (itself relying on a W3C standard called XPath and XQuery Functions and Operators). The complete list of functions in this standard is available at https://www.w3.or g/TR/xpath-functions/, although XML-related functions should be ignored, and JSONiq supports additional JSON-related functions (such as keys(), etc).
Query |
concat("foo", "bar") |
Result |
"foobar" |
Query |
string-join(("foo", "bar", "foobar"), "-") |
Result |
"foo-bar-foobar" |
Query |
substr(("foobar", 4, 3) |
Result |
"bar" |
Query |
string-length("foobar") |
Result |
6 |
Value comparison¶
Sequences of one atomic item can be compared with eq (equal), ne (not equal), le (lower or equal), ge (greater or equal), lt (lower than) and gt (greater than).
Query |
1 + 1 eq 2 |
Result |
true |
Query |
6 * 7 ne 21 * 2 |
Result |
false |
Query |
234 gt 123 |
Result |
|
If one of the two sides is the empty sequence, then the value comparison expression returns an empty sequence as well.
Query |
() le 2 |
Result |
|
If one of the two sides is null, then the value comparison expression returns null as well.
Query |
null le 2 |
Result |
null |
Logic¶
JSONiq supports the three basic logic expressions and, or, and not. not has the highest precedence, then and, then or.
Query |
1 + 1 eq 2 and (2 + 2 eq 4 or not 100 mod 5 eq 0) |
Result |
true |
JSONiq also supports universal and existential quantification:
Query |
every $i in 1 to 10 satisfies $i gt 0 |
Result |
true |
Query |
some $i in 1 to 10 satisfies $i gt 5 |
Result |
true |
Note that unlike SQL, JSONiq logic expressions are two-valued and return either true or false.
If one of the two sides is not a sequence of a single Boolean item, then implicit conversions are made. This mechanism is called the Effective Boolean Value (EBV). For example, an empty sequence, or a sequence of one empty string, or a sequence of one zero integer, is considered false. A sequence of one non-empty string, or a sequence of one non-zero integer, or a sequence starting with one object (or array) is considered true.
General comparison¶
JSONiq has a shortcut for existential quantification on value comparisons. This is called general comparison.
For example, consider this query:
Query |
some $i in (1, 2, 3, 4, 5) satisfies $i eq 1 |
Result |
true |
It can be abbreviated to the shorter:
Query |
(1, 2, 3, 4, 5) = 1 |
Result |
true |
More generally,
Query |
some $i in 1 to 5, $j in 3 to 10 satisfies $i gt $j |
Result |
true |
can be abbreviated to:
Query |
1 to 5 > 3 to 10 |
Result |
true |
In other words, = (resp. ! =, <, >, <=, >=) is a shortcut for an existential quantification on both input sequences on the value comparison eq (resp. ne, lt, gt, le, ge).
In particular, errors are thrown for incompatible types, and false is returned if any side is the empty sequence.
General comparison is very convenient when scanning datasets and looking for matching values.
Query |
json-doc("file.json").o[].a.b[].c = 1 |
Result |
true |
Composability¶
JSONiq, as a functional language, is modular. This means that expressions can be combined at will, exactly like one would combine addition, multiplication, etc, at will.
Any expression can appear as the operand of any other expression. Of course, if the output of an expression is not compatible with what the parent expression expects, an error is thrown.
We show below an example and the successive details of the evaluation.
Query |
(1 + (({"b":[{"a": 2+2}]}).b[].a)) to 10 |
Step 1 |
(1 + (({"b":[{"a": 4}]}).b[].a)) to 10 |
Step 2 |
(1 + ([{"a": 4}][].a)) to 10 |
Step 3 |
(1 + ({"a": 4}.a)) to 10 |
Step 4 |
(1 + 4) to 10 |
Step 5 |
5 to 10 |
Result |
5 6 7 8 9 10 |
Query |
{ "attr" : string-length("foobar") "values" : [ for $i in 1 to 10 return long($i) ] } |
Result |
{ "attr" : 6 "values" : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ] } |
Just like arithmetic, where multiplication has precedence over addition, expressions have a precedence order. Precedence can be easily overridden with parentheses. In practice, it is not realistic to know all precedences by heart, so that when in doubt, it is always a good idea to add parentheses to be on the safe side.
We list the expression in increasing order of precedence below. Beware, the low precedence of the comma is a common pitfall.
Precedence (low first): Comma; Data Flow (FLWOR, if-then-else, switch...); Logic; Comparison; String concatenation; Range; Arithmetic; Path expressions; Filter predicates, dynamic function calls; Literals, constructors and variables; Function calls, named function references, inline functions.
Data flow¶
A few expressions give some control over the data flow by picking the output or this or that expression based on the value of another expression. These expression look quite close to their counterparts in imperative languages (Python, Java...), but it is important to understand that they have functional semantics here.
This includes conditional expressions. If the expression inside the if returns true (or if its Effective Boolean Value is true), the result of the expression in the then clause is taken, otherwise, the result of the expression in the else clause is taken.
Query |
if(count(json-file("file.json").o) gt 1000) then "Large file!" else "Small file." |
Result |
"Small file." |
This also includes switch expressions. The expression inside the swich is evaluated and an error is thrown if more than one item is returned. Then, the resulting item is compared for equality with each one of the candidate values. The result of the expression corresponding to the first match is taken, and if there are no matches, the result of the default expression is taken.
Query |
switch(json-file("file.json").o[[1]].a.b[[1]].c) case 1 return "one" case 2 return "two" default return "other" |
Result |
one |
Note that we covered data types and cardinality indicators (*, +, ?) in Chapter 7.
This also includes try-catch expressions. If the expression in the try clause is successfully evaluated, then its results are taken. If there is an error, then the results of the expression in the first catch clause matching the error is taken (* being the joker).
Query |
try { date(json-file("file.json").o[[1]].a.b[[1]].c) } catch * { "This is not a date!" } |
Result |
"This is not a date!" |
Binding variables with cascades of let clauses¶
Let us go back to this example:
Query |
json-doc("file.json").o[].a.b[].c = 1 |
Result |
true |
We can rewrite it by explicitly binding intermediate variables, like so:
Query |
let $a := json-doc("file.json") let $b := $a.o let $c := $b[] let $d := $c.a let $e : $d.b let $f := $d[] let $g := $f.c return $g = 1 |
Result |
true |
Variables in JSONiq start with a dollar sign. This way of subsequently binding variables to compute intermediate results is typical of functional language: OCAML, F#, Haskell... have a similar feature. It is important to understand that this is not a variable assignment that would change the value of a variable. This is only a declarative binding.
It is possible to reuse the same variable name, in which case the previous binding is hidden. Again, this is not an assignment or a change of value.
Query |
let $a := json-doc("file.json") let $a := $a.o let $a := $a[] let $a := $a.a let $a : $a.b let $a := $a[] let $a := $a.c return $a = 1 |
Result |
true |
Each variable is visible in all subsequent let clauses, as well as in the final return clause (unless/until it is hidden by another let clause with the same variable name). It is not visible to any parent expressions. In particular, this query returns an error because the last reference to variable $a is not within the scope of the let $a binding:
Query |
(let $a := json-doc("file.json") let $b := $a.o let $c := $b[] let $d := $c.a let $e : $d.b let $f := $d[] let $g := $f.c return $g = 1 ) + $a |
FLWOR expressions¶
One of the most important and powerful features of JSONiq is the FLWORexpression. It corresponds to SELECT-FROM-WHEREqueries in SQL, however, it is considerably more expressive and generic than them in several aspects:
In SQL, the clauses must appear in a specific order (SELECT, then FROM, then WHERE, then GROUP BY, then HAVING, then ORDER BY, then OFFSET, then LIMIT), although most are optional. In JSONiq the clauses can appear in any order with the exception of the first and last clause.
JSONiq supports a let clause, which does not exist in SQL. Let clauses make it very convenient to write and organize more complex queries.
In SQL, when iterating over multiple tables in the FROM clause, they “do not see each other”, i.e., the semantics is (logically) that of a Cartesian product. In JSONiq, for clauses (which correspond to FROM clauses in SQL), do see each other, meaning that it is possible to iterate in higher and higher levels of nesting by referring to a previous for variable. This is both easier to write and read than lateral views, and it is also more expressive.
The semantics of FLWOR clauses is simple, clean, and inherently functional; it is based on tuple streams containing variable bindings, which flow from clause to clause. There is no "spooky action at a distance" such as the explode() function, which indirectly causes a duplication of rows in Spark SQL.
Simple dataset¶
For the purpose of illustration, we will use a very simple dataset consisting of two JSON Lines files:
products.json |
{"pid":1, "type" : "tv", "store":1} {"pid":2, "type" : "tv", "store":2} {"pid":3, "type" : "phone", "store":2} {"pid":4, "type" : "tv", "store":3} {"pid":5, "type" : "teapot", "store":2} {"pid":6, "type" : "tv", "store":1} {"pid":7, "type" : "teapot", "store":2} {"pid":8, "type" : "phone", "store":4} |
stores.json |
{ "sid" : 1, "country" : "Switzerland" } { "sid" : 2, "country" : "Germany" } { "sid" : 3, "country" : "United States" } |
Note that the Store ID 4 for stores is intentionally missing from stores.json, this is for the purpose of showing what happens if there are no matches.
For clauses¶
For clauses bind their variable in turn to each item of the provided expression. Here is an example:
Query |
for $x in 1 to 10 return { "number": $x, "square": $x * $x } |
Result |
{ "number" : 1, "square" : 1 } { "number" : 2, "square" : 4 } { "number" : 3, "square" : 9 } { "number" : 4, "square" : 16 } { "number" : 5, "square" : 25 } { "number" : 6, "square" : 36 } { "number" : 7, "square" : 49 } { "number" : 8, "square" : 64 } { "number" : 9, "square" : 81 } { "number" : 10, "square" : 100 } |
In the above query, the variable $x is bound with 1, then with 2, then with 3, etc, and finally with 10. It is always bound with a sequence of exactly one item. It is, however, possible to bind it with an empty sequence if the expression returns no items. This is done with "allowing empty".
Query |
for $x allowing empty in () return count($x) |
Result |
0 |
Note that, without "allowing empty", if the expression in the for clause evaluates to an empty sequence, the variable would not bind to anything at all and the FLWOR expression would simply return an empty sequence.
Query |
for $x in () return count($x) |
Result |
|
Each variable binding is also more generally called a tuple. In this above examples, there is only one variable binding in each tuple (\$x), but it is possible to build larger tuples with more clauses. For example, this FLWOR expression involves two for clauses. The tuples after the first for clause and before the second one only bind variable \$x (to 1, then to 2, then to 3), but the tuple after the second for clause and before the return clause bind variables \$x and \$y. There are six tuples in total, because the second for clause expands each incoming tuple to zero, one or more tuples (think of a flatMap transformation in Spark for an analogy).
Query |
for $x in 1 to 3 for $y in 1 to $x return [ $x, $y ] |
Result |
[1, 1] [2, 1] [2, 2] [3, 1] [3, 2] [3, 3] |
Now if we use our small example dataset, we can iterate on all objects, say, products:
Query |
for $product in json-file("products.json") return $product.type |
Result |
"tv" "tv" "phone" "tv" "teapot" "tv" "teapot" "phone" ... |
It can thus be seen that the for clause is akin to the FROM clause in SQL, and the return is akin to the SELECT clause.
Projection in JSONiq can be made with a project() function call, with the keys to keep:
Query |
for $product in json-file("products.json") return project($product, ("type", "store")) |
Result |
{"type" : "tv", "store":1} {"type" : "tv", "store":2} {"type" : "phone", "store":2} {"type" : "tv", "store":3} {"type" : "teapot", "store":2} {"type" : "tv", "store":1} {"type" : "teapot", "store":2} {"type" : "phone", "store":4} |
It is possible to implement a join with a sequence of two for clauses and a predicate (note that newlines in JSONiq are irrelevant, so we spread the for clause on two lines in order to fit the query on this page):
Query |
for $product in json-file("products.json") for $store in json-file("stores.json") [$$.sid eq $product.store] return { "product" : $product.type, "country" : $store.country } |
Result |
{"product" : "tv", "country":"Switzerland"} {"product" : "tv", "country":"Germany"} {"product" : "phone", "country":"Germany"} {"product" : "tv", "country":"United States"} {"product" : "teapot", "country":"Germany"} {"product" : "tv", "country":"Switzerland"} {"product" : "teapot", "country":"Germany"} |
Note that allowing empty can be used to perform a left outer join, i.e., to account for the case when there are no matching records in the second collection:
Query |
for $product in json-file("products.json") for $store allowing empty in json-file("stores.json") [$$.sid eq $product.store] return { "product" : $product.type, "country" : $store.country } |
Result |
{"product" : "tv", "country":"Switzerland"} {"product" : "tv", "country":"Germany"} {"product" : "phone", "country":"Germany"} {"product" : "tv", "country":"United States"} {"product" : "teapot", "country":"Germany"} {"product" : "tv", "country":"Switzerland"} {"product" : "teapot", "country":"Germany"} {"product" : "phone", "country": null} |
In the case of the last product, no matching record in stores.json is found and \$store is bound to the empty sequence for that tuple. When constructing the object in the return clause's expression, the empty sequence obtained from $store.country is automatically replaced with a null value (because an object value cannot be empty). But if we add an array constructor around the country, we will notice the empty sequence:
Query |
for $product in json-file("products.json") for $store allowing empty in json-file("stores.json") [$$.sid eq $product.store] return { "product" : $product.type, "country" : [ $store.country ] } |
Result |
{"product" : "tv", "country": [ "Switzerland" ]} {"product" : "tv", "country": [ "Germany" ]} {"product" : "phone", "country": [ "Germany" ]} {"product" : "tv", "country": [ "United States" ]} {"product" : "teapot", "country": [ "Germany" ]} {"product" : "tv", "country": [ "Switzerland" ]} {"product" : "teapot", "country": [ "Germany" ]} {"product" : "phone", "country": [ ]} |
Let clauses¶
As seen before, the let clause can be used to bind a variable with any sequence of items, also more than one. FLWOR expressions with just a cascade of let clauses are quite popular.
Query |
let $x := 2 return $x * $x |
Result |
4 |
However, let clauses can also appear after other clauses, for example, after a for clause. Then, they will bind a sequence of items for each previous binding (tuple), like so:
Query |
for $x in 1 to 10 let $square := $x * $x return { "number": $x, "square": $square } |
Result |
{ "number" : 1, "square" : 1 } { "number" : 2, "square" : 4 } { "number" : 3, "square" : 9 } { "number" : 4, "square" : 16 } { "number" : 5, "square" : 25 } { "number" : 6, "square" : 36 } { "number" : 7, "square" : 49 } { "number" : 8, "square" : 64 } { "number" : 9, "square" : 81 } { "number" : 10, "square" : 100 } |
In the above example, $square is only bound with one item. Here is another example where it is bound with more than one:
Query |
for $x in 1 to 10 let $square-and-cube := ($x * $x, $x * $x * $x) return { "number": $x, "square": $square-and-cube[1], "cube": $square-and-cube[2] } |
Result |
{ "number" : 1, "square" : 1, "cube" : 1 } { "number" : 2, "square" : 4, "cube" : 8 } { "number" : 3, "square" : 9, "cube" : 27 } { "number" : 4, "square" : 16, "cube" : 64 } { "number" : 5, "square" : 25, "cube" : 125 } { "number" : 6, "square" : 36, "cube" : 216 } { "number" : 7, "square" : 49, "cube" : 343 } { "number" : 8, "square" : 64, "cube" : 512 } { "number" : 9, "square" : 81, "cube" : 729 } { "number" : 10, "square" : 100, "cube" : 1000 } |
Note the difference with a for clause:
Query |
for $x in 1 to 10 for $square-or-cube in ($x * $x, $x * $x * $x) return { "number": $x, "square or cube": $square-or-cube } |
Result |
{ "number" : 1, "square or cube" : 1 } { "number" : 1, "square or cube" : 1 } { "number" : 2, "square or cube" : 4 } { "number" : 2, "square or cube" : 8 } { "number" : 3, "square or cube" : 9 } { "number" : 3, "square or cube" : 27 } { "number" : 4, "square or cube" : 16 } { "number" : 4, "square or cube" : 64 } { "number" : 5, "square or cube" : 25 } { "number" : 5, "square or cube" : 125 } { "number" : 6, "square or cube" : 36 } { "number" : 6, "square or cube" : 216 } { "number" : 7, "square or cube" : 49 } { "number" : 7, "square or cube" : 343 } { "number" : 8, "square or cube" : 64 } { "number" : 8, "square or cube" : 512 } { "number" : 9, "square or cube" : 81 } { "number" : 9, "square or cube" : 729 } { "number" : 10, "square or cube" : 100 } { "number" : 10, "square or cube" : 1000 } |
A let clause outputs exactly one outgoing tuple for each incoming tuple (think of a map transformation in Spark). Unlike the for clause, it does not modify the number of tuples.
Let us now showcase the use of a let clause with our dataset.
Now if we use our small example dataset, we can iterate on all objects, say, products:
Query |
for $product in json-file("products.json") let $type := $product.type return $type |
Result |
"tv" "tv" "phone" "tv" "teapot" "tv" "teapot" "phone" ... |
Let clauses also allow for joining the two datasets and denormalizing them by nesting the stores into the products. This would be considerably more difficult to do with (Spark) SQL, even with extensions. The results are pretty-printed for ease of read.
Query |
for $store in json-file("stores.json") let $product := json-file("products.json") [$store.sid eq $$.store] return { "store" : $store.country, "available products" : [ distinct-values($product.type) ] } |
Result |
{ "store" : "Germany", "available products" : [ "tv", "teapot", "phone" ] } { "store" : "Switzerland", "available products" : [ "tv" ] } { "store" : "United States", "available products" : [ "tv" ] } |
Where clauses¶
Where clauses are used to filter variable bindings (tuples) based on a predicate on these variables. They are the equivalent to a WHERE clause in SQL.
This is a simple example of its use in conjunction with a for clause:
Query |
for $x in 1 to 10 where $x gt 7 return { "number": $x, "square": $x * $x } |
Result |
{ "number" : 8, "square" : 64 } { "number" : 9, "square" : 81 } { "number" : 10, "square" : 100 } |
A where clause can appear anywhere in a FLWOR expression, except that it cannot be the first clause (always for or let) or the last clause (always return).
Query |
for $x in 1 to 10 let $square := $x * $x where $square gt 60 for $y in $square to $square + 1 return { "number": $x, "y": $y } |
Result |
{ "number" : 8, "y" : 64 } { "number" : 8, "y" : 65 } { "number" : 9, "y" : 81 } { "number" : 9, "y" : 82 } { "number" : 10, "y" : 100 } { "number" : 10, "y" : 101 } |
A where clause always outputs a subset (or all) of its incoming tuples, without any alteration. In the case that the predicate always evaluates to true, it forwards all tuples, as if there had been no where clause at all. In the case that the predicate always evaluates to false, it outputs no tuple and the FLWOR expression will then return the empty sequence, with no need to further evaluate any of the remaining clauses.
Here is another example of use of the where clause with our datasets:
Query |
for $product in json-file("products.json") let $store := json-file("stores.json") [$$.sid eq $product.store] where $store.country = "Germany" return $product.type |
Result |
"tv" "phone" "teapot" "teapot" |
Order by clauses¶
Order by clauses are used to reorganize the order of the tuples, but without altering them. They are the same as ORDER BY clauses in SQL.
Query |
for $x in-2 to 2 let $square := $x * $x order by $square return { "number": $x, "square": $square } |
Result |
{ "number" : 0, "square" : 0 } { "number" :-1, "square" : 1 } { "number" : 1, "square" : 1 } { "number" :-2, "square" : 4 } { "number" : 2, "square" : 4 } |
It is also possible, like in SQL, to specify an ascending or a descending order. By default, the order is ascending.
Query |
for $x in-2 to 2 let $square := $x * $x order by $square ascending return { "number": $x, "square": $square } |
Result |
{ "number" : 0, "square" : 0 } { "number" :-1, "square" : 1 } { "number" : 1, "square" : 1 } { "number" :-2, "square" : 4 } { "number" : 2, "square" : 4 } |
Query |
for $x in-2 to 2 let $square := $x * $x order by $square descending return { "number": $x, "square": $square } |
Result |
{ "number" : 2, "square" : 4 } { "number" :-2, "square" : 4 } { "number" : 1, "square" : 1 } { "number" :-1, "square" : 1 } { "number" : 0, "square" : 0 } |
In case of ties between tuples, the order is arbitrary. But it is possible to sort on another variable in case there is a tie with the first one (compound sorting keys):
Query |
for $x in-2 to 2 let $square := $x * $x order by $square descending, $x ascending return { "number": $x, "square": $square } |
Result |
{ "number" :-2, "square" : 4 } { "number" : 2, "square" : 4 } { "number" :-1, "square" : 1 } { "number" : 1, "square" : 1 } { "number" : 0, "square" : 0 } |
It is possible to control what to do with empty sequences: they can be considered smallest or greatest.
Query |
for $x in 1 to 5 let $y := $x[$$ mod 2 = 1] order by $y ascending empty greatest return [ $y ] |
Result |
[ 1 ] [ 3 ] [ 5 ] [ ] [ ] |
Query |
for $x in 1 to 5 let $y := $x[$$ mod 2 = 1] order by $y ascending empty least return [ $y ] |
Result |
[ ] [ ] [ 1 ] [ 3 ] [ 5 ] |
Here is another example of use of the order by clause with our datasets:
Query |
for $product in json-file("products.json") let $store := json-file("stores.json") [$$.sid eq $product.store] group by $t := $product.type order by count($store) descending, string-length($t) ascending return $t |
Result |
"tv" "teapot" "phone" |
Group by clauses¶
Group by clauses organize tuples in groups based on matching keys, and then output only one tuple for each group, aggregating other variables (count, sum, max, min...). This is similar to GROUP BY clauses in SQL.
Query |
for $x in 1 to 5 let $y := $x mod 2 group by $y return { "grouping key" : $y, "count of x" : count($x) } |
Result |
{ "grouping key" : 0, "count of x" : 2 } { "grouping key" : 1, "count of x" : 3 } |
However, JSONiq’s group by clauses are more powerful and expressive than SQL GROUP BY clauses: indeed, it is also possible to opt out of aggregating other (non-grouping-key) variables. Then, for a nonaggregated variable, the sequence of all its values within a group will be rebound to this same variable as a single binding in the outcoming tuple. It is thus possible to write many more queries than SQL would allow, which is one of the reasons why a language like JSONiq should be preferred for nested datasets.
Query |
for $x in 1 to 5 let $y := $x mod 2 group by $y return { "grouping key" : $y, "grouped x values" : [ $x ], } |
Result |
{ "grouping key" : 0, "grouped x values" : [ 2, 4 ] } { "grouping key" : 1, "grouped x values" : [ 1, 3, 5 ] } |
Finally, here is an example of use of a group by clause with our example dataset.
Query |
for $product in json-file("products.json") group by $sid := $product.sid order by $sid let $store := json-file("stores.json") [$$.sid = $sid] return {| $store, { "products" : [ distinct-values($product.type) ] } |} |
Result |
{ "sid" : 1, "country" : "Switzerland", "products" [ "tv" ] } { "sid" : 2, "country" : "Germany", "products" : [ "tv", "phone", "teapot" ] } { "sid" : 3, "country" : "United States", "products" : [ "tv" ] } |
Tuple stream visualization¶
Although it is unnecessary to write simple FLWOR expressions, a visualization can be helpful in order to understand how more complex FLWOR expressions are evaluated. We give below a few examples of how tuple streams within a FLWOR expression can be seen as tables (or DataFrames) in which each bound variable is represented in a column:
Note, however, that these tuple streams are not sequences of items, because clauses are not expressions; tuple streams are only a formal description of the semantics of FLWOR expressions and their visualization as DataFrames is pedagogical. Having said that, the reader may have guessed that tuple streams can be internally implemented as Spark DataFrames, and in fact, RumbleDB does just that (but it hides it from the user).
Types¶
The type system in JSONiq is consistent with what was covered in Chapter 7.
Variable types¶
It is possible to annotate any FLWOR variable with an expected type as shown below.
Query |
let $path as string := "git-archive-big.json" let $events as object* := json-file($path) let $actors as object* := $events.actor let $logins as string* := $actors.login let $distinct-logins as string* := distinct-values($logins) let $count as integer := count($distinct-logins) return $count |
Result |
53744 |
Since every value in JSONiq is a sequence of item, a sequence type consists of two parts: an item type, and a cardinality.
Item types can be any of the builtin atomic types (JSound) covered in Chapter 7, as well as "object", "array" and the most generic item type, "item". Cardinality can be one of the following four:
Any number of items (suffix *); for example object*
One or more items (suffix +); for example array+
Zero or one item (suffix ?); for example boolean?
Exactly one item (no suffix); for example integer
If it is detected, at runtime, that a sequence of items is bound to a variable but does not match the expected sequence type, either because one of the items does not match the expected item type, or because the cardinality of the sequence does not match the expected cardinality, then a type error is thrown and the query is not evaluated.
It is also possible to annotate variables in for clauses, however the cardinality of the sequence type of a for variable will logically be either one (no suffix), or zero-or-one (?) in the case that "allowing empty" is specified.
Type expressions¶
JSONiq has a few expressions related to types.
An instance of expression checks whether a sequences matches a sequence type, and returns true or false. This is similar to the homonymous expression in Java.
Query |
(3.14, "foo") instance of integer*, ([1], [ 2, 3 ]) instance of array+ |
Result |
false true |
A cast as expression casts single items to an expected item type.
Query |
"3.14" cast as decimal |
Result |
3.14 |
A cast as expression can also deal with an empty sequence, and supports the zero-or-more cardinality in the expected resulting type. But it will throw an error if the sequence has more than one item: you need to use a FLWOR expression if you want to cast every item in a sequence.
Query |
[1, 2, 3, 4][$$ > 4] cast as string? |
Result |
|
A castable as expression tests whether a cast would succeed (in which case it returns true) or not (false).
Query |
"3.14" castable as decimal |
Result |
true |
Atreat as expression checks whether its input sequence matches an expected type (like a type on a variable); if it does, the input sequence is returned unchanged. If not, an error is raised. This is useful in complex queries and for debugging purposes.
Query |
[ 1, 2, 3, 4][] treat as integer+ |
Result |
1 2 3 4 |
There are also typeswitch expressions. The expression inside the typeswich is evaluated. Then, the resulting sequence is type-matched with each one of the sequence types. The result of the expression corresponding to the first match is taken, and if there are no matches, the result of the default expression is taken.
Query |
typeswitch(json-file("file.json").o[[1]].a.b[[1]].c) case integer+ return "integer" case string return "string" default return "other" |
Result |
integer |
Types in user-defined functions¶
JSONiq supports user-defined functions. Parameter types can be optionally specified, and a return type can also be optionally specified.
Query |
declare function is-big-data( $threshold as integer, $objects as object* ) as boolean { count($objects) gt $threshold }; is-big-data(1000, json-file("git-archive.json") |
Result |
true |
But also:
Query |
declare function is-big-data( $threshold, $objects ) { count($objects) gt $threshold }; is-big-data(1000, json-file("git-archive.json")) |
Result |
true |
Validating against a schema¶
It is possible to declare a schema, associating it with a user-defined type, and to validate a sequence of items against this user-defined type.
Query |
declare type local:histogram as { "commits" : "short", "count" : "long" }; validate type local:histogram* { for $event in json-file("git-archive-big.json") group by $nb-commits := (size($event.payload.commits), 0)[1] order by $nb-commits return { "commits" : $nb-commits, "count" : count($event) } } |
Result |
{ "commits" : 0, "count" : 94554 } { "commits" : 1, "count" : 92094 } { "commits" : 2, "count" : 9951 } { "commits" : 3, "count" : 3211 } { "commits" : 4, "count" : 1525 } { "commits" : 5, "count" : 877 } { "commits" : 6, "count" : 688 } { "commits" : 7, "count" : 426 } { "commits" : 8, "count" : 383 } { "commits" : 9, "count" : 259 } { "commits" : 10, "count" : 274 } { "commits" : 11, "count" : 193 } { "commits" : 12, "count" : 146 } |
If the results of a JSONiq query have been validated against a JSound schema, under specific conditions (the same covered in Chapter 7 for a schema to be DataFrame compatible), then it is possible to save the output of the query in other formats than JSON, such as Parquet, Avro, or (if there is no nestedness) CSV.
Architecture of a query engine¶
We now cover the physical architecture and implementation of a query engine such as RumbleDB.
Static phase¶
When a query is received by an engine, it is text that needs to be parsed. The theory and techniques for doing this (context-free grammars, EBNF...) are covered in compiler design courses. The output of this is a tree structure called an Abstract Syntax Tree.
An Abstract Syntax Tree, even though it already has the structure of a tree, is tightly tied to the original syntax. Thus, it needs to be converted into a more abstract Intermediate Representation called an expression tree. Every node in this tree corresponds to either an expression or a clause in the JSONiq language, making the design modular.
At this point, static typing takes place, meaning that the engine infers the static type of each expression, that is, the most specific type possible expected at runtime (but without actually running the program). User-specified types are also taken into account for this step. Inferring static types facilitates the optimization step.
Engines like RumbleDB perform their optimization round on this Intermediate Representation. Optimizations consist in changing the tree to another one that will evaluate faster, but without changing the semantics of the query (i.e., it should produce the same output). An example is that, if RumbleDB detects that both sides to a general comparison are single items, then the comparison is rewritten as a more efficient value comparison. Another example is that user-defined function calls are "inlined", meaning that the body of the function is copied over instead of the function call, as if the user had written it manually there.
Once optimizations have been done, RumbleDB decides the mode with which each expression and clause will be evaluated (locally, sequentially, in parallel, in DataFrames, etc). The resulting expression tree is then converted to a runtime iterator tree; this is the query plan that will actually be evaluated by the engine.
Every node in a runtime iterator tree outputs either a sequence of items (if it corresponds to an expression) or a tuple stream (if it corresponds to a clause other than the return clause).
Dynamic phase¶
During the dynamic phase, the root of the tree is asked to produce a sequence of items, which is to be the final output of the query as a whole.
Then, recursively, each node in the tree will ask its children to produce sequences of items (or tuple streams). Each node then combines the sequences of items (or tuple streams) it receives from its children in order to produce its own sequence of items according to its semantics, and pass it to its parent. That way, the data flows all the way from the bottom of the tree to its root, and the final results are obtained and presented to the user or written to persistent storage (drive or data lake).
There are many different ways for a runtime iterator to produce an output sequence of items (or tuple stream) and pass it to its parent runtime iterator in the tree:
Bymaterializing sequences of items (or tuple streams) completely in local computer memory.
By locally iterating over each item in a sequence, one after the other (or over each tuple in a tuple stream, one after the other).
By working in parallel over the sequence of items, internally stored as a Spark RDD.
Byworkinginparallel over the sequence of items (or tuple stream), internally stored as a Spark DataFrame.
By natively converting the semantics of the iterator to native Spark SQL.
Materialization¶
When a sequence of items is materialized, it means that an actual List (or Array, or Vector), native to the language of implementation (in this case Java) is stored in local memory, filled with the items. This is, of course, only possible if the sequence is small enough that it fits.
The parent runtime iterator then directly processes this List in place, in order to produce its output.
A special case is when an expression is statically known to return either zero or one item (e.g., an addition, or a logical expression), but not more. Then no List structure is needed, and a single Item can be returned via a simple method call in the language of implementation (Java).
Streaming¶
With larger sequences of items, it becomes impracticable to materialize because the footprint in memory becomes too large, and the size of the sequences that can be manipulated is strictly limited by the total memory available.
Thus, another technique is used instead: streaming. When a sequence of items (or tuple stream) is produced and consumed in a streaming fashion, it means that the items (or tuples) are produced and consumed one by one, iteratively. But the whole sequence of items (or tuple stream) is never stored anywhere.
The classical pattern for doing so is known as the Volcano iterator architecture. It consists in first calling a method called open() to initialize the iterator, then hasNext() to check if there exists a next item (or tuple), and if so, then next() to consume it; and then hasNext() and next() are called again and repeatedly as long as hasNext() does not return false. When it finally does, close() is called to clean up the iterator.
With this technique, it is possible to process sequences that are much larger than memory, because the actual sequence is never fully stored. However, there are two problems with this: first, it can take a lot of time to go through the entire sequence (imagine doing so with billions or trillions of items). Second, there are expressions or clauses that are not compatible with streaming (consider, for example, the group by or order by clause, which cannot be implemented without materializing their full input).
Parallel execution (with RDDs)¶
When a sequence becomes unreasonably large, RumbleDB switches to a parallel execution, leveraging Spark capabilities: the sequences of items are passed and processed as RDDs of Item objects. Each runtime iterator then calls Spark transformations on these RDDs to produce an output RDD, or in some cases (e.g., count()) calls a Spark action to produce a single, local, materialized Item with an action.
A Spark transformation or action often needs to be supplied with an additional function (e.g., a map function, a filter function), called a Spark UDF (for “User-Defined Function”). What RumbleDB then does is that it squeezes an entire runtime iterator subtree into a UDF, so that this subtree can be recursively evaluated on each node of the cluster, as a local execution (materialized or streaming).
For example, imagine a filter expression, with a specific predicate, on a sequence of a billion items. If the input sequence is physically available as an RDD, RumbleDB squeezes the predicate’s runtime iterator tree into a UDF, and invokes the filter() transformation with this UDF, resulting in a smaller RDD that contains the filtered sequence of items. Physically, the predicate’s runtime iterator tree will be evaluated on items, in parallel, across thousands of machines in the cluster; relative to each one of these machines, this is a local execution (local to each machine), where the predicate iterator streams over each batch.
The use of RDDs is specific to sequences of items and does not exist for tuple streams.
Parallel execution (with DataFrames)¶
The RDD implementation supports heterogeneous sequences by leveraging the polymorphism of Item objects. However, this is not efficient in the case that Items in the same sequence happen to have a regular structure.
Thus, if the Items in a sequence are valid against a specific schema, or even against an array type or an atomic type, the underlying physical storage in memory relies on Spark DataFrames instead of RDDs. Homogeneous sequences of arrays or of atomics (e.g., a sequence of integers) are physical implemented as a one-column DataFrame with the corresponding type.
Thus, there exists a mapping from JSONiq types to Spark SQL types. In the case that there is no corresponding Spark SQL type, the implementation falls back to RDDs
To summarize, homogeneous sequences of the most common types are stored in DataFrames, and RDDs are used in all other cases.
DataFrames are also consistently used for storing tuple streams and parallelizing the execution of FLWOR clauses. In FLWOR DataFrames, every column corresponds to one FLWOR variable, which is similar to the visuals provided earlier for FLWOR expressions in this chapter. The column type can either be native if the variable type can be mapped seamlessly to a Spark SQL type. Otherwise, the column type will be binary and Items are serialized to sequences of types and deserialized back on demand.
Parallel execution (with Native SQL)¶
In some cases (more in every release), RumbleDB is able to evaluate the query using only Spark SQL, compiling JSONiq to SQL directly instead of packing Java runtime iterators in UDFs. This leads to faster execution, because UDFs are slower than a native execution in SQL. This is because, to a SQL optimizer, UDFs are opaque and prevent automatic optimizations.
RumbleDB switches seamless between all execution modes, even within the same query, as shown on the following diagram.