263-3010-00: Big Data
Section 5
Syntax
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 10/12/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Why Syntax¶
A data lake, whether on public cloud services like S3 or Azure Blob, or a distributed file system like HDFS, is where datasets are stored in their native format, such as CSV files. Unlike traditional databases that store data in proprietary formats and require ETL (Extract, Transform, Load) processes, data lakes allow in-situ querying, meaning data can be accessed directly without importing it into a specific system. While ETL improves performance with optimized formats and indices, it can be time-consuming and isn't always necessary. Data lakes make data syntax visible and easily accessible, like CSV files used for tabular data.
CSV¶
ID,Last name,First name
1,Einstein,Albert
2,Gödel,Kurt
ID | Last name | First name |
---|---|---|
1 | Einstein | Albert |
2 | Gödel | Kurt |
CSV is a textual format, in the sense that it can be opened in a text editor. This is in contrast to binary formats that are more opaque.
Each record (a table row) corresponds to one line of text in a CSV file. Having a record per line of test is a common pattern not unique to CSV. This is waht makes it possible to scale up data processing to billions of records.
What appears on each line of text is specific to CSV. CSV means comma-separated values.
The main challenge with CSV files is that, in spite of a standard (RFC 4180), in practice there are many different dialects and variations, which limits interoperability. For example, another character can be used instead of the comma (tabulation, semi-colons, etc). Also, when a comma (or the special character used in its stead) needs to actually appear in a value, it needs to be escaped. There are many ways to do so; one of them is to double-quote the cell, which implies in turn that quotes within quotes must be escaped. There are many different conventions for doing so.
ID,Last name,First name,Theory
1,Einstein,Albert,"General, Special Relativity"
2,Gödel,Kurt,"""Incompleteness"" Theorem"
ID | Last name | First name | Theory |
---|---|---|---|
1 | Einstein | Albert | General, Special Relativity |
2 | Gödel | Kurt | "Incompleteness" Theorem |
Data denormalization¶
We know that it is desired to store data in so-called normal forms in a relational database management system. As you may recall, data in the first normal form cannot nest, and dat in higher normal froms are split across multiple tables that get joined at query time. As a rule of thumb, normalizing data means joining it back a query time.
In the context of data lake and large-scale data processing, it is oftern desirable to go exactly the opposite way. This is called data denormalization. This means that not only several tables can be merged into just one (with functional dependencies that would otherwise have been considered "undesirable"), it also means that we can nest data: tables in tables in tables.
While this is likely to come as a shock to people who have learned normal forms, it has to be said that data denormalization should be done with knowledge of normal forms, because one needs to have a deep understanding of what one is doing and why one is doing it.
Data denormalization makes a lot of sense in read-intensive senarios in which not having to join brings a significant performance improvement. In read-intensive scenarios, we love anything that is linear, which corresponds to a full scan of the dataset. This is as opposed to point queries more commonly found in traditional databases.
Thanks to the way that we defined tables, data denormalization is straightforward to explain. Remember a table is a collection of tuples.
We required identical support (relational integrity), flat rows (atomic integrity, which is also the first normal form), and homogeneous data types within a column (doamin integrity). Denormalization simply means that we drop all three constraints (or two, or just one).
Let us dive into this.
A tuple, mathematically, can be formalized as a partial function mapping strings to values:
As it turns out, a tuple can also be represented in a purely textual fashion.
{
"product": "Phone",
"price": 800,
"customer": "John",
"quantity": 1
}
The difference with CSV is that, in JSON, the attributes appear in every tuple, while CSV they do not appear except in the header line. JSON is appropriate for data denormalization because including the attributes in every tuple allows us to drop the identical support requirement.
If we now look at a table (which checks all three integrity boxes), we can re-express it in a JSON-based textual format like so:
Now, if we are to drop realtional integrity and allow for nestedness, the table could look like so:
CSV would not be powerful enough to express such data. But JSON is able to. For example, the first tuple of the table above, expressed in JSON, looks like so:
{
"product": "Phone",
"orders": [
{"customer": "John", "quantity": 1},
{"customer": "Peter", "quantity": 2},
{"customer": "Mary", "quantity": 1}
]
}
Concretely, data denormalization means that we abandon the paradim of homogeneous collections of flat items (tables) and instead consider heterogeneous collections of nested items.
Semi-structured Data and Well-formedness¶
The generic name for denormalized data (in the same of heterogeneous and nested) is "semi-structured data". Textual formats such as XML and JSON have the advantage that they can both be processed by computers, and can also be read, written and edited by humans.
Another very important and characterizing aspect of XML and JSON is that they are standards: XML is a W3C standard. W3C, also known as the World Wide Web consortium, is the same body that also standardizes HTML, HTTP, etc. JSON is now an ECMA standard, which is the same body that also standardizes JavaScript. In fact, the JS in JSON comes from JavaScript, because its look was inspried by JavaScript.
This is what an XML document looks like:
<?xml version="1.0"?>
<country code="RU">
<name>Russia</name>
<population>144500000</population>
<currency code="RUB">Russian Ruble</currency>
<cities>
<city>Moscow</city>
<city>Saint Petersburg</city>
<city>Novosibirsk</city>
</cities>
<description>
We produce <b>excellent</b> vodka and caviar.
</description>
</country>
This is what a JSON document looks like:
{
"code": "RU",
"name": "Russia",
"population": 144500000,
"currency": {
"name": "Russian Ruble",
"code": "RUB"
},
"confederation": false,
"president": "Vladimir Putin",
"capital": "Moscow",
"cities": [ "Moscow", "Saint Petersburg", "Novosibirsk" ],
"description": "We produce excellent vodka and caviar."
}
It is commonly believed taht XML is losing in popularity and JSON is "the new cool stuff", however this is not fully accurate; while on the research side, the publications on XML have become less widespread, in companies, XML is very populat due to its very mature ecosystem supported by several other W3C standards. A few examples are that the mandatory financial reports of US public companies must be filed in XML, and in Switzerland, electronic tax statement are also stored in XML. What is important is to understand that neither f them is better than the otherl this is highly use-case dependent and in some cases XML will be a better fit (this is typically the case in the publishing industry), while in other cases JSON will be a better fit.
XML and JSON share the concept of well-formedness, meaning a document is either well-structured according to their syntax or not. In computer science, both are considered languages, where a "well-formed" document belongs to that language. A well-formed XML or JSON document can be successfully opened, allowing features like automatic formatting and color coding. Non-well-formed documents, however, can't be processed until fixed. Due to the abundance of free and open-source tools for reading and writing well-formed XML and JSON, they are widely used, avoiding the need to create new syntaxes and tools.
JSON¶
Now let us dive into the details of the JSON syntax. JSON stands for JavaScript Object Notation because the way it looks like originates from JavaScript syntax, however it is now living its own life completely independently of JavaScript.
JSON is made of exactly six building blocks: strings, numbers, Booleans, null, objects, and arrays.
Strings¶
Strings are simply text. In JSON, strings always appear in double quotes. This is a well-formed JSON string:
"This is a string"
Obviously, strings could contain quotes and in order not to confuse them with the surrounding quotes, they need to be differentiated. This is called escaping and, in JSON, escapting is done with backslash characters ().
"The word \"quoted\" is quoted."
There are several other escapt sequences in JSON, the most popular ones being:
Escape sequence | Funtionality |
---|---|
\\ |
\ |
\n |
new line |
\r |
carriage return |
\t |
tabulation |
\u followed by four hexadecimal digits |
any character |
The last one, in fact, allows the insertion of any character via its Unicode code point. Unicode is a standard that assigns a numeric code (called a code point) to each character in order to catalog them across all languages of the world, even including emojis. Th catalog evolves eith regular meetings of the working group. For example, the Russian letter П is \u0400
.
The code point must be indicated in base 16 (digits 0 to 9, plus letters from A to F). Code points can easily be looked up with a search engine by typing a description of waht you are looking for, even though more complex strings will typically be created automatically.
Numbers¶
JSON generally supports numbers, without explicitly naming any types nor making any distinction between numbers apart from how they appear in syntax. The way a number appears apart in syntax is called a lexical representation, or a literal. These two words, infact, also generally apply to many other types.
Generally, a number is made of digits, possibly including a decimal period (which must be a dot) and optionally followed by the letter e (in either case) and a power of ten (scientific notation). Both the number and the optional power of ten can also have an optional sign.
These are a few examples of well-formed JSON number literals:
0
1234
12.34
-132.54
12.3E45
12.3e-45
-12.3e-45
JSON places a few restrictions
A leading
+
is not allowedA leading
0
is not allowed except if the integer part is exactly 0 (in which case it is even mandatory, i.e.,.23
is not a well-formed JSON number literal, instead it should be0.23
).
JSON numbers are unquoted. Otherwise, they would be recognized as strings by the parser and not as numbers.
A warning: the same (mathematical) number might have several literals to represent it.
2
20e-1
2.0
It is important to have in mind that the literal, which is the syntactic representation, is not the same as the actual, logical number. The above three literals have in common their "two-ness".
Booleans¶
There are two Booleans, true and false, and each one is associated with exactly one possible literal, which are well, true, and false.
true
false
In spite of the fact that there is only exactly one literal for each Boolean, it is also important to distinguish the literal true, which is the sequence of letters t, r, u, and e appearing in JSON syntax, from the actual concept of "true-ness," which is an abstract mathematical concept.
Boolean literals are unquoted. Otherwise, they would be recognized as strings by the parser and not as Booleans.
Null¶
There is a special value, null, which corresponds to the (unique) literal.
null
The concept of "null-ness" can be subject to debate: some like to see this as an unknown or hidden value, others as equivalent to an absent value, etc. On the logical level, we will consider that an absent value is not the same thing as null value.
Null literals are unquoted. Otherwise, they would be recognized as strings by the parser and not as nulls.
Arrays¶
Arrays are simply lists of values. The concept of list is abstract and mathematical, i.e., lists are considered an abstract data type and correspond to finite mathematical sequence.
The concept of array is the syntactic conterpart of a list, i.e., an array is a physical representation of an abstract list.
The members of an array can be any JSON value: string, number, Boolean, null, array or object. They are listed within square brackets, and are separated by commas.
[ 1, 2, 3 ]
[ ]
[ null, "foo", 12.3, false, [ 1, 3 ] ]
It can also be convenient to let arrays "breathe" with extra spaces, which are irrelevant when parsing JSON (except if they are inside a string literal). In fact, there are plenty of libraries out there that can nicely do this, which is known as "pretty-printing":
[
1,
2,
3
]
[]
[
null,
"foo",
12.3,
false,
[
1,
3
]
]
Objects¶
Objects are simply maps from strings to values. The concept of map is abstract and mathematical, i.e., maps are considered an abstract data type and correspond to mathematical partial functions with a string domain and the range of all values.
The concept of object is the syntactic couterpart of a map, i.e., an object is a physical representation of an obstract map that explicitly lists all string-value pairs (this is called an extensional definition of a function, as opposed to the way functions are typically defined in mathematics).
The keys of an object must be string. This excludes any other kind of value: it cannot be an integer, it cannot be an object. This also implies that keys must be quoted. While some JSON parsers are lenient and will accept unquoted keys, it is very important to never create any JSON documents will unquoted keys for full compatibility with all parsers.
The values associated with them can be any JSON value: string, number, Boolean, null, array, or object. The pairs are listed within curly brackets, and are separated by commas. Within a pair, the value is separated form the key with a colon character.
{ "foo" : 1 }
{ }
{ "foo" : "foo", "bar" : [ 1, 2 ],
"foobar" : [ { "foo" : null }, { "foo" : true } ]
}
It can also be convenient to let objects "breathe" with extra spaces, as was already explained for arrays.
{
"foo" : "foo",
"bar" : [
1,
2
],
"foobar" : [
{
"foo" : null,
"bar" : 2
},
{
"foo" : true,
"bar" : 3
}
]
}
The JSON standard recommends for keys to be unique within an object. Many parsers and products will reject duplicate keys, because they reply on the semantics of a map abstract data type. If one downloads a dataset that has duplicate keys and the engine one intends to use to process it does not allow them, then this will require extra work. In particular, one needs to find a JSON library that accepts duplicate keys, and use it to fix the dataset by disambiguating the keys to make it parseable with any engine. It is very imprortant to never create any JSON documents with duplicate keys for full compatibility with all parsers, to avoid creating this extra workload for the consumers.
XML¶
XML stands for eXtensible Markup Language. It resembles HTML, except that it allows fro any tags and that it is stricter in what it allows.
XML is considerably more complex than JSON but, fortunately, most databases only use a subset of what XML can do. XML's most important building blocks are elements, attributes, text, and comments.
Elements¶
XML is a markup language, which means that content if "tagged". Tagging is done with XML elements.
An XML element consists of an opening tag, and a closing tag. What is "tagged" is everything inbetween the opening tag and the closing tag.
This is an example with an opening tag, some content (which can be recursively anything), and then a closing tag. Tags consist of a name surrounded wiht angle brackets <...>, and the closing tag has a additional slash in front of the name.
<person>(any content here)</person>
If there is no content at all, the lazy of us will appreciated a convrnient shortcut to denote the empty element with a single tag. Mind that the slash is at the end:
<person/>
is equivalent to:
<person></person>
Elements nest arbitrarily:
<person><first>(some content)</first><student/>
<last>(some other content)</last></person>
Like JSON, it is possible to use indentation and new lines to pretty-print the document for ease of read by a human:
<person>
<first>(some content)</first>
<student/>
<last>(some other content)</last>
</person>
Unlike JSON keys, element names can repeat at will. In fact, it is even a common pattern to repeat an element many times under another element in plural form, like so:
<persons>
<person>
<first>(some content)</first>
<last>(some other content)</last>
</person>
<person>
<first>(some content)</first>
<last>(some other content)</last>
</person>
<person>
<first>(some content)</first>
<last>(some other content)</last>
</person>
</persons>
Some care needs to be put in "well-parenthesizing" tags, for example, this is incorrect and not well-formed XML:
<foo><bar></foo></bar>
because the inner elements must close before the outer elements.
Elements cannot appear within opening or clothing tags, they must appear between tags. This is not well-formed XML:
<foo <bar/>></foo>
At the top-level, a well-formed XML document must have exactly one element. Not zero, not two, exactly one. This is not well-formed XML:
<person>
<first>(some content)</first>
<last>(some other content)</last>
</person>
<person>
<first>(some content)</first>
<last>(some other content)</last>
</person>
<person>
<first>(some content)</first>
<last>(some other content)</last>
</person>
Element names can generally contain alphanumerical characters,
dashes, points, and underscores. The first character of an element name has to be a letter or an underscore. The use of colons is restricted to the support namesapces, which are explained below.