263-3010-00: Big Data

Section 5 Syntax

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 10/12/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Why Syntax¶

A data lake, whether on public cloud services like S3 or Azure Blob, or a distributed file system like HDFS, is where datasets are stored in their native format, such as CSV files. Unlike traditional databases that store data in proprietary formats and require ETL (Extract, Transform, Load) processes, data lakes allow in-situ querying, meaning data can be accessed directly without importing it into a specific system. While ETL improves performance with optimized formats and indices, it can be time-consuming and isn't always necessary. Data lakes make data syntax visible and easily accessible, like CSV files used for tabular data.

CSV¶

ID,Last name,First name
1,Einstein,Albert
2,Gödel,Kurt

ID	Last name	First name
1	Einstein	Albert
2	Gödel	Kurt

CSV is a textual format, in the sense that it can be opened in a text editor. This is in contrast to binary formats that are more opaque.

Each record (a table row) corresponds to one line of text in a CSV file. Having a record per line of test is a common pattern not unique to CSV. This is waht makes it possible to scale up data processing to billions of records.

What appears on each line of text is specific to CSV. CSV means comma-separated values.

The main challenge with CSV files is that, in spite of a standard (RFC 4180), in practice there are many different dialects and variations, which limits interoperability. For example, another character can be used instead of the comma (tabulation, semi-colons, etc). Also, when a comma (or the special character used in its stead) needs to actually appear in a value, it needs to be escaped. There are many ways to do so; one of them is to double-quote the cell, which implies in turn that quotes within quotes must be escaped. There are many different conventions for doing so.

ID,Last name,First name,Theory
1,Einstein,Albert,"General, Special Relativity"
2,Gödel,Kurt,"""Incompleteness"" Theorem"

ID	Last name	First name	Theory
1	Einstein	Albert	General, Special Relativity
2	Gödel	Kurt	"Incompleteness" Theorem

Data denormalization¶

We know that it is desired to store data in so-called normal forms in a relational database management system. As you may recall, data in the first normal form cannot nest, and dat in higher normal froms are split across multiple tables that get joined at query time. As a rule of thumb, normalizing data means joining it back a query time.

In the context of data lake and large-scale data processing, it is oftern desirable to go exactly the opposite way. This is called data denormalization. This means that not only several tables can be merged into just one (with functional dependencies that would otherwise have been considered "undesirable"), it also means that we can nest data: tables in tables in tables.

While this is likely to come as a shock to people who have learned normal forms, it has to be said that data denormalization should be done with knowledge of normal forms, because one needs to have a deep understanding of what one is doing and why one is doing it.

Data denormalization makes a lot of sense in read-intensive senarios in which not having to join brings a significant performance improvement. In read-intensive scenarios, we love anything that is linear, which corresponds to a full scan of the dataset. This is as opposed to point queries more commonly found in traditional databases.

Thanks to the way that we defined tables, data denormalization is straightforward to explain. Remember a table is a collection of tuples.

We required identical support (relational integrity), flat rows (atomic integrity, which is also the first normal form), and homogeneous data types within a column (doamin integrity). Denormalization simply means that we drop all three constraints (or two, or just one).

Let us dive into this.

A tuple, mathematically, can be formalized as a partial function mapping strings to values:

No description has been provided for this image

As it turns out, a tuple can also be represented in a purely textual fashion.

{
    "product": "Phone",
    "price": 800,
    "customer": "John",
    "quantity": 1
}

The difference with CSV is that, in JSON, the attributes appear in every tuple, while CSV they do not appear except in the header line. JSON is appropriate for data denormalization because including the attributes in every tuple allows us to drop the identical support requirement.

If we now look at a table (which checks all three integrity boxes), we can re-express it in a JSON-based textual format like so:

Now, if we are to drop realtional integrity and allow for nestedness, the table could look like so:

CSV would not be powerful enough to express such data. But JSON is able to. For example, the first tuple of the table above, expressed in JSON, looks like so:

{
    "product": "Phone",
    "orders": [
        {"customer": "John", "quantity": 1},
        {"customer": "Peter", "quantity": 2},
        {"customer": "Mary", "quantity": 1}
    ]
}

Concretely, data denormalization means that we abandon the paradim of homogeneous collections of flat items (tables) and instead consider heterogeneous collections of nested items.

Semi-structured Data and Well-formedness¶

The generic name for denormalized data (in the same of heterogeneous and nested) is "semi-structured data". Textual formats such as XML and JSON have the advantage that they can both be processed by computers, and can also be read, written and edited by humans.

Another very important and characterizing aspect of XML and JSON is that they are standards: XML is a W3C standard. W3C, also known as the World Wide Web consortium, is the same body that also standardizes HTML, HTTP, etc. JSON is now an ECMA standard, which is the same body that also standardizes JavaScript. In fact, the JS in JSON comes from JavaScript, because its look was inspried by JavaScript.

This is what an XML document looks like:

<?xml version="1.0"?>
<country code="RU">
    <name>Russia</name>
    <population>144500000</population>
    <currency code="RUB">Russian Ruble</currency>
    <cities>
        <city>Moscow</city>
        <city>Saint Petersburg</city>
        <city>Novosibirsk</city>
    </cities>
    <description>
        We produce <b>excellent</b> vodka and caviar.
    </description>
</country>

This is what a JSON document looks like:

{
    "code": "RU",
    "name": "Russia",
    "population": 144500000,
    "currency": {
        "name": "Russian Ruble",
        "code": "RUB"
    },
    "confederation": false,
    "president": "Vladimir Putin",
    "capital": "Moscow",
    "cities": [ "Moscow", "Saint Petersburg", "Novosibirsk" ],
    "description": "We produce excellent vodka and caviar."
}

It is commonly believed taht XML is losing in popularity and JSON is "the new cool stuff", however this is not fully accurate; while on the research side, the publications on XML have become less widespread, in companies, XML is very populat due to its very mature ecosystem supported by several other W3C standards. A few examples are that the mandatory financial reports of US public companies must be filed in XML, and in Switzerland, electronic tax statement are also stored in XML. What is important is to understand that neither f them is better than the otherl this is highly use-case dependent and in some cases XML will be a better fit (this is typically the case in the publishing industry), while in other cases JSON will be a better fit.

XML and JSON share the concept of well-formedness, meaning a document is either well-structured according to their syntax or not. In computer science, both are considered languages, where a "well-formed" document belongs to that language. A well-formed XML or JSON document can be successfully opened, allowing features like automatic formatting and color coding. Non-well-formed documents, however, can't be processed until fixed. Due to the abundance of free and open-source tools for reading and writing well-formed XML and JSON, they are widely used, avoiding the need to create new syntaxes and tools.

JSON¶

Now let us dive into the details of the JSON syntax. JSON stands for JavaScript Object Notation because the way it looks like originates from JavaScript syntax, however it is now living its own life completely independently of JavaScript.

JSON is made of exactly six building blocks: strings, numbers, Booleans, null, objects, and arrays.

Strings¶

Strings are simply text. In JSON, strings always appear in double quotes. This is a well-formed JSON string:

"This is a string"

Obviously, strings could contain quotes and in order not to confuse them with the surrounding quotes, they need to be differentiated. This is called escaping and, in JSON, escapting is done with backslash characters ().

"The word \"quoted\" is quoted."

There are several other escapt sequences in JSON, the most popular ones being:

Escape sequence	Funtionality
`\\`	\
`\n`	new line
`\r`	carriage return
`\t`	tabulation
`\u` followed by four hexadecimal digits	any character

The last one, in fact, allows the insertion of any character via its Unicode code point. Unicode is a standard that assigns a numeric code (called a code point) to each character in order to catalog them across all languages of the world, even including emojis. Th catalog evolves eith regular meetings of the working group. For example, the Russian letter П is \u0400.

The code point must be indicated in base 16 (digits 0 to 9, plus letters from A to F). Code points can easily be looked up with a search engine by typing a description of waht you are looking for, even though more complex strings will typically be created automatically.

Numbers¶

JSON generally supports numbers, without explicitly naming any types nor making any distinction between numbers apart from how they appear in syntax. The way a number appears apart in syntax is called a lexical representation, or a literal. These two words, infact, also generally apply to many other types.

Generally, a number is made of digits, possibly including a decimal period (which must be a dot) and optionally followed by the letter e (in either case) and a power of ten (scientific notation). Both the number and the optional power of ten can also have an optional sign.

These are a few examples of well-formed JSON number literals:

0
1234
12.34
-132.54
12.3E45
12.3e-45
-12.3e-45

JSON places a few restrictions

A leading + is not allowed
A leading 0 is not allowed except if the integer part is exactly 0 (in which case it is even mandatory, i.e., .23 is not a well-formed JSON number literal, instead it should be 0.23).

JSON numbers are unquoted. Otherwise, they would be recognized as strings by the parser and not as numbers.

A warning: the same (mathematical) number might have several literals to represent it.

2
20e-1
2.0

It is important to have in mind that the literal, which is the syntactic representation, is not the same as the actual, logical number. The above three literals have in common their "two-ness".

Booleans¶

There are two Booleans, true and false, and each one is associated with exactly one possible literal, which are well, true, and false.

true
false

In spite of the fact that there is only exactly one literal for each Boolean, it is also important to distinguish the literal true, which is the sequence of letters t, r, u, and e appearing in JSON syntax, from the actual concept of "true-ness," which is an abstract mathematical concept.

Boolean literals are unquoted. Otherwise, they would be recognized as strings by the parser and not as Booleans.

Null¶

There is a special value, null, which corresponds to the (unique) literal.

null

The concept of "null-ness" can be subject to debate: some like to see this as an unknown or hidden value, others as equivalent to an absent value, etc. On the logical level, we will consider that an absent value is not the same thing as null value.

Null literals are unquoted. Otherwise, they would be recognized as strings by the parser and not as nulls.

Arrays¶

Arrays are simply lists of values. The concept of list is abstract and mathematical, i.e., lists are considered an abstract data type and correspond to finite mathematical sequence.

The concept of array is the syntactic conterpart of a list, i.e., an array is a physical representation of an abstract list.

The members of an array can be any JSON value: string, number, Boolean, null, array or object. They are listed within square brackets, and are separated by commas.

[ 1, 2, 3 ]
[ ]
[ null, "foo", 12.3, false, [ 1, 3 ] ]

It can also be convenient to let arrays "breathe" with extra spaces, which are irrelevant when parsing JSON (except if they are inside a string literal). In fact, there are plenty of libraries out there that can nicely do this, which is known as "pretty-printing":

[
    1,
    2,
    3
]
[]
[
    null,
    "foo",
    12.3,
    false,
    [
        1,
        3
    ]
]

Objects¶

Objects are simply maps from strings to values. The concept of map is abstract and mathematical, i.e., maps are considered an abstract data type and correspond to mathematical partial functions with a string domain and the range of all values.

The concept of object is the syntactic couterpart of a map, i.e., an object is a physical representation of an obstract map that explicitly lists all string-value pairs (this is called an extensional definition of a function, as opposed to the way functions are typically defined in mathematics).

The keys of an object must be string. This excludes any other kind of value: it cannot be an integer, it cannot be an object. This also implies that keys must be quoted. While some JSON parsers are lenient and will accept unquoted keys, it is very important to never create any JSON documents will unquoted keys for full compatibility with all parsers.

The values associated with them can be any JSON value: string, number, Boolean, null, array, or object. The pairs are listed within curly brackets, and are separated by commas. Within a pair, the value is separated form the key with a colon character.

{ "foo" : 1 }
{ }
{   "foo" : "foo", "bar" : [ 1, 2 ],
    "foobar" : [ { "foo" : null }, { "foo" : true } ]
}

It can also be convenient to let objects "breathe" with extra spaces, as was already explained for arrays.

{
    "foo" : "foo",
    "bar" : [
        1,
        2
    ],
    "foobar" : [
        {
            "foo" : null,
            "bar" : 2
        },
        {
            "foo" : true,
            "bar" : 3
        }
    ]
}

The JSON standard recommends for keys to be unique within an object. Many parsers and products will reject duplicate keys, because they reply on the semantics of a map abstract data type. If one downloads a dataset that has duplicate keys and the engine one intends to use to process it does not allow them, then this will require extra work. In particular, one needs to find a JSON library that accepts duplicate keys, and use it to fix the dataset by disambiguating the keys to make it parseable with any engine. It is very imprortant to never create any JSON documents with duplicate keys for full compatibility with all parsers, to avoid creating this extra workload for the consumers.

XML¶

XML stands for eXtensible Markup Language. It resembles HTML, except that it allows fro any tags and that it is stricter in what it allows.

XML is considerably more complex than JSON but, fortunately, most databases only use a subset of what XML can do. XML's most important building blocks are elements, attributes, text, and comments.

Elements¶

XML is a markup language, which means that content if "tagged". Tagging is done with XML elements.

An XML element consists of an opening tag, and a closing tag. What is "tagged" is everything inbetween the opening tag and the closing tag.

This is an example with an opening tag, some content (which can be recursively anything), and then a closing tag. Tags consist of a name surrounded wiht angle brackets <...>, and the closing tag has a additional slash in front of the name.

<person>(any content here)</person>

If there is no content at all, the lazy of us will appreciated a convrnient shortcut to denote the empty element with a single tag. Mind that the slash is at the end:

<person/>

is equivalent to:

<person></person>

Elements nest arbitrarily:

<person><first>(some content)</first><student/>
<last>(some other content)</last></person>

Like JSON, it is possible to use indentation and new lines to pretty-print the document for ease of read by a human:

<person>
    <first>(some content)</first>
    <student/>
    <last>(some other content)</last>
</person>

Unlike JSON keys, element names can repeat at will. In fact, it is even a common pattern to repeat an element many times under another element in plural form, like so:

<persons>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
</persons>

Some care needs to be put in "well-parenthesizing" tags, for example, this is incorrect and not well-formed XML:

<foo><bar></foo></bar>

because the inner elements must close before the outer elements.

Elements cannot appear within opening or clothing tags, they must appear between tags. This is not well-formed XML:

<foo <bar/>></foo>

At the top-level, a well-formed XML document must have exactly one element. Not zero, not two, exactly one. This is not well-formed XML:

<person>
    <first>(some content)</first>
    <last>(some other content)</last>
</person>
<person>
    <first>(some content)</first>
    <last>(some other content)</last>
</person>
<person>
    <first>(some content)</first>
    <last>(some other content)</last>
</person>

Element names can generally contain alphanumerical characters,

dashes, points, and underscores. The first character of an element name has to be a letter or an underscore. The use of colons is restricted to the support namesapces, which are explained below.

It is not allowed for general users to create element names that start with XML or xml, or any case combination (XmL, etc). This is because this is reserved for another use (namespaces with attributed starting with xmlns). A common source of confusion is that XML parsers do not throw any error upon names starting with xml. The reason is that XML parsers have to be forward-compatible with any future specifications that might introduce more special names starting with xml, although they may only be introduced through the W3C standardization body and not by individual users.

Attributes¶

Attributes appear in any opening elements tag and are basically key-value pairs. In the following examples, we added two attributes with the keys "birth" and "death".

<person birth="1879" death="1955">
    <first>(some content)</first>
    <last>(some other content)</last>
</person>

Values can be either double-quoted or single-quoted. This is also well-formed XML:

<person birth='1879' death='1955'>
    <first>(some content)</first>
    <last>(some other content)</last>
</person>

As well as this:

<person birth="1879" death='1955'>
    <first>(some content)</first>
    <last>(some other content)</last>
</person>

The key is never quoted, and it is not allowed to have unquoted value. This is not well-formed XML:

<person birth=1879 "death"=1955>
    <first>(some content)</first>
    <last>(some other content)</last>
</person>

Within the same opening tag, there cannot be suplicate keys. This is not well-formed XML:

<person birth="1879" birth="1955">
    <first>(some content)</first>
    <last>(some other content)</last>
</person>

Attributes can also appear in an empty element tag:

<person birth="1879" death="1955"/>

Attributes can never appear in a closing tag. This is not well-formed XML:

<person>
    <first>(some content)</first>
    <last>(some other content)</last>
</person birth="1879" death="1955">

Elements cannot nest within attribute values. This is not well-formed XML:

<person birth="<date>1879</date>" death="1955">
    <first>(some content)</first>
    <last>(some other content)</last>
</person>

The rules for attributes name are akin to those for element names, described in the previous section.

Text¶

Text, in XML syntax, is ximply freely appearing in elements and without any quotes (attribute values are not taxt). For example, we can have text inside the first and last elements like so:

<person birth="1879" death="1955">
    <first>Albert</first>
    <last>Einstein</last>
</person>

Text cannot appear on its own at the top level. This is not well-formed XML:

Albert <person/> Einstein

Within an element, text can freely alternate with other elements. This is called mixed content and is unique to XML, like so:

<person>
    <style>His Royal Highness</style>
    The <title>Duke of <location>Cambridge</location></title>
</person>

This feature of XML makes it very popular in the publishing industry, where it is very convenient to have books, papers, etc, with the text tagged with extra information.

Comments¶

Comments in XML look like so:

<!-- This is a comment -->

but, as we saw, a single comment alone is not well-formed XML (we need exactly one top-level element). This would be well-formed XM with a comment:

<person birth="1879" death="1955">
    <first>Albert</first>
    <last>Einstein</last>
    <!-- He is still famous today -->
</person>

Comments can also appear at the top-level though, but under the condition that there is exactly one top-level element.

<!-- He is still famous today -->
<person birth="1879" death="1955">
    <first>Albert</first>
    <last>Einstein</last>
</person>
<!-- He is -->
<!-- He totally is -->

The reason why comments specifically looks like this is historical: XML was derived as a simplified subset of an older markup language called SGML. Many of the strange-looking symbols of XML are in fact comming from SGML.

Text declaration¶

XML documents can be identified as such with an optional text declaration containing a version number and an encoding.

<?xml version="1.0" encoding="UTF-8"?>
<person birth="1879" death="1955">
    <first>Albert</first>
    <last>Einstein</last>
</person>

The version is either 1.0 or 1.1, but there is not need to understand the difference for now. It is rather subtle and mostly due to more permissive behavior with international characters in 1.1. Most XML documents our there are version 1.0.

The encoding is a physical detail and gives information on how the document is stored as bits on the disk. This is also advanced and out of the scope here. If in doubt, UTF-8 is the recommended standard so far.

Another tag that might appear right below, or instead of, the text declaration is the doctype declaration. It must repeat the name of the top-level element, like so:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person>
<person birth="1879" death="1955">
    <first>Albert</first>
    <last>Einstein</last>
</person>

or like so:

<!DOCTYPE person>
<person birth="1879" death="1955">
    <first>Albert</first>
    <last>Einstein</last>
</person>

This might be familiar to HTML aficionados:

<!DOCTYPE html>
<html>
    ...
</html>

Doctype declarations exist also for historical reasons and are part of the DTD validation mechanism, which is out of scope now because it was abandoned with more morden mechanisms such as XML Schema.

To sum all, the only reason why we showed what text declarations look like and waht doctype delcatations look like is so you are not suprised whe you see them.

Escaping special characters¶

As you might have guess, if characters such as < are used in the text, or characters such as " or ' are used in attribute values, it will cause problems. This is now well-formed XML:

<equation name="the "basic" comparison">
    1 < 2
</equation>

Remember that JSON, it is possible to escape sequences with a backslash character. In XML, this is done with an ampersand (&) character.

There are exactly five possible escapt sequences pre-defined in XML:

Escape sequence	Corresponding character
`<`	<
`>`	>
`"`	"
`'`	'
`&`	&

For example, the above document can be turned into a well-formed XML document like so:

<equation name="the &quot;basic&quot; comparison">
    1 &lt; 2
</equation>

Escape sequences can be used anywhere in text, and in attribute values. At other places (element names, attribute namee, inside comments), they will not be recognized or will lead to well-formedness errors.

But there are a few places where they are mandatory:

In text, & and < MUST be escaped. The other characters may, but need not, be escaped.
In double-quoted attribute values, ", &, and < MUST be escaped. The other characters may, but need not, be escaped.
In single-quoted attribute values, ', &, and < MUST be escaped. The other characters may, but need not, be escaped.

Overall XML document structure¶

Finally, the table below summarizes where elements, attributes, and text can appear, and where not.

Note that comments can appear anywhere between element tags, or even completely outside the top-level element.

Namespaces in XML¶

When a lot of data is created in the XML format, scaling issues start appearing because people use the same element and attribute name for different purposes. For example, and element named "client" can be used in customer relationship datasets, or in computer network datasets.

Namespaces are an extension of XML that allows users to group their elements and attributed in packages, similar to Python modules, Java packages, or C++ namespaces. This is a very natural thing to do.

Namespaces URIs¶

A namespace is identified with a URI. We already studied URIs in the context of REST API. A point of confusion is that XML namespaces often start with http://, but are not meant to be entered as an address into a browser. It might sometimes work, but this will only be because the owner of the namespaces was kind enough to put a page (often with documentation on the namespace) on the Web at the same URI.

Here are just a few examples of namespaces:

Namespace	Usage
http://www.w3.org/1999/xhtml	HTML
http://www.w3.org/1998/Math/MathML	MathML (formulas)
http://www.music-encoding.org/ns/mei	Music sheets

If you think about it, this is not so different from Java: Java just uses a different convention with reversed domains and dots instead of slashes. For example the Music package would probably be called org.music-encoding.ns.mei in Java.

An entire XML document in a namespace¶

Let us start with something easy. It is possible to put all elements of an XML document in a namespace, here http://www.example.com/persons, like so:

<persons xmlns="http://www.example.com/persons">
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
</persons>

In the above document, the elements person, first and last all live in the namespace http://www.example.com/persons. This is because of what looks like an xmlns attribute, associated with the namespave http://www.example.com/persons as the value. But in fact, xmlns is not an attribute, it is really a namespace declaration. We also know that attributes starting with xml are forbidden, and this is because this is reserved for namespace declatations.

To document below is different, because it does not have this declaration. So, the elements person, first and last do not live in any namespace, we say that namespace is absent for these elements:

<persons>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
    <person>
        <first>(some content)</first>
        <last>(some other content)</last>
    </person>
</persons>

Here is another example, this time is a MathML document, in the corresponding namespace:

<math xmlns="http://www.w3.org/1998/Math/MathML">
    <apply>
        <eq/>
        <ci>x</ci>
        <apply>
        <root/>
            <cn>2</cn>
        </apply>
    </apply>
</math>

QNames¶

What about documents that use multiple namespaces? This is done by associating namespaces with prefixes, which acts as shorthands for a namespace. Then we can use the prefix shorthand in every element that we want to have in this namespace.

This is the same MathML document as previously seen, except that now we explicitly associate the MathML namespace with prefix m. This is done by using xmlns:m instead of just using xmlns, and by adding m: in front of every element that we want to have in this namespace, like so:

<m:math xmlns:m="http://www.w3.org/1998/Math/MathML">
    <m:apply>
        <m:eq/>
        <m:ci>x</m:ci>
        <m:apply>
        <m:root/>
            <m:cn>2</m:cn>
        </m:apply>
    </m:apply>
</m:math>

What is important to understand if that, semantically, this is the same document as the one we saw without the prefix: all elements are, in both cases, in the MathML namespace. The part of the element name that appears on the right of the colon sign (or the entire element name, if it does not have a prefix) is called the local name of the element.

Every element has a local name. An element may have a prefix (on the left of the colon sign , say foo), in which case the corresponding namespace is looked ip with the appropriate xmlns:foo declaration with the same prefix. An element may also not have a prefix, in which case the prefix is said to be absent. In this case, the corresponding namespace (called the default namespace) is looked up with the appropriate xmlns declaration. If there is no such declaration, then the namespace is said to be absent, too.

So, given any element, it is possible to find its local name, its (possibly absent) prefix, and its (possibly absent) namespace. The triplet (namespace, prefix, localname) is called a QName (for "qualified name").

For the purpose of the comparisons of two QNames (and thus of documents), the prefix is ignored: only the local name and the namespace are compared. The following document, which uses yet another prefix , is again the same document as the previous two MathML documents:

<foo:math xmlns:foo="http://www.w3.org/1998/Math/MathML">
    <foo:apply>
        <foo:eq/>
        <foo:ci>x</foo:ci>
        <foo:apply>
        <foo:root/>
            <foo:cn>2</foo:cn>
        </foo:apply>
    </foo:apply>
</foo:math>

This document, however, is different, because its elements are in no namespace at all:

<math>
    <apply>
        <eq/>
        <ci>x</ci>
        <apply>
        <root/>
        <cn>2</cn>
        </apply>
    </apply>
</math>

With the QName machinery, it is possible to have as many namespaces and prefixes as one wants, in this example four of them:

<?xml version "1.0"?>
<a:bar
xmlns:a="http://example.com/a"
xmlns:b="http://example.com/b"
xmlns:c="http://example.com/c"
xmlns:d="http://example.com/d">
    <b:foo/>
    <c:bar>
        <d:foo/>
        <a:foobar/>
    </c:bar>
</a:bar>

Namespaces are quite flexible: xmlns declarations can be put anywhere and it can quickly become messy and out of countrol. Thus, we highly recommend to stick to several rules that are common practice if you want to keep your sanity:

only put xmlns and xmlns:prefix declarations in the top-level element. Nowhere else.
make sure the prefix-namespace mapping is bijective. Do not use twice the same prefix with the same namespace. Do not bind two namespaces with the same prefix (which would in fact be an error).
do not mix the default namespace (xmlns declaration) with namespaces associated with prefixes in the same document (xmlns:prefix declarations). It is either or: rither you have a single namespace for all elements in the entire document that is the default namespace, or you have one or several namespaces that are associated with a (non-absent) prefix. Mixing the two types of declarations quickly leads to confusion for people looking at the document.

As a counterexample showing bad practice, this is what should be avoided:

<foo:bar xmlns:foo="http://example.com/foo">
    <foo:foo/>
        <bar:foobar xmlns:bar="http://example.com/bar">
            <bar:foo/>
            <foo:foo/>
            <foo/>
            <foo/>
        </bar:foobar>
    <foo xmlns="http://example.com/foo"/>
    <foo:bar/>
    <foo:bar xmlns:foo="http://example.com/bar"/>
    <foo:foo/>
</foo:bar>

Attributes and namespaces¶

Attributes can also live in namespaces, that is, attribute names are generally QNames. However, there are two every important aspects to consider.

First, unprefixed attributes are not sensitive to default namespaces: unlike elements, the namespace of an unprefixed attribute is always absetn even if there is a default namespace. The attribute attr in this example is in no namespace, although all (unprefixed) elements live in the http://example.com/foo namespace:

<?xml version "1.0"?>
<bar xmlns="http://example.com/foo">
    <foo/>
    <foobar attr="value">
        <foo/>
        <foo/>
        <foo/>
        <foo/>
    </foobar>
    <foo/>
    <bar/>
    <bar/>
    <foo/>
</bar>

Second, it is possible for two attributes to collide if they have the same local name, and different prefixes but associated with the same namespace (but again, do not do that). The following is thus not well-formed.

<?xml version "1.0"?>
<bar
    xmlns:foo="http://example.com/foo"
    xmlns:bar="http://example.com/foo">
    <foo/>
    <foobar foo:attr="value" bar:attr="value">
        <foo/>
        <foo/>
        <foo/>
        <foo/>
    </foobar>
    <foo/>
    <bar/>
    <bar/>
    <foo/>
</bar>

Datasets in XML¶

Let use now look deeper at how to express tabular data in XML, in order to demostrate that XML subsumes tabular data.

The tuple marked in red here:

can be stored in XML as:

<sale>
    <product>Phone</product>
    <price>800</price>
    <customer>John</customer>
    <quantity>1</quantity>
</sale>

The tuples marked in red here:

can be stored in XML using the plural-singular convention as:

<sales>
    <sale>
        <product>Phone</product>
        <price>800</price>
        <customer>John</customer>
        <quantity>1</quantity>
    </sale>
    <sale>
        <product>Phone</product>
        <price>800</price>
        <customer>Peter</customer>
        <quantity>2</quantity>
    </sale>
    <sale>
        <product>Phone</product>
        <price>800</price>
        <customer>Mary</customer>
        <quantity>1</quantity>
    </sale>
    <sale>
        <product>Laptop</product>
        <price>2000</price>
        <customer>John</customer>
        <quantity>3</quantity>
    </sale>
</sales>

Finally this nested tuple:

can be stored in XML using nested elements as:

<?xml version "1.0"?>
<!DOCTYPE sales>
<sales>
    <sale>
        <product>Phone</product>
        <price>800</price>
        <orders>
            <order>
                <customer>John</customer>
                <quantity>1</quantity>
            </order>
            <order>
                <customer>Peter</customer>
                <quantity>2</quantity>
            </order>
            <order>
                <customer>Mary</customer>
                <quantity>1</quantity>
            </order>
        </orders>
    </sale>
</sales>

XML vs. JSON or How to Troll Internet Forums¶

Whether XML or JSON is better is the topic of an intense debate, a bit like vi vs. emacs or mac vs. PC. The reason is simple: neither is. It depends on the use case. Objectively, one can nevertheless say that XML is quite suitable and popular for the publishing industry and "text-oriented", tagged data because of its unique mixed content feature. Data itself can be stored indifferently in XML or in JSON, as we saw. The XML ecosystem is also more mature and enterprise-ready because it is older, however JSON has gained so much popularity in the recent decade that it is expected that the JSON ecosystem will catch up.