263-3010-00: Big Data

Section 7 Data Models and Validation

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 10/28/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Even though the data is physically stored as bits - or as text directly encoded to bits in the case of XML and JSON - it would not be appropriate to directly manipulate the data at the bit or text level. This is, in fact, in the spirit of data independence to abstract away. Doing so is called data modeling.

A data model is an abstract view over the data that hides the way it is stored physically. For example, a CSV file should be abstracted logically as a table. This is because CSV enforces at least relational integrity as well as atomic integrity. As for domain integrity , this can be considered implicit since an entire column can be interpreted as a string in the case of incompatible literals.

No description has been provided for this image

The JSON Information Set¶

Obviously, a model based on tables is not appropriate for JSON. This is because, unlike CSV, JSON enforce neither relational integrity, nor atomic integrity, nor domain integrity. In fact, we will see that the appropriate abstraction for any JSON document is a tree.

The nodes of that tree, which are JSON logical values, are naturally of six possible kinds: the six syntactic building blocks of JSON.

These are the four leaves corresponding to atomic values:

Strings
Numbers
Booleans
Nulls

As well as two intermediate nodes (possibly leaves if empty):

Objects (String-to-value map)
Arrays (List of values)

Formally, and not only for JSON but for all tree-based models, these nodes are generally called information items and form the logical building blocks of the model, called information set.

Let us take the following example.

{
    "foo" : true,
    "bar" : [
        {
            "foobar" : "foo"
        },
        null
    ]
}

It is possible to draw this document as a logical tree, where each information item (node) corresponds to each one of the values present in the document: two objects, one array, and three atomics. Note that the information items are the rectangles; the ovals are not information items but labels on the edges connecting the information items. The ovals correspond to object keys.

It is possible to do so for any JSON document. Thus, we have now obtained a similar logical / physical mapping to what we previously did with CSV and tables, except taht this is now with JSON and trees.

When a JSON document is being parsed by a JSON library, this tree is built in memory, the edges being pointers, and further processing will be done on the tree and not on the original syntax.

Conversely, it is possible to take a tree and output it back to JSON syntax. This is called serialization.

The XML Information Set¶

It is possible to do the same logical abstraction, also based on trees, with XML, where information items corresponod to elements, attributes, text, etc:

A fundamental difference between JSON trees and XML trees is that for JSON, the labels (object keys) are on the edges connecting an object information item to each one of its children information items. In XML, the labels (these would be element and attribute names) are on the nodes (information items) directly. Another way to say it is that a JSON informaiton item does not know with which key it is associated in an object (if at all), while an XML element or attribute information item knows its name

Let us dive more into details. In XML, there are many more informaiton items:

Document information items
Element information items
Attribute information items
Character information items
Comment information items
Processing instruction information items
Namespace information items
Unexpected entity reference information items
DTD information items
Unparsed entity information items
Notation information items

We only go into the most important ones from a data perspective here: documents, elements, attributes, and characters. We will leave comments and namespaces aside to keep things simple, even though we saw what they look like syntactically, and will also skip all other information items, for which we have not studied the syntax.

Let us take this example:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE metadata>
<metadata>
    <title
        language="en"
        year="2019"
        >Systems Group</title>
    <publisher>ETH Zurich</publisher>
</metadata>

Formally, the XML Information Set is defined in a standard of the World Wide Web consortium (W3C). Each kind of information item has specific properties, and some of these properties link it to other information items, building the tree.

Let us go through the information items for the above document and list some of its properties.

Document information item¶

The document information item is just the root of an XML tree. It does not correspond to anything syntactically or, if at all, it would correspond to the text and doctype declarations.

The documentation information has two important properties:

[children] Element information item metadata
[version] 1.0

Element information items¶

There is one element information item for each element. Here we have three.

The element information item metadata has four important properties:

[local name] metadata
[children] Element information item title, element information item publisher
[attributes] (empty)
[parent] Document information item

The element information item title has four important properties:

[local name] title
[children] Character information items (System Group)
[attributes] Attribute information item language, Attribute information item year
[parent] Element information item metadata

The element information item publisher has four important properties:

[local name] publisher
[children] Character information items (ETH Zurich)
[attributes] (empty)
[parent] Element information item metadata

Attributes information items¶

There is one attribute information item for each attribute. Here we have two.

The attribute information item language has three important properties:

[local name] language
[normalized value] en
[owner element] Element information item title

The attribute information item year has three important properties:

[local name] year
[normalized value] 2019
[owner element] Element information item title

Character information items¶

There are as many character information items as characters in text (brtween tags). For example, for S in System Group:

[character code] the unicode point for the letter S
[parent] Element information item title

It is sometimes simpler to group them into a single (non standard) "text information item":

[characters] S y s t e m G r o u p
[parent] Element information item title

The entire tree¶

All information items built previously can finally be assenbled and drawn as a tree. The edges, corresponding to children and parent (or owner element) properties, will correspond to pointers in memory when the tree is built by the XML library:

When an XML document is being parsed by a XML library, this tree is built in memory, the edges being pointers, and further processing will be done on the tree and not on the original syntax.

Conversely, it is possible to take a tree and output it back to XML syntax. This is called serialization.

Validation¶

Once documents, JSON or XML, have been parsed and logically abstracted as a tree in memory, the natural next step is to check for further structural constraints.

For example, you could want to check whether your JSON documents all associate key "name" with a string, or if they all associate "years" with an array of positive integers. Or you could want to check whether your XML documents all have root elements called "persons," and whether the root element in each document has only children elements called "person," all with an attribute "first" and an attribute "last".

This might remind the reader of schemas in a relational database, but with a major difference: in a relational database, the schema of a table is defined before any data is populated into the table. Thus, the data in the table is guaranteed, at all times, to fulfill all the constraints of the schema. The exact term is that the data is guaranteed to be valid against the schema, because the schema was enforced at write time (schema on write).

But in case of a collection of JSON and XML documents, this is the other way around. A collection of JSON and XML documents out there can exist without any schema and contain arbitrary strcutures. Validation happens "ex post," that is, only after reading the data (schema on read).

Thus, it means that JSON and XML documents undergo two steps:

a well-formedness check: attempt to parse the document and construct a tree representation in memory
(if first step succeeded) a validation check given a specific schema

Thus, a text document can be either not well-formed, or well-formed and invalid against a specific schema, or well-formed and valid against a specific schema.

Note that, unlike well-formednedd, validation is schema dependent: a given well-formed document can be valid against schema A and invalid against schema B.

Validation is oftern performed on much more than a document at a time: an entire collection. Thus, we distinguish between heterogeneous collections, whose documents are not valid against any particular schema, and homogeneous collections, whosr documents are all valid against a specific schema.

To give an intuitive feeling, this is what a non-validated, "messy" document looks like. Notive in particular how values in the same array are of different types.

{
    "a" : 1,
    "b" : [
        "foo",
        true,
        null,
        {
            "foo" : "bar"
        }
    ],
    "c" : {
        "d" : { "foo" : null },
        "e" : [ 1, 2, [ 3, 4 ] ],
        "f" : 3.14
    }
}

This, on the other hand, is a document that has more structure and could easily be validated against an appropriate schema: for example, the array associated with key "c" only has object elements, which all have a "foo" key associated with a string and an optional "bar" key associated with an array of integers.

{
    "a" : 1,
    "b" : true,
    "c" : [
        { "foo":"bar1", "bar":[1,2 ] },
        { "foo":"bar2", "bar":[ 3,4,5 ]},
        { "foo" : "bar3" }
    ]
}

Item types¶

A fundamental aspect of validation is the type system. A well-designed type system, in turn, allows fro storing the data in much more efficient, binary formats tailored to the model.

There are many different technologies and type systems for arborescent, denormalized data, well beyond only JSON and XML. But there is good news: all these type systems are very similar to each other and have a lot in common. Therefore, in this section, we will present an overview of an agnostic type system that is very representative of common practice. After reading this, learning a new vlidation or storage format technology will then amount to reading the documentation and immediately recognizing familiar patterns, to then focus on the small deviations and peculiarities of each specific technology.

The first aspect that almost all, if not all type systems, have in common, is the distinction between atomic types and structured types. In fact, this distinction is so universal that even object-oriented languages like Java and Python make such a distinction.

The distinction should also not really come as a surpeise to the reader, because we have seen it several times.

Atomic types¶

Atomic types correspond to the leaf of a tree data model: these are types that do not contain any further nestedness.

The kinds of atomic types available are also relatively standard and common to most technologies.

Also, all atomic types have in common that they have a logical value space and a lexical value space. The logical value sapce corresponds to the actual logical abstraction of the type (ex: a mathematical set of integers), while the lexical value space corresponds to the representation of logical values in a textual format (such as the decimal, or binary, or hexadecimal representation of an integer).

An atomic type also has a (not necessarily injective) mapping from its lexical value space to its logical value space (ex: mapping the hexadecimal literal x10 to the mathematical integer sixteen), and often, a canonical mapping from the logical value space to the lexical value space (ex: mapping the mathematical integer sixteen to its canonical decimal representation 16).

Atomic types can be in a subtype relationship: a type is a subtype of another type if its logical value space is a subset of the latter. Normally it means that the same holds for the lexical value spaces and the related mappings should be consistent with each other. The subtype relationship over types organizes the types as a hierarchy, called the type hierarchy. The type hierarchy gives a good visual to get a quick overview of all available types ina specific technology.

Strings¶

Strings are simply finite sequences of (usually printable) characters. Formally, strings form a monoid under concatenation, where the neutral element is the empty string.

These are three examples:

foo
Zurich
Ilsebill salzte nach.

Often, the lexical representation of a string is double-quoted, sometimes also single-quoted.

"foo"
"Zurich"
'Ilsebill salzte nach.'

The difference between the lexical representation and the logical value of a string becomes immediately apparent when escaping is used. For example, the lexical representation

"\\\""

corresponds to the (logical) string \".

In "pure computer science" textbooks, strings are often presented as structured values rather than as atomic values because of their complexity on the physical layer. However, for us data scientists, strings are atomic values.

Numbers¶

Integers¶

Integers correspond to finite cardinalities (counting) as well as their negative couterparts. These are decimal numbers without anything after the decimal period, or fractions with a denominator of 1.

In older programming languages, support for integers used to be bounded. This is why classical types, still in use today, correspond to 8-bit (often called byte), 16-bit (often called short), 32-bit (often called int) and 64-bit integers (often called long). This means that, expressed in base 1, they use 8, 16, 32, or 64 bits (binary digits).

However, in modern databases, it is customary to support unbounded integers. In XML, the corresponding type is simply called integer, as opposed to int. Engines can optimize computations for small integers, but might become less efficient with very large integers.

The other, restricted integer types are called subtypes of the integer type, because their logical value spave is a subset of the set of all integers.

Other commonly found integer subtypes include positive integers, non-negative integers (also called unsigned integers), negative integers, non-positive integers.

The lexical representation of integers is usually done in base 10, in the familiar decimal syste,, even though base 2, 8, or 16 can be found too. Leading 0s are optional, but when an logical integer value is canonically serialized, it is done without a leading 0.

Note that th exact names of the types can vary. Some systems might use integer for 32-bit integers or int for the entire value space.

Decimals¶

Decimals correspond to real numbers that can be written as a finite sequence of digits in base 10, with an optional decimal period. Equivalently, these are fractions that can be expressed with a power of 10 in the denominator.

Many modern databases or storage formats support the entire logical decimal value space with no restriction on how large, small or precise a decimal number is.

The lexical representation of integers is usually done in base 10, it is not common to use other bases for decimals.

Floating-point¶

Support for the entire decimal value space can be costly in performance. In order to address this issue, a floating-point standard (IEEE 754) was invented and is still very popular today. These are the types known as float and double in many programming languages. They can be processed natively by processors.

Floating-point numbers are limited both in precision and magnitude (both upper and lower) in order to fit on 32 bits (float) or 64 bits (double). Floats have about 7 digits of precision and their absolute value can be between roughtly $10^{-37}$ and $10^{37}$, while doubles have 15 digits of precision and their absolute value can be between roughly $10^{-307}$ and $10^{308}$.

Double and float types also cover additional special values: NaN (not a number), positive and negative infinity, and negative zero (in addition to the "positive" 0).

Note that the exact names of the types can vary. Some systems might use float for double, or decimal for double, etc.

The lexical representation of floats and doubles often use the scientific notation:

-12.34E-56

And the lexical values corresponding to the special logical values look like so:

NaN
INF
-INF
-0

Booleans¶

The logical value space for the Boolean type is made of two values: true and false as in NoSQL queries, two-valued logic is typically assumed.

The corresponding lexical values are typically:

true
false

If an unknown value is needed in the spirit of three-valued logic, the null can be used or it can be left absent.

Dates and times¶

Dates and times are a very important component of databases because they are heavily needed by users, albeit often neglected or forgotten by some formats.

Dates are commonly using in the Gregorian calendar (with some technologyies possibly supporting more) with a year (BC or AD), a month and a day of the month.

Times are expressed in the hexagesimal (60) basis with hours, minutes, seconds, where the seconds commonly go all the way to microseconds (six digits after the decomal period).

Datetimes are expressed with a year, a month, a day of the month, hours, minutes, and (decimal) seconds.

Some technologies type corresponds to a datetime with a timezone, but treating datetimes as equivalent if they express the same point in time (formally, it means that it is a timezoned datetime quotiented with an equivalent relation). Timestamp values are typically stored as longs (64-bits integers) expressing the number of milliseconds elapsed since January 1, 1970 by convention.

The lexical value can also vary, although many technologies follow the ISO 8601 standard, where lexical values look like so (with many parts optional):

2022-08-07T14:18:00.123456+02:00
2022-08-07
14:18:00.123456
14:18:00.123456Z

The names of date, time, datetime, and timestamp types vary largely between technologies and formats. Here we will focus on the standardized XML schema types, which are also the same as in JSound and the JSONiq language that we will study later. The types are called date, time, dateTime, and dateTimeStamp. XML schema additionally supports values made of just a year (gYear), just a month (gMonth), just a daty of the month (gDay), or a year and month (gMonthYear), or a month and a day of the month (gMonthDay). XML schema, JSound, and JSONiq follow the ISO 8601 standard.

Durations¶

Durations can be of many different kinds, generally a combination of years, months, days, hours, minutes, and (possibly with decimals) seconds.

What is important to understand is that there is a "wall" between months and days: what is the duration "1 month and 1 day?" It could be 39, 30, 31, or 32 days and should thus avoided. Thus, most durations, for the sake of being unambiguous, are either involving years and / or months, or are involving days and / or hours and / or mintues and / or seconds.

The lexical representation can vary, but there is a standard defined by ISO 8601 as well, starting with a P and prefixing sub-day parts with a T.

For example 2 years and 3 months:

P2Y3M

4 days, 3 hours, 2 minutes, and 1.123456 seconds:

P4DT3H2M1.123456S

3 hours, 2 minutes, and 1.123456 seconds:

PT3H2M1.123456S

XML schema, JSound, and JSONiq, used in this course, follow the ISO 8601 standard.

Binary data¶

Binary data is, logically, simply a sequence of bytes.

There are two main lexical representations used in data: hexadecimal and base64.

Hexadeciaml expresses the data with two hexadecimal digits per bytes, like so:

0123456789ABCDEF

which would correspond to the bit sequence:

0000000100100011010001010110011110001001101010111100110111101111

Base 64, formally, does the same but in the base 64, which "wastes" less lexical space in the text. It does so by encoding the bits six by six, encoding each sequence of six bits with one base-64 digit. Equivalently, it means that each group of three bytes is encoded with four base-64 digits. The base-64 digits are, by convention, decimal digits, all uppercase latin alphabet letters, all lowercase latin alphabet letters as well as + and /. = is used for padding if the length of the base-64 string is not a multiple of four.

This is the base-64 lexical representation of the same binary data as above, which is textually more compact than the hexadecimal version:

ASNFZ4mrze8=

In XML schema, JSound and JSONiq, this is covered with two types hexBinary, and base64Binary.

Note that the string type could also be considered to provide an additonal lexical representation of a binary type (ex: with the UTF-8 encoding), although this can cause issues with non-printable characters.

Null¶

Many technologies and formats also provide support for null values, although how this is done largely varies.

Some technologies allow null to appear as a (valid) value for any type. A schema can either allow, or disallow the null type. Often, the terminology used is that a type can be nullable (or nillable) or not.

Other technologies consider that there is a singleton-valued null type, containing only the null value with the lexical representation

null

Allowing nulls is done by taking the union of the desired type with the null type.

Yet other technologies consider null when it appears as a value in an object to be semantically equivalent by the absence of a value and then, allowing or disallowing null is achieved by (ex: in JSON) making the field required or optional.

It is important to understand that the latter technologies are unable to distinguish between the following two JSON objects, so that information in the input dataset is lost upon validating:

{}
{ "foo" : null }

This can be problematic with datasets where this distinction is semantically relevant.

XML also supports null values, but calls them "nil" and does so with a special attribute and no content rather than with a lexical representation.

<foo
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:nil="true"/>

Structured types¶

Most technologies and formats offer four kinds of structured types (or a subset of thereof).

Lists¶

Lists correspond to JSON arrays and are ordered sequences of (atomic or structured) values.

With a schema, it is possible to define types that are, for example, lists of integers, or lists of strings, or lists of lists of doubles, etc.

Records¶

Records, or structs, correspond to JSON object and are maps from strings to values.

With a schema, it is possible to restrict the type accepted for every key. For example, name must be a string, birthday must be a date, etc.

Maps¶

Maps (not be confused with records, which are similar) are maps from any atomic value to any value, i.e., generalize objects to keys that are not necessarily strings (ex: numbers, dates, etc).

With a schema, it is possible to restrict the type of the keys, as well as the type of the values. However, unlike records, the type of the values must be the same for all keys. For example, a map from dates to numbers.

Sets¶

Sets are rarer and supported by rather few technologies and formats, but they exist.

Sets are like lists, but without any specific ordering, and without duplicated values.

XML elements and attributes¶

XML schema stands apart from most other technologies and formats, in that it does not offer specific support for records and maps; it offers some limited support for lilst list, but considers them to be simple types, which are "inbetween" atomic types and structured types. In XML schema, structure is obtained, instead, with elements and attributes, and the machinery for elements and attributes is highly specific to XML. Elements and attributes can in particular easily emulate records and lists, but are more powerful (i.e., elements can be repeated, and intermixing text with elements is allowed).

Most formats besides XML are "JSON-like" from a modeling perspective and structure their data at the very minimum with lists and records.

Type names¶

For convenience, we provide below a summary of many types over various technologies and languages and how they correspond to each other. The most important thing to see there is that on the high level, atomic types are almost always the same everywhere, even though the names can vary.

Sequence types¶

Cardinality¶

In the context of data querying but also of nested lists and arrays, items (single values) rarely appear alne. They ofen appear as a sequence of many values. Thus, many type system give options regarding the number of occurences of items in a sequence.

There are four main occurence indicators:

just once (often implicit): exactly one item
optional: zero or one item. Often representd with a question make (?).
any occurence: zero, one, or more items. Often represented with a Kleene start (*).
at least once: one or more items. Often represented with a Kleene plus (+).

The symbols commonly used (?, *, +) correspond to those used in regular expressions.

There is variety of other ways to specify such indicators: some technologies use keywords such as "repeated," "required," "optional" etc. Finally, some technologies even allow specifying a precise interval (ex: between 2 and 5 items).

Collections vs. nested lists¶

There are different kinds of sequences of items. It is common to distinguish between collections of items, and list (or arrays) of items.

A collection of items is on the outer level, and can be massively large (billions, trillions of items). A collection of items corresponds to a relational table, i.e., a relational table can be seen as a collection of flat object items in the context of this chapter.

A list (or array) of items, however, usually refers to a nested structure, for example an array nested inside a document or object. Such lists of items are usually restricted in size for reasons of performance and scalability. Many technologies do not allow items to exceed a two-digit number of Megabytes (ex: 10 MB, 16 MB etc), or if they do allow them, might become slow and inefficient with too large items, a million at best.

It is thus important to keep this subtle difference in mind, in particular, do not confuse a collection of integers with a collection that contains a single array of integers.

JSON validation¶

Let us now look at how to use what we have learned so far to validate JSON documents. First, it is important to remember that we can only attempt to validate well-formed JSON documents. If a document cannot be parsed as JSON, and thus, cannot be represented as a tree in memory, then validation makes no sense on it.

Validating flat objects¶

JSound is a schema language that was designed to be simple for 80% of the cases, making it paricularly suitable in a teaching environment. It is independent of any programming language.

The compact syntax of JSound, which we focus on here, is very close to how the original documents look like. Let use take an example:

{
    "name" : "Einstein",
    "first" : "Albert",
    "age" : 142
}

Let us say we want to validate the above document, in the sense that "name" should be a string, and "first" should be a string. The corresponding JSound schema looks like so:

{
    "name" : "string",
    "first" : "string",
    "age" : "integer"
}

"string" can be replaced with any other named type, in particular taken from the table shown in the former section. Let us list them here:

Strings: string, any URI (for strings containing a URI)
Numbers: decimal, integer, float, double, long, int, short, byte, negativeInteger, positiveInteger, nonNegativeInteger, nonPositiveInteger, unsignedByte, unsighedShort, unsignedInt, unsignedLong
Dates and times: date, time, dateTime, gYearMonth, gYear, gMonth, gDay, gMonthDay, dateTimeStamp
Time intervals: duration, yearMonthDuration, dayTimeDuration
Binary types: hexBinary, base64Binary
Booleans: boolean
Nulls: null

This is a standardized list of types defined by the E3C.

JSON schema is another technology for validating JSON documents. A JSON schema against which the same document as above is valid would be:

{
    "type" : "object",
    "properties" : {
        "name" : "string",
        "first" : "string",
        "age" : "number"
    }
}

The available JSON schema types are strnig, number, integer, boolean, null, array, and object. This closely matches the original JSON syntax with the only exception that numbers have this additional integer subtype. The type system of JSON schema is thus less rich than that of JSound, but extra checks can be done with so-caled formats, which include date, time, duration, email, and so on including generic regular expressions. Like JSound, JSON schema also allow restricting the length of a string, constraining numbers to intervals, etc.

Requiring the present of a key¶

By default, the presence of a key is optional, so that each one of the following objects is also valid against the previous schema:

{ "name" : "Einstein" }
{ "first" : "Albert" }
{ "age" : 142 }
{ "name" : "Einstein", "age" : 142 }
{ }

It is important to reuiqre the presence of a key by adding an exclamation mark, like so.

{
    "!name" : "string",
    "!first" : "string",
    "age" : "integer"
}

which would make the following document valid:

{ "name" : "Einstein", "first" : "Einstein", "age" : 142 }
{ "name" : "Einstein", "first" : "Einstein" }

This is the equivalent JSON schema, which uses a "required" property associated with the list of reuiqred keys to express the same:

{
    "type" : "object",
     "required" : [ "name", "first" ]
     "properties" : {
         "name" : "string",
         "first" : "string",
         "age" : "number"
     }
}

Open and closed object types¶

In JSound compact syntax, extra keys are forbidden, i.e., this document is valid against neither of the previous two schemas:

{
     "name" : "Einstein",
     "first" : "Albert",
     "profession" : "physicist"
}

The schema is said to be closed. There are ways to define JSound schemas to allow arbitrary additional keys (open schemas), with a more verbose syntax.

Unlike JSound, in JSON schema, extra properties are allowed by default, i.e., this document is also valid against previous schemas:

{
     "name" : "Einstein",
     "first" : "Albert",
     "profession" : "physicist"
}

JSON schema then allows to forbid extra properties with the "additionalProperties" property, like so:

{
     "type" : "object",
     "required" : [ "name", "first" ]
     "properties" : {
         "name" : "string",
         "first" : "string",
         "age" : "number"
     },
     "additionalProperties" : false
}

Nested structures¶

What about nested structures? Let us consider this document, which contains a nested array of integers.

{
    "numbers" : [ 1, 2, 6, 2, 7, 1, 57, 4 ]
}

This document is valid against the following schema (where of course, "integer" can be replaced with any other type):

{ "numbers" : [ "integer" ] }

This also works with multiple dimensions

{ "matrix" : [ [ "decimal" ] ] }

for validating matrices:

{ "matrix" : [ [ 0, 1 ], [ 1, 0 ] ] }

With the JSound compact syntax, object and array types can nest arbitrarily:

{
    "datapoints" : [
         {
             "features" : [ "double" ],
             "label" : "integer"
         }
     ]
}

The following document is valid against the above schema:

{
    "datapoints" : [
        {
            "features" : [ 1.2, 3.4, 5.6 ],
            "label" : 0
        },
        {
            "features" : [ 9.3, 2.6, 2.4 ],
            "label" : 1
        },
        {
            "features" : [ 1.1, 4.3, 6.5 ],
            "label" : 0
        }
    ]
}

The same document can also be validated against a more complex JSON schema with nested arrays and objects, like so:

{
    "type" : "object",
    "properties" : {
        "datapoints" : {
            "type" : "array",
            "items" : {
                "type" : "object",
                "properties" : {
                "features" : {
                    "type" : "array",
                    "items" : {
                    "type" : "number"
                    }
                },
                "label" : {
                    "type" : "integer"
                    }
                }
            }
        }
    }
}

As we will see shortly, the type "shape" of a collection of documents captured with a compact JSound schema is of particular relevance in the context of efficient data processing, and a collection of valid JSON documents with such a shape is known as a data frame.

Every schema can be given a name, turning into a type. When a document is valid agains a schema, it is typical to also annotate the document, which means that its tree representation in memory contains additional type information and values are stored natively in their type, enabling efficient processing and space efficiently.

JSound allows for the definition not only of arbitrary array and object types as shown above, but also of additional atomic types, by imposing some constraint on existing types (ex: airport codes by restricting the length of a string to 3 and requiring all three characters to be uppercase letters, shoe sizes with intervals, etc). These are called user-defined types.

Primary key constraints, allowing for null, default values¶

There are a few more features available in the compact JSound syntax (not in JSON schema) wiht the special characters @, ?, and =:

{
    "datapoints" : [
         {
         "@id" : "int",
         "features" : [ "double" ],
         "label?" : "integer",
         "set" : "string=training"
         }
     ]
}

The question mark (?) allows for null values (which are not the same as absent values). Technically, it creates a union of the specified type with the null type.

The arobase (@) indicates that one or more fields are primiary keys for a list of objects that are members of the same array. In this case, it means all id fields must be different for the datapoints array of each documents.

The equal sign (=) is used to indicate a default value that is automatically populated if the value is absent. In this case, if the field "set" is missing, then upon annotating the document after its validation, it will be added with a value "training".

The following document is valid against the above schema. Note that some values are quoted, which does not matter for validation: validation only checks whether lexical values are part of the type's lexical space.

{
    "datapoints" : [
        {
            "id" : "10",
            "features" : [ 1.2, 3.4, 5.6 ],
            "label" : null,
            "set" : "training"
        },
        {
            "id" : "11",
            "features" : [ "9.3", 2.6, 2.4 ],
            "label" : 1
        },
        {
            "id" : "12",
            "features" : [ 1.1, 4.3, 6.5 ],
            "label" : "0",
            "set" : "test"
        }
    ]
}

And, after annotating it, it will look like so (except it will be represented efficiently in memory, and no longer in actual JSON syntax).

{
    "datapoints" : [
        {
            "id" : "10",
            "features" : [ 1.2, 3.4, 5.6 ],
            "label" : null,
            "set" : "training"
        },
        {
            "id" : "11",
            "features" : [ 9.3, 2.6, 2.4 ],
            "label" : 1,
            "set" : "training"
        },
        {
            "id" : "12",
            "features" : [ 1.1, 4.3, 6.5 ],
            "label" : 0,
            "set" : "test"
        }
    ]
}

Accepting any values¶

Accepting any values in JSound can be done with the type "item", which contains all possible values, like so:

{
    "!name" : "item",
    "!first" : "item",
    "age" : "number"
}

In JSON schema, in order to declare a field to accept any values, you can user either true or an empty object in lieu of the type, like so:

{
    "type" : "object",
    "required" : [ "name", "first" ]
    "properties" : {
        "name" : {},
        "first" : true,
        "age" : "number"
    },
    "additionalProperties" : false
}

The following document validates successfully against the above JSound and JSON schemas:

{
    "name" : [ "Ein", "st", "ein" ],
    "first" : "Albert",
}

JSON schema additionally allows to use false to forbid a field:

{
    "type" : "object",
    "properties" : {
        "name" : "string",
        "first" : false,
    }
}

Making this document invalid:

{
    "name" : "Einstein",
    "first" : "Albert",
}

Type unions¶

In JSON schema, it is also possible to combine validation checks with Boolean combinations.

First, disjunction (logical or) is done with

{
    "anyOf" : [
        { "type" : "string" },
        { "type" : "array" }
    ]
}

JSound schema allows defining unions of types with the vertical bar inside type strings, like so:

"string|array"

Type conjunction, exclusive, or negation¶

In JSON schema only (not in JSound), it is also possible to do a conjunction (logical and) with

{
    "allOf" : [
        { "type" : "string", "maxLength" : 3 },
        { "type" : "string", "minLength" : 2 }
    ]
}

as well as exclusive or (xor);

{
    "oneOf" : [
        { "type" : "number", "minimum" : 2 },
        { "type" : "number", "multipleOf" : 2 }
    ]
}

as well as negation:

{
    "not" : { "type" : "array" }
}

XML validation¶

XML validation is also suported by several technologies. We will briefly show how one of them works, XMl schema.

Similar to how a JSound schema or a JSON schema is a JSON document, an XML schema is also an XML document.

Simple types¶

This is an example (well-formed) XML document:

<?xml version="1.0" encoding="UTF-8"?>
<foo>This is text.</foo>

And this is an XML schema against which the above XML document is valid:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="foo" type="xs:string"/>
</xs:schema>

So what is going on here? First, you notice that all elements in an XML schema are in a namespace, the XML schema namespace. We explained the namespace machinery before, so that the xmlns:xs construct should be familiar. The namespace is prescribed by the XML schema standard and must be this one. It is recommended to stick to the prefix xs, or xsd, which is also quite popular. We do not recommend declaring the XML schema namespace as a default namespace, because it can create confusion in several respects.

Net you will notice that the top element in an XML schema document is the xs:schema element, and inside there is n element declaration done with the xs:element element. It has two attributes: one defines the name of the element to validate (foo) and the other one specifies its type (xs:string). The list of predefined atomic types is the same as in JSound, except that in XML schema, all these predefined types lives in the XML schema namespace and thus bear th prefix xs as well. In fact, formally, this list of predefined types is standardized by XML schema (along with more XML-specific types that are less known).

Let us try to change the type. Integer (prefix with xs) needs no intriduction.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="foo" type="xs:integer"/>
</xs:schema>

This document is then valid:

<?xml version="1.0" encoding="UTF-8"?>
<foo>142857</foo>

Note that extra whitespaces, newlines, and indentation are fine:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    142857
</foo>

Builtin types¶

Let us list again the most important predefined types here, almost the same as JSound, but with the xs prefix:

Strings: xs:string, xs:anyURI (for strings containing a URI)
Numbers: xs:decimal, xs:integer, xs: oat, xs:double, xs:long, xs:int, xs:short, xs:byte, xs:negativeInteger, xs:positiveInteger, xs:nonNegativeInteger, xs:nonPositiveInteger, xs:unsignedByte, xs:unsignedShort, xs:unsignedInt, xs:unsignedLong
Dates and times: xs:date, xs:time, xs:dateTime, xs:gYearMonth, xs:gYear, xs:gMonth, xs:gDay, xs:gMonthDay, xs:dateTimeStamp
Time intervals: xs:duration, xs:yearMonthDuration, xs:dayTimeDuration
Binary types: xs:hexBinary, xs:base64Binary
Booleans: xs:boolean
Nulls: does not exist as a type in XML schema (JSON specific).

XML schema allows you to define user-defined atomic tyes, for example restricting the length of an string to 3 for airport codes, and then use it with an element (there is no prefix because, to keep things simple for teaching, we are not working with any namespaces for our own declared elements and types):

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:simpleType name="airportCode">
        <xs:restriction base="xs:string">
            <xs:length value="3"/>
        </xs:restriction>
    </xs:simpleType>
    <xs:element name="foo" type="airportCode"/>
</xs:schema>

With this valid document:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    ZRH
</foo>

Complex types¶

It is also possible to contrain structures and the element / attribute / text hierarchy with complex types applying to element nodes. There are four main kinds of complex types:

complex content: there can be nested elements, but there can be no text nodes as direct children.
simple content: there are no nested elements: just text, but attributes are also possible.
empty content: there are neither nested elements or text, but attributes are also possible.
mixed context: there can be nested elements and it can be intermixed with text as well.

This is an example of complex content:

<foo>
    <twotofour>foobar</twotofour>
    <twotofour>foobar</twotofour>
    <twotofour>foobar</twotofour>
    <zeroorone>true</zeroorone>
</foo>

which is valid against this schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="complex">
        <xs:sequence>
            <xs:element
                name="twotofour"
                type="xs:string"
                minOccurs="2"
                maxOccurs="4"/>
            <xs:element
                name="zeroorone"
                type="xs:boolean"
                minOccurs="0"
                maxOccurs="1"/>
        </xs:sequence>
    </xs:complexType>
    <xs:element name="foo" type="complex"/>
</xs:schema>

Note how children elements can be repeated, and the number of occurences can be constrined to some interval with minOccurs and maxOccurs attributes in the schema. Of course, this all works recursively, i.e., the nested elements can also have complex types with complex content and so on (even though in this example they have simple types).

This is an example of simple content:

<foo country="Switzerland">2014-12-02</foo>

which is valid against this schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="dateCountry">
        <xs:simpleContent>
            <xs:extension base="xs:date">
                <xs:attribute name="country" type="xs:string"/>
            </xs:extension>
        </xs:simpleContent>
    </xs:complexType>
    <xs:element name="foo" type="dateCountry"/>
</xs:schema>

NOte how a complex type with simple content is defined as the extension of a simple type, adding one or more attributes to it. If there are not attributes, of course, there is no need to bother with a complex type: a simple type does the trick as shown before.

This is an example of empty content:

<foo/>

which is valid against this schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="complex">
        <xs:sequence/>
    </xs:complexType>
    <xs:element name="foo" type="complex"/>
</xs:schema>

As you can see, empty content is "boring", in that it is defined just like complex content, but with no nested elements at all (attributes, though, can absolutely be added).

Finally, this is an example of mixed content:

<foo>Some text and some <b>bold</b> text.</foo>

which is valid against this schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="mixedContent" mixed="true">
        <xs:sequence>
            <xs:element
                name="b"
                type="xs:string"
                minOccurs="0"
                maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:element name="foo" type="mixedContent"/>
</xs:schema>

Mixed content is also "boring" to define: all it takes is a structure just like complex content, plus an extra attribute in the xs:complexType declaration called "mixed" and set to true.

Attribute declarations¶

Finally, all types of content can additionally contain attributes. Attributes always have a simple type. An example with empty content involving one attribute:

<foo country="Switzerland"/>

which is valid against:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="withAttribute">
        <xs:sequence/>
        <xs:attribute name="country"
            type="xs:string"
            default="Switzerland"/>
    </xs:complexType>
<xs:element name="foo" type="withAttribute"/>
</xs:schema>

The default attribute of the attribute declaration will automatically add an attribute with the corresponding name and specified value in memory in case it was missing in the original instance. This works just like in JSound.

Anonymous types¶

Finally, it is not mandatory to give a name to all types. It is possible, instead of the type attribute of an element or attribute declaration, to nest a type declaration with no name attribute:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="c">
        <xs:complexType>
            <xs:sequence/>
        </xs:complexType>
    </xs:element>
</xs:schema>

Be careful: if there is neither a type attribute nor a nested type declaration, then, anything is allowed.

Miscellaneous¶

XML schema has many more features than described here: primary keys, constraints, support for namespaves, etc.

Finally, XML schema documents are themselves XML documents, and can thus be validated against a "schema or schemas", itself written as an XML schema. This schema has the wonderful property of being valid against itself, which will delight aficionados of Douglas Hofstadter's Pulitzer-prize-winning Godel, Escher, Bach book.

Data frames¶

Heterogeneous, nested datasets¶

Now that we have defined data models for JSON and XML, and added a mechanism (schema validation) to enforce additional constraint on collection of JSON objects or of XML documents, we can take a step back and look at the bigger picture.

First, it should now be clear to you that, in both XML and JSON, datasets are simply collections of trees. If we now consider the praticular case of JSON, which is growing in popularity, we can express a collection of JSON trees formally as a list of maps.

Now we have already seen lists of maps before: relational tables are lists of maps, too, at least in the list semantics, as opposed to the set or bag semantics of the relational algebra. But in the context of relational tables, we call these maps tuples, or rows. The reason is that relational tables are not any list of maps: they have constraints, as we explained before. These constraints include atomic integrity, relational integrity, and domain integrity.

Casually explained, atomic integrity means that relational tables are flat.

A counterexample of flatness (or of atomic integrity) is the presence of nested tables:

Casually explained, relational integrity and domain integrity means that relational tables are homogeneous.

A counterexample of homogeneity (or relational and domain integrity) is the presence of missing or additional columns, and of values with different types in the same columns:

Finally, we can even provide a counterexample of a dataset that is neither flat, nor homogeneous: it is both nested and heterogeneous.

The beauty of the JSON data model is that, unlike the relational model and the CSV syntax, it supports the nested, heterogeneous datasets, while also supporting as a particular case flat, homogeneous datasets like so:

{"ID":1, "Name": "Einstein", "Living" : false}
{"ID":2, "Name": "Penrose", "Living" : true}
{"ID":3, "Name": "Turing", "Living" : false}
{"ID":4, "Name": "Dean", "Living" : true}

This is niw an example of nested (but homogeneous) collection in JSON syntax, corresponding to the nested visual above:

{
    "ID":1,
    "Profession": "Physicist",
    "People": [
        {"Name": "Einstein", "Living" : false},
        {"Name": "Penrose", "Living" : true}
    ]
}
{
    "ID":2,
    "Profession": "Computer Scientist",
    "People": [
        {"Name": "Turing", "Living" : false},
        {"Name": "Dean", "Living" : true}
    ]
}

This is now an example of heterogeneous (but flat) collection in JSON syntax, corresponding to the heterogeneous visual above:

{
    "ID":1,
    "Name": "Einstein",
    "Living" : false,
    "Profession" : "Physicist"
}
{
    "ID":2,
    "Name": "Penrose",
    "Living" : true,
    "Profession" : "CS"
}
{
    "ID":3,
    "Name": "Turing"
}
{
    "ID":4,
    "Name": "Dean",
    "Living" : true
}

And finally, this is an example of heterogeneous and nested collection in JSON syntax, corresponding to the heterogeneous and nested visual above:

{
    "ID":1,
    "Profession": "Physicist",
    "People": [
        {"Name": "Einstein", "Living" : false},
        "Penrose"
    ]
}
{
    "ID":2,
    "Profession": "Computer Scientist",
    "People": [
        {"Name": "Turing"},
        {"Name": "Dean", "Living" : true}
    ],
    "Comment": "They rock"
}

For completeness, this shows how the same can be expressed in XML. However, XML is more powerful as we explained before (it covers more use cases from the publishing industry with mixed content), so that in the remainder of the course, we will tend to focus more on JSON.

<professions>
    <profession id="1">
        <name>physicist</name>
        <persons>
            <person>
                <name>Einstein</name>
                <living>false</living>
            </person>
            Penrose
        </persons>
    </profession>
    <profession id="2">
    <name>Computer Scientist</name>
    <persons>
        <person>
            <name>Turing</name>
        </person>
        <person>
            <name>Dean</name>
            <living>true</living>
        </person>
    </persons>
    <comment>They rock</comment>
    </profession>
</professions>

Dataframe visuals¶

There is a particular subclass of semi-structured dataset that are very interesting: valid dataset, which are collections of JSON objects valid against a common schema, with some requirements on the considered schemas.

The datasets belonging to this particular subclass are called data frames, or dataframes.

Specifically, for the dataset to qualify as a data frame, firstly, we forbid schemas that allow for open object types, that is, schemas must disallow any additional attribute, and, secondly, we forbid schemas that allow for object or array values to be too permissive and allow any values, that is, we ask that schemas require specific type such as integers, strings, dates, objects representing a person, arrays of binaries, etc. We, however, include schemas that allow for null values and / or absent values.

Under the above conditions, we call the collection of objects a data frame. It should be immediate to the reader that relational tables are data frames, while data frames are not necessarily relational tables: data frames can be (and are often) nested, but they are still relatively homogeneous to some extent. Relatively, because schemas can still allow for a value to be missing.

Data frames have the nice property that they can be drawn visually in structures that look like generalized relational tables and that look a bit nicer and more structured that the previous visuals with nested tables. Further, JSound compact schemas provides a natural syntax are always closed, and it allows for requiring ot not values, and for including or not null values. Thus, we can now give a few examples of JSound schemas and their corresponding visuals. Let us start with the example of a "flat" JSound schema:

{
    "ID" : "integer",
    "Name" : "string",
    "Living" : "boolean"
}

This schema could also be described as a SQL CREATE TABLE statement just as well. In fact, for this use case, a relational database might make more sense than a datalake altogether. Concretely, this means the data can be drawn as a table, like so:

But things get interesting if we denormalize and define one of the fields to be an array of strings, for example, like so:

{
    "ID" : "integer",
    "Name" : [ "string" ],
    "Living" : "boolean"
}

Then, the Data frame visual becomes:

Thus, Data frames are a generalization of (normalized) relational tables allowing for (organized and structured) nestedness.

Data frames also can have nested objects, as described by the following schema:

{
     "ID" : "integer",
     "Name" : {
         "First" : "string",
         "Last" : "string"
     },
     "Living" : "boolean"
}

A valid instance can then be drawn like so:

Finally, a very common use case modelling "tables in tables" involves objects nested in arrays, like so:

{
    "ID" : "integer",
    "Who" : [
        {
            "Name" : "string",
            "Type" : "string"
        }
    ],
    "Living" : "boolean"
}

Leading to the following visual:

Note that the former visual could also match the following different, but less natural, schema:

{
    "ID" : "integer",
    "Who" : {
        "Name" : [ "string" ],
        "Type" : [ "string" ]
    },
    "Living" : "boolean"
}

This schema structure is, however, much less common even though it can be used by savvy users to optimize the layout under specific circumstances (an example where this can be seens as in high-energy physics datasets found at CERN). Beware in particular that this alternate schema structure does not enforce that there must be the same number of items in each array.

Data formats¶

Once data is structured as a valid data frame, it can be stored in various efficient binary formats like Parquet, Avro, Root, and Google’s Protocol Buffers. These formats are optimized for space and performance, offering significant advantages over JSON. If a JSON dataset is compatible with a data-frame-friendly schema (e.g., validated with JSound), converting it to Parquet can yield:

Space Efficiency: Binary formats like Parquet are compact, allowing more data to fit on local devices and speeding up cloud transfers.
Performance Efficiency: Binary files are faster to read, and many optimizers can skip irrelevant sections of the data based on queries, improving performance.

In cases where the data schema is simple (no nested structures, only closed object types) and under a terabyte, a relational database with SQL may be an ideal choice.

Binary formats are efficient because they store the schema as a header and avoid repeating field names in each record. Additionally, techniques for compressing homogeneous data sequences and performing direct lookups increase efficiency. Parquet, for example, uses a columnar storage approach, grouping values by field across records, which enhances query performance.

Data formats are typically classified by:

Schema Requirement: Some formats require a schema (Parquet, Protocol Buffers), while others do not (JSON, XML).
Support for Nested Data: Formats like Parquet allow nested structures, whereas CSV does not.
Binary vs. Textual: Binary formats (e.g., Parquet) are generally faster than textual ones (e.g., JSON, XML).

In summary, binary data formats like Parquet are efficient, schema-requiring, and well-suited to large datasets, with concepts similar to those we've explored in this chapter. Familiarity with format-specific terminology (e.g., int64 for integer) is all that’s needed to make the switch.