263-3010-00: Big Data
Section 7
Data Models and Validation
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 10/28/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Even though the data is physically stored as bits - or as text directly encoded to bits in the case of XML and JSON - it would not be appropriate to directly manipulate the data at the bit or text level. This is, in fact, in the spirit of data independence to abstract away. Doing so is called data modeling.
A data model is an abstract view over the data that hides the way it is stored physically. For example, a CSV file should be abstracted logically as a table. This is because CSV enforces at least relational integrity as well as atomic integrity. As for domain integrity , this can be considered implicit since an entire column can be interpreted as a string in the case of incompatible literals.
The JSON Information Set¶
Obviously, a model based on tables is not appropriate for JSON. This is because, unlike CSV, JSON enforce neither relational integrity, nor atomic integrity, nor domain integrity. In fact, we will see that the appropriate abstraction for any JSON document is a tree.
The nodes of that tree, which are JSON logical values, are naturally of six possible kinds: the six syntactic building blocks of JSON.
These are the four leaves corresponding to atomic values:
Strings
Numbers
Booleans
Nulls
As well as two intermediate nodes (possibly leaves if empty):
Objects (String-to-value map)
Arrays (List of values)
Formally, and not only for JSON but for all tree-based models, these nodes are generally called information items and form the logical building blocks of the model, called information set.
Let us take the following example.
{
"foo" : true,
"bar" : [
{
"foobar" : "foo"
},
null
]
}
It is possible to draw this document as a logical tree, where each information item (node) corresponds to each one of the values present in the document: two objects, one array, and three atomics. Note that the information items are the rectangles; the ovals are not information items but labels on the edges connecting the information items. The ovals correspond to object keys.
It is possible to do so for any JSON document. Thus, we have now obtained a similar logical / physical mapping to what we previously did with CSV and tables, except taht this is now with JSON and trees.
When a JSON document is being parsed by a JSON library, this tree is built in memory, the edges being pointers, and further processing will be done on the tree and not on the original syntax.
Conversely, it is possible to take a tree and output it back to JSON syntax. This is called serialization.
The XML Information Set¶
It is possible to do the same logical abstraction, also based on trees, with XML, where information items corresponod to elements, attributes, text, etc:
A fundamental difference between JSON trees and XML trees is that for JSON, the labels (object keys) are on the edges connecting an object information item to each one of its children information items. In XML, the labels (these would be element and attribute names) are on the nodes (information items) directly. Another way to say it is that a JSON informaiton item does not know with which key it is associated in an object (if at all), while an XML element or attribute information item knows its name
Let us dive more into details. In XML, there are many more informaiton items:
Document information items
Element information items
Attribute information items
Character information items
Comment information items
Processing instruction information items
Namespace information items
Unexpected entity reference information items
DTD information items
Unparsed entity information items
Notation information items
We only go into the most important ones from a data perspective here: documents, elements, attributes, and characters. We will leave comments and namespaces aside to keep things simple, even though we saw what they look like syntactically, and will also skip all other information items, for which we have not studied the syntax.
Let us take this example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE metadata>
<metadata>
<title
language="en"
year="2019"
>Systems Group</title>
<publisher>ETH Zurich</publisher>
</metadata>
Formally, the XML Information Set is defined in a standard of the World Wide Web consortium (W3C). Each kind of information item has specific properties, and some of these properties link it to other information items, building the tree.
Let us go through the information items for the above document and list some of its properties.
Document information item¶
The document information item is just the root of an XML tree. It does not correspond to anything syntactically or, if at all, it would correspond to the text and doctype declarations.
The documentation information has two important properties:
[children] Element information item metadata
[version] 1.0
Element information items¶
There is one element information item for each element. Here we have three.
The element information item metadata has four important properties:
[local name] metadata
[children] Element information item title, element information item publisher
[attributes] (empty)
[parent] Document information item
The element information item title has four important properties:
[local name] title
[children] Character information items (System Group)
[attributes] Attribute information item language, Attribute information item year
[parent] Element information item metadata
The element information item publisher has four important properties:
[local name] publisher
[children] Character information items (ETH Zurich)
[attributes] (empty)
[parent] Element information item metadata
Attributes information items¶
There is one attribute information item for each attribute. Here we have two.
The attribute information item language has three important properties:
[local name] language
[normalized value] en
[owner element] Element information item title
The attribute information item year has three important properties:
[local name] year
[normalized value] 2019
[owner element] Element information item title
Character information items¶
There are as many character information items as characters in text (brtween tags). For example, for S in System Group:
[character code] the unicode point for the letter S
[parent] Element information item title
It is sometimes simpler to group them into a single (non standard) "text information item":
[characters] S y s t e m G r o u p
[parent] Element information item title
The entire tree¶
All information items built previously can finally be assenbled and drawn as a tree. The edges, corresponding to children and parent (or owner element) properties, will correspond to pointers in memory when the tree is built by the XML library:
When an XML document is being parsed by a XML library, this tree is built in memory, the edges being pointers, and further processing will be done on the tree and not on the original syntax.
Conversely, it is possible to take a tree and output it back to XML syntax. This is called serialization.
Validation¶
Once documents, JSON or XML, have been parsed and logically abstracted as a tree in memory, the natural next step is to check for further structural constraints.
For example, you could want to check whether your JSON documents all associate key "name" with a string, or if they all associate "years" with an array of positive integers. Or you could want to check whether your XML documents all have root elements called "persons," and whether the root element in each document has only children elements called "person," all with an attribute "first" and an attribute "last".
This might remind the reader of schemas in a relational database, but with a major difference: in a relational database, the schema of a table is defined before any data is populated into the table. Thus, the data in the table is guaranteed, at all times, to fulfill all the constraints of the schema. The exact term is that the data is guaranteed to be valid against the schema, because the schema was enforced at write time (schema on write).
But in case of a collection of JSON and XML documents, this is the other way around. A collection of JSON and XML documents out there can exist without any schema and contain arbitrary strcutures. Validation happens "ex post," that is, only after reading the data (schema on read).
Thus, it means that JSON and XML documents undergo two steps:
a well-formedness check: attempt to parse the document and construct a tree representation in memory
(if first step succeeded) a validation check given a specific schema