263-3010-00: Big Data
Section 11
Document Stores
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 11/24/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
In the journey towards processing large-scale, real world datasets, a few compromises had to be made in the early systems. In particular, the ACID paradigm was replaced with the more lenient CAP-theorem paradigm with, in particular eventual consistency. Another compromise is that many of the early Big Data systems offer a low-level API in an imperative host language rather than a query language. And finally, many systems work as data lakes where users throw all their datasets, rather than a fully integrated database management system that takes over the control of, and hides, the physical layout.
All of these compromises are a high price to pay because it sends us back, as far as data independence is concerned, in the 1960s where people wrote programs to directly read data from their file system. This is something that the database community is well aware of, and for this reason, there are attempts to bring back all these data independence bells and whistles (ACID, query languages, data management).
Document stores are an example of step in this direction: a document store, unlike a data lake, manages the data directly and the users do not see the physical layout.
Relational databases¶
As a reminder, in relational databases, everything is a table. We saw that a table can be seen as a set of maps (from attributes to values) that fulfils three constraints: relational integrity, domain integrity, and atomic integrity.
We can, of course, process tables through a data lake: we could upload CSV files to S3 or HDFS and then query them via Spark or, even better, Spark SQL.
But a relational database management system will offer more than this: it can optimize the layout of the data on disk and build additional structures (indices) to accelerate SQL queries without the need to modify them, and it can handle transactions.
Can we rebuild a similar system for collections of trees, in the sense that we drop all three constraints: relational integrity, domain integrity, and atomic integrity?
Document stores bring us one step in this direction.
Challenges¶
Schema on read¶
Data that fulfills relational integrity, domain integrity, and atomic integrity always comes with a schema. In a relational database management system, it is not possible to populate a table without having defined its schema first.
We saw in Chapter 7 that schemas can be extended to data that break relational integrity (optional fields, open objects), or domain integrity (union types or use of the “item” topmost type), or atomic integrity (nested arrays and objects). We also saw that the special case of valid data that only breaks atomic integrity (or relational integrity in reasonable amounts, i.e., optional fields but no open objects) is described with the dataframes framework.
However, when encountering such denormalized data, in the real world, there is often no schema. In fact, one of the important features of a system that deals with denormalized data is the ability to discover a schema, i.e., offer query functionality to find out which keys appear in the data, what kind of value is associated with each key, etc; or even functionality that directly infers a schema, as we saw is the case with Apache Spark.
Making trees fit in tables¶
A first thought when trying to build a system that supports denormalized data, such as collections of JSON or XML objects, is to force-fit it into tables. In fact, it is a very natural thing to do if the collection is flat and homogeneous, i.e., respects the three fundamental integrity constraints.
For example, a flat JSON object can naturally be seen as the row of a relational table:
Likewise, a flat XML element can naturally be seen as the row of a relational table:
Thus, several XML elements (or, likewise, several JSON objects) can be naturally mapped to a relational table with several rows:
The corresponding XML Schemas can also be transformed (modulo an appropriate data type mapping, as explained in Chapter 10) naturally to a relational schema:
The same goes for JSound or JSON Schemas:
Is this not great? Does it mean we actually have nothing to do: JSONandXMLcollections, more generally semi-structured collections, just fit elegantly in relational tables? At the risk of raining on the party, the matter is more complex than this. This is because semi-structured data can generally be nested and heterogeneous.
So it is tempting to map nestedness, then: