263-3010-00: Big Data

Section 11 Document Stores

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 11/24/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

In the journey towards processing large-scale, real world datasets, a few compromises had to be made in the early systems. In particular, the ACID paradigm was replaced with the more lenient CAP-theorem paradigm with, in particular eventual consistency. Another compromise is that many of the early Big Data systems offer a low-level API in an imperative host language rather than a query language. And finally, many systems work as data lakes where users throw all their datasets, rather than a fully integrated database management system that takes over the control of, and hides, the physical layout.

All of these compromises are a high price to pay because it sends us back, as far as data independence is concerned, in the 1960s where people wrote programs to directly read data from their file system. This is something that the database community is well aware of, and for this reason, there are attempts to bring back all these data independence bells and whistles (ACID, query languages, data management).

Document stores are an example of step in this direction: a document store, unlike a data lake, manages the data directly and the users do not see the physical layout.

Relational databases¶

As a reminder, in relational databases, everything is a table. We saw that a table can be seen as a set of maps (from attributes to values) that fulfils three constraints: relational integrity, domain integrity, and atomic integrity.

We can, of course, process tables through a data lake: we could upload CSV files to S3 or HDFS and then query them via Spark or, even better, Spark SQL.

But a relational database management system will offer more than this: it can optimize the layout of the data on disk and build additional structures (indices) to accelerate SQL queries without the need to modify them, and it can handle transactions.

Can we rebuild a similar system for collections of trees, in the sense that we drop all three constraints: relational integrity, domain integrity, and atomic integrity?

Document stores bring us one step in this direction.

Challenges¶

Schema on read¶

Data that fulfills relational integrity, domain integrity, and atomic integrity always comes with a schema. In a relational database management system, it is not possible to populate a table without having defined its schema first.

We saw in Chapter 7 that schemas can be extended to data that break relational integrity (optional fields, open objects), or domain integrity (union types or use of the “item” topmost type), or atomic integrity (nested arrays and objects). We also saw that the special case of valid data that only breaks atomic integrity (or relational integrity in reasonable amounts, i.e., optional fields but no open objects) is described with the dataframes framework.

However, when encountering such denormalized data, in the real world, there is often no schema. In fact, one of the important features of a system that deals with denormalized data is the ability to discover a schema, i.e., offer query functionality to find out which keys appear in the data, what kind of value is associated with each key, etc; or even functionality that directly infers a schema, as we saw is the case with Apache Spark.

Making trees fit in tables¶

A first thought when trying to build a system that supports denormalized data, such as collections of JSON or XML objects, is to force-fit it into tables. In fact, it is a very natural thing to do if the collection is flat and homogeneous, i.e., respects the three fundamental integrity constraints.

For example, a flat JSON object can naturally be seen as the row of a relational table:

No description has been provided for this image

Likewise, a flat XML element can naturally be seen as the row of a relational table:

Thus, several XML elements (or, likewise, several JSON objects) can be naturally mapped to a relational table with several rows:

The corresponding XML Schemas can also be transformed (modulo an appropriate data type mapping, as explained in Chapter 10) naturally to a relational schema:

The same goes for JSound or JSON Schemas:

Is this not great? Does it mean we actually have nothing to do: JSONandXMLcollections, more generally semi-structured collections, just fit elegantly in relational tables? At the risk of raining on the party, the matter is more complex than this. This is because semi-structured data can generally be nested and heterogeneous.

So it is tempting to map nestedness, then:

as well as heterogeneity:

Doing so quickly leads to the realization that such mapping will at best have to be done for every single dataset, and requires in most cases a schema, whereas we are looking for a generic solution for semistructured data with no a-priori schema information. And at worst, dealing with heterogeneity will lead to an intractable number of columns and nestedness to an intractable number of tables.

Document stores¶

Document stores provide a native database management system for semi-structured data. Document stores also scale to Gigabytes or Terabytes of data, and typically millions or billions of records (a record being a JSON object or an XML document).

A document store typically specializes in either JSON or XML data, even though some companies (e.g., MarkLogic) offer support for both.

Document stores work on collections of records, generalizing the way that relational tables can be seen as collections of rows. Such a collection can look like this in the case of JSON:

{
    "foo": 1,
    "bar": [ "foo", "bar" ],
    "foobar" : true,
    "a" : { "foo" : null, "b" : [ 3, 2 ] },
    "b" : 3.14
}
{ 
    "foo": 1,
    "bar": "foo"
}
{
    "foo": 2,
    "bar": [ "foo", "foobar" ],
    "foobar" : false,
    "a" : { "foo" : "foo", "b" : [ 3, 2 ] },
    "b" : 3.1415
}

Records, in a document store, are called, wait for it, documents. It is important to understand that document stores are optimized for the typical use cases of many records of small to medium sizes. Typically, a collection can have millions or billions of documents, while each single document weighs no more than 16 MB (or a size in a similar magnitude). Some document stores strictly enforce a maximum size and will not allow larger individual documents.

Finally, a collection of documents need not have a schema: it is possible to insert random documents that have dissimilar structures with no problem at all. Most document stores, however, do provide the ability to add a schema. If they do, it is then possible to validate the documents in a collection. This can be done before adding documents (schema-on-write), to ensure validity, or validation can be attempted after adding the schema to a previously schemaless collection, or while processing a collection (schema-on-read).

As you can see, this model with collection of trees generalizes the relational model to nested and heterogeneous data elegantly, while retaining the ability to enforce a minimum structure thanks to schemas.

Document stores can generally do selection, projection, aggregation and sorting quite well, but many of them are typically not (yet) optimized for joining collections. In fact, often, their language or API does not offer any joining functionality at all, which pushes the burden to reimplement joins in a host language to the users. The same applies for complex queries that push the API to its limits and force users to write a significant part of their code in the host language, rather than push it down to the document store. This is a serious breach of data independence.

Implementations¶

There is a very large number of products in the document store space for both JSONandXML,letusmentionfor example MongoDB, CouchDB, ElasticSearch, Cloudant, existDB, ArangoDB, BaseX, MarkLogic, DocumentDB, CosmosDB, and so on. We will focus, as an example, on MongoDB.

Physical storage¶

In a data lake, the files are stored "as is" in the cloud or on a distributed file system. In a database mangement system, the storage format is typically proprietary and optimized for performance. It is hidden from the user. Just like the storage format is optimized for tabular data in a relational database management system, it is optimized for tree-like data in a document store.

In MongoDB, the format is a binary version of JSON called BSON. BSON is basically based on a sequence of tokens that efficiently encode the JSON constructs found in a document, like so:

The immediate benefit of BSON is that it takes less space in storage than JSON stored as a text file: for example, null, true and false literals need four or five bytes in text format at best, while they can be efficiently encoded as single bytes in BSON.

Furthermore, BSON supports additional types that JSON does not have, such as dates. These types are exposed to the user.

Querying paradigm (CRUD)¶

The API of MongoDB, like many document stores, is based on the CRUDparadigm. CRUDmeansCreate, Read, Update, Delete and corresponds to low-level primitives similar to those we covered in Chapter 6 for HBase.

MongoDB supports several host languages to query collections via an API. This includes in particular JavaScript and Python, but many other languages are supported via drivers. We will use JavaScript here because this is the native host language of MongoDB. It is important to note that these APIs are not query languages; this is because they provide functionality to the user allowing the dynamic creation of a query plan (even though the engine will, of course, optimize and modify it automatically for performance), while a full-blown query language completely hides the concept of query plan from the user and creates it in the background.

MongoDB also provides access to the data via a shell called mongo or, newly, mongosh. This is a simple JavaScript interface wrapped around the MongoDB’s node.js driver. We will give our example using queries on the mongo shell, and draw the attention of the reader that, if they try to look for MongoDB documentation via a search engine, they might land on pages showing queries written with a different driver (e.g., Python). The syntax might thus differ, but the general API look and feel is the same.

Populating a collection¶

Creating and populating collections is more straightforward in a document store than in a relational database system, because no schema is required. Thus, to create a collection, one can simply insert a document in it, and it will be automatically created if it does not exist.

Generally, collections in MongoDB are accessed in JavaScript with

db.scientists

where "scientists" is the name of the collection.

In order to insert one document, the method insertOne() should be called, like so:

db.scientists.insertOne(
    {
        "Last" : "Einstein",
        "Theory" : "Relativity"
    }
)

In order to insert several documents at the same time, the method insertMany() should be called, with the documents to insert provided in an array, like so:

db.scientists.insertMany(
    [
        {
            "Last" : "Lovelace",
            "Theory" : "Analytical Engine"
        },
        {
            "Last" : "Einstein",
            "Theory" : "Relativity"
        }
    ]
)

MongoDB automatically adds to every inserted document a special field called "_id" and associated with a value called an Object ID and with a type of its own. An Object ID can simply be thought as a 12 byte binary value. Object IDs are convenient for deleting or updating a specific document with no ambiguity.

Querying a collection¶

In the following examples, for easing the learning, we will also attempt to give a SQL equivalent, but please understand this with caution as document stores are not relational databases and using SQL with them would be an impedance mismatch.

Scan a collection¶

Asking for the contents of an entire collection is done with a simple find() call on the previous object, like so:

db.collection.find()

An equivalent SQL query would be:

SELECT *
FROM collection

This function does not in fact return the entire collection; rather, it returns some pointer, called a cursor, to the collection; the user can then iterate on the cursor in an imperative fashion in the host language (this is another reason why this is not a query language).

Selection¶

It is possible to perform a selection on a collection by passing a parameter to find() that is a JSON object (the JavaScript syntax is similar, in fact this is where the JSON syntax comes from):

db.collection.find({ "Theory" : "Relativity" })

The above query returns all documents in the collection that have a key called "Theory" associated with the string value "Relativity". It would correspond, in SQL, to a WHERE clause, like so:

SELECT *
FROM collection
WHERE Theory = 'Relativity'

Note that the MongoDB query is not written in any query language; rather, it is an API in an imperative host language (JavaScript) used to create a (high-level) query plan via chains of method calls (find() and others).

What is different from a relational database, then? In a document store, it is possible that some documents have this key, while others do not. The latter are excluded from the results. It is also possible that some documents have the key, but the value is something else than a string (a number, a date, etc). These documents would also be excluded from the results.

It is possible to select on several field values simply by adding more in the parameter object, like so:

db.collection.find(
    {
        "Theory":"Relativity",
        "Last":"Einstein"
    }
)

Adisjunction (OR) uses a special MongoDB keyword, prefixed with a dollar sign, like so:

db.collection.find(
    {
        "$or" : [
            { "Theory":"Relativity" },
            { "Last":"Einstein" }
        ]
    }
)

The special "$or" field is associated with an array, and the members of the array are (recursively) all the filtering predicates to take the disjunction of.

MongoDB offers many other keywords, for example for comparison other than equality:

db.collection.find(
    {
        "Publications" : { "$gte" : 100 }
    }
)

Note that quoting keys, in MongoDB, is optional and this is also true with dollar keywords; however, we recommend caution with growing this as a habit, as this is not well-formed JSON.

Projection¶

By default, MongoDB returns the entire objects.

Projections are made with the second parameter of this same find() method. This is done in form of a JSON object associating all the desired keys in the projection with the value 1.

db.scientists.find(
    { "Theory" : "Relativity" },
    { "First" : 1, "Last": 1 }
)

equivalent to:

SELECT First, Last
FROM scientists
WHERE Theory = "Relativity"

By default, the object ID, in the field "_id" is always included in the results. It is possible to project it away with a 0:

db.scientists.find(
    { "Theory" : "Relativity" },
    { "First" : 1, "Last" : 1, "_id" : 0 }
)

It is also possible to project fields away in the same way with 0s, however 1s and 0s cannot be mixed in the projection parameter, except in the specific above case of projecting away the object ID.

db.scientists.find(
    { "Theory" : "Relativity" },
    { "First" : 0, "Last" : 0 }
)

Counting¶

Counting can be done by chaining a count() method call, like so:

db.scientists.find(
    { "Theory" : "Relativity" }
).count()

equivalent to:

SELECT COUNT(*)
FROM scientists
WHERE Theory = "Relativity"

Sorting¶

Sorting can be done by chaining a sort() method call, like so:

db.scientists.find(
    { "Theory" : "Relativity" },
    { "First" : 1, "Last" : 1 }
).sort(
    {
        "First" : 1,
        "Name" :-1
    }
)

equivalent to:

SELECT First, Last
FROM scientists
WHERE Theory = "Relativity"
ORDER BY First ASC, Name DESC

1 is for ascending order and-1 for descending order, which is also an artefact and arbitrary convention due to the use of an API rather than a query language.

It is also possible to add limits and offsets to paginate results also by chaining more method calls:

db.scientists.find(
    { "Theory" : "Relativity" },
    { "First" : 1, "Last" : 1 }
).sort(
    {
        "First" : 1,
        "Name" :-1
    }
).skip(30).limit(10)

equivalent to:

SELECT First, Last
FROM scientists
WHERE Theory = "Relativity"
ORDER BY First ASC, Name DESC
LIMIT 10
OFFSET 30

Note that, contrary to intuition, the order of the calls does not matter, as this is really just the creation of a query plan by providing parameters (in any order).

Duplicate elimination¶

It is possible to obtain all the distinct values for one field with a distinct() call:

db.scientists.distinct("Theory")

Querying for heterogeneity¶

Absent fields¶

Absent fields can be filtered with:

db.scientists.find(
    { "Theory" : null }
)

which is equivalent to the following SQL query (as NULLs, in rela tional database, provide limited support for heterogeneity when used parsimoniously):

SELECT *
FROM scientists
WHERE Theory IS NULL

Filtering for values across types¶

Querying for several values with different types and in the same field can easily be made with a disjunctive query:

db.collection.find(
    {
        "$or" : [
            { "Theory": "Relativity" },
            { "Theory": 42 },
            { "Theory": null }
        ]
    }
)

MongoDB also provides an alternate syntax for doing the same with the $in keyword:

db.scientists.find(
    {
        "Theory" : {
            "$in" : [
                "Relativity",
                42,
                null
            ]
        }
    }
)

MongoDB is also able to sort on fields that have heterogeneous data types. It does so by first order by type in some (arbitrary, but documented) order, and then within each type. The documented order to sort across types is (ascending):

null values, numbers, strings, objects, arrays, binary data, Object IDs, Booleans, dates, timestamps, regular expressions.

There are also special "types" and (singleton) values called the min key and the max key whose sole purpose is for the value to come first or last.

Querying for nestedness¶

Nestedness in MongoDB is handled in several ad-hoc ways for specific use cases (a fully generic mechanism for this would require a query language native to denormalized data, which we will cover in a subsequent chapter).

Values in nested objects¶

We saw how to select documents based on values associated with top level keys. What about values that are not at the top-level, but in nested objects? Let us put arrays aside for now and assume only nested objects.

The first solution that might come to mind is something like this:

db.scientists.find({
    "Name" : {
        "First" : "Albert"
    }
})

However, this query will not have the behavior many would have expected: instead of finding documents that have a value "Albert" associated with the key "First" in an object itself associated with the top-level key "Name", this query looks for an exact match of the entire object

{
    "First" : "Albert"
}

associated with key “Name”, and will not include documents such as, for example:

{
    "Name" : {
        "First" : "Albert",
        "Some" : "Other field"
    }
}

In order to include documents such as above, MongoDB uses a dot syntax. This means that, like the dollar sign, dots are treated in a special way in MongoDB queries. This query will return the document above:

db.scientists.find({
    "Name.First" : "Albert"
})

Values in nested arrays¶

MongoDB allows to filter documents based on whether a nested array contains a specific value, like so:

db.scientists.find({
    "Theories" : "Special relativity"
})

The query above will return this document, as expected:

{
    "Name" : "Albert Einstein",
    "Theories" : "Special relativity"
}

but it will also return documents like this:

{
    "Name" : "Albert Einstein",
    "Theories" : [
        "Special relativity",
        "General relativity"
    ]   
}

that is, a single value in a query will also be matched with arrays that contain it.

Deleting objects from a collection¶

Objects can be deleted from a collection either one at a time with deleteOne(), or several at a time with deleteMany():

db.scientists.deleteMany(
    { "century" : "15" }
)

will delete all documents matching the criteria, which uses the same syntax as selection in find() queries.

db.scientists.deleteOne(
    { "century" : "15" }
)

will delete just one document matching the criterion, leaving any other documents matching the same criterion unchanged in the original collection. If there is no such document, nothing happens.

Updating objects in a collection¶

Documents can be updated with updateOne() and updateMany() by providing both a filtering criterion (with the same syntax as the first parameter of find()) and an update to apply. The command looks like so:

db.scientists.updateMany(
    { "Last" : "Einstein" },
    { $set : { "Century" : "20" } }
)

In addition to \$set, there are also \$unset to remove a value and \$replaceWith to completely change an entire document.

The granularity of updates is per document, that is, a single document can be updated by at most one query at the same time. However, within the same collection, several different documents can be modified concurrently by different queries in parallel.

Complex pipelines¶

For grouping and such more complex queries, MongoDB provides an API in the form of aggregation pipelines, for which the syntax looks like so:

db.scientists.aggregate(
    { $match : { "Century" : 20 },
    { $group : {
            "Year" : "$year",
            "Count" : { "$sum" : 1 }
        }
    },
    { $sort : { "Count" :-1 } },
    { $limit : 5 }
)

Since we covered Apache Spark in Chapter 10, we believe the above code requires not much explanation, as this is quite akin to a sequence of Spark Transformation followed by a collect action.

For completeness, this would be a similar SQL query:

SELECT Year, SUM(Count) AS Count
FROM scientists
WHERE Century = 20
GROUP BY Year
ORDER BY SUM(Count) DESC
LIMIT 5

Limitations of a document store querying API¶

The API provided by MongoDB and similar document stores is very powerful and goes beyond what SQL can do in functionality especially in the context of nested, heterogeneous datasets.

Simple use cases are straightforward to handle, however more complex use cases require a lot of additional code in the host language, be it JavaScript or Python. An example is that joins must be taken care of by the end user in the host language: the burden of implementing more complex use cases is pushed to the end user. This leads to code that is complex to write and to read, and also potentially to performance issues in cases that the optimizer does not catch non optimal code written by the user.

Some document stores recently added higher-level query languages on top of their API in order to address this issue. We will cover query languages for denormalized data in Chapter 12.

Architecture¶

The architecture of MongoDB follows similar principles to what we covered before: scaling out the hardware to multiple machine, and sharding as well as replicating the data.

Sharding collections¶

Collections in MongoDB can be sharded. Shards are determined by selecting one or several fields. (Lexicographically-ordered) intervals over these fields then determine the shards. This is similar in spirit to regions in HBase covered in Chapter 6, which are sharded by Row ID intervals.

Shards then are stored in different physical locations. The fields used to shard must be organized in a tree index structure. Indices are described in Section 11.8.

Replica sets¶

A replica set is a set of several nodes running the MongoDB server process. The nodes within the same replica set all have a copy of the same data.

Each shard of each collection is assigned to exactly one replica set. Note that this architecture is not the same as that of HDFS, in which the replicas are spread over the entire cluster with no notion of "walls" between replica sets and no two DataNodes having the exact same block replicas. It is also not the same as HBase, in which nodes receive the responsibility of handling specific regions, but do not necessarily store them physically as this is done on the HDFS level.

Write concerns¶

When writing (be it delete, update or insert) to a collection, more exactly, to a specific shard of a collection, MongoDB checks that a specific minimum number of nodes (within the replica set that is responsible for the shard) have successfully performed the update. This is similar in spirit to the W parameter in DynamoDB. Once the minimum number of replication is reached, the user call returns (synchronous). Then replication continues in the background to a higher number of nodes (asynchronous).

Indices¶

Motivation¶

Adocument store, unlike a data lake, manages the physical data layout. This has a cost: the need to import (ETL) data before it is possible to query it, but this cost comes with a nice benefit: index support, just like relational database management systems.

An index is an auxiliary structure that can accelerate certain queries.

Imaging for example a large collection with billions of objects like this:

{"Name":"Einstein", "Profession":"Physicist"}
{"Name":"Lovelace", "Profession":"Computer scientist"}
{"Name":"G¨odel", "Profession":"Mathematician"}
{"Name":"Ramanujan", "Profession":"Mathematician "}
{"Name":"Pythagoras", "Profession":"Mathematician "}
{"Name":"Curie", "Profession":"Chemist"}
{"Name":"Turing", "Profession":"Computer Scientist"}
{"Name":"Church", "Profession":"Computer Scientist"}
{"Name":"Nash", "Profession":"Economist"}
{"Name":"Mirzakhani", "Profession":"Mathematician"}
{"Name":"Euler", "Profession":"Mathematician"}
{"Name":"Bohm", "Profession":"Physicist"}
{"Name":"Galileo", "Profession":"Astrophysicist"}
{"Name":"Germain", "Profession":"Mathematician"}
{"Name":"Lagrange", "Profession":"Mathematician"}
{"Name":"Gauss", "Profession":"Mathematician"}
{"Name":"du Chatelet", "Profession":"Computer Scientist"}
{"Name":"Thales", "Profession":"Mathematician"}
...

A typical use case is performing a selection on this collection. A selection query can be very specific and pinpoint exactly one document. This is called a point query. Assuming the values associated with the Names field are unique, the following is a point query:

find({"Name":"Euler"})

With no index, the document would have no other choice than scanning the full collection. It can easily take minutes, if not hours before the appropriate document is located.

With an index, this same query, with no change, will execute in just a few milliseconds.

Even selection queries that are less specific will be considerably faster with an index, for example:

find("Profession":"Mathematician"})

There is another important class of selection queries called range queries. Range queries select an interval of values on an ordered field.

Imagine for example this collection:

{"Name":"Einstein", "Year":1879}
{"Name":"Lovelace", "Year":1815}
{"Name":"G¨odel", "Year":1906}
{"Name":"Ramanujan", "Year":1887}
{"Name":"Pythagoras", "Year":-570}
{"Name":"Curie", "Year":1867}
{"Name":"Turing", "Year":1912}
{"Name":"Church", "Year":1903}
{"Name":"Nash", "Year":1928}
{"Name":"Mirzakhani", "Year":1977}
{"Name":"Euler", "Year":1707}
{"Name":"Bohm", "Year":1917}
{"Name":"Galileo", "Year":1564}
{"Name":"Germain", "Year":1777}
{"Name":"Lagrange", "Year":1736}
{"Name":"Gauss", "Year":1777}
{"Name":"du Chatelet", "Year":1706}
{"Name":"Thales", "Year":-624}

Then the following query is a range query, which selects all years from 1900 onward.

find("Year":{"$gte":1900})

How can an index make queries faster? Let us start with a visual on a simple example. Let us consider the collection

{
    Name: "Apple", "Color": [ "green", "red" ]
}
{
    Name: "Orange", "Color": [ "orange" ]
}
{
    Name: "Banana", "Color": [ "yellow" ]
}
{
    Name: "Kiwi", "Color": [ "brown", "green" ]
}
{
    Name: "Ananas", "Color": [ "yellow" ]
}

and imagine we would like to quickly locate fruits that have a specific colour. Then an index would visually look like so:

The general idea is that by picking a color on the left (which can be done quickly– how quickly depends on the index chosen), it then suffices to follow the pointer(s) to the documents with that color, which is just a disk read away.

We will look into two kinds of indices commonly used: hash indices and tree indices.

Hash indices¶

Hash indices are used to optimize point queries and more generally query that select on a specific value of a field.

The general idea is that all the values that a field takes in a specific collection can be hashed to an integer. The value, together with pointers to the corresponding documents, is then placed in a physical array in memory, at the position corresponding to this integer (modulo the overall size of the array).

The primitives involved in this are all in O(1): computing the hash leverages cryptographic tools by means of a so-called hash function. Hash functions fulfill useful criteria such as making collisions unlikely, spreading values uniformly in the index array, etc. They are, in fact, more powerful than what we need, but most importantly powerful enough for what we need.

The lookup of a value at a specific position in array is also happening in O(1): this is really just adding the integer to a pointer address to obtain a new address.

Finally, following the pointer(s) to the document(s) is also happening in O(1), although the constant can vary: if the collection fits in memory and is loaded in memory, then this is just access to the RAM. If the desired documents are not loaded in memory, this is a disk access, but even in this case, disks support random access, so that it is fast too (compared to scanning the entire collection...).

However, there is no free lunch: before an index can be used, it must be built. Building an index consists in the creation of the array structure, and then its population by sequentially scanning through the entire collection, computing the hash of the value, and adding to the index an entry and a pointer to the document, document by document.

Indices are built at the request of the user, by executing a command. Building an index can either happen synchronously, in the sense that the entire collection (or data store) remains unavailable during build time. Or it can happen asynchronously, meaning that the collection (or data store) remains available, but will be slower until the index is completely built.

The following illustrates how the index is used to perform a point query:

Tree indices¶

Hash indices are great and fast, but have limitations: first, they consume space. The more one wants to avoid hash collisions, the larger the array needs to be. Collisions can be handled by nesting additional lists in the array entries for all values that land into the same hash value, however if the array is too small these lists grow with the size of the collection and lookups become slower.

But more importantly, hash indices cannot support range queries. This is because hashes do not preserve the order of the values and distribute them “randomly” in the index array structure. At best, evaluating range queries with a hash index might work with discrete value spaces, by looking up each possible value in the sought interval, one by one. This is, however, intractable and inefficient for large ranges, and impossible for decimal numbers or double values.

Range queries are supported with tree indices. Instead of an array, tree indices use some sort of tree structure in which they arrange the possible values of the indexed field, such that the values are ordered when traversing the tree in a depth-first-search manner.

More precisely, the structure is called a B+-tree. Unlike a simple binary tree, nodes have a large number of children. The intent is that the leaves (and nodes) of the tree are large enough to match roughly a disk block, in order to optimize disk latency when fetching the nodes. In many cases, indices are so large (or numerous) that they do not fit in memory, and the database system only loads the parts of the index it needs. With nodes the size of a block, fewer disk accesses are needed.

There are a few constraints in a B+-tree. First, all leaves must be at exactly the same depth. This depth grows with larger collections. Second, the number of children of each node must be within a specific interval, generally parameterized as between d + 1 and 2d + 1 with some choice of d. For pedagogical purposes, we use a small value of d on our examples, but in practice d is larger. The only exception is the root node, which is not subject to the d + 1 minimum. The root can have less children, but at least two (had it only one child, it would be useless and could be removed).

The non-leaf nodes in the tree contain a list of increasing values, interlaced with pointers to the children. These values are compared to the actually sought value in order to locate the pointer that must be followed in order to resolved this sought value. There is exactly one less value on a node than its number of children. Thus, each node has between d and 2d values.

In a B+-tree, all possible values appear on the leaves together with a pointer to the documents that contain them. The values can be "repeated" on non-leaf nodes, but values on non-leaf nodes are only used for comparison purposes. A B+-tree also typically chains all its leaves with pointers, for efficient full-traversals in ascending value order. This allows resolving a range query by only looking up the bounds in the B+-tree, and then traversing from the minimum to the maximum value.

This is unlike B-trees, in which non-leaf nodes can also contain pointers to documents. Insertion and deletion of values in a B+-tree are a non-trivial matter, which we will not dive into. The idea is that one navigates down the tree and compares the new value to the existing value, then inserts the new value in a free spot. If this breaks the constraint that there must be at most 2d values on the same node, then the node is split into two nodes. This, in turn, can cause the number of values on the parent node to exceed 2d values, in which case the splitting recursively propagates up the tree, potentially even causing the depth to increase by one. Deletion is the other round: if nodes become to small, they are merged, and this can also cause the depth of the tree to shrink. Below is a visual of a very small B+-tree index structure on a collection:

And the example of a lookup. The lookup is in logarithmic time (O(log n)) with the size of the collection, which is less efficient than a hash index, but nevertheless very good compared to a linear full scan.

Secondary indices¶

By default, MongoDB always builds a tree index for the id field. Users can request to build hash and tree indices for more fields. These indices are called secondary indices. The choice of which indices to build depends on the collection and expected queries, and is both a science and an art. Too many indices can be counterproductive and slow down the system.

The command for building a hash index looks like so:

db.scientists.createIndex({
    "Name.Last" : "hash"
})

And for a tree index (1 means in ascending order,-1 would be descending):

db.scientists.createIndex({
    "Name.Last" : 1
})

Secondary tree indices can also involve several fields, as opposed to just one for hash indices. When several fields are involved, the value tuples are sorted in lexicographic order.

db.scientists.createIndex({
    "Name.Last" : 1
    "Name.First" : -1
})

(Note, en passant, that the order in which the fields are given is thus important, which is a breach of the JSON object model in which key-value pairs are not supposed to be ordered– this shows the limits of this and similar APIs compared to high-level query languages, which we will study in a later chapter).

What is important to understand is that, if one builds a tree index on fields A, B, C and D, then a tree index on field A is superfluous, and so is a tree index on fields A and B, and so is a tree index on fields A, B and C. This is because looking up documents with a specific value for A only, or for A and B, or for A, B and C can be efficiently done on the index on all four fields. Why this is, is left as a very interesting exercise (hint: this is because of the lexicographic ordering).

It is a common mistake by document store users to build superfluous indices in this way, which wastes valuable space.

When are indices useful¶

When building indices, it is important to get a feeling for whether a query will be faster or not with this index.

Some cases are obvious. For example, if we built a hash index on Name.Last, then this query will be instantaneous:

db.scientists.find({
    "Name.Last" : "Einstein"
})

Let us now imagine we built a hash index on Profession. Then, this query will be faster, but less instantaneous:

db.scientists.find({
    "Profession" : "Physicist",
    "Theories" : "Relativity"
})

This is because the index can be used to pre-filter a superset of the results (which have the correct value for Profession), but some postprocessing is still needed, in memory, to also filter those documents that have the correct value for Theories.

Now, let us look at range queries. If we consider this index:

db.scientists.createIndex({
    "Birth" : 1,
    "Death" :-1
})

Then this query will be very fast:

db.scientists.find({
    "Birth" : 1887,
    "Death" : 1946
})

However, since the index is a tree index, an interval lookup will also be faster than a full scan (note that this works because Birth is the first field of the index, it would not work if it were the second):

db.scientists.find({
    "Birth" : { "$gte": 1946 }
})

This query will also be faster, but requires additional post-processing for the Death field in memory

db.scientists.find({
    "Birth" : { "$gte": 1946 },
    "Death" : 1998
})