263-3010-00: Big Data

Section 6 Wide Column Stores

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 10/16/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Now that we have looked into some textual data formats (syntax) like CSV, JSON, XML, how do we store these CSV, JSON, and XML files? A first obvious solution is: on the local disk, or on a object storace service (S3, Azure Blob Storage), or a distributed file system (HDFS).

The problem with HDFS is its latency: HDFS works well with very large files (at least hundreads of MBs so that blocks even start becoming useful), but will have performance issues if accessing millions of small XML or JSON files.

Wide column stores were inventied to provide more control over performance and in particular, in order to achieve high-throughput and low latency for objects ranging from a few bytes to about 10 MB, which are too big and numerous to be efficiently stored as so-called clobs (character large objects) or blobs (binary large objects) in a relational database system, but also too small and numerous to be efficiently accessed in a distributed file system.

A Sweet Spot between Object Storage and Relational Database Systems¶

The astute reader will argue that a large number of JSON or XML files would be handled quite well with an object storage service. But a wide column store has additional benefits:

a wide column store will be more tightly integrated with the parellel data processing systems. This is possible because that wide column store processes run on the same machines as the data processing processes, and it makes the entire system faster.
wide column stores have a richer logical model than the simple key-value model behind object storage
wide column stores also handle very small values (bytes and kBs) well thanks to batch processing.

Note that a wide column store is not a relational database management system:

it does not have any data model for values, which are just arrays of bytes;
since it efficiently handles values up to 10 MB, the values can be nested data in various formats, which breaks the first normal form;
tables do not have a schema;
there is no language like SQL, instead the API is on a lower level and more akin to that of a key-value store;
tables can be sparse, allowing for billions of rows and millions of columns at the same timel this is another reason why data stored in HBase is denormalized.

History¶

Relational database management systems (RDBMS) from the 1970s were designed to run on a single machine, limiting their data capacity and processing speed to that machine's resources. To handle more data or improve performance, early solutions involved scaling up with bigger machines, but this had limits due to cost and feasibility.

In the early 2000s, tech companies began scaling out across multiple machines. This involved spreading data over several systems, similar to how data is partitioned in HDFS with redundancy across clusters. However, this required additional, complex software to manage data partitioning, replication, and querying across the cluster, making it costly and difficult to maintain.

Engineers realized they were essentially creating a new type of database system optimized for clusters. Google was the first to release a fully integrated solution, called BigTable, followed by the open-source HBase in the Hadoop ecosystem.

Logical Data Model¶

Rationale¶

The data model of HBase is based on the realization that joins are expensive, and that they should be avoided or minimized on a cluster architecture. Joins can be avoided if they are pre-computed, that is, instead of storing the data as separate tables, we store, and work on the joined table.

Visually, what would look like this in a traditional RDBMS:

No description has been provided for this image

is instead stored like so in HBase:

There is a direct consequence of this change: the number of columns in a traditional RDBMS is limited, tyically to somewhere around 256, or maybe in the low four digits if the columns have simple and compact types. But in a denormalized table, the number of columns can easily skyrocket, with many cells being in fact empty. The table is thus very wide and sparse, which is a completely different use case than what RDBMS software was designed for.

The second design pinciple underlying HBase is that it is efficient to store together what is accessed together. In the big picture, this is a flavor of batch processing, on of the overarching principles in Big Data. Batch processing reduces the impact of latency - remember that latency barely improved in the past few decades in comparison to capacity and throughput - by involving fewer, larger requests instead of more, smaller requests.

These two design principles are tied to a specific usage pattern of the database management system: when the emphasis is reading and processing data in large amounts. This is, at first sight, in direct conflict with efficiently writing data. First, writing denormalized data is more cumbersome: one needs to deal with insertion, update, and deletion anomalies. Second, writing data under the constraint of storing together what is accessed together is a challenging endeavor, because one cannot just "insert" new data at a specific physical localtion without moving the existing data around.

HBase provides an impressive solution for handling writes quite efficiently, while preserving a batch-processing paradigm. Moreover, the underlying idea also impacted the handling of intermediate data by MapReduce and Apache Spark.

Tables and row IDs¶

From an abstract perspective, HBase can be seen as an enhanced key-value store, in the snese that:

a key is compound and involves a row, a column and a version
keys are sortable
values can be larger (clobs, blobs), up to around 10 MB.

This is unlike traditional key-value stores, which have a flat key-value model, which (in the case of distributed hash tables) do not sort keys, which thus do not store "close" keys together, and which usually support smaller values.

On the logical level, the data is organized in a tabular fashion: as a collection of rows. Each row is identified with a row ID. Row IDs can be compared, and the rows are logically sorted by row ID:

A row ID is logically an array of bytes, although there is a library to easily create row ID bytes from specific primitive values (bytes, short, int, long, string, etc).

Column families¶

The other attributes, called columns, are split into so-called column families. This is a concept that does not exist in relational databases and that allows scaling the number of columns. Very intuitively, one can think of column families as the tables that one would have if the data were actually normalized and the joins had not been pre-computed. On the picture above, we separated the columns into column families, using a different color and alphabet for each family. The name of a column family is a string, just like the name of a table. Often, we also use the terminology "column family" to refer to the name of the column family, i.e. we identify the column family with its name.

Column quantifiers¶

Columns in HBase have a name (in addition to the column family) called column qualifier, however unlike traditional RDBMS, they do not have a particular type. In fact, as far as HBase is concerned, all values are binary (arrays of bytes) and what the user does with it (string, integer, large objects, XML, JSON, HTML, etc) is really up to them. There are many different frameworks that can be used in complement of HBase to add a type system (Avro, etc) and it is, in fact, very common to store large blobs of data in the cells.

In fact, it goes further than that. Not only are there no column types: even the column quanlifiers are not specified as part of the schema of an HBase table: columns are created on the fly when data is inserted, and the rows need not have data in the same columns, which natively allows for sparsity. Column qualifiers are arrays of bytes (rather than strings), and as for row IDs, there is a library to easily create column qualifiers from primitive values.

Thus, on the logical level, columns come and go as the table lives its life:

Unlike the values which can be large arrays of bytes (blobs), it is important to keep column families and column qualifiers short, because as we will see, they are repeated a gigantic number of times on the physical layer.

Versioning¶

HBase generally supports versioning, in the sense that it keeps track of the past versions of the data. As we will see, this is implemented by associating any value with a timestamp, also called version, at which it was created (or deleted). Users can also override timestamps with a value of their choice to have more control about versions.

A multifimensional key-value store¶

As we previously explained, one way to look at an HBase table is that it is an enhanced key-value store where the key is four-dimensional. Indeed, in HBase, the key identifying the values in the cells consists of:

the row ID
the column family
the column qualifier
the version

HBase is able to efficiently look up any cell, and even any version of any cell, given its key.

Logical Queries¶

Having realized that an HBase table is nothing but a four-dimensional key-value store, if follows logically that the HBase API also resembles that of a key-value store: HBase supports four kinds of low-level queries: get, put, scan, and delete. Unlike a traditional key-value store, HBase also supports querying ranges of row IDs and ranges of timestamps.

In comparison to a full-fledged RDBMS, this is quite limited and, as for data types, support for higher-level queries (such as SQL) is brought by additional frameworks that come as a complement of, and atop HBase (Apache Hive, Apache Phoenix, etc).

Get¶

With a get command, it is possible to retrieve a row specifying a table and a row ID.

Optionally, it is also possible to only request some but not all of the columns, or to request a specific version, or the latest $k$ versions (where $k$ can be chosen) within a time range (interval of versions).

Put¶

With a put command, it is possible to put a new value in a cell by specifying a table, row ID, column family, and column qualifier.

It is also possible to optionally specify the version. If none is specified, the current time is used as the version.

HBase offers a locking mechanism at the row level, meaning that different rows can be modified concurrently, but the cells in the same row cannot: only one user at a time cna modify any given row.

Scan¶

With a scan command, it is possible to query a whole table or part of a table, as opposed to a single row.

It is possible to restrict the scan to specific columns families or even columns.

It is possible to restrict the scan to an interval of rows.

It is possible to run the scan at a specific version, or on a time range.

Scans are fundamental for obtaining high throughput in a parallel processing.

Delete¶

With a delete command, it is possible to delete a spcific value with a table, row ID, column family, and qualifier. Optionally, it is also possible to delete the value with a specific version, or all values with a version less or equal to a specific version.

Physical Architecture¶

Partitioning¶

A table in HBase is physically partitioned in two ways: on the rows and on the columns.

The rows are split in consecutive regions. Each region is identified by a lower and an upper row key, the lower row key being included and the upper row key excluded.

A partition is called a store and corresponds to the intersection of a region and of a column family.

Network Topology¶

HBase has exactly the same centralized architecture as HDFS.

The HMaster and the RegionServers should be understood as processes running on the nodes, rather than the nodes themselves, even though it is common to use "HMaster" to designate the node on which the HMaster process runs, and "RegionServer" to designate a node on which a RegionServer process runs.

It is common for the HMaster process to run on the same node as the NameNode process. Likewise, it is common for the RegionServer processes to run on those same nodes on which the DataNode processes run.

The HMaster assigns responsibility of each region to one of the RegionServers. This does mean that, for a given region (interval of row IDs), all column families - each one within this region being a store - are handled by the same RegionServer.

There is no need to attribute the responsibility of a region to more than one RegionServer at a time because, as we will see soon, fault tolerance is already handled on the storage level by HDFS.

If a region grows too big, for example because of many writes in the same row ID interval, then the region will be automatically split by the responsible RegionServer. Note, however, that concentrated wites ("hot spots") might be due to a poor choice of row IDs for the use case at hand. There are solutions to this such as salting or using hashes in row ID prefixes.

If a RegionServer has too many regions compared to other RegionServers, then the HMaster can reassign regions to other RegionServers.

Likewise, if a RegionServer fails, then the HMaster can reassign all its regions to other RegionServers.

Physical sotrage¶

Let us now dive into the actual physical storage. As we saw, the data is partitioned in stores, so we need to look at how each store is physically stored and persisted.

The store is, physically, nothing less than an organized set of cells:

Each value in a cell is identified by a rwo ID (within the region handled by the store), a column family (the one handled by the store), a column qualifier (arbitrary) and a version (arbitrary). The version is often implicit as several versions of the same cell can co-exist with the latest one being current, but it is an important component in the identification of a value in a cell. This tuple of four values will be referred to as the key (of the value in the cell).

Each value in a cell is thus handled physically as a key-valeu pair where the key is a (row ID, column family, column quanlifier, version) tuple and the value is its content. On the physical level, a key-value pair is often referred to in CamelCase as a KeyValue to disambiguate from other contexts in which key-values might appear within HBase.

All the cells within a store are eventually persisted on HDFS, in files that we will call HFiles.

An HFile is, in fact, nothing else than a (boring) flat list of KeyValues, one per verion of a cell. What is important is that, in an HFile, all these KeyValues are stored by key in increasing order, meaning, first sorted by row ID, then by column family (trivially unique for a given store), then by column qualifier, then by version (in decreasing order, recent to old).

This means that all version of a given cell that are in the same HFile are located together, and one of the values (within this HFile) is the latest:

Of course, on this disk, a file is a sequence of 0s and 1s with on tabular structure, so that what in fact happens is that they KeyValues are stored sequentially, like so:

Now if we zoom in at the bit level, a KeyValue consists of four parts:

The length of the keys in bits (this length is encoded on a constant, known number of bits)
The length of the value in bits (this length is encoded on a constant, known number of bits)
The actual key (of variable length)
The actual value (of variable length)

Why do we not just store the key and the value? This is because their length can vary. If we do not know their length, then it is impossible to know when they stop just looking at the bits. Thus, the trick is to start with saving the length of the keys, this length being itself always stored as 32 bits. Once the engine has obtained the length of the key (say $n_k$ bytes) and the length of the value (say $n_v$ bytes) from the first 64 bits, it can then look at the next $n_k$ bytes (remember that 1 byte is 8 bits) and stop. We have the key. Then the engine looks at the next $n_v$ bytes and stop. We have the value. And then proceed to the next KeyValue.

Zooming in, the key is itself made of the row ID, the column family, the column qualifier and the timestamp. We need also a rwo ID length and a column family length (similar to the key length and the value length). The timestamp has a fixed length (64 bits) and does not need additional input on its length. Finally the last byte (named "key type" for some reason) is mostly used as a deletion flag that indicates that the content of the cell, as of this version, was deleted.

Why does the column qualifier not have an additional column qualifier length? This is because we know the key length, so the column qualifier length would be superfluous as it can be deduce with simple arithmetics.

The next thing to know is that KeyValues, within an HFile, are organized in blocks. But to not confuse them with HDFS blocks, we will call them HBlocks. HBlocks have a size of 64 kB, but this size is variable: if the last KeyValue goes beyond this boundary, then the HBlock is simply longer and stops whenever the last KeyValue stops. This is in particular the case for large values exceeding 64 kB, which will be "their own HBlock".

The HFile then additionally contains an index of all blocks with their key boundaries.

This separate index is loaded in memory prior to reading anything from the HFile. It can then be kept in memory for subsequent reads. Thanks to the index, it is possible to efficiently find out in which HBlock the KeyValues with a specific key (or within a specific key range) are to be read.

The following is a summary of the entire physical storage hierarchy of KeyValues on HDFS:

But wait, if all the KeyValues within an HFile are sorted, and it is not possible to modify files on HDFS randomly, how is it even possible to insert any new KeyValues when the HBase user puts values into new cells?

Log-sturctured merge trees¶

Before we dive into writing and persisting new data, it is important to understand whee the data is read from.

Generally, "old" data is persisted to HDFS in HFiles, while "fresh" data is still in memory on the RegionServer node, and has not been persisted to HDFS yet.

Thus, when accessing data, HBase needs to generally look everywhere for cell values (i.e., physically, KeyValues) to potentially return: in every HFile, and in memory.

As long as there is room in memory, freshly created KeyValues are added in memory. At some point, the memory becomes full (or some other limits are reached). When this happens, all the KeyValues need to be flushed to a brand new HFile.

Upon flushing, all KeyValues are written sequentially to a new HFile in ascending key order, HBlock by HBlock, concurrently building the index structure. In fact, sorting is not done in the last minute when flushing. Rather, what happens is that when KeyValues are added to memory, they are added inside a data structure that maintains them in sorted order (such as tree maps) and then flushing is a linear traversal of the tree.

What happens if the machine crashes and we lose everything in memory? We have a so-called write-ahead-log for this. Before any fresh KayValues are written to memory, they are written in sequential order (append) to an HDFS file called the HLog. There is one HLog per RegionServer. A full write-ahead-log also triggers a flush of all KeyValues in memory to a new HFile.

If there is a problem and the memory is lost, ther HLog can be retrived from HDFS and "played back" in order to repopulate the memory and recreate the sorting tree structure.

After many flushes, the number of HFiles to read from grows and becomes impracticable. For this reason, there is an additional process called compaction that taes several HFiles:

And outputs a single, merged HFile. Since the KeyValues within each HFile are already sorted, this can be done in linear time, as this is essentially the merge part of the merge-sort algorithm.

With flushing and compaction, we are starting to see some cycle of persistence, as illustrated below. On a first level, the KeyValues in memory, on a second level, the KeyValues that have been flushed, on the third level, the KeyValues that have been flushed and compacted once, etc.

Compaction is not done arbitrarily but follows a regular, logarithmic pattern. Let us go through it. In a fresh HBase store, the memory becomes full at some point and a first HFile is output in a flush.

Then the memory, which was emptied, becomes fill again and second HFile is output in a flush.

This results in two HFiles of "standard size" that are immediately compacted to one HFile, twice as large.

Then the memory, which was emptied, becomes full again and a new "standard size" HFile is output in a flush.

Then the memory, which was emptied, becomes full again and a second "standard size" HFile is output in a flush.

This results in two HFiles of "standard size" that are immediately compacted to one HFile, twice as large.

This results in two HFiles of "double size" that are immediately compacted to one HFile, four times as large as the standard size:

Then the memory, which is was emptied, becomes full again and a new "standard size" HFile is output in a flush.

By now, the process should be clear: when the memory is flushed again, an standard-size HFile is written and the two standard-size HFiles are immediately compacted to a double-size HFile.

When the memory is flushed again, an standard-size HFile is written, and so on, and so on.

If you paid attention, you will have noticed that the parrtern of the HFiles, looked at in a mirror, is simply counting in base 2: 1, 10, 11, 100, 101, 110, 111, and so on. In fact, this number in base two times the size of a standard HFile gives you the total persisted size on HDFS.

Additional Design Aspects¶

Bootstrapping lookups¶

In order to know which RegionServer a client should communicate with to receive KeyValues corresponding to a specific region, there is a main, big lookup table that lists all regions of all tables together with the coordinates of the RegionServer in charge of this region as well as additional metadata, for example to support splitting regions.

This big table is, in fact, also an HBase table, but it is special because this one fits on just one machines, known to everybody. Thus, the clients use this so-called meta table to know which RegionServers to communicate with.

There also exists an alternate design (commonly found in early versions of BigTable or HBase) with two levels of meta tables in order to scale up the number of tables.

To create, delete, or update tables, clients communicate with the HMaster.

Caching¶

In order to improve latency, KeyValues that are normally persisted to HFiles (and thus no longer in memory) can be cached in a separate memory region, with the idea of keeping in the cache those KeyValues that are frequently accessed.

Bloom filters¶

HBase has a mechanism to avoid looking for KeyValues in every HFile. This mechanism is called a Bloom filter. It is basically a black bos that can tell with absolute certainty that a certain key does not belong to an HFile, while it only predicts with good probability (albeit not certain) that is does belong to it. A Bloom filter is implemented using a multiple hashing mechanism that sets Boolean flags in an array, which is very efficient. By maintaining Bloom filters for each HFile (or even each column), HBase can know with certainty that some HFiles need not be read when looking up certain keys.

Data locality and short-circuiting¶

It is informative to think about the interaction between HBase and HDFS. In particular, recollect when we said that HDFS outputs the first replica of every block on the same (DataNode) machine as the client. Who is the client here? The RegionServer, which does co-habit with a DataNode. Now the pieces of the puzzle should start assembling in your mind: this means that when a RegionServer flushes KeyValues to a new HFile, a replica of each (HDFS) block of the HFile is written, by the DataNode process living on the same machine as the RegionServer process, to the local disk. This makes accessing the KeyVlaues in future reads by the RegionServer extremely efficient, because the RegionServer can read the data locally without communicating with the NameNode: this is known as short-circuiting in HDFS.

However, as time flies and the HDFS cluster lives its own life, some replicas might be moved to other DataNodes when rebalancing, making short-circuiting not (always) possible.

This, however, is not a problem, because with the log-structured merge tree mechanism, compactions happen regularly. And with every compaction, the replicas of the brand new HFile are again written on the local disk.