263-3010-00: Big Data
Section 6
Wide Column Stores
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 10/16/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Now that we have looked into some textual data formats (syntax) like CSV, JSON, XML, how do we store these CSV, JSON, and XML files? A first obvious solution is: on the local disk, or on a object storace service (S3, Azure Blob Storage), or a distributed file system (HDFS).
The problem with HDFS is its latency: HDFS works well with very large files (at least hundreads of MBs so that blocks even start becoming useful), but will have performance issues if accessing millions of small XML or JSON files.
Wide column stores were inventied to provide more control over performance and in particular, in order to achieve high-throughput and low latency for objects ranging from a few bytes to about 10 MB, which are too big and numerous to be efficiently stored as so-called clobs (character large objects) or blobs (binary large objects) in a relational database system, but also too small and numerous to be efficiently accessed in a distributed file system.
A Sweet Spot between Object Storage and Relational Database Systems¶
The astute reader will argue that a large number of JSON or XML files would be handled quite well with an object storage service. But a wide column store has additional benefits:
a wide column store will be more tightly integrated with the parellel data processing systems. This is possible because that wide column store processes run on the same machines as the data processing processes, and it makes the entire system faster.
wide column stores have a richer logical model than the simple key-value model behind object storage
wide column stores also handle very small values (bytes and kBs) well thanks to batch processing.
Note that a wide column store is not a relational database management system:
it does not have any data model for values, which are just arrays of bytes;
since it efficiently handles values up to 10 MB, the values can be nested data in various formats, which breaks the first normal form;
tables do not have a schema;
there is no language like SQL, instead the API is on a lower level and more akin to that of a key-value store;
tables can be sparse, allowing for billions of rows and millions of columns at the same timel this is another reason why data stored in HBase is denormalized.
History¶
Relational database management systems (RDBMS) from the 1970s were designed to run on a single machine, limiting their data capacity and processing speed to that machine's resources. To handle more data or improve performance, early solutions involved scaling up with bigger machines, but this had limits due to cost and feasibility.
In the early 2000s, tech companies began scaling out across multiple machines. This involved spreading data over several systems, similar to how data is partitioned in HDFS with redundancy across clusters. However, this required additional, complex software to manage data partitioning, replication, and querying across the cluster, making it costly and difficult to maintain.
Engineers realized they were essentially creating a new type of database system optimized for clusters. Google was the first to release a fully integrated solution, called BigTable, followed by the open-source HBase in the Hadoop ecosystem.
Logical Data Model¶
Rationale¶
The data model of HBase is based on the realization that joins are expensive, and that they should be avoided or minimized on a cluster architecture. Joins can be avoided if they are pre-computed, that is, instead of storing the data as separate tables, we store, and work on the joined table.
Visually, what would look like this in a traditional RDBMS:
is instead stored like so in HBase:
There is a direct consequence of this change: the number of columns in a traditional RDBMS is limited, tyically to somewhere around 256, or maybe in the low four digits if the columns have simple and compact types. But in a denormalized table, the number of columns can easily skyrocket, with many cells being in fact empty. The table is thus very wide and sparse, which is a completely different use case than what RDBMS software was designed for.
The second design pinciple underlying HBase is that it is efficient to store together what is accessed together. In the big picture, this is a flavor of batch processing, on of the overarching principles in Big Data. Batch processing reduces the impact of latency - remember that latency barely improved in the past few decades in comparison to capacity and throughput - by involving fewer, larger requests instead of more, smaller requests.
These two design principles are tied to a specific usage pattern of the database management system: when the emphasis is reading and processing data in large amounts. This is, at first sight, in direct conflict with efficiently writing data. First, writing denormalized data is more cumbersome: one needs to deal with insertion, update, and deletion anomalies. Second, writing data under the constraint of storing together what is accessed together is a challenging endeavor, because one cannot just "insert" new data at a specific physical localtion without moving the existing data around.
HBase provides an impressive solution for handling writes quite efficiently, while preserving a batch-processing paradigm. Moreover, the underlying idea also impacted the handling of intermediate data by MapReduce and Apache Spark.
Tables and row IDs¶
From an abstract perspective, HBase can be seen as an enhanced key-value store, in the snese that:
a key is compound and involves a row, a column and a version
keys are sortable
values can be larger (clobs, blobs), up to around 10 MB.
This is unlike traditional key-value stores, which have a flat key-value model, which (in the case of distributed hash tables) do not sort keys, which thus do not store "close" keys together, and which usually support smaller values.
On the logical level, the data is organized in a tabular fashion: as a collection of rows. Each row is identified with a row ID. Row IDs can be compared, and the rows are logically sorted by row ID:
A row ID is logically an array of bytes, although there is a library to easily create row ID bytes from specific primitive values (bytes, short, int, long, string, etc).
Column families¶
The other attributes, called columns, are split into so-called column families. This is a concept that does not exist in relational databases and that allows scaling the number of columns. Very intuitively, one can think of column families as the tables that one would have if the data were actually normalized and the joins had not been pre-computed. On the picture above, we separated the columns into column families, using a different color and alphabet for each family. The name of a column family is a string, just like the name of a table. Often, we also use the terminology "column family" to refer to the name of the column family, i.e. we identify the column family with its name.
Column quantifiers¶
Columns in HBase have a name (in addition to the column family) called column qualifier, however unlike traditional RDBMS, they do not have a particular type. In fact, as far as HBase is concerned, all values are binary (arrays of bytes) and what the user does with it (string, integer, large objects, XML, JSON, HTML, etc) is really up to them. There are many different frameworks that can be used in complement of HBase to add a type system (Avro, etc) and it is, in fact, very common to store large blobs of data in the cells.
In fact, it goes further than that. Not only are there no column types: even the column quanlifiers are not specified as part of the schema of an HBase table: columns are created on the fly when data is inserted, and the rows need not have data in the same columns, which natively allows for sparsity. Column qualifiers are arrays of bytes (rather than strings), and as for row IDs, there is a library to easily create column qualifiers from primitive values.
Thus, on the logical level, columns come and go as the table lives its life:
Unlike the values which can be large arrays of bytes (blobs), it is important to keep column families and column qualifiers short, because as we will see, they are repeated a gigantic number of times on the physical layer.
Versioning¶
HBase generally supports versioning, in the sense that it keeps track of the past versions of the data. As we will see, this is implemented by associating any value with a timestamp, also called version, at which it was created (or deleted). Users can also override timestamps with a value of their choice to have more control about versions.
A multifimensional key-value store¶
As we previously explained, one way to look at an HBase table is that it is an enhanced key-value store where the key is four-dimensional. Indeed, in HBase, the key identifying the values in the cells consists of:
the row ID
the column family
the column qualifier
the version
HBase is able to efficiently look up any cell, and even any version of any cell, given its key.
Logical Queries¶
Having realized that an HBase table is nothing but a four-dimensional key-value store, if follows logically that the HBase API also resembles that of a key-value store: HBase supports four kinds of low-level queries: get, put, scan, and delete. Unlike a traditional key-value store, HBase also supports querying ranges of row IDs and ranges of timestamps.
In comparison to a full-fledged RDBMS, this is quite limited and, as for data types, support for higher-level queries (such as SQL) is brought by additional frameworks that come as a complement of, and atop HBase (Apache Hive, Apache Phoenix, etc).
Get¶
With a get command, it is possible to retrieve a row specifying a table and a row ID.
Optionally, it is also possible to only request some but not all of the columns, or to request a specific version, or the latest $k$ versions (where $k$ can be chosen) within a time range (interval of versions).
Put¶
With a put command, it is possible to put a new value in a cell by specifying a table, row ID, column family, and column qualifier.
It is also possible to optionally specify the version. If none is specified, the current time is used as the version.
HBase offers a locking mechanism at the row level, meaning that different rows can be modified concurrently, but the cells in the same row cannot: only one user at a time cna modify any given row.
Scan¶
With a scan command, it is possible to query a whole table or part of a table, as opposed to a single row.
It is possible to restrict the scan to specific columns families or even columns.
It is possible to restrict the scan to an interval of rows.
It is possible to run the scan at a specific version, or on a time range.
Scans are fundamental for obtaining high throughput in a parallel processing.
Delete¶
With a delete command, it is possible to delete a spcific value with a table, row ID, column family, and qualifier. Optionally, it is also possible to delete the value with a specific version, or all values with a version less or equal to a specific version.
Physical Architecture¶
Partitioning¶
A table in HBase is physically partitioned in two ways: on the rows and on the columns.
The rows are split in consecutive regions. Each region is identified by a lower and an upper row key, the lower row key being included and the upper row key excluded.
A partition is called a store and corresponds to the intersection of a region and of a column family.
Network Topology¶
HBase has exactly the same centralized architecture as HDFS.
The HMaster and the RegionServers should be understood as processes running on the nodes, rather than the nodes themselves, even though it is common to use "HMaster" to designate the node on which the HMaster process runs, and "RegionServer" to designate a node on which a RegionServer process runs.
It is common for the HMaster process to run on the same node as the NameNode process. Likewise, it is common for the RegionServer processes to run on those same nodes on which the DataNode processes run.
The HMaster assigns responsibility of each region to one of the RegionServers. This does mean that, for a given region (interval of row IDs), all column families - each one within this region being a store - are handled by the same RegionServer.
There is no need to attribute the responsibility of a region to more than one RegionServer at a time because, as we will see soon, fault tolerance is already handled on the storage level by HDFS.
If a region grows too big, for example because of many writes in the same row ID interval, then the region will be automatically split by the responsible RegionServer. Note, however, that concentrated wites ("hot spots") might be due to a poor choice of row IDs for the use case at hand. There are solutions to this such as salting or using hashes in row ID prefixes.