263-3010-00: Big Data
Section 14
OLAP and Data Cubes
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 12/10/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
We now turn to a fourth data shape after covering tables, trees, and graphs: data cubes. Like all other shapes, we will see that it has its own data model, its own algebra, its own syntaxes, and its own query language.
OLTP vs. OLAP¶
While modern, computer-based database systems saw the light of the day in the 1970s in the form of relational database systems, it was only twenty years later in the 1990s that another shape emerged: data cubes—still, this was before the massive diversity and volume scaleup if the 2000s.
The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing orders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions. These two use cases differ in several fundamental ways.
First, running the day-to-day business of a company requires a database management system that is able to sustain a high pace of updates (deletions, insertions, replacements, etc) while maintaining consistency. This is called OLTP (OnLine Transaction Processing). On the other hand, the production of fancy, shiny reports does not require any updates: a frozen, read-only view of the data is sufficient for analysis purposes. This is called OLAP (OnLine Analytical Processing). Incidentally, even though OLAP is often associated with data cubes for historical reasons, a lot of the material covered in this lecture (data lakes, MapReduce, Apache Spark, etc) can be considered to be classifiable as OLAP.
The difference between a write-heavy system (OLTP) and a read heavy system (OLAP) has an impact on the entire architecture of the system. On the write-heavy side, we saw in the context of document stores (MongoDB) in Chapter 11 that a high number of point queries, each returning small quantities of results, can be implemented quite efficiently, to the point that it is almost instant (a few milliseconds), with hash indices and B+-tree indices. These indices are perfectly compatible with a high pace of updates (it just feels slightly slower upon updating). On the read-heavy side, we saw in the context of MapReduce and Apache Spark that queries that aggregate massive amounts of data take more time (minutes, hours, days), although parallelism and batch processing helps. This is first because large parts of the input data must be read (in contrast to just one or two records in the OLTP case), and second because the queries typically involve a lot of shuffling on the physical level.
Another impact of OLTP vs. OLAP is about how normalized the data is: in OLTP and good old relational database management systems, we strive to have data in a high normal form (e.g., the Third Normal Form or Boyce-Codd Normal Form) to avoid update/deletion/insertion anomalies and to avoid redundancy. In OLAP, as was repeated a lot throughout the course, we want, on the contrary, to denormalize the data: we love redundancy because this is a way to avoid joins.
Features of data warehousing¶
In the context of business intelligence and data cubes, OLAP is also referred to as data warehousing. Date warehousing distinguishes itself commonly via four features:
It is subject-oriented. This means that it is highly focused on specific business use cases: census analytics (historically the oldest use case for computers), web analytics, sales analytics, science, etc.
It is time-variant. This means that we do not only care about the latest state of the data: we also want to go back in time and see the previous month, the previous quarter, the previous years—often typically up to 10 years in the past.
It is non-volatile. This means that once the data has been imported, it remains frozen and no updates are made. We load, we read, and this is it.
It is integrated. This means that companies using data warehousing have an architecture that allows the regular and (semi)automatic import of data from day-to-day systems (Customer Relationship Management (CRM), Enterprise Resource Planning (ERP), relational database management systems, file systems, etc) into the data warehouse.
Extract-Transform-Load (ETL)¶
One might ask why not directly run business-intelligence queries directly on the day-to-day systems. The main reason is that doing so would massively slow down the day-to-day systems. Thus, data cubes are typically stored separately in the data warehouse, with infrequent updates (e.g., weekly, monthly). These updates consist of three steps:
Extract: the desired filtered subset of the data is extracted from the source, e.g., using gateways.
Transform: the data is cleaned up and rearranged (e.g., projected, joined, etc).
Load: thedata is imported into the data warehouse system, which triggers constraint checks, building cube-specific indices, partitioning, etc.
In fact, the term "ETL" is so widespread that it is even commonly used as a verb: one ETLs the data into a system (not necessarily data cubes: one can also ETL data into PostgreSQL or MongoDB).
Market products¶
There are currently four big players on the market regarding data warehousing: Oracle (Essbase), IBM (Cognos), Microsoft (Analysis Services), and SAP (HANA).
Historically, the first data warehouses were built in new companies, and then a market consolidation followed in which several of the big players purchased the data warehousing companies. We will see that there is a technical reason why companies making relational (tabular) database management systems have an interest in (cubic) data warehouse systems.
A new generation of players such as Databricks and Snowflake have also recently appeared, with innovative approaches bringing data lakes and data warehouse systems together (e.g., data lakehouse systems), and increasingly, artificial intelligence (machine Learning, vector databases, etc).
The cubic data model¶
Cubes and dimensions¶
We now turn to the logical data model behind data cubes. Here is a visualization of a data cube:
A data cube consists of small cells, called facts, arranged across multiple dimensions. Each cell thus corresponds to a combination of dimensions called dimensional coordinates, which associate each dimension to a value (called a member of that dimension). Logically, dimensional coordinates are similar to a flat map (JSON object) where keys are dimensions and values are members. Each cell furthermore contains one value. Typically OLAP data cubes are dense, meaning that most if not all cells contain some value (even if it is zero).
The number of dimensions in a cube need not be three: it can be 2, or more often 4, or 5 or even 10. This is, of course, difficult to visualize for a human, but we will shortly see that this is not such a big problem because there are alternate representations of data cubes.
While data cubes can have a much higher number of dimensions, highly dimensional data cubes will obviously be sparse, which leads to a different architecture and different optimizations. In fact, scaling up to massive amounts of dimensions even leads to ... vector databases, in which the cube loses its first-class citizen status and simply becomes a vector space (with continuous members). In this chapter, we focus on dense data cubes with a reasonable number of dimensions.
Dimensions can be of many different sorts, including often:
What (measure): what is the fact about: revenues? quanties? costs? profits?
When (period): when is a fact valid?
Who (entity): which company is the fact about?
Where (geography): which country or city is the fact about?
Of what (unit): in which unit is the fact expressed (e.g., a currency)?
and there are many other possibilities, even for each use case.
Fact tables¶
Because humans are not capable of visualizing more than three dimensions (and even three is difficult), cubes are typically presented as a flat list of facts called a fact table. Each column represents a dimension, except the last one, which contains the value of the fact. This is for example the fact table of a 2 × 2×2 data cube. As expected, it has 8 facts:
In practice, it is common to annotate dimensions (also called axes, e.g., by the U.S. Securities and Exchange Commission in the context of quaterly reports in the XBRL format) and members with square brackets to make it easier to understand:
The cubic algebra¶
We now turn to how data cubes can be manipulated, similar to how the relational algebra manipulates tables.
Slicing¶
A cube can be sliced, which means that one focuses specifically on a specific member of a specific dimension and ignores the rest of the cube.
For example, if we slice our previous example: