263-3010-00: Big Data
Section 14
OLAP and Data Cubes
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 12/10/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
We now turn to a fourth data shape after covering tables, trees, and graphs: data cubes. Like all other shapes, we will see that it has its own data model, its own algebra, its own syntaxes, and its own query language.
OLTP vs. OLAP¶
While modern, computer-based database systems saw the light of the day in the 1970s in the form of relational database systems, it was only twenty years later in the 1990s that another shape emerged: data cubes—still, this was before the massive diversity and volume scaleup if the 2000s.
The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing orders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions. These two use cases differ in several fundamental ways.
First, running the day-to-day business of a company requires a database management system that is able to sustain a high pace of updates (deletions, insertions, replacements, etc) while maintaining consistency. This is called OLTP (OnLine Transaction Processing). On the other hand, the production of fancy, shiny reports does not require any updates: a frozen, read-only view of the data is sufficient for analysis purposes. This is called OLAP (OnLine Analytical Processing). Incidentally, even though OLAP is often associated with data cubes for historical reasons, a lot of the material covered in this lecture (data lakes, MapReduce, Apache Spark, etc) can be considered to be classifiable as OLAP.
The difference between a write-heavy system (OLTP) and a read heavy system (OLAP) has an impact on the entire architecture of the system. On the write-heavy side, we saw in the context of document stores (MongoDB) in Chapter 11 that a high number of point queries, each returning small quantities of results, can be implemented quite efficiently, to the point that it is almost instant (a few milliseconds), with hash indices and B+-tree indices. These indices are perfectly compatible with a high pace of updates (it just feels slightly slower upon updating). On the read-heavy side, we saw in the context of MapReduce and Apache Spark that queries that aggregate massive amounts of data take more time (minutes, hours, days), although parallelism and batch processing helps. This is first because large parts of the input data must be read (in contrast to just one or two records in the OLTP case), and second because the queries typically involve a lot of shuffling on the physical level.
Another impact of OLTP vs. OLAP is about how normalized the data is: in OLTP and good old relational database management systems, we strive to have data in a high normal form (e.g., the Third Normal Form or Boyce-Codd Normal Form) to avoid update/deletion/insertion anomalies and to avoid redundancy. In OLAP, as was repeated a lot throughout the course, we want, on the contrary, to denormalize the data: we love redundancy because this is a way to avoid joins.
Features of data warehousing¶
In the context of business intelligence and data cubes, OLAP is also referred to as data warehousing. Date warehousing distinguishes itself commonly via four features:
It is subject-oriented. This means that it is highly focused on specific business use cases: census analytics (historically the oldest use case for computers), web analytics, sales analytics, science, etc.
It is time-variant. This means that we do not only care about the latest state of the data: we also want to go back in time and see the previous month, the previous quarter, the previous years—often typically up to 10 years in the past.
It is non-volatile. This means that once the data has been imported, it remains frozen and no updates are made. We load, we read, and this is it.
It is integrated. This means that companies using data warehousing have an architecture that allows the regular and (semi)automatic import of data from day-to-day systems (Customer Relationship Management (CRM), Enterprise Resource Planning (ERP), relational database management systems, file systems, etc) into the data warehouse.
Extract-Transform-Load (ETL)¶
One might ask why not directly run business-intelligence queries directly on the day-to-day systems. The main reason is that doing so would massively slow down the day-to-day systems. Thus, data cubes are typically stored separately in the data warehouse, with infrequent updates (e.g., weekly, monthly). These updates consist of three steps:
Extract: the desired filtered subset of the data is extracted from the source, e.g., using gateways.
Transform: the data is cleaned up and rearranged (e.g., projected, joined, etc).
Load: thedata is imported into the data warehouse system, which triggers constraint checks, building cube-specific indices, partitioning, etc.
In fact, the term "ETL" is so widespread that it is even commonly used as a verb: one ETLs the data into a system (not necessarily data cubes: one can also ETL data into PostgreSQL or MongoDB).
Market products¶
There are currently four big players on the market regarding data warehousing: Oracle (Essbase), IBM (Cognos), Microsoft (Analysis Services), and SAP (HANA).
Historically, the first data warehouses were built in new companies, and then a market consolidation followed in which several of the big players purchased the data warehousing companies. We will see that there is a technical reason why companies making relational (tabular) database management systems have an interest in (cubic) data warehouse systems.
A new generation of players such as Databricks and Snowflake have also recently appeared, with innovative approaches bringing data lakes and data warehouse systems together (e.g., data lakehouse systems), and increasingly, artificial intelligence (machine Learning, vector databases, etc).
The cubic data model¶
Cubes and dimensions¶
We now turn to the logical data model behind data cubes. Here is a visualization of a data cube:
A data cube consists of small cells, called facts, arranged across multiple dimensions. Each cell thus corresponds to a combination of dimensions called dimensional coordinates, which associate each dimension to a value (called a member of that dimension). Logically, dimensional coordinates are similar to a flat map (JSON object) where keys are dimensions and values are members. Each cell furthermore contains one value. Typically OLAP data cubes are dense, meaning that most if not all cells contain some value (even if it is zero).
The number of dimensions in a cube need not be three: it can be 2, or more often 4, or 5 or even 10. This is, of course, difficult to visualize for a human, but we will shortly see that this is not such a big problem because there are alternate representations of data cubes.
While data cubes can have a much higher number of dimensions, highly dimensional data cubes will obviously be sparse, which leads to a different architecture and different optimizations. In fact, scaling up to massive amounts of dimensions even leads to ... vector databases, in which the cube loses its first-class citizen status and simply becomes a vector space (with continuous members). In this chapter, we focus on dense data cubes with a reasonable number of dimensions.
Dimensions can be of many different sorts, including often:
What (measure): what is the fact about: revenues? quanties? costs? profits?
When (period): when is a fact valid?
Who (entity): which company is the fact about?
Where (geography): which country or city is the fact about?
Of what (unit): in which unit is the fact expressed (e.g., a currency)?
and there are many other possibilities, even for each use case.
Fact tables¶
Because humans are not capable of visualizing more than three dimensions (and even three is difficult), cubes are typically presented as a flat list of facts called a fact table. Each column represents a dimension, except the last one, which contains the value of the fact. This is for example the fact table of a 2 × 2×2 data cube. As expected, it has 8 facts:
In practice, it is common to annotate dimensions (also called axes, e.g., by the U.S. Securities and Exchange Commission in the context of quaterly reports in the XBRL format) and members with square brackets to make it easier to understand:
The cubic algebra¶
We now turn to how data cubes can be manipulated, similar to how the relational algebra manipulates tables.
Slicing¶
A cube can be sliced, which means that one focuses specifically on a specific member of a specific dimension and ignores the rest of the cube.
For example, if we slice our previous example:
on Location [Axis] = Switzerland [Member], one obtains the following data cube (represented as a fact table);
It is common to display sliced dimensions (called slicers) not as a column, but rather as a separately header above the fact table, like so (we also do so for currency, as everything is in U.S. Dollars):
Dicing¶
Notice that this last fact table, after the slicing, only has two dimensions left that are displayed as columns. This slice is a two-dimensional cube, that is, a square. But humans are quite good at looking at squares. This is the reason why, instead of using the fact table representation, it is common to reorganize the facts in a grid, like so:
Note that these are the same four facts, but displayed in a 2 $\times$ 2 grid rather than as a list of four rows. Doing so is called dicing. The two dimensions whose members are displayed on row headers and column headers are called dicers. A cleaner view (the display is due to Charles Hoffman, CPA, and widely used in XBRL renderings) looks like so, and is called a cross-tabulation, or a cross-tabularized view:
In practice, there are two dicers. It is also possible to have only one (in which case there is only one column, or only one row in the cross-tabulation), or three (in which case user interfaces commonly use tabs for the third dicer—also called sheets in spreadsheet software).
Roll ups¶
Another manipulation primitive on a data cube is the aggregation of all the facts along one dimension. If the aggregation is a sum, this leads to a total. However, other aggregations can be used, such as average, min, max, or count.
If we come back to our example:
Then we can aggregate the data along the location dimension. This means that, for each combination of members of other dimensions, an aggregated value is computed from all the facts with these members, varying the location. Below we show with colors which facts will go with which facts (there are four groups: yellow, red, blue, and green);
This leads to a new cube with the following fact table representation, in which all resulting facts have the whole world as their location:
We say that we have just completely rolled up the Location [Axis] dimension. Note that the member associated with a complete roll up is also called a domain, as it represents the entire domain of the dimension.
Finally, in the context of a complete roll-up, it is common to move over the fully rolled-up dimension into the slicers section above the table, like so:
And we may as well cross-tabulate, as only two dimensions are left:
Dimension hierarchies¶
It is also possible to compute totals without removing the original values. This is commonly done on a dicer, by adding an additional row (or column) and marking it cosmetically as a total. On the rows, this is often done with bold fonts and/or underlines, while for columns, this is often achieved with an L-shaped header cell:
This makes explicit that the members, in a dimension, are organized as a hierarchy. While our example with Salespersons is flat, hierarchies can have more nestedness, for example, towns or cities can be organized under cantons that can be organized under countries that can be organized under continents that form the world domain:
Displaying nested hierarchies on columns can look beautiful, with nested L-shaped headers:
On rows, it looks less fancy as one commonly uses indentation and lines, like so. This is very commonly done by accountants and, if the VIS Qu¨astor or Qu¨astorin is reading these lines, they will feel right at home:
The nested hierarchies are also an opportunity to mention that the full roll-ups we covered before are in fact a special case. Roll-ups can also be partial, level by level. We now proceed to the a visual of a succession of roll-ups. We start with the fully visible country hierarchy:
Now, we roll up by one level:
Another level:
Another level:
And now all the way up, which gives an example where there is only one dicer left (and thus, only one column of facts in the crosstabulation):
Physical implementation¶
The principle of data independence applies for data cubes as much as it applies to tables, trees, and graphs. This means in particular that there are two main approaches to cubic database management systems:
The native approach, where support for cubes and querying is implemented from scratch. This is called MOLAP (Multidimensional OLAP).
On top of a database management system for another shape tables come naturally into the picture here, and this is called ROLAP (Relational OLAP).
In the remainder of the chapter, we focus on ROLAP.
ROLAP¶
In ROLAP, a cube is stored as a fact table, in a relational table. All columns are dimensions except for the last one, which has the fact values. Concretely, this means that the dimension columns form jointly a compound primary key of the table.
There is an alternate format for fact tables, with multiple value columns: these are called measures.
This can be seen as merging multiple cubes into one, with the measures sharing the same sets of possible dimensional coordinates. In accounting, examples of measures are sale quantities, profits, revenues, expenses, assets, etc. They can all be jointly reported against regions, periods, entities, etc.
The two approaches (only one value column, or multiple value columns) are in fact logically equivalent and are obtained from each other via pivoting and unpivoting. Several relational database systems support extensions of SQL that support PIVOT and UNPIVOT clauses. A fact table with multiple measures can be converted to a fact table with just one value column by “promoting” the measure to a dimension, with one member for each measure column.
In addition to the fact table, additional relational tables are typically available, one per dimension. The dimension members in the fact table are in fact foreign keys that point to the corresponding satellite table. The satellite tables contain more information about the dimension members. This is known as a star schema.
If the satellite tables are further normalized into several tables per dimension, this is known as the snowflake schema:
Here is an example of normalization of a geographical hierarchy into multiple tables for the geographic dimension:
It is also possible to go the opposite way and to fully denormalized the star schema into a denormalized fact table, like so:
This is useful, for example, when importing such a denormalized fact table into spreadsheet software such as Excel, which is able to manipulate the corresponding data cubes: select all the cells of the fact table, then select “Summarize with PivotTable” in the Data menu. This gives you a user interface in which you can drag and drop the dimensions and measures into a cross-tabulated view at will. Here is a concrete example of how the geographic dimension can be denormalized into multiple columns in the fact table. Notice the functional dependencies (City $\rightarrow$ Country and Country $\rightarrow$ Continent), which are the visible signs that the fact table is denormalized.