263-3010-00: Big Data

Section 2 Lessons Learned from the Past

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 09/24/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Data Independence¶

Data independence refers to the separation of the logical view of data from its physical storage. Initially, it was proposed that the most intuitive logical view is in the form of tables, a format that is easily understood and has been used for thousands of years.

A relational database management system (RDBMS) provides a logical model for manipulating data, separate from the physical storage layer. In the 1970s, this physical layer was a single computer's hard drive, with storage formats irrelevant to the user. As the RDBMS and hardware are updated, the user’s interaction remains unchanged, while performance may improve automatically. Logical layer updates focus on long-term compatibility and functionality.

A datatbase management system stack can be viewed as four-layer stack:

A logical query language with which the user can query data
A logical model for the data
A physical compute layer that processes the query on an instance of the model
A physical storage layer where the data is physically stored

Relational Database Management Systems¶

Main concepts¶

Relational database management systems (RDBMS) are based on a tabular data format. The core of the relational model is thus the concept of table. The following summarize the elements in this model:

Element	Description	Example
Table	A collection of records. For example, in an employee table, each record represents a person, and in a products table, each record represents a product. Therefore, in some models that generalize tables, the term "collection" is used.
Attribute	A property that records can have. For example, if a record is an employee, an attribute could be their last name or city of residence. Other common synonyms for "attribute" include column, field, property, and key.
Row	A record in a collection. A row links properties with the values relevant to the record it represents. Common synonyms for "row" include record, entity, document, item, and the more formal term "business object," which is popular among MBA graduates.
Primary key	A particular attribute or set of attributes that uniquely identify a record in its table. For example, a social security number (AHV in Switzerland) can identify a person, or a code (HG, CAB) can identify a building at ETH.
Value	A specific part of a row. For example, the first name (a string) of a student in a table.

Relational Properties¶

Why relational table are called relation tables?

A mathematical relation over several domains is defined as a subset of the Cartesian product of the domains. Each element in a mathematical relation, that is, a record in the table, is thus a tuple made of values picked in each domain.

In general, we can say that a relational table is made of:

A set of attributes (schema)
A set / bag / list of tuples (extension)

Mathematical representation

JSON representation

Table representation

Relational integrity¶

A collection $T$ fulfills relational integrity if all its records have identical support:

$$\forall t, u \in T, \; \text{support}(t) = \text{support}(u)$$

If a collection, in particular a table, has relational integrity, then this common support is a property of the table and contains the attributes of the table $T$: $Attributes_{T}$.

The extension of the table, sometimes denoted $Extension_{T}$ when used together with $Attributes_{T}$, is its actual content, which is $T$ itself. We use $T$ or $Extention_{T}$ interchangeably depending ono the context.

The following collection is an example that respects relational integrity:

However, the following collection violates the relational integrity:

Domain integrity¶

A collection $T$ fulfills domain integrity if the values associated with each attribute are restricted to a domain. We define these domains with a function $D$ mapping strings (the attributes) to domains (unused attributes are just associated with empty domains, i.e. this does not need to be a partial function):

$$D \in P(V)^S$$

A collection $T$ fulfills the domain integrity constraint specified by $D$ if, for each row, the values are in the specified domains:

$$\forall t \in T, \forall a \in \text{support}(t), t.a \in D(a)$$

Typically, the domains $(D(a))_a$ will be standard sets such as integers, strings, booleans, dates, and so on. Concretely, the domain mapping $D$ is called a schema, i.e., the names of columns, and the domain of each column (string, integer, date, boolean, and so on).

The following collection respects the domain integrity:

However, the following collection does not respect the domain integrity:

Atomic integrity¶

A collection $T$ fulfills the atomic integrity constraint if the values used in it are only atomic values, i.e.:

$$T \subseteq S \mapsto A$$

This means that the collection does not contain any nested collections or sets or lists or anything that has a structure of its own: it is "flat."

The following collection respects atomic integrity:

However, the following collection does not respect atomic integrity:

The relational model and beyond¶

A relational table is defined by three key integrity constraints: all records have the same structure, a schema assigns domains to all attributes, and values are atomic. Emphasizing these constraints separately is pedagogically useful, as it highlights their importance in relational databases and sets the stage for selectively relaxing them. By relaxing one or more of these constraints, we move into the realm of NoSQL databases, which handle collections of trees rather than traditional tables.

Relational Algebra¶

Once relational tables have been formally defined, it is possible to manipulate them. Mathematically, the framework that allows for this is called the relational algebra. Like numbers can be manipulated with addition, multiplication, relational tables can be manipulated with many different operators.

These operators can be classified in four board categories:

Set queries act on relational tables as sets: one can take the union or the intersection of two sets, or subtract a set from another. These operators directly and naturally translate to relational tables.
Filter queries. These operators take a portion of a table: some or all columns, some or all rows, etc. Thye are known as projection and selection. There also exists a fancy operator called the "extended projection" that can be used to add more computed columns.
Renaming queries. These operators can rename columns.
Joining queries. These operators can take the Cartesian product of two tables, potentially filtering to match values from both sides. The latter is called a join.

Selection¶

A selection takes a subset of the records belonging to the table. It takes a parameter, which is a predicate on the attributes. This predicate is evaluated for each record: if it is true, the record is kept, if it is false, it does not appear in the selection.

The notation looks like following:

$$\sigma_{\text{condition}}(\text{relation})$$

For example, suppose the following relation is denoted as $R$, then the selection can be written as $\sigma_{\text{Country} = \text{"GB"}}(R)$.

Projection¶

A projection keeps all records, but removes columns. It takes as a parameter the names of the attributes to keep in the projection.

The notation looks like following:

$$\pi_{\text{column(s)}}(\text{relation})$$

For example, suppose the following relation is denoted as $R$, then the selection can be written as $\pi_{\text{Last, First}}(R)$.

Grouping¶

A grouping, also called aggregation, merges records by grouping on some attributes, and aggregating on all others.

The notation looks like following:

$$\gamma_{\text{group}, \text{function}(\text{column}) \rightarrow \text{column}}(\text{relation})$$

For example, suppose the following relation is denoted as $R$, then the grouping can be written as $\gamma_{\text{G}, \text{SUM}(\text{A}) \rightarrow \text{A}}(\text{R})$.

In this example, we uses the aggregration function SUM. There are a lot of other aggregration functions, and the most common ones are COUNT, SUM, MAX, MIN, and AVERAGE. Another thing that need to be noteiced is that the aggregration functions must aggregrate to a single value, not nested values, to meet the atomic integrity of relational database / algebra.

Renaming¶

Renaming a column changes the name of a column and is denoted with $\rho$. The notation looks like following:

$$\rho_{\text{old column} \rightarrow \text{new column}}(\text{R})$$

Sorting¶

This operation can sort the whole table based on a column.

Cartesian product¶

A Cartesian product combines each tuple from the left relational table with each tuple from the right relational table.

The notation used is the $\times$ symbol. For example,

$$T = R \times S$$

is the Cartesian products of R with S and corresponds visually to:

Cartesian products should be handled with care: imagine the size of the resulting table if the table on the left has a billion records and that on the right has a million records (tables with millions or billions of records are quite common in the real world).

Join¶

We often use joins instead of Cartesian products. Joins can be considered as a "filtered" Cartesian product in which we only combine directly related tuples and omit all other non-matching pairs.

The notation used is the $\bowtie$ symbol. For example,

$$T = R \bowtie S$$

joins R and S but only keeps records that coincide on the attributes common to both sides (here, B must match):

The joins as above, called natural joins, are very common in relational databases and are less expensive than Cartesian products, but remain more expensive than simpler operators such as projection and selection.

There are other kinds of joins:

Theta joins explicitly specify the joining critrtion instead of matching the common attributes. The ariterion is then supplied as a predicate in the subscript of $\bowtie$ with the same syntax as $\sigma$. The notation looks like $T = R \bowtie_{A = D} S$. Note that the join must have the a pair of same column name.
Outer joins keep records with no match on the other side, keeping the other attributes absent. Note that this breaks reltational integrity in its strictest sense, as these joined records will not have the full support of the resulting relational table.
Semi-outer joins are similar to (full) outer joins but only keep unmatched records on the left, or only on the right.

Note: Relational algebra operators can be combined and nested in more complex formulas.

Normal forms¶

When creating databases, the schemas, that is, the columns and their domains, should be designed with care. If some rules are not respected, things can go wrong as the data can easily become inconsistent because it is duplicated:

Deletion anomalies: Poorly designed databases can become inconsistent when a record is deleted. For example, deleting the only order of a specific phone in an online shop might erase all its data.
Insertion anomalies: Insertion anomalies occur in poorly designed databases when adding a record creates inconsistencies, such as duplicating product data.
Update anomalies: Update anomalies occur in poorly designed databases when modifying a record creates inconsistencies, such as conflicting phone data across orders.

In order to avoid such anomalies, best practice dicates to follow so-called normal forms.

Before introducing normal forms, there are a few concepts that have to be understanded.

Functional dependency: A functional dependency means that when the value of one attribute is known, one can always determine the value of another attribute.

Superkey: A superkey is any set of attributes that uniquely identifies each tuple of a relation.

Candidate key (minimal superkey): A candidate key is a column or combination of columns in a database table that uniquely identifies a record and can be selected as the primary key. It can have one key per table or many keys per table.

Primary key: A primary key is a single attribute or collection of attributes that uniquely identifies each record in a table. A table can only have one primary key, and it cannot contain null values.

1st Normal Form¶

The 1st normal form is simply following the atomic integrity.

2nd Normal Form¶

The second normal form takes it to the next level: it requires that each column in a record contains information on the entire record.

Take the following table as an example. As the column "Name" refers to a student (identified with their "Legi"), but the whole record semantically corresponds to an attendance of a student to a lecture; we say that the "Legi" and "Lecture ID", together, are a primary key of the record, but "Name" does not functionally depend on the full primary key, only on half of it. That's why this table is not in second normal form.

Here is another counter example:

The following is how we fix the table above.

Actually, if we join these three tables we will get the whole table we have before.

3rd Normal Form¶

Relation $R$ with functional dependicies $F$ is in 3rd normal form if, for all $X \rightarrow A$ in $F$:

$A \subseteq X$ or
$X$ is a superkey or
$A$ is a part of some minimal key for $R$,

here $X$ is subset of attributes, $A$ is single attribute.

The third normal form additionally forbids functional dependencies on anything else than the primary key, for example the following table is not in third normal form because "PLZ" functionally depends on "City" and "State," but it is not a primary key, in fact column "Legi" is the primary key here.

This can be fixed by separating the data into two tables, with the data on "City" and "State" moved out to another table.

Boyce-Codd Normal Form¶

Relation $R$ with functional dependicies $F$ is in 3rd normal form if, for all $X \rightarrow A$ in $F$:

$A \subseteq X$ or
$X$ is a superkey,

here $X$ is subset of attributes, $A$ is single attribute.

Note that Boyce-Codd Normal Form (BCNF) implies 3NF, but 3NF does not imply BCNF.

A counter example of BCNF is shown as follows:

The following is how we fix the issue:

The SQL Language¶

SQL (Structured Query Language) originated as a follow-up to Edgar Codd's idea of data independence. It was developed by Don Chamberlin and Raymond Boyce, who also contributed to the Boyce-Codd normal form. Initially called SEQUEL (Structured English QUEry Language), it was implemented in IBM's System R, the first commercial relational database. Due to trademark issues, the name was changed to SQL, explaining the varying pronunciations ("see-kwel" or "es-kew-el"). Oracle, initially known as Software Development Laboratories, later became a major player in the RDBMS market alongside IBM and Microsoft.

SQL is a declarative language, allowing users to specify what they want, while the system determines how to execute the query. It is set-based, meaning it handles sets of records rather than individual values, and has some functional aspects through nested queries, although this is limited.

Programming languages can be grouped into six main categories: assembly code, higher-level imperative languages (like C, C++, Java), functional and declarative query languages (SQL, JSONiq, SPARQL), lower-level APIs (like DataFrame APIs in Apache Spark), and machine learning frameworks (TensorFlow, PyTorch, Keras). ML frameworks are gradually being supplemented by higher-level declarative languages for machine learning.

A selecting query¶

A projecting query¶

A renaming query¶

A sorting query¶

A grouping query¶

Post-aggregation selection¶

A union query¶

Theta joins¶

Full outer joins¶

Natural join¶

Summary of basic SQL operations¶

SELECT city, COUNT(*) AS population
FROM persons
WHERE residence = "first"
GROUP BY city
HAVING COUNT(*) < 100000
ORDER BY population DESC
LIMIT 10
OFFSET 20

The FROM clause specifies the tables to read data from, typically producing a Cartesian product. The WHERE clause filters the results, often creating a theta join. Efficient SQL implementations optimize these joins.

The GROUP BY clause aggregates data using a key, while the SELECT clause defines projections and column renaming (with AS). The HAVING clause filters results after grouping, unlike WHERE.

ORDER BY sorts the output, and LIMIT with OFFSET handles pagination. Except for SELECT and FROM, all clauses are optional.

Internally a SQL query will be converted to a query plan that is closely related to the relational algebra, for example like so:

Nesting¶

Tips¶

Three-valued logics in SQL

DML and DDL

DML: Data Manipulation Language (query, insert, remove rows)
DDL: Data Definition Language (create or table / schema, drop it)

Indices

Looks like a tree structure

Optimize for read vs. write intensive

Transactions¶

The good old times of databases - ACID, which stands for Atomicity, Consistency, Isolation, and Durability.

Atomiticy: either an update (called a transaction if it consists of several updates) is applied to the database completely, or not at all.
Consistency: before and after the transactions, the data is in a consistent state, for example some values sum to another value, another value is positive.
Isolation: the system "feels like" the user is the only one using the system, where in fact maybe thousands of people are using it as well concurrently.
Durability: any data written to the database is durably stored and will not be lost, for example is there is an electricity shortage or a disk crash.

Scaling up and out¶

The limits of traditional relational database management systems can be reached in many ways:

Too many rows. Beyond a million records, a system on one machine ca start showing signs of weaknessl even though more recent system s meange to push it a bit higher on a single machine.
Too many columns. Beyond 225 columns, a system on one machine can start showing signs of weakness or even not support it at all.
Too many nesting. Many systems do not support nested data or, if they do, do so only in a limited fashion and it becomes quickly cumbersome.

This concludes our brush-up, as in fact most of what follows will be directly related to data that has lots of rows, lots of columns, or lot of nesting: