263-3010-00: Big Data
Section 1
Introduction
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 12/23/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Scale¶
Humankind's understanding of the universe has evolved from observing the Sun, Moon, and planets to realizing the vastness of the cosmos, stretching from meters to Yottameters. This expansion mirrors the growth of data, from bytes to exabytes, in our digital age. Just as astrophysicists need to understand the universe on both large and small scales, data science requires modeling data at the tiniest levels to grasp massive datasets. Like physics, data science studies what the world actually is, making it more empirical than fields like mathematics and computer science.
A short history of databases¶
Prehistory of databases¶
Humans pass down knowledge from generation to generation by speaking, telling, and singing stories. However, this method is susceptible to distortions, loss of information, and the introduction of errors.
First revolution - Writing
The first revolution occurred thousands of years ago with the invention of writing, initially done on clay tablets. Once dried, these tablets could preserve information for thousands of years. This ability to encode and store information is why we have much more knowledge about human history after the advent of writing, allowing us to access well-preserved information from those earlier times.
Accounting
What is truly remarkable about clay tablets is that they weren't just used for writing text; some contained relational tables. The earliest known example is Plimpton 322, which is over 3,800 years old. This demonstrates how intuitive relational tables are to humans and foreshadows their significance in later centuries.
Second revolution - Printing
The second revolution was the invention of the printing press. Before its advent, duplicating documents was expensive and required manual copying, which hindered the spread of knowledge. The printing press made it easy to produce large quantities of the same text, ushering in the Golden Age of the written press.
Third revolution - Computers
The third revolution was the invention of modern silicon-based computers, which greatly accelerated data processing.
Modern databases¶
The birth of database management systems is often traced back to 1970, when Edgar Codd published a seminal paper introducing the concept of data independence. Early computer users managed data directly on storage devices, a resource-intensive process. Codd proposed that a database management system should hide this physical complexity and present a simple, table-based model, leading to the relational model and relational algebra. However, the explosion of data in recent decades pushed this model to its limits, leading to modern systems like key-value stores, wide column stores, document stores, and graph databases.
1960s: File systems
1970s: The relational era
2000s: The NoSQL era
The three V's of Big Data¶
The recent evolution of large-scale data processing in the past two decades is often summarized by the three Vs: Volume, Variety, and Velocity.
Volume¶
The volume of data stored worldwide is increasing exponentially, reaching nearly 100 zettabytes by 2021. This surge is due to automated data collection and ample storage capacity, allowing companies to retain data indefinitely. However, this practice may change with regulations like the European GDPR.
Prefixes of powers of ten that must remember:
Prefix | Number |
---|---|
kilo (k) | 1,000 ($10^3$) |
Mega (M) | 1,000,000 ($10^6$) |
Giga (G) | 1,000,000,000 ($10^9$) |
Tera (T) | 1,000,000,000,000 ($10^{12}$) |
Peta (P) | 1,000,000,000,000,000 ($10^{15}$) |
Exa (E) | 1,000,000,000,000,000,000 ($10^{18}$) |
Zetta (Z) | 1,000,000,000,000,000,000,000 ($10^{21}$) |
Yotta (Y) | 1,000,000,000,000,000,000,000,000 ($10^{24}$) |
Ronna (R) | 1,000,000,000,000,000,000,000,000,000 ($10^{27}$) |
Quetta (Q) | 1,000,000,000,000,000,000,000,000,000,000 ($10^{30}$) |
Note that 1,000 kB = 1 MB here.
Computer Scientists have often used prefixes to express powers of 2 rather than powers of 10, using the coincidence that $2^{10}$ is very close to $10^3$. Actually, 1kB means 1,024 B and not 1,000 B.
The following prefixes are just for reference:
Prefix | Number |
---|---|
kilo (k) | 1,024 ($2^{10}$) |
Mega (M) | 1,048,576 ($2^{20}$) |
Gibi (Gi) | 1,073,741,824 ($2^{30}$) |
Tebi (Ti) | 1,099,511,627,776 ($2^{40}$) |
Pebi (Pi) | 1,125,899,906,842,624 ($2^{50}$) |
Exbi (Ei) | 1,152,921,504,606,846,976 ($2^{60}$) |
Zebi (Zi) | 1,180,591,620,717,411,303,424 ($2^{70}$) |
Yobi (Yi) | 1,208,925,819,614,629,174,706,176 ($2^{80}$) |
Variety¶
New shapes of data have emerged beyond the table-based model Edgar Codd proposed. These include:
Trees: Found in formats like XML, JSON, and Parquet, representing denormalized data.
Unstructured Data: Raw formats such as text, images, audio, and video.
Cubes: Popular in business analytics for multi-dimensional data analysis, especially in the 1990s.
Graphs: Used in databases like Neo4j, ideal for efficient data traversal.
While these new data shapes are increasingly relevant, maintaining a clean, logical abstraction is essential. Despite misconceptions, data independence is crucial across all forms, including document stores, to enhance data normalization and denormalization.
Velocity¶
A distortion has emerged between the amount of data we can store, how quickly we can read it, and the latency involved. This challenge led to the development of data processing technologies like MapReduce and Apache Spark. With data now being automatically generated by sensors and user interactions, understanding recent advancements requires examining three factors:
Capacity: how much data can we store per unit volume?
Throughput: how many bytes can we read per unit of time?
Latency: how much time do we need to wait until the bytes start arriving?
The first commercially available hard drive, the IBM RAMAC 350 from 1956, had a capacity of 5 MB, a throughput of 12.5 kB/s, and a latency of 600 ms, with large dimensions. In contrast, a modern hard drive like the Western Digital Ultrastar DC HC670 in 2024 offers 26 TB of storage, 261 MB/s throughput, and 4.16 ms latency in a much smaller size. Over time, storage capacity per unit volume has increased by 23 billion times, throughput by 20,800 times, while latency has only decreased by a factor of 144.
Consider a book with 600,000 words that takes 10 hours to read at 1,000 words per minute. If the book's size increased by 23 billion times and reading speed increased by 20,800 times, it would take 1,300 years to finish. However, by using parallelization—spreading the reading across 1.1 million people—it could still be completed in 10 hours. This illustrates parallelization, a key technique in modern data processing. The second technique is batch processing, which handles large datasets by processing records in batches rather than individually. Big Data, therefore, is a set of technologies designed to store, manage, and analyze data too large for a single machine, addressing the growing gap between capacity, throughput, and latency.
In short
Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodating for the issue of growing discrepancy between capacity, throughput and latency.
Big Data in the Sciences¶
Big Data has numerous applications, one of the largest being in High-Energy Physics at CERN, where 50 PB of data is produced annually from a billion collisions per second. With 15,000 servers and 230,000 cores, most raw data is filtered immediately, storing only a fraction for analysis. Surprisingly, much of this data is archived on tape, the most cost-effective long-term storage despite retrieval delays. Another example is the Sloan Digital Sky Survey (SDSS), which generated 200 GB of data nightly to create the most detailed 3D map of the universe. Emerging storage methods include DNA, with CRISPR-Cas9 and mRNA techniques reflecting advances in data storage and manipulation.