263-3010-00: Big Data
Section 3
Cloud Storage
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 10/02/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Storing Data¶
Suppose a dataset is 273 TB, organized into 176 million files across 680,000 directories, with 1.23 billion objects, including 260 million stars and 208 million galaxies. Since no single machine can store this in 2021, a single-machine solution is unfeasible. How do we manage this large-scale data?
The fact is that Petabytes do not fit on a single machine. Good thing is that we do not need to reinventing the wheel. About 99% of what we learned with 48 years of SQL and relational can be reused:
Relational algebra: selection, projection, grouping, sorting, joining
Language: SQL, declarative languages, functional languages, optimizations, query plans, indices
Components of tables: table, rows, columns, primary key
NoSQL is the solution, although it does not follows some constraints that SQL has, such as tabular integrity, domain integrity, atomic integrity (1NF), 2NF, 3NF, and BCNF, it has some new features that SQL does not have, it supports:
Heterogeneous data: Data that is not in first normal form.
Nested data: Data that does not fulfil domain integrity (it may not even have a schema) and also not relational integrity.
Denormalized data: At large scales, it becomes desirable to sometimes denormalize, rather than normalize data, in order to get better performance.
The technology stack¶
The technology stack we are going to rebuild goes from a level very close to the hardware (storage) and all the way to interaction with the end user (query language or even voice assistant).
The following are some examples and description for each stack:
Stack | Example | Description |
---|---|---|
Storage | Local file system, NFS, HDFS, S3 | Systems to store data, including local, network, distributed, and cloud storage options. |
Encoding | UTF-8, BSON | Methods to encode data into bits for computer processing, including textual encodings like UTF-8 and binary encodings like BSON. |
Syntax | CSV, XML, JSON, RDF/XML | Formats used to represent data in a human-readable text form, or in binary form for efficient storage. |
Data models | Relational model, RDF triples, XML infoset | High-level representations of data, such as models for tables (relational), trees, and graphs. |
Validation | XML Schema, JSON Schema, XBRL taxonomies | Languages used to impose constraints and validate the structure of data. |
Processing | MapReduce, Spark, Flink | Frameworks that process and analyze data, ranging from older two-stage systems to newer dataflow engines. |
Indexing | Hash indices, B+-trees, spatial indices | Techniques used to speed up data lookups, generally independent of the data's shape. |
Data stores | RDBMS, MongoDB, Snowflake | Products that provide a complete set of functionalities for storing and managing data, including relational databases and document stores. |
Querying | SQL, SPARQL, Cypher | Languages used to query different data models, such as SQL for tables, SPARQL for graphs, and others for trees and graphs. |
User interfaces | Tableau, Qlikview, Siri | Tools and applications that abstract away the query language, providing an easy-to-use interface for users. |
In this section, we only focus on the storage stack.
Databases vs. Data Lakes¶
Data must be stored somewhere, and there are two main paradigms for storing and retrieving it:
Extract-Transform-Load (ETL): Data is imported into a database, requiring extra work before querying. This is used by relational databases and some NoSQL systems like MongoDB and HBase. ETLing data allows for faster queries by storing it in a proprietary format and building indices.
Data Lake Paradigm: Data is stored on a file system (local, distributed, or cloud) and queried in place using tools like Pandas, Apache Spark, or RumbleDB. This approach allows immediate querying without ETL but is generally slower, best suited for full data scans (e.g., MapReduce, Spark).
From Your Laptop to a Data Center¶
Local file systems¶
On a single machine, data is typically stored on the local file system, limited by disk capacity to a few terabytes. SSDs (Solid State Disks), though more expensive and often smaller, can now hold up to 1 TB. A file on the local filesystem consists of content and metadata. The content is sliced into blocks (around 4 kB each) and read block by block for optimized performance. Metadata includes file attributes like name, access rights, owner, modification time, and size. Files are organized hierarchically in directories familiar to most users.
More users, more files¶
How does this file system accessible to more users? One way is to make a disk accessiable through the network, local area network (LAN), and shared with other people.
There also exists larger networks, wide area network (WAN), but it is difficult to scale concurrent access and problems can arise when two people work on a file at the same time.
Another difficulty is that a local file system cannot support billions of files easily.
To make this scale possible, the approach is called object storage:
- throw away the hierarchy: there are no directories
make the metadate flexible: attributes can differ from file to file (no schema)
user a very simple, if not trival, data model: a flat list of files (called objects) identified with an identifier (ID); blocks are not exposed to the user, i.e., an object is a blackbox
- user a large number of cheap machines rather than some super computers, an example is the battery of Tesla vehicles, which is made of many many small battries
Scale¶
Aspect | Scale Up | Scale Out |
---|---|---|
Definition | Increase the resources (memory, CPU, disk) of a single machine. | Add more machines and distribute the workload across them. |
Cost | Becomes exponentially more expensive as resources increase. | Typically more cost-effective; allows adding lower-cost machines. |
Complexity | Relatively simpler, but limited by the capacity of a single machine. | Involves more complexity due to distributed system management. |
Limitations | Has hard limits in terms of cost and hardware capabilities. | Can theoretically scale indefinitely with more machines. |
Use Case | Useful for small-to-moderate data sizes that fit on a single machine. | Best for handling large datasets that don't fit on a single machine or need high throughput. |
Performance Gain | Improves performance linearly but with diminishing returns. | Performance improves by adding more machines, with better parallelism. |
Example | Upgrading a server from 8 GB to 32 GB of RAM. | Adding multiple servers to handle 50 TB of data. |
Scale up process (from left to right)