263-3010-00: Big Data
Section 4
Distributed File Systems
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 10/07/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Cloud storage services, like Amazon S3, can store vast amounts of large objects, with individual object sizes up to 5 TB. But what about even larger objects?
Yes, it’s possible to store files bigger than a single machine, requiring a paradigm shift that:
Natively reintroduces file hierarchy.
Natively supports block-based storage.
Main requirements of a distributed file system¶
Data goes through a lifecycle: raw data (from sensors, logs, etc.) gets processed into derived datasets, which may undergo further processing. The storage backend for this must be resilient to frequent failures in large clusters (1,000+ machines) and should:
Monitor itself
Detect from failures
Recover from failures
Be fault-tolerant overall
While laptop hard drives prioritize random access, distributed storage systems focus on:
Efficiently scanning large files (for analysis)
Appending new data (for logs/sensors)
Such systems must handle hundreds of concurrent users, with throughput being the primary concern rather than latency.
This aspect of the design is directly consistent with a full-scan pattern, rather than with a random access pattern, the latter being strongly latency-bound (in the sense that most of the time is spent waiting for data to arrive).
Parallelism and batch processing help overcome throughput and latency limitations, forming the core of distributed file systems like GoogleFS and HDFS. HDFS, part of the Hadoop project, grew from 188 to 42,000 nodes between 2006 and 2016, storing hundreds of petabytes of data.
The Model behind HDFS¶
File hierachy¶
In HDFS, the word "file" is used instead of "object", although they correspond to S3 objects logically.
HDFS does not follow a key-value model. Instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system.
Blocks¶
Unlike in S3, HDFS files are furthermore not stored as monolithis black boxes, but HDFS exposes them as lists of blocks - also similar to a local file system.
In Google's original file system, GFS, blocks are called chuncks. In fact, throughout the course, many words will be used, in each technology, to describe partitioning data in this way: blocks, chuncks, splits, shards, partitions, and so on. It is more important to understand that they are almost interchangeable, while it is secondary to learn "by heart" which technology uses which word.
Why does HDFS use blocks?
First, because a PB-sized file surely does not fit on a single machine as of now. Files have to be split in some way.
Second, it is a level of abstraction simple enough that it can be exposed on the logical level.
Third, blocks can easily be spread at will across many machines, a bit like a large jar of cookies can easily be split and shared among several people.