263-3010-00: Big Data

Section 4 Distributed File Systems

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 10/07/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Cloud storage services, like Amazon S3, can store vast amounts of large objects, with individual object sizes up to 5 TB. But what about even larger objects?

No description has been provided for this image

Yes, it’s possible to store files bigger than a single machine, requiring a paradigm shift that:

Natively reintroduces file hierarchy.
Natively supports block-based storage.

Main requirements of a distributed file system¶

Data goes through a lifecycle: raw data (from sensors, logs, etc.) gets processed into derived datasets, which may undergo further processing. The storage backend for this must be resilient to frequent failures in large clusters (1,000+ machines) and should:

Monitor itself
Detect from failures
Recover from failures
Be fault-tolerant overall

While laptop hard drives prioritize random access, distributed storage systems focus on:

Efficiently scanning large files (for analysis)
Appending new data (for logs/sensors)

Such systems must handle hundreds of concurrent users, with throughput being the primary concern rather than latency.

This aspect of the design is directly consistent with a full-scan pattern, rather than with a random access pattern, the latter being strongly latency-bound (in the sense that most of the time is spent waiting for data to arrive).

Parallelism and batch processing help overcome throughput and latency limitations, forming the core of distributed file systems like GoogleFS and HDFS. HDFS, part of the Hadoop project, grew from 188 to 42,000 nodes between 2006 and 2016, storing hundreds of petabytes of data.

The Model behind HDFS¶

File hierachy¶

In HDFS, the word "file" is used instead of "object", although they correspond to S3 objects logically.

HDFS does not follow a key-value model. Instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system.

Blocks¶

Unlike in S3, HDFS files are furthermore not stored as monolithis black boxes, but HDFS exposes them as lists of blocks - also similar to a local file system.

In Google's original file system, GFS, blocks are called chuncks. In fact, throughout the course, many words will be used, in each technology, to describe partitioning data in this way: blocks, chuncks, splits, shards, partitions, and so on. It is more important to understand that they are almost interchangeable, while it is secondary to learn "by heart" which technology uses which word.

Why does HDFS use blocks?

First, because a PB-sized file surely does not fit on a single machine as of now. Files have to be split in some way.
Second, it is a level of abstraction simple enough that it can be exposed on the logical level.
Third, blocks can easily be spread at will across many machines, a bit like a large jar of cookies can easily be split and shared among several people.

Putting it all together, the following picture summarizes best the logical model of an HDFS cluster:

The size of the blocks¶

As for the block size: HDFS blocks are typically 64 MB or 128 MB large, and are thus considerably larger than blocks on a local hard drive (around 4 kB). This is because of the prerequisites of HDFS:

It is not optimized for random access
Blocks will be shipped over the network

Thus, blocks of 128 MB are large enough that time is not lost in latency waiting for a book to arrive, i.e., access to the HDFS cluster will be throughput-bound during a full-scan analysis with MapReduce or Spark.

At the same time, they are small enough for a large file to be conveniently spread over many machines, allowing for parallel access, and also small enough for a block to be sent again without to much overhead in case of a network issue.

Physical Architecture¶

HDFS is designed to a run on a cluster of machines. The number of machines can range from just a handful of them to thousands of machines. How are they connected?

One way to connect machines is called peer-to-peer, decentralized network. In a peer-to-peer network, each machine talks to any other machines. This architecture, for example, is used in the Bitcoin blockchain and was also popular in the 1990s with the Napster network.

By contrast, HDFS is on the contrary, imiplemented on a fully centralied architecture, in which one node is special and all others are interchangeable and connected to it.

The terminology used to describe the nodes in a centralized network architecture varies in the literature. In the case of HDFS, the central node is called the "NameNode" and the other nodes are called the "DataNodes". In fact, more precisely, the NameNode and DataNodes are processes running on these nodesm and the CamelCase notation often used to write their names down corresponds to the Java classes implementing these processes. By metonymy, we also use these names to describes the nodes themselves.

Now we have the NameNode and DataNodes, how is the logical HDFS model implemented on top of this architecture?

As we saw before, every file is divided into chuncks called blocks. All blocks have a size of exactly 128 MB, except the last one which is usually smaller.

Each of the blocks is then replicated and stored on several DataNodes. How many times? This is a parameter called the replication factor. By default, it is 3, but this can be changed by the user.

There is no such thing as a primary replica: all replicas are on an equal footing. In other words, by default, three replicas of each block are stored. Overall, in a production instance, there is avery high number of block replicas and they are spread more or less evenly across the nodes, to have a certain balance.

An HDFS cluster is then accessed concurrently by many clients. Typically, the cluster is owned by one company or organization and centrally managed by a specialized team. The clients correspond to users within this organization who read from and write to the HDFS cluster.

The responsibilities of the NameNode¶

The NameNode is responsible for the system-wide activity of the HDFS cluster. It stores in particular three things:

The file namespace: the hierarchy of directory names and file names, as well as any access control (ACL) information similar to Unix-based systems.
A mapping from each file to the list of its blocks: each block, in this list, is represented with a 64-bit indentifier; the content of the blocks is not on the NameNode.
A mapping from each block represented with its 64-bit identifier: to the locations of its replicas, the list of the DataNodes that store a copy of this block.

The NameNode updates this information whenever a client connects to it in order to update files and directories, as well as with the regular reports it received from the DataNodes.

Clients connect to the NameNode via the Client Protocol. Clients can perform metadata operations such as creating or deleting a diewctory, but also aks to delete a file, read a file, or write a new file. In the latter case, the NameNode will send back to the client block identifiers (for reading them), or lists of DataNode locations (for reading and writing them).

The rerponsibilities of the DataNode¶

The DataNode store the blocks themselves. These blocks are stored on their local disks. Each block is stored as a 128 MB local, physical firl on the DataNode. If the block is the last block as an HDFS file and thus less than 128 MB, then the physical file has exactly the size of the block: there is no waste of space. This is different from the physical blocks (4kB or so) of a local hard drive, which have a constant size.

DataNodes send regular heartbeats to the NameNode. The frequency of these heartbeats is configurable and it by default a few seconds. This is a way to let the NameNode know that everything is alright. The heartbeat may also contain a notificatio ot the NameNode that a new block was received and successfully stored, and that the DataNode can thus be used as a location for this block. Finally, the DataNode also sends, every couple of hours, a full report including all the blocks that it contains.

If there is an issue with the local disk or the node, then the DataNode can report the block as corrupted to the NameNode, which will then ensure, asynchronously, its replication to somewhere else. "Asynchronously," as opposed to "synchronously," means that this is done in the background at some later time and the DataNode does not idly wait for this to happen. In the meantime, the block is simply marked as underreplicated.

A NameNode never initiates a connection ot a DataNode. If the NameNode needs anything from a DataNode, for example, if it needs to request a DataNode to download an underreplicated block from another DataNode and store a new replica of it, then the NameNode will wait until the next heartbeat, and answer to it with this request.

Finally, DataNodes are also capable of communicating with each other by forming replication pipelines. A pipeline happens whenever a new HDFS file is created. The client does not send a copy of the block to all the destination DataNodes, but only to the first one. This first DataNode is then responsible for creating the pipeline and propagating the block to its counterparts.

When a replication pipeline is ongoing and a new block is being written to the cluster, the content of the block is not sent in one single 128 MB packet. Rather, it is sent in smaller packets (64 kB for example) in a streaming fashion via a network protocol. That way, if some packets are missing, the client can send them again. The client receives the acknowledgements in a streaming fashion as well.

File system functionality¶

HDFS exposes to the client, via the NameNode, and API that allows typical opeartions available on a file system: creating a directory, deleting a directory, listing the contents of a directory, creating a file, appending to a file, reading a file, deleting a file.

Many of these involve the NameNode only. However, reading, writing, and deleting a files involves communication across the cluster, which we now detail.

Reading a file¶

Let us start with reading a file. The client first connects to the NameNode to initiate the read and request info on the file.

Then, the NameNode responds with the list of all blocks, as well as a list of DataNodes that contains a replica of each block. The DataNodes are furthermore sorted by increasing distance of the client (which is typically itself one of the nodes in the cluster).

The client then connects to the DataNodes in order to download a copy of the blocks. It starts with the first (closest) DataNode in the list provided by the NameNode for each blocks, and will go down the list if a DataNode cannot be reached or is not available. In the simple case that the client wants to stream its way through an HDFS file, bit by bit, it will download each block in turn.

However, this functionality is typically nicely encapsulated in additional, client-dise Java libraries that expose this as a "single big input stream" with the InputStream classes familiar to Java programmers. Switching between the DataNodes is hidden and the user just sees a stream of bits flowing through.

Note that with streaming, it is possible to process files larger than the working memory, because older blocks can be thrown away from the memory of the client once processed. Alternatively, the client can also download a multi-block HDFS file and store it as a single big file on its local drive as long as it fits.

Writing a file¶

Let us now write a new file to HDFS. This can be either a simple upload of a large local file, or it can be fro a stream of bits created on the fly by a program. The client first connects to the NameNode (just like for reading) formulating its intent to create a new file.

The NameNode then responds with a list of DataNodes to which the content of the first block should be sent. Note that, at that point, the file is not yet guaranteed to be available for read for other clients, and it is locked in such a way that nobody else can write to it at the same time.

The client then connects to the first DataNode and instructs it to organize a pipeline with the other DataNode provided by the NameNode for this block.

The client then starts sending through the content of that block, as we explained earlier. The content will be pipelined all the way to the other DataNodes.

The client receives regular acknowledgements from the first DataNode that ther bits have been received.

When all bits have been acknowledged, the client connects to the NameNode in order to move over to the second block. Then, the same steps (2, 3, 4, 5) as before are repeated for each block.

Once the last block has been written, the client informs the NameNode that the file is complete, and release the lock.

The NameNode then checks for minimal replication through the DataNode protocol.

Finally the NameNode gives its final achknowledgement to the client.

From now on, and separately (this is called "asynchronous"), the NameNode will continue to monitor the replicas and trigger copies from DataNode to DataNode whenever necessary, that is, when a block is underreplicated.

Replication Strategy¶

By default, three replicas of each block are stored, although this can be changed by users for every file. Note that there is no such thing as a "main replica", there are just several replicas.

The placement strategy of replicas is based on knowledge of the physical setup of the cluster. Recall that the servers, called nodes, are organized in racks, and several racks are placed in the same room or data center. This gives a natural tree topology.

With this topology in mind, one can define a notion of distance between two nodes: two nodes in the same rack have a distance of 2:

one edge from the first node to the rack
one edge from the rack from the rack to the other node

Two nodes in different racks have a distance of 4

one edge from the first node to the first rack
one from the first rack to the data center
one from the data center to the other rack
one from the other rack to the other node

It is this distance that can then be used by the NameNode to sort DataNodes by increasing distance from the client.

Now, the strategy for placing replicas of a new blocks is as follows. First, it is important to understand that, in practice, the client that is writing the file is a process running on one of the nodes of the HDFS cluster. This will become obvious when we study massive parallel computeing (MapReduce and Spark), where reading and writing is done in a distributed fashion over the same cluster. But even without parallel computing, when users create clusters in public clouds such as AWS and Auzure or Google Cloud, they connect to the machines using SSH (a safe protocl for remotely accessing a machine). Thus, anything they will do will originate from the machine they are connected to, whihch is in the cluster.

Having this in mind, the first replica of the block, by default, gets written to the same machine that the client is running on - keep in mind that this machine is tyically a DataNode.

The second replica is written on a DataNode sitting in a different rack than the client, that we call B. The third replica is written to another DataNode on the same rack B. And further replicas are written mostly at random, but respecting two simple rules for resilience:

at most one replica per node
at most two replica per rack

Below is an example with the three replicas of two different blocks, wiritten by the same client.

Fault Tolerance and Availability¶

HDFS has a single point of failure: the NameNode. If the metadata stored on it s lost, then all the data on the cluster is lost, because it is not possible to reassenble the blocks into files any more.

For this reason, the metadata is backed up. More precisely, the file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. Note that the mapping of block IDs to DataNodes does not require a backup, as it can be recovered from the periodic block reports.

Since the HDFS system is constantly updated, it would not be viable to do a backup upon each update. It would also not be viable to do backups less often, as this could lead to data loss. Thus, what is done is that updates to the file system arriving after the snapshot has been made are instrad stored in a journal , called edit log, that lists the updates stored by time of arrival.

The snapshot and edit log are stored either locally or on a network-attached drive (not HDFS itself). For more resilience, they can also be copied over to more backup locations.

If the NameNode crashes, it can be restarted, the snapshot can be loaded back into memory to get the file namespace and the mapping of the files to block IDs.

Then the edit log can be replayed in order to apply the latest changes.

And the NameNode can wait for (or trigger) block reports to rebuild the mapping from block IDs to the DataNodes that have a replica of them.

However, as more and more updates are applied, the edit log grows. This can easily lead to a long delay of 30+ minutes to restart the NameNode after a crash. More strategies has to be put in place. This was done incrementally in each HDFS release, and there are many different kinds of additional NameNodes that came into place and succeeded to one another: the Checkpoint NameNode, the Backup NameNode, the Secondary NameNode, the Standby NameNode, etc.

First, the edit log is periodically merged back into a new, larger snapshot and reset to an empty edit log. This is called a checkpoint. This can be done with a "phantom NameNode" (our terminology) that keeps the exact same sturctures in its memory as the real NameNode, and performs checkpoints periodically.

Second, it is possible to configure a phantom NameNode to be able to instantly take over from the NameNode in case of a crash. This considerably shortens the amount of time to recover.

As of the time of writing, these are called Standby NameNodes, but we do warn that the HDFS architecture will continue to evolve.

Another change to the architecture was made later: Federated NameNodes. In this change, several NameNodes can run at the same time, and each NameNode is responseible for specific (non overlapping) directories within the hierarchy. This spreads the management workload over multiple nodes.

Using HDFS¶

One of the ways to access HDFS is via the command line (a shell). The HDFS command line interface is mostly POSIX compliant, which means that the names of the commands are quite similar to those found on Linux or macos.

HDFS command starts with

hdfs dfs <put command here>

Although we recommend to use the hadoop command instead, becuase it can also connect to other file systems (S3, local drive, etc).

hadoop fs <put command here>

You can list the contents of the current directory with:

hadoop fs -ls

You can print to screen the contents of a file with:

hadoop fs -cat /user/hadoop/dir/file.json

You can delete a file with:

hadoop fs -rm /user/hadoop/dir/file.json

You can create a directory with:

hadoop fs -mkdir /user/hadoop/dir

You can upload files from your local machine to HDFS with:

hadoop fs -copyFromlocal localfile1 localfile 2 /user/hadoop/targetdirectory

You can upload a file from HDFS to your local computer with:

hadoop fs -copyToLocal /user/hadoop/file.json localfile.json

Paths and URIs¶

Although in the previous examples we used relative paths, generally, paths are URIs, as we have seen before. This scheme determines the file system.

This is was a HDFS URI looks like:

hdfs://www.example.com:8021/user/hadoop/file.json

This is what an S3 URI looks like (there are also the s3a and s3n schemes):

s3://my-bucket/directory/file.json

In Azure blob storage:

wasb://[email protected]/directory/file.json

An on the local file system:

file:///home/user/file.json

Typically, a default file system can be defined in a configuration file called core-site.xml:

<properties> <property>
<name>fs.defaultFS</name>
<value>hdfs://hdfs.mycluster.example.com:8020</value>
<description>NameNode hostname</description>
</property> </properties>

Then, the scheme is not needed to access files on this system and an absolute pach starting with a slash can be used instead:

/user/hadoop/file.json

Note that other file systems can still be accessed (if set up correctly) by using a URI scheme.

As for relative paths, they are resolved against the working directory, which should be unsurprising to anybody familiar with the command line interface even outside the context of HDFS.

Please mind that the current working directory might be on a different file system than the default file system; then you have situations in which a relative path will be local, while an absolute path will be on cloud, or disteibuted storage. This can cause unexpected issues. A common mistake is to users to use a relative path within a massive parallel computing framework (MapReduce, Apache Spark) when the working directory is local, which causes the data to be read, or written to the local disk rather than the HDFS cluster. If the job succeeds but the output is nowhere to be seen, ig is likely it was just split all over the local disks of the cluster. You can play treasure hunt and go after each machine to download the results, or you might realize that it will be easier to just rerun your job with the corret paths.

Logging and Importing Data¶

Two tools are worth mentioning:

Apache Flume lets you collect, aggregate, and move log data to HDFS.
Acache Sqoop lets you import data from a relational database management system to HDFS.