263-3010-00: Big Data

Section 9 Resource Management

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 11/11/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

The first version of MapReduce is inefficient in several respects. For this reason, the architecture was fundamentally changed by adding a resource management layer to the stack, adding one more level of decoupling between scheduling and monitoring. A resource management system, here YARN, is a very important building block not only for a better MapReduce, but also for many other technologies running on a cluster (Generalized Parallel Processing, ex: with Apache Spark, which we will cover later, Nessage-Passing Interface, Graph computing, etc.).

Limitations of MapReduce in its first version¶

Recollect that the first version of MapReduce is based on a centralized archetecture with a JobTracker, the coordinator node, and TaskTrackers, the worker nodes.

No description has been provided for this image

The JobTracker has a lot on its shoulders! It has to deal with resource management, scheduling, monitoring, the job lifecycle, and fault tolerance.

The first cosequence of this is scalability: things start breaking beyond 4,000 nodes and / or 40,000 tasks.

The second consequence is the bottleneck that this introduces at the JobTracker level, which slows down the system.

The third issue is that it is difficult to design a system that does many things well: "Jack of all trades, master of none".

The fourth issue is that resources are statically allocated to the Map or the Reduce phase, meaning that parts of the cluster remain idle during both phases.

The fifth issue is the lack of fungibility between the Map phase and the Reduce phase: the system is closely tied to the two-phase mechanism of MapReduce, in spite of these two phases having a lot in common in terms of parallel execution.

YARN¶

General architecture¶

YARN means Yet Another Resource Negotiator. It was introduced as an additional layer that specifically handles the management of CPU and memory resources in the cluster.

YARN, unsurprisingly, is based on a centralized architecture in which the coordinator node is called the ResourceManager, and the worker nodes are called NodeManagers. NodeManagers furthermore provide slots (equipped with exclusively allocated CPU and memory) known as containers.

YARN provides generic support for allocating resources to any application and is application-agnostic. When the user launches a new application, the ResourceManager assigns on of the container to act as the ApplicationMaster which will take care of running the MapReduce job. The ApplicationMaster can then communicate with the ResourceManager in order to book and use more containers in order to run jobs.

Thus, YARN cleanly separates between the general management of resources and bootstrapping new applications, which remains centralized on the coordinator node, and monitoring the job lifecycle, which is now delegated to one or more ApplicationMasters running concurrently. This means, in particular, that several applications can run concurrently in the same cluster. This ability, known as multi-tenancy, is very important for large companies o universities in order to optimized the use of their resources. In fact, countrary to popular belif, this ecosystem was not meant to optimize performance, but to optimize resource use and costs. Technologies like MapReduce or Apace Spark are often described as slow, but this is a misunderstanding of the fact that these technologies were never designed for a single user running just one job in their own center. When there is only one user, this user typically has more control over the cluster and can optimize things in better ways than they can in a shared environment (an example of that is High-Performance Computing).

Resource management¶

In resource management, one abstracts away from hardware by distinguishing between four specific resources used in a distributed database system. These four resources are:

Memory
CPU
Disk I/O
Network I/O

Most resource management systems (including YARN), today, focus mainly on allocating and sharing memory and CPU, but there is also a lot of ongoing work and process to allocate and share disk and network I/O.

ApplicationMasters can request and release containers at any time, dynamically. A container request is typically made by the ApplicationMaster with a specific demand (ex: "10 containers with each 2 cores and 16 GB of RAM"). If the request is granted by the ResourceManager fully or partially, this is done indirectly by signing and issueing a container token to be ApplicationMaster that acts as proof that the resource was granted.

The ApplicationMaster can then connect to the allocated NodeManager and send the token. The NodeManager will then check the validity of the token and provide the memory and CPU granted by the ResoureManager. The ApplicationMaster ships the code (ex: as a jar file) as well as parameters, which then runs as a process with exclusive use of this memory and CPU.

To bootstrap a new application, the ResourceManager can also issue application tokens to external clients so they can start the ApplicationMaster.

Job lifecycle management and fault tolerance¶

Version 2 of MapReduce works on top of YARN by leaving the job lifecycle management to an ApplicationMaster. The ApplicationMaster requests containers for the Map phase, and sets these containers up to execute Map tasks. As soon as a container is done executing a Map task, the ApplicationMaster will assign a new Map task to this container from the remaining queue, until no Map tasks are left.

Note that there is a subtle distinction between a slot and a container. Indeed, containers that have several cores can run several Map tasks concurrently (sharing the same memory space). In other words, a container in the Map phase can contain several Map slots. Sharing memory and containers across slots in this way improves the overall effciency, because setting up a container adds latency. It is thus more effcient to allocate 10 containers of each 4 cores, compared to 40 containers of each 1 core. Likewise, sharing memory does not mean that the threads executing the Map tasks within the same container communicate or are entangled in any way (they each live their own life), but rather allows a better spread of the di erence of memory usage across tasks.

When the end of the Map phase approaches, the ApplicationMaster then starts allocating containers for the Reduce phase and initiates shuffling. When the Map phase is complete, these new slots can then start processing Reduce tasks and the Map containers can be released. Whenthe output has been written, the Reduce containers are also freed and returned to YARN for other users to book.

In the event of a container crashing during the Map phase, the ApplicationMaster will handle this by re-requesting containers and restarting the failed tasks. In the case that some data is lost in the Reduce phase, it is possible that the entire job must be restarted, because this the only way to recreate the intermediate data is to re-execute the Map tasks.

Finally, version 2 of MapReduce supports 10,000 nodes and 100,000 tasks, which is an improvement on version 1.

Scheduling¶

The ResourceManager decides whether and when to grant resource requests based on several factors: capacity guarantees, fairness, service level agreements (remember the numbers with plenty of 9s?) and with the goal to maximize cluster utilization. It provides both an interface to clients (users) and maintains a queue of applications with their status, as well as statistics. It also provides an admin interface to con gure the queue.

The ResourceManager keeps track of the list of available NodeManagers (who can dynamically come and go) and their status. Just like in HDFS, NodeManagers send periodic heartbeats to the ResourceManager to give a sign of life.

The ResourceManager also o ers an interface so that ApplicationMasters can register and send container requests. ApplicationMasters also send periodic heartbeats to the ResourceManager.

It is important to understand that, unlike the JobTracker, the ResourceManager does not monitor tasks, and will not restart slots upon failure. This job is left to the ApplicationMasters.

Scheduling strategies¶

There are several possible scheduling strategies, common to many systems with multi-tenancy.

FIFO scheduling¶

In FIFO scheduling, there is one application at a time running on the entire cluster. When it is done, the next application runs again on the entire cluster, and so on.

Capacity scheduling¶

In capacity scheduling, the resources of the cluster are partitioned into several sub-clusters of various sizes: these can correspond, for example, to subdivisions of a company or university: the CS department, the Mathematics department, etc.

Each one of these sub-clusters has its own queue of applications running in a FIFO fashion within this queue.

It is also possible to have more hierarchical levels: sub-sub-clusters. This is called a hierarchical queue. Applications then run with FIFO scheduling on the leaves of the hierarchical queue, which shows that scheduling strategies can mix with each other in a recursive fashion.

Capacity scheduling also exists in a more dynamic avour in which, when a sub-cluster is not currently used, its resources can be temporarily lent to the other sub-clusters. This is also in the spirit of usage maximization, so that the company as a whole will not waste unused resources.

Fair scheduling¶

Fair scheduling involves more complex algorithms that attempt to allocate resources in a way fair to all users of the cluster and based on the share they are normally entitled to. Fair scheduling also typically includes economic and game theory thinking, which is necessary to ensure a smooth cluster management with a large number of users within the same company.

Fair scheduling should be understood in a dynamic fashion: the cluster has, at any point in time, users from various departments running their applications. Applications are dynamically and regularly requesting many containers with speci c memory and CPU requirements, and releasing them again. Thus, fair scheduling consists on making dynamic decisions regarding which requests get granted and which requests have to wait.

Fair scheduling combines several ways to compute cluster shares:

Steady fair share: this is the share of the cluster o cially allocated to each department. The various departments agree upon this with each other in advance (e.g., based on their nancial contribution to the maintenance of the cluster). This number is thus static and rarely changes. It is the theoretical share that each department would normally get if all departments constantly use the cluster.
Instantaneous fair share: this is the fair share that a department should ideally be allocated (according to economic and game theory considerations) at any point in time. This is a dynamic number that changes constantly, based on departments being idle: if a department is idle, then the instantaneous fair share of others department becomes higher than their steady fair shares.
Current share: this is the actual share of the cluster that a department e ectively uses at any point in time. This is highly dynamic. The current share does not necessarily match the instantaneous fair share because there is some inertia in the process: a department might be using more resources while another is idle. When the other department later stops being idle, these resources are not immediately withdrawn from the rst department; rather, the first department will stop getting more resources, and the second department will gradually recover these resources as they get released by the rst department. This behavior can be changed with preemption and overrides with priority levels, in which case the resources are forcibly and immediately taken back from the first department (which typically leads to failed tasks or even failed jobs) and handed over to the second department.

The easiest case of fair scheduling is when only one resource is considered: for example, only CPU cores, or only memory. In this case, the algorithm is relatively simple: the requests by users who are significantly below their instantaneous fair share get prioritized over those who are above (or closer to) their instantaneous fair share.

Things become more complicated when several resources are considered, in practice both CPU cores and memory. This is because one application may request many containers with, say, each one core and 100 GB of memory, while another might request many containers with each 4 cores and 10 GB of memory. How can we, then, de ne what the current share is, as the containers are parameterized on several dimensions?

This problem was solved game-theoretically with the Dominant Resource Fairness algorithm. The two (or more) dimensions are projected again to a single dimension by looking at the dominant resource for each user. For example, imagine the total size of the cluster is 1000 cores and 10 TB of memory. Then, a container with one core and 100 GB of memory would use 0.1% of the cluster cores and 1% of the cluster memory. Its dominant resource is memory and it is thus the 1% that is considered. Another container with 4 cores and 10 GB of memory would use 0.4% of the CPU cores and 0.1% of the memory. Then, CPU cores are the dominant resource, and it is 0.4% that is considered. Let us take the example where both users have an instantaneous fair share of 50% each. In this case, the fair ratio of the containers should be 2 to 5: each time the rst user gets 2 containers of one core and 100 GB of memory, the other user gets 5 containers of 4 cores and 10 GB of memory. That way, both have the same dominant resource usage : they each get 2% of the cluster every time the rst user gets 2 containers (2 times 1%) and the second user gets 5 containers (5 times 0.4%). The algorithm grants requests in order to get as close as possible to this fair ratio in cruise mode.