263-3010-00: Big Data
Section 10
Generic Dataflow Management
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 11/12/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
MapReduce is very simple and generic, but many more complex uses involve not just one, but a sequence of several MapReduce jobs. Furthermore, the MapReduce API is low-level, and most users need higher level interfaces, either in the form of APIs or query languages.
This is why, after MapReduce, another generation of distributed processing technologies were invented. The most popular one is the open source Apache Spark.
A more general dataflow model¶
MapReduce consists of a map phase, followed by shu ing, followed by a reduce phase. In fact, the map phase and the reduce phase are not so different: both involve the computation of tasks in parallel on slots.
One could exaggerate a bit by considering that MapReduce is in fact a map phase, then a shuffle, then a map phase (the reduce phase on the already shuffled data) again:
Or, if we abstract away the partitions and consider the input, intermediate input and output as blackboxes that these phases act on:
Resilient distributed datasets¶
The rst idea behind generic data ow processing is to allow the dataflow to be arranged in any distributed acyclic graph (DAG), like so:
All the rectangular nodes in the above graph correspond to intermediate data. They are called resilient distributed datasets, or in short, RDDs. Resilient means that they remain in memory or on disk on a best effort basis, and can be recomputed if need be. Distributed means that, just like the collections of key-value pairs in MapReduce, they are partitioned and spread over multiple machines.
A major di erence with MapReduce, though, is that RDDs need not be collections of pairs. In fact, RDDs can be (ordered) collections of just about anything: strings, dates, integers, objects, arrays, arbitrary JSON values, XML nodes, etc. The only constraint is that the values within the same RDD share the same static type, which does not exclude the use of polymorphism.
Since a key-value pair is a particular example of possible value, RDDs are a generalization of the MapReduce model for input, intermediate input and output.
The RDD lifecycle¶
Creation¶
RDDsundergoalifecycle. First, they get created. RDDs can be created by reading a dataset from the local disk, or from cloud storage, or from a distributed le system, or from a database source, or directly on the fly from a list of values residing in the memory of the client using Apache Spark:
Transformation¶
Then, RDDs can be transformed into other RDDs.
Mapping or reducing, in this model, become two very specific cases of transformations. However, Spark allows for many, many more kinds of transformations. This also includes transformations with several RDDs as input (think of joins or unions in the relational algebra, for example).
Action¶
RDDs can also undergo a final action leading to making an output persistent. This can be by outputting the contents of an RDD to the local disk, to cloud storage, to a distributed le system, to a database system, or directly to the screen of the user (assuming there is not intractably much data to display).
Lazy evaluation¶
Another important aspect of the RDD lifecycle is that the evaluation is lazy: in fact, creations and transformations on their own do nothing. It is only with an action that the entire computation pipeline is put into motion, leading to the computation of all the necessary intermediate RDDsall the way down to the final output corresponding to the action.
A simple example¶
Let us give a very simple example of RDD lifecycle. One easy way to start with Spark is by using a shell. Let us use a scala shell.
First, we can create an RDD with a simple list of two strings, and assign it to a variable, with the parallelize creation function, like so:
val rdd1 = sc.parallelize(
List("Hello, World!", "Hello, there!")
)
Note, for completeness, that it would also be possible to read from an input dataset (let us assume a sharded dataset in some directory in HDFS con gured as the default le system) with
val rdd1 = sc.textFile("/user/hadoop/directory/data-*.txt")
Note that, at this point, nothing actually gets set into motion. But, for pedagogical purpose, this is what will be generated when the computation starts:
Next, we can transform the RDD by tokenizing the strings based on spaces with a atMap transformation. We assign the resulting RDD to another variable.
val rdd2 = rdd1.flatMap(
value => value.split(" ")
)
Note that, at this point, still nothing actually gets set into motion, but rdd2 would correspond do:
Finally, we invoke the action countByValue(), which will return a (local) map structure (object, dict...) associating each value with a count. This is when the computations are actually triggered:
rdd2.countByValue()
The result will be displayed on the screen of the shell:
Transformations¶
Transformations, in Spark, are like a large restaurant menu to pick from. We now go through the most important transformations.
Unary transformations¶
Let us start with unary transformations, i.e., those that take a single RDD as their input.
The lter transformation, provided a predicate function taking a value and returning a Boolean, returns the subset of its input that satisfies the predicate, preserving the relative order.
The map transformation, provided a function taking a value and returning another value, returns the list of values obtained by applying this function to each value in the input.
The atMap transformation, provided a function taking a value and returning zero, one or more values, returns the list of values obtained by applying this function to each value in the input, attening the obtained values (i.e., the information on which values came from the same input value is lost). The flatMap transformation, in fact, corresponds to the MapReduce map phase (in the special case that the RDDs involved are RDDs of key-value pairs).