tldr: MapReduce + A Major Step Backwards?

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

MapReduce: Simplified Data Processing on Large Clusters (2004)

Dean et al. 

MapReduce is a functional programming pattern, popularized by Google for big data processing. It abstracts away parallelization, fault-tolerance, and load balancing, which allows programmers to run jobs on a distributed system through a simple but powerful 2 core API.

Programming Model

The MapReduce library consists of two functions:
  • Map. (k1, v1) --> list(k2, v2). A user-defined function that takes in a key-value pair and outputs a list of intermediate key-value pairs. 
  • Reduce. (k2, list(v2)) -->  list(v2). A user-defined function that takes in an intermediate key and a list of its corresponding values. It merges those values into a possibly smaller list of values. 
The process from beginning to end is:
  • User specifies a list of input key-value pairs. 
  • Map takes each input key-value pair and outputs a list of intermediate key-value pairs.
  • Shuffle. The MapReduce library groups the intermediate key-value pairs by key, such that all the values with the same key are grouped together.
  • Reduce takes each input key with its list of associated values, and merges them into another list of values.
We can now walk through an example from the paper with some more details fleshed out: counting word occurrences in a large collection of documents. (See the paper for pseudocode and complete C++ code.)
  • User specifies a list of input key-value pairs. 
    • ("doc1", "cat zebra zebra dog"), ("doc 2", "duck cat")
  • Map takes each input key-value pair and outputs a list of intermediate key-value pairs. In this example, the user-defined Map function is just emitting 1 for every word in the documents.  
    • [("cat", 1), ("zebra", 1), ("zebra", 1), ("dog", 1)], [("duck", 1), ("cat", 1)]
  • The MapReduce library groups the intermediate key-value pairs by key, such that all the values with the same key are grouped together.
    • ("cat", [1, 1]), ("zebra", [1, 1]), ("dog", [1]), ("duck", [1])
  • Reduce takes each input key with its list of associated values, and merges them into another list of values. In this example, the user-defined Reduce function sums all 1's corresponding to a word.
    • [("cat", 2), ("dog", 1), ("duck", 1), ("zebra", 2)]

Implementation

The implementation described in the paper was specifically designed for Google. Their environment is "large clusters of commodity PCs connected together with switched Ethernet". Machine failures are common and expected, since each cluster is made up of 100s or 1000s of machines.
The flow of a MapReduce operation on the implementation in the above figure is:
  1. The MapReduce library splits input files into M 16-64 MB pieces. It also starts up workers on a cluster.
  2. The master worker assigns either a map task or a reduce task to the other idle workers. In total, there are M map tasks and R reduce tasks. 
  3. Each map worker reads the corresponding input file split. It parses out and passes key-value pairs from the split to the user-defined Map function. The output of the Map function, intermediate key-value pairs, are buffered in memory. 
  4. Periodically, buffered key-value pairs are written to the map workers' local disks, which are divided into R partitions. Buffer key-value pairs' locations are sent to the master, who forwards the locations to workers who were assigned reduce tasks. 
  5. Each reduce worker that is notified by the master about the key-value pairs' locations then reads in the data from the local disks of the map workers. After reading all the data, the reduce worker sorts the data by key. 
  6. The reduce worker passes in each intermediate key and its corresponding values to the user-defined Reduce function. The output of the Reduce function is appended to the corresponding partition's final output file, which is stored in a global file system. 
  7. After all map tasks and reduce tasks are done, the master wakes up the user program. The final output is in R output files. These R output files are usually inputted into another MapReduce call or some other distributed system that can handle an input divided into multiple files.  
Notes:
  • Input data is managed by Google File System (GFS) and stored on local disks. Each file is split into 64 MB blocks, and multiple (usually 3) copies of each block are stored on different machines. 
  • The master worker tries to schedule each map task on a machine that already stores the corresponding data locally. If that's not possible, the master worker tries to schedule the map task on a machine close to a machine which stores the corresponding data. This is because network bandwidth is a scarce resource. 
  • M and R should be much larger than the number of workers for better load balancing.
  • Stragglers are machines that take an abnormally long time to finish a task when a MapReduce operation is almost done. Stragglers are a common issue that lead to bad performance. To address them, the master worker schedules backup executions for the last few tasks. Then the task is completed whenever the first execution finishes, whether that's the original or the backup execution. 
  • Applications that this MapReduce implementation isn't suited for include small datasets, and small updates to big datasets due to performance overhead. 

Fault Tolerance

Worker Failure

The master marks a worker as failed if the worker doesn't respond to the master's periodic pings in a certain amount of time.

Completed map tasks, in-progress map tasks, and in-progress reduce tasks from failed workers can be rescheduled on other workers. (Completed map tasks have to be rescheduled because their output stored on local disks is inaccessible once a worker fails. Completed reduce tasks don't have to be rescheduled because their output is stored in a global file system.)

All reduce workers are notified if a map worker fails. Reduce workers which have not yet read from that failed map worker will instead read from the map worker which the map task was rescheduled on.

Master Failure

The failure of the one master worker is improbable, so the MapReduce implementation in the paper aborts its operation if the master worker fails. The user can then retry the operation.

A Major Step Backwards (2008)

DeWitt and Stonebraker published an article in 2008 on their perspective of MapReduce. To quote them:

MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:
  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on
Their article generated a lot of attention. People countered that RDBMS was not as scalable as big data processing systems like MapReduce. As one commenter on Hackernews, vicaya, wrote:

The article is from 2008. Since then, the so called parallel DB goes no where, and Hadoop takes over the world.

The main problem of traditional (OLAP) DBMS in the era of big data is that ETL (Extract/Transform/Load) becomes the main bottle neck rather than complex queries, as big data is inherently semi-structured and noisy. MR is the tool to process big data.


Another commenter, thomas11, wrote:

The whole piece seems to be based on a false premise, namely that MapReduce is supposed to replace databases. That's not the case at all, it's a way to analyze and transform data in parallel. Afterwards, you can load it into a relational (or other) database if you want database features.

However, recently, people have started developing scalable RDBMS and SQL based systems. Since the amount of data is still growing exponentially, indexing, rather than "brute force", is becoming more important than ever. Furthermore, machine learning algorithms involving huge matrices with O(N^2) amount of data, instead of O(N) or O(NlogN) data, can't use MapReduce directly. 

Finally, Stonebraker recently addressed MapReduce as part of a larger issue in this talk [around 30 minutes in]. 

Conclusion

The MapReduce programming model was successful at Google since it was easy to use, adaptable to many problems because of its simple interface, and scalable to clusters of 1000s of machines. It launched other big data processing systems outside of Google, such as Hadoop.














Popular posts from this blog

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go

tldr: Raft + Implementation Tips