[2004 OSDI] MapReduce: Simplified Data Processing on Large Clusters

One-line Summary

MapReduce is a simple programming model on large clusters with frequent failures. It provides a set of limited but general functional API (Map, Reduce, Sort), fault tolerance, and straggler mitigation through retries.

Paper Structure Outline

Introduction
Programming Model
1. Example
2. Types
3. More Examples
Implementation
1. Execution Overview
2. Master Data Structures
3. Fault Tolerance
4. Locality
5. Task Granularity
6. Backup Tasks
Refinements
1. Partitioning Function
2. Ordering Guarantees
3. Combiner Function
4. Input and Output Types
5. Side-effects
6. Skipping Bad Records
7. Local Execution
8. Status Information
9. Counters
Performance
1. Cluster Configuration
2. Grep
3. Sort
4. Effect of Backup Tasks
5. Machine Failures
Experience
1. Large-Scale Indexing
Related Work
Conclusions

Programming Model

The data type for each record is of the form (key, value).

The terms "map" and "reduce" are borrowed from functional languages like Lisp. The Map function (parallelly) processes (a large number of) individual records to generate intermediate (key, value) pairs. The Reduce function (parallelly) processes and merges all intermediate values associated per key by partitioning keys (e.g., hash partitioning).

Example Workloads

Word Count

Distributed grep

Reversed Web-Link Graph

Count of URL Access Frequency

Sort

MapReduce Scheduling

Inside MapReduce

For the users, they only need to write the map & reduce programs, then submit the job and wait for the results, without the need to know about the internal parallel/distributed computing. For the paradigm and the scheduler, the following need to be handled:

Parallelize Map
Transfer data from Map to Reduce: Use partitioning function, ensuring all map output records with the same key are assigned to the same Reduce task
Parallelize Reduce
Implement storage for Map input, Map output, Reduce input, Reduce output
1. Map input: From distributed FS (GFS, HDFS, etc.)
2. Map output: To local FS/disk at Map node
3. Reduce input: From (multiple) remote disks; Uses local FS
4. Reduce output: To distributed FS
Ensure the barrier between the Map phase and Reduce phase

The YARN Scheduler

Yet Another Resource Negotiator (YARN) is offered in Hadoop 2.x+. It treats each server as a collection of containers (some CPU + some memory). It has three main components:

Global Resource Manager (RM): Scheduling
Per-Server Node Manager (NM): Daemon and Server-specific functions
Per-Application/job Application Master (AM): Handles container negotiations with RMs and NMs, detect task failures of that job

Other Designs

Fault Tolerance: Failures

Server Failure
- NM heartbeats to RM: If server fails, RM lets all affected AMs know, and AMs take action
- NM keeps track of each task running at its server: If a task fails while in progress, mark the task as idle and restart it. If the same task fails repeatedly, end the job
- AM heartbeats to RM: On failure, RM restarts AM, which then syncs up with its running tasks
RM Failure
- Use old checkpoints and bring up secondary RM

Fault Tolerance: Stragglers

The slowest machine slows the entire job down
Possible reasons: bad disk, network bandwidth, CPU, or memory
Keep track of the progress of each task (% done). When a straggler appears, launch a second copy of a task on another node and take the output of whichever finishes first (this is called Speculative Execution).

Locality

Cloud has hierarchical topology (e.g., racks)
GFS/HDFS stores 3 replicas of each chunk (e.g., 64 MB in size), possibly on different racks
MapReduce attempts to schedule a Map task on (preference from high to low):
- A machine that contains a replica of corresponding input data
- On the same rack as a machine containing the input
- Anywhere

Links

Previous[2003 SOSP] The Google File System Next[2010 SIGMOD] Pregel: A System for Large-Scale Graph Processing

Last updated 3 years ago

Was this helpful?