[2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...

...Computing

One-line Summary

Spark is a generalized MapReduce model (in-memory Hadoop :P). It uses RDDs to support in-memory computations.

Introduction
Resilient Distributed Datasets (RDDs)
1. RDD Abstraction
2. Spark Programming Interface
  1. Example: Console Log Mining
3. Advantages of the RDD Model
4. Applications Not Suitable for RDDs
Spark Programming Interface
1. RDD Operations in Spark
2. Example Applications
  1. Logistic Regression
  2. PageRank
Representing RDDs
Implementation
1. Job Scheduling
2. Interpreter Integration
3. Memory Management
4. Support for Checkpointing
Evaluation
1. Iterative Machine Learning Applications
2. PageRank
3. Fault Recovery
4. Behavior with Insufficient Memory
5. User Applications Built with Spark (In-mem analytics, traffic modeling, twitter spam classification)
6. Interactive Data Mining
Discussion
1. Expressing Existing Programming Models
2. Leveraging RDDs for Debugging
Related Work
Conclusion

Programmability
- Most real applications require multiple MapReduce stages:
  - Google indexing pipeline: 21 steps
  - Analytics queries: 2-5 steps
  - Iterative algorithms: 10s of steps
- Multi-step jobs create spaghetti code
  - 21 MapReduce steps -> 21 mapper & reducer classes
Performance
- MapReduce only provides one pass of computation (must write data to file system in between)
- Expensive for apps that need to reuse data (e.g., multi-step algorithms like PageRank, interactive data mining)

Simple, functional API
- 5x - 10x less code than MapReduce
- Parallel transformations on collections
- Available in Scala, Python, Java, and R
- Can trace how operations are chained -> free type checking!
- Mimics local programs
Performance
- In-memory computing primitives
  - In-mem caching: LRU
  - Caches also get cleared when workers go away
- Optimization across operators
Lazy evaluation/execution
Fault tolerance
- Lineage graphs: Records of transformations that created this RDD
  - Assumption: Input file is still available
  - Assumption: Storage for lineage graphs is stable (similar to the MapReduce master)
Job scheduling
- Captures RDD dependency graph
- Pipelines function into "stages"
- Cache-aware for data reuse, locality (move computation to where data is cached)
- Partition-aware to avoid shuffles

Last updated 3 years ago

Was this helpful?