[2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
...Computing
One-line Summary
Spark is a generalized MapReduce model (in-memory Hadoop :P). It uses RDDs to support in-memory computations.
Paper Structure Outline
- Introduction 
- Resilient Distributed Datasets (RDDs) - RDD Abstraction 
- Spark Programming Interface - Example: Console Log Mining 
 
- Advantages of the RDD Model 
- Applications Not Suitable for RDDs 
 
- Spark Programming Interface - RDD Operations in Spark 
- Example Applications - Logistic Regression 
- PageRank 
 
 
- Representing RDDs 
- Implementation - Job Scheduling 
- Interpreter Integration 
- Memory Management 
- Support for Checkpointing 
 
- Evaluation - Iterative Machine Learning Applications 
- PageRank 
- Fault Recovery 
- Behavior with Insufficient Memory 
- User Applications Built with Spark (In-mem analytics, traffic modeling, twitter spam classification) 
- Interactive Data Mining 
 
- Discussion - Expressing Existing Programming Models 
- Leveraging RDDs for Debugging 
 
- Related Work 
- Conclusion 
Background & Motivation
- Programmability - Most real applications require multiple MapReduce stages: - Google indexing pipeline: 21 steps 
- Analytics queries: 2-5 steps 
- Iterative algorithms: 10s of steps 
 
- Multi-step jobs create spaghetti code - 21 MapReduce steps -> 21 mapper & reducer classes 
 
 
- Performance - MapReduce only provides one pass of computation (must write data to file system in between) 
- Expensive for apps that need to reuse data (e.g., multi-step algorithms like PageRank, interactive data mining) 
 
Design and Implementation
Apache Spark
- Simple, functional API - 5x - 10x less code than MapReduce 
- Parallel transformations on collections 
- Available in Scala, Python, Java, and R 
- Can trace how operations are chained -> free type checking! 
- Mimics local programs 
 
- Performance - In-memory computing primitives - In-mem caching: LRU 
- Caches also get cleared when workers go away 
 
- Optimization across operators 
 
- Lazy evaluation/execution 
- Fault tolerance - Lineage graphs: Records of transformations that created this RDD - Assumption: Input file is still available 
- Assumption: Storage for lineage graphs is stable (similar to the MapReduce master) 
 
 
- Job scheduling - Captures RDD dependency graph 
- Pipelines function into "stages" 
- Cache-aware for data reuse, locality (move computation to where data is cached) 
- Partition-aware to avoid shuffles 
 
RDDs
- Immutable, partitioned collection of objects 
- Can be cached in memory for faster reuse 
- Operations on RDDs: Transformations (build RDDs) & Actions (compute results) 
Links & References
Last updated
Was this helpful?
