[2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
...Computing
One-line Summary
Spark is a generalized MapReduce model (in-memory Hadoop :P). It uses RDDs to support in-memory computations.
Paper Structure Outline
Introduction
Resilient Distributed Datasets (RDDs)
RDD Abstraction
Spark Programming Interface
Example: Console Log Mining
Advantages of the RDD Model
Applications Not Suitable for RDDs
Spark Programming Interface
RDD Operations in Spark
Example Applications
Logistic Regression
PageRank
Representing RDDs
Implementation
Job Scheduling
Interpreter Integration
Memory Management
Support for Checkpointing
Evaluation
Iterative Machine Learning Applications
PageRank
Fault Recovery
Behavior with Insufficient Memory
User Applications Built with Spark (In-mem analytics, traffic modeling, twitter spam classification)
Interactive Data Mining
Discussion
Expressing Existing Programming Models
Leveraging RDDs for Debugging
Related Work
Conclusion
Background & Motivation
Programmability
Most real applications require multiple MapReduce stages:
Google indexing pipeline: 21 steps
Analytics queries: 2-5 steps
Iterative algorithms: 10s of steps
Multi-step jobs create spaghetti code
21 MapReduce steps -> 21 mapper & reducer classes
Performance
MapReduce only provides one pass of computation (must write data to file system in between)
Expensive for apps that need to reuse data (e.g., multi-step algorithms like PageRank, interactive data mining)
Design and Implementation
Apache Spark
Simple, functional API
5x - 10x less code than MapReduce
Parallel transformations on collections
Available in Scala, Python, Java, and R
Can trace how operations are chained -> free type checking!
Mimics local programs
Performance
In-memory computing primitives
In-mem caching: LRU
Caches also get cleared when workers go away
Optimization across operators
Lazy evaluation/execution
Fault tolerance
Lineage graphs: Records of transformations that created this RDD
Assumption: Input file is still available
Assumption: Storage for lineage graphs is stable (similar to the MapReduce master)
Job scheduling
Captures RDD dependency graph
Pipelines function into "stages"
Cache-aware for data reuse, locality (move computation to where data is cached)
Partition-aware to avoid shuffles
RDDs
Immutable, partitioned collection of objects
Can be cached in memory for faster reuse
Operations on RDDs: Transformations (build RDDs) & Actions (compute results)
Links & References
Last updated