[2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
...Computing
Spark is a generalized MapReduce model (in-memory Hadoop :P). It uses RDDs to support in-memory computations.
- 1.Introduction
- 2.Resilient Distributed Datasets (RDDs)
- 1.RDD Abstraction
- 2.Spark Programming Interface
- 1.Example: Console Log Mining
- 3.Advantages of the RDD Model
- 4.Applications Not Suitable for RDDs
- 3.Spark Programming Interface
- 1.RDD Operations in Spark
- 2.Example Applications
- 1.Logistic Regression
- 2.PageRank
- 4.Representing RDDs
- 5.Implementation
- 1.Job Scheduling
- 2.Interpreter Integration
- 3.Memory Management
- 4.Support for Checkpointing
- 6.Evaluation
- 1.Iterative Machine Learning Applications
- 2.PageRank
- 3.Fault Recovery
- 4.Behavior with Insufficient Memory
- 5.User Applications Built with Spark (In-mem analytics, traffic modeling, twitter spam classification)
- 6.Interactive Data Mining
- 7.Discussion
- 1.Expressing Existing Programming Models
- 2.Leveraging RDDs for Debugging
- 8.Related Work
- 9.Conclusion
- Programmability
- Most real applications require multiple MapReduce stages:
- Google indexing pipeline: 21 steps
- Analytics queries: 2-5 steps
- Iterative algorithms: 10s of steps
- Multi-step jobs create spaghetti code
- 21 MapReduce steps -> 21 mapper & reducer classes
- Performance
- MapReduce only provides one pass of computation (must write data to file system in between)
- Expensive for apps that need to reuse data (e.g., multi-step algorithms like PageRank, interactive data mining)
- Simple, functional API
- 5x - 10x less code than MapReduce
- Parallel transformations on collections
- Available in Scala, Python, Java, and R
- Can trace how operations are chained -> free type checking!
- Mimics local programs
- Performance
- In-memory computing primitives
- In-mem caching: LRU
- Caches also get cleared when workers go away
- Optimization across operators
- Lazy evaluation/execution
- Fault tolerance
- Lineage graphs: Records of transformations that created this RDD
- Assumption: Input file is still available
- Assumption: Storage for lineage graphs is stable (similar to the MapReduce master)
- Job scheduling
- Captures RDD dependency graph
- Pipelines function into "stages"
- Cache-aware for data reuse, locality (move computation to where data is cached)
- Partition-aware to avoid shuffles
- Immutable, partitioned collection of objects
- Can be cached in memory for faster reuse
- Operations on RDDs: Transformations (build RDDs) & Actions (compute results)