[2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
One-line Summary
Mesos is a scheduler for sharing cluster resources between multiple services/applications. Using two-level scheduling (application-specific schedulers), its orchestration mechanism can provide scalable, decentralized scheduling.
Paper Structure Outline
Background & Motivation
Motivation: Share resources among multiple frameworks/services
Background: OS scheduling: time sharing
Mechanism: Pre-empt processes
Policy: Which process is chosen to run
Background: Cluster scheduling
Time sharing: Context switches are more expensive
Space sharing: Partition resources at the same time across jobs
Policies: Should be aware of locality
Scale
Fault tolerance
Background: Target environment
Multiple MapReduce versions
Mix of frameworks: MPI, Spark, MapReduce
Data sharing across frameworks
Avoid per-framework clusters (not good for resource utilization)
Design and Implementation
Architecture
Centralized master (& its backups)
Agent on every machine
Two-level scheduling: Framework scheduler attached to Mesos
"Simplicity across frameworks"
Resource offers
Slave send heartbeats with information on available resources
Mesos master sends resource offers to frameworks -> Frameworks replies tasks & their granularity
Constraints
Example of constraints
Soft constraints: Prefer to run the task at a particular location (e.g. for data locality)
Hard constraints: Task needs GPUs
Constraints in Mesos
Applications can reject offers
Optimization: Filters (reduces the number of rejected offers)
Allocation
Assumption: Tasks are short -> allocate when they finish
Long tasks: Revocation beyond guaranteed allocation
E.g., MPI has guaranteed allocation of 8 machines; currently assigned 12 machines; can take away 4 machines
Isolation
Containers (Docker/Linux cgroups)
Fault tolerance
Node fails -> forward the failure to Hadoop, let them decide what to do
Master fails -> recover (soft) state by communicating with framework schedulers/workers
Also has a standby master
Framework scheduler fails -> Mesos doesn't handle that
Placement preferences
Problem: More frameworks have preferred nodes than available. Who gets the offers?
Lottery scheduling: Offers weighted by num allocations
Design choices & implications
Centralized vs. Distributed
Framework complexity: Every framework developer needs to implement a scheduler
Fragmentation, starvation: Especially if a job has large resource requirements. Partial workaround: min offer size
Inter-dependent framework: 2 frameworks cannot be colocated (e.g. due to security, privacy, ...)
Evaluations
Links & References
Last updated