[2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

One-line Summary

Mesos is a scheduler for sharing cluster resources between multiple services/applications. Using two-level scheduling (application-specific schedulers), its orchestration mechanism can provide scalable, decentralized scheduling.

Paper Structure Outline

Background & Motivation

Motivation: Share resources among multiple frameworks/services
Background: OS scheduling: time sharing
- Mechanism: Pre-empt processes
- Policy: Which process is chosen to run
Background: Cluster scheduling
- Time sharing: Context switches are more expensive
- Space sharing: Partition resources at the same time across jobs
- Policies: Should be aware of locality
- Scale
- Fault tolerance
Background: Target environment
- Multiple MapReduce versions
- Mix of frameworks: MPI, Spark, MapReduce
- Data sharing across frameworks
- Avoid per-framework clusters (not good for resource utilization)

Design and Implementation

Architecture

Centralized master (& its backups)
Agent on every machine
Two-level scheduling: Framework scheduler attached to Mesos
- "Simplicity across frameworks"

Resource offers

Slave send heartbeats with information on available resources
Mesos master sends resource offers to frameworks -> Frameworks replies tasks & their granularity
Constraints
- Example of constraints
  - Soft constraints: Prefer to run the task at a particular location (e.g. for data locality)
  - Hard constraints: Task needs GPUs
- Constraints in Mesos
  - Applications can reject offers
  - Optimization: Filters (reduces the number of rejected offers)
Allocation
- Assumption: Tasks are short -> allocate when they finish
- Long tasks: Revocation beyond guaranteed allocation
  - E.g., MPI has guaranteed allocation of 8 machines; currently assigned 12 machines; can take away 4 machines
Isolation
- Containers (Docker/Linux cgroups)

Fault tolerance

Node fails -> forward the failure to Hadoop, let them decide what to do
Master fails -> recover (soft) state by communicating with framework schedulers/workers
Also has a standby master
Framework scheduler fails -> Mesos doesn't handle that

Placement preferences

Problem: More frameworks have preferred nodes than available. Who gets the offers?
Lottery scheduling: Offers weighted by num allocations

Design choices & implications

Centralized vs. Distributed
- Framework complexity: Every framework developer needs to implement a scheduler
- Fragmentation, starvation: Especially if a job has large resource requirements. Partial workaround: min offer size
- Inter-dependent framework: 2 frameworks cannot be colocated (e.g. due to security, privacy, ...)

Evaluations

Links & References

Previous[2010 SIGMOD] Pregel: A System for Large-Scale Graph Processing Next[2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...

Last updated 3 years ago

Was this helpful?