[2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

One-line Summary

Mesos is a scheduler for sharing cluster resources between multiple services/applications. Using two-level scheduling (application-specific schedulers), its orchestration mechanism can provide scalable, decentralized scheduling.

Paper Structure Outline

Background & Motivation

  • Motivation: Share resources among multiple frameworks/services

  • Background: OS scheduling: time sharing

    • Mechanism: Pre-empt processes

    • Policy: Which process is chosen to run

  • Background: Cluster scheduling

    • Time sharing: Context switches are more expensive

    • Space sharing: Partition resources at the same time across jobs

    • Policies: Should be aware of locality

    • Scale

    • Fault tolerance

  • Background: Target environment

    • Multiple MapReduce versions

    • Mix of frameworks: MPI, Spark, MapReduce

    • Data sharing across frameworks

    • Avoid per-framework clusters (not good for resource utilization)

Design and Implementation

Architecture

  • Centralized master (& its backups)

  • Agent on every machine

  • Two-level scheduling: Framework scheduler attached to Mesos

    • "Simplicity across frameworks"

Resource offers

  • Slave send heartbeats with information on available resources

  • Mesos master sends resource offers to frameworks -> Frameworks replies tasks & their granularity

  • Constraints

    • Example of constraints

      • Soft constraints: Prefer to run the task at a particular location (e.g. for data locality)

      • Hard constraints: Task needs GPUs

    • Constraints in Mesos

      • Applications can reject offers

      • Optimization: Filters (reduces the number of rejected offers)

  • Allocation

    • Assumption: Tasks are short -> allocate when they finish

    • Long tasks: Revocation beyond guaranteed allocation

      • E.g., MPI has guaranteed allocation of 8 machines; currently assigned 12 machines; can take away 4 machines

  • Isolation

    • Containers (Docker/Linux cgroups)

Fault tolerance

  • Node fails -> forward the failure to Hadoop, let them decide what to do

  • Master fails -> recover (soft) state by communicating with framework schedulers/workers

  • Also has a standby master

  • Framework scheduler fails -> Mesos doesn't handle that

Placement preferences

  • Problem: More frameworks have preferred nodes than available. Who gets the offers?

  • Lottery scheduling: Offers weighted by num allocations

Design choices & implications

  • Centralized vs. Distributed

    • Framework complexity: Every framework developer needs to implement a scheduler

    • Fragmentation, starvation: Especially if a job has large resource requirements. Partial workaround: min offer size

    • Inter-dependent framework: 2 frameworks cannot be colocated (e.g. due to security, privacy, ...)

Evaluations

Last updated