[2021 OSDI] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
One-line Summary
Paper Structure Outline
Background & Motivation
Motivation: System Throughput & Statistical Efficiency, Dynamicity in DL Training Jobs



Background: Existing DL Schedulers
Design
Goodput = Throughput * Statistical Efficiency

Modeling Statistical Efficiency



Modeling System Throughput



Implementation

Job-level optimization: PolluxAgent
Cluster-wide optimization: PolluxSched

Evaluation



Links
Previous[2021 EuroMLSys] Interference-Aware Scheduling for Inference ServingNext[2021 MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
Last updated