[2019 NSDI] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
One-line Summary
Paper Structure Outline
Background & Motivation
Unpredictable Training Time

Over-Aggressive Job Consolidation
Preemption is Costly

Design and Implementation

Scheduling


Priority Discretization

Placement


Evaluation



New Vocabulary
Links
Previous[2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training WorkloadsNext[2019 SOSP] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...
Last updated