[2018 OSDI] Gandiva: Introspective Cluster Scheduling for Deep Learning
One-line Summary
The authors present Gandiva, a cluster scheduling framework that employs techniques like time-slicing, migration, intra-job elasticity, and dynamic priority.
Paper Structure Outline
Introduction
Background
DLT Job Characteristics
Sensitivity to locality
Sensitivity to interference
Intra-job predictability
Design
Mechanisms
Scheduling Policy
Reactive Mode
Introspective Mode
Implementation
Scheduler
Modifications to DL toolkits
Evaluation
Micro-benchmarks
Model exploration in a multi-job
Cluster experiments: time-slicing and packing
Cluster experiments: time-slicing and migration
Related Work
Conclusion
Background & Motivation
Today's DNN schedulers (e.g., YARN, Kubernetes) treat deep learning jobs naively (as if they are traditional big-data jobs): A job is scheduled on a set of GPUs exclusively, and the job holds the GPUs until completion. There are some problems:
High Latency (head-of-line blocking): Long DNN jobs have runtimes of hours and days, so we need time-slicing of jobs. However, GPUs are not efficiently virtualizable.
Low Efficiency (fixed decision at the job-placement time): Need the ability to migrate jobs, and the sensitivity to locality varies across jobs.
DLT jobs have the following characteristics:
Sensitivity to locality: Different models have various levels of sensitivity to intra-server and inter-server locality that a DLT scheduler needs to take into account.
Sensitivity to interference: Similarly, different models demonstrate different levels of sensitivity to interference between jobs.
Intra-job predictability: DLT jobs' GPU memory usage reveals a pattern (goes up during forward pass of a minibatch and goes down during backward pass). Gandiva leverages this in three ways:
A job can be split into mini-batch iterations
If suspend/resume is performed during the nadir, less amount of memory needs to be copied from GPU to CPU
The progress rate can be profiled to evaluate the effectiveness of mechanisms
Design and Implementation
Gandiva employs the following mechanisms:
Suspend-Resume and Packing
Suspend-Resume: Intra-job predictability is leveraged to suspend/resume DLT jobs when their GPU usage is at the lowest.
Packing: Run multiple jobs on a GPU simultaneously and let the GPU time-share the jobs, with the premise that the packed jobs do not interfere with each other. It is only considered during overload.
Migration: The set of GPUs assigned to a job can be changed for (1) moving time-sliced jobs to vacated GPUs, (2) moving interfering jobs away from each other, and (3) doing de-fragmentation of the cluster. The migration overhead is as little as a second or two.
Grow-Shrink: # GPUs available for a job can be increased during idle times and shrank when the load goes up.
Profiling: Gandiva profiles each job's time for one forward/backward pass over a minibatch. With this, Gandiva introspects DLT jobs to estimate the rate of progress, e.g. to check if packing helped.
Gandiva's scheduler works in two modes: reactive and introspective. The reactive mode handles events such as job arrivals/departures and machine failures, while the introspective mode monitors and optimizes job placement to improve the overall utilization and the completion time.
Evaluation
New Vocabulary
Introspection (反省): The examination of one's own conscious thoughts and feelings.
Links
Last updated