[2018 OSDI] Gandiva: Introspective Cluster Scheduling for Deep Learning

One-line Summary

The authors present Gandiva, a cluster scheduling framework that employs techniques like time-slicing, migration, intra-job elasticity, and dynamic priority.

Paper Structure Outline

Introduction
Background
DLT Job Characteristics
1. Sensitivity to locality
2. Sensitivity to interference
3. Intra-job predictability
Design
1. Mechanisms
2. Scheduling Policy
  1. Reactive Mode
  2. Introspective Mode
Implementation
1. Scheduler
2. Modifications to DL toolkits
Evaluation
1. Micro-benchmarks
2. Model exploration in a multi-job
3. Cluster experiments: time-slicing and packing
4. Cluster experiments: time-slicing and migration
Related Work
Conclusion

Background & Motivation

Today's DNN schedulers (e.g., YARN, Kubernetes) treat deep learning jobs naively (as if they are traditional big-data jobs): A job is scheduled on a set of GPUs exclusively, and the job holds the GPUs until completion. There are some problems:

High Latency (head-of-line blocking): Long DNN jobs have runtimes of hours and days, so we need time-slicing of jobs. However, GPUs are not efficiently virtualizable.
Low Efficiency (fixed decision at the job-placement time): Need the ability to migrate jobs, and the sensitivity to locality varies across jobs.

DLT jobs have the following characteristics:

Sensitivity to locality: Different models have various levels of sensitivity to intra-server and inter-server locality that a DLT scheduler needs to take into account.
Sensitivity to interference: Similarly, different models demonstrate different levels of sensitivity to interference between jobs.
Intra-job predictability: DLT jobs' GPU memory usage reveals a pattern (goes up during forward pass of a minibatch and goes down during backward pass). Gandiva leverages this in three ways:
1. A job can be split into mini-batch iterations
2. If suspend/resume is performed during the nadir, less amount of memory needs to be copied from GPU to CPU
3. The progress rate can be profiled to evaluate the effectiveness of mechanisms

Design and Implementation

Gandiva employs the following mechanisms:

Suspend-Resume and Packing
1. Suspend-Resume: Intra-job predictability is leveraged to suspend/resume DLT jobs when their GPU usage is at the lowest.
2. Packing: Run multiple jobs on a GPU simultaneously and let the GPU time-share the jobs, with the premise that the packed jobs do not interfere with each other. It is only considered during overload.
Migration: The set of GPUs assigned to a job can be changed for (1) moving time-sliced jobs to vacated GPUs, (2) moving interfering jobs away from each other, and (3) doing de-fragmentation of the cluster. The migration overhead is as little as a second or two.
Grow-Shrink: # GPUs available for a job can be increased during idle times and shrank when the load goes up.
Profiling: Gandiva profiles each job's time for one forward/backward pass over a minibatch. With this, Gandiva introspects DLT jobs to estimate the rate of progress, e.g. to check if packing helped.

Gandiva's scheduler works in two modes: reactive and introspective. The reactive mode handles events such as job arrivals/departures and machine failures, while the introspective mode monitors and optimizes job placement to improve the overall utilization and the completion time.