[2020 OSDI] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning

Summary

AntMan is a cluster scheduler for GPU sharing. It introduces two techniques, dynamic memory scaling and opportunistic computation management, to accommodate multiple jobs and avoid interference.

Background & Motivation

GPUs in a shared cluster are not properly utilized (both SM and GRAM are under-utilized). One of the reasons is multi-GPU jobs require gang scheduling, which creates GPU idleness. Moreover, DL training jobs have dynamic resource demand over time.
Training jobs in the Alibaba cluster have the following characterstics:
- Small model size: Most GPU memory can be shared
- Short mini-batch: Fast resource coordination
- Similar mini-batch: Mini-batch time can be used to quantify inter-job interference

Design & Implementation

Dynamic Memory Scaling

AntMan dynamically co-locates jobs on shared GPUs. The goal is for resource-guarantee jobs to maintain the same performance as dedicated execution while co-locating opportunistic jobs to best utilize the resources.

AntMan monitors the memory usage of DL jobs and sets the corresponding memory upper bounds, allowing other jobs to utilize the spare memory. However, since DL jobs have dynamic resource demand, jobs may require more memory than before, which creates OOM and fails all jobs. In this case (Fig. 7a), these memory bursts are cached on the host (CPU) memory, and are moved back to GRAM after re-allocation. The same technique is applied to jobs that need to shrink their memory requirements to make way for other jobs (Fig. 7b).