[2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
One-line Summary
This paper presents a characterization study of large-scale GPU clusters for DNN training. It uncovers some inefficiencies in cluster utilization and presents some lessons for better cluster manager decisions.
This paper means a lot to me in that this is the first paper that I read through thoroughly :)
Paper Structure Outline
Introduction
Philly: System Overview
Workloads
Cluster Architecture
Job Scheduling and Execution Workflow
Data Collection and Analysis
Impact of Locality Awareness
Queueing Delays
Impact of Locality-Driven Scheduling
GPU utilization
Impact of Distributed Learning
Training Progress and Completion
Effectiveness of Training Iterations
Job Failures
Failure Classification
Failure Frequency
Runtime to Failure
Impact of GPUs Allocated
Design Implications for Future Schedulers
Related Work
Conclusion
Background & Motivation
DNN-based workloads are different from traditional big data analytics workloads in two ways:
Cluster utilization: GPUs represent a monolithic resource that cannot be shared at a fine granularity across users
Workload: Deep learning frameworks require gang scheduling, reducing the flexibility of scheduling and job's elasticity of runtime failures
The authors first present an overview of Philly, a large, multi-tenant GPU-based cluster for production-scale deep learning tasks. Then, they present a detailed workload characterization and study how factors like gang scheduling, locality requirements, and failures might affect cluster utilization.
Microsoft Philly
The three main steps in Fig. 1 are:
Incoming jobs and queuing: The scheduler needs to perform gang scheduling while being locality-aware. Each production group is provided with a virtual cluster and a quota (in terms of #GPUs to each virtual cluster).
Job placement and utilization: The scheduler aims to maximize locality and minimize fragmentation of resources (from smaller jobs, e.g. 1-GPU jobs). There is a trade-off between colocation and distribution, though, as placing different jobs on the same server could lead to lower GPU utilization (because of interference in shared resources like RDMA and PCIe).
Training progress and completion: Jobs can finish with three statuses: passed, killed, or unsuccessful. Failed jobs are retried a few times to overcome non-deterministic failures.
The logs are collected over a 75-day period and it consists of 96260 jobs over 14 virtual clusters. There are three main sources of the logs:
YARN scheduler log: job arrival time, # GPUs requested, GPU allocation status, job finish status
stdout & stderr logs from ML frameworks
Ganglia monitoring system log: Per-minute statistics on hardware usage (CPU, memory, network, GPU utilization)
Impact of Locality Awareness
Queueing Delays
GPU Utilization
Relaxing locality constraints:
High intra-server locality
Pros
High communication efficiency
Cons
Long queueing time
Low intra-server locality
Pros
Low queueing time
Cons
Contention in the use of network
Risk of intra-server interference (across jobs)
Training Progress and Completion
As observed in the above table, it is important to understand the reasons behind these failures, as fewer unsuccessful jobs would mean more resources for successful jobs.
Training Iterations
~80% of passed jobs require all epochs executed to reach the lowest loss. However, an average of 62% and 56% (for passed jobs and killed jobs, respectively) GPU times for each job are used to improve the convergence accuracy by merely 0.1%. This suggests that jobs can be terminated early to save considerable resources.
Job Failures
The failures happen across the whole stack: Infrastructure (GPU, HDFS, resource scheduler), ML frameworks (PyTorch, TensorFlow), and user programs (shitty code :P). A failure classifier is used to analyze the causes of the job failures. The most important failures are from these classifications:
Incorrect inputs: Model files or input data stored in the external HDFS storage cannot be read
Semantic error: Errors that happen due to library version mismatch or other dependencies of the user training program not being setup correctly
Model checkpoint error: The job is not able to successfully create a model checkpoint after a certain number of epochs complete. This is usually due to either transient error in HDFS or HDFS name node recovery
MPI runtime failure: This is usually due to either a failure of network connection to peer MPI process, or possibly an internal failure of the MPI daemon itself
Job preempted: YARN reclaims any GPU currently in use to schedule another job
Invalid memory access: Training job dies because of violating access on memory address space e.g., using an invalid pointer value, or having race condition while copying data. This failure is observed in both CPU memory and memory allocated for GPU access
An analysis on the failure frequency show that:
Failures repeat for the same job/user
User/programming errors lead to a lot of failures
An analysis on the runtime to failure (RTF) shows that:
RTF exhibits high variability with many short RTFs
Infrastructure failures occur infrequently but have much longer RTF
The authors also find that large jobs with programming semantic errors tend to fail a while after execution.
Guidelines
Prioritize locality: As the lack of locality impacts both utilization and job runtime, and because DNN training jobs are long-running, schedulers should trade queueing delay for adhering to locality constraints.
Mitigate interference: As different jobs on a single server might interfere with each other, schedulers should aim to isolate the jobs on dedicated servers while implementing techniques like migration for defragmentation to support the locality constraints of jobs that need more GPUs.
Improve failure handling: To catch failures early before they are scheduled on a cluster and thus prevent resources from being wasted, each incoming job should be scheduled on a small dedicated pool of servers/a single GPU to catch simple programming and configuration errors from multi-GPU jobs. Another possible improvement is for clusters to predictively mitigate failures by proactively observing related failures. For example, the scheduler should stop retrying for failure categories like incorrect data input and continue retrying for network timeouts.
Links
Last updated