Rui's Blog
  • Rui's Blog/Paper Reading Notes - Introduction
  • Personal Blog
    • Personal Blog - Index
      • How to Create Picture-in-Picture Effect / Video Overlay for a Presentation Video
      • How to Do Your Part to Protect the Environment in Wisconsin
      • How to Get a Driver's License in Wisconsin
      • How to Travel from the U.S. to China onboard AA127 in June 2021
      • How to Transfer Credits Back to UW-Madison
      • Resources on Learning Academic Writing (for Computer Science)
    • Towards applying to CS Ph.D. programs
  • Machine Learning Systems
    • Machine Learning Systems - Index
      • MLSys Papers - Short Notes
      • [2011 NSDI] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • [2014 OSDI] Scaling Distributed Machine Learning with the Parameter Server
      • [2018 OSDI] Gandiva: Introspective Cluster Scheduling for Deep Learning
      • [2018 SIGCOMM] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera ...
      • [2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference
      • [2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
      • [2019 NSDI] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
      • [2019 SOSP] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...
      • [2019 SOSP] PipeDream: Generalized Pipeline Parallelism for DNN Training
      • [2019 SOSP] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
      • [2019 NIPS] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
      • [2019 SC] ZeRO: memory optimizations toward training trillion parameter models
      • [2020 OSDI] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
      • [2020 OSDI] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
      • [2020 OSDI] BytePS: A High Performance and Generic Framework for Distributed DNN Training
      • [2020 SIGCOMM] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
        • [2020 MLSys] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
      • [2020 EuroSys] AlloX: Compute Allocation in Hybrid Clusters
      • [2020 VLDB] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
      • [2020 NetAI] Is Network the Bottleneck of Distributed Training?
      • [2020 NSDI] Themis: Fair and Efficient GPU Cluster Scheduling
      • [2021 MLSys] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification
      • [2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training
      • [2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
      • [2021 EuroMLSys] Interference-Aware Scheduling for Inference Serving
      • [2021 OSDI] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
      • [2021 MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
      • [2021 NSDI] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation
    • Big Data Systems - Index
      • Big Data Systems Papers - Short Notes
      • [2003 SOSP] The Google File System
      • [2004 OSDI] MapReduce: Simplified Data Processing on Large Clusters
      • [2010 SIGMOD] Pregel: A System for Large-Scale Graph Processing
      • [2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • [2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
      • [2012 OSDI] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
      • [2019 FAST] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...
      • [2021 HotOS] From Cloud Computing to Sky Computing
      • [2021 EuroSys] NextDoor: Accelerating graph sampling for graph machine learning using GPUs
  • Earlier Readings & Notes
    • High Performance Computing Course Notes
      • Lecture 1: Course Overview
      • Lecture 2: From Code to Instructions. The FDX Cycle. Instruction Level Parallelism.
      • Lecture 3: Superscalar architectures. Measuring Computer Performance. Memory Aspects.
      • Lecture 4: The memory hierarchy. Caches.
      • Lecture 5: Caches, wrap up. Virtual Memory.
      • Lecture 6: The Walls to Sequential Computing. Moore’s Law.
      • Lecture 7: Parallel Computing. Flynn's Taxonomy. Amdahl's Law.
      • Lecture 8: GPU Computing Intro. The CUDA Programming Model. CUDA Execution Configuration.
      • Lecture 9: GPU Memory Spaces
      • Lecture 10: GPU Scheduling Issues.
      • Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.
      • Lecture 12: Global Memory Access Patterns and Implications.
      • Lecture 13: Atomic operations in CUDA. GPU ode optimization rules of thumb.
      • Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.
      • Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.
      • Lecture 16: Streams, and overlapping data copy with execution.
      • Lecture 17: GPU Computing: Advanced Features.
      • Lecture 18: GPU Computing with thrust and cub.
      • Lecture 19: Hardware aspects relevant in multi-core, shared memory parallel computing.
      • Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.
      • Lecture 21: OpenMP Work Sharing.
      • Lecture 22: OpenMP Work Sharing
      • Lecture 23: OpenMP NUMA Aspects. Caching and OpenMP.
      • Lecture 24: Critical Thinking. Code Optimization Aspects.
      • Lecture 25: Computing with Supercomputers.
      • Lecture 26: MPI Parallel Programming General Introduction. Point-to-Point Communication.
      • Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.
      • Lecture 28: MPI Parallel Programming: MPI Collectives. Overview of topics covered in the class.
    • Cloud Computing Course Notes
      • 1.1 Introduction to Clouds, MapReduce
      • 1.2 Gossip, Membership, and Grids
      • 1.3 P2P Systems
      • 1.4 Key-Value Stores, Time, and Ordering
      • 1.5 Classical Distributed Algorithms
      • 4.1 Spark, Hortonworks, HDFS, CAP
      • 4.2 Large Scale Data Storage
    • Operating Systems Papers - Index
      • CS 736 @ UW-Madison Fall 2020 Reading List
      • All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
      • ARC: A Self-Tuning, Low Overhead Replacement Cache
      • A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications
      • Biscuit: The benefits and costs of writing a POSIX kernel in a high-level language
      • Data Domain: Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
      • Disco: Running Commodity Operating Systems on Scalable Multiprocessors
      • FFS: A Fast File System for UNIX
      • From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees
      • LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
      • LFS: The Design and Implementation of a Log-Structured File System
      • Lottery Scheduling: Flexible Proportional-Share Resource Management
      • Memory Resource Management in VMware ESX Server
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • NFS: Sun's Network File System
      • OptFS: Optimistic Crash Consistency
      • RAID: A Case for Redundant Arrays of Inexpensive Disks
      • RDP: Row-Diagonal Parity for Double Disk Failure Correction
      • Resource Containers: A New Facility for Resource Management in Server Systems
      • ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay
      • Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
      • SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery
      • The Linux Scheduler: a Decade of Wasted Cores
      • The Unwritten Contract of Solid State Drives
      • Venti: A New Approach to Archival Storage
    • Earlier Notes
      • How to read a paper
  • FIXME
    • Template for Paper Reading Notes
Powered by GitBook
On this page
  • One-line Summary
  • Paper Structure Outline
  • Background & Motivation
  • Microsoft Philly
  • Impact of Locality Awareness
  • Queueing Delays
  • GPU Utilization
  • Training Progress and Completion
  • Training Iterations
  • Job Failures
  • Guidelines
  • Links

Was this helpful?

  1. Machine Learning Systems
  2. Machine Learning Systems - Index

[2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

One-line Summary

This paper presents a characterization study of large-scale GPU clusters for DNN training. It uncovers some inefficiencies in cluster utilization and presents some lessons for better cluster manager decisions.

This paper means a lot to me in that this is the first paper that I read through thoroughly :)

Paper Structure Outline

  1. Introduction

  2. Philly: System Overview

    1. Workloads

    2. Cluster Architecture

    3. Job Scheduling and Execution Workflow

    4. Data Collection and Analysis

  3. Impact of Locality Awareness

    1. Queueing Delays

      1. Impact of Locality-Driven Scheduling

    2. GPU utilization

      1. Impact of Distributed Learning

  4. Training Progress and Completion

    1. Effectiveness of Training Iterations

    2. Job Failures

      1. Failure Classification

      2. Failure Frequency

      3. Runtime to Failure

      4. Impact of GPUs Allocated

  5. Design Implications for Future Schedulers

  6. Related Work

  7. Conclusion

Background & Motivation

DNN-based workloads are different from traditional big data analytics workloads in two ways:

  1. Cluster utilization: GPUs represent a monolithic resource that cannot be shared at a fine granularity across users

  2. Workload: Deep learning frameworks require gang scheduling, reducing the flexibility of scheduling and job's elasticity of runtime failures

The authors first present an overview of Philly, a large, multi-tenant GPU-based cluster for production-scale deep learning tasks. Then, they present a detailed workload characterization and study how factors like gang scheduling, locality requirements, and failures might affect cluster utilization.

Microsoft Philly

The three main steps in Fig. 1 are:

  1. Incoming jobs and queuing: The scheduler needs to perform gang scheduling while being locality-aware. Each production group is provided with a virtual cluster and a quota (in terms of #GPUs to each virtual cluster).

  2. Job placement and utilization: The scheduler aims to maximize locality and minimize fragmentation of resources (from smaller jobs, e.g. 1-GPU jobs). There is a trade-off between colocation and distribution, though, as placing different jobs on the same server could lead to lower GPU utilization (because of interference in shared resources like RDMA and PCIe).

  3. Training progress and completion: Jobs can finish with three statuses: passed, killed, or unsuccessful. Failed jobs are retried a few times to overcome non-deterministic failures.

The logs are collected over a 75-day period and it consists of 96260 jobs over 14 virtual clusters. There are three main sources of the logs:

  1. YARN scheduler log: job arrival time, # GPUs requested, GPU allocation status, job finish status

  2. stdout & stderr logs from ML frameworks

  3. Ganglia monitoring system log: Per-minute statistics on hardware usage (CPU, memory, network, GPU utilization)

Impact of Locality Awareness

Queueing Delays

GPU Utilization

Relaxing locality constraints:

  • High intra-server locality

    • Pros

      • High communication efficiency

    • Cons

      • Long queueing time

  • Low intra-server locality

    • Pros

      • Low queueing time

    • Cons

      • Contention in the use of network

      • Risk of intra-server interference (across jobs)

Training Progress and Completion

As observed in the above table, it is important to understand the reasons behind these failures, as fewer unsuccessful jobs would mean more resources for successful jobs.

Training Iterations

~80% of passed jobs require all epochs executed to reach the lowest loss. However, an average of 62% and 56% (for passed jobs and killed jobs, respectively) GPU times for each job are used to improve the convergence accuracy by merely 0.1%. This suggests that jobs can be terminated early to save considerable resources.

Job Failures

The failures happen across the whole stack: Infrastructure (GPU, HDFS, resource scheduler), ML frameworks (PyTorch, TensorFlow), and user programs (shitty code :P). A failure classifier is used to analyze the causes of the job failures. The most important failures are from these classifications:

  1. Incorrect inputs: Model files or input data stored in the external HDFS storage cannot be read

  2. Semantic error: Errors that happen due to library version mismatch or other dependencies of the user training program not being setup correctly

  3. Model checkpoint error: The job is not able to successfully create a model checkpoint after a certain number of epochs complete. This is usually due to either transient error in HDFS or HDFS name node recovery

  4. MPI runtime failure: This is usually due to either a failure of network connection to peer MPI process, or possibly an internal failure of the MPI daemon itself

  5. Job preempted: YARN reclaims any GPU currently in use to schedule another job

  6. Invalid memory access: Training job dies because of violating access on memory address space e.g., using an invalid pointer value, or having race condition while copying data. This failure is observed in both CPU memory and memory allocated for GPU access

An analysis on the failure frequency show that:

  1. Failures repeat for the same job/user

  2. User/programming errors lead to a lot of failures

An analysis on the runtime to failure (RTF) shows that:

  1. RTF exhibits high variability with many short RTFs

  2. Infrastructure failures occur infrequently but have much longer RTF

The authors also find that large jobs with programming semantic errors tend to fail a while after execution.

Guidelines

  1. Prioritize locality: As the lack of locality impacts both utilization and job runtime, and because DNN training jobs are long-running, schedulers should trade queueing delay for adhering to locality constraints.

  2. Mitigate interference: As different jobs on a single server might interfere with each other, schedulers should aim to isolate the jobs on dedicated servers while implementing techniques like migration for defragmentation to support the locality constraints of jobs that need more GPUs.

  3. Improve failure handling: To catch failures early before they are scheduled on a cluster and thus prevent resources from being wasted, each incoming job should be scheduled on a small dedicated pool of servers/a single GPU to catch simple programming and configuration errors from multi-GPU jobs. Another possible improvement is for clusters to predictively mitigate failures by proactively observing related failures. For example, the scheduler should stop retrying for failure categories like incorrect data input and continue retrying for network timeouts.

Links

Previous[2018 NIPS] Dynamic Space-Time Scheduling for GPU InferenceNext[2019 NSDI] Tiresias: A GPU Cluster Manager for Distributed Deep Learning

Last updated 2 years ago

Was this helpful?

Paper PDF
Full presentation video at USENIX ATC '19
Lightning talk at USENIX '19
Full presentation slides
Lightning talk slides
Philly traces on GitHub
Project Fiddle
The scheduler works in practice to trade-off locality for lower scheduling delay
There are two types of delays: Fair-share denotes fairness (which is common in conventional data analytics clusters), while fragmentation denotes locality requirement and resource fragmentation (which is more prevalent in DL clusters).
GPU utilization is low (lower in distributed training) because of (1) distribution across servers and (2) intra-server interference
GPU utilization when running 8 and 16 GPU jobs on dedicated servers
In general, DL training jobs underutilize GPU processing cycles regardless of their job sizes.
A significant fraction (30.7%) of jobs are either termintated by users or are unsuccessful. These jobs constitute ~55% of the total GPU time.