[2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training
One-line Summary
Paper Structure Outline
Background & Motivation

Design and Implementation
DS-Analyzer: Perform predictive what-if analysis of data stalls
CoorDL: Mitigating data stalls
MinIO: DNN-aware software caching to reduce cache misses per epoch (benefits single-server training)

Partitioned caching to coordinate remote MinIO caches (benefits distributed training)
Coordinated prep to eliminate redundant fetch & prep across jobs (benefits hyperparameter search)
Evaluation


Links
Previous[2021 MLSys] Accordion: Adaptive Gradient Communication via Critical Learning Regime IdentificationNext[2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
Last updated