[2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
One-line Summary
CheckFreq pipelines checkpointing with computation for automated, frequent, fine-grained checkpointing in DNN training.
Paper Structure Outline
Introduction
Background
The Current State of Checkpointing
Checkpointing is Incorrect
Checkpointing is Inefficient
Summary
CheckFreq: Design and Implementation
Goals
CheckFreq Recovery Guarantees
Design
Checkpointing Mechanism
Checkpointing Policy
Implementation
Evaluation
Experimental Setup
Accuracy Implications
Performance of Checkpointing Mechanism
Checkpoint Stalls
Breakdown of Benefits
Checkpointing Policy
Recovery Time
End-to-End Training
Discussion
Related Work
Conclusion
Background & Motivation
During DNN training, checkpointing is performed to ensure fault tolerance. Current checkpointing schemes are synchronous, thus leading to large checkpoint stalls. Furthermore, due to bigger models and larger datasets, epoch times are increasing. Typically, checkpointing is performed at epoch boundaries and the checkpointing frequency needs to be set manually. → We need fine-grained, iteration-level checkpointing.
Design and Implementation
CheckFreq is an automated checkpointing framework for DNN training.
Mechanism: Low-cost, pipelined checkpointing
Low checkpoint stalls: 2-phase DNN-aware checkpointing
CheckFreq decouples the traditional checkpointing into two phases: snapshot()
and persist()
. snapshot()
serializes the training state and copies it from the GPU memory to a user-space buffer in CPU memory. persist()
writes the serialized content to disk. These two phases are pipelined with DNN training computation.
In the optimal case, as the model weights are synchronized in the last phase of an iteration, we can pipeline the snapshot()
with the forward & backward pass of the next iteration, minimizing the checkpoint stall.
The authors also found that doing the snapshot on the GPU has an orders-of-magnitude lower cost than that on the CPU, as the latter involves a memory copy from GPU to CPU. Therefore, if spare GPU memories are available, the snapshot is done on the GPU memory.
Maintain data invariant: Resumable data iterator
Current data iterators do not guarantee the order of data items after resuming. CheckFreq resolves this by using a seed that is a function of the epoch number to reconstruct the shuffle order after resuming.
Policy: When to checkpoint?
Initial frequency: Systematic online profiling
The key idea is to come up with a frequency of checkpointing every k iterations such that:
The cost of 1 checkpoint can be amortized over k iterations
The runtime overhead of checkpointing is within a user-defined threshold of the actual compute time (say 5%)
To accomplish this, CheckFreq profiles: the iteration time (Ti), time to perform weight update (Tw), time to create an in-memory GPU copy (Tg), time to create an in-memory CPU copy (Tc), time to write to storage (Ts), size of checkpoint (m), peak GPU memory utilization (M), and total GPU memory (Mmax). Then, the frequency is determined as follows:
Adaptive rate tuning: Manages interference from other jobs
Consider the following example
Isolated: When a job runs alone, the checkpointing overhead is kept at 5% as specified by the user
Static: When another job space-shares the same GPU, checkpointing at the previous frequency results in a 35% overhead
Adaptive: CheckFreq's adaptive policy reduced the checkpoint frequency and keeps the overhead at 5%
Evaluation
Links
Last updated