[2019 SOSP] PipeDream: Generalized Pipeline Parallelism for DNN Training
One-line Summary
PipeDream uses pipeline parallelism to combine intra-batch parallelism with inter-batch parallelization and reduce the communication overhead in DNN training.
Paper Structure Outline
Introduction
Background & Related Work
DNN Training
Parallel Training in PipeDream
Pipeline Parallelism
Partitioning Layers Across Machines
Work Scheduling
Effective Learning
GPU Memory Management
Implementation
Evaluation
Experimental Setup
PipeDream vs. Data Parallelism
Value of Data Parallelism in stages
Conclusion
Background & Motivation
Existing parallelization techniques (data, model, hybrid) use intra-batch parallelization where each iteration of the optimization algorithm is parallelized across a set of workers. In data parallelism, there exists a high communication overhead at a large scale, while in vanilla model parallelism, system resources are under-utilized.
Design and Implementation
Pipeline parallelism
With pipeline parallelism, multiple minibatches are injected into the pipeline one after another (Fig. 4).
Pipeline parallelism is superior because of 2 reasons:
Pipelining communicates less: Compared to DP where the whole model is communicated in an all-to-all fashion, PipeDream reduces the amount of inter-worker communication by only communicating intermediate inputs & outputs across consecutive layers on different workers peer-to-peer.
Pipelining overlaps computation and communication: Asynchronous communication of intermediate results in significant overlap of communication with the computation of a subsequent minibatch (Fig. 5).
Shenanigan 1: Work/model partitioning
PipeDream partitions the training of a model into stages in a pipeline. Ideally, each stage would have the same throughput rate, otherwise, there would be a load imbalance. The communication between stages should also be minimized as much as possible to improve the overall throughput. PipeDream uses an optimizer to output a balanced pipeline. To minimize the time taken by the slowest stage, PipeDream uses a partitioning algorithm that takes in:
Do a short profiling run of 1000 minibatches. For each layer,
The total computation time across the forward & backward passes
The size of the output activations & input gradients
The size of weight parameters
and outputs {a partitioning of layers into stages, the number of workers for each stage, and the optimal number of in-flight minibatches to keep the training pipeline busy}.
Shenanigan 2: Work scheduling
Each stage should decide whether to perform a forward pass or a backward pass on different minibatches. During the startup state, PipeDream admits enough minibatches to keep the pipeline full in steady state. Then, each stage alternates between a forward pass for a minibatch & a backward pass for a different minibatch (1F1B schedule). For a data parallel configuration, an 1F1B-RR (round robin) schedule is used.
Shenanigan 3: Weight version mismatch
Naive pipelining leads to mismatch in weight versions. For example, in Fig. 4, for minibatch 5, on stage 1, the forward pass is performed after the backprop of minibatch 1, whereas the backprop is performed after the backprop of minibatches 2-4. This discrepancy results in invalid gradients and may prevent model convergence. PipeDream's workaround is to use weight stashing to store multiple versions of the weights, one for each active minibatch. Even with the extra memory footprint of stashing the weight versions, PipeDream's peak per-worker memory usage is on par with data parallelism.
Evaluation
Links
Last updated