[2021 MLSys] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification
One-line Summary
Accordion dynamically adjusts the gradient compression rate and batch size during critical regimes in training to do better compression, reduce communication, and achieve an end-to-end speedup w/o losing accuracy.
Paper Structure Outline
Introduction
Related work
Distributed SGD
ACCORDION
Adaptive communication using critical regimes
ACCORDION's Design
Relationship between gradient compression and adaptive batch-size
Experimental evaluation
Experimental setup
Results
ACCORDION with PowerSGD
ACCORDION with TopK
ACCORDION with Large Batch size
Comparison with Prior Work
Future Work and Limitations
Conclusion
Appendix
Detailed Experimental Settings
Connection Between Gradient Compression and Batch Size
ACCORDION on Extremely Large Batch Size
Results and Detailed Analysis
Language Model
Computer Vision Models
Detailed Analysis of Batch Size Results
Compression Ratio Selection of Adasparse
Model Descriptions
Background & Motivation
Current methods to alleviate gradient communication bottlenecks include:
Lossy gradient compression (reduce the size of data communicated)
Choosing the compression ratio is a tradeoff between final accuracy & communication overhead)
Can be generalized into three groups: quantization, sparsification, and low rank approximation
Increase batch size (reduce the frequency of per-epoch communication)
This leads to degradation in final accuracy
In this work, the authors relax the "fixed communication" scheme and use adaptive schemes. The authors build on the idea of critical regimes so that avoiding gradient compression (lowering the compression rate) during critical regimes mitigates accuracy loss. Accordion is also able to adjust the batch size.
Design and Implementation
In the example above, if low compression is used for the first 20 epochs and the 10 epochs after epoch 150 and high compression is used in other places, the overall communication will be close to high compression, and the accuracy will be the same as using low compression throughout (communication is also reduced significantly).
Critical regimes are identified by measuring the rate of change in gradient norms. This technique has a low computational and memory overhead.
For batch sizes, small batches are used only in critical regimes, and this results in performance similar to using small batches everywhere.
Evaluation
More evaluations are available in the paper appendix. This paper has the longest appendix I've ever seen :)
Links
Presentation slides at MLSys '21
Last updated