One-line Summary
In this paper, the authors introduced BytePS, a unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. Yes, this is the OG title of the paper, but it's so long that it does not fit in the GitBook title, so...
Paper Structure Outline
Motivation and BytePS Architecture
BytePS Communication Design
Inter-machine Communication
Communication Efficiency Analysis
Intra-machine Communication
Implementation
Address RDMA Performance Issues
Evaluation
Inter-machine Microbenchmarks
Adapt to Intra-machine Topology
Observations and Discussion
Background & Motivation
Existing architectures (all_reduce and parameter server) for distributed DNN training are insufficient.
The two architectures for distributed training based on data parallelism: All-reduce and PS Even with ByteScheduler, we are still 30% away from the optimal performance The paper analyzed three problems that led to this slowdown, and then presented a solution to each problem:
Sub-optimal Inter-machine Communication
Sub-optimal Intra-machine Communication
Sub-optimal Inter-machine Communication
For allreduce, the CPUs are not leveraged properly as the communication is between GPUs only. For PS, if there are insufficient CPUs for servers, bottlenecks may be created. The solution is an optimal communication strategy that unifies allreduce and PS.
Sub-optimal Intra-machine Communication
There are often multiple GPUs in a GPU machine IRL. The internal topology is also a network, which results in bottlenecks. The solution is intra-machine optimizations that accommodate diverse intra-machine topologies.
The CPU Bottleneck
In the parameter server setup, the GPU workers send the gradients to CPU servers. The CPU servers first aggregate the gradients received, and then update the parameters using the optimizer function. The problem is that CPUs might not match network rates, thus creating bottlenecks. The solution is a summation service that moves parameter updates from CPUs to GPUs.
Design and Implementation
Sub-optimal Inter-machine Communication
PS only uses links between CPUs and GPUs and does not utilize the bandwidth between GPU machines. In the allreduce setup, the communication solely relies on inter-GPU communications, not utilizing the CPU at all. BytePS takes the best of both worlds and combines these two strategies. In the paper, the authors presented an optimal partition strategy that adjusts the proportion of CPU-GPU and GPU-GPU communications.
Sub-optimal Intra-machine Communication
I might come back to this later, this is a bit complicated to understand for now :P The CPU Bottleneck
The PS server's role can be divided into two parts: Gradient Summation & Parameter Update. Typically, forward propagation and backward propagation get placed on GPUs, while gradient summation and parameter update are placed on GPUs. The authors found that the gradient summation step is CPU-friendly, while the parameter update step is heavy. To resolve this issue, the authors presented the Summation Service, which moves the parameter update to GPUs to resolve the aforementioned bottleneck.
System Architecture Overview
BytePS achieves near-optimal communication performance BytePS outperforms allreduce & PS by up to 84% and 245%, respectively Breakdown of Performance Gains Network Interface Controller (NIC):
Goodput: Throughput that's good. Goodput is the rate at which useful data passes through a link, while throughput measures the rate at which all data passes through a link. For example, in a local area network (LAN), goodput only measures the throughput of the original data, while throughput also measures all the protocol overhead information (packet headers, etc.).