MLSys Papers - Short Notes
Last updated
Last updated
This work proposes tensor parallelism (TP), where tensors are partitioned across devices and are only aggregated for operations that require the whole tensor. A key insight of TP is that matrix multiplication can be split between multiple GPUs to parallelize computation and save memory.
Each transformer layer consists of a self-attention block followed by a two-layer, multi-layer perceptron (MLP). To parallelize an MLP, column parallelism can be used to split the matrix multiplication, and synchronizations are not needed until the very end of the computation. Parallelizing the multi-headed attention layers is even easier since they are already inherently parallel. As a result, each transformer layer requires two allreduce during the forward pass and two allreduce during the backward pass.
Note that using TP requires a super fast network for near-theoretical-optimal performance, and in real life, TP is usually used in conjugation with other forms of parallelism.
BlueConnect adapts to the hierarchy of communication bandwidths by leveraging topology-awareness to fully utilize the heterogeneous network architecture. It decomposes all-reduce (reduce-scatter + all-gather) into multiple stages of parallelizable reduce-scatter & all-gather, which provides more granularity and flexibility to map operations to the heterogeneous underlying network hierarchy.
This paper address the problem of link under-utilization due to topology heterogeneity in distributed ML training. Topology heterogeneity mainly comes from (1) differing server configurations (e.g, different NVLink topologies across generations of DGX nodes) and (2) scheduler’s topology-agnostic placements/allocations (e.g., an 8-GPU job uses 3 GPUs in an 8-GPU DGX node and 5 GPUs from another). To handle topology heterogeneity from hardware generations or partial allocations from cluster schedulers, Blink dynamically generates optimal communication primitives for a given topology. Blink models collective communication operations as flows on a directed graph and uses a spanning-tree packing algorithm to maximize link bandwidth utilization.
Serving specialized CNNs (e.g., for offline video analytics) have low arithmetic intensity, leading to the severe under-utilization of server-grade accelerators. Increasing the batch size is a popular technique to boost the arithmetic intensity, utilization, and application-level throughput by amortizing the cost of loading a CNN’s weights from memory. However, it suffers from diminishing returns. This paper proposes a technique to redesign specialized CNNs with the purpose of boosting the inference utilization and throughput. The key insight is that, once arithmetic intensity has plateaued due to increased batch size, reading/writing activations accounts for most of the memory traffic in specialized CNNs. The authors show that this memory traffic can be significantly reduced, while performing the same number of FLOPs, by jointly decreasing the size of the batch of input/output activations for a layer and increasing the layer’s width.
Compared to vanilla CNNs, FoldedCNNs have improvements on the throughput and the accelerator utilization while suffering slight accuracy loss.
TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms.
NCCL uses the topology of GPU connections and NIC placement along with buffer size to decide between two main types of communication algorithms — Ring and Tree, but it is agnostic to the exact performance profile of the links, and thus is often multiple times slower than TACCL’s custom collectives.
Chimera is yet another pipeline parallelism paradigm. Compared with the other STOA systems, it reduces more compute idleness and has a more balanced activation memory consumption.
Nowadays, DNN workload schedulers in shared GPU clusters consider GPU as the dominant resource and only allocate other types of resources (e.g., CPU and memory) proportional to the number of GPUs. However, different jobs have various sensitivity to these other types of resources, which leads to sub-optimal allocation results by current schedulers.
Synergy is an idea that applies to all existing scheduling policies: It uses profiling to infer a workload's sensitivity to different resources and performs multi-resource workload-aware resource allocation. The key nugget is to co-locate two jobs on the same server, one of which is CPU-sensitive and the other is not, so that while the CPU-insensitive job does not hurt from the reduced resource allocation, the CPU-sensitive job can gain a higher throughput, benefiting the cluster-wide aggregate throughput and metrics like avg JCT, makespan, fairness, etc.
The main technical contributions of this paper are two-fold:
Profiling the workloads: Naively profiling all possible resource configurations can be expensive due to the large combination space. Synergy introduces an optimistic profiling technique that exploits the predictability in the relationship between job throughput and memory allocation. As for the CPU allocation, Synergy empirically profiles the job for varying, discrete CPU allocations at full memory allocation. The profiling time is tens of minutes, which is reasonable considering most DNN jobs are long-running.
Encorporating resource-sensitivity-awareness into existing scheduling algorithms.