Rui's Blog
  • Rui's Blog/Paper Reading Notes - Introduction
  • Personal Blog
    • Personal Blog - Index
      • How to Create Picture-in-Picture Effect / Video Overlay for a Presentation Video
      • How to Do Your Part to Protect the Environment in Wisconsin
      • How to Get a Driver's License in Wisconsin
      • How to Travel from the U.S. to China onboard AA127 in June 2021
      • How to Transfer Credits Back to UW-Madison
      • Resources on Learning Academic Writing (for Computer Science)
    • Towards applying to CS Ph.D. programs
  • Machine Learning Systems
    • Machine Learning Systems - Index
      • MLSys Papers - Short Notes
      • [2011 NSDI] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • [2014 OSDI] Scaling Distributed Machine Learning with the Parameter Server
      • [2018 OSDI] Gandiva: Introspective Cluster Scheduling for Deep Learning
      • [2018 SIGCOMM] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera ...
      • [2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference
      • [2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
      • [2019 NSDI] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
      • [2019 SOSP] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...
      • [2019 SOSP] PipeDream: Generalized Pipeline Parallelism for DNN Training
      • [2019 SOSP] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
      • [2019 NIPS] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
      • [2019 SC] ZeRO: memory optimizations toward training trillion parameter models
      • [2020 OSDI] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
      • [2020 OSDI] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
      • [2020 OSDI] BytePS: A High Performance and Generic Framework for Distributed DNN Training
      • [2020 SIGCOMM] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
        • [2020 MLSys] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
      • [2020 EuroSys] AlloX: Compute Allocation in Hybrid Clusters
      • [2020 VLDB] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
      • [2020 NetAI] Is Network the Bottleneck of Distributed Training?
      • [2020 NSDI] Themis: Fair and Efficient GPU Cluster Scheduling
      • [2021 MLSys] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification
      • [2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training
      • [2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
      • [2021 EuroMLSys] Interference-Aware Scheduling for Inference Serving
      • [2021 OSDI] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
      • [2021 MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
      • [2021 NSDI] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation
    • Big Data Systems - Index
      • Big Data Systems Papers - Short Notes
      • [2003 SOSP] The Google File System
      • [2004 OSDI] MapReduce: Simplified Data Processing on Large Clusters
      • [2010 SIGMOD] Pregel: A System for Large-Scale Graph Processing
      • [2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • [2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
      • [2012 OSDI] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
      • [2019 FAST] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...
      • [2021 HotOS] From Cloud Computing to Sky Computing
      • [2021 EuroSys] NextDoor: Accelerating graph sampling for graph machine learning using GPUs
  • Earlier Readings & Notes
    • High Performance Computing Course Notes
      • Lecture 1: Course Overview
      • Lecture 2: From Code to Instructions. The FDX Cycle. Instruction Level Parallelism.
      • Lecture 3: Superscalar architectures. Measuring Computer Performance. Memory Aspects.
      • Lecture 4: The memory hierarchy. Caches.
      • Lecture 5: Caches, wrap up. Virtual Memory.
      • Lecture 6: The Walls to Sequential Computing. Moore’s Law.
      • Lecture 7: Parallel Computing. Flynn's Taxonomy. Amdahl's Law.
      • Lecture 8: GPU Computing Intro. The CUDA Programming Model. CUDA Execution Configuration.
      • Lecture 9: GPU Memory Spaces
      • Lecture 10: GPU Scheduling Issues.
      • Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.
      • Lecture 12: Global Memory Access Patterns and Implications.
      • Lecture 13: Atomic operations in CUDA. GPU ode optimization rules of thumb.
      • Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.
      • Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.
      • Lecture 16: Streams, and overlapping data copy with execution.
      • Lecture 17: GPU Computing: Advanced Features.
      • Lecture 18: GPU Computing with thrust and cub.
      • Lecture 19: Hardware aspects relevant in multi-core, shared memory parallel computing.
      • Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.
      • Lecture 21: OpenMP Work Sharing.
      • Lecture 22: OpenMP Work Sharing
      • Lecture 23: OpenMP NUMA Aspects. Caching and OpenMP.
      • Lecture 24: Critical Thinking. Code Optimization Aspects.
      • Lecture 25: Computing with Supercomputers.
      • Lecture 26: MPI Parallel Programming General Introduction. Point-to-Point Communication.
      • Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.
      • Lecture 28: MPI Parallel Programming: MPI Collectives. Overview of topics covered in the class.
    • Cloud Computing Course Notes
      • 1.1 Introduction to Clouds, MapReduce
      • 1.2 Gossip, Membership, and Grids
      • 1.3 P2P Systems
      • 1.4 Key-Value Stores, Time, and Ordering
      • 1.5 Classical Distributed Algorithms
      • 4.1 Spark, Hortonworks, HDFS, CAP
      • 4.2 Large Scale Data Storage
    • Operating Systems Papers - Index
      • CS 736 @ UW-Madison Fall 2020 Reading List
      • All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
      • ARC: A Self-Tuning, Low Overhead Replacement Cache
      • A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications
      • Biscuit: The benefits and costs of writing a POSIX kernel in a high-level language
      • Data Domain: Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
      • Disco: Running Commodity Operating Systems on Scalable Multiprocessors
      • FFS: A Fast File System for UNIX
      • From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees
      • LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
      • LFS: The Design and Implementation of a Log-Structured File System
      • Lottery Scheduling: Flexible Proportional-Share Resource Management
      • Memory Resource Management in VMware ESX Server
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • NFS: Sun's Network File System
      • OptFS: Optimistic Crash Consistency
      • RAID: A Case for Redundant Arrays of Inexpensive Disks
      • RDP: Row-Diagonal Parity for Double Disk Failure Correction
      • Resource Containers: A New Facility for Resource Management in Server Systems
      • ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay
      • Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
      • SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery
      • The Linux Scheduler: a Decade of Wasted Cores
      • The Unwritten Contract of Solid State Drives
      • Venti: A New Approach to Archival Storage
    • Earlier Notes
      • How to read a paper
  • FIXME
    • Template for Paper Reading Notes
Powered by GitBook
On this page
  • Distributed Training & Parallelism Paradigms
  • Workload Scheduling, Cluster Resource Management
  • Serving/Inference
  • Optimizing Networks/Communications for ML
  • ML for Systems, Video Analytics & Streaming
  • Tricks and Relaxations in Learning and Systems: Compression, Pruning, Freezing, and many more
  • Misc: Storage, Hyperparameter Tuning, Federated Learning, DL Compilers, Green Datacenters

Was this helpful?

  1. Machine Learning Systems

Machine Learning Systems - Index

PreviousTowards applying to CS Ph.D. programsNextMLSys Papers - Short Notes

Last updated 2 years ago

Was this helpful?

Distributed Training & Parallelism Paradigms

  • ()

  • [SoCC '18] Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training ()

  • ()

  • ()

  • [MLSys '20] Resource Elasticity in Distributed Deep Learning ()

  • [NSDI '23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs ()

  • Parallelism Paradigms & Strategies ()

    • ()

    • ()

    • ()

    • [MLSys '19] FlexFlow: Beyond Data and Model Parallelism for Deep Neural Networks ()

    • ()

    • [ATC '20] HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism ()

    • ()

    • [SC '21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM ()

    • ()

    • [ICML '21] Memory-Efficient Pipeline-Parallel DNN Training ()

    • ()

    • [PPoPP '21] DAPPLE: A Pipelined Data Parallel Approach for Training Large Models ()

    • [OSDI '22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning ()

    • [OSDI '22] Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization ()

    • [EuroSys '22] Varuna: Scalable, Low-cost Training of Massive Deep Learning Models ()

    • [arXiv '22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model ()

    • [PPoPP '22] BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores ()

    • [NeurIPS '22] AMP:Automatically Finding Model Parallel Strategies with Heterogeneity Awareness ()

    • [VLDB '23] MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud ()

Workload Scheduling, Cluster Resource Management

  • [NSDI '23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

  • [NSDI '23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup

Serving/Inference

Optimizing Networks/Communications for ML

  • [MLSys '21] In-network Aggregation for Shared Machine Learning Clusters

  • [NSDI '23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE

  • Optical Networks for ML

ML for Systems, Video Analytics & Streaming

  • [SIGCOMM '17] Pensieve: Neural Adaptive Video Streaming with Pensieve

  • [HotNets '17] Congestion-Control Throwdown

  • [NSDI '18] PCC Vivace: Online-Learning Congestion Control

  • [NSDI '18] Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol

  • [SIGCOMM '20] DDS: Server-Driven Video Streaming for Deep Learning Inference

  • [MobiCom '20] OnRL: Improving Mobile Video Telephony via Online Reinforcement Learning

  • [NSDI '20] Learning in situ: a randomized experiment in video streaming

  • [NSDI '22] Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers

  • [HotMobile '22] Understanding the Potential of Server-Driven Edge Video Analytics

  • [SIGCOMM '22] Genet: automatic curriculum generation for learning adaptation in networking

  • [NSDI '23] GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

Tricks and Relaxations in Learning and Systems: Compression, Pruning, Freezing, and many more

  • [NIPS '13] More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

  • [arXiv '16] Training Deep Nets with Sublinear Memory Cost

  • [ICLR '16] Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding

  • [NIPS '17] Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

  • [ICLR '18] Mixed precision training

  • [ICLR '19] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

  • [arXiv '21] AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

  • [PVLDB '21] BAGUA: Scaling up Distributed Learning with System Relaxations

  • [arXiv '22] BagPipe: Accelerating Deep Recommendation Model Training

  • [arXiv '22] Efficient DNN Training with Knowledge-Guided Layer Freezing

    • [arXiv '22] Cuttlefish: Factorized Model Training without All the Tuning

Misc: Storage, Hyperparameter Tuning, Federated Learning, DL Compilers, Green Datacenters

  • [NIPS '16 workshop] Federated Learning: Strategies for Improving Communication Efficiency

  • [ICML '18 workshop] Tune: A research platform for distributed model selection and training

  • [OSDI '18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

  • [MLSys '19] Bandana: Using Non-Volatile Memory for Storing Deep Learning Models

  • [MLSys '19] Towards Federated Learning at Scale: System Design

  • [SOSP '19] TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions

  • [MLSys '20] A System for Massively Parallel Hyperparameter Tuning

  • [ICLR '20] Federated Learning with Matched Averaging

  • [OSDI '20] Ansor: Generating High-Performance Tensor Programs for Deep Learning

  • [OSDI '20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks

  • [EuroSys '21] RubberBand: Cloud-based Hyperparameter Tuning

  • [MLSys '21] Fluid: Resource-aware Hyperparameter Tuning Engine

  • [OSDI '21] PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections

  • [NSDI '22] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models

  • [ICML '22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale

  • [HotCarbon '22] Treehouse: A Case For Carbon-Aware Datacenter Software

  • [NSDI '23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

()

()

[EuroSys '18] Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters ()

()

()

()

()

()

()

[OSDI '20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees ()

[EuroSys '20] Gandiva-Fair: Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning ()

[EuroSys '20] AlloX: Compute Allocation in Hybrid Clusters ()

()

()

[ATC '21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training ()

[SoCC '21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs ()

[NSDI '21] AFS/CoDDL: Elastic Resource Sharing for Distributed Deep Learning ()

[NSDI '22] MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters ()

()

[SIGCOMM '22] Multi-Resource Interleaving for Deep Learning Training ()

[arXiv '22] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision ()

[NSDI '17] Clipper: A Low-Latency Online Prediction Serving System ()

[NIPS '17 MLSys workshop] TensorFlow-Serving: Flexible, High-Performance ML Serving ()

[arXiv '18] Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications ()

()

()

[SOSP '19] Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis ()

[arXiv '19] No DNN left behind: Improving inference in the cloud with Multi-Tenancy ()

[ATC '19] MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving ()

[SoCC '20] GSLICE: controlled spatial sharing of GPUs for a scalable inference platform ()

[SoCC '20] InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines ()

[OSDI '20] Serving DNNs like Clockwork: Performance Predictability from the Bottom Up ()

[OSDI '20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications ()

[ATC '21] INFaaS: Automated Model-less Inference Serving ()

()

[arXiv '21] Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem ()

[arXiv '21] Gati: Accelerating Deep Learning Inference via Learned Caches ()

()

[ICML '22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale ()

[OSDI '22] Achieving μs-scale Preemption for Concurrent GPU-accelerated DNN Inferences ()

[OSDI '22] Orca: A Distributed Serving System for Transformer-Based Generative Models ()

[ATC '22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing ()

[SIGMOD '22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving ()

[ATC '17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters ()

()

()

()

()

()

()

[MLSys '20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training ()

[SoCC '20] Network-accelerated Distributed Machine Learning for Multi-Tenant Settings ()

()

[NSDI '21] ATP: In-network Aggregation for Multi-tenant Learning ()

[SIGCOMM '21] Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning ()

()

[arXiv '21] Cloud Collectives: Towards Cloud-aware Collectives for ML Workloads with Rank Reordering ()

[PPoPP '21] Synthesizing Optimal Collective Algorithms ()

[NSDI '22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks ()

[SIGCOMM '21] SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training ()

[SIGCOMM '21 OptSys workshop] IOI: In-network Optical Inference ()

[OFC '22] Emerging Optical Interconnects for AI Systems ()

[NSDI '23] TOPOOPT: Optimizing the Network Topology for Distributed DNN Training ()

[HotEdge '19] Edge-based Transcoding for Adaptive Live Video Streaming ()

[OSDI '21] Polyjuice: High-Performance Transactions via Learned Concurrency Control ()

Hongyi Wang's talk:

[NIPS '18] ATOMO: Communication-efficient Learning via Atomic Sparsification ()

()

[MLSys '21] Pufferfish: Communication-efficient Models At No Extra Cost ()

[SOSP '21] Gradient Compression Supercharged High-Performance Data Parallel DNN Training ()

[MLSys '22] On the utility of gradient compression in distributed training systems ()

[arXiv '22] ByteComp: Revisiting Gradient Compression in Distributed Training ()

[OSDI '21] Oort: Efficient Federated Learning via Guided Participant Selection ()

[SoCC '21] Elastic Hyperparameter Tuning on the Cloud ()

[NSDI '11] DRF: Fair Allocation of Multiple Resource Types
pdf
[OSDI '18] Gandiva: Introspective Cluster Scheduling for Deep Learning
pdf
pdf
[ATC '19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
pdf
[NSDI '19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
pdf
[NSDI '20] Themis: Fair and Efficient GPU Cluster Scheduling
pdf
[MLSys '20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
pdf
[OSDI '20] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
pdf
[OSDI '20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
pdf
pdf
pdf
pdf
[MLSys '21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
pdf
[OSDI '21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
[NIPS '18] Dynamic Space-Time Scheduling for GPU Inference
pdf
[SOSP '19] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
[EuroMLSys '21] Interference-Aware Scheduling for Inference Serving
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
[SOSP '19] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training Acceleration
pdf
[NetAI '20] Is Network the Bottleneck of Distributed Training?
pdf
pdf
pdf
[NSDI '21] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
Kuntai Du's overview on video analytics
CS34702 @ UChi: Machine Learning for Networking and Systems
[SIGCOMM '18] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera Correlations
pdf
[SIGCOMM '20] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
pdf
On the Utility of Gradient Compression in Distributed Training Systems
pdf
[MLSys '21] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification
pdf
pdf
pdf
pdf
pdf
[FAST '21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
[VLDB '21] Analyzing and Mitigating Data Stalls in DNN Training
pdf
pdf
[OSDI '14] Scaling Distributed Machine Learning with the Parameter Server
pdf
pdf
[OSDI '20] BytePS: A High Performance and Generic Framework for Distributed DNN Training
pdf
[VLDB '20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
pdf
pdf
pdf
Overview by Hugging Face
[NIPS '19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
pdf
[SOSP '19] PipeDream: Generalized Pipeline Parallelism for DNN Training
pdf
pdf
pdf
[SC '20] ZeRO: memory optimizations toward training trillion parameter models
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
pdf
[SC '21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
[ATC '21] ZeRO-Offload: Democratizing Billion-Scale Model Training
[arXiv '19] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
[SC '21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
[OSDI '22] Synergy: Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters
[ICML '21] Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size
[MLSys '19] BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy
[MLSys '20] Blink: Fast and Generic Collectives for Distributed ML
[NSDI '23] Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL
[MLSys '19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
[MLSys '19] P3: Priority-Based Parameter Propagation for Distributed DNN Training
Page cover image