Rui's Blog

Machine Learning Systems - Index

Distributed Training & Parallelism Paradigms

Workload Scheduling, Cluster Resource Management


  • [NSDI '17] Clipper: A Low-Latency Online Prediction Serving System (pdf)
  • [NIPS '17 MLSys workshop] TensorFlow-Serving: Flexible, High-Performance ML Serving (pdf)
  • [arXiv '18] Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications (pdf)
  • [SOSP '19] Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis (pdf)
  • [arXiv '19] No DNN left behind: Improving inference in the cloud with Multi-Tenancy (pdf)
  • [ATC '19] MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving (pdf)
  • [SoCC '20] GSLICE: controlled spatial sharing of GPUs for a scalable inference platform (pdf)
  • [SoCC '20] InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines (pdf)
  • [OSDI '20] Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (pdf)
  • [OSDI '20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications (pdf)
  • [ATC '21] INFaaS: Automated Model-less Inference Serving (pdf)
  • [arXiv '21] Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem (pdf)
  • [arXiv '21] Gati: Accelerating Deep Learning Inference via Learned Caches (pdf)
  • [ICML '22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (pdf)
  • [OSDI '22] Achieving μs-scale Preemption for Concurrent GPU-accelerated DNN Inferences (pdf)
  • [OSDI '22] Orca: A Distributed Serving System for Transformer-Based Generative Models (pdf)
  • [ATC '22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing (pdf)
  • [SIGMOD '22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving (pdf)

Optimizing Networks/Communications for ML

ML for Systems, Video Analytics & Streaming

Tricks and Relaxations in Learning and Systems: Compression, Pruning, Freezing, and many more

  • [NIPS '13] More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
  • [arXiv '16] Training Deep Nets with Sublinear Memory Cost
  • [ICLR '16] Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding
  • [NIPS '17] Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
  • [ICLR '18] Mixed precision training
  • [ICLR '19] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
  • [arXiv '21] AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning
  • [PVLDB '21] BAGUA: Scaling up Distributed Learning with System Relaxations
  • [arXiv '22] BagPipe: Accelerating Deep Recommendation Model Training
  • [arXiv '22] Efficient DNN Training with Knowledge-Guided Layer Freezing
    • [NIPS '18] ATOMO: Communication-efficient Learning via Atomic Sparsification (pdf)
    • [MLSys '21] Pufferfish: Communication-efficient Models At No Extra Cost (pdf)
    • [SOSP '21] Gradient Compression Supercharged High-Performance Data Parallel DNN Training (pdf)
    • [MLSys '22] On the utility of gradient compression in distributed training systems (pdf)
    • [arXiv '22] Cuttlefish: Factorized Model Training without All the Tuning
    • [arXiv '22] ByteComp: Revisiting Gradient Compression in Distributed Training (pdf)

Misc: Storage, Hyperparameter Tuning, Federated Learning, DL Compilers, Green Datacenters

  • [NIPS '16 workshop] Federated Learning: Strategies for Improving Communication Efficiency
  • [ICML '18 workshop] Tune: A research platform for distributed model selection and training
  • [OSDI '18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
  • [MLSys '19] Bandana: Using Non-Volatile Memory for Storing Deep Learning Models
  • [MLSys '19] Towards Federated Learning at Scale: System Design
  • [SOSP '19] TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions
  • [MLSys '20] A System for Massively Parallel Hyperparameter Tuning
  • [ICLR '20] Federated Learning with Matched Averaging
  • [OSDI '20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
  • [OSDI '20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
  • [EuroSys '21] RubberBand: Cloud-based Hyperparameter Tuning
  • [MLSys '21] Fluid: Resource-aware Hyperparameter Tuning Engine
  • [OSDI '21] Oort: Efficient Federated Learning via Guided Participant Selection (pdf)
  • [OSDI '21] PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
  • [SoCC '21] Elastic Hyperparameter Tuning on the Cloud (pdf)
  • [NSDI '22] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models
  • [ICML '22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
  • [HotCarbon '22] Treehouse: A Case For Carbon-Aware Datacenter Software
  • [NSDI '23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training