Page cover image

Machine Learning Systems - Index

Distributed Training & Parallelism Paradigms

Workload Scheduling, Cluster Resource Management

Serving/Inference

  • [NSDI '17] Clipper: A Low-Latency Online Prediction Serving System (pdf)

  • [NIPS '17 MLSys workshop] TensorFlow-Serving: Flexible, High-Performance ML Serving (pdf)

  • [arXiv '18] Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications (pdf)

  • [SOSP '19] Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis (pdf)

  • [arXiv '19] No DNN left behind: Improving inference in the cloud with Multi-Tenancy (pdf)

  • [ATC '19] MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving (pdf)

  • [SoCC '20] GSLICE: controlled spatial sharing of GPUs for a scalable inference platform (pdf)

  • [SoCC '20] InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines (pdf)

  • [OSDI '20] Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (pdf)

  • [OSDI '20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications (pdf)

  • [ATC '21] INFaaS: Automated Model-less Inference Serving (pdf)

  • [arXiv '21] Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem (pdf)

  • [arXiv '21] Gati: Accelerating Deep Learning Inference via Learned Caches (pdf)

  • [ICML '22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (pdf)

  • [OSDI '22] Achieving μs-scale Preemption for Concurrent GPU-accelerated DNN Inferences (pdf)

  • [OSDI '22] Orca: A Distributed Serving System for Transformer-Based Generative Models (pdf)

  • [ATC '22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing (pdf)

  • [SIGMOD '22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving (pdf)

Optimizing Networks/Communications for ML

ML for Systems, Video Analytics & Streaming

Tricks and Relaxations in Learning and Systems: Compression, Pruning, Freezing, and many more

  • [NIPS '13] More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

  • [arXiv '16] Training Deep Nets with Sublinear Memory Cost

  • [ICLR '16] Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding

  • [NIPS '17] Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

  • [ICLR '18] Mixed precision training

  • [ICLR '19] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

  • [arXiv '21] AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

  • [PVLDB '21] BAGUA: Scaling up Distributed Learning with System Relaxations

  • [arXiv '22] BagPipe: Accelerating Deep Recommendation Model Training

  • [arXiv '22] Efficient DNN Training with Knowledge-Guided Layer Freezing

  • Hongyi Wang's talk: On the Utility of Gradient Compression in Distributed Training Systems

    • [NIPS '18] ATOMO: Communication-efficient Learning via Atomic Sparsification (pdf)

    • [MLSys '21] Pufferfish: Communication-efficient Models At No Extra Cost (pdf)

    • [SOSP '21] Gradient Compression Supercharged High-Performance Data Parallel DNN Training (pdf)

    • [MLSys '22] On the utility of gradient compression in distributed training systems (pdf)

    • [arXiv '22] Cuttlefish: Factorized Model Training without All the Tuning

    • [arXiv '22] ByteComp: Revisiting Gradient Compression in Distributed Training (pdf)

Misc: Storage, Hyperparameter Tuning, Federated Learning, DL Compilers, Green Datacenters

  • [NIPS '16 workshop] Federated Learning: Strategies for Improving Communication Efficiency

  • [ICML '18 workshop] Tune: A research platform for distributed model selection and training

  • [OSDI '18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

  • [MLSys '19] Bandana: Using Non-Volatile Memory for Storing Deep Learning Models

  • [MLSys '19] Towards Federated Learning at Scale: System Design

  • [SOSP '19] TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions

  • [MLSys '20] A System for Massively Parallel Hyperparameter Tuning

  • [ICLR '20] Federated Learning with Matched Averaging

  • [OSDI '20] Ansor: Generating High-Performance Tensor Programs for Deep Learning

  • [OSDI '20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks

  • [EuroSys '21] RubberBand: Cloud-based Hyperparameter Tuning

  • [MLSys '21] Fluid: Resource-aware Hyperparameter Tuning Engine

  • [OSDI '21] Oort: Efficient Federated Learning via Guided Participant Selection (pdf)

  • [OSDI '21] PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections

  • [SoCC '21] Elastic Hyperparameter Tuning on the Cloud (pdf)

  • [NSDI '22] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models

  • [ICML '22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale

  • [HotCarbon '22] Treehouse: A Case For Carbon-Aware Datacenter Software

  • [NSDI '23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Last updated

Was this helpful?