# Machine Learning Systems

- [Machine Learning Systems - Index](/machine-learning-systems/machine-learning-systems-index.md)
- [MLSys Papers - Short Notes](/machine-learning-systems/machine-learning-systems-index/mlsys-papers-short-notes.md)
- [\[2011 NSDI\] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types](/machine-learning-systems/machine-learning-systems-index/dominant-resource-fairness-fair-allocation-of-multiple-resource-types.md)
- [\[2014 OSDI\] Scaling Distributed Machine Learning with the Parameter Server](/machine-learning-systems/machine-learning-systems-index/scaling-distributed-machine-learning-with-the-parameter-server.md)
- [\[2018 OSDI\] Gandiva: Introspective Cluster Scheduling for Deep Learning](/machine-learning-systems/machine-learning-systems-index/gandiva-introspective-cluster-scheduling-for-deep-learning.md)
- [\[2018 SIGCOMM\] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera ...](/machine-learning-systems/machine-learning-systems-index/2018-sigcomm-chameleon-scalable-adaptation-of-video-analytics-via-temporal-and-cross-camera-....md): ...Correlations
- [\[2018 NIPS\] Dynamic Space-Time Scheduling for GPU Inference](/machine-learning-systems/machine-learning-systems-index/2018-nips-dynamic-space-time-scheduling-for-gpu-inference.md)
- [\[2019 ATC\] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads](/machine-learning-systems/machine-learning-systems-index/analysis-of-large-scale-multi-tenant-gpu-clusters-for-dnn-training-workloads.md)
- [\[2019 NSDI\] Tiresias: A GPU Cluster Manager for Distributed Deep Learning](/machine-learning-systems/machine-learning-systems-index/tiresias-a-gpu-cluster-manager-for-distributed-deep-learning.md)
- [\[2019 SOSP\] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...](/machine-learning-systems/machine-learning-systems-index/2019-sosp-bytescheduler-a-generic-communication-scheduler-for-distributed-dnn-training-....md): ...Acceleration
- [\[2019 SOSP\] PipeDream: Generalized Pipeline Parallelism for DNN Training](/machine-learning-systems/machine-learning-systems-index/pipedream-generalized-pipeline-parallelism-for-dnn-training.md)
- [\[2019 SOSP\] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems](/machine-learning-systems/machine-learning-systems-index/2019-sosp-parity-models-erasure-coded-resilience-for-prediction-serving-systems.md)
- [\[2019 NIPS\] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](/machine-learning-systems/machine-learning-systems-index/gpipe-efficient-training-of-giant-neural-networks-using-pipeline-parallelism.md)
- [\[2019 SC\] ZeRO: memory optimizations toward training trillion parameter models](/machine-learning-systems/machine-learning-systems-index/2019-sc-zero-memory-optimizations-toward-training-trillion-parameter-models.md)
- [\[2020 OSDI\] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads](/machine-learning-systems/machine-learning-systems-index/gavel-heterogeneity-aware-cluster-scheduling-policies-for-deep-learning-workloads.md)
- [\[2020 OSDI\] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning](/machine-learning-systems/machine-learning-systems-index/2020-osdi-antman-dynamic-scaling-on-gpu-clusters-for-deep-learning.md)
- [\[2020 OSDI\] BytePS: A High Performance and Generic Framework for Distributed DNN Training](/machine-learning-systems/machine-learning-systems-index/byteps-a-high-performance-and-generic-framework-for-distributed-dnn-training.md)
- [\[2020 SIGCOMM\] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics](/machine-learning-systems/machine-learning-systems-index/2020-sigcomm-reducto-on-camera-filtering-for-resource-efficient-real-time-video-analytics.md)
- [\[2020 MLSys\] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications](/machine-learning-systems/machine-learning-systems-index/2020-sigcomm-reducto-on-camera-filtering-for-resource-efficient-real-time-video-analytics/salus-fine-grained-gpu-sharing-primitives-for-deep-learning-applications.md)
- [\[2020 EuroSys\] AlloX: Compute Allocation in Hybrid Clusters](/machine-learning-systems/machine-learning-systems-index/allox-compute-allocation-in-hybrid-clusters.md)
- [\[2020 VLDB\] PyTorch Distributed: Experiences on Accelerating Data Parallel Training](/machine-learning-systems/machine-learning-systems-index/pytorch-distributed-experiences-on-accelerating-data-parallel-training.md)
- [\[2020 NetAI\] Is Network the Bottleneck of Distributed Training?](/machine-learning-systems/machine-learning-systems-index/2020-netai-is-network-the-bottleneck-of-distributed-training.md)
- [\[2020 NSDI\] Themis: Fair and Efficient GPU Cluster Scheduling](/machine-learning-systems/machine-learning-systems-index/themis-fair-and-efficient-gpu-cluster-scheduling.md)
- [\[2021 MLSys\] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification](/machine-learning-systems/machine-learning-systems-index/accordion-adaptive-gradient-communication-via-critical-learning-regime-identification.md)
- [\[2021 VLDB\] Analyzing and Mitigating Data Stalls in DNN Training](/machine-learning-systems/machine-learning-systems-index/analyzing-and-mitigating-data-stalls-in-dnn-training.md)
- [\[2021 FAST\] CheckFreq: Frequent, Fine-Grained DNN Checkpointing](/machine-learning-systems/machine-learning-systems-index/checkfreq-frequent-fine-grained-dnn-checkpointing.md)
- [\[2021 EuroMLSys\] Interference-Aware Scheduling for Inference Serving](/machine-learning-systems/machine-learning-systems-index/2021-euromlsys-interference-aware-scheduling-for-inference-serving.md)
- [\[2021 OSDI\] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning](/machine-learning-systems/machine-learning-systems-index/pollux-co-adaptive-cluster-scheduling-for-goodput-optimized-deep-learning.md)
- [\[2021 MLSys\] Wavelet: Efficient DNN Training with Tick-Tock Scheduling](/machine-learning-systems/machine-learning-systems-index/wavelet-efficient-dnn-training-with-tick-tock-scheduling.md)
- [\[2021 NSDI\] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation](/machine-learning-systems/machine-learning-systems-index/2021-nsdi-switchml-scaling-distributed-machine-learning-with-in-network-aggregation.md)
- [Big Data Systems - Index](/machine-learning-systems/index.md)
- [Big Data Systems Papers - Short Notes](/machine-learning-systems/index/big-data-systems-papers-short-notes.md)
- [\[2003 SOSP\] The Google File System](/machine-learning-systems/index/the-google-file-system.md)
- [\[2004 OSDI\] MapReduce: Simplified Data Processing on Large Clusters](/machine-learning-systems/index/mapreduce-simplified-data-processing-on-large-clusters.md)
- [\[2010 SIGMOD\] Pregel: A System for Large-Scale Graph Processing](/machine-learning-systems/index/pregel-a-system-for-large-scale-graph-processing.md)
- [\[2011 NSDI\] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center](/machine-learning-systems/index/mesos-a-platform-for-fine-grained-resource-sharing-in-the-data-center.md)
- [\[2012 NSDI\] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...](/machine-learning-systems/index/resilient-distributed-datasets-a-fault-tolerant-abstraction-for-in-memory-cluster-computing.md): ...Computing
- [\[2012 OSDI\] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs](/machine-learning-systems/index/powergraph-distributed-graph-parallel-computation-on-natural-graphs.md)
- [\[2019 FAST\] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...](/machine-learning-systems/index/2019-fast-distcache-provable-load-balancing-for-large-scale-storage-systems-with-distributed....md): ...Caching
- [\[2021 HotOS\] From Cloud Computing to Sky Computing](/machine-learning-systems/index/from-cloud-computing-to-sky-computing.md)
- [\[2021 EuroSys\] NextDoor: Accelerating graph sampling for graph machine learning using GPUs](/machine-learning-systems/index/accelerating-graph-sampling-for-graph-machine-learning-using-gpus.md)
