# Machine Learning Systems

- [Machine Learning Systems - Index](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index.md)
- [MLSys Papers - Short Notes](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/mlsys-papers-short-notes.md)
- [\[2011 NSDI\] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/dominant-resource-fairness-fair-allocation-of-multiple-resource-types.md)
- [\[2014 OSDI\] Scaling Distributed Machine Learning with the Parameter Server](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/scaling-distributed-machine-learning-with-the-parameter-server.md)
- [\[2018 OSDI\] Gandiva: Introspective Cluster Scheduling for Deep Learning](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/gandiva-introspective-cluster-scheduling-for-deep-learning.md)
- [\[2018 SIGCOMM\] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera ...](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2018-sigcomm-chameleon-scalable-adaptation-of-video-analytics-via-temporal-and-cross-camera-....md): ...Correlations
- [\[2018 NIPS\] Dynamic Space-Time Scheduling for GPU Inference](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2018-nips-dynamic-space-time-scheduling-for-gpu-inference.md)
- [\[2019 ATC\] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/analysis-of-large-scale-multi-tenant-gpu-clusters-for-dnn-training-workloads.md)
- [\[2019 NSDI\] Tiresias: A GPU Cluster Manager for Distributed Deep Learning](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/tiresias-a-gpu-cluster-manager-for-distributed-deep-learning.md)
- [\[2019 SOSP\] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2019-sosp-bytescheduler-a-generic-communication-scheduler-for-distributed-dnn-training-....md): ...Acceleration
- [\[2019 SOSP\] PipeDream: Generalized Pipeline Parallelism for DNN Training](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/pipedream-generalized-pipeline-parallelism-for-dnn-training.md)
- [\[2019 SOSP\] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2019-sosp-parity-models-erasure-coded-resilience-for-prediction-serving-systems.md)
- [\[2019 NIPS\] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/gpipe-efficient-training-of-giant-neural-networks-using-pipeline-parallelism.md)
- [\[2019 SC\] ZeRO: memory optimizations toward training trillion parameter models](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2019-sc-zero-memory-optimizations-toward-training-trillion-parameter-models.md)
- [\[2020 OSDI\] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/gavel-heterogeneity-aware-cluster-scheduling-policies-for-deep-learning-workloads.md)
- [\[2020 OSDI\] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2020-osdi-antman-dynamic-scaling-on-gpu-clusters-for-deep-learning.md)
- [\[2020 OSDI\] BytePS: A High Performance and Generic Framework for Distributed DNN Training](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/byteps-a-high-performance-and-generic-framework-for-distributed-dnn-training.md)
- [\[2020 SIGCOMM\] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2020-sigcomm-reducto-on-camera-filtering-for-resource-efficient-real-time-video-analytics.md)
- [\[2020 MLSys\] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2020-sigcomm-reducto-on-camera-filtering-for-resource-efficient-real-time-video-analytics/salus-fine-grained-gpu-sharing-primitives-for-deep-learning-applications.md)
- [\[2020 EuroSys\] AlloX: Compute Allocation in Hybrid Clusters](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/allox-compute-allocation-in-hybrid-clusters.md)
- [\[2020 VLDB\] PyTorch Distributed: Experiences on Accelerating Data Parallel Training](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/pytorch-distributed-experiences-on-accelerating-data-parallel-training.md)
- [\[2020 NetAI\] Is Network the Bottleneck of Distributed Training?](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2020-netai-is-network-the-bottleneck-of-distributed-training.md)
- [\[2020 NSDI\] Themis: Fair and Efficient GPU Cluster Scheduling](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/themis-fair-and-efficient-gpu-cluster-scheduling.md)
- [\[2021 MLSys\] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/accordion-adaptive-gradient-communication-via-critical-learning-regime-identification.md)
- [\[2021 VLDB\] Analyzing and Mitigating Data Stalls in DNN Training](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/analyzing-and-mitigating-data-stalls-in-dnn-training.md)
- [\[2021 FAST\] CheckFreq: Frequent, Fine-Grained DNN Checkpointing](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/checkfreq-frequent-fine-grained-dnn-checkpointing.md)
- [\[2021 EuroMLSys\] Interference-Aware Scheduling for Inference Serving](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2021-euromlsys-interference-aware-scheduling-for-inference-serving.md)
- [\[2021 OSDI\] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/pollux-co-adaptive-cluster-scheduling-for-goodput-optimized-deep-learning.md)
- [\[2021 MLSys\] Wavelet: Efficient DNN Training with Tick-Tock Scheduling](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/wavelet-efficient-dnn-training-with-tick-tock-scheduling.md)
- [\[2021 NSDI\] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation](https://blog.ruipan.xyz/machine-learning-systems/machine-learning-systems-index/2021-nsdi-switchml-scaling-distributed-machine-learning-with-in-network-aggregation.md)
- [Big Data Systems - Index](https://blog.ruipan.xyz/machine-learning-systems/index.md)
- [Big Data Systems Papers - Short Notes](https://blog.ruipan.xyz/machine-learning-systems/index/big-data-systems-papers-short-notes.md)
- [\[2003 SOSP\] The Google File System](https://blog.ruipan.xyz/machine-learning-systems/index/the-google-file-system.md)
- [\[2004 OSDI\] MapReduce: Simplified Data Processing on Large Clusters](https://blog.ruipan.xyz/machine-learning-systems/index/mapreduce-simplified-data-processing-on-large-clusters.md)
- [\[2010 SIGMOD\] Pregel: A System for Large-Scale Graph Processing](https://blog.ruipan.xyz/machine-learning-systems/index/pregel-a-system-for-large-scale-graph-processing.md)
- [\[2011 NSDI\] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center](https://blog.ruipan.xyz/machine-learning-systems/index/mesos-a-platform-for-fine-grained-resource-sharing-in-the-data-center.md)
- [\[2012 NSDI\] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...](https://blog.ruipan.xyz/machine-learning-systems/index/resilient-distributed-datasets-a-fault-tolerant-abstraction-for-in-memory-cluster-computing.md): ...Computing
- [\[2012 OSDI\] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs](https://blog.ruipan.xyz/machine-learning-systems/index/powergraph-distributed-graph-parallel-computation-on-natural-graphs.md)
- [\[2019 FAST\] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...](https://blog.ruipan.xyz/machine-learning-systems/index/2019-fast-distcache-provable-load-balancing-for-large-scale-storage-systems-with-distributed....md): ...Caching
- [\[2021 HotOS\] From Cloud Computing to Sky Computing](https://blog.ruipan.xyz/machine-learning-systems/index/from-cloud-computing-to-sky-computing.md)
- [\[2021 EuroSys\] NextDoor: Accelerating graph sampling for graph machine learning using GPUs](https://blog.ruipan.xyz/machine-learning-systems/index/accelerating-graph-sampling-for-graph-machine-learning-using-gpus.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/machine-learning-systems.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
