Machine Learning Systems - Index
- [SoCC '18] Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training (pdf)
- [NSDI '23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (pdf)
- [ATC '20] HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism (pdf)
- [OSDI '22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (pdf)
- [OSDI '22] Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization (pdf)
- [arXiv '22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (pdf)
- [NeurIPS '22] AMP:Automatically Finding Model Parallel Strategies with Heterogeneity Awareness (pdf)
- [EuroSys '20] Gandiva-Fair: Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (pdf)
- [NSDI '22] MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (pdf)
- [arXiv '22] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision (pdf)
- [NSDI '23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
- [NSDI '23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
- [arXiv '18] Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications (pdf)
- [ATC '19] MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving (pdf)
- [arXiv '21] Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem (pdf)
- [ICML '22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (pdf)
- [ATC '22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing (pdf)
- [ATC '17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters (pdf)
- [MLSys '20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training (pdf)
- [SIGCOMM '21] Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning (pdf)
- [MLSys '21] In-network Aggregation for Shared Machine Learning Clusters
- [arXiv '21] Cloud Collectives: Towards Cloud-aware Collectives for ML Workloads with Rank Reordering (pdf)
- [NSDI '22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks (pdf)
- [NSDI '23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
- Optical Networks for ML
- [SIGCOMM '21] SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training (pdf)
- [SIGCOMM '17] Pensieve: Neural Adaptive Video Streaming with Pensieve
- [HotNets '17] Congestion-Control Throwdown
- [NSDI '18] PCC Vivace: Online-Learning Congestion Control
- [NSDI '18] Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol
- [SIGCOMM '20] DDS: Server-Driven Video Streaming for Deep Learning Inference
- [MobiCom '20] OnRL: Improving Mobile Video Telephony via Online Reinforcement Learning
- [NSDI '20] Learning in situ: a randomized experiment in video streaming
- [NSDI '22] Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers
- [HotMobile '22] Understanding the Potential of Server-Driven Edge Video Analytics
- [SIGCOMM '22] Genet: automatic curriculum generation for learning adaptation in networking
- [NSDI '23] GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge
- [NIPS '13] More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
- [arXiv '16] Training Deep Nets with Sublinear Memory Cost
- [ICLR '16] Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding
- [NIPS '17] Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
- [ICLR '18] Mixed precision training
- [ICLR '19] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- [arXiv '21] AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning
- [PVLDB '21] BAGUA: Scaling up Distributed Learning with System Relaxations
- [arXiv '22] BagPipe: Accelerating Deep Recommendation Model Training
- [arXiv '22] Efficient DNN Training with Knowledge-Guided Layer Freezing
- [arXiv '22] Cuttlefish: Factorized Model Training without All the Tuning
- [NIPS '16 workshop] Federated Learning: Strategies for Improving Communication Efficiency
- [ICML '18 workshop] Tune: A research platform for distributed model selection and training
- [OSDI '18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- [MLSys '19] Bandana: Using Non-Volatile Memory for Storing Deep Learning Models
- [MLSys '19] Towards Federated Learning at Scale: System Design
- [SOSP '19] TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions
- [MLSys '20] A System for Massively Parallel Hyperparameter Tuning
- [ICLR '20] Federated Learning with Matched Averaging
- [OSDI '20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
- [OSDI '20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
- [EuroSys '21] RubberBand: Cloud-based Hyperparameter Tuning
- [MLSys '21] Fluid: Resource-aware Hyperparameter Tuning Engine
- [OSDI '21] PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
- [NSDI '22] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models
- [ICML '22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
- [HotCarbon '22] Treehouse: A Case For Carbon-Aware Datacenter Software
- [NSDI '23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training