Rui's Blog
  • Rui's Blog/Paper Reading Notes - Introduction
  • Personal Blog
    • Personal Blog - Index
      • How to Create Picture-in-Picture Effect / Video Overlay for a Presentation Video
      • How to Do Your Part to Protect the Environment in Wisconsin
      • How to Get a Driver's License in Wisconsin
      • How to Travel from the U.S. to China onboard AA127 in June 2021
      • How to Transfer Credits Back to UW-Madison
      • Resources on Learning Academic Writing (for Computer Science)
    • Towards applying to CS Ph.D. programs
  • Machine Learning Systems
    • Machine Learning Systems - Index
      • MLSys Papers - Short Notes
      • [2011 NSDI] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • [2014 OSDI] Scaling Distributed Machine Learning with the Parameter Server
      • [2018 OSDI] Gandiva: Introspective Cluster Scheduling for Deep Learning
      • [2018 SIGCOMM] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera ...
      • [2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference
      • [2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
      • [2019 NSDI] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
      • [2019 SOSP] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...
      • [2019 SOSP] PipeDream: Generalized Pipeline Parallelism for DNN Training
      • [2019 SOSP] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
      • [2019 NIPS] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
      • [2019 SC] ZeRO: memory optimizations toward training trillion parameter models
      • [2020 OSDI] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
      • [2020 OSDI] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
      • [2020 OSDI] BytePS: A High Performance and Generic Framework for Distributed DNN Training
      • [2020 SIGCOMM] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
        • [2020 MLSys] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
      • [2020 EuroSys] AlloX: Compute Allocation in Hybrid Clusters
      • [2020 VLDB] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
      • [2020 NetAI] Is Network the Bottleneck of Distributed Training?
      • [2020 NSDI] Themis: Fair and Efficient GPU Cluster Scheduling
      • [2021 MLSys] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification
      • [2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training
      • [2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
      • [2021 EuroMLSys] Interference-Aware Scheduling for Inference Serving
      • [2021 OSDI] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
      • [2021 MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
      • [2021 NSDI] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation
    • Big Data Systems - Index
      • Big Data Systems Papers - Short Notes
      • [2003 SOSP] The Google File System
      • [2004 OSDI] MapReduce: Simplified Data Processing on Large Clusters
      • [2010 SIGMOD] Pregel: A System for Large-Scale Graph Processing
      • [2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • [2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
      • [2012 OSDI] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
      • [2019 FAST] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...
      • [2021 HotOS] From Cloud Computing to Sky Computing
      • [2021 EuroSys] NextDoor: Accelerating graph sampling for graph machine learning using GPUs
  • Earlier Readings & Notes
    • High Performance Computing Course Notes
      • Lecture 1: Course Overview
      • Lecture 2: From Code to Instructions. The FDX Cycle. Instruction Level Parallelism.
      • Lecture 3: Superscalar architectures. Measuring Computer Performance. Memory Aspects.
      • Lecture 4: The memory hierarchy. Caches.
      • Lecture 5: Caches, wrap up. Virtual Memory.
      • Lecture 6: The Walls to Sequential Computing. Moore’s Law.
      • Lecture 7: Parallel Computing. Flynn's Taxonomy. Amdahl's Law.
      • Lecture 8: GPU Computing Intro. The CUDA Programming Model. CUDA Execution Configuration.
      • Lecture 9: GPU Memory Spaces
      • Lecture 10: GPU Scheduling Issues.
      • Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.
      • Lecture 12: Global Memory Access Patterns and Implications.
      • Lecture 13: Atomic operations in CUDA. GPU ode optimization rules of thumb.
      • Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.
      • Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.
      • Lecture 16: Streams, and overlapping data copy with execution.
      • Lecture 17: GPU Computing: Advanced Features.
      • Lecture 18: GPU Computing with thrust and cub.
      • Lecture 19: Hardware aspects relevant in multi-core, shared memory parallel computing.
      • Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.
      • Lecture 21: OpenMP Work Sharing.
      • Lecture 22: OpenMP Work Sharing
      • Lecture 23: OpenMP NUMA Aspects. Caching and OpenMP.
      • Lecture 24: Critical Thinking. Code Optimization Aspects.
      • Lecture 25: Computing with Supercomputers.
      • Lecture 26: MPI Parallel Programming General Introduction. Point-to-Point Communication.
      • Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.
      • Lecture 28: MPI Parallel Programming: MPI Collectives. Overview of topics covered in the class.
    • Cloud Computing Course Notes
      • 1.1 Introduction to Clouds, MapReduce
      • 1.2 Gossip, Membership, and Grids
      • 1.3 P2P Systems
      • 1.4 Key-Value Stores, Time, and Ordering
      • 1.5 Classical Distributed Algorithms
      • 4.1 Spark, Hortonworks, HDFS, CAP
      • 4.2 Large Scale Data Storage
    • Operating Systems Papers - Index
      • CS 736 @ UW-Madison Fall 2020 Reading List
      • All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
      • ARC: A Self-Tuning, Low Overhead Replacement Cache
      • A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications
      • Biscuit: The benefits and costs of writing a POSIX kernel in a high-level language
      • Data Domain: Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
      • Disco: Running Commodity Operating Systems on Scalable Multiprocessors
      • FFS: A Fast File System for UNIX
      • From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees
      • LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
      • LFS: The Design and Implementation of a Log-Structured File System
      • Lottery Scheduling: Flexible Proportional-Share Resource Management
      • Memory Resource Management in VMware ESX Server
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • NFS: Sun's Network File System
      • OptFS: Optimistic Crash Consistency
      • RAID: A Case for Redundant Arrays of Inexpensive Disks
      • RDP: Row-Diagonal Parity for Double Disk Failure Correction
      • Resource Containers: A New Facility for Resource Management in Server Systems
      • ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay
      • Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
      • SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery
      • The Linux Scheduler: a Decade of Wasted Cores
      • The Unwritten Contract of Solid State Drives
      • Venti: A New Approach to Archival Storage
    • Earlier Notes
      • How to read a paper
  • FIXME
    • Template for Paper Reading Notes
Powered by GitBook
On this page
  • Table of Contents
  • Infrastructure, Frameworks, and Paradigms
  • Scheduling & Resource Allocation
  • Cloud/Serverless Computing
  • Network Flow Scheduling
  • Graphs
  • Distributed Tracing
  • Caching
  • New Data, Hardware Models
  • Databases
  • Meta stuff

Was this helpful?

  1. Machine Learning Systems

Big Data Systems - Index

Previous[2021 NSDI] SwitchML: Scaling Distributed Machine Learning with In-Network AggregationNextBig Data Systems Papers - Short Notes

Last updated 2 years ago

Was this helpful?

Table of Contents

Infrastructure, Frameworks, and Paradigms

  • [SOSP '09] FAWN: A Fast Array of Wimpy Nodes

  • [HotOS '15] Scalability! But at what COST? ()

Scheduling & Resource Allocation

  • [NSDI '11] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

  • [EuroSys '13] Omega: flexible, scalable schedulers for large compute clusters

  • [SoCC '13] Apache Hadoop YARN: Yet Another Resource Negotiator

  • [SoCC '14] Wrangler: Predictable and Faster Jobs using Fewer Resources

  • [OSDI '14] Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing ()

  • [SIGCOMM '14] Tetris: Multi-Resource Packing for Cluster Schedulers ()

  • [ASPLOS '14] Quasar: Resource-Efficient and QoS-Aware Cluster Management

  • [SIGCOMM '15] Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

  • [OSDI '16] CARBYNE: Altruistic Scheduling in Multi-Resource Clusters ()

  • [OSDI '16] Packing and Dependency-aware Scheduling for Data-Parallel Clusters

  • [NSDI '16] HUG: Multi-Resource Fairness for Correlated and Elastic Demands

  • [EuroSys '16] TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters

  • [SoCC '17] Selecting the best vm across multiple public clouds: A data-driven performance modeling approach

  • [ATC '18] On the diversity of cluster workloads and its impact on research results

Cloud/Serverless Computing

  • [SoCC '17] Occupy the Cloud: Distributed Computing for the 99%

  • [arXiv '19] Cloud Programming Simplified: A Berkeley View on Serverless Computing

  • [SoCC '19] Centralized Core-granular Scheduling for Serverless Functions

  • [SoCC '19] Cirrus: a Serverless Framework for End-to-end ML Workflows

  • [NSDI '19] Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure

  • [SIGMOD '20] Le Taureau: Deconstructing the Serverless Landscape & A Look Forward

  • [SoCC '20] Serverless linear algebra

  • [ATC '20] Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider

  • [SIGMOD '21] Towards Demystifying Serverless Machine Learning Training

  • [OSDI '21] Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

  • [SoCC '21] Atoll: A Scalable Low-Latency Serverless Platform

  • [NSDI '21] Caerus: Nimble Task Scheduling for Serverless Analytics

  • [ASPLOS '22] Serverless computing on heterogeneous computers

Network Flow Scheduling

  • [SIGCOMM '11] Managing Data Transfers in Computer Clusters with Orchestra

  • [SIGCOMM '14] Barrat: Decentralized task-aware scheduling for data center networks

  • [SIGCOMM '16] CODA: Toward Automatically Identifying and Scheduling COflows in the DArk

  • [SIGCOMM '18] Sincronia: Near-Optimal Network Design for Coflows

  • [SPAA '19] Near Optimal Coflow Scheduling in Networks

Graphs

  • [OSDI '14] GraphX: Graph Processing in a Distributed Dataflow Framework

  • [ATC '17] Garaph: Efficient GPU-accelerated Graph Processing on a Single Machine with Balanced Replication

  • [OSDI '21] Marius: Learning Massive Graph Embeddings on a Single Machine

  • [arXiv '22] Marius++: Large-Scale Training of Graph Neural Networks on a Single Machine

  • [MLSys '22] Graphiler: Optimizing Graph Neural Networks with Message Passing Data Flow Graph

Distributed Tracing

  • [Textbook] Distributed Tracing in Practice

Caching

  • [SoCC '11] Small Cache, Big Effect: Provable Load Balancing for Randomly Partitioned Cluster Services

  • [NSDI '16] Be Fast, Cheap and in Control with SwitchKV

  • [SOSP '17] NetCache: Balancing Key-Value Stores with Fast In-Network Caching

New Data, Hardware Models

  • [ISCA '17] In-Datacenter Performance Analysis of a Tensor Processing Unit

Databases

  • [SIGMOD '12] Towards a Unified Architecture for in-RDBMS Analytics

  • [arXiv '13] Bayesian Optimization in a Billion Dimensions via Random Embeddings

  • [SIGMOD '17] Automatic Database Management System Tuning Through Large-scale Machine Learning

  • [HotStorage '20] Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting Important Knobs

  • [arXiv '21] Facilitating Database Tuning with Hyper-Parameter Optimization: A Comprehensive Experimental Evaluation

  • [VLDB '22] LlamaTune: Sample-Efficient DBMS Configuration Tuning

Meta stuff

  • Reading lists

  • Some other stuff

    • Meta papers

[arXiv '22] Groundhog: Efficient Request Isolation in FaaS ()

[SIGCOMM '16] Scheduling Mix-flows in Commodity Datacenters with Karuna ()

by

[PPoPP '13] Ligra: A Lightweight Graph Processing Framework for Shared Memory ()

[EuroSys '17] MOSAIC: Processing a Trillion-Edge Graph on a Single Machine ()

[VLDB '18] A Distributed Multi-GPU System for Fast Graph Processing ()

[SoCC '20] PaGraph: Scaling GNN Training on Large Graphs via Computation-aware Caching ()

[SOSP '15] Pivot tracing: dynamic causal monitoring for distributed systems ()

[SoCC '16] Principled Workflow-Centric Tracing of Distributed Systems ()

[SOSP '17] Canopy: An End-to-End Performance Tracing And Analysis System ()

[SoCC '18] Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay ()

[SoCC '19] Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering ()

[HotNets '21] Snicket: Query-Driven Distributed Tracing ()

[NSDI '23] The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems ()

[VLDB '21] An Inquiry into Machine Learning-based Automatic Configuration Tuning Services on Real-World Database Management Systems ()

, with a focus on the ML side

: An open-sourced reading list

: Not a paper reading class, more of an end-to-end comprehensive introduction of foundations of DL Systems

: A great overview of HPC, CUDA, OpenMP, MPI

NFS: Sun's Network File System
[SOSP '03] The Google File System
[OSDI '04] MapReduce: Simplified Data Processing on Large Clusters
[NSDI '11] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
[NSDI '12] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
pdf
[HotOS '21] From Cloud Computing to Sky Computing
pdf
pdf
pdf
pdf
pdf
MIT's 6.886 Graph Analytics reading list
Prof. Julian Shun
[SIGMOD '10] Pregel: A System for Large-Scale Graph Processing
[OSDI '12] PowerGraph: Distributed Graph-Parallel Computation on Neural Graphs
pdf
pdf
pdf
pdf
[EuroSys '21] NextDoor: Accelerating graph sampling for graph machine learning using GPUs
pdf
pdf
pdf
pdf
pdf
pdf
pdf
[FAST '19] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching
pdf
CS 294 @ Berkeley: Machine Learning Systems
CS 744 @ UW-Madison: Big Data Systems
CS 6787 @ Cornell: Advanced Machine Learning Systems
Awesome-System-for-Machine-Learning
The MLSys conference
SOSP AI Systems workshop
A Berkeley View of Systems Challenges for AI
MLSys: The New Frontier of Machine Learning Systems
Systems Benchmarking Crimes
CSE 559W @ U Washington Slides
CS 759 @ UW-Madison (HPC) Course Notes
[HotNets '12] Coflow: A Networking Abstraction for Cluster Applications
[SIGCOMM '14] Efficient coflow scheduling with Varys
[SIGCOMM '15] Aalo: Efficient coflow scheduling without prior knowledge