Rui's Blog
  • Rui's Blog/Paper Reading Notes - Introduction
  • Personal Blog
    • Personal Blog - Index
      • How to Create Picture-in-Picture Effect / Video Overlay for a Presentation Video
      • How to Do Your Part to Protect the Environment in Wisconsin
      • How to Get a Driver's License in Wisconsin
      • How to Travel from the U.S. to China onboard AA127 in June 2021
      • How to Transfer Credits Back to UW-Madison
      • Resources on Learning Academic Writing (for Computer Science)
    • Towards applying to CS Ph.D. programs
  • Machine Learning Systems
    • Machine Learning Systems - Index
      • MLSys Papers - Short Notes
      • [2011 NSDI] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • [2014 OSDI] Scaling Distributed Machine Learning with the Parameter Server
      • [2018 OSDI] Gandiva: Introspective Cluster Scheduling for Deep Learning
      • [2018 SIGCOMM] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera ...
      • [2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference
      • [2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
      • [2019 NSDI] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
      • [2019 SOSP] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...
      • [2019 SOSP] PipeDream: Generalized Pipeline Parallelism for DNN Training
      • [2019 SOSP] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
      • [2019 NIPS] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
      • [2019 SC] ZeRO: memory optimizations toward training trillion parameter models
      • [2020 OSDI] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
      • [2020 OSDI] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
      • [2020 OSDI] BytePS: A High Performance and Generic Framework for Distributed DNN Training
      • [2020 SIGCOMM] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
        • [2020 MLSys] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
      • [2020 EuroSys] AlloX: Compute Allocation in Hybrid Clusters
      • [2020 VLDB] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
      • [2020 NetAI] Is Network the Bottleneck of Distributed Training?
      • [2020 NSDI] Themis: Fair and Efficient GPU Cluster Scheduling
      • [2021 MLSys] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification
      • [2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training
      • [2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
      • [2021 EuroMLSys] Interference-Aware Scheduling for Inference Serving
      • [2021 OSDI] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
      • [2021 MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
      • [2021 NSDI] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation
    • Big Data Systems - Index
      • Big Data Systems Papers - Short Notes
      • [2003 SOSP] The Google File System
      • [2004 OSDI] MapReduce: Simplified Data Processing on Large Clusters
      • [2010 SIGMOD] Pregel: A System for Large-Scale Graph Processing
      • [2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • [2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
      • [2012 OSDI] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
      • [2019 FAST] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...
      • [2021 HotOS] From Cloud Computing to Sky Computing
      • [2021 EuroSys] NextDoor: Accelerating graph sampling for graph machine learning using GPUs
  • Earlier Readings & Notes
    • High Performance Computing Course Notes
      • Lecture 1: Course Overview
      • Lecture 2: From Code to Instructions. The FDX Cycle. Instruction Level Parallelism.
      • Lecture 3: Superscalar architectures. Measuring Computer Performance. Memory Aspects.
      • Lecture 4: The memory hierarchy. Caches.
      • Lecture 5: Caches, wrap up. Virtual Memory.
      • Lecture 6: The Walls to Sequential Computing. Moore’s Law.
      • Lecture 7: Parallel Computing. Flynn's Taxonomy. Amdahl's Law.
      • Lecture 8: GPU Computing Intro. The CUDA Programming Model. CUDA Execution Configuration.
      • Lecture 9: GPU Memory Spaces
      • Lecture 10: GPU Scheduling Issues.
      • Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.
      • Lecture 12: Global Memory Access Patterns and Implications.
      • Lecture 13: Atomic operations in CUDA. GPU ode optimization rules of thumb.
      • Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.
      • Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.
      • Lecture 16: Streams, and overlapping data copy with execution.
      • Lecture 17: GPU Computing: Advanced Features.
      • Lecture 18: GPU Computing with thrust and cub.
      • Lecture 19: Hardware aspects relevant in multi-core, shared memory parallel computing.
      • Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.
      • Lecture 21: OpenMP Work Sharing.
      • Lecture 22: OpenMP Work Sharing
      • Lecture 23: OpenMP NUMA Aspects. Caching and OpenMP.
      • Lecture 24: Critical Thinking. Code Optimization Aspects.
      • Lecture 25: Computing with Supercomputers.
      • Lecture 26: MPI Parallel Programming General Introduction. Point-to-Point Communication.
      • Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.
      • Lecture 28: MPI Parallel Programming: MPI Collectives. Overview of topics covered in the class.
    • Cloud Computing Course Notes
      • 1.1 Introduction to Clouds, MapReduce
      • 1.2 Gossip, Membership, and Grids
      • 1.3 P2P Systems
      • 1.4 Key-Value Stores, Time, and Ordering
      • 1.5 Classical Distributed Algorithms
      • 4.1 Spark, Hortonworks, HDFS, CAP
      • 4.2 Large Scale Data Storage
    • Operating Systems Papers - Index
      • CS 736 @ UW-Madison Fall 2020 Reading List
      • All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
      • ARC: A Self-Tuning, Low Overhead Replacement Cache
      • A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications
      • Biscuit: The benefits and costs of writing a POSIX kernel in a high-level language
      • Data Domain: Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
      • Disco: Running Commodity Operating Systems on Scalable Multiprocessors
      • FFS: A Fast File System for UNIX
      • From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees
      • LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
      • LFS: The Design and Implementation of a Log-Structured File System
      • Lottery Scheduling: Flexible Proportional-Share Resource Management
      • Memory Resource Management in VMware ESX Server
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • NFS: Sun's Network File System
      • OptFS: Optimistic Crash Consistency
      • RAID: A Case for Redundant Arrays of Inexpensive Disks
      • RDP: Row-Diagonal Parity for Double Disk Failure Correction
      • Resource Containers: A New Facility for Resource Management in Server Systems
      • ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay
      • Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
      • SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery
      • The Linux Scheduler: a Decade of Wasted Cores
      • The Unwritten Contract of Solid State Drives
      • Venti: A New Approach to Archival Storage
    • Earlier Notes
      • How to read a paper
  • FIXME
    • Template for Paper Reading Notes
Powered by GitBook
On this page
  • Lesson 1: Snapshots
  • What is Global Snapshot?
  • Global Snapshot Algorithm
  • Consistent Cuts
  • Safety and Liveness
  • Lesson 2: Multicast
  • Multicast Ordering
  • Implementing Ordering
  • Reliable Multicast
  • Virtual Synchrony
  • Lesson 3: Paxos

Was this helpful?

  1. Earlier Readings & Notes
  2. Cloud Computing Course Notes

1.5 Classical Distributed Algorithms

Lesson 1: Snapshots

What is Global Snapshot?

  • In the cloud: Each application or service is running on multiple servers which handle concurrent events and interact with each other. Thus, the ability to obtain a "global photograph" of the system is important

  • Global snapshot = global state = individual state of each process/channel in the distributed system

  • First solution: Synchronize clocks of all processes

    • Ask all processes to record their states at know time t

    • Problems

      • Time synchronization always has error

      • Does not record the state of messages

    • Causality is enough!

Global Snapshot Algorithm

  • System model:

    • N processes in the system

    • Two uni-directional communication channels between each ordered process pair

    • Channels are FIFO

    • No failures

    • All messages arrive intact and are not duplicated

  • Requirements

    • Snapshot should not interfere with/block normal application actions

    • Each process is able to record its own state

    • The global state is collected in a distributed manner

    • Any process may initiate the snapshot

  • Chandy-Lamport Global Snapshot Algorithm

    • First, Initiator Pi records its own state

      • The initiator process creates special messages called "Marker" messages

      • For all other processes j, Pi sends out a Marker message on outgoing channel Cij (N-1 channels in total)

      • Starts recording the incoming messages on each of the incoming channels at Pi: Cji for j = 1 to N excluding i

    • Whenever a process Pi receives a Marker message on an incoming channel Cki

      • If this is the first Marker Pi is seeing

        • Pi records its own state first

        • Marks the state of channel Cki as "empty"

        • For j = 1 to N except i, Pi sends out a Marker message on outgoing channel Cij

        • Starts recording incoming messages on each of the incoming channels at Pi: Cji for j = 1 to N except i and k

      • Else (if it has already seen a Marker message):

        • Mark the state of channel Cki as all the messages that have arrived on it since recording was turned on for Cki

    • The algorithm terminates when

      • All processes have received a Marker (to record their own state)

      • All processes have received a Marker on all the N-1 incoming channels (to make sure each process has its state recorded)

    • Then, optionally, a central server collects all these partial state pieces to obtain the full global snapshot

Consistent Cuts

  • Cut = time frontier at each process and at each channel

  • Events at the process/channel that happen before the cut are "in the cut"

  • Consistent cut: A cut that obeys causality

    • A cut is consistent iff for each pair of events (e, f) in the system s.t. event e is in the cut C, and if f -> e, then f is also in the cut C

  • Any run of the Chandy-Lamport Global Snapshot algorithm creates a consistent cut

    • Proof by contradiction

Safety and Liveness

  • Liveness: Guarantee that something good will happen eventually

  • Safety: Guarantee that something bad will never happen

  • Can be difficult to satisfy both in an asynchronous distributed system

    • Failure detector: Completeness/liveness & accuracy/safety cannot both be guaranteed in an asynchronous distributed system

    • Consensus: Decisions/liveness and correct decisions/safety cannot both be guaranteed by any consensus protocol in an asynchronous distributed system

  • Liveness w.r.t. a property Pr in a given state S means

    • S satisfies Pr, or there is some causal path of global states from S to S' where S' satisfies Pr

  • Safety w.r.t. a property Pr in a given state S means

    • S satisfies Pr, and all global states S' reachable from S also satisfy Pr

  • Chandy-Lamport algorithm can be used to detect global properties that are stable (once true, stays true forever afterwards)

Lesson 2: Multicast

Multicast Ordering

  • Different communication forms

    • Multicast: Message sent to a group of processes

    • Broadcast: Message sent to all processes anywhere

    • Unicast: Message sent from one sender process to one receiver process

  • FIFO Ordering

    • Multicasts from each sender are received in the order they are sent, at all receivers

    • Doesn't care about multicasts from different senders

  • Casual Ordering

    • Multicasts whose send events are causally related must be received in the same causality-obeying order at all receivers

    • Concurrent multicasts are ok to be received in different orders at different receivers

    • Casual Ordering -> FIFO Ordering (the reverse is not true)

  • Total Ordering/Atomic Broadcast

    • Ensures all receivers receive all multicasts in the same order

    • Doesn't care about the order of multicast sending

    • May need to delay delivery of some messages at sender

Implementing Ordering

  • FIFO Ordering

    • Each receiver Pi maintains a per-sender sequence number Pi[1...N], initially all zeros

    • Pi[j] is the latest sequence number Pi has received from Pj

    • Send multicast from Pj:

      • Pj[j] += 1

      • Include new Pj[j] in the multicast message

    • Receive multicast (Pi receives from Pj with sequence number S in the message):

      • If S == Pi[j] + 1, then

        • Deliver message to application

        • Pi[j] += 1

      • Else

        • Buffer this multicast until the above condition is true

  • Casual Ordering

    • Each receiver Pi maintains a per-sender sequence number Pi[1...N], initially all zeros

    • Send multicast from Pj:

      • Pj[j] += 1

      • Include new entire vector Pj[1...N] in the multicast message

    • Receive multicast (Pi receives from Pj with vector M[1...N], buffer it until both:)

      • This message is the next one Pi is expecting from Pj, i.e. M[j] = Pi[j] + 1

      • All multicasts, anywhere in the group, which happened-before M, have been received at Pi, i.e.

        • For all k != j, M[k] <= Pi[k] (Receiver satisfies causality)

      • When the above two conditions are met, deliver M to application and set Pi[j] = M[j]

  • Total Ordering: Sequencer-based approach

Reliable Multicast

  • Reliable multicast loosely says that every process in the group receives all multicasts

    • Reliability is orthogonal to ordering

  • When it comes to process failures, the definition becomes vague

  • Definition: Need all correct/non-faulty processes to receive the same set of multicasts as all other correct processes

Virtual Synchrony

  • Each process maintains a membership list, called a View

  • An update to this membership list is called a View Change

  • Virtual synchrony guarantees that all view changes are delivered in the same order at all correct processes

    • A multicast M is said to be "delivered in a view V at process Pi" iff Pi receives view V, and then some time before Pi receives the next view, it delivers multicast M

  • Views may be delivered at different physical times at processes, but they are delivered in the same order

  • Virtual synchrony ensures that

    • The set of multicasts delivered in a given view is the same set at all correct processes that were in that view

      • What happens in a view, stays in that view

    • The sender of the multicast message also belongs to that view

    • If a process Pi does not deliver a multicast M in view V while other processes in the view V delivered M in V, the Pi will be forcibly removed from the next view delivered after V at the other processes

  • TODO: Add some examples

Lesson 3: Paxos

  • TODO

Previous1.4 Key-Value Stores, Time, and OrderingNext4.1 Spark, Hortonworks, HDFS, CAP

Last updated 3 years ago

Was this helpful?

FIFO ordering example 1 (from the final exam)
FIFO ordering example 2 (from the final exam)
FIFO ordering example 3 (from the final exam)
Casual ordering example 1 (from the final exam)
Casual ordering example 2 (from the final exam)
Casual ordering example 3 (from the final exam)