Rui's Blog
  • Rui's Blog/Paper Reading Notes - Introduction
  • Personal Blog
    • Personal Blog - Index
      • How to Create Picture-in-Picture Effect / Video Overlay for a Presentation Video
      • How to Do Your Part to Protect the Environment in Wisconsin
      • How to Get a Driver's License in Wisconsin
      • How to Travel from the U.S. to China onboard AA127 in June 2021
      • How to Transfer Credits Back to UW-Madison
      • Resources on Learning Academic Writing (for Computer Science)
    • Towards applying to CS Ph.D. programs
  • Machine Learning Systems
    • Machine Learning Systems - Index
      • MLSys Papers - Short Notes
      • [2011 NSDI] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • [2014 OSDI] Scaling Distributed Machine Learning with the Parameter Server
      • [2018 OSDI] Gandiva: Introspective Cluster Scheduling for Deep Learning
      • [2018 SIGCOMM] Chameleon: Scalable Adaptation of Video Analytics via Temporal and Cross-camera ...
      • [2018 NIPS] Dynamic Space-Time Scheduling for GPU Inference
      • [2019 ATC] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
      • [2019 NSDI] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
      • [2019 SOSP] ByteScheduler: A Generic Communication Scheduler for Distributed DNN Training ...
      • [2019 SOSP] PipeDream: Generalized Pipeline Parallelism for DNN Training
      • [2019 SOSP] Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
      • [2019 NIPS] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
      • [2019 SC] ZeRO: memory optimizations toward training trillion parameter models
      • [2020 OSDI] Gavel: Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
      • [2020 OSDI] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
      • [2020 OSDI] BytePS: A High Performance and Generic Framework for Distributed DNN Training
      • [2020 SIGCOMM] Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
        • [2020 MLSys] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
      • [2020 EuroSys] AlloX: Compute Allocation in Hybrid Clusters
      • [2020 VLDB] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
      • [2020 NetAI] Is Network the Bottleneck of Distributed Training?
      • [2020 NSDI] Themis: Fair and Efficient GPU Cluster Scheduling
      • [2021 MLSys] Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification
      • [2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training
      • [2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
      • [2021 EuroMLSys] Interference-Aware Scheduling for Inference Serving
      • [2021 OSDI] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
      • [2021 MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
      • [2021 NSDI] SwitchML: Scaling Distributed Machine Learning with In-Network Aggregation
    • Big Data Systems - Index
      • Big Data Systems Papers - Short Notes
      • [2003 SOSP] The Google File System
      • [2004 OSDI] MapReduce: Simplified Data Processing on Large Clusters
      • [2010 SIGMOD] Pregel: A System for Large-Scale Graph Processing
      • [2011 NSDI] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • [2012 NSDI] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ...
      • [2012 OSDI] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
      • [2019 FAST] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...
      • [2021 HotOS] From Cloud Computing to Sky Computing
      • [2021 EuroSys] NextDoor: Accelerating graph sampling for graph machine learning using GPUs
  • Earlier Readings & Notes
    • High Performance Computing Course Notes
      • Lecture 1: Course Overview
      • Lecture 2: From Code to Instructions. The FDX Cycle. Instruction Level Parallelism.
      • Lecture 3: Superscalar architectures. Measuring Computer Performance. Memory Aspects.
      • Lecture 4: The memory hierarchy. Caches.
      • Lecture 5: Caches, wrap up. Virtual Memory.
      • Lecture 6: The Walls to Sequential Computing. Moore’s Law.
      • Lecture 7: Parallel Computing. Flynn's Taxonomy. Amdahl's Law.
      • Lecture 8: GPU Computing Intro. The CUDA Programming Model. CUDA Execution Configuration.
      • Lecture 9: GPU Memory Spaces
      • Lecture 10: GPU Scheduling Issues.
      • Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.
      • Lecture 12: Global Memory Access Patterns and Implications.
      • Lecture 13: Atomic operations in CUDA. GPU ode optimization rules of thumb.
      • Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.
      • Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.
      • Lecture 16: Streams, and overlapping data copy with execution.
      • Lecture 17: GPU Computing: Advanced Features.
      • Lecture 18: GPU Computing with thrust and cub.
      • Lecture 19: Hardware aspects relevant in multi-core, shared memory parallel computing.
      • Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.
      • Lecture 21: OpenMP Work Sharing.
      • Lecture 22: OpenMP Work Sharing
      • Lecture 23: OpenMP NUMA Aspects. Caching and OpenMP.
      • Lecture 24: Critical Thinking. Code Optimization Aspects.
      • Lecture 25: Computing with Supercomputers.
      • Lecture 26: MPI Parallel Programming General Introduction. Point-to-Point Communication.
      • Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.
      • Lecture 28: MPI Parallel Programming: MPI Collectives. Overview of topics covered in the class.
    • Cloud Computing Course Notes
      • 1.1 Introduction to Clouds, MapReduce
      • 1.2 Gossip, Membership, and Grids
      • 1.3 P2P Systems
      • 1.4 Key-Value Stores, Time, and Ordering
      • 1.5 Classical Distributed Algorithms
      • 4.1 Spark, Hortonworks, HDFS, CAP
      • 4.2 Large Scale Data Storage
    • Operating Systems Papers - Index
      • CS 736 @ UW-Madison Fall 2020 Reading List
      • All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
      • ARC: A Self-Tuning, Low Overhead Replacement Cache
      • A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications
      • Biscuit: The benefits and costs of writing a POSIX kernel in a high-level language
      • Data Domain: Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
      • Disco: Running Commodity Operating Systems on Scalable Multiprocessors
      • FFS: A Fast File System for UNIX
      • From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees
      • LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
      • LFS: The Design and Implementation of a Log-Structured File System
      • Lottery Scheduling: Flexible Proportional-Share Resource Management
      • Memory Resource Management in VMware ESX Server
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • NFS: Sun's Network File System
      • OptFS: Optimistic Crash Consistency
      • RAID: A Case for Redundant Arrays of Inexpensive Disks
      • RDP: Row-Diagonal Parity for Double Disk Failure Correction
      • Resource Containers: A New Facility for Resource Management in Server Systems
      • ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay
      • Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
      • SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery
      • The Linux Scheduler: a Decade of Wasted Cores
      • The Unwritten Contract of Solid State Drives
      • Venti: A New Approach to Archival Storage
    • Earlier Notes
      • How to read a paper
  • FIXME
    • Template for Paper Reading Notes
Powered by GitBook
On this page
  • Lesson 1: Gossip
  • Multicast Problem
  • Gossip Protocols
  • Gossip Analysis
  • Gossip Implementations
  • Lesson 2: Membership
  • What is Group Membership List?
  • Failure Detectors
  • Gossip-Style Membership
  • Which is the best failure detector?
  • Another Probabilistic Failure Detector
  • Dissemination and suspicion
  • Lesson 3: Grids
  • Grid Applications
  • Grid Infrastructure

Was this helpful?

  1. Earlier Readings & Notes
  2. Cloud Computing Course Notes

1.2 Gossip, Membership, and Grids

Lesson 1: Gossip

Multicast Problem

  • In computer networking, multicast is group communication where data transmission is addressed to a group of destination computers simultaneously

  • Multicast can be one-to-many or many-to-many distribution

  • The difference between multicast and broadcast is that in broadcast, the packet is delivered to all the hosts connected to the network, whereas in multicast, the packet is delivered to intended recipients only.

  • The multicast protocol typically sits at the application layer (i.e., does not deal with the underlying network)

  • Challenges

    • Fault-Tolerance

      • Nodes may crash

      • Packets may be dropped

    • Scalability

      • Tens of thousands of nodes

  • Simplest implementation: Centralized

    • The sender sends in a loop UDP/TCP packets

    • Problems

      • Not fault-tolerant: Sender may fail. Say it fails halfway through, only half of the receivers get the message

      • High overhead: Not scalable -> high O(N) latency

  • Solution: Tree-Based implementation

    • Pro: For a good (balanced) tree, the height is O(log(N)) -> better latency

    • Con: High set up and maintenance costs

  • Tree-Based multicast protocols

    • Build spanning trees to disseminate multicasts

    • Use ACKs or NAKs to repair multicasts not received

    • SRM: Scalable Reliable Multicast

      • Uses NAKs

      • Uses random delays (before sending out repair request) and exponential backoff (if sending out multiple NAKs, doubles the wait time every time they wait) to avoid NAK storms

    • RMTP: Reliable Multicast Transport Protocol

      • Uses ACKs

      • ACKs are only sent to designated receivers, which then re-transmit missing multicasts

  • Studies show that despite these countermeasures, these protocols still suffer from O(N) ACK/NAK overheads, which motivated the development of gossip/epidemic protocols

Gossip Protocols

  • There are two "hyperparameters": t and b. Say we set t to be 5 seconds and b (fan-out) to be 2 nodes. In the following examples, we consider only 1 multicast message and only 1 sender.

  • Periodically (every t seconds), a sender picks b random targets and sends them the multicast/gossip message. We can use UDP to transmit the messages as the gossip protocol itself is very reliable.

  • Once a node receives its gossip, it is said to be "infected" and becomes a sender.

  • The gossip protocol is not synchronized across nodes: each node uses its local clock to send messages in rounds. When doing analyses, we typically assume them to be synchronized, though.

  • Those above described the "push" gossip: once you have a multicast message, you start gossiping about it.

    • There is also a "pull" gossip that periodically polls randomly selected processes for new multicast messages that haven't been received.

    • Another variant is the push-pull model. In this model, when sending out a pull query, the sender also includes some gossip messages it received recently.

  • Multiple messages -> push a random subset/recently-received ones/high-priority ones

Gossip Analysis

  • The simple push protocol:

    • Is lightweight in large groups

    • Spreads a multicast quickly

    • Is highly fault-tolerant

  • Analyze using Epidemiology

    • Population: (n+1) nodes

    • The contact rate between any individual pair is ß

    • At any time, each individual is either uninfected (x) or infected (y)

    • x_0 = n, y_0 = 1. At all times, x + y = n + 1

  • Model as continuous time process. Do some math, and the conclusion is that when t becomes very large (as time progresses), x goes to 0, and y goes to n + 1. I.e., eventually, everyone receives the gossip. We can also show that the gossip protocol is fast: it converges within a logarithmic number of rounds.

  • Recap

    • Lightweight: Each node transmits no more than cblog(n) gossip messages

    • Low latency: Converges within clog(n) rounds

    • Reliability: All but 1/(n^(cb-2)) nodes receive the multicast

  • While log(n) is not constant, it grows very slowly pragmatically (e.g., using base = 2, log(1000) ~= 10, log(all IPv4 addresses) = 32).

  • Packet loss: with 50% packet loss, analyze with b /= 2. To achieve the same reliability as 0% loss rate, take twice as many rounds

  • Node failure: with 50% nodes failing, analyze with n /= 2 and b /= 2. Same as above

  • Fault tolerance: with failures, it is possible (but improbable) that the epidemic dies out quickly. If it happens, it happens early, but as gossips spread very fast, it is very difficult to kill a gossip after a few rounds (just like pandemics like COVID-19/rumors on the internet!)

  • In all forms of a gossip, it takes O(log(n)) rounds before n/2 nodes get the gossip (because the fastest structure for a message to spread is a spanning tree). Thereafter, pull gossip is faster than push gossip. The second half of pull gossip finishes in time O(log(log(n))). Some more math is involved here...

  • Gossip protocols are not topology-aware -> core switches may get overloaded (O(n)). In this example, there are two subnets/racks. If nodes select targets randomly, half of the gossips will go through the router. The fix is to have gossips prefer nodes in the local subnet using a higher probability and vice versa. E.g., in subnet i with n_i nodes, pick gossip target in the subnet with probability (1 - 1/n_i). With this fix, the router load becomes O(1), and the dissemination time is still O(log(n)).

Gossip Implementations

  • Some implementations

  • Example: NNTP Inter-Server Protocol

Lesson 2: Membership

What is Group Membership List?

  • In data centers, failures are the norm, not the exception. For example, if the rate of failure of one machine (OS/disk/motherboard/network, etc.) is once every 10 years, a DC with 12000 servers has a mean time to failure (MTTF) of ~7.2 hours!

  • Thus, we need a mechanism to detect failures. Preferably, a failure detector program distributedly, and automatically detects failures and reports to your workstation.

  • Two sub-protocols are needed by a membership protocol:

    • A failure detector

    • A mechanism to disseminate information about joins, leaves, and failures of processes

Failure Detectors

  • Two correctness properties for failure detectors:

    • Completeness: Each failure is detected (eventually by one other non-faulty process)

    • Accuracy: There is no mistaken detection

    • In reality, in lossy networks, it is impossible to achieve both 100% completeness and 100% accuracy. Otherwise, we can solve consensus (TODO: figure out what this is). In real life, failure detectors guarantee completeness while only guaranteeing partial/probabilistic accuracy.

  • Two desirable properties:

    • Speed: Time until some process first detects a failure

    • Scale: Equal load on each member, network message load

  • We want to satisfy all the above properties in spite of arbitrary, simultaneous process failures

  • A very simple failure detection protocol, centralized heartbeating:

    • Heartbeats sent periodically

    • If a heartbeat is not received from process p within timeout, mark p as failed

    • All processes send heartbeats to one central process

    • Cons

      • The central process may fail

      • The central process may be overloaded in a large process group

  • A variant: ring heartbeating

    • Cons

      • Unpredictable on simultaneous multiple failures: If both neighbors of a process p fail, before the neighbors are repaired, p may fail undetected

  • A third variant: all-to-all heartbeating

    • Pros

      • Equal load per member

      • Guarantees completeness (as long as there is at least one non-faulty process in the group)

    • Cons

      • If there is one straggler process that receives packets at a longer delay than others, it may mark all other processes as failed, leading to a low accuracy/high false-positive rate

      • How to improve the robustness is covered in the next lecture

Gossip-Style Membership

  • Gossip-style heartbeating: a more robust (better accuracy properties) variant of all-to-all heartbeating

    • Each node maintains a membership list with each entry being [node address, hearbeat counter, local time]

    • Nodes periodically gossip their membership list

    • On receipt, the local membership list is updated

      • Those entries with a higher/newer heartbeat counter is updated. The new time is the current, local time at recepient nodes

    • When an entry is last updated/heartbeat has not increased more than T_fail seconds ago, it is marked as failed

    • After T_cleanup seconds, the member is deleted from the list

    • Without this two-stage cleanup mechanism, a failed node has its entry deleted right after it is detected as failed. However, other processes may not have detected the failure/deleted the entry, and it may be added back in a gossip.

  • Analysis: tradeoff between false positive rate, detection time, and bandwidth

    • A single heartbeat takes O(log(N)) time to propagate

    • If bandwidth allowed per node is O(N), N heartbeats takes O(log(N)) time to propagate

    • If bandwidth allowed per node is O(1) (only a few sampled entries of the membership list), N heartbeats takes O(Nlog(N)) time to propagate (inversely proportional).

    • If the gossip period T_gossip is decreased:

      • We have a higher bandwidth/send out gossips much quicker

      • We can have a shorter time for the failure detection time T_fail and T_cleanup

      • As a result, we have a higher false positive rate, as non-faulty nodes (that are mistakenly detected) are given slightly shorter time for their heartbeat to make it across

Which is the best failure detector?

  • Metrics of comparisons

    • Completeness: we want it always guaranteed

    • Speed: denote the time to first detection of a failure to be T seconds

    • (In)Accuracy: denote as PM(T), probability of mistake in time T. In other words, this is the probability that a mistaken detection will be made in T time units.

    • Given the above requirements, we will compare the network message load, N*L, across protocols

  • All-to-all heartbeating

    • The load is linear per node: L = N / T (N heartbeats sent out every T time units)

  • Gossip-based all-to-all heartbeating

    • Gossip period is every tg unit seconds, where O(N) gossip messages are sent

    • T = log(N) * tg (gossip takes O(log(N)) rounds to propagate)

    • L = N / tg = N * log(N) / T

    • Higher load compared to all-to-all heartbeating: better accuracy by using more messages

  • What's the theoretical optimal?

    • Optimal L is independent of N (?!)

    • All-to-all and gossip-based protocols are sub-optimal (L = O(N / T))

    • Main reason: these two protocols mix up the failure detection and dissemination components. The keys to getting close to this optimal bound are:

      • Separate the two components

      • Use a non heartbeat-based failure detection component

Another Probabilistic Failure Detector

  • SWIM: Scalable Weakly-consistent Infection-style Membership protocol

    • Instead of using heartbeating, we use pinging

    • Process pi runs the protocol every T time units (protocol period)

    • At the beginning of a protocol, pi randomly picks a process pj and sends a ping message

    • If everything goes well, pj responds with an ACK

    • If the ACK is not heard back (original ping/ACK is dropped), pi tries to ping pj again using indirect paths

      • pi sends pings to K randomly selected processes, each of which sends a direct ping to pj and sends an ACK back to pi. If one ACK is received by pi, then we are good

      • Otherwise, pi marks pj as failed

      • The reason for using indirect paths is that the pi-pj path may be congested, and it might be dropping more packets than other paths. Using indirect paths bypasses the potential congestion (spatial chance) and gives pj a second (temporal) chance.

Dissemination and suspicion

  • Dissemination options

    • Multicast (Hardware/IP)

      • Unreliable

      • Multiple simultaneous multicasts

    • Point-to-point (TCP/UDP)

      • Expensive

    • Zero extra messages: Piggyback on Failure Detector messages

      • Infection-style dissemination

        • Maintain a buffer of recently joined/evicted processes

          • Piggyback from this buffer

          • Prefer recent updates

        • Buffer elements are garbage collected after a while

  • Suspicion mechanism

    • False positives might be due to

      • Perturbed processes

      • Packet losses, e.g. from congestion

      • Indirect pinging may not solve the problem (correlated message losses near pinged host)

    • Solution: suspect a process before declaring it as failed in the group (see state diagram below)

    • To distinguish multiple suspicions of a process, use per-process incarnation numbers

      • Higher inc# notifications override lower inc#'s

      • Within an inc#: (Suspected, inc#) > (Alive, inc#)

      • (Failed, inc#) overrides everything else

Lesson 3: Grids

Grid Applications

  • Example: RAMS (Rapid Atmospheric Modeling System)

    • Compute-intensive computing (or HPC)

    • Can such programs be run without access to a supercomputer?

    • See picture below for a set of distributed computing resources in a grid

  • Grid applications...

    • May have several GBs of intermediate data

    • May take several hours/days

    • Have four stages: Init, Stage in, Execute, Stage out, Publish (optional)

    • Are computationally intensive, massivelly parallel

  • The core problem comes down to scheduling and resource allocations

Grid Infrastructure

  • 2-level scheduling infrastructure

    • Intra-site: for example, UW-Madison uses HTCondor protocol

      • Such protocols are responsible for:

        • Internal allocation & scheduling

        • Monitoring

        • Distribution and publishing of files

      • HTCondor:

        • Runs on a lot of workstations

        • When workstation is free, ask site's central server (or Globus) for tasks

        • If user hits a keystroke, the task is stopped (either killed or asked to reschedule)

    • Inter-site: e.g., Globus protocol

      • It is responsible for:

        • External allocation & scheduling

        • Stage in & stage out of files

      • Internal structures of different sites may be invisible to Globus

  • Globus toolkit

  • Grids are federated, i.e. no single entity controls the entire infrastructure

  • Architectures & key concepts have a lot in common with those of clouds

Previous1.1 Introduction to Clouds, MapReduceNext1.3 P2P Systems

Last updated 3 years ago

Was this helpful?

In this example, we start with one sender at the bottom left corner
Some implementations
NTTP inter-server protocol
Distributed computing resources. For example, in the UW-Madison (!) CS lab, at night, the workstations can be harvested for running Grid applications
Example application expressed as DAGs