Rui's Blog

Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.

Lecture Summary

  • Last time
    • HPC via MPI
    • MPI point-to-point communication: The blocking flavor
  • Today
    • Wrap up point-to-point communication
    • Collective communication

Point-to-point communication

  • Different "send" modes:
    • Synchronous send: MPI_SSEND
      • Risk of deadlock/waiting -> idle time
      • High latency but better bandwidth than bsend
    • Buffered (async) send: MPI_BSEND
      • Low latency/bandwidth
    • Standard send: MPI_SEND
      • Up to the MPI implementation to device whether to do rendezvous or eager
      • Less overhead if in eager mode
      • Blocks in rendezvous, switches to sync mode
    • Ready send: MPI_RSEND
      • Works only if the matching receive has been posted
      • Rarely used, very dangerous
  • Receiving, all modes: MPI_RECV
  • Buffered send
    • Reduces overhead associated with data transmission
    • Relies on the existence of a buffer. Buffering incurs an extra memory copy
    • Return from an MPI_Bsend does not guarantee the message was sent: the message remains in the buffer until a matching receive is posted
Blocking options

Non-blocking point-to-point

  • Blocking send: Covered above. Upon return from a send, you can modify the content of the buffer in which you stored data to be sent since the data has been sent
  • Non-blocking send: The sender returns immediately, no guarantee that the data has been transmitted
    • Routine name starts with MPI_I
    • Gets to do useful work (overlap communication with execution) upon return from the non-blocking call
    • Use synchronization call to wait for communication to complete
  • MPI_Wait: Blocks until a certain request is completed
    • Wait for multiple sends: Waitall, Waitany, Waitsome
  • MPI_Test: Non-blocking, returns quickly with status information
    • int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
  • MPI_Probe: Allows for incoming messages to be queried prior to receiving them

Collective communications

  • Three types of collective actions:
    • Synchronization (barrier)
    • Communication (e.g., broadcast)
    • Operation (e.g., reduce)
  • Broadcast: MPI_Bcast
  • Gather: MPI_Gather
  • Scatter: MPI_Scatter
  • Reduce: MPI_Reduce
    • Result is collected by the root only
  • Allreduce: MPI_Allreduce
    • Result is sent out to all ranks in the communicator
  • Prefix scan: MPI_Scan
  • User-defined reduction operations: Register using MPI_Op_create()
Visualization of the operations, excerpted from the Distributed PyTorch documentation
Predefined reduction operations