Lecture 27: MPI Parallel Programming Point-to-Point communication: Blocking vs. Non-blocking sends.
Lecture Summary
Last time
HPC via MPI
MPI point-to-point communication: The blocking flavor
Today
Wrap up point-to-point communication
Collective communication
Point-to-point communication
Different "send" modes:
Synchronous send: MPI_SSEND
Risk of deadlock/waiting -> idle time
High latency but better bandwidth than bsend
Buffered (async) send: MPI_BSEND
Low latency/bandwidth
Standard send: MPI_SEND
Up to the MPI implementation to device whether to do rendezvous or eager
Less overhead if in eager mode
Blocks in rendezvous, switches to sync mode
Ready send: MPI_RSEND
Works only if the matching receive has been posted
Rarely used, very dangerous
Receiving, all modes: MPI_RECV
Buffered send
Reduces overhead associated with data transmission
Relies on the existence of a buffer. Buffering incurs an extra memory copy
Return from an MPI_Bsend does not guarantee the message was sent: the message remains in the buffer until a matching receive is posted
Non-blocking point-to-point
Blocking send: Covered above. Upon return from a send, you can modify the content of the buffer in which you stored data to be sent since the data has been sent
Non-blocking send: The sender returns immediately, no guarantee that the data has been transmitted
Routine name starts with MPI_I
Gets to do useful work (overlap communication with execution) upon return from the non-blocking call
Use synchronization call to wait for communication to complete
MPI_Wait: Blocks until a certain request is completed
Wait for multiple sends: Waitall, Waitany, Waitsome
MPI_Test: Non-blocking, returns quickly with status information
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
MPI_Probe: Allows for incoming messages to be queried prior to receiving them
Collective communications
Three types of collective actions:
Synchronization (barrier)
Communication (e.g., broadcast)
Operation (e.g., reduce)
Writing distributed applications with PyTorch is a good tutorial
Broadcast: MPI_Bcast
Gather: MPI_Gather
Scatter: MPI_Scatter
Reduce: MPI_Reduce
Result is collected by the root only
Allreduce: MPI_Allreduce
Result is sent out to all ranks in the communicator
Prefix scan: MPI_Scan
User-defined reduction operations: Register using MPI_Op_create()
Last updated