Lecture 13: Atomic operations in CUDA. GPU ode optimization rules of thumb.

Lecture Summary

Last time
- GPU mem operations: focus on shared memory
- GPU mem operations: focus on global memory
- How parallel computing makes memory operations tricky
  - RAW, WAW, WAR hazards
Today
- Atomic operations
- Things that determine the speed of execution of a kernel
- Case studies: parallel reduction on the GPU

As threadIdx.x changes faster than threadIdx.y, we should have C[j][i] = A[j][i] + B[j][i] instead of C[i][j] = A[i][j] + B[i][j].

Two ways to store data in global memory:

Atomic memory operations is a mechanism that alleviates race conditions/access coordination problems
The order in which concurrent atomic updates are performed is not defined
While the order is not clear, none of the atomically performed updates will be lost
Performance becomes poor when many threads attempt to perform atomic operations on a small number of locations
When to use:
- Cannot fall back on normal memory operations because of possible race conditions
- Use for infrequent, sparse, and/or unpredictable global communication
- Use shared memory and/or customized data structures & algorithms to avoid synchronization whenever possible
Difference between __syncthreads() and an atomic operation:
- __syncthreads() establishes a barrier, i.e. of synchronization
- Atomic operations instead tie to the idea of coordination in relation to operations that involve memory transactions. Threads need not synchronize their execution, it’s only that a certain memory operation in a kernel is conducted in an atomic fashion

To get the maximum benefit from CUDA, focus first on finding ways to parallelize sequential code. Expose fine-grain parallelism
Minimize data transfer between the host and the device, even if it means running some kernels on the device that do not show performance gains when compared with running them on the host CPU
Strive to have aligned and coalesced global memory accesses. Design your implementation such that global memory accesses are coalesced for that part of the red-hot parts of the code
Minimize the use of global memory. Prefer shared memory access where possible (consider tiling as a design solution)

Accesses to shared memory should be designed to avoid serializing requests due to bank conflicts
Strive for sufficient occupancy
Keep the number of threads per block a multiple of 32 to avoid wasted lanes
Use the fast math library whenever speed is very important, and you can live with a tiny loss of accuracy
Avoid thread divergence

Last updated 4 years ago

Was this helpful?