Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.
Last updated
Was this helpful?
Last updated
Was this helpful?
Last time
Atomic operations
Things that shape the speed of execution of a kernel
The concept of "occupancy" and what impacts it (how many threads per block, how many registers/thread, how much ShMem/block)
Rules of thumb, for good execution speed in GPU computing
The nvcc toolchain, and how code is sent to host or gpu compilers
Today
Case studies: parallel reduction on the GPU & 1D convolution
Looking beyond today: some more GPU computing feature, but looking for a while into optimization features
Problem: Ideally we want to synchronize across all thread blocks, but CUDA does not have global synchronization. Our workaround is to decompose into multiple kernels.
Optimization goal: Reaching GPU peak performance
Choosing the right metric
GFLOP/s: for compute-bound kernels
Bandwidth: for memory-bound kernels
Reductions have low arithmetic intensity (1 flop/2 elements loaded), so we should go for peak bandwidth
Kernel 4: Replace single load w/ two loads and first add of the reduction
Kernel 5: Loop unrolling (unroll last warp)
Kernel 6: Completely unrolling (using templates)
Kernel 7: Multiple elements per thread