# Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.

- Last time
- Atomic operations
- Things that shape the speed of execution of a kernel
- The concept of "occupancy" and what impacts it (how many threads per block, how many registers/thread, how much ShMem/block)

- Rules of thumb, for good execution speed in GPU computing
- The nvcc toolchain, and how code is sent to host or gpu compilers

- Today
- Case studies: parallel reduction on the GPU & 1D convolution
- Looking beyond today: some more GPU computing feature, but looking for a while into optimization features

Application optimization process

What the algorithm does

Serial implementation

Parallel implementation

nvprof pointed out spaces for optimizations

Use pinned memory (pinned memory cannot be paged out by the OS)

Data partitioning example (overlapping compute & memory)

Performance improvements

Optimization summary

What the algorithm does (summing all entries in an array)

Problem: Ideally we want to synchronize across all thread blocks, but CUDA does not have global synchronization. Our workaround is to decompose into multiple kernels.

- Optimization goal: Reaching GPU peak performance
- Choosing the right metric
- GFLOP/s: for compute-bound kernels
- Bandwidth: for memory-bound kernels

- Reductions have low arithmetic intensity (1 flop/2 elements loaded), so we should go for peak bandwidth

Interleaved addressing: highly divergent warps are inefficient, and % operator is very slow

Change which thread works on what. New problem: shared memory bank conflicts

Sequential addressing

- Kernel 4: Replace single load w/ two loads and first add of the reduction
- Kernel 5: Loop unrolling (unroll last warp)
- Kernel 6: Completely unrolling (using templates)
- Kernel 7: Multiple elements per thread