# Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.

## Lecture Summary

* Last time
  * Atomic operations
  * Things that shape the speed of execution of a kernel
    * The concept of "occupancy" and what impacts it (how many threads per block, how many registers/thread, how much ShMem/block)
  * Rules of thumb, for good execution speed in GPU computing
  * The nvcc toolchain, and how code is sent to host or gpu compilers
* Today
  * Case studies: parallel reduction on the GPU & 1D convolution
  * Looking beyond today: some more GPU computing feature, but looking for a while into optimization features

![Application optimization process](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaGzNUqo5nvkZ8652d%2FScreen%20Shot%202021-02-27%20at%207.36.50%20PM.png?alt=media\&token=b71ca91a-4b11-4504-969b-8c94669c91bb)

## 1D Stencil Operation

![What the algorithm does](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaJHU_DXQVJAav8PI6%2FScreen%20Shot%202021-02-27%20at%207.46.52%20PM.png?alt=media\&token=ec3d15f0-f7a0-4467-a9a2-3b95169f0f61)

![Serial implementation](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaF6vRP6ESWAOteYoK%2FScreen%20Shot%202021-02-27%20at%207.28.39%20PM.png?alt=media\&token=3303e930-3e03-43d9-8a4d-c0f8f24d0db1)

![Parallel implementation](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaFBvQe_ao9L1H6S1U%2FScreen%20Shot%202021-02-27%20at%207.29.00%20PM.png?alt=media\&token=8da2c593-59b9-4b07-a672-ce040680ef1a)

![nvprof pointed out spaces for optimizations](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaFhSgzV3a-743xa61%2FScreen%20Shot%202021-02-27%20at%207.31.13%20PM.png?alt=media\&token=0deb1628-d58d-413a-93ed-cbf925a1cefd)

![Use pinned memory (pinned memory cannot be paged out by the OS)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaFmXuI4iYW3mtJnTM%2FScreen%20Shot%202021-02-27%20at%207.31.34%20PM.png?alt=media\&token=db2c042c-6464-4a7c-9255-3aaffc86032c)

![Data partitioning example (overlapping compute & memory)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaHGbHf0eZozOEa_6H%2FScreen%20Shot%202021-02-27%20at%207.38.03%20PM.png?alt=media\&token=3ad2a42b-be3f-4f85-abe2-9fc47d70694b)

![Performance improvements](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaHVUUxxYU36hw0EQF%2FScreen%20Shot%202021-02-27%20at%207.39.04%20PM.png?alt=media\&token=0dc82e32-002d-447f-b7d2-86bd2932658c)

![Optimization summary](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaHl8Jiw7XsP_gAIWm%2FScreen%20Shot%202021-02-27%20at%207.40.12%20PM.png?alt=media\&token=4a73b052-2e6a-49e2-b327-78911445aa6b)

## Vector Reduction in CUDA

![What the algorithm does (summing all entries in an array)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaJQQHlwV7DAtdvMzz%2FScreen%20Shot%202021-02-27%20at%207.47.28%20PM.png?alt=media\&token=b87bc7db-c74e-46bd-b065-ce715851f736)

Problem: Ideally we want to synchronize across all thread blocks, but CUDA does not have global synchronization. Our workaround is to decompose into multiple kernels.

* Optimization goal: Reaching GPU peak performance
  * Choosing the right metric
    * GFLOP/s: for compute-bound kernels
    * Bandwidth: for memory-bound kernels
* Reductions have low arithmetic intensity (1 flop/2 elements loaded), so we should go for peak bandwidth

![Interleaved addressing: highly divergent warps are inefficient, and % operator is very slow](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaZQUej2Ywkez82zyK%2FScreen%20Shot%202021-02-27%20at%208.57.23%20PM.png?alt=media\&token=5b151f76-29e8-4058-954c-b865ae750db5)

![Change which thread works on what. New problem: shared memory bank conflicts](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUaZafkMRUMDusaJMYU%2FScreen%20Shot%202021-02-27%20at%208.58.08%20PM.png?alt=media\&token=7ea69cc7-53ee-40df-978a-532654986f49)

![Sequential addressing](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUZyco4Pl9xOtTJ0-DF%2F-MUa_4Wm9gkhVOKy1Uit%2FScreen%20Shot%202021-02-27%20at%209.00.15%20PM.png?alt=media\&token=f6b0626b-f07c-4857-a63f-43aac75436f0)

* Kernel 4: Replace single load w/ two loads and first add of the reduction
* Kernel 5: Loop unrolling (unroll last warp)
* Kernel 6: Completely unrolling (using templates)
* Kernel 7: Multiple elements per thread
