# Lecture 14: CUDA Case Studies. (1) 1D Stencil Operation. (2) Vector Reduction in CUDA.

## Lecture Summary

* Last time
  * Atomic operations
  * Things that shape the speed of execution of a kernel
    * The concept of "occupancy" and what impacts it (how many threads per block, how many registers/thread, how much ShMem/block)
  * Rules of thumb, for good execution speed in GPU computing
  * The nvcc toolchain, and how code is sent to host or gpu compilers
* Today
  * Case studies: parallel reduction on the GPU & 1D convolution
  * Looking beyond today: some more GPU computing feature, but looking for a while into optimization features

![Application optimization process](/files/-MUaGzNUqo5nvkZ8652d)

## 1D Stencil Operation

![What the algorithm does](/files/-MUaJHU_DXQVJAav8PI6)

![Serial implementation](/files/-MUaF6vRP6ESWAOteYoK)

![Parallel implementation](/files/-MUaFBvQe_ao9L1H6S1U)

![nvprof pointed out spaces for optimizations](/files/-MUaFhSgzV3a-743xa61)

![Use pinned memory (pinned memory cannot be paged out by the OS)](/files/-MUaFmXuI4iYW3mtJnTM)

![Data partitioning example (overlapping compute & memory)](/files/-MUaHGbHf0eZozOEa_6H)

![Performance improvements](/files/-MUaHVUUxxYU36hw0EQF)

![Optimization summary](/files/-MUaHl8Jiw7XsP_gAIWm)

## Vector Reduction in CUDA

![What the algorithm does (summing all entries in an array)](/files/-MUaJQQHlwV7DAtdvMzz)

Problem: Ideally we want to synchronize across all thread blocks, but CUDA does not have global synchronization. Our workaround is to decompose into multiple kernels.

* Optimization goal: Reaching GPU peak performance
  * Choosing the right metric
    * GFLOP/s: for compute-bound kernels
    * Bandwidth: for memory-bound kernels
* Reductions have low arithmetic intensity (1 flop/2 elements loaded), so we should go for peak bandwidth

![Interleaved addressing: highly divergent warps are inefficient, and % operator is very slow](/files/-MUaZQUej2Ywkez82zyK)

![Change which thread works on what. New problem: shared memory bank conflicts](/files/-MUaZafkMRUMDusaJMYU)

![Sequential addressing](/files/-MUa_4Wm9gkhVOKy1Uit)

* Kernel 4: Replace single load w/ two loads and first add of the reduction
* Kernel 5: Loop unrolling (unroll last warp)
* Kernel 6: Completely unrolling (using templates)
* Kernel 7: Multiple elements per thread


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cs759-hpc-course-notes/lecture-14-tiling-as-a-programing-pattern-in-cuda.-example-vector-reduction-in-cuda..md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
