# Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.

## Lecture Summary

* Last time
  * GPU Computing: Execution Scheduling
    * Block scheduler (at the GPU level)
    * Warp scheduler (at the SM level)
  * Thread Divergence
* Today
  * Aspects related to how GPU memory operations take place

## The NVIDIA GPU Memory Ecosystem

![From high vantage point (2 blocks w/ 2 threads each)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MU-jxdTV1dHiHvTE3F4%2F-MU-kgq9_xEuiTHANh22%2FScreen%20Shot%202021-02-20%20at%2012.45.24%20PM.png?alt=media\&token=169b0104-2289-44d1-a209-8b07e7119cbe)

Each thread can:

* R/W per-thread registers&#x20;
* R/W per-thread local memory&#x20;
* R/W per-block shared memory&#x20;
* R/W per-grid global memory&#x20;
* Read only per-grid constant memory&#x20;
* Read only per-grid texture memory&#x20;
* Read only per-grid surface memory

Some aspects of Local Memory:

* Physically, local memory does not exist
  * In reality, data stored in local memory is placed in cache or the global memory at run time or by the compiler
* It's specific to one thread and not visible to any other thread
* Local memory has the same latency as global memory, unless cached

Different memories:

* Global memory: Main means of communicating R/W data between host and device. cudaMalloc(), cudaFree(), and cudaMemcpy() operate here. Note that there are four types of cudaMemcpy transfers ({host/device} to {host/device}), and things happen over a PCIe connection.
* Texture and Constant memories: Constants initialized by host, contents available to all threads.&#x20;

Global, texture and constant memories are accessible by host (done at high latency, low bandwidth).

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUWtkL91nzoQleK75mw%2F-MUWvdlx1VKsrdzJ2o0d%2FScreen%20Shot%202021-02-26%20at%2011.20.44%20PM.png?alt=media\&token=d3aa5d2d-3f96-4a6e-8290-b9936d593e3d)

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUWtkL91nzoQleK75mw%2F-MUWvh-iq5_1M-3g8CTm%2FScreen%20Shot%202021-02-26%20at%2011.20.58%20PM.png?alt=media\&token=8b644962-c6a3-4947-ab1b-335419f0e2e4)

![Memory Access Times](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MU-jxdTV1dHiHvTE3F4%2F-MU-mPQFGKffrBKhLvJA%2FScreen%20Shot%202021-02-20%20at%2012.52.54%20PM.png?alt=media\&token=cb9b1af3-2522-428f-b8a0-2f641dc02517)

![Storage Locations](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MU-jxdTV1dHiHvTE3F4%2F-MU-mVa98p1fKr9ClwX-%2FScreen%20Shot%202021-02-20%20at%2012.53.19%20PM.png?alt=media\&token=386ca24c-39b4-4dd1-a8b4-fb7d56ef8ede)

![The 3 most important GPU memory spaces](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MU-jxdTV1dHiHvTE3F4%2F-MU-mewdFW4fThoFzjy2%2FScreen%20Shot%202021-02-20%20at%2012.54.01%20PM.png?alt=media\&token=6a33f36b-a539-4e58-b809-e1a05fb96d11)

## Case Studies: Matrix Multiplication, Revisited

Purpose:

* See an example where the use of multiple blocks of threads play a central role

* Highlight the use/role of the shared memory

* Point out the \_\_syncthreads() function call (synchronizes all threads in a block)

* The previous example: Low arithmetic intensity, a lot of unnecessary movements from global memory to device

* **Rule of thumb: If the data that you, as a thread, use can also be used by another thread in your block, then you should consider using shared memory**

* To use shared memory:
  * Partition data into data subsets (tiles) that each fits into shared memory
  * Handle each data subset (tile) with one thread block by:
    * Loading the tile from global memory into shared memory, using multiple threads to exploit memory-level parallelism
    * Performing the computation on the tile from shared memory; each thread can efficiently multi-pass over any data element of the tile

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MU-jxdTV1dHiHvTE3F4%2F-MU-pK_llmHlfZTlVScE%2FScreen%20Shot%202021-02-20%20at%201.05.39%20PM.png?alt=media\&token=fc63abbb-37e3-4bee-90e1-395e9cc91fa3)

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MU-jxdTV1dHiHvTE3F4%2F-MU-pOckb9xMGBF_cVqq%2FScreen%20Shot%202021-02-20%20at%201.05.57%20PM.png?alt=media\&token=83b0a65d-1e18-48aa-9375-a228873ee6a5)

* `__syncthreads()` synchronizes all threads in a block
  * Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory
  * Be very careful when using it in a conditional
* 3 ways to set aside shared memory:
  * Statically, declare inside a kernel
  * Through the execution configuration (see code block below)
  * Dynamically, via CUDA driver API `cuFuncSetSharedSize()` (out of scope)

```
__global__ void MyFunc(float*) // __device__ or __global__ function 
{
    extern __shared__ float shMemArray[];
    // Size of shMemArray determined through the execution configuration
    // You can use shMemArrayas you wish here...
}

// invoke like this. Ns indicates the size in bytes to be allocated in shared memory
MyFunc<<< Dg, Db, Ns>>>(parameter);
```

![Example: Reversing an array using dynamic shared memory](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUWtkL91nzoQleK75mw%2F-MUWy0e-NWQno8IS6iby%2FScreen%20Shot%202021-02-26%20at%2011.31.07%20PM.png?alt=media\&token=5e07659e-e6d0-4d66-87b8-214c4604a55d)

![How different technology fetches data into shared memory](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUWtkL91nzoQleK75mw%2F-MUWyXTZw2J_L2O5Wrvs%2FScreen%20Shot%202021-02-26%20at%2011.33.21%20PM.png?alt=media\&token=c8e20090-4acf-4490-9ae1-33d0b721c806)

* Each SM has shared memory organized in 32 memory banks
  * Successive 32-bit words map to successive banks
  * Each bank has a bandwidth of 32 bits per clock cycle
* ShMem and L1 cache draw on the same physical memory inside an SM

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUWtkL91nzoQleK75mw%2F-MUX12bQ1JhNcQJ2FIdc%2FScreen%20Shot%202021-02-26%20at%2011.48.42%20PM.png?alt=media\&token=2c9e3508-6db1-426c-8163-d11ca79b3339)
