# Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.

## Lecture Summary

* Last time
  * GPU Computing: Execution Scheduling
    * Block scheduler (at the GPU level)
    * Warp scheduler (at the SM level)
  * Thread Divergence
* Today
  * Aspects related to how GPU memory operations take place

## The NVIDIA GPU Memory Ecosystem

![From high vantage point (2 blocks w/ 2 threads each)](/files/-MU-kgq9_xEuiTHANh22)

Each thread can:

* R/W per-thread registers&#x20;
* R/W per-thread local memory&#x20;
* R/W per-block shared memory&#x20;
* R/W per-grid global memory&#x20;
* Read only per-grid constant memory&#x20;
* Read only per-grid texture memory&#x20;
* Read only per-grid surface memory

Some aspects of Local Memory:

* Physically, local memory does not exist
  * In reality, data stored in local memory is placed in cache or the global memory at run time or by the compiler
* It's specific to one thread and not visible to any other thread
* Local memory has the same latency as global memory, unless cached

Different memories:

* Global memory: Main means of communicating R/W data between host and device. cudaMalloc(), cudaFree(), and cudaMemcpy() operate here. Note that there are four types of cudaMemcpy transfers ({host/device} to {host/device}), and things happen over a PCIe connection.
* Texture and Constant memories: Constants initialized by host, contents available to all threads.&#x20;

Global, texture and constant memories are accessible by host (done at high latency, low bandwidth).

![](/files/-MUWvdlx1VKsrdzJ2o0d)

![](/files/-MUWvh-iq5_1M-3g8CTm)

![Memory Access Times](/files/-MU-mPQFGKffrBKhLvJA)

![Storage Locations](/files/-MU-mVa98p1fKr9ClwX-)

![The 3 most important GPU memory spaces](/files/-MU-mewdFW4fThoFzjy2)

## Case Studies: Matrix Multiplication, Revisited

Purpose:

* See an example where the use of multiple blocks of threads play a central role

* Highlight the use/role of the shared memory

* Point out the \_\_syncthreads() function call (synchronizes all threads in a block)

* The previous example: Low arithmetic intensity, a lot of unnecessary movements from global memory to device

* **Rule of thumb: If the data that you, as a thread, use can also be used by another thread in your block, then you should consider using shared memory**

* To use shared memory:
  * Partition data into data subsets (tiles) that each fits into shared memory
  * Handle each data subset (tile) with one thread block by:
    * Loading the tile from global memory into shared memory, using multiple threads to exploit memory-level parallelism
    * Performing the computation on the tile from shared memory; each thread can efficiently multi-pass over any data element of the tile

![](/files/-MU-pK_llmHlfZTlVScE)

![](/files/-MU-pOckb9xMGBF_cVqq)

* `__syncthreads()` synchronizes all threads in a block
  * Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory
  * Be very careful when using it in a conditional
* 3 ways to set aside shared memory:
  * Statically, declare inside a kernel
  * Through the execution configuration (see code block below)
  * Dynamically, via CUDA driver API `cuFuncSetSharedSize()` (out of scope)

```
__global__ void MyFunc(float*) // __device__ or __global__ function 
{
    extern __shared__ float shMemArray[];
    // Size of shMemArray determined through the execution configuration
    // You can use shMemArrayas you wish here...
}

// invoke like this. Ns indicates the size in bytes to be allocated in shared memory
MyFunc<<< Dg, Db, Ns>>>(parameter);
```

![Example: Reversing an array using dynamic shared memory](/files/-MUWy0e-NWQno8IS6iby)

![How different technology fetches data into shared memory](/files/-MUWyXTZw2J_L2O5Wrvs)

* Each SM has shared memory organized in 32 memory banks
  * Successive 32-bit words map to successive banks
  * Each bank has a bandwidth of 32 bits per clock cycle
* ShMem and L1 cache draw on the same physical memory inside an SM

![](/files/-MUX12bQ1JhNcQJ2FIdc)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cs759-hpc-course-notes/lecture-11-execution-divergence.-control-flow-in-cuda.-global-memory-access-patterns-and.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
