Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.

Lecture Summary

Last time
- GPU Computing: Execution Scheduling
  - Block scheduler (at the GPU level)
  - Warp scheduler (at the SM level)
- Thread Divergence
Today
- Aspects related to how GPU memory operations take place

The NVIDIA GPU Memory Ecosystem

Each thread can:

R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant memory
Read only per-grid texture memory
Read only per-grid surface memory

Some aspects of Local Memory:

Physically, local memory does not exist
- In reality, data stored in local memory is placed in cache or the global memory at run time or by the compiler
It's specific to one thread and not visible to any other thread
Local memory has the same latency as global memory, unless cached

Different memories:

Global memory: Main means of communicating R/W data between host and device. cudaMalloc(), cudaFree(), and cudaMemcpy() operate here. Note that there are four types of cudaMemcpy transfers ({host/device} to {host/device}), and things happen over a PCIe connection.
Texture and Constant memories: Constants initialized by host, contents available to all threads.

Global, texture and constant memories are accessible by host (done at high latency, low bandwidth).

Case Studies: Matrix Multiplication, Revisited

Purpose:

See an example where the use of multiple blocks of threads play a central role
Highlight the use/role of the shared memory
Point out the __syncthreads() function call (synchronizes all threads in a block)

The previous example: Low arithmetic intensity, a lot of unnecessary movements from global memory to device
Rule of thumb: If the data that you, as a thread, use can also be used by another thread in your block, then you should consider using shared memory
To use shared memory:
- Partition data into data subsets (tiles) that each fits into shared memory
- Handle each data subset (tile) with one thread block by:
  - Loading the tile from global memory into shared memory, using multiple threads to exploit memory-level parallelism
  - Performing the computation on the tile from shared memory; each thread can efficiently multi-pass over any data element of the tile

__syncthreads() synchronizes all threads in a block
- Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory
- Be very careful when using it in a conditional
3 ways to set aside shared memory:
- Statically, declare inside a kernel
- Through the execution configuration (see code block below)
- Dynamically, via CUDA driver API cuFuncSetSharedSize() (out of scope)

__global__ void MyFunc(float*) // __device__ or __global__ function 
{
    extern __shared__ float shMemArray[];
    // Size of shMemArray determined through the execution configuration
    // You can use shMemArrayas you wish here...
}

// invoke like this. Ns indicates the size in bytes to be allocated in shared memory
MyFunc<<< Dg, Db, Ns>>>(parameter);

Each SM has shared memory organized in 32 memory banks
- Successive 32-bit words map to successive banks
- Each bank has a bandwidth of 32 bits per clock cycle
ShMem and L1 cache draw on the same physical memory inside an SM

PreviousLecture 10: GPU Scheduling Issues.NextLecture 12: Global Memory Access Patterns and Implications.

Last updated 4 years ago

Was this helpful?