Rui's Blog

Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.

Lecture Summary

  • Last time
    • GPU Computing: Execution Scheduling
      • Block scheduler (at the GPU level)
      • Warp scheduler (at the SM level)
    • Thread Divergence
  • Today
    • Aspects related to how GPU memory operations take place

The NVIDIA GPU Memory Ecosystem

From high vantage point (2 blocks w/ 2 threads each)
Each thread can:
  • R/W per-thread registers
  • R/W per-thread local memory
  • R/W per-block shared memory
  • R/W per-grid global memory
  • Read only per-grid constant memory
  • Read only per-grid texture memory
  • Read only per-grid surface memory
Some aspects of Local Memory:
  • Physically, local memory does not exist
    • In reality, data stored in local memory is placed in cache or the global memory at run time or by the compiler
  • It's specific to one thread and not visible to any other thread
  • Local memory has the same latency as global memory, unless cached
Different memories:
  • Global memory: Main means of communicating R/W data between host and device. cudaMalloc(), cudaFree(), and cudaMemcpy() operate here. Note that there are four types of cudaMemcpy transfers ({host/device} to {host/device}), and things happen over a PCIe connection.
  • Texture and Constant memories: Constants initialized by host, contents available to all threads.
Global, texture and constant memories are accessible by host (done at high latency, low bandwidth).
Memory Access Times
Storage Locations
The 3 most important GPU memory spaces

Case Studies: Matrix Multiplication, Revisited

  • See an example where the use of multiple blocks of threads play a central role
  • Highlight the use/role of the shared memory
  • Point out the __syncthreads() function call (synchronizes all threads in a block)
  • The previous example: Low arithmetic intensity, a lot of unnecessary movements from global memory to device
  • Rule of thumb: If the data that you, as a thread, use can also be used by another thread in your block, then you should consider using shared memory
  • To use shared memory:
    • Partition data into data subsets (tiles) that each fits into shared memory
    • Handle each data subset (tile) with one thread block by:
      • Loading the tile from global memory into shared memory, using multiple threads to exploit memory-level parallelism
      • Performing the computation on the tile from shared memory; each thread can efficiently multi-pass over any data element of the tile
  • __syncthreads() synchronizes all threads in a block
    • Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory
    • Be very careful when using it in a conditional
  • 3 ways to set aside shared memory:
    • Statically, declare inside a kernel
    • Through the execution configuration (see code block below)
    • Dynamically, via CUDA driver API cuFuncSetSharedSize() (out of scope)
__global__ void MyFunc(float*) // __device__ or __global__ function
extern __shared__ float shMemArray[];
// Size of shMemArray determined through the execution configuration
// You can use shMemArrayas you wish here...
// invoke like this. Ns indicates the size in bytes to be allocated in shared memory
MyFunc<<< Dg, Db, Ns>>>(parameter);
Example: Reversing an array using dynamic shared memory
How different technology fetches data into shared memory
  • Each SM has shared memory organized in 32 memory banks
    • Successive 32-bit words map to successive banks
    • Each bank has a bandwidth of 32 bits per clock cycle
  • ShMem and L1 cache draw on the same physical memory inside an SM