Lecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.

Lecture Summary

  • Last time

    • GPU Computing: Execution Scheduling

      • Block scheduler (at the GPU level)

      • Warp scheduler (at the SM level)

    • Thread Divergence

  • Today

    • Aspects related to how GPU memory operations take place

The NVIDIA GPU Memory Ecosystem

Each thread can:

  • R/W per-thread registers

  • R/W per-thread local memory

  • R/W per-block shared memory

  • R/W per-grid global memory

  • Read only per-grid constant memory

  • Read only per-grid texture memory

  • Read only per-grid surface memory

Some aspects of Local Memory:

  • Physically, local memory does not exist

    • In reality, data stored in local memory is placed in cache or the global memory at run time or by the compiler

  • It's specific to one thread and not visible to any other thread

  • Local memory has the same latency as global memory, unless cached

Different memories:

  • Global memory: Main means of communicating R/W data between host and device. cudaMalloc(), cudaFree(), and cudaMemcpy() operate here. Note that there are four types of cudaMemcpy transfers ({host/device} to {host/device}), and things happen over a PCIe connection.

  • Texture and Constant memories: Constants initialized by host, contents available to all threads.

Global, texture and constant memories are accessible by host (done at high latency, low bandwidth).

Case Studies: Matrix Multiplication, Revisited


  • See an example where the use of multiple blocks of threads play a central role

  • Highlight the use/role of the shared memory

  • Point out the __syncthreads() function call (synchronizes all threads in a block)

  • The previous example: Low arithmetic intensity, a lot of unnecessary movements from global memory to device

  • Rule of thumb: If the data that you, as a thread, use can also be used by another thread in your block, then you should consider using shared memory

  • To use shared memory:

    • Partition data into data subsets (tiles) that each fits into shared memory

    • Handle each data subset (tile) with one thread block by:

      • Loading the tile from global memory into shared memory, using multiple threads to exploit memory-level parallelism

      • Performing the computation on the tile from shared memory; each thread can efficiently multi-pass over any data element of the tile

  • __syncthreads() synchronizes all threads in a block

    • Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory

    • Be very careful when using it in a conditional

  • 3 ways to set aside shared memory:

    • Statically, declare inside a kernel

    • Through the execution configuration (see code block below)

    • Dynamically, via CUDA driver API cuFuncSetSharedSize() (out of scope)

__global__ void MyFunc(float*) // __device__ or __global__ function 
    extern __shared__ float shMemArray[];
    // Size of shMemArray determined through the execution configuration
    // You can use shMemArrayas you wish here...

// invoke like this. Ns indicates the size in bytes to be allocated in shared memory
MyFunc<<< Dg, Db, Ns>>>(parameter);
  • Each SM has shared memory organized in 32 memory banks

    • Successive 32-bit words map to successive banks

    • Each bank has a bandwidth of 32 bits per clock cycle

  • ShMem and L1 cache draw on the same physical memory inside an SM

Last updated