Lecture 10: GPU Scheduling Issues.

Lecture Summary

Wrap up GPU computing: generalities, execution configuration
GPU computing: scheduling execution

Using Multiple Blocks

Execution Scheduling Issues

Scheduling questions:

What is the order for the blocks to be executed?
How is this execution process managed?
When/How are the threads in a block executed?

Two levels of schedulers:

Device-level scheduler (NVIDIA GigaThread engine): Assigns (large numbers of) blocks to (small numbers of) SMs that signal that they have “excess capacity”
1. Once a block is picked up for execution by one SM, it does not leave the SM before all threads in that block finish executing the kernel. Only when a block is finished & retired can we place another block on that SM. Thus, more SMs means a more expensive card.
SM-level scheduler (more interesting): Schedules the execution of the threads in a block onto the SM functional units

SM-Level Scheduling

Each block of threads are divided into 32-thread warps
- 32: Selected by NVIDIA
- Warp: A group of 32 thread of consecutive IDs, basic scheduling unit on the SM
SM hardware implements almost zero-overhead warp scheduling/switching

Thread IDs within a warp are consecutive and increasing
But we cannot assume ordering among warps

There are three possible states for warps:
- Active warps (deployed on an SM)
- Eligible warps (a subset of active warps)
- Issued warps (a subset of eligible warps)
Warp stalling: No new instruction issued at a clock cycle
- Possible reasons
  - Instruction fetch
  - Memory dependency
  - Execution dependency
  - Synchronization barrier
In execution configurations, we should have thread block sizes that result in mostly full warps

Thread Divergence (pre-Volta)

Consider this:

__global__ void odd_even(int n, int* x)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if( (i & 0x01) == 0 )
    {
        x[i] = x[i] + 1;
    }
    else
    {
        x[i] = x[i] + 2;
    }
}
// half of the threads in the warp execute the if clause, and the other half the else clause

The performance decreases with the degree of divergence in warps, say a 32-case switch statement
Solutions
- Pre-Volta: a single program counter is shared amongst all 32 threads, combined with an active mask that specifies which threads of the warp are active at any given time
- Post-Volta: enables equal concurrency between all threads, regardless of warp
  - Execution state (PC, program counter & S, call stack) are maintained per thread (as opposed to one per warp up until Pascal)

PreviousLecture 9: GPU Memory Spaces NextLecture 11: Execution Divergence. Control Flow in CUDA. CUDA Shared Memory Issues.

Last updated 4 years ago

Was this helpful?