Lecture 10: GPU Scheduling Issues.
- Wrap up GPU computing: generalities, execution configuration
- GPU computing: scheduling execution


Thread Index vs. Thread ID
Scheduling questions:
- What is the order for the blocks to be executed?
- How is this execution process managed?
- When/How are the threads in a block executed?
Two levels of schedulers:
- 1.Device-level scheduler (NVIDIA GigaThread engine): Assigns (large numbers of) blocks to (small numbers of) SMs that signal that they have “excess capacity”
- 1.Once a block is picked up for execution by one SM, it does not leave the SM before all threads in that block finish executing the kernel. Only when a block is finished & retired can we place another block on that SM. Thus, more SMs means a more expensive card.
- 2.SM-level scheduler (more interesting): Schedules the execution of the threads in a block onto the SM functional units

Note that tensor cores are not present in older architectures
- Each block of threads are divided into 32-thread warps
- 32: Selected by NVIDIA
- Warp: A group of 32 thread of consecutive IDs, basic scheduling unit on the SM
- SM hardware implements almost zero-overhead warp scheduling/switching

SM Architecture Specifications (for one SM)
- Thread IDs within a warp are consecutive and increasing
- But we cannot assume ordering among warps
- There are three possible states for warps:
- Active warps (deployed on an SM)
- Eligible warps (a subset of active warps)
- Issued warps (a subset of eligible warps)
- Warp stalling: No new instruction issued at a clock cycle
- Possible reasons
- Instruction fetch
- Memory dependency
- Execution dependency
- Synchronization barrier
- In execution configurations, we should have thread block sizes that result in mostly full warps
Consider this:
__global__ void odd_even(int n, int* x)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if( (i & 0x01) == 0 )
{
x[i] = x[i] + 1;
}
else
{
x[i] = x[i] + 2;
}
}
// half of the threads in the warp execute the if clause, and the other half the else clause

A visualization of what happens (execution moves forward for half of the threads each time in lockstep fashion)
- The performance decreases with the degree of divergence in warps, say a 32-case switch statement
- Solutions
- Pre-Volta: a single program counter is shared amongst all 32 threads, combined with an active mask that specifies which threads of the warp are active at any given time
- Post-Volta: enables equal concurrency between all threads, regardless of warp
- Execution state (PC, program counter & S, call stack) are maintained per thread (as opposed to one per warp up until Pascal)
