Lecture 10: GPU Scheduling Issues.
Last updated
Last updated
Wrap up GPU computing: generalities, execution configuration
GPU computing: scheduling execution
Scheduling questions:
What is the order for the blocks to be executed?
How is this execution process managed?
When/How are the threads in a block executed?
Two levels of schedulers:
Device-level scheduler (NVIDIA GigaThread engine): Assigns (large numbers of) blocks to (small numbers of) SMs that signal that they have “excess capacity”
Once a block is picked up for execution by one SM, it does not leave the SM before all threads in that block finish executing the kernel. Only when a block is finished & retired can we place another block on that SM. Thus, more SMs means a more expensive card.
SM-level scheduler (more interesting): Schedules the execution of the threads in a block onto the SM functional units
Each block of threads are divided into 32-thread warps
32: Selected by NVIDIA
Warp: A group of 32 thread of consecutive IDs, basic scheduling unit on the SM
SM hardware implements almost zero-overhead warp scheduling/switching
Thread IDs within a warp are consecutive and increasing
But we cannot assume ordering among warps
There are three possible states for warps:
Active warps (deployed on an SM)
Eligible warps (a subset of active warps)
Issued warps (a subset of eligible warps)
Warp stalling: No new instruction issued at a clock cycle
Possible reasons
Instruction fetch
Memory dependency
Execution dependency
Synchronization barrier
In execution configurations, we should have thread block sizes that result in mostly full warps
Consider this:
The performance decreases with the degree of divergence in warps, say a 32-case switch statement
Solutions
Pre-Volta: a single program counter is shared amongst all 32 threads, combined with an active mask that specifies which threads of the warp are active at any given time
Post-Volta: enables equal concurrency between all threads, regardless of warp
Execution state (PC, program counter & S, call stack) are maintained per thread (as opposed to one per warp up until Pascal)