# Lecture 10: GPU Scheduling Issues.

## Lecture Summary

* Wrap up GPU computing: generalities, execution configuration
* GPU computing: scheduling execution

## Using Multiple Blocks

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MTaeIp80mOcc_zCV8x_%2F-MTag_62WJSGAkjketja%2FScreen%20Shot%202021-02-15%20at%2011.17.17%20AM.png?alt=media\&token=dd38da82-6176-435b-91b2-f1f1d366a28b)

## Execution Scheduling Issues

![Thread Index vs. Thread ID](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MTaeIp80mOcc_zCV8x_%2F-MTahxcIWx52K-kVbS2C%2FScreen%20Shot%202021-02-15%20at%2011.23.19%20AM.png?alt=media\&token=b4ada231-9929-42ba-ba1c-c3e5f39c3b5c)

Scheduling questions:

* What is the order for the blocks to be executed?
* How is this execution process managed?
* When/How are the threads in a block executed?

Two levels of schedulers:

1. Device-level scheduler (NVIDIA GigaThread engine): Assigns (large numbers of) blocks to (small numbers of) SMs that signal that they have “excess capacity”
   1. Once a block is picked up for execution by one SM, it does not leave the SM before all threads in that block finish executing the kernel. Only when a block is finished & retired can we place another block on that SM. Thus, more SMs means a more expensive card.
2. SM-level scheduler (more interesting): Schedules the execution of the threads in a block onto the SM functional units

### SM-Level Scheduling

![Note that tensor cores are not present in older architectures](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MTak3tlHx_PHHZFaair%2F-MTalNYiLac8hmBURtTZ%2FScreen%20Shot%202021-02-15%20at%2011.38.15%20AM.png?alt=media\&token=801519f1-3b55-4c21-850c-4b587a6514f5)

* Each block of threads are divided into 32-thread warps
  * 32: Selected by NVIDIA
  * Warp: A group of 32 thread of consecutive IDs, basic scheduling unit on the SM
* SM hardware implements almost zero-overhead warp scheduling/switching

![SM Architecture Specifications (for one SM)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MTak3tlHx_PHHZFaair%2F-MTanyadX_4VkVSK9BzN%2FScreen%20Shot%202021-02-15%20at%2011.49.35%20AM.png?alt=media\&token=22988050-9995-48b7-93cc-38821eb8b812)

* Thread IDs within a warp are consecutive and increasing

* But we cannot assume ordering among warps

* There are three possible states for warps:
  * Active warps (deployed on an SM)
  * Eligible warps (a subset of active warps)
  * Issued warps (a subset of eligible warps)

* Warp stalling: No new instruction issued at a clock cycle
  * Possible reasons
    * Instruction fetch
    * Memory dependency
    * Execution dependency
    * Synchronization barrier

* In execution configurations, we should have thread block sizes that result in mostly full warps

## Thread Divergence (pre-Volta)

Consider this:

```
__global__ void odd_even(int n, int* x)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if( (i & 0x01) == 0 )
    {
        x[i] = x[i] + 1;
    }
    else
    {
        x[i] = x[i] + 2;
    }
}
// half of the threads in the warp execute the if clause, and the other half the else clause

```

![A visualization of what happens (execution moves forward for half of the threads each time in lockstep fashion)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUWikICWlUm8e-41cxu%2F-MUWr8P_1JMOTjugaEYD%2FScreen%20Shot%202021-02-26%20at%2011.01.01%20PM.png?alt=media\&token=6dc6c517-a10c-4e05-80d0-6895facdb62d)

* The performance decreases with the degree of divergence in warps, say a 32-case switch statement
* Solutions
  * Pre-Volta: a single program counter is shared amongst all 32 threads, combined with an active mask that specifies which threads of the warp are active at any given time
  * Post-Volta: enables equal concurrency between all threads, regardless of warp
    * Execution state (PC, program counter & S, call stack) are maintained per thread (as opposed to one per warp up until Pascal)

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUWikICWlUm8e-41cxu%2F-MUWtho4gdQUFJT7mq5U%2FScreen%20Shot%202021-02-26%20at%2011.12.16%20PM.png?alt=media\&token=59b8ce9b-590d-4a2a-a11e-664ce54c1cd3)
