# Lecture 16: Streams, and overlapping data copy with execution.

## Lecture Summary

* Last time
  * Case study: Parallel prefix scan
  * Using streams in GPU computing
* Today
  * Wrap up streams in GPU computing: increasing problem size; improving execution speeds
  * Debugging & profiling GPU code: some nuts and bolts

## Streams

### Example 0

* Stream 1 & 2 are defined and initialized already
  * Use the two copy sub-engines at the same time: copy in (stream1), copy out (stream2)
  * Postpone launching of myKernel in stream2until the copy operation in stream1is completed

```
cudaEvent_t event;
cudaEventCreate(&event);                           // create event
cudaMemcpyAsync(d_in, in, size, H2D, stream1);     // 1) H2D copy of new input
cudaEventRecord(event, stream1);                   // record event
cudaMemcpyAsync(out, d_out, size, D2H, stream2);   // 2) D2H copy of previous result
cudaStreamWaitEvent(stream2, event);               // wait for event in stream1
myKernel<<<1000, 512, 0, stream2>>>(d_in, d_out);  // 3) GPU must wait for 1 and 2
someCPUfunction(blah, blahblah)                    // this gets executed right away

```

### Example 1

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUioE8afH4hCpCSuxzN%2FScreen%20Shot%202021-03-01%20at%2011.23.24%20AM.png?alt=media\&token=5e472b9f-3151-4105-b35c-64d7c3d9c319)

![Stage 3 enqueues the set of GPU operations that need to be undertaken (the "chunkification")](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUioIE-U1H7NQ0bvAyj%2FScreen%20Shot%202021-03-01%20at%2011.23.40%20AM.png?alt=media\&token=f0cf1b84-8d06-450f-8e3c-5821f185c75b)

![Concurrency (manual pipelining)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUiouqO4gVQSOgbUNib%2FScreen%20Shot%202021-03-01%20at%2011.26.23%20AM.png?alt=media\&token=4b9a40e4-e492-4341-a6e9-4e6e3b587974)

### Example 2.1

* Similar to example 1, but with two streams to increase the speed of execution
* This actually doesn't give a big speedup (62 ms -> 61 ms)

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUiqUBodq2ZSkIw6OH7%2FScreen%20Shot%202021-03-01%20at%2011.33.11%20AM.png?alt=media\&token=03ff72cc-5b59-44fd-8f18-f542a32fe906)

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUiqX9GJQEKaLrtVY2q%2FScreen%20Shot%202021-03-01%20at%2011.33.24%20AM.png?alt=media\&token=bebebe30-2771-49d8-a8f6-8076dcb07185)

![Note that the kernel stays the same](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUiqZp1B9Dzr7wqlkKf%2FScreen%20Shot%202021-03-01%20at%2011.33.38%20AM.png?alt=media\&token=3232e8b7-4561-4f31-af95-680496c14abf)

![There is actually no overlap of copy & execution...](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUirjp6weTcbnpgAcXG%2FScreen%20Shot%202021-03-01%20at%2011.38.44%20AM.png?alt=media\&token=29d8a0a1-acf4-4add-a7d0-7d810bdff318)

### Example 2.2

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MUijEtpkoZxGLdpTJoN%2F-MUirrUWb-2ttNJpko8u%2FScreen%20Shot%202021-03-01%20at%2011.39.16%20AM.png?alt=media\&token=1f2ed962-60ed-4047-a48a-a2ab82f5fd09)

* Streams recap
  * Concurrency brings two flavors:
    * The copy and the execution engines of the GPU working at the same time
    * Several different kernels being executed at the same time on the GPU
* CUDA/GPU computing recap
  * Generally, any application that fits the SIMD paradigm can benefit from using GPUs
    * Good speedups at a small time and financial investment
  * Hardware is changing faster than software&#x20;

## Debugging & Profiling in CUDA

### cuda-gdb

* gdb but with more things that need our attention
* For more usage, see the slides
  * Program execution control
  * Thread focus
  * Program state inspection (stack trace, source variables, memory, HW registers, code disassembly)
  * Run-time error detection (cuda-memcheck)
  * Tips, best practices, and misc notes
* I still prefer `printf()`, change my mind. /s

### Profiling

* Nsight Compute (only focus on GPU; ncu to collect data, ncu-ui to visualize interactively)
* Nsight Systems (focus on the whole system)
* nvprof (being deprecated rn)
