# Lecture 16: Streams, and overlapping data copy with execution.

## Lecture Summary

* Last time
  * Case study: Parallel prefix scan
  * Using streams in GPU computing
* Today
  * Wrap up streams in GPU computing: increasing problem size; improving execution speeds
  * Debugging & profiling GPU code: some nuts and bolts

## Streams

### Example 0

* Stream 1 & 2 are defined and initialized already
  * Use the two copy sub-engines at the same time: copy in (stream1), copy out (stream2)
  * Postpone launching of myKernel in stream2until the copy operation in stream1is completed

```
cudaEvent_t event;
cudaEventCreate(&event);                           // create event
cudaMemcpyAsync(d_in, in, size, H2D, stream1);     // 1) H2D copy of new input
cudaEventRecord(event, stream1);                   // record event
cudaMemcpyAsync(out, d_out, size, D2H, stream2);   // 2) D2H copy of previous result
cudaStreamWaitEvent(stream2, event);               // wait for event in stream1
myKernel<<<1000, 512, 0, stream2>>>(d_in, d_out);  // 3) GPU must wait for 1 and 2
someCPUfunction(blah, blahblah)                    // this gets executed right away

```

### Example 1

![](/files/-MUioE8afH4hCpCSuxzN)

![Stage 3 enqueues the set of GPU operations that need to be undertaken (the "chunkification")](/files/-MUioIE-U1H7NQ0bvAyj)

![Concurrency (manual pipelining)](/files/-MUiouqO4gVQSOgbUNib)

### Example 2.1

* Similar to example 1, but with two streams to increase the speed of execution
* This actually doesn't give a big speedup (62 ms -> 61 ms)

![](/files/-MUiqUBodq2ZSkIw6OH7)

![](/files/-MUiqX9GJQEKaLrtVY2q)

![Note that the kernel stays the same](/files/-MUiqZp1B9Dzr7wqlkKf)

![There is actually no overlap of copy & execution...](/files/-MUirjp6weTcbnpgAcXG)

### Example 2.2

![](/files/-MUirrUWb-2ttNJpko8u)

* Streams recap
  * Concurrency brings two flavors:
    * The copy and the execution engines of the GPU working at the same time
    * Several different kernels being executed at the same time on the GPU
* CUDA/GPU computing recap
  * Generally, any application that fits the SIMD paradigm can benefit from using GPUs
    * Good speedups at a small time and financial investment
  * Hardware is changing faster than software&#x20;

## Debugging & Profiling in CUDA

### cuda-gdb

* gdb but with more things that need our attention
* For more usage, see the slides
  * Program execution control
  * Thread focus
  * Program state inspection (stack trace, source variables, memory, HW registers, code disassembly)
  * Run-time error detection (cuda-memcheck)
  * Tips, best practices, and misc notes
* I still prefer `printf()`, change my mind. /s

### Profiling

* Nsight Compute (only focus on GPU; ncu to collect data, ncu-ui to visualize interactively)
* Nsight Systems (focus on the whole system)
* nvprof (being deprecated rn)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cs759-hpc-course-notes/lecture-16-streams-and-overlapping-data-copy-with-execution..md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
