Rui's Blog

Lecture 16: Streams, and overlapping data copy with execution.

Lecture Summary

  • Last time
    • Case study: Parallel prefix scan
    • Using streams in GPU computing
  • Today
    • Wrap up streams in GPU computing: increasing problem size; improving execution speeds
    • Debugging & profiling GPU code: some nuts and bolts


Example 0

  • Stream 1 & 2 are defined and initialized already
    • Use the two copy sub-engines at the same time: copy in (stream1), copy out (stream2)
    • Postpone launching of myKernel in stream2until the copy operation in stream1is completed
cudaEvent_t event;
cudaEventCreate(&event); // create event
cudaMemcpyAsync(d_in, in, size, H2D, stream1); // 1) H2D copy of new input
cudaEventRecord(event, stream1); // record event
cudaMemcpyAsync(out, d_out, size, D2H, stream2); // 2) D2H copy of previous result
cudaStreamWaitEvent(stream2, event); // wait for event in stream1
myKernel<<<1000, 512, 0, stream2>>>(d_in, d_out); // 3) GPU must wait for 1 and 2
someCPUfunction(blah, blahblah) // this gets executed right away

Example 1

Stage 3 enqueues the set of GPU operations that need to be undertaken (the "chunkification")
Concurrency (manual pipelining)

Example 2.1

  • Similar to example 1, but with two streams to increase the speed of execution
  • This actually doesn't give a big speedup (62 ms -> 61 ms)
Note that the kernel stays the same
There is actually no overlap of copy & execution...

Example 2.2

  • Streams recap
    • Concurrency brings two flavors:
      • The copy and the execution engines of the GPU working at the same time
      • Several different kernels being executed at the same time on the GPU
  • CUDA/GPU computing recap
    • Generally, any application that fits the SIMD paradigm can benefit from using GPUs
      • Good speedups at a small time and financial investment
    • Hardware is changing faster than software

Debugging & Profiling in CUDA


  • gdb but with more things that need our attention
  • For more usage, see the slides
    • Program execution control
    • Thread focus
    • Program state inspection (stack trace, source variables, memory, HW registers, code disassembly)
    • Run-time error detection (cuda-memcheck)
    • Tips, best practices, and misc notes
  • I still prefer printf(), change my mind. /s


  • Nsight Compute (only focus on GPU; ncu to collect data, ncu-ui to visualize interactively)
  • Nsight Systems (focus on the whole system)
  • nvprof (being deprecated rn)