Lecture 16: Streams, and overlapping data copy with execution.
Lecture Summary
Last time
Case study: Parallel prefix scan
Using streams in GPU computing
Today
Wrap up streams in GPU computing: increasing problem size; improving execution speeds
Debugging & profiling GPU code: some nuts and bolts
Streams
Example 0
Stream 1 & 2 are defined and initialized already
Use the two copy sub-engines at the same time: copy in (stream1), copy out (stream2)
Postpone launching of myKernel in stream2until the copy operation in stream1is completed
Example 1
Example 2.1
Similar to example 1, but with two streams to increase the speed of execution
This actually doesn't give a big speedup (62 ms -> 61 ms)
Example 2.2
Streams recap
Concurrency brings two flavors:
The copy and the execution engines of the GPU working at the same time
Several different kernels being executed at the same time on the GPU
CUDA/GPU computing recap
Generally, any application that fits the SIMD paradigm can benefit from using GPUs
Good speedups at a small time and financial investment
Hardware is changing faster than software
Debugging & Profiling in CUDA
cuda-gdb
gdb but with more things that need our attention
For more usage, see the slides
Program execution control
Thread focus
Program state inspection (stack trace, source variables, memory, HW registers, code disassembly)
Run-time error detection (cuda-memcheck)
Tips, best practices, and misc notes
I still prefer
printf()
, change my mind. /s
Profiling
Nsight Compute (only focus on GPU; ncu to collect data, ncu-ui to visualize interactively)
Nsight Systems (focus on the whole system)
nvprof (being deprecated rn)
Last updated