Lecture 16: Streams, and overlapping data copy with execution.
Lecture Summary
Streams
Example 0
cudaEvent_t event;
cudaEventCreate(&event); // create event
cudaMemcpyAsync(d_in, in, size, H2D, stream1); // 1) H2D copy of new input
cudaEventRecord(event, stream1); // record event
cudaMemcpyAsync(out, d_out, size, D2H, stream2); // 2) D2H copy of previous result
cudaStreamWaitEvent(stream2, event); // wait for event in stream1
myKernel<<<1000, 512, 0, stream2>>>(d_in, d_out); // 3) GPU must wait for 1 and 2
someCPUfunction(blah, blahblah) // this gets executed right away
Example 1



Example 2.1




Example 2.2

Debugging & Profiling in CUDA
cuda-gdb
Profiling
PreviousLecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.NextLecture 17: GPU Computing: Advanced Features.
Last updated