Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.
Last updated
Last updated
Last time
Case studies: parallel reduction on the GPU & 1D convolution
Looking beyond today: some more GPU computing feature, but looking for a while into optimization features
Today
One more cast study: parallel prefix scan
Using streams in GPU computing: increasing problem size; improving execution speeds
Simple, but suboptimal (O(N*log2(N)))
Convoluted, but O(N)
Balanced trees: A common parallel algorithm pattern
Upsweep from roots to the main trunk, and then down sweep from trunk to root
"Tree": Just a concept--the actual data structure is not used
A CUDA-enabled GPU has 2 engines
An execution engine
A copy engine (which contains 2 sub-engines that can work simultaneously)
A H2D copy sub-engine
A D2H copy sub-engine
Async execution
Examples: Kernel launches, D2D mem copies, mem copies by functions with the Async
suffix, etc
Overlapping Host <--> Device data transfer with device execution
Issue: The device execution stack is FIFO
Addressed by the usage of CUDA "streams"
Concurrency can be managed through streams
Concurrency means one of two things:
The copy and the execution engines of GPU working at the same time
Several different kernels being executed at the same time on the GPU
A stream is a sequence of CUDA commands issued by the host that executes on the GPU in issue-order
CUDA operations in different streams may run concurrently
CUDA operations from different streams may be interleaved
As soon as a CUDA function is invoked, a default stream (stream 0) is created
Create using cudaStreamCreate()
, destroy using cudaStreamDestroy()