# Lecture 15: CUDA Case Studies. (3) Parallel Prefix Scan on the GPU. Using Multiple Streams in CUDA.

## Lecture Summary

* Last time
  * Case studies: parallel reduction on the GPU & 1D convolution
  * Looking beyond today: some more GPU computing feature, but looking for a while into optimization features
* Today
  * One more cast study: parallel prefix scan
  * Using streams in GPU computing: increasing problem size; improving execution speeds

## Parallel Prefix Scan on the GPU

![Definition of the algorithm](/files/-MUagVRQSj9oQ_u7rnM0)

### Algo 1: Hillis & Steele (1986)

* Simple, but suboptimal (O(N\*log2(N)))

![](/files/-MUai1zJdx8Fh1Gx69o8)

![](/files/-MUahma-WBNSWHRrRell)

### Algo 2: Harris-Sengupta-Owen (2007)

* Convoluted, but O(N)
* Balanced trees: A common parallel algorithm pattern
  * Upsweep from roots to the main trunk, and then down sweep from trunk to root
  * "Tree": Just a concept--the actual data structure is not used

![The reduction/upsweep step](/files/-MUalCCjeF3jD1xH6QV1)

![The down sweep step. Sheesh, this is just...](/files/-MUalIN_NGokDWkN0SY3)

## CUDA Streams

* A CUDA-enabled GPU has 2 engines
  * An execution engine
  * A copy engine (which contains 2 sub-engines that can work simultaneously)
    * A H2D copy sub-engine
    * A D2H copy sub-engine
* Async execution
  * Examples: Kernel launches, D2D mem copies, mem copies by functions with the `Async` suffix, etc
* Overlapping Host <--> Device data transfer with device execution
  * Issue: The device execution stack is FIFO
    * Addressed by the usage of CUDA "streams"
* Concurrency can be managed through streams
  * Concurrency means one of two things:
    * The copy and the execution engines of GPU working at the same time
    * Several different kernels being executed at the same time on the GPU
* A stream is a sequence of CUDA commands issued by the host that executes on the GPU in issue-order
  * CUDA operations in different streams may run concurrently
  * CUDA operations from different streams may be interleaved
* As soon as a CUDA function is invoked, a default stream (stream 0) is created
* Create using `cudaStreamCreate()`, destroy using `cudaStreamDestroy()`

![](/files/-MUau-OIEYDDvXFpWp9r)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cs759-hpc-course-notes/lecture-15-cuda-optimization-issues.-resource-utilization-issues.-parallel-prefix-scan-on-the-gpu..md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
