# Lecture 17: GPU Computing: Advanced Features.

## Lecture Summary

* Last time
  * Streams in GPU computing
  * Debugging & profiling
* Today
  * Use of unified memory in CUDA GPU Computing

## Unified Memory (Managed Memory) in CUDA

* cudaMemCpy
  * Available in release 1.0
  * Moves data between host and device (over PCI-E)
* cudaHostAlloc
  * Allocate host memory rather than malloc-ing -> improve host/device data transfer speed if host memory is not pageable
  * Pros
    * Faster device <--> host transfer
    * Enables the use of asynchronous memory transfer and kernel execution
    * Enables mapping of the host pinned memory into the memory space of the device
  * Cons
    * Large memory impacts system performance
    * Memory allocation speed using cudaHostAlloc is low
  * `cudaError_t cudaHostAlloc(void** pHst, size_t sz, unsigned int flag);`
    * Using the flag `cudaHostAllocMapped` maps the memory allocated on the host in the memory space of the device for direct access
  * **Zero-Copy (Z-C)** GPU-CPU interaction
    * We no longer need an explicit CUDA runtime copy call to move data onto the GPU
    * This balloons the device memory so that it includes main memory that physically resides on the host
    * However, this requires the runtime call to cudaHostGetDevicePointer(). The need for this is eliminated by the Unified Virtual Addressing (UVA) mechanism.
* UVA: GPU and CPU share the virtual memory space. UVAS: UV Address Space.
  * CUDA runtime can identify where the data is stored based on the pointer
  * Instead of `cudaMemcpyxxx`, now we can use a generic `cudaMemcpyDefault`
* Z-C: Use pointer within device function to access host data
* UVA
  * Data access: A GPU can access data on a different GPU
  * Data transfer: Copy data in between GPUs
* UM (Unified Memory): Like UVA, but enabled the CPU to access GPU memory
  * UM works in conjunction with a "managed memory pool"
  * `cudaMallocManaged`replaces the need for explicit memory transfers between host and device, and cudaMalloc / cudaHostAlloc
  * Data is stored on the device but migrated where needed
  * Makes writing code easier, and will probably run faster due to locality (for the casual programmer)
  * Still evolving

![Unified Memory simplifies things](/files/-MVJM7qMZ-W9tbM_RnD7)

## Review

1. cudaMemcpy
2. Z-C: Device could access memory on the host
3. UVA: Unified virtual space
4. UM: Processors can access each other's memory


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cs759-hpc-course-notes/lecture-17-gpu-computing-advanced-features.-unified-memory-usage..md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
