Lecture 19: Hardware aspects relevant in multi-core, shared memory parallel computing.
- Last time
- GPU computing via thrust & CUB
- Today
- Final project proposal discussion
- Parallel computing on the CPU: Hardware & OpenMP generalities

Opportunities for efficiency gains
- OpenMP targets parallelism on SMP architectures
- It is handy when
- You have a multi-core processor, say 16 cores/socket (go beyond that and we suffer from diminishing returns due to overheads)
- Might have multiple sockets, say 2
- You have a good amount of system memory, say 64 GB
- Processes and threads are similar in the sense that they are both independent sequences of execution
- OpenMP touches on threads, while MPI touches on processes
- Threads of the same process run in a shared memory space and they have one translation page. Processes, on the other hand, run in separate memory spaces.
- We want to use OpenMP for both data parallelism and task parallelism
- Data parallelism: The processing of a large amount of data elements can be done in parallel
- Task parallelism: The execution of a collection of tasks can be performed in parallel

Hello world for OpenMP
- The OMP parallel region is similar to a CUDA kernel: both are executed by threads
- A major difference
- Variables inside GPU kernel are truly local variables, stored in registers
- OMP variables in a parallel region may or may not be visible to other threads executing the code of the parallel region: the scoping is tricky
#include <omp.h>
- Most OpenMP constructs are compiler directives. In C/C++, they take the form of
pragmas
- Programming model: A master thread spawns a team of threads
