# Lecture 22: OpenMP Work Sharing

## Lecture Summary

* Last time
  * OpenMP: Tasks, variable scoping, synchronization (barrier & critical constructs)
* Today
  * Wrap up synchronization
  * OpenMP rules of thumb
  * Parallel computing w/ OpenMP: NUMA aspects & how caches come into play

## Synchronization

* The atomic directive
  * A guarded memory access operation
  * Can only protect a single assignment
  * Applies only to simple update of memory
  * Is a special case of a critical section with significantly less overhead due to atomicity
* The reduction construct (see example down below)
  * Local copy of sum for each thread engaged in the reduction is private
    * Each local sum is initialized to the identity operand associated with the operator that comes into play. In this case, we have "+", so the init value is 0.
  * All local copies of sum are added together and stored in a "global" variable
  * \#pragma omp for reduction(op:list)
    * The variables in list will be shared in the enclosing parallel region&#x20;
* The simd directive
  * \#pragma omp for simd reduction(+:sum)

![The atomic directive](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MXyPSe7ti8WfVVThKMS%2F-MXyWOilcthtpfSHN-jS%2FScreen%20Shot%202021-04-10%20at%208.22.05%20PM.png?alt=media\&token=a674a3f6-4abf-4ee1-9f05-16222a46e35a)

![The reduction construct](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MXyPSe7ti8WfVVThKMS%2F-MXyVxnynW2rk1-OQdD4%2FScreen%20Shot%202021-04-10%20at%208.20.13%20PM.png?alt=media\&token=0807a3ac-9500-4deb-96e0-fca58e77ed65)

## Performance Issues

* Common causes are:
  * Too much sequential code in your app
    * Seek to reduce amount of execution time where only one thread executes code
  * Too much communication
    * Difficult to pin down costly memory operations
  * Load imbalance
    * One thread gets too much work, while others idle waiting for it
    * For OpenMP for, one can use schedule(runtime)
      * Example: `setenv OMP_SCHEDULE "dynamic,5"`
  * Synchronization
    * Barriers can be expensive
    * Avoid them using
      * Careful use of the `nowait` clause
      * Parallelize at the outermost level possible
      * Use `critical` or `atomic`
      * Use other OpenMP facilities like `reduce`
  * Compiler (non-)optimizations
    * Sometimes the addition of parallel directives can prevent the compiler from performing sequential optimization
    * Symptom: parallel code running with 1 thread has longer execution and higher instruction count than sequential code

## NUMA

* Up to this point, we have been using the Symmetric Multi-Processing (SMP) model and we haven't been concerned about the mechanics of shared memory access
* In today's servers/clusters, nodes have many CPUs, each with many cores (this is called multi-socket configurations, as opposed to one chip per motherboard), and not all memory access are equal
* NUMA: Non-uniform memory access
  * Cost of memory access depends on which memory bank stores your data
* The NUMA factor: the ratio between the largest and shortest average amount of time for a thread running on a particular core to reach data in memory
  * A low NUMA factor is desirable (not much of a difference which bank data is stored)
  * Numa factor = 1: SMP system
  * Accessing memory outside a NUMA node: 20% slowdown for reads, 30% slowdown for writes
* NUMA aspects where OS comes into play
  * When a thread mallocs memory, how should this memory be allocated
  * Affinity: How the runtime/OS assigns a thread to a certain core
    * OMP\_PROC\_BIND: Allows you to dictate a distribution policy
      * master: Collocate threads with the master thread
      * close: Place threads close to the master in the places list
        * Useful if code is compute-bound and don't do many trips to main memory
        * Reduce synchronization costs (single, barrier, etc.)
      * spread (default): Spread out thread as much as possible
        * Useful if code is memory-bound as it improves aggregate system memory bandwidth
      * false: Set no binding
      * true: lock thread to a core
    * OMP\_PLACES: Allows you to control locations. OMP\_PLACES can assume one of these values
      * threads: Hardware thread, assuming hyper threading is on
      * cores: Core
      * sockets: Node (socket)
      * A place list: Defined by user, explicitly referencing the underlying hardware of the machine
    * An extensive list of examples can be found in the slides

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MXyPSe7ti8WfVVThKMS%2F-MXyaETS9HvGPYAu_adg%2FScreen%20Shot%202021-04-10%20at%208.43.15%20PM.png?alt=media\&token=d239130d-7d46-4619-ba04-3dca9b62c657)

![OMP\_PLACES usage. The CPU ids can be found through \`numactl -H\` or \`lscpu\` ](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MXyPSe7ti8WfVVThKMS%2F-MXyeZF1bhxe8BjtQtUh%2FScreen%20Shot%202021-04-10%20at%209.02.08%20PM.png?alt=media\&token=935b2aba-abd0-4b8f-9b41-8ce3acf4624d)
