Lecture 22: OpenMP Work Sharing

Lecture Summary

  • Last time

    • OpenMP: Tasks, variable scoping, synchronization (barrier & critical constructs)

  • Today

    • Wrap up synchronization

    • OpenMP rules of thumb

    • Parallel computing w/ OpenMP: NUMA aspects & how caches come into play

Synchronization

  • The atomic directive

    • A guarded memory access operation

    • Can only protect a single assignment

    • Applies only to simple update of memory

    • Is a special case of a critical section with significantly less overhead due to atomicity

  • The reduction construct (see example down below)

    • Local copy of sum for each thread engaged in the reduction is private

      • Each local sum is initialized to the identity operand associated with the operator that comes into play. In this case, we have "+", so the init value is 0.

    • All local copies of sum are added together and stored in a "global" variable

    • #pragma omp for reduction(op:list)

      • The variables in list will be shared in the enclosing parallel region

  • The simd directive

    • #pragma omp for simd reduction(+:sum)

Performance Issues

  • Common causes are:

    • Too much sequential code in your app

      • Seek to reduce amount of execution time where only one thread executes code

    • Too much communication

      • Difficult to pin down costly memory operations

    • Load imbalance

      • One thread gets too much work, while others idle waiting for it

      • For OpenMP for, one can use schedule(runtime)

        • Example: setenv OMP_SCHEDULE "dynamic,5"

    • Synchronization

      • Barriers can be expensive

      • Avoid them using

        • Careful use of the nowait clause

        • Parallelize at the outermost level possible

        • Use critical or atomic

        • Use other OpenMP facilities like reduce

    • Compiler (non-)optimizations

      • Sometimes the addition of parallel directives can prevent the compiler from performing sequential optimization

      • Symptom: parallel code running with 1 thread has longer execution and higher instruction count than sequential code

NUMA

  • Up to this point, we have been using the Symmetric Multi-Processing (SMP) model and we haven't been concerned about the mechanics of shared memory access

  • In today's servers/clusters, nodes have many CPUs, each with many cores (this is called multi-socket configurations, as opposed to one chip per motherboard), and not all memory access are equal

  • NUMA: Non-uniform memory access

    • Cost of memory access depends on which memory bank stores your data

  • The NUMA factor: the ratio between the largest and shortest average amount of time for a thread running on a particular core to reach data in memory

    • A low NUMA factor is desirable (not much of a difference which bank data is stored)

    • Numa factor = 1: SMP system

    • Accessing memory outside a NUMA node: 20% slowdown for reads, 30% slowdown for writes

  • NUMA aspects where OS comes into play

    • When a thread mallocs memory, how should this memory be allocated

    • Affinity: How the runtime/OS assigns a thread to a certain core

      • OMP_PROC_BIND: Allows you to dictate a distribution policy

        • master: Collocate threads with the master thread

        • close: Place threads close to the master in the places list

          • Useful if code is compute-bound and don't do many trips to main memory

          • Reduce synchronization costs (single, barrier, etc.)

        • spread (default): Spread out thread as much as possible

          • Useful if code is memory-bound as it improves aggregate system memory bandwidth

        • false: Set no binding

        • true: lock thread to a core

      • OMP_PLACES: Allows you to control locations. OMP_PLACES can assume one of these values

        • threads: Hardware thread, assuming hyper threading is on

        • cores: Core

        • sockets: Node (socket)

        • A place list: Defined by user, explicitly referencing the underlying hardware of the machine

      • An extensive list of examples can be found in the slides

Last updated