Rui's Blog

Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.

Lecture Summary

  • Last time: OpenMP generalities
  • This time: OpenMP nuts & bolts


Compiler directives examples (the directive goes behind `#pragma omp`)
User-level run time routines
Environment variables. This helps with bypassing the run-time function calls, but using env vars does not allow for dynamic OpenMP behavior. A function call overrides an env var setting, though.
  • OpenMP: portable and scalable model for shared memory parallel applications
    • No need to dive deep and work with POSIX pthreads
    • Under the hood, the compiler translates OpenMPfunctions and directives to pthread calls
  • Structured block and OpenMP construct are the two sides of the “parallel region” coin
  • In a structured block, the only "branches" allowed are exit() function calls. There is an implicit barrier after each structured block where threads wait on each other.

Nested Parallelism

  • The nested parallelism behavior can be controlled by using the OpenMP API
  • The single directive identifies a section of the code that must be run by a single thread
    • The difference between single and master is that in single, the code is executed by whichever thread reaches the region first
    • Another diff is that for single, there is an implicit barrier upon completion of the region

Work Sharing

  • Work sharing is a general term used in OpenMP to describe the distribution of work across threads
  • The three main constructs for automatic work division are:
    • omp for
    • omp sections
    • omp task

omp for

  • A #pragma omp for inside a #pragma omp parallel is equivalent to #pragma omp parallel for
  • Most OpenMP implementations use default block partitioning, where each thread is assigned roughly n/thread_count iterations. This may lead to load imbalance if the work per iteration varies
    • The schedule clause comes to the rescue!
    • Usage example: #pragma omp parallel for schedule(static, 8)
Effects of different schedules, assuming 3 threads
Choosing a schedule
  • OpenMP will only parallelize for loops that are in canonical form. Counterintuitive behavior may happen
  • The collapse clause supports collapsing the embedded loops into one uber loop
    • For example, if the outer loop has 10 iters, the inner loop has 10^7 iters, and we have 32 threads: parallelizing the outer loop is bad (10<32), parallelizing the inner loop is good, but we can do better using collapse

omp sections