Lecture 20: Multi-core Parallel Computing with OpenMP. Parallel Regions.

Lecture Summary

  • Last time: OpenMP generalities

  • This time: OpenMP nuts & bolts


  • OpenMP: portable and scalable model for shared memory parallel applications

    • No need to dive deep and work with POSIX pthreads

    • Under the hood, the compiler translates OpenMPfunctions and directives to pthread calls

  • Structured block and OpenMP construct are the two sides of the “parallel region” coin

  • In a structured block, the only "branches" allowed are exit() function calls. There is an implicit barrier after each structured block where threads wait on each other.

Nested Parallelism

  • The nested parallelism behavior can be controlled by using the OpenMP API

  • The single directive identifies a section of the code that must be run by a single thread

    • The difference between single and master is that in single, the code is executed by whichever thread reaches the region first

    • Another diff is that for single, there is an implicit barrier upon completion of the region

Work Sharing

  • Work sharing is a general term used in OpenMP to describe the distribution of work across threads

  • The three main constructs for automatic work division are:

    • omp for

    • omp sections

    • omp task

omp for

  • A #pragma omp for inside a #pragma omp parallel is equivalent to #pragma omp parallel for

  • Most OpenMP implementations use default block partitioning, where each thread is assigned roughly n/thread_count iterations. This may lead to load imbalance if the work per iteration varies

    • The schedule clause comes to the rescue!

    • Usage example: #pragma omp parallel for schedule(static, 8)

  • OpenMP will only parallelize for loops that are in canonical form. Counterintuitive behavior may happen

  • The collapse clause supports collapsing the embedded loops into one uber loop

    • For example, if the outer loop has 10 iters, the inner loop has 10^7 iters, and we have 32 threads: parallelizing the outer loop is bad (10<32), parallelizing the inner loop is good, but we can do better using collapse

omp sections

Last updated