Skip to content. Skip to navigation

ICTP Portal

Sections
You are here: Home Manuals on-line PGI Compiler pgiws_ug PGI Workstation User's Guide - 11 OpenMP Parallelization Pragmas for C and C++
Personal tools
Document Actions

PGI Workstation User's Guide - 11 OpenMP Parallelization Pragmas for C and C++

<< << " border=0> >> > " border=0> Title Contents Index Home Help

11 OpenMP Parallelization
Pragmas for C and C++


The PGCC ANSI C and C++ compilers support the OpenMP C/C++ Application Program Interface. The OpenMP shared-memory parallel programming model is defined by a collection of compiler directives or pragmas, library routines and environment variables that can be used to specify shared-memory parallelism in Fortran, C and C++ programs. The OpenMP C/C++ pragmas include a parallel region construct for writing coarse grain SPMD programs, work-sharing constructs which specify that C/C++ for loop iterations should be split among the available threads of execution, and synchronization constructs. The data environment is controlled using clauses on the pragmas or with additional pragmas. Run-time library functions are provided to query the parallel runtime environment, for example to determine how many threads are participating in execution of a parallel region. Finally, environment variables are provided to control the execution behavior of parallel programs. For more information on OpenMP, and a complete copy of the OpenMP C/C++ API specification, see http://www.openmp.org.

11.1 Parallelization Pragmas

Parallelization pragmas are #pragma statements in a C or C++ program that are interpreted by the PGCC C and C++ compilers when the option -mp is specified on the command line. The form of a parallelization pragma is:

#pragma omp	pragmas_name	[clauses]

The pragmas follow the conventions of the C and C++ standards. White space can appear before and after the #. Preprocessing tokens following the #pragma omp are subject to macro replacement. The order in which clauses appear in the parallelization pragmas is not significant. Spaces separate clauses within the pragmas. Clauses on pragmas may be repeated as needed subject to the restrictions listed in the description of each clause.

For the purposes of the OpenMP pragmas, a C/C++ structured block is defined to be a statement or compound statement (a sequence of statements beginning with { and ending with }) that has a single entry and a single exit. No statement or compound statement is a C/C++ structured block if there is a jump into or out of that statement.

The compiler option -mp enables recognition of the parallelization pragmas. The use of this option also implies:

-Mreentrant
local variables are placed on the stack and optimizations that may result in non-reentrant code are disabled (e.g., -Mnoframe)

Also, note that calls to I/O library functions are system-dependent and are not necessarily guaranteed to be thread-safe. I/O library calls within parallel regions should be protected by critical regions (see below) to ensure they function correctly on all systems.

In the examples given with each section, the functions omp_get_num_threads() and omp_get_thread_num() are used (refer to section 11.15, Runtime Library Routines). They return the number of threads currently in the team executing the parallel region and the thread number within the team, respectively.

11.2 omp parallel

Syntax:

#pragma omp parallel [clauses]
< C/C++ structured block >

Clauses:

private(list)
shared(list)
default(shared | none)
firstprivate(list)
reduction(operator: list)
copyin (list)
if (scalar_expression)

This pragma declares a region of parallel execution. It directs the compiler to create an executable in which the statements within the following C/C++ structured block are executed by multiple lightweight threads. The code that lies within the structured block is called a parallel region.

The OpenMP parallelization pragmas support a fork/join execution model in which a single thread executes all statements until a parallel region is encountered. At the entrance to the parallel region, a system-dependent number of symmetric parallel threads begin executing all statements in the parallel region redundantly. These threads share work by means of work-sharing constructs such as parallel for loops (see below). The number of threads in the team is controlled by the OMP_NUM_THREADS environment variable. If OMP_NUM_THREADS is not defined, the program will execute parallel regions using only one processor. Branching into or out of a parallel region is not supported.

All other shared-memory parallelization pragmas must occur within the scope of a parallel region. Nested omp parallel pragmas are not supported and are ignored. There is an implicit barrier at the end of a parallel region. When all threads have completed execution of the parallel region, a single thread resumes execution of the statements that follow.

It should be emphasized that by default there is no work distribution in a parallel region. Each active thread executes the entire region redundantly until it encounters a directive that specifies work distribution. For work distribution, see the omp for pragma.

Example:

#include <stdio.h>
#include <omp.h>
main(){
int a[2]={-1,-1};
#pragma omp parallel
{
a[omp_get_thread_num()] = omp_get_thread_num();
}
printf("a[0] = %d, a[1] = %d",a[0],a[1]);
}

The variables specified in a private list are private to each thread in a team. In effect, the compiler creates a separate copy of each of these variables for each thread in the team. When an assignment to a private variable occurs, each thread assigns to its local copy of the variable. When operations involving a private variable occur, each thread performs the operations using its local copy of the variable. Other important points to note about private variables are the following:

  • Variables declared private in a parallel region are undefined upon entry to the parallel region. If the first use of a private variable within the parallel region is in a right-hand-side expression, the results of the expression will be undefined (i.e. this is probably a coding error).
  • Likewise, variables declared private in a parallel region are undefined when serial execution resumes at the end of the parallel region.

The variables specified in a shared list are shared between all threads in a team, meaning that all threads access the same storage area for shared data.

The default clause allows the user to specify the default attribute for variables in the lexical extent of the parallel region. Individual clauses specifying private, shared, etc status override the declared default. Specifying default(none) declares that there is no implicit default, and in this case each variable in the parallel region must be explicitly listed with an attribute of private, shared, firstprivate, or reduction.

Variables that appear in the list of a firstprivate clause are subject to the same semantics as private variables, but in addition are initialized from the original object existing prior to entering the parallel region. Variables that appear in the list of a reduction clause must be shared. A private copy of each variable in list is created for each thread as if the private clause had been specified. Each private copy is initialized according to the operator as specified in table 11-1:

Table 11-1 Initialization of reduction Variables

OPERATOR

INITIALIZATION

+


0


*


1


-


0


&


~0


|


0


^


0


&&


1


||


0


At the end of the parallel region, a reduction is performed on the instances of variables appearing in list using operator as specified in the reduction clause. The initial value of each reduction variable is included in the reduction operation. If the operator: portion of the reduction clause is omitted, the default reduction operator is "+" (addition).

The copyin clause applies only to threadprivate variables. In the presence of the copyin clause, data from the master thread's copy of the threadprivate variable is copied to the thread private copies upon entry to the parallel region.

In the presence of an if clause, the parallel region will be executed in parallel only if the corresponding scalar_expression evaluates to a non-zero value. Otherwise, the code within the region will be executed by a single processor regardless of the value of the environment variable OMP_NUM_THREADS.

11.3 omp critical

Syntax:

#pragma omp critical [(name)]
< C/C++ structured block >

Within a parallel region, there may exist subregions of code that will not execute properly when executed by multiple threads simultaneously. This is often due to a shared variable that is written and then read again.

The omp critical pragma defines a subsection of code within a parallel region, referred to as a critical section, which will be executed one thread at a time. The first thread to arrive at a critical section will be the first to execute the code within the section. The second thread to arrive will not begin execution of statements in the critical section until the first thread has exited the critical section. Likewise each of the remaining threads will wait its turn to execute the statements in the critical section.

An optional name may be used to identify the critical region. Names used to identify critical regions have external linkage and are in a name space separate from the name spaces used by labels, tags, members and ordinary identifiers.

Critical sections cannot be nested, and any such specifications are ignored. Branching into or out of a critical section is illegal.

Example:

#include <stdlib.h>
main(){
int a[100][100], mx=-1, lmx=-1, i, j;
for (j=0; j<100; j++)
for (i=0; i<100; i++)
a[i][j]=1+(int)(10.0*rand()/(RAND_MAX+1.0));
#pragma omp parallel private(i) firstprivate(lmx)
{
#pragma omp for
for (j=0; j<100; j++)
for (i=0; i<100; i++)
lmx = (lmx > a[i][j]) ? lmx : a[i][j];
#pragma omp critical
mx = (mx > lmx) ? mx : lmx;
}
printf ("max value of a is %d\n",mx);
}

11.4 omp master

Syntax:

#pragma omp master
< C/C++ structured block >

In a parallel region of code, there may be a sub-region of code that should execute only on the master thread. Instead of ending the parallel region before this subregion, and then starting it up again after this subregion, the omp master pragma allows the user to conveniently designate code that executes on the master thread and is skipped by the other threads. There is no implied barrier on entry to or exit from a master section. Nested master sections are ignored. Branching into or out of a master section is not supported.

Example:

#include <stdio.h>
#include <omp.h>
main(){
int a[2]={-1,-1};
#pragma omp parallel
{
a[omp_get_thread_num()] = omp_get_thread_num();
#pragma omp master
printf("YOU SHOULD ONLY SEE THIS ONCE\n");
}
printf("a[0]=%d, a[1]=%d\n",a[0],a[1]);
}

11.5 omp single

Syntax:

#pragma omp single [Clauses]
< C/C++ structured block >

Clauses:

private(list)
firstprivate(list)
nowait

In a parallel region of code, there may be a subregion of code that will only execute correctly on a single thread. Instead of ending the parallel region before this subregion, and then starting it up again after this subregion, the omp single pragma allows the user to conveniently designate code that executes on a single thread and is skipped by the other threads. There is an implied barrier on exit from a single process section unless the optional nowait clause is specified.

Nested single process sections are ignored. Branching into or out of a single process section is not supported. The private and firstprivate clauses are as described in section 11.2.

11.6 omp for

Syntax:

#pragma omp for [Clauses]
< C/C++ for loop to be executed in parallel >

Clauses:

private(list)
firstprivate(list)
lastprivate(list)
reduction(operator: list)
schedule (kind[, chunk])
ordered
nowait

The real purpose of supporting parallel execution is the distribution of work across the available threads. The user can explicitly manage work distribution with constructs such as:

if (omp_get_thread_num() == 0) {
...
}
else if (omp_get_thread_num() == 1) {
...
}

However, these constructs are not in the form of pragmas. The omp for pragma provides a convenient mechanism for the distribution of loop iterations across the available threads in a parallel region.

Variables declared in a private list are treated as private to each processor participating in parallel execution of the loop, meaning that a separate copy of the variable exists on each processor. Variables declared in a firstprivate list are private, and in addition are initialized from the original object existing before the construct. Variables declared in a lastprivate list are private, and in addition the thread that executes the sequentially last iteration updates the version of the object that existed before the construct. The reduction clause is as described in section 11.2. The schedule clause is explained below. If ordered code blocks are contained in the dynamic extent of the for directive, the ordered clause must be present. See section 11.11 for more information on ordered code blocks.

The omp for pragma directs the compiler to distribute the iterative for loop immediately following across the threads available to the program. The for loop is executed in parallel by the team that was started by an enclosing parallel region. Branching into or out of an omp for loop is not supported, and omp for pragmas may not be nested.

By default, there is an implicit barrier after the end of the parallel loop; the first thread to complete its portion of the work will wait until the other threads have finished their portion of work. If nowait is specified, the threads will not synchronize at the end of the parallel loop.

Other items to note about omp for loops:

  • The for loop index variable is always private and must be a signed integer
  • omp for loops must be executed by all threads participating in the parallel region or none at all.
  • The for loop must be a structured block and its execution must not be terminated by break
  • Values of the loop control expressions and the chunk expressions must be the same for all threads executing the loop

Example:

#include <stdio.h>
#include <math.h>
main(){
float a[1000], b[1000];
int i;
for (i=0; i<1000; i++)
b[i] = i;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<1000; i++)
a[i] = sqrt(b[i]);
...
}
...
}

The schedule clause specifies how iterations of the for loop are divided up between processors. Given a schedule (kind[, chunk]) clause, kind can be static, dynamic, guided, or runtime. These are defined as follows:

When schedule (static, chunk) is specified, iterations are allocated in contiguous blocks of size chunk. The blocks of iterations are statically assigned to threads in a round-robin fashion in order of the thread ID numbers. The chunk must be a scalar integer expression. If chunk is not specified, a default chunk size is chosen equal to:

(number_of_iterations + omp_num_threads() - 1) / omp_num_threads()

When schedule (dynamic, chunk) is specified, iterations are allocated in contiguous blocks of size chunk. As each thread finishes a piece of the iteration space, it dynamically obtains the next set of iterations. The chunk must be a scalar integer expression. If no chunk is specified, a default chunk size is chosen equal to 1.

When schedule (guided, chunk) is specified, the chunk size is reduced in an exponentially decreasing manner with each dispatched piece of the iteration space. Chunk specifies the minimum number of iterations to dispatch each time, except when there are less than chunk iterations remaining to be processed, at which point all remaining iterations are assigned. If no chunk is specified, a default chunk size is chosen equal to 1.

When schedule (runtime) is specified, the decision regarding iteration scheduling is deferred until runtime. The schedule type and chunk size can be chosen at runtime by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the resulting schedule is equivalent to schedule(static).

11.7 omp barrier

Syntax:

#pragma omp barrier

There may be occasions in a parallel region when it is necessary that all threads complete work to that point before any thread is allowed to continue. The omp barrier pragma synchronizes all threads at such a point in a program. Multiple barrier points are allowed within a parallel region. The omp barrier pragma must either be executed by all threads executing the parallel region or by none of them.

11.8 omp parallel for

The omp parallel for pragma is supported using the following syntax.

Syntax:

#pragma omp parallel for [clauses]
< C/C++ for loop to be executed in parallel >

Clauses:

private(list)
shared(list)
default(shared | none)
firstprivate(list)
lastprivate(list)
reduction(operator: list)
copyin (list)
if (scalar_expression)
ordered
schedule (kind[, chunk])

The semantics of the omp parallel for pragma are identical to those of a parallel region containing only a single parallel for loop and pragma. The available clauses are as defined in sections 11.2 and 11.6.

11.9 omp sections

The omp sections pragma is supported using the following syntax:

Syntax:

#pragma omp sections [ Clauses ]
{
[#pragma omp section]
< C/C++ structured block executed by processor i >
[#pragma omp section]
< C/C++ structured block executed by processor j >
...
}

Clauses:

private (list)
firstprivate (list)
lastprivate (list)
reduction(operator: list)
nowait

The omp sections pragma defines a non-iterative work-sharing construct within a parallel region. Each section is executed by a single thread. If there are more threads than sections, some threads will have no work and will jump to the implied barrier at the end of the construct. If there are more sections than threads, one or more threads will execute more than one section.

An omp section pragma may only appear within the lexical extent of the enclosing omp sections pragma. In addition, the code within the omp sections pragma must be a structured block, and the code in each omp section must be a structured block.

The available clauses are as defined in section 11.2 and 11.6.

11.10 omp parallel sections

The omp parallel sections pragma is supported using the following syntax:

Syntax:

#pragma omp parallel sections [clauses]
{
[#pragma omp section]
< C/C++ structured block executed by processor i >
[#pragma omp section]
< C/C++ structured block executed by processor j >
...
}

Clauses:

private(list)
shared(list)
default(shared | none)
firstprivate(list)
lastprivate (list)
reduction({operator: list)
copyin (list)
if (scalar_expression)
nowait

The omp parallel sections pragma defines a non-iterative work-sharing construct without the need to define an enclosing parallel region. Semantics are identical to a parallel region containing only an omp sections pragma and the associated structured block.

11.11 ordered

The OpenMP ordered pragma is supported using the following syntax:

Syntax:

#pragma omp ordered
< C/C++ structured block >

The ordered pragma can appear only in the dynamic extent of a for or parallel for pragma that includes the ordered clause. The structured code block appearing after the ordered pragma is executed by only one thread at a time, and in the order of the loop iterations. This sequentializes the ordered code block while allowing parallel execution of statements outside the code block. The following additional restrictions apply to the ordered pragma:

  • The ordered code block must be a structured block. It is illegal to branch into or out of the block.
  • A given iteration of a loop with a DO directive cannot execute the same ORDERED directive more than once, and cannot execute more than one ORDERED directive.

11.12 omp atomic

The omp atomic pragma is supported using the following syntax:

Syntax:

#pragma omp atomic
< C/C++ expression statement >

The omp atomic pragma is semantically equivalent to subjecting the following single C/C++ expression statement to an omp critical pragma. The expression statement must be of one of the following forms:

* x <binary_operator>= expr

* x++

* ++x

* x--

* --x

where x is a scalar variable of intrinsic type, expr is a scalar expression that does not reference x, <binary_operator> is not overloaded and is one of +, *, -, /, &, ^, |, << or >>.

11.13 omp flush

The omp flush pragma is supported using the following syntax:

Syntax:

#pragma omp flush [(list)]

The omp flush pragma ensures that all processor-visible data items, or only those specified in list when it's present, are written back to memory at the point at which the directive appears.

11.14 omp threadprivate

The omp threadprivate pragma is supported using the following syntax:

Syntax:

#pragma omp threadprivate (list)

Where list is a list of variables to be made private to each thread but global within the thread. This pragma must appear in the declarations section of a program unit after the declaration of any variables listed. On entry to a parallel region, data in a threadprivate variable is undefined unless copyin is specified on the omp parallel pragma. When a variable appears in an omp threadprivate pragma, each thread's copy is initialized once at an unspecified point prior to its first use as the master copy would be initialized in a serial execution of the program.

The following restrictions apply to the omp threadprivate pragma:

  • The omp threadprivate pragma must appear after the declaration of every threadprivate variable included in list.
  • It is illegal for an omp threadprivate variable to appear in any clause other than a copyin, schedule or if clause.
  • If a variable is specified in an omp threadprivate pragma in one translation unit, it must be specified in an omp threadprivate pragma in every translation unit in which it appears.
  • The address of an omp threadprivate variable is not an address constant.
  • An omp threadprivate variable must not have an incomplete type or a reference type.

11.15 Run-time Library Routines

User-callable functions are available to the OpenMP C/C++ programmer to query and alter the parallel execution environment. Any program unit that invokes these functions should include the statement #include <omp.h>. The omp.h include file contains definitions for each of the C/C++ library routines and two required type definitions.

#include <omp.h>
int omp_get_num_threads(void);

returns the number of threads in the team executing the parallel region from which it is called. When called from a serial region, this function returns 1. A nested parallel region is the same as a single parallel region. By default, the value returned by this function is equal to the value of the environment variable OMP_NUM_THREADS or to the value set by the last previous call to the omp_set_num_threads() function defined below.

#include <omp.h>
void omp_set_num_threads(int num_threads);

sets the number of threads to use for the next parallel region. This function can only be called from a serial region of code. If it is called from within a parallel region, or within a function that is called from within a parallel region, the results are undefined. This function has precedence over the OMP_NUM_THREADS environment variable.

#include <omp.h>
int omp_get_thread_num(void);

returns the thread number within the team. The thread number lies between 0 and omp_get_num_threads()-1. When called from a serial region, this function returns 0. A nested parallel region is the same as a single parallel region.

#include <omp.h>
int omp_get_max_threads(void);

returns the maximum value that can be returned by calls to omp_get_num_threads(). If omp_set_num_threads() is used to change the number of processors, subsequent calls to omp_get_max_threads() will return the new value. This function returns the maximum value whether executing from a parallel or serial region of code.

#include <omp.h>
int omp_get_num_procs(void);

returns the number of processors that are available to the program.

#include <omp.h>
int omp_in_parallel(void);

returns non-zero if called from within a parallel region and zero if called outside of a parallel region. When called from within a parallel region that is serialized, for example in the presence of an if clause evaluating to zero, the function will return zero.

#include <omp.h>
void omp_set_dynamic(int dynamic_threads);

is designed to allow automatic dynamic adjustment of the number of threads used for execution of parallel regions. This function is recognized, but currently has no effect.

#include <omp.h>
int omp_get_dynamic(void);

is designed to allow the user to query whether automatic dynamic adjustment of the number of threads used for execution of parallel regions is enabled. This function is recognized, but currently always returns zero.

#include <omp.h>
void omp_set_nested(int nested);

is designed to allow enabling/disabling of nested parallel regions. This function is recognized, but currently has no effect.

#include <omp.h>
int omp_get_nested(void);

is designed to allow the user to query whether dynamic adjustment of the number of threads available for execution of parallel regions is enabled. This function is recognized, but currently always returns zero.

#include <omp.h>
void omp_init_lock(omp_lock_t *lock);
void omp_init_nest_lock(omp_nest_lock_t *lock);

initializes a lock associated with the variable lock for use in subsequent calls to lock routines. This initial state of lock is unlocked. It is illegal to make a call to this routine if lock is already associated with a lock.

#include <omp.h>
void omp_destroy_lock(omp_lock_t *lock);
void omp_destroy_nest_lock(omp_nest_lock_t *lock);

disassociates a lock associated with the variable lock.

#include <omp.h>
void omp_set_lock(omp_lock_t *lock);
void omp_set_nest_lock(omp_nest_lock_t *lock);

causes the calling thread to wait until the specified lock is available. The thread gains ownership of the lock when it is available. It is illegal to make a call to this routine if lock has not been associated with a lock.

#include <omp.h>
void omp_unset_lock(omp_lock_t *lock);
void omp_unset_nest_lock(omp_nest_lock_t *lock);

causes the calling thread to release ownership of the lock associated with lock. It is illegal to make a call to this routine if lock has not been associated with a lock.

#include <omp.h>
int omp_test_lock(omp_lock_t *lock);
int omp_test_nest_lock(omp_nest_lock_t *lock);

causes the calling thread to try to gain ownership of the lock associated with lock. The function returns non-zero if the thread gains ownership of the lock, and zero otherwise. It is illegal to make a call to this routine if lock has not been associated with a lock.

11.16 Environment Variables

OMP_NUM_THREADS - specifies the number of threads to use during execution of parallel regions. The default value for this variable is 1. For historical reasons, the environment variable NCPUS is supported with the same functionality. In the event that both OMP_NUM_THREADS and NCPUS are defined, the value of OMP_NUM_THREADS takes precedence.



Note

OMP_NUM_THREADS threads will be used to execute the program regardless of the number of physical processors available in the system. As a result, you can run programs using more threads than physical processors and they will execute correctly. However, performance of programs executed in this manner can be unpredictable, and oftentimes will be inefficient

OMP_SCHEDULE - specifies the type of iteration scheduling to use for omp for and omp parallel for loops which include the schedule(runtime) clause. The default value for this variable is "static". If the optional chunk size is not set, a chunk size of 1 is assumed except in the case of a static schedule. For a static schedule, the default is as defined in section 11.6. Examples of the use of OMP_SCHEDULE are as follows:

$ setenv OMP_SCHEDULE "static, 5"
$ setenv OMP_SCHEDULE "guided, 8"
$ setenv OMP_SCHEDULE "dynamic"

OMP_DYNAMIC - currently has no effect.

OMP_NESTED - currently has no effect.

MPSTKZ - increase the size of the stacks used by threads executing in parallel regions. For use with programs that utilize large amounts of thread-local storage in the form of private variables or local variables in functions or subroutines called within parallel regions. The value should be an integer <n> concatenated with M or m to specify stack sizes of n megabytes. For example:

$ setenv MPSTKZ 8M


<< << " border=0> >> > " border=0> Title Contents Index Home Help

Powered by Plone This site conforms to the following standards: