You are here: Home → Manuals on-line → PGI Compiler → pgiws_ug → PGI Workstation User's Guide - 3 Optimization Features

Personal tools

Document Actions

PGI Workstation User's Guide - 3 Optimization Features

<< " border=0>

> " border=0>

3 Optimization Features

3.1 Using Optimization Features
3.2 Types of Optimization
- 3.2.1 Local Optimization
- 3.2.2 Global Optimization
3.3 Loop Unrolling
3.4 Vectorization
3.5 High-Level Transformations
3.6 Parallelization Transformations

The preceding chapter briefly described optimization options to the PGI compiler commands and showed how to invoke optimization from the command line. This chapter contains two parts: the first part describes additional optimization options, and the second part describes the types of optimization that the compilers perform for each optimization level.

Before reading this chapter, you should be familiar with the topics described in the preceding chapter. In particular, you should be somewhat familiar with the following:

Local optimization
Global optimization
Loop unrolling
Vectorization
Parallelization

3.1 Using Optimization Features

Chapter 2, Optimization and Parallelization, described the use of the -O option to control local and global optimization. This section describes several -Mpgflag options that invoke additional optimization features. If you use these options you need to be aware that they will only improve your code and provide execution speed ups if your code meets certain requirements (as specified in the accompanying descriptions). Furthermore, misuse of some these options may produce erroneous code and unpredictable results.

3.1.1 Using the -Mvect Vectorization Option

This section describes the options that control the operation of the vectorizer. Before reading this section you should read the description of the vectorizer in the previous chapter. For details concerning the types of transformations the vectorizer makes, refer to Section 3.4, Vectorization later in this chapter.

The vectorizer scans code searching for loops that are candidates for vectorizing transformations and transforms these loops. When the vectorizer finds vectorizing opportunities it rearranges or replaces sections of loops. In addition to performing these loop transformations, the vectorizer recognizes common constructs that can be replaced with calls to fast hand-coded library routines and produces extensive data dependence information.

The -Mvect option invokes the vectorizer. The vectorizer can speed up execution if the source code contains countable loops and the loops operate on large REAL, REAL*4, REAL*8, INTEGER*4, COMPLEX or COMPLEX DOUBLE arrays in Fortran and their C/C++ counterparts.

The vectorizer performs various operations that can be controlled by arguments to the -Mvect command line option. The following sections describe these arguments that affect the operation of the vectorizer. In addition, these vectorizer operations can be controlled from within code using directives and pragmas. For details on the use of directives and pragmas, refer to Chapter 9, Optimization Directives and Pragmas.

The vectorizer performs the following operations:

Inner-loop, and outer-loop distribution
Loop interchange
Memory-hierarchy (cache tiling) optimizations
Generation of Pentium III SSE instructions and Pentium III or Athlon prefetch instructions

Default Vectorizer Options

By default, -Mvect without any options is equivalent to:

-Mvect=assoc,cachesize:262144

This enables the options for nested loop transformation and various other vectorizer options. These defaults may vary depending on the target system. The following sections describe the vectorizer options

Nested Loop Transformation Options

The vectorizer performs high-level loop transformations on countable loops. A loop is countable if the number of iterations is set only before loop execution and cannot be modified during loop execution. These transformations, which include loop distribution, loop splitting, loop interchange, and cache tiling allow the resulting loop to be optimized more completely, and often result in more effective use of machine resources, such as registers. Refer to section 3.4, Vectorization, for examples of nested loop transformations.

Assoc Option

The option -Mvect=assoc instructs the vectorizer to perform associativity conversions that can change the results of a computation due to roundoff error (-Mvect=noassoc disables this option). For example, a typical optimization is to change one arithmetic operation to another arithmetic operation that is mathematically correct, but can be computationally different and generate faster code. This option is provided to enable or disable this transformation, since roundoff error for such associativity conversions may produce unacceptable results.

Cachesize Option

The option -Mvect=cachesize:n instructs the vectorizer to tile nested loop operations assuming a data cache size of n bytes. By default, the vectorizer will attempt to tile nested loop operations, such as matrix multiply, using multi-dimensional strip-mining techniques to maximize re-use of items in the data cache.

SSE Option

The option -Mvect=sse instructs the vectorizer to automatically generate Pentium III SSE (streaming SIMD extensions) and prefetch instructions when vectorizable loops are encountered. Pentium III SSE instructions operate only on 32-bit floating-point data, and hence apply only to vectorizable loops that operate on 32-bit floating-point data. Pentium III prefetch instructions can be used to improve the performance of vectorizable loops that operate on either 32-bit or 64-bit floating-point data.

Note

Programs units compiled with -Mvect=sse will not execute on Pentium, Pentium Pro, Pentium II or AMD Athlon processors. They will only execute correctly on Pentium III systems running an SSE-enabled operating system.

Prefetch Option

The option -Mvect=prefetch instructs the vectorizer to automatically generate Pentium III or AMD Athlon prefetch instructions when vectorizable loops are encountered.

NOTE

Program units compiled with -Mvect=prefetch will not execute correctly on Pentium, Pentium Pro or Pentium II processors. They will execute correctly only on Pentium III or Athlon systems. In addition, Pentium III and Athlon prefetch instructions are not compatible. Program units that use Pentium III prefetch instructions will not execute correctly on Athlon systems, and program units that use Athlon prefetch instructions will not execute correctly on Pentium III systems.

3.1.2 Using the -Mconcur Auto-parallelization Option

The driver switch -Mconcur specifies that automatic parallelization should occur, with block distribution as the default. The syntax of the -Mconcur switch is:

-Mconcur[=option[,option]]

where option is:

altcode:n: Generate alternate scalar code for parallelized loops. If altcode is specified without arguments, the parallelizer determines an appropriate cutoff length and generates scalar code to be executed whenever the loop count is less than or equal to that length. If altcode:n is specified, the scalar altcode is executed whenever the loop count is less than or equal to n.
noaltcode: Do not generate alternate scalar code. The parallelized version of the loop is always executed regardless of the loop count.
dist:block: Parallelize with block distribution. This is the default.
dist:cyclic: Parallelize with cyclic distribution. The outermost parallelizable loop in any loop nest is parallelized. If a parallelized loop is innermost, its iterations are allocated to processors cyclically. That is, processor 0 performs iterations 0, 3, 6, etc.; processor 1 performs iterations 1, 4, 7, etc.; and processor 2 performs iterations 2, 5, 8, etc.
cncall: Allow subroutine/function calls in parallel loops. Loops containing calls are candidates for parallelization. Also, no minimum loop count threshold must be satisfied before parallelization will occur, and last values of scalars are assumed to be safe.
noassoc: Disable parallelization of loops with reductions.

When linking, the -Mconcur switch must be specified.

The environment variable NCPUSis checked at runtime for a parallel program. If NCPUS is set to 1, a parallel program runs serially, but will use the parallel routines generated during compilation. If NCPUS is set to a value greater than 1, the specified number of processors will be used to execute the program. Setting NCPUS to a value exceeding the number of physical processors can produce inefficient execution. Executing a program on multiple processors in an environment where some of the processors are being time-shared with another executing job can also result in inefficient execution.

3.1.3 Determining Why a Loop is Not Vectorized or Parallelized

If the -Mvect switch is set and a loop is not vectorized or tiled, the compiler has determined that some condition necessary for vectorization or tiling has not been met. The compiler produces information on reasons why a loop is not vectorized/tiled with the -Mneginfo=loop option.

If the -Mconcur switch is set and a loop is not parallelized, the compiler has determined that some condition necessary for parallelization has not been met. The compiler produces information on reasons why a loop is not parallelized with the -Mneginfo=concur option.

3.1.4 Data Dependences and Safe Pointers

This section describes the pgcc and pgCC data dependence options -Mnodepchk and -Msafeptr. These options provide information about data dependences to the compiler. The -Mnodepchk option specifies that when data dependences cannot be computed, they are assumed not to exist. The -Msafeptr option provides information about data dependence between pointers and other variables. This may improve performance for C functions using local, dummy argument, static or global pointers that are known never to conflict with other data items.

Warning

When used improperly the -Msafeptr or the -Mnodepchk options may produce incorrect results. Exercise extreme caution when using these options.

Safe Pointers

The -Msafeptr option accepts the following arguments: arg, local, static or global. Using the
-Msafeptr option with no arguments specifies all arguments.

The arg argument specifies that C dummy arguments are treated with the same semantics as FORTRAN dummy arguments. This implies that PGCC C or C++ will assume that all dummy arguments declared as pointers or arrays never conflict with any other data item. With this argument, you can generate more efficient code for C functions that use dummy pointers and arrays. For example, a vector multiply-add function could be coded in either of two ways, as shown in the two versions of vmuladd() shown in Example 3-1.

vmuladd(a, b, c, d, cnt)
double *a, *b, *c, *d;
int *cnt;
{
  int i;
  for (i=0; i<*cnt; i++)
    d[i] = c[i]+a[i]*b[i];
}

vmuladd(a, b, c, d, cnt)
double *a, *b, *c, *d;
int *cnt;
{
  int i;
  for (i=0; i<*cnt; i++)
    *d++ = *c++ + *a++ * *b++;
}

Example 3-1 vmuladd()

Since the compiler does not know whether the pointers in this function conflict, it generates code where the loads of c[i], a[i] and b[i] are all performed before the store of d[i] and the store of d[i] is performed before another load. Another complication is caused by the pointer loop counter, *cnt which could change when storing d[i].

For this example, due to the pointers, the loop is not countable. Even if the loop were countable, PGCC C and C++ would still detect the dependence relation between the store of d and each of the loads of a, b and c.

However, using -Msafeptr, the compiler assumes that arguments do not conflict and better code is generated, for example:

$ pgcc -O2 -Msafeptr=arg vmuladd.c

Keep in mind that this example, and the use of the -Msafeptr option cause PGCC C and C++ to assume that each dummy argument supplied to vmuladd() does not conflict or overlap in any way. That is, each argument is unique. There are many cases where this may not be true. For example, if two pointers could point to the same data element you would not use the -Msafeptr option.

-Msafeptr=auto

This option specifies that C local (auto) pointers and arrays do not conflict with each other or with other data items. The implications of this are similar to those for the -Msafeptr=arg option described in the previous section, except that the data independence applies to local pointers and arrays.

Whenever local pointers are used, you may want to use the -Msafeptr=auto option. Code that assigns dummy arguments to local (auto) pointers can be optimized. Independence of pointers allows for common subexpression elimination within loops and reusing calculations for sub-expressions occurring multiple times within one iteration, such as the expression *i1*multiplier.

-Msafeptr=static

This option specifies that C static pointers and arrays do not conflict with each other or with other data items. For this argument data independence applies to static pointers and arrays.

Suppose in the vmuladd() example shown previously, the output vector *d could overlap with one of the input vectors *a, *b or *c. Rather than accepting slow performance, you could allocate a static workspace to hold the result and copy the result into the output vector *d.

This revised vmuladd() is shown in Example 3-2. In Example 3-2, if the compiler knows that the static pointer *w does not conflict with any other pointers, safe pointer optimizations can be used since the only other pointers are dummy arguments. Hence, the following compile line will inform the compiler that both dummy argument pointers and static pointers are safe and this will improve performance.

vmuladd_workspace(a, b, c, d, cnt)
double *a, *b, *c, *d; 
int *cnt;
{
    static double *w = NULL;  /* w is the workspace */
    static int sz = 0;
    int i;
 
    if ( sz < *cnt ) {
        if (w) free(w);
        w = (double *) malloc(*cnt * sizeof(double));
        sz = *cnt;
    }
    for (i = 0; i < *cnt; i++)
       w[i]  = c[i] + a[i] * b[i];

    bcopy(d,w,sz*sizeof(double));  
}

Example 3-2 Modified vmuladd()

Using -Msafeptr=global

This option specifies that C global pointers and arrays do not conflict with each other or other data items. The implications of this are similar to those described for the other -Msafeptr arguments, except that the data independence applies to global (or extern) pointers and arrays.

3.1.5 Using the -Munroll Loop Unroller Option

The loop unrolling option replicates the body of a loop and reduces the number of times a loop is executed. There are several possible benefits from loop unrolling, including reducing the loop's branching overhead and providing better opportunities for instruction scheduling. Refer to Section 3.3, Loop Unrolling for more information on loop unrolling.

A loop with a constant count may be completely or partially unrolled depending upon options supplied to the unroller. A loop with a non-constant count may also be unrolled. A candidate loop for loop unrolling must be an innermost loop containing one to four blocks of code. The following example shows a pgf90 invocation using the unroller.

$ pgf90 -Munroll=c:3 uloop.f

This invokes the unroller with its c option set to three. The valid unroll options are c and n.

The c option instructs the compiler to completely unroll a loop with a constant loop count less than or equal to m, a supplied constant. If this value is not supplied, m is set to a default value of four (4).

The n:u argument instructs the compiler to unroll u times a loop that cannot be completely unrolled. This option also applies to loops that have a non-constant loop count. If u is not supplied, the unroller computes the number of times a candidate loop is unrolled. If the supplied value for u is 1, the loop is not unrolled. The loop unroller sets the optimization level to level-two if no -O or -g option is supplied.

3.2 Types of Optimization

This section describes the types of optimization that the PGI compilers perform for each optimization level. It provides a more in depth look at the optimizer and the types of constructs that the optimizer recognizes and transforms. The following sections describe local optimizations, global optimizations, instruction scheduling and vectorization.

3.2.1 Local Optimization

As mentioned in the previous chapter, local optimization is performed on basic blocks. The PGI compilers perform most local optimization during code generation. Local optimization includes:

Algebraic identity removal
Common subexpression elimination
Constant folding
Redundant load and store elimination
Strength reduction

This section describes local optimization. Instruction scheduling is also considered local optimization, since it is performed on basic blocks. Instruction scheduling is covered in Section 3.2.3, Instruction Scheduling, later in this chapter.

Algebraic Identity Removal

The compiler recognizes instances of arithmetic operations in which one of the operands is an algebraic identity and simplifies them. The examples in Table 3-2 below demonstrate algebraic identity removal.

Table 3-2 Algebraic Identity Removal

Original	Replacement
`A + 0`	`A`
`A * 1`	`A`
`A * 0`	`0`

Common Subexpression Elimination

Common subexpression elimination refers to the process of detecting multiple occurrences of the same computation and replacing them with a single occurrence. For example, the following statement contains the common subexpression X(A) * Y(B):

I = X(A) * Y(B) + X(A) * Y(B) + X(A) * Y(B)

Instead of generating code to compute the common subexpression three times, the compiler generates the code and places the result in a register and reuses the register. If for some reason the register must be used for some other purpose, its value will be saved in a temporary variable and re-used as needed. For example:

T1 = X(A) * Y(B)
I = T1 + T1 + T1

Constant Folding

Constant folding refers to the process of replacing an expression by its value if the value can be computed at compile time. For examples, refer to Table 3-3.

Table 3-3 Examples of Constant Folding

Original	Replacement
`J = 1 + 2`	`J = 3`
`K = 3 - 4`	`K = -1`
`L = 5 * 6`	`L = 30`
`F = 1.2 + 3.8`	`F = 5.0`
`G = 10.0 * 0.5`	`G = 5.0`

Redundant Load and Store Elimination

Redundant load and store elimination removes store operations that are immediately followed by a load operation on the same variable (redundant code). For example, if the compiler did not eliminate redundant loads and stores in the code below, it would perform a store of A, then a load of A and an additional store of A:

A = B + C	
A = A + 2

By eliminating redundant loads and stores, the compiler keeps the value of A in a register and does not perform the first store of A (or the load of A).

Strength Reduction

Strength reduction replaces time-consuming operations, such as division, with faster operations, such as multiplication or shifts. The following example illustrates strength reduction:

Original	Replacement
J = I/4	J = ARSHIFT(I,2)

3.2.2 Global Optimization

Global optimization looks at code over all of a program's basic blocks. The optimizer performs control-flow and data-flow analysis for an entire program. All loops are detected and optimized. Global optimization includes:

Branch to branch elimination
Constant propagation
Copy propagation
Dead store elimination
Global register allocation
Induction variable elimination
Invariant code motion
Scalar replacement

Branch to Branch Elimination

This optimization eliminates unnecessary branch statements. For example, the following illustrates an unnecessary branch.

IF (A(I) .EQ. 0) THEN
	A(I) = B
END IF
GO TO 290

An opportunity for branch to branch elimination occurs when A(I) is not 0. In the optimized version, control can be transferred directly to statement 290:

IF (A(I) .NE. 0)  GO T0 290
A(I) = B
GO TO 290

Constant Propagation

Constant propagation is similar to copy propagation, but the value propagated is a constant. The following example illustrates constant propagation:

Original	Replacement
X = 5 ... A(T4) = X	X = 5 ..... A(T4) = 5

Now, if X is not subsequently used, the first store, X = 5 can be removed.

Copy Propagation

If one variable is copied to a second, and the second is later used but not changed, the initial copy was unnecessary and may be propagated to the point of later use. This process is referred to as copy propagation.

The following example illustrates copy propagation:

Original	Replacement
... X = T3 A(T2) = T5 A(T4) = X GOTO B2 ...	... X = T3 A(T2) = T5 A(T4) = T3 GOTO B2 ...

Now, if X is not subsequently used, the first store, X = T3 can be removed.

Dead-store Elimination

Dead-store elimination eliminates stores of variables that the program does not need. A store operation is not needed if the value it generates is not used subsequently in the program or for output. Such operations can arise from several sources, including:

Naive code generation can produce operations that are part of a generic translation, but are useless in some contexts.
Other optimizations, such as common subexpression elimination and copy propagation, can introduce useless stores.
Other code improvements or conditional compilation can eliminate use of the results of an operation.

By analyzing the data flow in a program the compiler can detect many useless store operations and remove them.

For example, with the following code:

T1 = X + Y
T2 = X + Y
T3 = T1 + T2

the compiler can revise the code to read:

T1 = X + Y
T2 = T1
T3 = T1 + T1

Now the second line of code, T2 = T1, can be reached but serves no purpose because the new third line, T3 = T1 + T1 which eliminates the use of T2. In other words, the second line is a dead store. Dead store elimination refers to the process of removing such code.

Global Register Allocation

Using global register allocation the compiler attempts to maintain frequently used variables in registers across blocks. This strategy attempts to eliminate "unnecessary" memory references of variables where a program stores the value of a variable at the end of one block and immediately loads the value at the beginning of the next block. This replaces the load with a use and the store with a move.

One way to allocate registers is to have each of them hold a particular value for the duration of a single basic block. The drawback to this approach is that the value must be stored at the end of one basic block and then reloaded, if needed, at the beginning of another basic block.

Global register allocation attempts to overcome this problem by maintaining frequently used variables in registers across basic blocks (globally). Because programs spend most of their time inside loops, one approach to global register allocation is to try to keep a particular variable in a particular register throughout a loop. One way to accomplish this is by reserving a set number of registers to hold frequently used variables.

Invariant Code Motion

The compiler may detect loops containing a computation whose result is always the same no matter how many times the loop is executed. Invariant code motion refers to the process of moving (hoisting) such a computation from inside the loop to the front of the loop so that it will be only executed once, thereby speeding up the loop.

For example, for the expression X = Y + Z to be treated as loop invariant code, it must meet the following conditions:

The variables Y and Z must not be assigned values within the loop.
If the expression Y + Z involves floating point computations then the expression must be executed before any possible exit from the loop.

Induction Variable Elimination

Induction variables are integer variables whose values are incremented by the same amount each time through a loop (they differ by a regular, calculable value). Induction variable elimination removes many of these variables from a loop and often leads to other optimizations, such as strength reduction. The next example illustrates induction variable elimination (assume x is not subsequently used).

Original	Replacement
... Y = X DO I = 1, 100 X = 4 * I Z1 = Z1 + X Z2 = Z2 + X + Y ENDDO ...	... Y = X TMP = 4 DO I = 1, 100 Z1 = Z1 + TMP Z2 = Z2 + TMP + Y TMP = TMP + 4 ENDDO ...

Unreachable Code Elimination

Unreachable code elimination removes code that cannot be accessed, for example, code that is immediately preceded by an unconditional branch around the code.

3.3 Loop Unrolling

The loop unroller expands the contents of a loop and reduces the number of times a loop is executed. Using the -Munroll option, the PGI compilers unroll loops either partially or completely. There are several possible benefits from loop unrolling, including:

Reducing the loop's branching overhead
Providing better opportunities for instruction scheduling

Branching overhead is reduced when a loop is unrolled two or more times, since each iteration of the unrolled loop corresponds to two or more iterations of the original loop; the number of branch instructions executed is proportionately reduced. When a loop is unrolled completely, the loop's branch overhead is eliminated altogether.

Loop unrolling may be beneficial for the instruction scheduler. When a loop is completely unrolled or unrolled two or more times, opportunities for improved scheduling may be presented. The code generator can take advantage of more possibilities for instruction grouping or filling instruction delays found within the loop. Examples 3-3 and 3-4 show the effect of code unrolling on a segment that computes a dot product.

REAL*4  A(100), B(100), Z
INTEGER I
DO I=1, 100 
  Z = Z + A(i) * B(i)
END DO
END

Example 3-3 Dot Product Code

REAL*4 A(100), B(100), Z
INTEGER I
DO I=1, 100, 2 
  Z = Z + A(i) * B(i)
  Z = Z + A(i+1) * B(i+1)
END DO
END

Example 3-4 Unrolled Dot Product Code

Using the -Minfo or -Minfo=loop option, the compiler informs you when code is being unrolled. For example a message indicating the line number, and the number of times the code is unrolled, similar to the following will display when a loop is unrolled:

dot:
      5, Loop unrolled 5 times

3.4 Vectorization

The vectorizer performs code transformations on countable loops and enhances parallelization by providing extensive data-dependence information to the parallelizer. The vectorizer transforms loops on two levels: transformations across loops (inter-loop) and inner-loop transformations. The following sections provide more information on vectorization.

Loops that are candidates for the vectorizer are countable, that is the number of loop iterations is determined prior to the loop's execution and the loop counter is incremented or decremented by a fixed amount at each iteration. In addition, candidate loops do not contain any subprogram calls.

3.4.1 Loop Interchange

One phase of the vectorizer looks for candidate interchangeable loops. Interchangeable loops are perfectly-nested loops. These loops are nested loops that do not contain any statements between the loop control statements. Perfectly nested loops can be transformed to ensure the innermost loop is vectorizable and to improve memory access patterns for the innermost loop. Loop interchange transformations also create opportunities for invariant vector motion. For example:

DO I = 1,1000
   DO J = 1,100
     X(I,J) = X(I,J) * A(I,J)
   ENDDO
ENDDO

The vectorizer rearranges this code so the i loop is the inner loop and the j loop is the outer loop. To judge which interchange is best the vectorizer tries to predict actual runtime performance. The interchange results in operations performed on 100 vectors of length 1000 instead of 1000 vectors of length 100 and the new inner loop is stride-1. The resulting generated code performs as if it was originally written as:

DO J = 1,100
   DO I = 1,1000
     X(I,J) = X(I,J) * A(I,J)
   ENDDO
ENDDO

In this example, using the -Minfo=loop switch, the vectorizer produces an informational message indicating that the loop was interchanged:

5, Interchange produces reordered loop nest: 5, 4.

3.4.2 Nested Loop Distribution

The vectorizer performs inner nested-loop distribution and outer nested-loop distribution to remove code that inhibits software pipelining and to optimize the loop size for register and cache usage. These transformations improve code in several ways:

Removing constructs that inhibit software pipelining. Such constructs include subprogram calls, conditionals (IF statements), and recurrences.
Optimizing the loop size. This reduces the size for the loop so that the software pipeliner can generate effective code for the loop.
Creating opportunities for loop interchange.

3.4.3 Outer Loop Distribution

Because loop interchange is only conducted on perfectly nested loops, the vectorizer transforms outer loops using outer loop distribution. This increases the number of perfect loop nests and increases opportunities for loop interchange. For example, consider the following code segment:

DO I = 1,N
     DO J = 1,N
          C(I,J) = 0
          DO K = 1,N
               C(I,J) = C(I,J) + A(I,K) * B(K,J)
          ENDDO
     ENDDO
ENDDO

The vectorizer could transform this loop to generate two sets of loops:

DO I = 1,N
     DO J = 1,N
          C(I,J) = 0
     ENDDO
ENDDO
DO I = 1,N
     DO J = 1,N
          DO K = 1,N
               C(I,J) = C(I,J) + A(I,K) * B(K,J)
          ENDDO
     ENDDO
ENDDO

Loop interchange can be applied as a further optimization on this loop. In this example, using the
-Minfo=loop, the vectorizer produces an informational message indicating that the loop was distributed and then interchanged:

 6, Distributing loop; 2 new loops.	
    Interchange produces reordered loop nest: 8, 6.

3.4.4 Inner Loop Distribution

Inner loop distribution isolates non-vectorizable statements and allows the vectorizer to operate on the vectorizable portions of a loop. For example, in the following loop the second statement is not vectorizable because it contains a recurrence. This can be moved to a separate loop allowing the first statement to be vectorized:

DO I = 1,N
   X = A(I) + C(I)
   B(I+1) = B(I) * X
ENDDO

The vectorizer transforms this loop to generate two loops:

DO I = 1,N
   X(I) = A(I) + C(I)
ENDDO
DO I = 1,N
   B(I+1) = B(I) * X(I)
ENDDO

Notice that the vectorizer has promoted the scalar X to a vector. In this example, using the
-Minfo=loop switch, the vectorizer produces an informational message indicating that the loop was distributed:

5, Distributing inner loop; 2 new loops.

In case there is a promoted scalar and the loop count is a variable or constant larger than 1024, loops created through distribution are surrounded by another loop whose stride is 1024, and the loop counts of the inner loops are set to a maximum of 1024. The vectorizer can then allocate 1024 elements to each promoted scalar. When this strip-mining occurs, and the -Minfo=loop switch is used, the message "Creating strip-mine loop" is issued. For examples of strip-mining, refer to the following section. This is nested-loop transformation strip-mining. Strip-mining also occurs in the inner-loop transformations described in the following section.

3.4.5 Automatic Usage of Pentium III SSE Instructions

When the compiler switch -Mvect=sse is used, the vectorizer in the PGI Workstation compilers automatically uses Pentium III SSE instructions where possible. This capability is supported by all of the PGI Fortran, C and C++ compilers, and is accomplished by generating Pentium III SSE and prefetch instructions when vectorizable loops are encountered (a modification in the generated assembly code - your source code remains unaltered). Using -Mvect=sse, performance improvements of up to two times over equivalent scalar code sequences are possible. However, the Pentium III SSE instructions apply only to 32-bit floating-point data.

In the program in example 3-5, the vectorizer recognizes the vector operation in subroutine 'loop' when the compiler switch -Mvect=sse is used. This example shows the compilation, informational messages, and runtime results using the SSE instructions on a Pentium III system, along with issues that affect SSE performance.

First note that the arrays in Example 3-5 are single-precision and that the vector operation is done using a unit stride loop. You can cause unconstrained data objects of size 16 bytes or greater to be cache-aligned by compiling with the -Mcache_align switch. An unconstrained data object is a data object that is not a common block member and not a member of an aggregate data structure.

Note

In order for stack-based local variables to be properly aligned, the main program or function must be compiled with -Mcache_align.

The -Mcache_align switch has no effect on the alignment of Fortran allocatable or automatic arrays. If you have arrays that are constrained, for example vectors that are members of Fortran common blocks, you must specifically pad your data structures to ensure proper cache alignment; -Mcache_align causes only the beginning address of each common block to be cache-aligned.

The examples below show results of compiling the example code with and without -Mcache_align.

      		program vector_op
      		parameter (N = 99999)
      		real*4 x(n),y(n),z(n),w(n)
      		do i = 1,n
         			y(i) = i
      			z(i) = 2*i
                	w(i) = 4*i
      		enddo
      		do j = 1, 10000
      		   call loop(x,y,z,w,1.0e0,n)
      		enddo
      		print*,x(1),x(771),x(3618),x(23498),x(99999)
      		end
      		subroutine loop(a,b,c,d,s,n)
      		integer i,n
      		real*4 a(n),b(n),c(n),d(n),s
      		do i = 1,n
         			a(i) =  b(i) + c(i) - s * d(i)
      		enddo
      		end

Example 3-5 Vector operation using Pentium III SSE instructions

Assume the above program is compiled as follows:

 % pgf90 -fast -Mvect -Minfo vadd.f
 vector_op:
      4, Loop unrolled 10 times
 loop:
     18, Loop unrolled 5 times

No compile-time vectorization messages are emitted. That is an indicator that no loops are optimized by the vectorizor. Following is the result if the generated executable is run and timed on a standalone Pentium III 733 Mhz system:

   % /bin/time a.out
   -1.000000  -771.000  -3618.000  -23498.00  -99999.00    
   40.36user 0.020system 0:40.41 elapsed 99.9%CPU

Now, recompile with Pentium III SSE vectorization enabled:

% pgf90 -fast -Mvect=sse -Minfo vadd.f
vector_op:
     4, Unrolling inner loop 8 times
        Loop unrolled 7 times (completely unrolled)
loop:
    18, Generating sse code for inner loop
        Generated prefetch instructions for 3 loads

Note the informational message indicating that the loop has been vectorized and SSE instructions have been generated. The second part of the informational message notes that prefetch instructions have been generated for 3 loads to minimize latency of transfers of data from main memory.

Executing again, you should see results similar to the following:

   % /bin/time a.out
   -1.000000  -771.000  -3618.00  -23498.00  -99999.0    
   30.410user 0.010system 0:30.50elapsed 99.7%CPU

A 33% performance improvement is realized. However, there are further potential improvements available. In the compilation above, there is no guarantee that the starting addresses of vector data computed on using SSE instructions are aligned to cache line boundaries. To ensure alignment of local arrays and common blocks, the -Mcache_align switch can be used. Using this switch in combination with those used previously results in the following:

% pgf90 -fast -Mvect=sse -Minfo vadd.f
vector_op:
     4, Unrolling inner loop 8 times
        Loop unrolled 7 times (completely unrolled)
loop:
    18, Generating sse code for inner loop
        Generated prefetch instructions for 3 loads

Note that the same informational messages are emitted. Executing this version of the code, you should see results similar to the following:

   % /bin/time a.out
   -1.000000  -771.000  -3618.00  -23498.00  -99999.0    
   25.120user 0.040system 0:25.21elapsed 99.8%CPU

The result is a speed-up of more than 67% over the equivalent scalar (i.e. non-SSE) version of the program.

By careful coding in combination with the -Mvect=sse and -Mcache_align switches, it is possible to get substantial speed-ups on programs which operate on 32-bit floating-point vectors running on the Pentium III. However, in some cases, codes which operate on unaligned or strided data will result in performance degradations when compiling with -Mvect=sse. For this reason, PGI recommends that you always measure the performance of codes with and without -Mvect=sse and -Mcache_align rather than using these switches as a default for optimization.

Note

Compiling with -Mvect=sse can result in numerical differences from the generated executable. Certain vectorizable operations, for example dot products, are sensitive to order of operations and the associative transformations necessary to enable vectorization (or parallelization).

3.4.6 Pentium III and Athlon Prefetch Instructions

When the compiler switch -Mvect=prefetch is used on a Pentium III or Athlon-based system, or in combination with the -tp p6 or -tp athlon switches, the PGI Workstation compilers automatically use (respectively) Pentium III or Athlon prefetch instructions where possible. This capability is supported by all of the PGI Fortran, C and C++ compilers.

Prefetch instructions are issued in advance of actual data accesses to ensure data is in cache when referenced with memory load and store instructions. Use of prefetch instructions can improve performance by minimizing the amount of time the processor stalls while waiting for data to arrive from main memory.

Unlike Pentium III SSE instructions, which operate only on 32-bit data, prefetch instructions can be used to improve performance of loops that operate on either 32-bit or 64-bit data structures.

Assume the code in Example 3-5 is converted to double precision. The examples below show results of compiling the example code with -Mvect=prefetch. Assume the program is compiled as follows:

% pgf90 -fast -Minfo vector.f
vector_op:
     4, Loop unrolled 5 times
loop:
    18, Loop unrolled 3 times

No compile-time prefetch messages are emitted, so that's an indicator that no loops are optimized using prefetching. Following is the result if the generated executable is run and timed on a standalone Pentium III 733 Mhz system:


   % time a.out
   -1.000000  -771.0000  -3618.000  -23498.00  -99999.00    
   54.640u 0.010s 0:54.65 100.0%  0+0k 0+0io 114pf+0w

Now, recompile with prefetch enabled:

% pgf90 -fast -Mvect=prefetch -Minfo vadd.f
vector_op:
   4, Unrolling inner loop 4 times
      Loop unrolled 3 times (completely unrolled)
loop:
  18, Unrolling inner loop 4 times
      Used streaming stores for 1 stores
      Generated prefetch instructions for 3 loads

Note the informational message indicating that the loop has been optimized using prefetch instructions. Executing again, you should see results similar to the following:


   % time a.out
   -1.000000  -771.0000  -3618.000  -23498.00  -99999.00    
   44.490u 0.010s 0:44.50 100.0%  0+0k 0+0io 114pf+0w

A 23% performance improvement is realized. Similar performance improvements can be realized using -Mvect=prefetch on AMD Athlon-based systems.

By using the -Mvect=prefetch option, it is possible to get substantial speed-ups on programs which operate on either 32-bit or 64-bit floating-point vectors. However, in some cases, codes which operate on unaligned or strided data can see performance degradations when compiling with
-Mvect=prefetch. For this reason, PGI recommends that you always measure the performance of codes with and without -Mvect=prefetch rather than using this switch as a default for optimization.

3.5 High-Level Transformations

3.5.1 Reductions

A reduction is an assignment of the form:

x = x op expr,

where x is a scalar, and op is +, -, min, or max. Reductions are evaluated on individual processors, then results are accumulated within critical sections.

When a loop contains a reduction, a new private accumulator is created on each processor, and initialized according to op. Let op' be + if op = -; otherwise let otherwise, let op' =op. If the new private accumulator is .xp, the following block is added immediately preceding label 25 in the transformed code from Example 1:

   call _mp_bcs 
   x = x op' .xp
   call _mp_ecs

3.5.2 Expandable Scalars

An expandable scalar is a scalar appearing in an innermost loop for which every use is reached by a single assignment to the scalar, or where all paths from the beginning of a loop to the scalar use contain a definition of that scalar. Without transformations for expandable scalars, processors may overwrite values produced by other processors. Using the transformation, all expandable scalars within a parallel loop are replaced by new private variables everywhere within the loop. If the last value of an expandable scalar is required outside the loop, the original expandable scalar is assigned the value of the private variable on the processor that performed the computation of the last iteration. Suppose the original expandable scalar s is replaced by a new private variable .xs. Using block iteration allocation, the following statements are added immediately preceding label 25 in the transformed code in Example 1:

     if (.n0 .ne. count + start) goto 21 
     s = .xs 
21   continue

Using cyclic iteration allocation, the following statements are added prior to label 25 in Example 2:

     if (.n0 .ne. count + start + _mp_ncpus() - 1) goto 21
     s = .xs 
21   continue

3.5.3 Calls

When an expression appears as the argument in a call within a Fortran program, the compiler creates a new temporary variable and assigns the expression to the temporary variable prior to the call. The address of the temporary is passed to the called program unit. A similar transformation is required when the address of a loop index is passed as the argument of a call.

Usually, vectorization and parallelization are not conducted on loops with calls. However, when
-Mconcur=cncall is specified during compilation, loops with calls are transformed. The high-level vectorizer replaces user-defined loop indices with zero-based compiler-created temporaries (.n0 in Example 1). When the address of a user-defined loop index is passed as an argument to a subprogram, the vectorizer must create yet another temporary to hold the address of the expression that yields the same value as the original loop index. For example, consider the following loop as it appears in a source program.

	   do 1 i = 1, n 
1	        call sub(i)

Since the address of loop index i is passed as an argument, a new integer expandable scalar is introduced, and set to be the value of i. Prior to parallelization, the loop is transformed to the following.

     .n0 = 0 
     .n1 = n 
     if (n .le. 0) goto 30
20   .i0 = .n0 + 1 
     call sub(&.i0) 
     .n0 = .n0 + 1 
     .n1 = .n1 - 1 
     if (.n1 .gt. 0) goto 20 
30   continue

3.6 Parallelization Transformations

This section describes the compiler's auto-parallelization phase. The auto-parallelizer scans code for loops that are candidates for auto-parallelization, and transforms these loops. The option -Mconcur invokes auto-parallelization

3.6.1 Candidate Parallel Loops

A loop is considered parallelizable if doesn't contain any cross-iteration data dependences. Cross-iteration dependences from reductions and expandable scalars are excluded from consideration, enabling more loops to be parallelizable. In general, loops with calls are not parallelized due to unknown side-effects. Also, loops with low trip counts are not parallelized since the overhead in setting up and starting a parallel loop will likely outweigh the potential benefits (compiler switches let you override some of these restrictions on auto-parallelization).

3.6.2 Loops Which Fail to Parallelize

In spite of the sophisticated analysis and transformations performed by the compiler, programmers will often note loops that are seemingly parallel, but are not parallelized. In this subsection, we'll look at some examples of common situations where parallelization does not occur.

3.6.3 Timing Loops

Often, loops will occur in programs that are similar to timing loops. The outer loop in the example below is one such loop.

	do 1 j = 1, 2 
	   do 1 i = 1, n 
	      a(i) = b(i) + c(i) 
1	continue

The outer loop above is not parallelized because the compiler detects a cross-iteration dependence in the assignment to a(i). Suppose the outer loop were parallelized. Then both processors would simultaneously attempt to make assignments into a(1:n). Now in general the values computed by each processor for a(1:n) will differ, so that simultaneous assignment into a(1:n) will produce values different from sequential execution of the loops.

In this example, values computed for a(1:n) don't depend on j, so that simultaneous assignment by both processors will not yield incorrect results. However, it is beyond the scope of the compilers' dependence analysis to determine that values computed in one iteration of a loop don't differ from values computed in another iteration. So the worst case is assumed, and different iterations of the outer loop are assumed to compute different values for a(1:n). Is this assumption too pessimistic? If j doesn't occur anywhere within a loop, the loop exists only to cause some delay, most probably to improve timing resolution. And, it's not usually valid to parallelize timing loops; to do so would distort the timing information for the inner loops.

3.6.4 Scalars

Quite often scalars will inhibit parallelization of non-innermost loops. There are two separate cases that present problems. In the first case, scalars appear to be expandable, but appear in non-innermost loops, as in the following example.

	do 1 j = 1, n 
	   x = b(j) 
	   do 1 i = 1, n 
	      a(i,j) = x + c(i,j) 
1	continue

There are a number of technical problems to be resolved in order to recognize expandable scalars in non-innermost loops. Until this generalization occurs, scalars like x above will inhibit parallelization of loops in which they are assigned. In the following example, scalar k is not expandable, and it is not an accumulator for a reduction.

	  k = 1 
	  do 3 i = 1, n 
	    do 1 j = 1, n 
1	      a(j,i) = b(k) * x 
	    k = i 
2	    if (i .gt. n/2) k = n - (i - n/2) 
3	  continue

If the outer loop is parallelized, conflicting values will be stored into k by the various processors. The variable k cannot be made local to each processor because the value of k must remain coherent among the processors. It is possible the loop could be parallelized if all assignments to k are placed in critical sections. However, it is not clear where critical sections should be introduced because in general the value for k could depend on another scalar (or on k itself), and code to obtain the value of other scalars must reside in the same critical section.

In the example above, the assignment to k within a conditional at label 2 prevents k from being recognized as an induction variable. If the conditional statement at label 2 is removed, k would be an induction variable whose value varies linearly with j, and the loop could be parallelized.

3.6.5 The safe_lastval Option

During parallelization, scalars within loops often need to be privatized; that is, each execution thread will have its own independent copy of the scalar. Problems can arise if a privatized scalar is accessed outside the loop. For example, consider the following loop:

	for (i = 1; i<N; i++){
	  if( f(x[i]) > 5.0 ) t = x[i];
 	}
	v = t;

The value of t may not be computed on the last iteration of the loop. Normally, if a scalar is assigned within a loop and used following the loop, the PGI compilers save the last value of the scalar. However, if the loop is parallelized and the scalar is not assigned on every iteration, it may be difficult (without resorting to costly critical sections) to determine on what iteration t is last assigned. Analysis allows the compiler to determine that a scalar is assigned on each iteration and hence that the loop is safe to parallelize if the scalar is used later. For example:

	for ( i = 1; i < n; i++){
	   if ( x[i] > 0.0 ) {
             t = 2.0;
          }
	   else {
	      t = 3.0;
	      y[i] = ...t;
          }
	}
	v = t

where t is assigned on every iteration of the loop. However, there are cases where a scalar may be privatizable, but if it is used after the loop, it is unsafe to parallelize. Examine this loop:

 for ( i = 1; i < N; i++ ){
     if( x[i] > 0.0 ){
         t = x[i];
	     ...
	     ...
	  y[i] = ...t;
      }
  }
  v = t;

where each use of t within the loop is reached by a definition from the same iteration. Here t is privatizable, but the use of t outside the loop may yield incorrect results since the compiler may not be able to detect on which iteration of the parallelized loop t is last assigned. The compiler detects the above cases. Where a scalar is used after the loop but is not defined on every iteration of the loop, parallelization will not occur.

When the programmer knows that the scalar is assigned on the last iteration of the loop, the programmer may use a directive or pragma to let the compiler know the loop is safe to parallelize. The Fortran directive which tells the compiler that for a given loop the last value computed for all scalars make it safe to parallelize the loop is:

cpgi$l safe_lastval

In addition, a command-line option, -Msafe_lastval, provides this information for all loops within the routines being compiled (essentially providing global scope) .