Skip to content. Skip to navigation

ICTP Portal

Sections
You are here: Home Manuals on-line PGI Compiler pgiws_ug PGI Workstation User's Guide - 2 Optimization & Parallelization
Personal tools
Document Actions

PGI Workstation User's Guide - 2 Optimization & Parallelization

<< << " border=0> >> > " border=0> Title Contents Index Home Help

2 Optimization & Parallelization


Source code that is readable, maintainable and produces correct results is not always organized for efficient execution. Normally, the first step in the program development process involves producing code that executes and produces the correct results. This first step usually involves compiling without much worry about optimization. After code is compiled and debugged, code optimization and parallelization become an issue. Invoking one of the PGI compiler commands with certain options instructs the compiler to generate optimized code. Optimization is not always performed since it increases compilation time and may make debugging difficult. However, optimization produces more efficient code that usually runs significantly faster than code that is not optimized.

The compilers optimize code according to the specified optimization level. Using the -O, -Mvect, and -Mconcur options, you specify the optimization levels. In addition, several -Mpgflag switches control specific types of optimization and parallelization.

This chapter describes the optimization options and describes how to choose an optimization level. Chapter 3, Optimization Features, provides more information on optimization. Chapter 4, Function Inlining, describes how to use the function inlining options.

2.1 Overview of Optimization

In general, optimization involves using transformations and replacements that generate more efficient code. This is done by the compiler and involves replacements that are independent of the particular target processor's architecture as well as replacements that take advantage of the IA-32 architecture, instruction set and registers. For the discussion in this and the following chapters, optimization is divided into the following categories:

Local Optimization

This optimization performs on a block by block basis within a program's basic blocks. A basic block is a sequence of statements in which the flow of control enters at the beginning and leaves at the end without the possibility of branching, except at the end. The PGI compilers perform many types of local optimization including: algebraic identity removal, constant folding, common sub-expression elimination, pipelining, redundant load and store elimination, scheduling, strength reduction and peephole optimizations.

Global Optimization

This optimization is performed on code over all its basic blocks. The optimizer performs control-flow and data-flow analysis for an entire program. All loops, including those formed by IFs and GO TOs are detected and optimized. Global optimization includes: constant propagation, copy propagation, dead store elimination, global register allocation, invariant code motion and induction variable elimination.

Loop Optimization: Vectorization, Unrolling, Parallelization

The performance of certain classes of loops may be improved through vectorization or unrolling options. Vectorization transforms loops to improve memory access performance. Unrolling replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization and scheduling of instructions. Performance for loops on systems with multiple processors may also improve using the parallelization features of the compiler.

Function Inlining

This optimization allows a call to a function to be replaced by a copy of the body of that function. This optimization will sometimes speed up execution by eliminating the function call and return overhead. Function inlining may also create opportunities for other types of optimization. Function inlining is not always beneficial. When used improperly it may increase code size and generate less efficient code.

2.2 Invoking Optimization

Using the PGI compiler commands with the -Olevel option you can specify any of the following optimization levels (the capital O is for Optimize):

-O0
level-zero specifies no optimization. A basic block is generated for each Fortran statement.
-O1
level-one specifies local optimization. Scheduling of basic blocks is performed. Register allocation is performed.
-O2
level-two specifies global optimization. This level performs all level-one local optimization as well as level-two global optimization.

You select the optimization level on the command line. For example, level-two optimization results in global optimization, as shown below:

$ pgf90 -O2 prog.f

Specifying -O on the command-line without a level designation is equivalent to -O2. In addition to the -O options, several of the -Mpgflag options affect optimization and parallelization. This chapter describes the -O options as well as the vectorizer option -Mvect and the auto-parallelization option -Mconcur. The following two chapters provide more detailed information on the -Mvect option and the other -Mpgflag options that control optimization and auto-parallelization features, including function inlining. Explicit parallelization through the use of OpenMP directives or pragmas is invoked using the -mp option, described in detail in Chapter 10, OpenMP Parallelization Directives for Fortran and Chapter 11, OpenMP Parallelization Pragmas for C and C++.

The default optimization level changes depending on which options you select on the command line. For example, when you select the -g debugging option, the default optimization level is set to level-zero (-O0). Refer to Section 2.9, Default Optimization Levels, for a description of the default levels.

2.3 Selecting Appropriate Optimizations - Checklist

This section outlines the steps that you can use to select appropriate optimizations. You should read the entire chapter for a thorough introduction to the options listed in this checklist. However, if you want to get started quickly, use the steps outlined in this section as a guide.

2.3.1 Guidelines for Selecting An Optimization Level

If you are looking to get started quickly, a good option to use with any of the PGI compilers is the following:

$ pgf90 -fast prog.f

For all of the PGI Fortran, C, and C++ compilers, this option will generally produce code that is well-optimized without the possibility of significant slowdowns due to pathological cases. The -fast option is equivalent to specifying -O2 -Munroll -Mnoframe. For C++ programs compiled using pgCC, add -Minline=levels:10 --no_exceptions:

$ pgCC -fast -Minline=levels:10 --no_exceptions prog.cc

By experimenting with individual -Mpgflag optimization options on a file-by-file basis, further significant performance gains can sometimes be realized. However, individual -Mpgflag optimizations can sometimes cause slowdowns depending on coding style and must be used carefully to ensure performance improvements result.

Following is a good sequence of steps to follow if you are just getting started with one of the PGI compilers, or wish to experiment with individual optimizations.

Step One (Debugging)

Your first concern should be getting your program to execute and produce correct results. To get your program running, start by compiling and linking without optimization. Use the optimization level -O0 or select -g which performs minimal optimization. At this level, you will be able to find coding errors and you will be able to debug and to profile your program.

Step Two (Local and Global Optimizations, Loop unrolling)

After you know that your program compiles and executes as intended, the next step is to compile the program with optimizations and to time the results (refer to the last section of this chapter for information on timing). Examine your source files and select an initial optimization level based on the discussion of the optimization levels (the following sections describe levels -O0, -O1 and -O2). Once you select an optimization level, time your results. If you are not sure which optimization level to use initially, just use -fast.

Step Three (Loop Optimizations: Vectorization)

If your program contains many loops then try the -Mvect option to see if it is helpful. If you select -Minfo=loop and your code contains loops that can be vectorized or unrolled, the compiler reports relevant information on the optimization applied.

Step Four (Loop Optimizations: Parallelization)

If your program contains parallelizable loops and you are generating executables for a multi-processor system, then try the -Mconcur option to see if the compiler can auto-parallelize portions of your application. Execute and time the program using -Mconcur on your multiprocessor system. You will need to set the NCPUS environment variable to the number of processors on which you wish the program to run. See section 3.1.2, Using the -Mconcur Auto-parallelization Option, for more detailed information on setting this environment variable. If you select the -Minfo option on the command-line and your code contains loops that the compiler can auto-parallelize, the compiler will report where the code is parallelized.

If the compiler is not able to successfully auto-parallelize your application, you should refer to Chapter 10, OpenMP Parallelization Directives for Fortran, or Chapter 11, OpenMP Parallelization Pragmas for C and C++, to see if insertion of explicit parallelization directives or pragmas and use of the -mp compiler option will enable the application to run in parallel.

Step Five (Function Inlining)

If your program makes many calls to small functions, especially if the calls are in loops, your program may benefit from the -Minline option. See Chapter 7, Command-line Options for details on how to use the -Minline option.

Step Six (Profiling)

Finally, try to determine areas within your program where the execution is concentrated. Use the PGPROF graphical profiler to analyze your code based on the information supplied by profiling. In order to produce a trace file for profiling, you must compile and link with the -Mprof=func (function-level profiling) or -Mprof=lines (line-level profiling) options. See Chapter 7, Command-line Options, as well as Chapter 14, The PGPROF Profiler for more information on using PGPROF.

In addition, there are a variety of -Mpgflag options to the PGI compiler commands that may improve your execution time (decrease it). The next two chapters describe these -Mpgflag optimization options in detail. If you try any of the -Mpgflag options, remember to carefully time your results to see if performance improves.

Most optimizations available on the command line can also be applied on a loop-by-loop basis by inserting directives in Fortran or pragmas in C and C++ (e.g. the number of times to unroll a loop) for specific loops based on information obtained through profiling. See Chapter 9, Optimization Directives and Pragmas, for more information on directive-based optimization.

2.4 Minimal Optimization (-O0)

Level-zero optimization specifies no optimization (-O0). At this level, the compiler generates a basic block for each Fortran statement. This level is useful for the initial execution of a program.

Performance will almost always be slowest using this optimization level. Level-zero is useful for debugging since there is a direct correlation between the Fortran program text and the code generated.

2.5 Local Optimization ( -O1)

Level-one optimization specifies local optimization (-O1). The compiler performs scheduling of basic blocks as well as register allocation. This optimization level is a good choice when the code is very irregular; that is it contains many short statements containing IF statements and the program does not contain loops (DO or DO WHILE statements). For certain types of code, this optimization level may perform better than level-two (-O2) although this case rarely occurs.

The PGI compilers perform many different types of local optimizations. The following chapter describes these optimizations in more detail.

  • Algebraic identity removal
  • Constant folding
  • Common subexpression elimination
  • Local register optimization
  • Peephole optimizations
  • Redundant load and store elimination
  • Strength reductions

2.6 Global Optimization (-O2, -O)

Level-two optimization (-O2 or -O) specifies global optimization. This level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular.

The PGI compilers perform many different types of global optimizations. The following chapter describes these optimizations in more detail.

  • Branch to branch elimination
  • Constant propagation
  • Copy propagation
  • Dead store elimination
  • Global register allocation
  • Invariant code motion
  • Induction variable elimination

2.7 Vectorization (Mvect)

When a PGI compiler command is invoked with the -Mvect option, the vectorizer scans code searching for loops that are candidates for vectorization transformations such as loop distribution, loop interchange, cache tiling, and idiom recognition (replacement of a recognizable code sequence, such as a reduction loop or matrix multiplication, with optimized code sequences or function calls). When the vectorizer finds vectorization opportunities, it internally rearranges or replaces sections of loops (the vectorizer changes the code generated; your source code's loops are not altered). In addition to performing these loop transformations, the vectorizer produces extensive data dependence information for use by other phases of compilation.

The -Mvect option can speed up code which contains well-behaved countable loops which operate on arrays. However, it is possible that some codes will show a decrease in performance on IA-32 systems when compiled with -Mvect due to the generation of conditionally executed code segments and other code generation factors. For this reason, it is recommended that you check carefully whether particular program units or loops show improved performance when compiled with this option enabled.

2.8 Parallelization (Mconcur, mp)

With the -Mconcur option the compiler scans code searching for loops that are candidates for auto-parallelization. When the parallelizer finds opportunities for auto-parallelization, it parallelizes loops and you are informed of the line or loop being parallelized if the -Minfo option is present on the compile line. See section 3.1.2, Using the -Mconcur Auto-parallelization Option, and Chapter 7, Command-line Options, for information on how to use -Mconcur, including information on how to control distribution of loop iterations among threads and how to auto-parallelize loops containing subroutine or function calls.

As with the vectorizer, the -Mconcur option can speed up code if it contains well-behaved countable loops and/or computationally intensive nested loops which operate on arrays. However, it is possible that some codes will show a decrease in performance on IA-32 multi-processor systems when compiled with -Mconcur due to parallelization overheads, memory bandwidth limitations in the target system, false-sharing of cache lines, or other architectural or code-generation factors. For this reason, it is recommended that you check carefully whether particular program units or loops show improved performance when compiled using this option.

If the compiler is not able to successfully auto-parallelize your application, you should refer to Chapter 10, OpenMP Parallelization Directives for Fortran, or Chapter 11, OpenMP Parallelization Pragmas for C and C++, to see if insertion of explicit parallelization directives or pragmas and use of the -mp compiler option will enable the application to run in parallel.

2.9 Loop Unrolling (Munroll)

This optimization unrolls loops, executing multiple instances of the loop during each iteration. This reduces branch overhead, and can improve execution speed. A loop with a constant count may be completely unrolled or partially unrolled. A loop with a non-constant count may also be unrolled. A candidate loop must be an innermost loop containing one to four blocks of code. The following shows the use of the -Munroll option:

$ pgf90 -Munroll prog.f

2.10 Default Optimization Levels

Table 2-1 shows the interaction between the -O, -g, -Mvect, and -Mconcur options. In the table, level can be 0, 1 or 2. The default optimization level is dependent upon these command-line options.

Table 2-1 Optimization and -O, -g, -Mvect, and -Mconcur Options

Optimize
Option

Debug
Option

-M
Option

Optimization Level

none


none


none


1


none


none


-Mvect


2


none


none


-Mconcur


2


none


-g


none


0


-O


none or -g


none


2


-Olevel


none or -g


none


level


-Olevel < 2


none or -g


-Mvect


2


-Olevel < 2


none or -g


-Mconcur


2


Without using any of these options the default optimization level is level-one (-O1). When you use the -O option without a level parameter, the default level is level-two (-O2). With debugging enabled (using -g), the default optimization level is set to level-zero (-O0). Using the -Mvect or -Munroll options, the default optimization level is set to level-two (-O2).

2.11 Local Optimization Using Directives and Pragmas

Command-line options allow you to specify optimizations for an entire source file. Directives supplied within a Fortran source file, and pragmas supplied within a C or C++ source file, provide information to the compiler and alter the effects of certain command-line options or default behavior of the compiler (many directives have a corresponding command-line option). While a command line option affects the entire source file that is being compiled, directives and pragmas apply, or disable, the effects of a particular command-line option to selected subprograms or to selected loops in the source file (for example an optimization), or allow you to globally override command-line options. Directives and pragmas allow a user to tune selected routines or loops based on the user's knowledge or on information obtained through profiling. Chapter 9, Optimization Directives and Pragmas, provides details on how to add directives and pragmas to your source files.

2.12 Execution Timing and Instruction Counting

As this chapter shows, once you have a program that compiles, executes and gives correct results, you may optimize your code for execution efficiency. Selecting the correct optimization level requires some thought and may require that you compare several optimization levels before arriving at the best solution. To compare optimization levels you need to measure the execution time for your program. There are several approaches you can take for timing execution. You can use shell commands that provide execution time statistics, you can include system calls in your code that provide timing information, or you can profile sections of code. In general, any of these approaches will work; however, there are several important timing considerations to keep in mind.

  • Execution should take at least five seconds (the choice of five seconds is somewhat arbitrary, the interval should be statistically significant). If the program does not execute for five seconds, increase the iteration count of some internal loops or try to place a loop around the main body of the program to extend execution time.
  • Timing should eliminate or reduce the amount of system level activities such as program loading and I/O and task switching.
  • Use one of the 3F timing routines, if available, or a similar call available on your system, or use the SECNDS pre-declared function in PGF77 or PGF90, or the SYSTEM_CLOCK intrinsic in PGF90 or PGHPF. Example 2-1 below shows a fragment that indicates how to use SYSTEM_CLOCK effectively within either an HPF or F90 program unit.
. . .
integer :: nprocs, hz, clock0, clock1
real :: time
integer, allocatable :: t(:)
!hpf$ distribute t(cyclic)
#if defined (HPF)
allocate (t(number_of_processors()))
#elif defined (_OPENMP)
allocate (t(OMP_GET_NUM_THREADS()))
#else
allocate (t(1))
#endif
call system_clock (count_rate=hz)
!
call system_clock(count=clock0)
< do work>
call system_clock(count=clock1)
!
t = (clock1 - clock0)
time = real (sum(t)) / (real(hz) * size(t))
. . .

Example 2-1 Using SYSTEM_CLOCK


<< << " border=0> >> > " border=0> Title Contents Index Home Help

Powered by Plone This site conforms to the following standards: