You are here: Home → Manuals on-line → PGI Compiler → pgiws_rel → PGI Workstation 3.2 - 2 PGI Workstation 3.2-4 Release Notes

Personal tools

Document Actions

PGI Workstation 3.2 - 2 PGI Workstation 3.2-4 Release Notes

<< " border=0>

> " border=0>

2 PGI Workstation 3.2-4
Release Notes

2.1 PGI Workstation 3.2 Contents
2.2 Supported Systems and Licensing
2.3 New Features
2.4 New Compiler Options
- 2.4.1 New Generic Options
- 2.4.2 New Win32 Options
2.5 OpenMP Directives and Pragmas
2.6 Pentium III, 4, and Athlon Support
2.7 Debugging with PGDBG
2.8 Profiling with PGPROF
- 2.8.1 Analyzing Scalability of Parallel Programs
2.9 LAPACK, the BLAS and FFTs
- 2.9.1 Pre-compiled BLAS and LAPACK Math Libraries
- 2.9.2 Assembly-coded Math Libraries
2.10 Fortran calling conventions on Win32
2.11 OpenMP Tutorial
2.12 PGCC C and C++ Compiler Notes
2.13 PGI Workstation 3.2 and glibc
2.14 PGI Workstation for Win32

This document describes changes between PGI Workstation 3.2 and previous releases, as well as late-breaking information not included in the current printing of the PGI User's Guide.

2.1 PGI Workstation 3.2 Contents

PGI Workstation 3.2 includes the following components:

PGHPF data parallel High Performance Fortran Compiler
PGF90 native OpenMP and auto-threading Fortran 90 Compiler
PGF77 native OpenMP and auto-threading F77 Compiler
PGCC native OpenMP and auto-threading ANSI and K&R C compiler
PGC++ native OpenMP and auto-threading ANSI C++ Compiler (available only on Linux and Solaris86)
PGPROF graphical profiler (command-level only on Win32)
PGDBG graphical debugger (available only on Linux and Solaris86)
Complete online HTML Documentation
A UNIX-like shell environment for Win32

Depending on the product you purchased, you may not have received all of the above components.

2.2 Supported Systems and Licensing

PGI Workstation 3.2-4 is supported on systems using Intel IA32 (e.g. Pentium or Pentium Pro/II/III/4) or AMD Athlon processors running Linux with a kernel version of 2.0 or above, Solaris 7 for Intel or higher, or Win32 operating systems including NT 4.0, Win98, and Win2K. Newer versions of Linux that use glibc 2.1.x, such as Redhat 6.x, and SuSE 6.x, are supported. The latest 7.x releases of Linux, like Redhat 7.0 and SuSE 7.1, which use glibc2.2.x, are now supported as well.

The PGI compilers and tools are license-managed. For PGI Workstation products using PGI-style licensing (the default), a single user can run as many simultaneous copies of the compiler as desired, on a single system, and no license daemon or Ethernet card is required. However, usage of the compilers and tools is restricted to a pre-specified username. If you would like the PGI compilers and tools to be usable under any username, you must request FLEXlm-style license keys and use FLEXlm-style licensing. See section 1, PGI Workstation 3.2-4 Installation Notes, for a more detailed description of licensing options.

2.3 New Features

Following are the new features included in PGI Workstation 3.2-4:

A new and improved PGDBG that is fully thread-aware and OpenMP-aware.
Support for Pentium 4 processors and a new -tp piv switch to allow generation of Pentium 4 executables when compiling on non-Pentium 4 platforms
Full native OpenMP 1.1-compliant shared-memory parallel programming model in F77 and F90.
Full native OpenMP 1.0-compliant shared-memory parallel programming model in C and C++.
Automatic inline usage of Pentium 4 SSE2 (streaming SIMD extensions) instructions via the -Mvect=sse compile-time switch (must be compiling on a Pentium 4 or using the -tp piv switch to enable SSE2 code generation).
Automatic inline usage of Pentium III/4 SSE (streaming SIMD extensions) and prefetch instructions via the -Mvect=sse compile-time switch.
Automatic inline usage of Pentium III/4 or AMD Athlon prefetch instructions via the -Mvect=prefetch compile-time switch.
User-directed prefetch.
Loop unroll and jam optimizations in all languages.
PGF77 now supports automatic arrays.
PGPROF enhancements, including support for SMP profiling, and scaling analysis in OpenMP and HPF parallel programs.
Support for Linux distributions using glibc 2.0x, glibc 2.1.x, and glibc 2.2.x, including enhanced PGI-supplied libpgthread.a and libpgthread.so libraries.
Updated EDG 2.40 C++ front-end and updated Rogue Wave C++ Standard Template Library (STL).
Support for debugging of PGI-compiled executables on Win32 using the gdb debugger, which is now included in the UNIX-like shell environment supplied along with the PGI software.
Support for dynamically-linked OpenMP and auto-threaded SMP parallel executables
End-user generation and management of license keys using a personalized account on the PGI web page
Updated hard copy and online documentation.

2.4 New Compiler Options

2.4.1 New Generic Options

Six new or updated generic compiler options (options which apply to all of the PGI compilers) have been added in release 3.2-4:

* -Mvect=sse - search for vectorizable loops and, where possible, use Pentium III SSE and/or prefetch instructions to improve performance. Using this switch, it is possible to automatically use the Pentium III SSE and/or prefetch instructions without making alterations in your source code. Pentium 4 optimizations are attempted as well.

* -Mvect=prefetch - search for vectorizable loops and, where possible, use prefetch instructions to improve performance. Using this switch, it is possible to automatically use the Pentium III or Athlon prefetch instructions without making alterations in your source code. Pentium 4 optimizations are attempted as well.

* -tp {px | p5 | p6 | piv | athlon} - set the target architecture. By default, the PGI compilers produce code specifically targeted to the type of processor on which compilation is performed. These executables may not be usable on processors that are not compatible with the processor on which compilation is performed. For example, an executable that uses Pentium III SSE instructions will fail to execute on a Pentium, Pentium II, or AMD Athlon processor. The -tp option allows you to specify a target architecture different from that on which compilation is performed. A blended style of code generation is produced when -tp px is specified, resulting in executables that will run on any x86-compatible system. Pentium-specific optimizations are specified with -tp p5. Pentium Pro/II/III-specific optimizations are specified with -tp p6. Pentium 4 specific optimizations are specified with -tp piv. AMD Athlon-specific optimizations are specified with -tp athlon.

* -M[no]dalign - (Don't) align double precision and long long variables in structures or common blocks on 8-byte boundaries.
-Mnodalign may result in decreased performance.

* -Mnollalign - Don't align long long variables in structures on 8-byte boundaries.

* -M[no]free[form] - (Don't) Process source using Fortran 90 freeform specifications. The -Mnofree or -Mnofreeform option specifies fixed form formatting. By default, files with a .f90 extension use freeform formatting.

2.4.2 New Win32 Options

In addition to the generic options, Win32 users can specify the following options to enable debugging with gdb. A version of gdb is now included in the UNIX-like command environment that ships with the PGI compilers and tools for Win32 systems.

* -g - Compile/link for debugging using the gdb debugger. You must use this switch in combination with -Mstabs. Use of -g disables use of Pentium III/4 prefetch and SSE instructions.

* -Mstabs - Generate GNU STABS symbol information so that resulting executables can be debugged using gdb. You must use this switch in combination with -g . Use of -Mstabs disables use of Pentium III/4 prefetch and SSE instructions.

2.5 OpenMP Directives and Pragmas

Full support for the OpenMP Fortran Application Program Interface, Version 1.1 is included in release 3.2 of the PGF77 and PGF90 compilers. The PGI User's Guide, Chapter 10, contains a complete description of the OpenMP directives, functions, and environment variables supported by the PGF77 and PGF90 compilers.

Full support for the OpenMP C and C+ Application Program Interface, Version 1.0 is included in release 3.2 of the PGCC ANSI C and C++ compilers. The PGI User's Guide, Chapter 11, contains a complete description of the OpenMP pragmas, functions, and environment variables supported by the PGCC compilers.

For more information on the OpenMP programming model or to obtain copies of the OpenMP API specifications, see the URL http://www.openmp.org.

2.6 Pentium III, 4, and Athlon Support

2.6.1 Pentium III SSE Instructions

When the compiler switch -Mvect=sse is used on a Pentium III or in combination with -tp p6, the vectorizer in release 3.2 of the PGI Workstation compilers automatically uses Pentium III SSE and prefetch instructions where possible. This capability is supported by all of the PGI Fortran, C and C++ compilers, and is accomplished by generating SSE and prefetch instructions when vectorizable loops are detected (a modification in the generated assembly code - your source code remains unaltered).

Executables compiled using -Mvect=sse must be executed on a Pentium III system with an SSE-enabled operating system (Win32 4.0 Service Pack 4, or Linux kernel 2.2.10 or higher with the appropriate kernel patches).

NOTE

Program units compiled with -Mvect=sse will not execute correctly on Pentium, Pentium Pro, Pentium II or AMD Athlon processors. They will execute correctly only on Pentium III systems running an SSE-enabled operating system.

Using -Mvect=sse on the Pentium III, performance improvements of up to two times over equivalent scalar code sequences are possible. However, the Pentium III SSE instructions apply only to 32-bit floating-point data, and meaningful performance improvements occur only for unit-stride vector operations on data that is aligned on a cache-line boundary.

In the following program, the vectorizer recognizes the vector operation in subroutine 'loop' when the compiler switch -Mvect=sse is used. This example shows the compilation, informational messages, and runtime results using the SSE instructions, along with issues that affect SSE performance.


      program vector_op
      parameter (n = 99999)
      real*4 x(n),y(n),z(n),w(n)
      do i = 1,n
         y(i) = i
         z(i) = 2*i
         w(i) = 4*i
      enddo
      do j = 1, 10000
         call loop(x,y,z,w,1.0e0,n)
      enddo
      print*,x(1),x(771),x(3618),x(23498),x(99999)
      end
      subroutine loop(a,b,c,d,s,n)
      integer i,n
      real*4 a(n),b(n),c(n),d(n),s
      do i = 1,n
         a(i) =  b(i) + c(i) - s * d(i)
      enddo
      end

First note that the arrays are single-precision. SSE instructions only operate on single-precision data that is aligned on cache-line boundaries. You can guarantee that unconstrained local arrays (such as x, y and z defined in the program above) are aligned on cache-line boundaries by compiling with the -Mcache_align switch.

NOTE: Fortran common blocks are also aligned on cache-line boundaries when -Mcache_align is used. If you have arrays in common blocks on which you'd like to invoke SSE vectorization, you must pad the common blocks explicitly to ensure all arrays contained in the common blocks are properly aligned.

The examples below show results of compiling the example code above with and without -Mcache_align. Assume the program is compiled as follows:

 
% pgf90 -fast -Mvect -Minfo vector.f
vector_op:
     4, Loop unrolled 10 times
loop:
    18, Loop unrolled 5 times

No compile-time vectorization messages are emitted, so that's an indicator that no loops are optimized by the vectorizor. Following is the result if the generated executable is run and timed on a standalone Pentium III 733 Mhz system:


   % time a.out
   -1.000000  -771.0000  -3618.000  -23498.00  -99999.00    
   40.36u 0.020s 0:40.41 99.9%  0+0k 0+0io 115pf+0w

Now, recompile with SSE vector idiom recognition enabled:


% pgf90 -fast -Mvect=sse -Minfo vadd.f
vector_op:
   4, Unrolling inner loop 8 times
      Loop unrolled 7 times (completely unrolled)
loop:
    18, Generating sse code for inner loop
        Generated prefetch instructions for 3 loads

Note the informational message indicating that the loop has been vectorized and SSE and prefetch instructions have been generated.

Executing again, you should see results similar to the following:


   % time a.out
   -1.000000  -771.0000  -3618.000  -23498.00  -99999.00    
   30.410u 0.010s 0:30.50 99.7%  0+0k 0+0io 115pf+0w

The resulting executable is 33% faster than the non-SSE (scalar) version. However, there are further potential improvements available. In the compilation above, there is no guarantee that the starting addresses of vector data computed on using SSE instructions are aligned to cache-line boundaries. The ensure alignment of local arrays and common blocks, the -Mcache_align switch can be used. Using this switch combined with those used previously results in the following:


% pgf90 -fast -Mvect=sse -Mcache_align -Minfo vadd.f
vector_op:
   4, Unrolling inner loop 8 times
      Loop unrolled 7 times (completely unrolled)
loop:
    18, Generating sse code for inner loop
        Generated prefetch instructions for 3 loads

So, the same informational messages are emitted. Executing this version of the code, you should see results similar to the following:


   % time a.out
   -1.000000  -771.0000  -3618.000  -23498.00  -99999.00    
   25.120u 0.040s 0:25.21 99.8%  0+0k 0+0io 115pf+0w

The result is an executable that is 61% faster than the equivalent scalar (i.e. non-SSE) version of the program.

By careful coding in combination with the -Mvect=sse and -Mcache_align switches, it is possible to get substantial speed-ups on programs which operate on 32-bit stride-one floating point vectors. However, in some cases, codes which operate on unaligned or strided data can see performance degradations when compiling with -Mvect=sse. For this reason, PGI recommends that you always measure the performance of codes with and without -Mvect=sse rather than using this switch as a default for optimization.

2.6.2 Pentium III and Athlon Prefetch Instructions

When the compiler switch -Mvect=prefetch is used on a Pentium III/4 or Athlon-based system, or in combination with the -tp p6 or -tp athlon switches, the PGI Workstation 3.2 compilers automatically use (respectively) Pentium III/4 or Athlon prefetch instructions where possible. This capability is supported by all of the PGI Fortran, C and C++ compilers.

NOTE

Program units compiled with -Mvect=prefetch will not execute correctly on Pentium, Pentium Pro or Pentium II processors. They will execute correctly only on Pentium III/4 or Athlon systems. In addition, Pentium III/4 and Athlon prefetch instructions are not compatible. This means that program units that use Pentium III/4 prefetch instructions will not execute correctly on Athlon systems, and program units that use Athlon prefetch instructions will not execute correctly on Pentium III/4 systems.

Prefetch instructions are issued in advance of actual data accesses to ensure data is in cache when referenced with memory load and store instructions. Use of prefetch instructions can improve performance by minimizing the amount of time the processor stalls while waiting for data to arrive from main memory.

Unlike Pentium III SSE instructions, which operate only on 32-bit data, prefetch instructions can be used to improve performance of loops that operate on either 32-bit or 64-bit data structures.

Assume the example used in section 2.6.1 is converted to double precision. The examples below show results of compiling the example code with
-Mvect=prefetch. Assume the program is compiled as follows:

 
% pgf90 -fast -Mvect -Minfo vector.f
vector_op:
     4, Loop unrolled 5 times
loop:
    18, Loop unrolled 3 times

No compile-time prefetch messages are emitted, so that's an indicator that no loops are optimized using prefetching. Following is the result if the generated executable is run and timed on a standalone Pentium III 733 Mhz system:


   % time a.out
   -1.000000  -771.0000  -3618.000  -23498.00  -99999.00    
   54.640u 0.010s 0:54.65 100.0%  0+0k 0+0io 114pf+0w

Now, recompile with prefetch enabled:


% pgf90 -fast -Mvect=prefetch -Minfo vadd.f
vector_op:
   4, Unrolling inner loop 4 times
      Loop unrolled 3 times (completely unrolled)
loop:
  18, Unrolling inner loop 4 times
      Used streaming stores for 1 stores
      Generated prefetch instructions for 3 loads

Note the informational message indicating that the loop has been optimized using prefetch instructions. Executing again, you should see results similar to the following:


   % time a.out
   -1.000000  -771.0000  -3618.000  -23498.00  -99999.00    
   44.490u 0.010s 0:44.50 100.0%  0+0k 0+0io 114pf+0w

A 23% performance improvement is realized. Similar performance improvements can be realized using -Mvect=prefetch on AMD Athlon-based systems.

By using the -Mvect=prefetch option, it is possible to get substantial speed-ups on programs which operate on either 32-bit or 64-bit floating point vectors. However, in some cases, codes which operate on unaligned or strided data can see performance degradations when compiling with
-Mvect=prefetch. For this reason, PGI recommends that you always measure the performance of codes with and without -Mvect=prefetch rather than using this switch as a default for optimization.

2.6.3 User-directed Prefetch Instructions

Release 3.2 of the PGI compilers support user-directed data prefetching. If you want to utilize this capability, you must compile using the option
-Mx,59,4. For the PGI Fortran compilers, the directive syntax is:


      c$mem prefetch <list of variables>

to prefetch the cache lines containing each of the listed variable at that point in the code. For example, the two inner loops of an unrolled matrix multiply with prefetch directives might look as follows:


         do j = 1, p
   c$mem prefetch arow(1),b(1,j)
   c$mem prefetch arow(5),b(5,j)
   c$mem prefetch arow(9),b(9,j)
            do k = 1, n, 4
   c$mem prefetch arow(k+12),b(k+12,j)
               c(i,j) = c(i,j) + arow(k) * b(k,j)
               c(i,j) = c(i,j) + arow(k+1) * b(k+1,j)
               c(i,j) = c(i,j) + arow(k+2) * b(k+2,j)
               c(i,j) = c(i,j) + arow(k+3) * b(k+3,j)
            enddo
         enddo

The syntax of the prefetch pragma in C/C++ is:


     #pragma mem prefetch(<list>)

Again, you must use the -Mx,59,4 option before the compilers will recognize the prefetch directive in Fortran and pragma in C/C++.

2.6.4 Pentium 4 SSE2 Instructions

Release 3.2-4 of the PGI compilers support the Pentium 4 SSE2 instructions, when invoking the compiler switch -Mvect=sse on a Pentium 4, or in conjunction with the new -tp piv cpu-type code generation switch. In addition, as with Pentium III SSE instructions, the -Mcache_align switch may be used to effect better performance.

As an illustration, here is a similar example to the Pentium III example in the previous section.

vadd8.f:
program vector_add
parameter (N = 99999)
common /x/ x
common /y/ y
common /z/ z
real*8 x(N),y(N),z(N)
real*8 t1,t2,t3
do i = 1,N
   y(i) = I
   z(i) = 2*I
enddo
t1 = dclock()
do j = 1, 10000
   call loop(x,y,z,N)
enddo
t2 = dclock()
t3 = t2 - t1
print*,x(1),x(771),x(3618),x(23498),x(99999)
print*,"elapsed dclock time is: ",t3
end
subroutine loop(a,b,c,n)
integer i,n
real*8 a(n),b(n),c(n)
do i = 1,n
   a(i) =  b(i) + c(i)
enddo
end
dclock.c: 
#include <sys/time.h>
#include <sys/resource.h>
double dclock_() /* change to dclock() for ifc */
{
/* This structure should be portable */
    struct rusage ru;
    double t;

    getrusage(RUSAGE_SELF, &ru);
    t = (double)ru.ru_utime.tv_sec +
        (double)ru.ru_stime.tv_sec;
    t += ( (double)ru.ru_utime.tv_usec +
           (double)ru.ru_stime.tv_usec ) *
          1.0e-6;
    return t;
}

Assume vadd8.f and dclock.c are compiled as follows:

% pgcc -c dclock.c
% pgf77 -fast -Minfo  vadd8.f dclock.o
vadd8.f:
vector_add:
     8, Loop unrolled 5 times
    27, Loop unrolled 5 times
% a.out
  3.00   2313.00   10854.00   70494.00    299997.00     
  elapsed dclock time is:    14.86000000000000     
% pgf77 -fast -Minfo -Mvect=sse vadd8.f dclock.o
vadd8.f:
vector_add:
    8, Unrolling inner loop 4 times
       Loop unrolled 3 times (completely unrolled)
loop:
   27, Generating sse code for inner loop
       Generated prefetch instructions for 2 loads
% a.out
    3.00   2313.00   10854.00   70494.00   299997.00        
    elapsed dclock time is:    11.18000000000

2.7 Debugging with PGDBG

2.7.1 PGDBG 3.2 New Features

The following enhancements and bug fixes have been made in version 3.2 of PGDBG:

1. PGDBG 3.2 launches in graphical (GUI) mode by default; text mode is still fully supported using the -text command-line option

2. OpenMP support for Solaris86 and linux86 in PGDBG 3.2-4. The support involves a logical cpuid, an added threads window, and the ability to examine private variables and step/next movement inside parallel regions (linux86 only). Thread support also applies to Posix threads in Linux.

3. The PGDBG 3.2-4 X Windows GUI properly handles user-directed I/O - see http://www.pgroup.com/faq.htm for more details

4. PGDBG 3.2-4 properly handles non-prototyped C functions - previous releases of PGDBG would sometimes lose line information in the context of non-prototyped C functions

5. The PGDBG 3.2-4 X Windows GUI uses absolute rather than relative source file pathnames, allowing greater flexibility in where binaries are placed and/or debugged provided the source files from which the binary was created are not moved. The GUI also can now set the font type, and line numbers now appear in the source display area. Control-C now works in GUI, and a thread control window has been added to the GUI.

6. The number of directories that can be added to the search path for source files using the PGDBG dir command has been increased from 100 to 2048

7. PGDBG 3.2-4 now properly handles scoping of module procedures in Fortran 90 programs and member functions in C++ programs, and improved lexical scoping in Fortran 90 along with nested subroutine support. Private variables are supported as well.

8. PGDBG 3.2 now prints the values in the floating-point stack and (if applicable) the SSE registers when the regs command is issued; PGDBG 3.2 also now supports initialization of these registers using the set command. PGDBG 3.2 does not recognize the SSE registers by default. Set the environment variable PGDBG_SSE to "on" to enable SSE support. For example "setenv PGDBG_SSE on". You must be running an SSE-enabled operating system for PGDBG SSE support to be well-defined.

9. PGDBG 3.2 correctly disassembles Pentium III SSE instructions under the disasm command. Pentium 4 SSE2 instructions are also disassembled.

10. PGDBG 3.2 has been enhanced to support hardware watchpoints on both Linux and Solaris86. See section 15.2.1.2 of the PGI User's Guide for a complete description of how to use hardware watchpoints. If you get the message: "ERROR: hardware watchpoints not supported" on Linux, you will need to patch your kernel to use this feature. See the PGI IA32 Linux compilers FAQ on the PGI website for a pointer to available patches.

11. In PGDBG 3.2, the call command now fully supports calling of functions, methods and procedures from the PGDBG command line (calls to library functions and system calls are not supported); the stacktrace and stackdump commands have been modified to report when a function has been called from the PGDBG command line

12. PGDBG 3.2 has been enhanced to properly handle Fortran 90 pointer variables; Fortran 90 pointers are now automatically dereferenced when used in an expression or PGDBG command

13. PGDBG 3.2 now supports optional triplet notations used in array subscripting (see section 15.1.4.8 of the PGI User's Guide); when a bound is excluded from a subscript expression, PGDBG automatically fills in the declared bound

14. PGDBG 3.2 supports debugging of dynamically linked executables and shared object files created using the PGI compilers

PGDBG can be used to debug F77, F90, C, C++, and assembly-language programs. It is not HPF-aware. PGDBG is currently available only on Linux and Solaris86. To use the graphical version of PGDBG, compile and link your program using the -g option and invoke the debugger as follows:


       % pgdbg a.out

If you wish to use the command-level interface (CLI), it is invoked using the command pgdbg with the -text option:


       % pgdbg -text a.out

Chapter 15 of the PGI User's Guide contains a complete description of PGDBG and how it is used, including an overview of the GUI.

2.7.2 Calling C++ Instance Methods

As noted above, the call command has been significantly enhanced in PGDBG 3.2. To call a C++ instance method, the object must explicitly be passed as the first parameter to the call. For example, given the following definition of class Person and the appropriate implementation of its methods:

        class Person {
            public:
            char name[10];
            Person(char * name);
            void print();
        };
        main(){
          Person * pierre;
          pierre =  new Person("Pierre");
          pierre.print();
        }

To call the instance method print on object pierre, use the following syntax:

        pgdbg> call Person::print(pierre)

Notice that pierre is explicitly passed into the method, and the class name must also be specified.

2.7.3 PGDBG 3.2-4 Thread Support

PGDBG is able to debug SMP-parallel programs on both Linux86 and Solaris86. These are programs that are annotated with OpenMP directives and subsequently compiled with the -mp option using PGI compilers. These are also programs that have been auto-parallelized, using the -Mconcur option, by PGI compilers.

See the PGI User's Guide for more information on OpenMP, parallelization, and auto-parallelization

Linux86

PGI compilers compile a parallel program (parallelized using either OpenMP directives or auto-parallelization) to the PGI pthreads library, libpgthread.so, based upon the Linuxthreads library. PGDBG can also debug programs that use libpthread.so directly (linked with -lpthread) on Linux86.

PGDBG can debug multiple threads on Linux86 systems with glibc version later than glibc-2.0.7 (or Linux systems later than 6.x).

The 'initial thread' and 'manager thread' described by PGDBG are the initial thread and manager thread employed by the Linux pthread library (the manager thread polls and services requests from other threads). These threads are distinguished by name in PGDBG. See the threads example below. While the state of the manager thread is accessible using PGDBG, it is an internal agent thread. Tampering with the manager thread could affect the behavior of the program in unexpected ways.

Solaris86

PGI compilers compile a parallel program (parallelized using either OpenMP directives or auto-parallelization) to Solaris Light Weight Processes (LWPs). PGDBG can debug parallel programs on systems running Sun0S 5.6 or higher. PGDBG recognizes LWPs, which are not user threads, so PGDBG is not able to debug threaded programs linked directly with -lthread.

In PGDBG each thread is designated a unique thread ID or TID. In Linux86, this TID is the thread's OpenMP physical thread ID. On Linux86, since threads are implemented as processes by the Linux pthread library, the process ID (aka PID) of each thread is also available.

In PGDBG for Solaris86, the thread ID of a thread is its LWP ID.

PGDBG Commands and Threads

The following PGDBG commands are available for controlling and checking the status of active threads:

thread - Set active thread

threads - Print active threads

These commands are used in conjunction with the PGDBG control commands to run and inspect a single thread or all threads.

To view the list of currently active threads in PGDBG use the threads command. For example, on Linux86:

pgdbg> threads
(PID 873)  (manager thread) [Stopped by SIGSTOP]
__poll address: 0x400f0320
  Thread 1    (PID 874)                   [Stopped by SIGTRAP]
  main line: 56 in "simple.c" address: 0x804910a
=>Thread 0    (PID 872)  (initial thread) [Stopped by SIGTRAP]
  main line: 56 in "simple.c" address: 0x804910a

on Solaris86:

pgdbg> threads
=>Thread 4 [Stopped by SIGTRAP]
  f line: 18 in "simple.c" address: 0x804a4ee 
  Thread 3 [Stopped by SIGSTOP]
  __door_return address: 0xdff4ff8b 
  Thread 2 [Stopped by SIGSTOP]
  _signotifywait address: 0xdff5258a 
  Thread 1 [Stopped by SIGTRAP]
  f line: 18 in "simple.c" address: 0x804a4ee

For each thread, the threads command displays the

* thread ID

* reason for stopping [+]

* current location of thread in the program being debugged

The current thread is indicated by the arrow "=>". Use the thread command to change the current thread.

On Linux86 the TID or PID of each thread can be used to schedule a particular thread. The threads command distinguishes the manager and initial threads from the rest of the user threads on Linux86.

On Solaris86 the TID of each thread is its LWP ID. Some of the LWPs may be agent threads. See System Dependent Issues: Solaris86 issues below.

[+] SIGTRAP indicates that a breakpoint has been hit. A message is displayed whenever a thread hits a breakpoint. SIGSTOP is used internally by PGDBG. Its use is mostly invisible to the user.

PGDBG reports status messages in the following situations:

Situation	Message
Thread Created	[New Thread 2854]
Thread Exited	[Thread 2858 Exited with status 0]
Thread Killed	[Thread 2855 Killed by SIGSEGV]
Switch Current Thread	[Switching to Thread 2855]

Status messages report the LWP ID of the thread on Solaris86. Status messages report the PID of the thread in Linux86 (since the thread may not have been assigned its physical thread ID yet). See System Dependent Issues: Linux86 below.

The following commands control the execution of the current thread only (with a few exceptions):

step stepi stepout next nexti cont <TID>

See the PGI User's Guide for details. When in a serial region, these commands advance all threads. This allows PGDBG to step into parallel regions through thread initialization code.

OpenMP programs make use of implicit and explicit barriers at the end of parallel regions. An example of an explicit barrier is the:


#pragma omp barrier

directive. When used at a barrier, the above commands will advance all threads past the barrier.

The cont command takes an optional parameter, a thread ID, to continue a specific thread. With no parameter the cont command continues all threads.

cont - Continue program.

cont <TID> - Continue thread <TID> only.

The Linux86 manager thread is run freely during any PGDBG control command issued on a non-manager thread. This is so it can poll for thread creation requests. The users are still able to execute only the manager thread if they so desire.

The 'run', 'rerun', 'debug', and quit commands are invoked over a program, not per thread.

PGDBG user-defined events are defined across all threads:

break catch clear delete disable display do doi

enable hwatch hwatchread hwatchboth ignore status

stop stopi track tracki trace tracei undisplay

watch watchi when wheni

When an event fires for more than one thread, each thread ID is reported.

For example, display f@i, for function f variable i, is displayed per thread when all threads stop.

pgdbg> c
[ Thread 2751 in "simple2.c"@f ]
(1) i = 2 
[ Thread 2750 in __poll ]
(1) i = Can not access variable i; not in current function
[ Thread 2749 in "simple2.c"@f ]
(1) i = 0 
Stopped at 0x8048f58, function f, file simple2.c, line 15
 #15:         printf("in f\n");
pgdbg> threads
(PID 2750) (manager thread) [Stopped by SIGSTOP]
__poll address: 0x400f0320 
=>  Thread 1 (PID 2751)                  [Stopped by SIGTRAP]
    f line: 15 in "simple2.c" address: 0x8048f58 
    Thread 0 (PID 2749) (initial thread) [Stopped by SIGTRAP]
    f line: 15 in "simple2.c" address: 0x8048f58

When a single thread is continued, the event is evaluated for that thread only. For example:

pgdbg> c 0
(1) i = 2 
Stopped at 0x8048f58, function f, file simple2.c, line 15
 #15:         printf("in f\n");
pgdbg> threads
(PID 2750) (manager thread) [Stopped by SIGSTOP]
__poll address: 0x400f0320 
     Thread 1 (PID 2751)                  [Stopped by SIGTRAP]
     f line: 15 in "simple2.c" address: 0x8048f58 
=>   Thread 0 (PID 2749) (initial thread) [Stopped by SIGTRAP]
     f line: 15 in "simple2.c" address: 0x8048f58

c 0 says to continue thread 0 (c 2749 would have the same effect). Thread 2749 becomes the current thread.

Program locations are reported for the current thread (with the exception of events as described above). The following commands report information regarding the current thread only (Use the thread command to switch the current thread).

/ ? arrive assign call call (*) class cread

decls disasm down dread dump enter file fp

fread func iread lines list lval names pc

regs retaddr rval scope set sp sread stacktrace

stackdump trace track up whatis where whereis

which

Expressions and registers are also evaluated with respect to the current thread. Each thread has state: registers, symbols, scope, and possibly private variables. The result of an expression evaluated off of one thread (current thread), may be different than that of an expression evaluated off of a different thread.

The PGDBG commands trace and track advance the current thread only. To continue all threads, disable trace and track events.

PGDBG limits the use of the call command to at most one thread at a time. See the PGI User's Guide for more information on PGDBG commands, or use the help command from within the debugger.

PGDBG Thread Behavior

When one thread stops, due to an event (for example a breakpoint), or a signal (for example SIGSEGV), PGDBG directs all threads to stop. On Linux86 it does so using SIGSTOP (signal 19). These 'internal stops' are never received by the process being debugged, and so they will not affect the behavior of your program.

On Solaris86 the threads are directed to stop in another way which is invisible to the user (See proc(4) : PIOCWSTOP). PGDBG attempts to assign the current thread to an interesting thread when the program stops due to an event. For example, if a breakpoint is hit, PGDBG will set the current thread to the first thread in the list of active threads that hit the breakpoint. If a thread receives a signal, PGDBG will set the current thread to the thread that received the signal.

For Example:

pgdbg> c
Program stopped
   Signal SIGSEGV
[Switching to Thread 2833]
Stopped at 0x8048f8d, function f, file seg.c, line 19
8048f8d:  88 02                         movb   %al,(%edx)
pgdbg> threads
(PID 2832) (manager thread) [Stopped by SIGSTOP]
__poll address: 0x400f0320 
=>   Thread 1 (PID 2833)                  [Stopped by SIGSEGV]
     f line: 19 in "seg.c" address: 0x8048f8d 
     Thread 0 (PID 2831) (initial thread) [Stopped by SIGSTOP]
     _mp_barrier address: 0x8049121

Above, thread 2833 seg faulted. The current thread is changed to the offending thread.

A thread is either running, stopped, killed, or exited. The threads command will display the stop, exit, or kill status of a thread. Whenever the user reaches a PGDBG command prompt, all threads are stopped with a stopped, killed, or exited status.

When a thread exits or is killed, all threads are stopped. Use the threads command to view the status of each thread. If the current thread exits or is killed, the current thread will become the first active thread on the currently active threads list (use threads command to view this list). If there is no such thread, then the program has exited.

PGDBG Threads and Signals

PGDBG intercepts all signals sent to any of the threads in a multi-threaded program, and passes them on according to that signal's disposition maintained by PGDBG (see the catch, ignore commands).

If a thread runs into a busy loop, or if the program runs into deadlock, control-C over the debugging command line to interrupt the threads. This causes SIGINT to be sent to all threads. By default PGDBG does not relay SIGINT to any of the threads, so in most cases program behavior is not affected.

Sending a SIGINT (control-C) to a program while it is in the middle of initializing its threads (calling omp_set_num_threads(), or entering a parallel region ) may kill some of the threads if the signal is sent before each thread is fully initialized. Avoid sending SIGINT in these situations. When the number of threads employed by a program is large, thread initialization may take a while.

Signals Used by PGDBG

SIGTRAP indicates a breakpoint has been hit. A message is displayed whenever a thread hits a breakpoint. SIGSTOP is used internally by PGDBG. Its use is mostly invisible to the user. Changing the disposition of these signals in PGDBG will result in undefined behavior.

Reserved Signals: On Linux86, the thread library uses SIGRT1, SIGRT3 to communicate among threads internally. In the absence of real-time signals in the kernel, SIGUSR1, SIGUSR2 are used. Changing the disposition of these signals in PGDBG will result in undefined behavior.

The PGDBG GUI and Threads

A new window has been added to the PGDBG GUI, which displays the state of each active thread. The current thread can be changed by a click of the mouse. The source code pane shows the current thread's position in the program. The REGISTER, DISASM, and CUSTOM windows reflect also the state of the current thread. Expressions are evaluated relative to the current thread.

The MEMORY window can be used to examine memory across all active threads.

System Dependent Issues

Under Linux86, PGDBG can debug multiple threads on systems with glibc version later than 2.0.x or later than Linux 6.x).

1. If the manager exits or is killed, thread behavior in the debugger is undefined. XMM state is not refreshed in this case. Under recent implementations, SIGHUP is sent to all threads before the manager dies.

2. Physical thread IDs are not filled in until the program has landed in a parallel region. The physical thread ID of a thread is in effect for the lifetime of a thread. Threads are fully initialized when their physical thread IDs appear when using the 'threads' command. The manager thread does not have a physical thread ID, since it is an agent thread, not a user thread.

3. Debugging programs linked explicitly with both -lpthread and -mp or -Mconcur will lead to undefined behavior in PGDBG. PGI compilers do not support this combination.

Following are system-dependent issues on Solaris86:

1. PGDBG can debug multiple threads on systems running Sun0S 5.6 or higher.

2. LWP IDs are used to identify threads under PGDBG for Solaris86. Agent LWPs are not distinguished from user threads. By running to a breakpoint, source position can be used to determine which threads are user threads. Thread 2 and Thread 3 are usually agent LWPs:

pgdbg> threads
=>   Thread 7 [Stopped by SIGTRAP]
     f1 line: 18 in "small.c" address: 0x804a7ae
     Thread 6 [Stopped by SIGTRAP]
     f1 line: 38 in "small.c" address: 0x804a7e3
     Thread 5 [Stopped by SIGTRAP]
     f1 line: 18 in "small.c" address: 0x804a7ae
(**) Thread 3 [Stopped by SIGSTOP] 
     __door_return address: 0xdff4ff8b
(**) Thread 2 [Stopped by SIGSTOP]
     _signotifywait address: 0xdff5258a
     Thread 1 [Stopped by SIGTRAP]
     f1 line: 18 in "small.c" address: 0x804a7ae
 pgdbg>

3. When a program uses more than OMP_NUM_THREADS threads (by calling omp_set_num_threads(n) for n larger than OMP_NUM_THREADS), or if OMP_NUM_THREADS is set to a value that is larger then the number of processors on your system, some of the threads will be put to sleep until the kernel gives them cycles to run. A thread that is asleep will not run since the scheduler does not currently schedule it. Threads move in and out of sleep possibly many times while the program executes.

4. By using breakpoints together with the cont command (instead of commands that continue only one (possibly sleepy) thread, you can avoid this complication. Try setting a breakpoint at a point in a parallel region that all threads will eventually hit. Repeated use of the cont command will iterate through each thread as it hits the breakpoint.

5. Any threads created by a program linked with -lthread are not visible in PGDBG under Solaris86.

More generally:

1. When the environment variable PGDBG_SSE is set to on, PGDBG adds the SSE registers to the state of each thread. The PGDBG control commands become very slow when SSE support is enabled since the SSE state must be extracted and flushed for each active thread. When debugging a multi threaded program, unsetenv PGDBG_SSE when it is not necessary to inspect these registers.

2. To use physical thread IDs (Linux86 only), xmm registers, and hwatchpoints, compile your program -g using 3.2-4 PGI compilers.

3. Physical thread IDs (Linux86 only), xmm registers, and hwatchpoints are not available while debugging legacy programs compiled with pre-3.2-3 PGI compilers. Legacy objects compiled with pre-3.2-3 PGCC can be re-linked (using 3.2-3 or 3.2-4 compilers) to enable this support using the -Wl,-u,__pgdbg_stub option.

4. Nexting and stepping over an OpenMP barrier may be slow if the number of threads is large.

2.7.4 PGDBG Scoping

PGDBG 3.2-4 supports various levels of language scoping. The changes are reflected in nested subroutines, Fortran 90 modules, lexical blocks, and private variables.

Nested Subroutines

To reference a nested subroutine you must qualify its name with the name of its enclosing function using the scoping operator @.

For example:

subroutine subtest (ndim)
integer(4), intent(in) :: ndim
integer, dimension(ndim) :: ijk
call subsubtest ()
contains
    subroutine  subsubtest ()
    integer :: I
    i=9
    ijk(1) = 1
    end subroutine subsubtest
    subroutine  subsubtest2 ()
    ijk(1) = 1
    end subroutine subsubtest2
end subroutine subtest           
program testscope
integer(4), parameter :: ndim = 4
call subtest (ndim)
end program testscope

pgdbg> break subtest@subsubtest
breakpoint set at: subsubtest line: 8 in "ex.f90" address: 0x80494091
pgdbg> names subtest@subsubtest 
i = 0
pgdbg> decls subtest@subsubtest 
arguments:
variables:
integer*4 i;
pgdbg> whereis subsubtest
function:       "ex.f90"@subtest@subsubtest

Fortran 90 Modules

To access a member mm of a Fortran 90 module M you must qualify mm

with M using the scoping operator @. If the current scope is M the qualification can be omitted.

For example:

module M
implicit none
real mm
contains
subroutine stub
print *,mm
end subroutine stub
end module M
program test
use M
implicit none
call stub()
print *,mm
end program test

pgdbg> Stopped at 0x80494e3, function MAIN, file M.f90, line 13
#13:       call stub()
pgdbg> which mm
"M.f90"@m@mm
pgdbg> print "M.f90"@m@mm
0
pgdbg> names m
mm = 0
stub = "M.f90"@m@stub
pgdbg> decls m
real*4 mm;
subroutine stub();
pgdbg> print m@mm
0
pgdbg> break stub
breakpoint set at: stub line:6 in "M.f90" address: 0x8049446      1
pgdbg> c
Stopped at 0x8049446, function stub, file M.f90, line 6
Warning: Source file M.f90 has been modified more recently than object file
#6:           print *,mm
pgdbg> print mm
0
pgdbg>

Lexical Blocks

PGDBG now understands lexical blocks. Line numbers are used to name lexical blocks. The line number of the first instruction contained by a lexical block indicates the start scope of the lexical block.

Below variable var is declared in the lexical block starting at line 5. The lexical block has the unique name "lex.c"@main@5. The variable var declared in "lex.c"@main@5 has the unique name "lex.c"@main@5@var.

For Example:

lex.c:
main()
{
    int var = 0;
    {
        int var = 1;
        printf("var %d\n",var);
    }
    printf("var %d\n",var)
}
pgdbg> n
Stopped at 0x8048b10, function main, file
/home/pete/pgdbg/bugs/workon3/ctest/lex.c, line 6
#6:         printf("var %d\n",var);
pgdbg> print var
1
pgdbg> which var
"lex.c"@main@5@var
pgdbg> whereis var
variable:       "lex.c"@main@var
variable:       "lex.c"@main@5@var
pgdbg> names "lex.c"@main@5
var = 1

Private Variables

PGDBG 3.2-4 understands private variables with some restrictions. In particular, inspecting private variables while debugging FORTRAN programs is not supported.

Private variables in C must be declared in the enclosing lexical block of the parallel region in order for them to be visible using PGDBG.

For example:

{
    #pragma omp parallel     
    {
        int i;
        ...
        /* i is private to 'this' thread */
        ...
    }
}

In the above case, i would be visible inside PGDBG for each thread. However, in the following example, i is not visible inside PGDBG:

{
    int i;
    #pragma omp parallel private(i)    
    {
        ...
        /* i is private to 'this' thread 
           but not visible within PGDBG */
        ...
    }
}

A private variable of a Thread A is accessed by switching the current thread to A, and by using the name (qualified if necessary) of the private variable.

For example:

%pgcc -g -mp lex.c  
%pgdbg -text a.out
....
pgdbg> break 8
pgdbg> cont
....
[New Thread 2309 (manager thread)]
[New Thread 2310]
Stopped at 0x8048fe2, function main, file
/home/pete/pgdbg/bugs/workon3/ctest/le
x.c, line 8
#8:         printf("var %d\n",var);
pgdbg> list
#3:       int var = 2;
#4:
#5:     #pragma parallel local(var)
#6:       {
#7:         int var = omp_get_thread_num();
#8:==>>     printf("var %d\n",var);
#9:     #pragma synchronize
#10:       }
#11:       printf("var %d\n",var);
#12:     }
pgdbg> threads		 // current thread is Thread 0
(PID 2309)  (manager thread) [Stopped by SIGSTOP]
__poll address: 0x400f0320
   Thread 1  (PID 2310)                   [Stopped by SIGTRAP]
   main line: 8 in "lex.c" address: 0x8048fe2
=> Thread 0  (PID 2308)  (initial thread) [Stopped by SIGTRAP]
   main line: 8 in "lex.c" address: 0x8048fe2
pgdbg> which var           // var is declared in
"lex.c"@main@7@var         // lexical block, and is
                           // private to each thread
pgdbg> print var           // print var from Thread 0
0
pgdbg> thread 1            // switch to Thread 1
Stopped at 0x8048fe2, function main, lex.c, line 8
#8:         printf("var %d\n",var);
pgdbg> print var           // print var from Thread 1
1

2.7.5 PGDBG GUI Modifications

Setting the Font

Use the xlsfonts command to list all fonts installed on your system, then choose one you like. For this example, we choose a sony font that is completely specified by the following string:

-sony-fixed-medium-r-normal--24-230-75-75-c-120-iso8859-1

There are two ways to set the font that your PGDBG GUI uses.

1. Use your .Xresources file:

Xpgdbg*font : <chosen font>
pgdbg*font : <chosen font>

For example:

pgdbg*font : -sony-fixed-medium-r-normal--24-230-75-75-c-120-iso8859-1

You will have to merge these changes into your X environment for them to take effect. You can use the following command:

       % xrdb -merge $HOME/.Xresources

2. Use the command line options : -fn <font>. For example:

% pgdbg -fn -sony-fixed-medium-r-normal--0-0-100-100-c-0-jisx0201.1976-0...

Thread Control Window

A new window has been added to the PGDBG' GUI to display the state of each active thread. The current thread can be changed by a click of the mouse.

The source code pane shows the current thread's position in the program. The REGISTER, DISASM, and CUSTOM windows also reflect the state of the current thread. Expressions are evaluated relative to the current thread. The MEMORY window can be used to examine memory across all active threads.

Control-C from GUI

The active window must be the command window (upper window) where the PGDBG prompt appears for control-C to interrupt the program being debugged.

2.7.6 PGDBG 3.2 and Shared Object Files

PGDBG 3.2 supports debugging of dynamically linked executables that reference shared object files created using the PGI compilers. If the executable being debugged is dynamically linked, PGDBG will report when each shared object is loaded and/or unloaded.

For example:

  pgdbg> ...
  pgdbg> n
  Stopped at 0x8048bee, function main, file   
  dynload.c, line 36
  #36: handle = dlopen("libpetesSO2.so",RTLD_NOW);
  pgdbg> n
  libpetesSO2.so loaded by ld-linux.so.2.
  Stopped at 0x8048c31, function main, file
  dynload.c, line 41
  #41:       if (handle){
  pgdbg> n
  Stopped at 0x8048c37, function main, file
  dynload.c, line 42
  #42:         dlclose(handle);
  pgdbg> n
  libpetesSO2.so unloaded by ld-linux.so.2.
  Stopped at 0x8048c42, function main, file
  dynload.c, line 45
  #45:     }
  pgdbg> ...

The global symbols defined by a dynamically linked shared object are visible during a PGDBG debug session. These symbols are currently available only without type and line number information. The machine level PGDBG commands (breaki, dump, hwatch, disasm, etc) are useful for inspecting these symbols. Each symbol is available with respect to the load status of its defining shared object.

For example, dynamically-linkable Position Independent Code (PIC) is implemented using a Procedure Linkage Table (PLT) and Global Offset Table (GOT). Each PIC function is bound lazily at run-time. If a function has not been linked dynamically, PGDBG reports the address of its PLT entry as its address. If a function has been linked dynamically, PGDBG reports the virtual address of the function itself. So, PGDBG reports the current or "effective" address of symbols with respect to dynamic linking and loading. PGDBG treats global symbols defined in shared objects in a similar way. The address of a global variable may be the address of its GOT entry or an absolute address, depending in part on its load status.

2.7.7 PGDBG 3.2-4 Known Limitations

The Version 3.2-4 of PGDBG has the following limitations:

1. It cannot process core files.

2. Output written to stdout from the process being debugged is block buffered to the GUI. In order to flush the buffer you must call fflush from within your program.

3. With PGDBG 3.2-4 Solaris86 - logical cpuid is not supported, and stepping in a threaded area can hang.

2.7.8 Debugging with gdb on Win32 Systems

The PGI Workstation 3.2 compilers for Win32 support generation of GNU STABS format debug information under control of the -g and -Mstabs compile/link switches. This enables debugging of PGI-compiled programs using the version of gdb included in mingw32, which is the UNIX-like environment included with the PGI Workstation for Win32 software package.

Once you have created an executable (for example a.out) using the above switches, simply invoke gdb as follows:

	
% gdb a.out

within a PGI Workstation 3.2 shell window.

Note that there are shortcomings in gdb with respect to its ability to debug Fortran - in particular it doesn't support COMPLEX data types and cannot examine data included in Fortran COMMON blocks. Also, on Win32 gdb doesn't understand the 'drive' (C:\) syntax of path names, so you must use gdb commands to set the source directory paths. The Win32 version of gdb does allow you to set and run to function and line breakpoints, examine variables, list source lines, and examine stack traces.

2.8 Profiling with PGPROF

The PGPROF profiler is a tool that analyzes tracefiles generated during execution of specially compiled C, C++, F77, F90 and HPF programs. It allows programmers to discover which functions and lines were executed, how often they were executed and how much of the total execution time they consumed.

On multiprocessor systems, the PGPROF profiler also allows you to view information on a processor-by-processor basis for HPF programs and on a thread-by-thread basis for OpenMP programs. You can view a summary of minimum or maximum execution times for each program unit or line, or view performance data for each individual processor or thread. This information can be used to identify communication patterns in HPF programs, load balancing problems in HPF or OpenMP programs, and identify the portions of a program that will benefit the most from performance tuning.

2.8.1 Analyzing Scalability of Parallel Programs

The PGPROF 3.2 now supports scaling analysis of HPF and OpenMP parallel programs. If you have not used PGPROF previously, read through Chapter 14 of the PGI User's Guide for a description of the capabilities of PGPROF and how it is used. Once you are familiar with PGPROF, follow these steps to utilize PGPROF 3.2 scaling analysis:

1. Compile your parallel program using the appropriate parallelizing PGI compiler - PGHPF for HPF programs, PGF90 for F90 OpenMP programs, PGF77 for F77 OpenMP programs, PGCC for OpenMP C programs or PGC++ for OpenMP C++ programs. In addition to the options you normally use, add the option
-Mprof=func during compilation and linking.

2. Run the resulting executable on a single processor. See section 1.4 of the PGI User's Guide for a brief introduction to running OpenMP parallel and HPF parallel programs on 1 or more processors.

3. At the completion of the single-processor run, a PGPROF tracefile named pgprof.out is automatically written to your current working directory. Rename pgprof.out to (for example) pgprof.out.1.

4. Rerun the executable on (for example) 2 processors. At the completion of the 2 processor run, a PGPROF tracefile named pgprof.out is again automatically written to your current working directory. Rename pgprof.out to (for example) pgprof.out.2.

5. Invoke PGPROF using the following command:

% pgprof -scale pgprof.out.1 pgprof.out.2

PGPROF opens a window for each pgprof.out file; the first one listed is taken to be the base run against which scaling is computed. In this example, the base run is on 1 processor, but it could be on any number of processors. A scaling metric is displayed in the window for each subsequent pgprof.out file, comparing the time values against those of the base run. Two or more pgprof.out files can be specified - a separate PGPROF window will be opened for each one. Negative scaling indicates the program slows down with additional processors, positive scaling indicates program speedup.

Alternatively, this can be done with the PGPROF GUI menus. Open a pgprof.out tracefile for the base run as usual, and subsequent files under the File menu using the Scalability Comparison option.

Performance scaling can be analyzed at the function level, or even at the line level if you compile a given program unit using the -Mprof=lines option. However, -Mprof=lines can sometimes incur substantial execution overhead. For this reason, it is advisable to compile only selected program units with this option rather than compiling your entire application with line profiling enabled.

2.9 LAPACK, the BLAS and FFTs

2.9.1 Pre-compiled BLAS and LAPACK Math Libraries

Precompiled versions of the BLAS and LAPACK math libraries are included for all target systems in the files $PGI/<target>/lib/libblas.a and $PGI/<target>/lib/liblapack.a. These can be linked in to your applications by simply placing the -llapack -lblas options on the link line:

% pgf77 myprog.F -lblas -llapack

Note that these libraries are compiled with switches that are relatively optimal but fully portable across the various IA32 architectures. In particular, they do not take advantage of Pentium III/4 SSE/SSE2 instructions, Pentium III prefetch instructions, or Athlon prefetch instructions. If you would like to rebuild libblas.a and liblapack.a on a Pentium III, PGI recommends using the following options:

-fast -pc 64 -Mvect=sse -Mcache_align -Kieee

NOTE: slmach.f and dlmach.f must be compiled -O0!

If you would like to rebuild libblas.a and liblapack.a on an AMD Athlon, PGI recommends using the following options:

-fast -pc 64 -Mvect=prefetch -Kieee

As on the Pentium III, slmach.f and dlmach.f must be compiled -O0.

2.9.2 Assembly-coded Math Libraries

On Win32 systems, assembly-coded BLAS and FFT routines are included in the file $PGI/<target>/lib/libmkl.a. You can specify that these should be linked in place of the standard (compiled Fortran) version of the BLAS using the -lmkl link time option:

% pgf90 myprog.F -lmkl -llapack

For more information about this library, see the URL:

http://support.intel.com/support/performancetools/libraries

A similar library is available for Linux systems, but cannot be shipped with the PGI compilers for legal reasons. However, you may obtain it at no cost at the following URL:

http://www.cs.utk.edu/~ghenry/distrib/index.htm

Follow the instructions for obtaining the software, install it in the file $PGI/linux86/lib/libmkl.a, and compile/link as above for Win32. NOTE: The contents of this library are similar but not identical to libmkl.a for Win32. Also, you must link with -g77libs when using this library.

2.10 Fortran calling conventions on Win32

All Microsoft calling conventions including Fortran STDCALL are supported by the PGI Fortran compilers. In addition, the PGI Fortran compilers support UNIX-style calling conventions on Win32. This allows simple porting of mixed Fortran/C applications from UNIX to Win32.

IMPORTANT: Object files compiled using release 1.7-6 or prior of the PGI Fortran compilers for Win32 are not compatible with object files compiled using releases 3.0 or 3.2.

Section 6.14 of the PGI User's Guide contains a detailed description of all supported Fortran calling conventions under Win32.

2.11 OpenMP Tutorial

A self-guided online tutorial is available to help you become familiar with how OpenMP parallelization directives. In particular, the tutorial takes the user step by step through the process of parallelizing the NAS FT benchmark using OpenMP directives. The tutorial can be found at:

    ftp://ftp.pgroup.com/pub/SMP

You can download this file using a web browser, and unpack the file using the following commands:

       % gunzip fftpde.tar.gz
       % tar xvf fftpde.tar

Change directories to the fftpde sub-directory, and follow the instructions in the README file.

2.12 PGCC C and C++ Compiler Notes

This release contains the EDG 2.40 C++ front-end.

2.13 PGI Workstation 3.2 and glibc

Release 3.2 of PGI Workstation is built and validated under both the Linux 2.0.36 and 2.2.x kernels. Newer distributions of Linux, such as Red Hat 6.x and 7.0 and SuSE 6.x and 7.0, incorporate revision 2.2.x of the Linux kernel and glibc2.1.x. If you are using a revision of Linux that includes the 2.2.x kernel and glibc 2.1.x, it will be detected automatically by the PGI Workstation installation script. Your installation will be modified as appropriate for these systems.

2.14 PGI Workstation for Win32

2.14.1 The PGI Workstation Shell Environment

On Win32, a UNIX-like shell environment is bundled with PGI Workstation. After installation, a double-left-click on the PGI Workstation icon on your desktop will launch a bash shell command window with pre-initialized environment settings. Most familiar UNIX commands are available (vi, emacs, sed, grep, awk, make, etc). If you are unfamiliar with the bash shell, reference the user's guide included with the online HTML documentation.

Alternatively, you can launch a standard Win32 command window pre-initialized for usage of the PGI compilers by selecting the appropriate option from the PGI Workstation program group accessed in the usual way through the "Start" button.

Except where noted in the PGI User's Guide, the command-level PGI compilers and tools on Win32 function identically to their UNIX counterparts. You can customize your command window (white background with black text, add a scroll bar, etc.) by right-clicking on the top border of the PGI Workstation command window, selecting "Properties", and making the appropriate modifications. When the changes are complete, Win32 will allow you to apply the modifications globally to any command window launched using the PGI Workstation desktop icon.

2.14.2 PGI Compilers for Win32 in MKS Toolkit

The PGI Workstation 3.2 command-level compilers can be used from within an MKS Toolkit shell window (for more information on the MKS Toolkit from Mortice Kern Systems, see http://www.mks.com).

After installing PGI Workstation as outlined in section 1, issue the following commands from within an MKS korn shell to initialize your environment and path:


       % PGI=C:/pgi
       % export PGI
       % PATH="C:\PGI\nt86\bin;$PATH"

The UNIX-style manual pages must be viewed in their HTML form on Win32. See section 4 for information on how to view the HTML documentation.

2.14.3 DLLs under Win32

To create dynamically linked libraries (DLLs) using the PGI compilers for Win32, you must use the utilities dlltool and dllwrap which are included as part of the PGI Workstation for Win32 command environment. Here are the steps in the process.

The tools dlltool and dllwrap need the full cygwin environment available through Redhat, in order to provide the libraries related to the -cyglibs switch involved below. Here are the steps in the process.

Step 1 - Use dlltool to create a .def file from the object file(s) you wish to have included in the DLL. Th .def file includes entry points and intermediate code for all of the functions/subroutines in the DLL. This intermediate code replaces the actual objects in an executable that references the DLL, and causes the objects to be loaded from the static .a library file at runtime. Only the objects that are to be included in the DLL are entered here.

To create a DLL from the object code in files object1.o and object2.o, create a file obj12.def as follows:


    % dlltool --export-all --output-def obj12.def \
    object1.o object2.o

Step 2 - Create the intermediate DLL file using dllwrap. This step requires a complete linking of the objects declared previously, ensuring that any DLL entries referenced in the target DLLs have all of their symbols resolved by the linker (the resolved symbols can also be DLLs).

Assuming the objects object1.o and object2.o are compiled by PGF90, do the following to create obj12.dll from the objects and the required PGF90 libraries:


    % dllwrap  --def obj12.def  -o obj12.dll \
    --driver-name pgcc  object1.o object2.o \
    -L. -dll -cyglibs -lpgf90 -lpgf90_rpm1 -lpgf902 \
    -lpgf90rtl -lpgftnrtl

If the objects are compiled using PGF77, you need only include the reference to -lpgftnrtl (i.e. you can omit the references to -lpgf90,
-lpgf90_rpm1, -lpgf902 and -lpgf90rtl. If the objects are compiled using PGCC, you need not include any of the PGI Fortran runtime library references.

The dllwrap command creates a series of commands to send to the linker, among which is -nostartfiles, which directs pgcc to not load various startup files into the list of object files sent to the linker.

Step 3 - Use dlltool again to create the libobj12dll.a library file from obj12.dll and obj12.def.


    % dlltool --dllname obj12.dll --def obj12.def  \
    --output-lib libobj12dll.a

As an example, consider the following source files, object1.f:


      subroutine subf1 (n)
      integer n
      n=1
      print *,"n=",n
      return
      end

and object2.f:


      function funf2 ()
      real funf2
      funf2 = 2.0
      return
      end

and prog.f:


      program test
      external subf1
      real funf2, val
      integer n
      call subf1(n)
      val = funf2()
      write (*,*) 'val = ', val
      stop
      end

Create the DLL libobj12dll.a using the steps above. To create the test program using libobj12dll.a, do the following:


    % pgf90 -o test prog.f -L. -lobj12dll

should you wish to change libobj12dll.a without changing the subroutine or function interfaces, no rebuilding of test is necessary. Just recreate libobj12dll.a, and it will be loaded at runtime.

<< " border=0>

> " border=0>

ICTP Portal

Sections

Personal tools

Document Actions

PGI Workstation 3.2 - 2 PGI Workstation 3.2-4 Release Notes

2 PGI Workstation 3.2-4
Release Notes

2.1 PGI Workstation 3.2 Contents

2.2 Supported Systems and Licensing

2.3 New Features

2.4 New Compiler Options

2.4.1 New Generic Options

2.4.2 New Win32 Options

2.5 OpenMP Directives and Pragmas

2.6 Pentium III, 4, and Athlon Support

2.6.1 Pentium III SSE Instructions

2.6.2 Pentium III and Athlon Prefetch Instructions

2.6.3 User-directed Prefetch Instructions

2.6.4 Pentium 4 SSE2 Instructions

2.7 Debugging with PGDBG

2.7.1 PGDBG 3.2 New Features

2.7.2 Calling C++ Instance Methods

2.7.3 PGDBG 3.2-4 Thread Support

2.7.4 PGDBG Scoping

2.7.5 PGDBG GUI Modifications

2.7.6 PGDBG 3.2 and Shared Object Files

2.7.7 PGDBG 3.2-4 Known Limitations

2.7.8 Debugging with gdb on Win32 Systems

2.8 Profiling with PGPROF

2.8.1 Analyzing Scalability of Parallel Programs

2.9 LAPACK, the BLAS and FFTs

2.9.1 Pre-compiled BLAS and LAPACK Math Libraries

2.9.2 Assembly-coded Math Libraries

2.10 Fortran calling conventions on Win32

2.11 OpenMP Tutorial

2.12 PGCC C and C++ Compiler Notes

2.13 PGI Workstation 3.2 and glibc

2.14 PGI Workstation for Win32

2.14.1 The PGI Workstation Shell Environment

2.14.2 PGI Compilers for Win32 in MKS Toolkit

2.14.3 DLLs under Win32