Chapter 6. Performance Tuning

After analyzing your code to determine where performance bottlenecks are occurring, you can turn your attention to making your programs run their fastest. One way to do this is to use multiple CPUs in parallel processing mode. However, this should be the last step. The first step is to make your program run as efficiently as possible on a single processor system and then consider ways to use parallel processing.

This chapter describes the process of tuning your application for a single processor system, and then tuning it for parallel processing in the following sections:

It also describes how to improve the performance of floating-point programs in

Single Processor Code Tuning

Several basic steps are used to tune performance of single-processor code:

Getting the Correct Results

One of the first steps in performance tuning is to verify that the correct answers are being obtained. Once the correct answers are obtained, tuning can be done. You can verify answers by initially disabling specific optimizations and limiting default optimizations. This can be accomplished by using specific compiler options and by using debugging tools.

The following compiler options emphasize tracing and porting over performance:

  • -O: the -O0 option disables all optimization. The default is -O2.

  • -g: the -g option preserves symbols for debugging.

  • -mp: the -mp option limits floating-point optimizations and maintains declared precision.

  • -IPF_fltacc: the -IPF_fltacc option disables optimizations that affect floating-point accuracy.

  • -r:, -i: the - r8 and -i8 options set default real, integer, and logical sizes to 8 bytes, which are useful for porting codes from Cray, Inc. systems. This explicitly declares intrinsic and external library functions.

Some debugging tools can also be used to verify that correct answers are being obtained. See “Debugging Tools” in Chapter 3 for more details.

Managing Heap Corruption Problems

Two methods can be used to check for heap corruption problems in programs that use glibc malloc/ free dynamic memory management routines: environment variables and Electric Fence.

Set the MALLOC_CHECK_ environment variable to 1 to print diagnostic messages or to 2 to abort immediately when heap corruption is detected.

Electric Fence is a malloc debugger. It aligns either the start or end of an allocated block with an invalid page, causing segmentation faults to occur on buffer overruns or underruns at the point of error. It can also detect accesses to previously freed regions of memory.

Overruns and underruns are circumstances where an access to an array is outside the declared boundary of the array. Underruns and overruns cannot be simultaneously detected. The default behavior is to place inaccessible pages immediately after allocated memory, but the complementary case can be enabled by setting the EF_PROTECT_BELOW environment variable. To use Electric Fence, link with the libefence library, as shown in the following example:

% cat foo.c
#include <stdio.h>
#include <stdlib.h>
int main (void)
{
int i;
int * a;
float *b;
a = (int *)malloc(1000*sizeof (int));
b = (float *)malloc(1000*sizeof (float));

a[0]=1;
for (i=1 ; i<1001;i++)
	{
	  a[i]=a[i-1]+1;
}
for (i=1 ; i<1000;i++)
{
  b[i]=a[i-1]*3.14;
}

printf(“answer is %d %f \n”,a[999],b[999]);

}

Compile and run the program as follows (note the error when it is compiled with the library call):

% ecc foo.c
% ./a.out
answer is 1000 3136.860107
% ecc foo.c -lefence
% ./a.out
Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens
Segmentation fault
% dbg

To avoid potentially large core files, the recommended method of using Electric Fence is from within a debugger. See the efence man page for additional details.

Using Tuned Code

Where possible, use code that has already been tuned for optimum hardware performance.

The following mathematical functions should be used where possible to help obtain best results:

  • SCSL: SGI's Scientific Computing Software Library. This library includes BLAS, LAPACK, FFT, convolution/correlation, and iterative/direct sparse solver routines. Documentation is available via online man pages (see intro_scsl) and the SCSL User's Guide available on the SGI Technical Publications Library at http://docs.sgi.com.

  • MKL: Intel's Math Kernel Library. This library includes BLAS, LAPACK, and FFT routines.

  • VML: the Vector Math Library, available as part of the MKL package (libmkl_vml_itp.so).

  • Standard Math library

    Standard math library functions are provided with the Intel compiler's libimf.a file. If the -lm option is specified, glibc libm routines are linked in first.

Documentation is available for MKL and VML, as follows: http://intel.com/software/products/perflib/index.htm?iid=ipp_home+software_libraries& .

Determining Tuning Needs

Use the following tools to determine what points in your code might benefit from tuning:

  • time: Use this command to obtain an overview of user, system, and elapsed time.

  • gprof: Use this tool to obtain an execution profile of your program (a pcsamp profile). Use the -p compiler option to enable gprof use.

  • VTune: This Intel performance monitoring tool is a Linux-server, Windows-client application. It supports remote sampling on all Itanium and Linux systems.

  • pfmon: This performance monitoring tool is designed for Itanium and Linux. It uses the Itanium Performance Monitoring Unit (PMU) to do counting and sampling on unmodified binaries.

For information about other performance analysis tools, see Chapter 3, “Performance Analysis and Debugging”.

Using Compiler Options Where Possible

Several compiler options can be used to optimize performance. For a short summary of ifort or ecc options, use the -help option on the compiler command line. Use the -dryrun option to show the driver tool commands that ifort or ecc generate. This option does not actually compile.

Use the following options to help tune performance:

  • -ftz: Flushes underflow to zero to avoid kernel traps. Enabled by default
    at -O3 optimization.

  • -fno-alias: Assumes no pointer aliasing. Pointer aliasing can create uncertainty about the possibility that two unrelated names might refer to the identical memory; because of this uncertainty, the compiler will assume that any two pointers can point to the same location in memory. This can remove optimization opportunities, particularly for loops.

    Other aliasing options include -ansi_alias and -fno_fnalias. Note that incorrect alias assertions may generate incorrect code.

  • -ip: Generates single file, interprocedural optimization; -ipo generates multifile, interprocedural optimization.

    Most compiler optimizations work within a single procedure (like a function or a subroutine) at a time. This intra-procedural focus restricts optimization possibilities because a compiler is forced to make worst-case assumptions about the possible effects of a procedure. By using inter-procedural analysis, more than a single procedure is analyzed at once and code is optimized. It performs two passes through the code and requires more compile time.

  • -O3: Enables -O2 optimizations plus more aggressive optimizations, including loop transformation and prefetching. Loop transformation are found in a transformation file created by the compiler; you can examine this file to see what suggested changes have been made to loops. Prefetch instructions allow data to be moved into the cache before their use. A prefetch instruction is similar to a load instruction.

    Note that Level 3 optimization may not improve performance for all programs.

  • -opt_report: Generates an optimization report and places it in the file specified in -opt_report_file .

  • -override_limits: This is an undocumented option that sometimes allows the compiler to continue optimizing when it has hit an internal limit.

  • -prof_gen and -prof_use: Generates and uses profiling information. These options require a three-step compilation process:

    1. Compile with proper instrumentation using -prof_gen.

    2. Run the program on one or more training datasets.

    3. Compile with -prof_use, which uses the profile information from the training run.

  • -S: Compiles and generates an assembly listing in the .s files and does not link. The assembly listing can be used in conjunction with the output generated by the -opt_report option to try to determine how well the compiler is optimizing loops.

Tuning the Cache Performance

There are several actions you can take to help tune cache performance:

  • Avoid large power-of-2 (and multiples thereof) strides and dimensions that cause cache thrashing. Cache thrashing occurs when multiple memory accesses require use of the same cache line. This can lead to an unnecessary number of cache misses.

    To prevent cache thrashing, redimension your vectors so that the size is not a power of two. Space the vectors out in memory so that concurrently accessed elements map to different locations in the cache. When working with two-dimensional arrays, make the leading dimension an odd number; for multidimensional arrays, change two or more dimensions to an odd number.

    Consider the following example: a cache in the hierarchy has a size of 256 KB (or 65536 4--byte words). A Fortran program contains the following loop:

    real data(655360,24)
    ...
    do i=1,23
       do j=1,655360
          diff=diff+data(j,i)-data(j,i+1)
       enddo
    enddo

    The two accesses to data are separated in memory by 655360*4 bytes, which is a simple multiple of the cache size; they consequently load to the same location in the cache. Because both data items cannot simultaneously coexist in that cache location, a pattern of replace on reload occurs that considerably reduces performance.

  • Use a memory stride of 1 wherever possible. A loop over an array should access array elements from adjacent memory addresses. When the loop iterates through memory by consecutive word addresses, it uses every word of every cache line in sequence and does not return to a cache line after finishing it.

    If memory strides other than 1 are used, cache lines could be loaded multiple times if an array is too large to be held in memory at one time.

  • Cache bank conflicts can occur if there are two accesses to the same 16-byte-wide bank at the same time. Try different padding of arrays if the output from the pfmon -e L2_OZQ_CANCELS1_BANK_CONF command and the output from the pfmon -e CPU_CYCLES command shows a high number of bank conflicts relative to total CPU cycles. These can be combined into one command:

    % pfmon -e CPU_CYCLES,L2_OZQ_CANCELS1_BANK_CONF a.out

    A maximum of four performance monitoring events can be counted simultaneously.

  • Group together data that is used at the same time and do not use vectors in your code, if possible. If elements that are used in one loop iteration are contiguous in memory, it can reduce traffic to the cache and fewer cache lines will be fetched for each iteration of the loop.

  • Try to avoid the use of temporary arrays and minimize data copies.

Managing Memory

Nonuniform memory access (NUMA) uses hardware with memory and peripherals distributed among many CPUs. This allows scalability for a shared memory system but a side effect is the time it takes for a CPU to access a memory location. Because memory access times are nonuniform, program optimization is not always straightforward.

Codes that frequently allocate and deallocate memory through glibc malloc/free calls may accrue significant system time due to memory management overhead. By default, glibc strives for system-wide memory efficiency at the expense of performance.

In compilers up to and including version 7.1.x, to enable the higher-performance memory management mode, set the following environment variables:

% setenv MALLOC_TRIM_THRESHOLD_ -1
% setenv MALLOC_MMAP_MAX_ 0

Because allocations in ifort using the malloc intrinsic use the glibc  malloc internally, these environment variables are also applicable in Fortran codes using, for example, Cray pointers with malloc /free. But they do not work for Fortran 90 allocatable arrays, which are managed directly through Fortran library calls and placed in the stack instead of the heap. The example, above, applies only to the csh shell and the tcsh shell.

Multiprocessor Code Tuning

Before beginning any multiprocessor tuning, first perform single processor tuning. This can often obtain good results in multiprocessor codes also. For details, see “Single Processor Code Tuning”.

Multiprocessor tuning consists of the following major steps:

Data Decomposition

In order to efficiently use multiple processors on a system, tasks have to be found that can be performed at the same time. There are two basic methods of defining these tasks:

  • Functional parallelism

    Functional parallelism is achieved when different processors perform different functions. This is a known approach for programmers trained in modular programming. Disadvantages to this approach include the difficulties of defining functions as the number of processors grow and finding functions that use an equivalent amount of CPU power. This approach may also require large amounts of synchronization and data movement.

  • Data parallelism

    Data parallelism is achieved when different processors perform the same function on different parts of the data. This approach takes advantage of the large cumulative memory. One requirement of this approach, though, is that the problem domain be decomposed . There are two steps in data parallelism:

    1. Data decomposition

      Data decomposition is breaking up the data and mapping data to processors. Data can be broken up explicitly by the programmer by using message passing (with MPI) and data passing (using the SHMEM library routines) or can be done implicitly using compiler-based MP directives to find parallelism in implicitly decomposed data.

      There are advantages and disadvantages to implicit and explicit data decomposition:

      • Implicit decomposition advantages: No data resizing is needed; all synchonization is handled by the compiler; the source code is easier to develop and is portable to other systems with OpenMP or High Performance Fortran (HPF) support.

      • Implicit decomposition disadvantages : The data communication is hidden by the user; the compiler technology is not yet mature enough to deliver consistent top performance.

      • Explicit decomposition advantages : The programmer has full control over insertion of communication and synchronization calls; the source code is portable to other systems; code performance can be better than implicitly parallelized codes.

      • Explicit decomposition disadvantages : Harder to program; the source code is harder to read and the code is longer (typically 40% more).

    2. The final step is to divide the work among processors.

Parallelizing Your Code

The first step in multiprocessor performance tuning is to choose the parallelization methodology that you want to use for tuning. This section discusses those options in more detail.

You should first determine the amount of code that is parallelized. Use the following formula to calculate the amount of code that is parallelized:

p=N(T(1)-T(N)) / T(1)(N-1)

In this equation, T(1) is the time the code runs on a single CPU and T(N) is the time it runs on N CPUs. Speedup is defined as T(1)/T(N).

If speedup/N is less than 50% (that is, N>(2-p)/(1- p)), stop using more CPUs and tune for better scalability.

CPU activity can be displayed with the top or vmstat commands or accessed by using the Performance Co-Pilot tools (for example, pmval kernel.percpu.cpu.user) or by using the Performance Co-Pilot visualization tools pmchart .

Next you should focus on a parallelization methodology, as discussed in the following subsections.

Use MPT

You can use the Message Passing Interface (MPI) from the SGI Message Passing Toolkit (MPT). MPI is optimized and more scalable for SGI Altix series systems than generic MPI libraries. It takes advantage of the SGI Altix architecture and SGI Linux NUMA features. MPT is included with the SGI ProPack for Linux software.

Use the -lmpi compiler option to use MPI. For a list of environment variables that are supported, see the mpi man page.

MPIO_DIRECT_READ and MPIO_DIRECT_WRITE are supported under Linux for local XFS filesystems in SGI MPT version 1.6.1 and beyond.

MPI provides the MPI-2 standard MPI I/O functions that provide file read and write capabilities. A number of environment variables are available to tune MPI I/O performance. See the mpi_io(3) man page for a description of these environment variables.

Performance tuning for MPI applications is described in more detail in Chapter 6 of the Message Passing Toolkit (MPT) User's Guide.

Use XPMEM DAPL Library with MPI

A Direct Access Programming Library (DAPL) is provided for high performance communication between processes on a partitioned or single host Altix system. This can be used with Intel MPI via the remote direct memory access (RDMA) MPI device. See dapl_xpmem(3) for more information.

To run Intel MPI with XPMEM DAPL on a system running SGI ProPack 5 for Linux, perform the following commands:

setenv I_MPI_DEVICE rdma:xpmem

To run an example program, perform commands similar to the following:

setenv DAPL_VERSION on    ( to get confirmation that you are using DAPL )
mpirun -np 2 /home/yoursystem/syslib/libtools/bin_linux/mpisanity.intel

On SGI ProPack 5 for Linux systems, to use XPMEM DAPL make sure you have the sgi-dapl module loaded, as follows:

module load sgi-dapl

For more information, see the following files:
/opt/sgi-dapl/share/doc/README
/opt/sgi-dapl/share/doc/README.sgi-dapl-api

See the following man pages: dapl_xpmem(3), libdat(3), and dat.conf(4).

Use OpenMP

OpenMP is a shared memory multiprocessing API, which standardizes existing practice. It is scalable for fine or coarse grain parallelism with an emphasis on performance. It exploits the strengths of shared memory and is directive-based. The OpenMP implementation also contains library calls and environment variables.

To use OpenMP directives with C, C++, or Fortran codes, you can use the following compiler options:

  • ifort -openmp or ecc -openmp : These options use the OpenMP front-end that is built into the Intel compilers. The resulting executable file makes calls to libguide.so, which is the OpenMP run-time library provided by Intel.

  • guide: An alternate command to invoke the Intel compilers to use OpenMP code. Use guidec (in place of ifort), guideefc (in place of ifort), or guidec++ to translate code with OpenMP directives into code with calls to libguide. See “Other Performance Tools” in Chapter 3 for details.

    The -openmp option to ifort is the long-term OpenMP compiler for Linux provided by Intel. However, if you have performance problems with this option, using guide might provide improved performance.

For details about OpenMP usage see the OpenMP standard, available at http://www.openmp.org/specs .

OpenMP Nested Parallelism

This section describes OpenMP nested parallelism. For additional information, see the dplace(1) man page.

Here is a simple example for OpenMP nested parallelism with 2 "top" threads and 4 "bottom" threads that are called master/nested below:

% cat place_nested
firsttask cpu=0
thread name=a.out oncpu=0 cpu=4 noplace=1 exact onetime thread name=a.out oncpu=0
cpu=1-3 exact thread name=a.out oncpu=4 cpu=5-7 exact

% dplace -p place_nested a.out
Master thread 0 running on cpu 0
Master thread 1 running on cpu 4
Nested thread 0 of master 0 gets task 0 on cpu 0 Nested thread 1 of master 0 gets task 1 on cpu 1
Nested thread 2 of master 0 gets task 2 on cpu 2 Nested thread 3 of master 0 gets task 3 on cpu 3
Nested thread 0 of master 1 gets task 0 on cpu 4 Nested thread 1 of master 1 gets task 1 on cpu 5
Nested thread 2 of master 1 gets task 2 on cpu 6 Nested thread 3 of master 1 gets task 3 on cpu 7

Use Compiler Options

Use the compiler to invoke automatic parallelization. Use the -parallel and -par_report option to the efc or ecc compiler. These options show which loops were parallelized and the reasons why some loops were not parallelized. If a source file contains many loops, it might be necessary to add the -override_limits flag to enable automatic parallelization. The code generated by -parallel is based on the OpenMP API; the standard OpenMP environment variables and Intel extensions apply.

There are some limitations to automatic parallelization:

  • For Fortran codes, only DO loops are analyzed

  • For C/C++ codes, only for loops using explicit array notation or those using pointer increment notation are analyzed. In addition, for loops using pointer arithmetic notation are not analyzed nor are while or do/while loops. The compiler also does not check for blocks of code that can be run in parallel.

Identifying Parallel Opportunities in Existing Code

Another parallelization optimization technique is to identify loops that have a potential for parallelism, such as the following:

  • Loops without data dependencies; a data dependency conflict occurs when a loop has results from one loop pass that are needed in future passes of the same loop.

  • Loops witih data dependencies because of temporary variables, reductions, nested loops, or function calls or subroutines.

Loops that do not have a potential for parallelism are those with premature exits, too few iterations, or those where the programming effort to avoid data dependencies is too great.

Fixing False Sharing

If the parallel version of your program is slower than the serial version, false sharing might be occurring. False sharing occurs when two or more data items that appear not to be accessed by different threads in a shared memory application correspond to the same cache line in the processor data caches. If two threads executing on different CPUs modify the same cache line, the cache line cannot remain resident and correct in both CPUs, and the hardware must move the cache line through the memory subsystem to retain coherency. This causes performance degradation and reduction in the scalability of the application. If the data items are only read, not written, the cache line remains in a shared state on all of the CPUs concerned. False sharing can occur when different threads modify adjacent elements in a shared array. When two CPUs share the same cache line of an array and the cache is decomposed, the boundaries of the chunks split at the chache line.

You can use the following methods to verify that false sharing is happening:

  • Use the performance monitor to look at output from pfmon and the BUS_MEM_READ_BRIL_SELF and BUS_RD_INVAL_ALL_HITM events.

  • Use pfmon to check DEAR events to track common cache lines.

  • Use the Performance Co-Pilot pmshub utility to monitor cache traffic and CPU utilization. You can also use the shubstats(1) tool to monitor Altix cache and directory traffic.

If false sharing is a problem, try the following solutions:

  • Use the hardware counter to run a profile that monitors storage to shared cache lines. This will show the location of the problem. You can use the profile.pl -e command or histx -e command. For more information, see “Profiling with profile.pl” in Chapter 3, “Using histx” in Chapter 3, and the profile.pl (1) man page.

  • Revise data structures or algorithms.

  • Check shared data, static variables, common blocks, and private and public variables in shared objects.

  • Use critical regions to identify the part of the code that has the problem.

Using dplace and taskset

The dplace command binds processes to specified CPUs in a round-robin fashion. Once bound to a process, they do not migrate. This is similar to _DSM_MUSTRUN on IRIX systems. dplace numbering is done in the context of the current CPU memory set. See Chapter 4, “Monitoring Tools” for details about dplace.

The taskset command restricts execution to the listed set of CPUs; however, processes are still free to move among listed CPUs.

Environment Variables for Performance Tuning

You can use several different environment variables to assist in performance tuning. For details about environment variables used to control the behavior of MPI, see the mpi(1) man page.

Several OpenMP environment variables can affect the actions of the OpenMP library. For example, some environment variables control the behavior of threads in the application when they have no work to perform or are waiting for other threads to arrive at a synchronization semantic; other variables can specify how the OpenMP library schedules iterations of a loop across threads. The following environment variables are part of the OpenMP standard:

  • OMP_NUM_THREADS (The default is the number of CPUs in the system.)

  • OMP_SCHEDULE (The default is static.)

  • OMP_DYNAMIC (The default is false.)

  • OMP_NESTED (The default is false.)

In addition to the preceding environment variables, Intel provides several OpenMP extensions, two of which are provided through the use of the KMP_LIBRARY variable.

The KMP_LIBRARY variable sets the run-time execution mode, as follows:

  • If set to serial, single-processor execution is used.

  • If set to throughput, CPUs yield to other processes when waiting for work. This is the default and is intended to provide good overall system performance in a multiuser environment. This is analogous to the IRIX _DSM_WAIT=YIELD variable.

  • If set to turnaround, worker threads do not yield while waiting for work. This is analogous to the IRIX _DSM_WAIT=SPIN variable. Setting KMP_LIBRARY to turnaround may improve the performance of benchmarks run on dedicated systems, where multiple users are not contending for CPU resources.

If your program gets a segmentation fault immediately upon execution, you may need to increase KMP_STACKSIZE. This is the private stack size for threads. The default is 4 MB. You may also need to increase your shell stacksize limit.

Understanding Parallel Speedup and Amdahl's Law

There are two ways to obtain the use of multiple CPUs. You can take a conventional program in C, C++, or Fortran, and have the compiler find the parallelism that is implicit in the code.

You can write your source code to use explicit parallelism, stating in the source code which parts of the program are to execute asynchronously, and how the parts are to coordinate with each other.

When your program runs on more than one CPU, its total run time should be less. But how much less? What are the limits on the speedup? That is, if you apply 16 CPUs to the program, should it finish in 1/16th the elapsed time?

This section covers the following topics:

Adding CPUs to Shorten Execution Time

You can distribute the work your program does over multiple CPUs. However, there is always some part of the program's logic that has to be executed serially, by a single CPU. This sets the lower limit on program run time.

Suppose there is one loop in which the program spends 50% of the execution time. If you can divide the iterations of this loop so that half of them are done in one CPU while the other half are done at the same time in a different CPU, the whole loop can be finished in half the time. The result: a 25% reduction in program execution time.

The mathematical treatment of these ideas is called Amdahl's law, for computer pioneer Gene Amdahl, who formalized it. There are two basic limits to the speedup you can achieve by parallel execution:

  • The fraction of the program that can be run in parallel, p, is never 100%.

  • Because of hardware constraints, after a certain point, there is less and less benefit from each added CPU.

Tuning for parallel execution comes down to doing the best that you are able to do within these two limits. You strive to increase the parallel fraction, p, because in some cases even a small change in p (from 0.8 to 0.85, for example) makes a dramatic change in the effectiveness of added CPUs.

Then you work to ensure that each added CPU does a full CPU's work, and does not interfere with the work of other CPUs. In the SGI Altix architectures this means:

  • Spreading the workload equally among the CPUs

  • Eliminating false sharing and other types of memory contention between CPUs

  • Making sure that the data used by each CPU are located in a memory near that CPU's node

Understanding Parallel Speedup

If half the iterations of a DO-loop are performed on one CPU, and the other half run at the same time on a second CPU, the whole DO-loop should complete in half the time. For example, consider the typical C loop in Example 6-1.

Example 6-1. Typical C Loop

for (j=0; j<MAX; ++j) {
   z[j] = a[j]*b[j];
} 

The compiler can automatically distribute such a loop over n CPUs (with n decided at run time based on the available hardware), so that each CPU performs MAX/n iterations.

The speedup gained from applying n CPUs, Speedup(n), is the ratio of the one-CPU execution time to the n-CPU execution time: Speedup(n ) = T(1) ÷ T(n). If you measure the one-CPU execution time of a program at 100 seconds, and the program runs in 60 seconds with two CPUs, Speedup(2) = 100 ÷ 60 = 1.67.

This number captures the improvement from adding hardware. T(n) ought to be less than T(1); if it is not, adding CPUs has made the program slower, and something is wrong! So Speedup(n) should be a number greater than 1.0, and the greater it is, the better. Intuitively you might hope that the speedup would be equal to the number of CPUs (twice as many CPUs, half the time) but this ideal can seldom be achieved.

Understanding Superlinear Speedup

You expect Speedup(n) to be less than n, reflecting the fact that not all parts of a program benefit from parallel execution. However, it is possible, in rare situations, for Speedup(n) to be larger than n. When the program has been sped up by more than the increase of CPUs it is known as superlinear speedup.

A superlinear speedup does not really result from parallel execution. It comes about because each CPU is now working on a smaller set of memory. The problem data handled by any one CPU fits better in cache, so each CPU executes faster than the single CPU could do. A superlinear speedup is welcome, but it indicates that the sequential program was being held back by cache effects.

Understanding Amdahl's Law

There are always parts of a program that you cannot make parallel, where code must run serially. For example, consider the DO-loop. Some amount of code is devoted to setting up the loop, allocating the work between CPUs. This housekeeping must be done serially. Then comes parallel execution of the loop body, with all CPUs running concurrently. At the end of the loop comes more housekeeping that must be done serially; for example, if n does not divide MAX evenly, one CPU must execute the few iterations that are left over.

The serial parts of the program cannot be speeded up by concurrency. Let p be the fraction of the program's code that can be made parallel (p is always a fraction less than 1.0.) The remaining fraction (1-p) of the code must run serially. In practical cases, p ranges from 0.2 to 0.99.

The potential speedup for a program is proportional to p divided by the CPUs you can apply, plus the remaining serial part, 1-p. As an equation, this appears as Example 6-2.

Example 6-2. Amdahl's law: Speedup(n) Given p

                  1 
Speedup(n) = ----------- 
             (p/n)+(1-p) 

Suppose p = 0.8; then Speedup (2) = 1 / (0.4 + 0.2) = 1.67, and Speedup(4)= 1 / (0.2 + 0.2) = 2.5. The maximum possible speedup (if you could apply an infinite number of CPUs) would be 1 / (1-p). The fraction p has a strong effect on the possible speedup.

The reward for parallelization is small unless p is substantial (at least 0.8); or to put the point another way, the reward for increasing p is great no matter how many CPUs you have. The more CPUs you have, the more benefit you get from increasing p. Using only four CPUs, you need only p= 0.75 to get half the ideal speedup. With eight CPUs, you need p= 0.85 to get half the ideal speedup.

Calculating the Parallel Fraction of a Program

You do not have to guess at the value of p for a given program. Measure the execution times T(1) and T(2) to calculate a measured Speedup (2) = T(1) / T(2). The Amdahl's law equation can be rearranged to yield p when Speedup (2) is known, as in Example 6-3.

Example 6-3. Amdahl's law: p Given Speedup(2)

     2    SpeedUp(2) - 1 
p = --- * -------------- 
     1      SpeedUp(2)  

Suppose you measure T(1) = 188 seconds and T(2) = 104 seconds.

SpeedUp(2) = 188/104 = 1.81 
p = 2 * ((1.81-1)/1.81) = 2*(0.81/1.81) = 0.895 

In some cases, the Speedup(2) = T(1)/T(2) is a value greater than 2; in other words, a superlinear speedup (“Understanding Superlinear Speedup”). When this occurs, the formula in Example 6-3 returns a value of p greater than 1.0, which is clearly not useful. In this case you need to calculate p from two other more realistic timings, for example T(2) and T(3). The general formula for p is shown in Example 6-4, where n and m are the two CPU counts whose speedups are known, n>m.

Example 6-4. Amdahl's Law: p Given Speedup( n) and Speedup( m)

                Speedup(n) - Speedup(m)
p  =  ------------------------------------------- 
   (1 - 1/n)*Speedup(n) - (1 - 1/m)*Speedup(m) 


Predicting Execution Time with n CPUs

You can use the calculated value of p to extrapolate the potential speedup with higher numbers of CPUs. The following example shows the expected time with four CPUs, if p=0.895 and T(1)=188 seconds:

Speedup(4)= 1/((0.895/4)+(1-0.895)) = 3.04 
T(4)= T(1)/Speedup(4) = 188/3.04 = 61.8 

The calculation can be made routine using the computer by creating a script that automates the calculations and extrapolates run times.

These calculations are independent of most programming issues such as language, library, or programming model. They are not independent of hardware issues, because Amdahl's law assumes that all CPUs are equal. At some level of parallelism, adding a CPU no longer affects run time in a linear way. For example, on some architectures, cache-friendly codes scale closely with Amdahl's law up to the maximum number of CPUs, but scaling of memory intensive applications slows as the system bus approaches saturation. When the bus bandwidth limit is reached, the actual speedup is less than predicted.

Floating-point Programs Performance

Certain floating-point programs experience slowdowns due to excessive floating point traps called Floating-Point Software Assist (FPSWA).

This happens when the hardware cannot complete a floating point operation and requests help (emulation) from software. This happens, for instance, with denormals numbers. See the following document for more details:

http://www.intel.com/design/itanium/downloads/245415.htm

The symptoms are a slower than normal execution, FPSWA message in the system log (run dmesg). The average cost of a FPSWA fault is quite high around 1000 cycles/fault.

By default, the kernel prints a message similar to the following in the system log:

foo(7716): floating-point assist fault at ip 40000000000200e1
            isr 0000020000000008

The kernel throttles the message in order to avoid flooding the console.

It is possible to control the behavior of the kernel on FPSWA faults using the prctl(1) command. In particular, it is possible to get a signal delivered at the first FPSWA. It is also possible to silence the console message.

Using pfmon you can count fp_true_sirstall to test for FPSWA faults, as follows:

$ pfmon --no-qual-check -ku --drange=fpswa_interface \
-eloads_retired,ia64_inst_retired,fp_true_sirstall -- test-fpswa

                           1 LOADS_RETIRED
                     2615140 IA64_INST_RETIRED 

To see a list of available options, use the pfmon - help command.