Chapter 3. Performance Analysis and Debugging

Tuning an application involves determining the source of performance problems and then rectifying those problems to make your programs run their fastest on the available hardware. Performance gains usually fall into one of three categories of measured time:

Any application tuning process involves the following steps:

  1. Analyzing and identifying a problem

  2. Locating where in the code the problem is

  3. Applying an optimization technique

This chapter describes the process of analyzing your code to determine performance bottlenecks. See Chapter 6, “Performance Tuning”, for details about tuning your application for a single processor system and then tuning it for parallel processing.

Determining System Configuration

One of the first steps in application tuning is to determine the details of the system that you are running. Depending on your system configuration, different options may or may not provide good results.

The topology(1) command displays general information about SGI Altix systems, with a focus on node information. This includes node counts for blades, node IDs, NASIDs, memory per node, system serial number, partition number, UV Hub versions, CPU to node mappings, and general CPU information. Example output is, as follows:

uv44-sys:~ # topology
Serial number: UV-00000044
Partition number: 0
4 Blades
64 CPUs
125.97 Gb Memory Total

Blade         ID       asic  NASID         Memory
-------------------------------------------------
    0 r001i01b00  UVHub 2.0      0    16757488 kB
    1 r001i01b01  UVHub 2.0      2    16777216 kB
    2 r001i01b02  UVHub 2.0      4    16777216 kB
    3 r001i01b03  UVHub 2.0      6    16760832 kB

CPU      Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)
-------------------------------------------------------------------------------
  0 r001i01b00     00     00       0      6    46  1866 32d/32i     256   18432
  1 r001i01b00     00     03       6      6    46  1866 32d/32i     256   18432
  2 r001i01b00     00     08      16      6    46  1866 32d/32i     256   18432
  3 r001i01b00     00     11      22      6    46  1866 32d/32i     256   18432
  4 r001i01b00     01     00      32      6    46  1866 32d/32i     256   18432
  5 r001i01b00     01     03      38      6    46  1866 32d/32i     256   18432
  6 r001i01b00     01     08      48      6    46  1866 32d/32i     256   18432
  7 r001i01b00     01     11      54      6    46  1866 32d/32i     256   18432
  8 r001i01b01     02     00      64      6    46  1866 32d/32i     256   18432
  9 r001i01b01     02     03      70      6    46  1866 32d/32i     256   18432
 10 r001i01b01     02     08      80      6    46  1866 32d/32i     256   18432
 11 r001i01b01     02     11      86      6    46  1866 32d/32i     256   18432
 12 r001i01b01     03     00      96      6    46  1866 32d/32i     256   18432
 13 r001i01b01     03     03     102      6    46  1866 32d/32i     256   18432
 14 r001i01b01     03     08     112      6    46  1866 32d/32i     256   18432
 15 r001i01b01     03     11     118      6    46  1866 32d/32i     256   18432
 16 r001i01b02     04     00     128      6    46  1866 32d/32i     256   18432
                                  ...
 63 r001i01b03     07     11     247      6    46  1866 32d/32i     256   18432

The cpumap(1) command displays logical CPUs and shows relationships between them in a human-readable format. Aspects displayed include hyperthread relationships, last level cache sharing, and topological placement. The cpumap command gets its information from /proc/cpuinfo, the /sys/devices/system directory structure, and /proc/sgi_uv/topology .

Example output is, as follows:

uv44-sys:~ # cpumap
Mon Oct 18 13:40:26 CDT 2010
uv44-sys.mycompany.com

This an SGI Altix UV
model name           : Intel(R) Xeon(R) CPU E7520 @ 1.87GHz
Architecture         : x86_64
cpu MHz              : 1866.557
cache size           : 18432 KB (Last Level)

Total Number of Sockets                 : 8
Total Number of Cores                   : 32    (4 per socket)
Hyperthreading                          : ON
Total Number of Physical Processors     : 32
Total Number of Logical Processors      : 64    (2 per Phys Processor)

UV Information
 HUB Version:                            UVHub  2.0
 Number of Hubs (Blades):                4
 Number of connected NUMAlink ports:    12
=============================================================================

Hub-Processor Mapping

  Hub Location      Processor Numbers -- HyperThreads in ()
  --- ----------    ---------------------------------------
    0 r001i01b00       0    1    2    3    4    5    6    7
                  (   32   33   34   35   36   37   38   39 )
    1 r001i01b01       8    9   10   11   12   13   14   15
                  (   40   41   42   43   44   45   46   47 )
    2 r001i01b02      16   17   18   19   20   21   22   23
                  (   48   49   50   51   52   53   54   55 )
    3 r001i01b03      24   25   26   27   28   29   30   31
                  (   56   57   58   59   60   61   62   63 )

=============================================================================

Processor Numbering on Socket(s)

  Socket    (Logical) Processors
  ------    -------------------------
     0      0    1    2    3   32   33   34   35
     1      4    5    6    7   36   37   38   39
     2      8    9   10   11   40   41   42   43
     3     12   13   14   15   44   45   46   47
     4     16   17   18   19   48   49   50   51
     5     20   21   22   23   52   53   54   55
     6     24   25   26   27   56   57   58   59
     7     28   29   30   31   60   61   62   63

=============================================================================

Sharing of Last Level (3) Caches

  Socket    (Logical) Processors
  ------    -------------------------
     0      0    1    2    3   32   33   34   35
     1      4    5    6    7   36   37   38   39
     2      8    9   10   11   40   41   42   43
     3     12   13   14   15   44   45   46   47
     4     16   17   18   19   48   49   50   51
     5     20   21   22   23   52   53   54   55
     6     24   25   26   27   56   57   58   59
     7     28   29   30   31   60   61   62   63

=============================================================================

HyperThreading

  Shared Processors
  -----------------
 (    0,   32) (    1,   33) (    2,   34) (    3,   35)
 (    4,   36) (    5,   37) (    6,   38) (    7,   39)
 (    8,   40) (    9,   41) (   10,   42) (   11,   43)
 (   12,   44) (   13,   45) (   14,   46) (   15,   47)
 (   16,   48) (   17,   49) (   18,   50) (   19,   51)
 (   20,   52) (   21,   53) (   22,   54) (   23,   55)
 (   24,   56) (   25,   57) (   26,   58) (   27,   59)
 (   28,   60) (   29,   61) (   30,   62) (   31,   63)

Use the x86info(1) command to display x86 CPU diagnostics information, as follows:

uv44-sys:~ # x86info
x86info v1.25.  Dave Jones 2001-2009
Feedback to .

Found 64 CPUs
--------------------------------------------------------------------------
CPU #1
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU           E7520  @ 1.87GHz
Type: 0 (Original OEM)  Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x0    Package: 0  Core: 0   SMT ID 0
--------------------------------------------------------------------------
CPU #2
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU           E7520  @ 1.87GHz
Type: 0 (Original OEM)  Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x6    Package: 0  Core: 0   SMT ID 6
--------------------------------------------------------------------------
CPU #3
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU           E7520  @ 1.87GHz
Type: 0 (Original OEM)  Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x10   Package: 0  Core: 0   SMT ID 16
-------------------------------------------------------------------------- 
                       ...

You can also use the uname command, which returns the kernel version and other machine information. For example:

uv44-sys:~ # uname -a
Linux uv44-sys 2.6.32.13-0.4.1.1559.0.PTF-default #1 SMP 2010-06-15 12:47:25 +0200 x86_64 x86_64 x86_64 GNU/Linux

For more system information, change directory (cd) to the /sys/devices/system/node/node0/cpu0/cache directory.

For example:

uv44-sys:/sys/devices/system/node/node0/cpu0/cache # ls
index0  index1  index2  index3

Change directory to index0 and list the contents, as follows:

uv44-sys:/sys/devices/system/node/node0/cpu0/cache/index0 # ls
coherency_line_size  level  number_of_sets  physical_line_partition  shared_cpu_list  shared_cpu_map  size  type  ways_of_associativity

Sources of Performance Problems

There are usually three areas of program execution that can have performance slowdowns:

  • CPU-bound processes: processes that are performing slow operations (such as sqrt or floating-point divides) or non-pipelined operations such as switching between add and multiply operations.

  • Memory-bound processes: code which uses poor memory strides, occurrences of page thrashing or cache misses, or poor data placement in NUMA systems.

  • I/O-bound processes: processes which are waiting on synchronous I/O, formatted I/O, or when there is library or system level buffering.

Several profiling tools can help pinpoint where performance slowdowns are occurring. The following sections describe some of these tools.

Profiling with perf

The perf(1) software provides the performance analysis tools for Linux. Performance counters for Linux are a kernel-based subsystem that provide a framework for all things performance analysis. It covers hardware level CPU/performance monitoring unit (PMU) features and software features, such as, software counters and tracepoints, as well. To use the perf profiling tools, you need to make sure the perf RPM is installed.. For more information, see the following man pages: perf-stat(1), perf-top(1), perf-record (1), perf-report(1), perf-list(1)

Profiling with PerfSuite

“PerfSuite is an easy-to-use collection of tools, utilities, and libraries to support application software performance analysis on Linux-based systems. It includes components to assist with a wide variety of performance-related tasks, ranging from assistance with compiler optimization reports to hardware performance counting, profiling, and MPI usage summarization. PerfSuite is Open Source software, approved for licensing under the University of Illinois/NCSA Open Source License (OSI-approved). You can find out more about PerfSuite at the project web sites, located at: http://perfsuite.ncsa.uiuc.edu/ or http://perfsuite.sourceforge.net/.”

For NCSA specific information about using PerfSuite tools, see http://www.ncsa.illinois.edu/UserInfo/Resources/Software/Tools/PerfSuite/ .

“psrun is a PerfSuite command-line utility that can be used to gather hardware performance information on an unmodified executable. It's a convenient and flexible way to do quick performance monitoring/measurement.” For more information, see http://perfsuite.ncsa.uiuc.edu/psrun/.

Using VTune for Remote Sampling

The Intel VTune performance analyzer does remote sampling experiments. The VTune data collector runs on the Linux system and an accompanying GUI runs on an IA-32 Windows machine, which is used for analyzing the results. VTune allows you to perform interactive experiments while connected to the host through its GUI. PTU (Performance Tuning Utility) is another tool which requires theIntel VTune license.

For details about using VTune, see the following URL:

http://developer.intel.com/software/products/vtune/vpa/

Other Performance Tools

The following performance tools also can be of benefit when you are trying to optimize your code:

  • Guide OpenMP Compiler is an OpenMP implementation for C, C++, and Fortran from Intel.

  • Assure Thread Analyzer from Intel locates programming errors in threaded applications with no recoding required.

For details about these products, see the following website:

http://developer.intel.com/software/products/threading


Note: These products have not been thoroughly tested on SGI systems. SGI takes no responsibility for the correct operation of third party products described or their suitability for any particular purpose.


Debugging Tools

Three debuggers are available to help you analyze your code:

  • gdb: the GNU project debugger. This is useful for debugging programs written in C, C++, and Fortran 95. When compiling with C and C++, include the -g option on the compiler command line to produce the dwarf2 symbols database used by gdb.

    When using gdb for Fortran debugging, include the -g and -O0 options. Do not use gdb for Fortran debugging when compiling with -O1 or higher.

    The debugger to be used for Fortran 95 codes can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=56720 . (Note that the standard gdb compiler does not support Fortran 95 codes.) To verify that you have the correct version of gdb installed, use the gdb -v command. The output should appear similar to the following:

    GNU gdb 5.1.1 FORTRAN95-20020628 (RC1)
    Copyright 2002 Free Software Foundation, Inc.

    For a complete list of gdb commands, see the gdb user guide online at http://sources.redhat.com/gdb/onlinedocs/gdb_toc.html or use the help option. Note that current instances of gdb do not report ar.ec registers correctly. If you are debugging rotating, register-based, software-pipelined loops at the assembly code level, try using idb instead.

  • idb: the Intel debugger. This is a fully symbolic debugger for the Linux platform. The debugger provides extensive support for debugging programs written in C, C++, FORTRAN 77, and Fortran 90. idb includes a GUI and it supports both Intel and GNU compilers.

    Running idb with the -gdb option on the shell command line provides gdb (1)-like user commands and debugger output.

  • ddd: a GUI to a command line debugger. It supports gdb and idb. For details about usage, see the following subsection.

  • TotalView: a licensed graphical debugger useful in an MPI environment (see http://www.totalviewtech.com/ )

Figure 3-1 shows a TotalView sesssion.

Figure 3-1. TotalView Session

TotalView Session

Using the Intel Debugger idb

idb is part of the Intel Compiler suite, both Fortran and C/C++. You are asked during the installation if you want to install it or not. When running idb you get the GUI interface. When you invoke the idbc command, you get the command line interface.

Figure 3-2. Intel® Debugger GUI

Intel® Debugger GUI

Using ddd

The DataDisplayDebugger ddd(1) tool is a GUI to an arbitrary command line debugger as shown in Figure 3-3. When starting ddd, use the --debugger option to specify the debugger used (for example, --debugger "idb"). The default debugger used is gdb.

Figure 3-3. DataDisplayDebugger(ddd)(1)

DataDisplayDebugger(ddd)

When the debugger is loaded the DataDisplayDebugger screen appears divided into panes that show the following information:

  • Array inspection

  • Source code

  • Disassembled code

  • A command line window to the debugger engine

These panes can be switched on and off from the View menu.

Some commonly used commands can be found on the menus. In addition, the following actions can be useful:

  • Select an address in the assembly view, click the right mouse button, and select lookup. The gdb command is executed in the command pane and it shows the corresponding source line.

  • Select a variable in the source pane and click the right mouse button. The current value is displayed. Arrays are displayed in the array inspection window. You can print these arrays to PostScript by using the Menu>Print Graph option.

  • You can view the contents of the register file, including general, floating-point, NaT, predicate, and application registers by selecting Registers from the Status menu. The Status menu also allows you to view stack traces or to switch OpenMP threads.