This chapter describes the perfcatch utility used to profile the performance of an MPI program and other tools that can be used for profiling MPI applications. It covers the following topics:
The perfcatch utility runs an MPI program with a wrapper profiling library that prints MPI call profiling information to a summary file upon MPI program completion. This MPI profiling result file is called MPI_PROFILING_STATS, by default (see “ MPI_PROFILING_STATS Results File Example”). It is created in the current working directory of the MPI process with rank 0.
The syntax of the perfcatch utility is, as follows:
perfcatch [-v | -vofed | -i] cmd args |
The perfcatch utility accepts the following options:
No option | Supports MPT | |
-v | Supports Voltaire MPI | |
-vofed | Supports Voltaire OFED MPI | |
-i | Supports Intel MPI |
To use perfcatch with an SGI Message Passing Toolkit MPI program, insert the perfcatch command in front of the executable name. Here are some examples:
mpirun -np 64 perfcatch a.out arg1 |
mpirun host1 32, host2 64 perfcatch a.out arg1 |
To use perfcatch with Intel MPI, add the -i options. An example is, as follows:
mpiexec -np 64 perfcatch -i a.out arg1 |
For more information, see the perfcatch (1) man page.
The MPI profiling result file has a summary statistics section followed by a rank-by-rank profiling information section. The summary statistics section reports some overall statistics, including the percent time each rank spent in MPI functions, and the MPI process that spent the least and the most time in MPI functions. Similar reports are made about system time usage.
The rank-by-rank profiling information section lists every profiled MPI function called by a particular MPI process. The number of calls and the total time consumed by these calls is reported. Some functions report additional information such as average data counts and communication peer lists.
An example MPI_PROFILING_STATS results file is, as follows:
============================================================ PERFCATCHER version 22 (C) Copyright SGI. This library may only be used on SGI hardware platforms. See LICENSE file for details. ============================================================ MPI program profiling information Job profile recorded Wed Jan 17 13:05:24 2007 Program command line: /home/estes01/michel/sastest/mpi_hello_linux Total MPI processes 2 Total MPI job time, avg per rank 0.0054768 sec Profiled job time, avg per rank 0.0054768 sec Percent job time profiled, avg per rank 100% Total user time, avg per rank 0.001 sec Percent user time, avg per rank 18.2588% Total system time, avg per rank 0.0045 sec Percent system time, avg per rank 82.1648% Time in all profiled MPI routines, avg per rank 5.75004e-07 sec Percent time in profiled MPI routines, avg per rank 0.0104989% Rank-by-Rank Summary Statistics ------------------------------- Rank-by-Rank: Percent in Profiled MPI routines Rank:Percent 0:0.0112245% 1:0.00968502% Least: Rank 1 0.00968502% Most: Rank 0 0.0112245% Load Imbalance: 0.000771% Rank-by-Rank: User Time Rank:Percent 0:17.2683% 1:19.3699% Least: Rank 0 17.2683% Most: Rank 1 19.3699% Rank-by-Rank: System Time Rank:Percent 0:86.3416% 1:77.4796% Least: Rank 1 77.4796% Most: Rank 0 86.3416% Notes ----- Wtime resolution is 5e-08 sec Rank-by-Rank MPI Profiling Results ---------------------------------- Activity on process rank 0 Single-copy checking was not enabled. comm_rank calls: 1 time: 6.50005e-07 s 6.50005e-07 s/call Activity on process rank 1 Single-copy checking was not enabled. comm_rank calls: 1 time: 5.00004e-07 s 5.00004e-07 s/call ------------------------------------------------ recv profile cnt/sec for all remote ranks local ANY_SOURCE 0 1 rank ------------------------------------------------ recv wait for data profile cnt/sec for all remote ranks local 0 1 rank ------------------------------------------------ recv wait for data profile cnt/sec for all remote ranks local 0 1 rank ------------------------------------------------ send profile cnt/sec for all destination ranks src 0 1 rank ------------------------------------------------ ssend profile cnt/sec for all destination ranks src 0 1 rank ------------------------------------------------ ibsend profile cnt/sec for all destination ranks src 0 1 rank |
The MPI performance profiling environment variables are, as follows:
Variable | Description |
MPI_PROFILE_AT_INIT | Activates MPI profiling immediately, that is, at the start of MPI program execution. |
MPI_PROFILING_STATS_FILE | Specifies the file where MPI profiling results are written. If not specified, the file MPI_PROFILING_STATS is written. |
The MPI supported profiled functions are, as follows:
Note: Some functions may not be implemented in all language as indicated below. |
Languages | Function |
C Fortran | mpi_allgather |
C Fortran | mpi_allgatherv |
C Fortran | mpi_allreduce |
C Fortran | mpi_alltoall |
C Fortran | mpi_alltoallv |
C Fortran | mpi_alltoallw |
C Fortran | mpi_barrier |
C Fortran | mpi_bcast |
C Fortran | mpi_comm_create |
C Fortran | mpi_comm_free |
C Fortran | mpi_comm_group |
C Fortran | mpi_comm_rank |
C Fortran | mpi_finalize |
C Fortran | mpi_gather |
C Fortran | mpi_gatherv |
C | mpi_get_count |
C Fortran | mpi_group_difference |
C Fortran | mpi_group_excl |
C Fortran | mpi_group_free |
C Fortran | mpi_group_incl |
C Fortran | mpi_group_intersection |
C Fortran | mpi_group_range_excl |
C Fortran | mpi_group_range_incl |
C Fortran | mpi_group_union |
C | mpi_ibsend |
C Fortran | mpi_init |
C | mpi_init_thread |
C Fortran | mpi_irecv |
C Fortran | mpi_isend |
C | mpi_probe |
C Fortran | mpi_recv |
C Fortran | mpi_reduce |
C Fortran | mpi_scatter |
C Fortran | mpi_scatterv |
C Fortran | mpi_send |
C Fortran | mpi_sendrecv |
C Fortran | mpi_ssend |
C Fortran | mpi_test |
C Fortran | mpi_testany |
C Fortran | mpi_wait |
C Fortran | mpi_wait |
This section describes the use of profiling tools to obtain performance information. Compared to the performance analysis of sequential applications, characterizing the performance of parallel applications can be challenging. Often it is most effective to first focus on improving the performance of MPI applications at the single process level.
It may also be important to understand the message traffic generated by an application. A number of tools can be used to analyze this aspect of a message passing application's performance, including Performance Co-Pilot and various third party products. In this section, you can learn how to use these various tools with MPI applications. It covers the following topics:
You can write your own profiling by using the MPI-1 standard PMPI_* calls. In addition, either within your own profiling library or within the application itself you can use the MPI_Wtime function call to time specific calls or sections of your code.
The following example is actual output for a single rank of a program that was run on 128 processors, using a user-created profiling library that performs call counts and timings of common MPI calls. Notice that for this rank most of the MPI time is being spent in MPI_Waitall and MPI_Allreduce.
Total job time 2.203333e+02 sec Total MPI processes 128 Wtime resolution is 8.000000e-07 sec activity on process rank 0 comm_rank calls 1 time 8.800002e-06 get_count calls 0 time 0.000000e+00 ibsend calls 0 time 0.000000e+00 probe calls 0 time 0.000000e+00 recv calls 0 time 0.00000e+00 avg datacnt 0 waits 0 wait time 0.00000e+00 irecv calls 22039 time 9.76185e-01 datacnt 23474032 avg datacnt 1065 send calls 0 time 0.000000e+00 ssend calls 0 time 0.000000e+00 isend calls 22039 time 2.950286e+00 wait calls 0 time 0.00000e+00 avg datacnt 0 waitall calls 11045 time 7.73805e+01 # of Reqs 44078 avg data cnt 137944 barrier calls 680 time 5.133110e+00 alltoall calls 0 time 0.0e+00 avg datacnt 0 alltoallv calls 0 time 0.000000e+00 reduce calls 0 time 0.000000e+00 allreduce calls 4658 time 2.072872e+01 bcast calls 680 time 6.915840e-02 gather calls 0 time 0.000000e+00 gatherv calls 0 time 0.000000e+00 scatter calls 0 time 0.000000e+00 scatterv calls 0 time 0.000000e+00 activity on process rank 1 ... |
MPI keeps track of certain resource utilization statistics. These can be used to determine potential performance problems caused by lack of MPI message buffers and other MPI internal resources.
To turn on the displaying of MPI internal statistics, use the MPI_STATS environment variable or the -stats option on the mpirun command. MPI internal statistics are always being gathered, so displaying them does not cause significant additional overhead. In addition, one can sample the MPI statistics counters from within an application, allowing for finer grain measurements. If the MPI_STATS_FILE variable is set, when the program completes, the internal statistics will be written to the file specified by this variable. For information about these MPI extensions, see the mpi_stats man page.
These statistics can be very useful in optimizing codes in the following ways:
To determine if there are enough internal buffers and if processes are waiting (retries) to aquire them
To determine if single copy optimization is being used for point-to-point or collective calls
For additional information on how to use the MPI statistics counters to help tune the run-time environment for an MPI application, see Chapter 8, “Run-time Tuning”.
Two third party tools that you can use with the SGI MPI implementation are Vampir from Pallas (www.pallas.com) and Jumpshot, which is part of the MPICH distribution. Both of these tools are effective for smaller, short duration MPI jobs. However, the trace files these tools generate can be enormous for longer running or highly parallel jobs. This causes a program to run more slowly, but even more problematic is that the tools to analyze the data are often overwhelmed by the amount of data.