This chapter discusses ways in which the user can tune the run-time environment to improve the performance of an MPI message passing application on SGI computers. None of these ways involve application code changes. This chapter covers the following topics:
One of the most common problems with optimizing message passing codes on large shared memory computers is achieving reproducible timings from run to run. To reduce run-time variability, you can take the following precautions:
Do not oversubscribe the system. In other words, do not request more CPUs than are available and do not request more memory than is available. Oversubscribing causes the system to wait unnecessarily for resources to become available and leads to variations in the results and less than optimal performance.
Avoid interference from other system activity. The Linux kernel uses more memory on node 0 than on other nodes (node 0 is called the kernel node in the following discussion). If your application uses almost all of the available memory per processor, the memory for processes assigned to the kernel node can unintentionally spill over to nonlocal memory. By keeping user applications off the kernel node, you can avoid this effect.
Additionally, by restricting system daemons to run on the kernel node, you can also deliver an additional percentage of each application CPU to the user.
Avoid interference with other applications. You can use cpusets to address this problem also. You can use cpusets to effectively partition a large, distributed memory host in a fashion that minimizes interactions between jobs running concurrently on the system. See the Linux Resource Administration Guide for information about cpusets.
On a quiet, dedicated system, you can use dplace or the MPI_DSM_CPULIST shell variable to improve run-time performance repeatability. These approaches are not as suitable for shared, nondedicated systems.
Use a batch scheduler; for example, Platform LSF from Platform Computing Corporation or PBS Professional from Altair Engineering, Inc. These batch schedulers use cpusets to avoid oversubscribing the system and possible interference between applications.
By default, the SGI MPI implementation buffers messages whose lengths exceed 64 bytes. Longer messages are buffered in a shared memory region to allow for exchange of data between MPI processes. In the SGI MPI implementation, these buffers are divided into two basic pools.
For messages exchanged between MPI processes within the same host or between partitioned systems when using the XPMEM driver or when there are more than MPI_BUFS_THRESHOLD hosts, buffers from the ”per process” pool (called the “per proc” pool) are used. Each MPI process is allocated a fixed portion of this pool when the application is launched. Each of these portions is logically partitioned into 16-KB buffers.
For MPI jobs running across multiple hosts, a second pool of shared memory is available. Messages exchanged between MPI processes on different hosts use this pool of shared memory, called the “per host” pool. The structure of this pool is somewhat more complex than the “per proc” pool.
For an MPI job running on a single host, messages that exceed 64 bytes are handled as follows. For messages with a length of 128k or less, the sender MPI process buffers the entire message. It then delivers a message header (also called a control message) to a mailbox, which is polled by the MPI receiver when an MPI call is made. Upon finding a matching receive request for the sender's control message, the receiver copies the data out of the shared memory buffer into the application buffer indicated in the receive request. The receiver then sends a message header back to the sender process, indicating that the shared memory buffer is available for reuse. Messages whose length exceeds 128k are broken down into 128k chunks, allowing the sender and receiver to overlap the copying of data to and from shared memory in a pipeline fashion.
Because there is a finite number of these shared memory buffers, this can be a constraint on the overall application performance for certain communication patterns. You can use the MPI_BUFS_PER_PROC shell variable to adjust the number of buffers available for the “per proc” pool. Similarly, you can use the MPI_BUFS_PER_HOST shell variable to adjust the “per host” pool. You can use the MPI statistics counters to determine if retries for these shared memory buffers are occurring.
For information on the use of these counters, see “MPI Internal Statistics” in Chapter 9. In general, you can avoid excessive numbers of retries for buffers by increasing the number of buffers for the “per proc” pool or “per host” pool. However, you should keep in mind that increasing the number of buffers does consume more memory. Also, increasing the number of “per proc” buffers does potentially increase the probability for cache pollution (that is, the excessive filling of the cache with message buffers). Cache pollution can result in degraded performance during the compute phase of a message passing application.
There are additional buffering considerations to take into account when running an MPI job across multiple hosts. For further discussion of multihost runs, see “Tuning for Running Applications Across Multiple Hosts”.
For further discussion on programming implications concerning message buffering, see “Buffering” in Chapter 4.
For message transfers between MPI processes within the same host or transfers between partitions, it is possible under certain conditions to avoid the need to buffer messages. Because many MPI applications are written assuming infinite buffering, the use of this unbuffered approach is not enabled by default for MPI_Send. This section describes how to activate this mechanism by default for MPI_Send.
For MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce , and MPI_Reduce, this optimization is enabled by default for large message sizes. To disable this default single copy feature used for the collectives, use the MPI_DEFAULT_SINGLE_COPY_OFF environment variable.
MPI takes advantage of the XPMEM driver to support single copy message transfers between two processes within the same host or across partitions.
Enabling single copy transfers may result in better performance, since this technique improves MPI's bandwidth. However, single copy transfers may introduce additional synchronization points, which can reduce application performance in some cases.
The threshold for message lengths beyond which MPI attempts to use this single copy method is specified by the MPI_BUFFER_MAX shell variable. Its value should be set to the message length in bytes beyond which the single copy method should be tried. In general, a value of 2000 or higher is beneficial for many applications.
During job startup, MPI uses the XPMEM driver (via the xpmem kernel module) to map memory from one MPI process to another. The mapped areas include the static (BSS) region, the private heap, the stack region, and optionally the symmetric heap region of each process.
Memory mapping allows each process to directly access memory from the address space of another process. This technique allows MPI to support single copy transfers for contiguous data types from any of these mapped regions. For these transfers, whether between processes residing on the same host or across partitions, the data is copied using a bcopy process. A bcopy process is also used to transfer data between two different executable files on the same host or two different executable files across partitions. For data residing outside of a mapped region (a /dev/zero region, for example), MPI uses a buffering technique to transfer the data.
Memory mapping is enabled by default. To disable it, set the MPI_MEMMAP_OFF environment variable. Memory mapping must be enabled to allow single-copy transfers, MPI-2 one-sided communication, support for the SHMEM model, and certain collective optimizations.
The MPI library takes advantage of NUMA placement functions that are available. Usually, the default placement is adequate. Under certain circumstances, however, you might want to modify this default behavior. The easiest way to do this is by setting one or more MPI placement shell variables. Several of the most commonly used of these variables are discribed in the following sections. For a complete listing of memory placement related shell variables, see the MPI(1) man page.
The MPI_DSM_CPULIST shell variable allows you to manually select processors to use for an MPI application. At times, specifying a list of processors on which to run a job can be the best means to insure highly reproducible timings, particularly when running on a dedicated system.
This setting is treated as a comma and/or hyphen delineated ordered list that specifies a mapping of MPI processes to CPUs. If running across multiple hosts, the per host components of the CPU list are delineated by colons. Within hyphen delineated lists CPU striding may be specified by placing "/#" after the list where "#" is the stride distance.
Note: This feature should not be used with MPI applications that use either of the MPI-2 spawn related functions. |
Examples of settings are as follows:
Value | CPU Assignment | |
8,16,32 | Place three MPI processes on CPUs 8, 16, and 32. | |
32,16,8 | Place the MPI process rank zero on CPU 32, one on 16, and two on CPU 8. | |
8-15/2 | Place the MPI processes 0 through 3 strided on CPUs 8, 10, 12, and 14 | |
8-15,32-39 | Place the MPI processes 0 through 7 on CPUs 8 to 15. Place the MPI processes 8 through 15 on CPUs 32 to 39. | |
39-32,8-15 | Place the MPI processes 0 through 7 on CPUs 39 to 32. Place the MPI processes 8 through 15 on CPUs 8 to 15. | |
8-15:16-23 | Place the MPI processes 0 through 7 on the first host on CPUs 8 through 15. Place MPI processes 8 through 15 on CPUs 16 to 23 on the second host. |
Note that the process rank is the MPI_COMM_WORLD rank. The interpretation of the CPU values specified in the MPI_DSM_CPULIST depends on whether the MPI job is being run within a cpuset. If the job is run outside of a cpuset, the CPUs specify cpunum values beginning with 0 and up to the number of CPUs in the system minus one. When running within a cpuset, the default behavior is to interpret the CPU values as relative processor numbers within the cpuset.
The number of processors specified should equal the number of MPI processes that will be used to run the application. The number of colon delineated parts of the list must equal the number of hosts used for the MPI job. If an error occurs in processing the CPU list, the default placement policy is used.
Use the MPI_DSM_DISTRIBUTE shell variable to ensure that each MPI process will get a physical CPU and memory on the node to which it was assigned. If this environment variable is used without specifying an MPI_DSM_CPULIST variable, it will cause MPI to assign MPI ranks starting at logical CPU 0 and incrementing until all ranks have been placed. Therefore, it is recommended that this variable be used only if running within a cpuset on a dedicated system.
Setting the MPI_DSM_VERBOSE shell variable directs MPI to display a synopsis of the NUMA and host placement options being used at run time.
The dplace tool offers another means of specifying the placement of MPI processes within a distributed memory host. The dplace tool and MPI interoperate to allow MPI to better manage placement of certain shared memory data structures when dplace is used to place the MPI job.
For instructions on how to use dplace with MPI, see the dplace(1) man page and the Linux Application Tuning Guide.
A hybrid MPI/OpenMP application is one in which each MPI process itself is a parallel threaded program. These programs often exploit the OpenMP paralllelism at the loop level while also implementing a higher level parallel algorithm using MPI.
Many parallel applications perform better if the MPI processes and the threads within them are pinned to particular processors for the duration of their execution. For ccNUMA systems, this ensures that all local, non-shared memory is allocated on the same memory node as the processor referencing it. For all systems, it can ensure that some or all of the OpenMP threads stay on processors that share a bus or perhaps a processor cache, which can speed up thread synchronization.
MPT provides the omplace(1) command to help with the placement of OpenMP threads within an MPI program. The omplace command causes the threads in a hybrid MPI/OpenMP job to be placed on unique CPUs within the containing cpuset. For example, the threads in a 2-process MPI program with 2 threads per process would be placed as follows:
rank 0 thread 0 on CPU 0 rank 0 thread 1 on CPU 1 rank 1 thread 0 on CPU 2 rank 1 thread 1 on CPU 3 |
The CPU placement is performed by dynamically generating a dplace(1) placement file and invoking dplace.
For detailed syntax and a number of examples, see the omplace(1) man page. For more information on dplace , see the dplace(1) man page. For information on using cpusets, see the Linux Resource Administration Guide . For more information on using dplace, see the Linux Application Tuning Guide.
Example 8-1. How to Run a Hybrid MPI/OpenMP Application
Here is an example of how to run a hybrid MPI/OpenMP application with eight MPI processes that are two-way threaded on two hosts:
mpirun host1,host2 -np 4 omplace -nt 2 ./a.out |
When using the PBS batch scheduler to schedule the a hybrid MPI/OpenMP job as shown in Example 8-1, use the following resource allocation specification:
#PBS -l select=8:ncpus=2 |
And use the following mpiexec command with the above example:
mpiexec -n 8 omplace -nt 2 ./a.out |
For more information about running MPT programs with PBS, see“Running MPI Jobs with a Work Load Manager” in Chapter 3 .
When you are running an MPI application across a cluster of hosts, there are additional run-time environment settings and configurations that you can consider when trying to improve application performance.
Systems can use the XPMEM interconnect to cluster hosts as partitioned systems, or use the Voltaire InfiniBand interconnect or TCP/IP as the multihost interconnect.
When launched as a distributed application, MPI probes for these interconnects at job startup. For details of launching a distributed application, see “Launching a Distributed Application” in Chapter 3. When a high performance interconnect is detected, MPI attempts to use this interconnect if it is available on every host being used by the MPI job. If the interconnect is not available for use on every host, the library attempts to use the next slower interconnect until this connectivity requirement is met. Table 8-1 specifies the order in which MPI probes for available interconnects.
Table 8-1. Inquiry Order for Available Interconnects
Interconnect | Default Order of Selection | Environment Variable to Require Use |
---|---|---|
XPMEM | 1 | MPI_USE_XPMEM |
InfiniBand | 2 | MPI_USE_IB |
TCP/IP | 3 | MPI_USE_TCP |
The third column of Table 8-1 also indicates the environment variable you can set to pick a particular interconnect other than the default.
In general, to insure the best performance of the application, you should allow MPI to pick the fastest available interconnect.
In addition to the choice of interconnect, you should know that multihost jobs may use different buffers from those used by jobs run on a single host. In the SGI implementation of MPI, the XPMEM interconnect uses the “per proc” buffers while the InfiniBand and TCP interconnects use the “per host” buffers. The default setting for the number of buffers per proc or per host might be too low for many applications. You can determine whether this setting is too low by using the MPI statistics described earlier in this section.
When using the TCP/IP interconnect, unless specified otherwise, MPI uses the default IP adapter for each host. To use a nondefault adapter, enter the adapter-specific host name on the mpirun command line.
When using the InfiniBand interconnect, MPT applications may not execute a fork() or system() call. The InfiniBand driver produces undefined results when an MPT process using InfiniBand forks.
Requires the MPI library to use the InfiniBand driver as the interconnect when running across multiple hosts or running with multiple binaries. MPT requires the ibhost software stack from Voltaire when the InfiniBand interconnect is used. If InfiniBand is used, the MPI_COREDUMP environment variable is forced to INHIBIT, to comply with the InfiniBand driver restriction that no fork()s may occur after InfiniBand resources have been allocated. Default: Not set
When this is set to 1 and the MPI library uses the InfiniBand driver as the inter-host interconnect, MPT will send its InfiniBand traffic over the first fabric that it detects. If this is set to 2, the library will try to make use of multiple available separate InfiniBand fabrics and split its traffic across them. If the separate InfiniBand fabrics do not have unique subnet IDs, then the rail-config utility is required. It must be run by the system administrator to enable the library to correctly use the separate fabrics. Default: 1 on all SGI Altix systems.
When MPI transfers data over InfiniBand, if the size of the cumulative data is greater than this value then MPI will attempt to send the data directly between the processes's buffers and not through intermediate buffers inside the MPI library. Default: 32767
For more information on these environment variables, see the “ENVIRONMENT VARIABLES” section of the mpi(1) man page.
When running an MPI application across a cluster of hosts using the InfiniBand interconnect, there are additional run-time environmental settings that you can consider to improve application performance, as follows:
Controls the number of other ranks that a rank can receive from over InfiniBand using a short message fast path. This is 8 by default and can be any value between 0 and 32.
For zero-copy sends over the InfiniBand interconnect, MPT keeps a cache of application data buffers registered for these transfers. This environmental variable controls the size of the cache. It is 8 by default and can be any value between 0 and 32. If the application rarely reuses data buffers, it may make sense to set this value to 0 to avoid cache trashing.
For very large MPI jobs, the time and resource cost to create a connection between every pair of ranks at job start time may be prodigious. When the number of ranks is at least this value, the MPI library will create InfiniBand connections lazily on a demand basis. The default is 2048 ranks.
When the MPI library uses the InfiniBand fabric, it allocates some amount of memory for each message header that it uses for InfiniBand. If the size of data to be sent is not greater than this amount minus 64 bytes for the actual header, the data is inlined with the header. If the size is greater than this value, then the message is sent through remote direct memory access (RDMA) operations. The default is 16384 bytes.
When an InfiniBand card sends a packet, it waits some amount of time for an ACK packet to be returned by the receiving InfiniBand card. If it does not receive one, it sends the packet again. This variable controls that wait period. The time spent is equal to 4 * 2 ^ MPI_IB_TIMEOUT microseconds. By default, the variable is set to 18.
When the MPI library uses InfiniBand and this variable is set, and an InfiniBand transmission error occurs, MPT will try to restart the connection to the other rank. It will handle a number of errors of this type between any pair of ranks equal to the value of this variable. By default, the variable is set to 4.
The SGI Altix UV 100 and Altix UV 1000 series systems are scalable nonuniform memory access (NUMA) systems that support a single Linux image of thousands of processors distributed over many sockets and SGI Altix UV Hub application-specific integrated circuits (ASICs). The UV Hub is the heart of the SGI Altix UV 1000 or Altix UV 100 system compute blade. Each "processor" is a hyperthread on a particular core within a particular socket. Each Altix UV Hub normally connects to two sockets. All communication between the sockets and the UV Hub uses Intel QuickPath Interconnect (QPI) channels. The Altix UV Hub has four NUMAlink 5 ports that connect with the NUMAlink 5 interconnect fabric. The UV Hub acts as a crossbar between the processors, local SDRAM memory, and the network interface. The Hub ASIC enables any processor in the single-system image (SSI) to access the memory of all processors in the SSI. For more information on the SGI Altix UV hub, Altix UV compute blades, QPI, and NUMAlink 5, see the SGI Altix UV 1000 System User's Guide or the SGI Altix UV 100 System User's Guide, respectively.
When MPI communicates between processes, two transfer methods are possible on an Altix UV system:
By use of shared memory
By use of the global reference unit (GRU), part of the Altix UV Hub ASIC
MPI chooses the method depending on internal heuristics, the type of MPI communication that is involved, and some user-tunable variables. When using the GRU to transfer data and messages, the MPI library uses the GRU resources it allocates via the GRU resource allocator, which divides up the available GRU resources. It fairly allocates buffer space and control blocks between the logical processors being used by the MPI job.
Running MPI jobs optimally on Altix UV systems is not very difficult. It is best to pin MPI processes to CPUs and isolate multiple MPI jobs onto different sets of sockets and Hubs, and this is usually achieved by configuring a batch scheduler to create a cpuset for every MPI job. MPI pins its processes to the sequential list of logical processors within the containing cpuset by default, but you can control and alter the pinning pattern using MPI_DSM_CPULIST (see “MPI_DSM_CPULIST”), omplace(1), and dplace(1).
The MPI library chooses buffer sizes and communication algorithms in an attempt to deliver the best performance automatically to a wide variety of MPI applications. However, applications have different performance profiles and bottlenecks, and so user tuning may be of help in improving performance. Here are some application performance types and ways that MPI performance may be improved for them:
Odd HyperThreads are idle.
Most high performance computing MPI programs run best using only one HyperThread per core. When an Altix UV system has multiple HyperThreads per core, logical CPUs are numbered such that odd HyperThreads are the high half of the logical CPU numbers. Therefore, the task of scheduling only on the even HyperThreads may be accomplished by scheduling MPI jobs as if only half the full number exist, leaving the high logical CPUs idle.You can use the cpumap(1) command to determine if cores have multiple HyperThreads on your Altix UV system. The output tells the number of physical and logical processors and if Hyperthreading is ON or OFF and how shared processors are paired (towards the bottom of the command's output).
If an MPI job uses only half of the available logical CPUs, set GRU_RESOURCE_FACTOR to 2 so that the MPI processes can utilize all the available GRU resources on a Hub rather than reserving some of them for the idle HyperThreads. For more information about GRU resource tuning, see the gru_resource(3) man page.
MPI large message bandwidth is important.
Some programs transfer large messages via the MPI_Send function. To switch on the use of unbuffered, single copy transport in these cases you can set MPI_BUFFER_MAX to 0. See the MPI(1) man page for more details.
MPI small or near messages are very frequent.
For small fabric hop counts, shared memory message delivery is faster than GRU messages. To deliver all messages within an Altix UV host via shared memory, set MPI_SHARED_NEIGHBORHOOD to " host". See the MPI(1) man page for more details.
MPI application processes normally perform best if their local memory is allocated on the socket assigned to execute it. This cannot happen if memory on that socket is exhausted by the application or by other system consumption, for example, file buffer cache. Use the nodeinfo(1) command to view memory consumption on the nodes assigned to your job and use bcfree(1) to clear out excessive file buffer cache. PBS Professional batch scheduler installations can be configured to issue bcfreecommands in the job prologue. For more information, see PBS Professional documentation and the bcfree(1) man page.
MPI software from SGI can internally use the XPMEM kernel module to provide direct access to data on remote partitions and to provide single copy operations to local data. Any pages used by these operations are prevented from paging by the XPMEM kernel module. If an administrator needs to temporarily suspend a MPI application to allow other applications to run, they can unpin these pages so they can be swapped out and made available for other applications.
Each process of a MPI application which is using the XPMEM kernel module will have a /proc/xpmem/pid file associated with it. The number of pages owned by this process which are prevented from paging by XPMEM can be displayed by concatenating the /proc/xpmem/pid file, for example:
# cat /proc/xpmem/5562 pages pinned by XPMEM: 17 |
# echo 1 > /proc/xpmem/5562 |
The echo command will not return until that process's pages are unpinned.
When the MPI application is resumed, the XPMEM kernel module will prevent these pages from paging as they are referenced by the application.