Portability is one of the main advantages MPI has over vendor-specific message passing software. Nonetheless, the MPI Standard offers sufficient flexibility for general variations in vendor implementations. In addition, there are often vendor specific programming recommendations for optimal use of the MPI library. This chapter addresses topics that are of interest to those developing or porting MPI applications to SGI systems. It covers the following topics:
This section describes the behavior of the SGI MPI implementation upon normal job termination. Error handling and characteristics of abnormal job termination are also described.
In the SGI MPI implementation, a call to MPI_Abort causes the termination of the entire MPI job, regardless of the communicator argument used. The error code value is returned as the exit status of the mpirun command. A stack traceback is displayed that shows where the program called MPI_Abort.
Section 7.2 of the MPI Standard describes MPI error handling. Although almost all MPI functions return an error status, an error handler is invoked before returning from the function. If the function has an associated communicator, the error handler associated with that communicator is invoked. Otherwise, the error handler associated with MPI_COMM_WORLD is invoked.
The SGI MPI implementation provides the following predefined error handlers:
MPI_ERRORS_ARE_FATAL. The handler, when called, causes the program to abort on all executing processes. This has the same effect as if MPI_Abort were called by the process that invoked the handler.
MPI_ERRORS_RETURN. The handler has no effect.
By default, the MPI_ERRORS_ARE_FATAL error handler is associated with MPI_COMM_WORLD and any communicators derived from it. Hence, to handle the error statuses returned from MPI calls, it is necessary to associate either the MPI_ERRORS_RETURN handler or another user defined handler with MPI_COMM_WORLD near the beginning of the application.
In the SGI implementation of MPI, all pending communications involving an MPI process must be complete before the process calls MPI_Finalize . If there are any pending send or recv requests that are unmatched or not completed, the application will hang in MPI_Finalize. For more details, see section 7.5 of the MPI Standard.
If the application uses the MPI-2 spawn functionality described in Chapter 5 of the MPI-2 Standard, there are additional considerations. In the SGI implementation, all MPI processes are connected. Section 5.5.4 of the MPI-2 Standard defines what is meant by connected processes. When the MPI-2 spawn functionality is used, MPI_Finalize is collective over all connected processes. Thus all MPI processes, both launched on the command line, or subsequently spawned, synchronize in MPI_Finalize.
In the SGI implementation, MPI processes are UNIX processes. As such, the general rule regarding handling of signals applies as it would to ordinary UNIX processes.
In addition, the SIGURG and SIGUSR1 signals can be propagated from the mpirun process to the other processes in the MPI job, whether they belong to the same process group on a single host, or are running across multiple hosts in a cluster. To make use of this feature, the MPI program must have a signal handler that catches SIGURG or SIGUSR1. When the SIGURG or SIGUSR1 signals are sent to the mpirun process ID, the mpirun process catches the signal and propagates it to all MPI processes.
Most MPI implementations use buffering for overall performance reasons and some programs depend on it. However, you should not assume that there is any message buffering between processes because the MPI Standard does not mandate a buffering strategy. Table 4-1 illustrates a simple sequence of MPI operations that cannot work unless messages are buffered. If sent messages were not buffered, each process would hang in the initial call, waiting for an MPI_Recv call to take the message.
Because most MPI implementations do buffer messages to some degree, a program like this does not usually hang. The MPI_Send calls return after putting the messages into buffer space, and the MPI_Recv calls get the messages. Nevertheless, program logic like this is not valid by the MPI Standard. Programs that require this sequence of MPI calls should employ one of the buffer MPI send calls, MPI_Bsend or MPI_Ibsend.
Table 4-1. Outline of Improper Dependence on Buffering
Process 1 | Process 2 |
---|---|
MPI_Send(2,....) | MPI_Send(1,....) |
MPI_Recv(2,....) | MPI_Recv(1,....) |
By default, the SGI implementation of MPI uses buffering under most circumstances. Short messages (64 or fewer bytes) are always buffered. Longer messages are also buffered, although under certain circumstances buffering can be avoided. For performance reasons, it is sometimes desirable to avoid buffering. For further information on unbuffered message delivery, see “Programming Optimizations”.
SGI MPI supports hybrid programming models, in which MPI is used to handle one level of parallelism in an application, while POSIX threads or OpenMP processes are used to handle another level. When mixing OpenMP with MPI, for performance reasons it is better to consider invoking MPI functions only outside parallel regions, or only from within master regions. When used in this manner, it is not necessary to initialize MPI for thread safety. You can use MPI_Init to initialize MPI. However, to safely invoke MPI functions from any OpenMP process or when using Posix threads, MPI must be initialized with MPI_Init_thread.
When using MPI_Thread_init() with the threading level MPI_THREAD_MULTIPLE, link your program with -lmpi_mt instead of -lmpi. See the mpi(1) man page for more information about compiling and linking MPI programs.
You can mix SHMEM and MPI message passing in the same program. The application must be linked with both the SHMEM and MPI libraries. Start with an MPI program that calls MPI_Init and MPI_Finalize.
When you add SHMEM calls, the PE numbers are equal to the MPI rank numbers in MPI_COMM_WORLD. Do not call start_pes() in a mixed MPI and SHMEM program.
When running the application across a cluster, some MPI processes may not be able to communicate with certain other MPI processes when using SHMEM functions. You can use the shmem_pe_accessible and shmem_addr_accessible functions to determine whether a SHMEM call can be used to access data residing in another MPI process. Because the SHMEM model functions only with respect to MPI_COMM_WORLD , these functions cannot be used to exchange data between MPI processes that are connected via MPI intercommunicators returned from MPI-2 spawn related functions.
SHMEM get and put functions are thread safe. SHMEM collective and synchronization functions are not thread safe unless different threads use different pSync and pWork arrays.
For more information about the SHMEM programming model, see the intro_shmem man page.
This section describes other characteristics of the SGI MPI implementation that might be of interest to application developers.
In this implementation, stdin is enabled for only those MPI processes with rank 0 in the first MPI_COMM_WORLD (which does not need to be located on the same host as mpirun). stdout and stderr results are enabled for all MPI processes in the job, whether launched via mpirun, or via one of the MPI-2 spawn functions.
This section describes ways in which the MPI application developer can best make use of optimized features of SGI's MPI implementation. Following recommendations in this section might require modifications to your MPI application.
MPI provides for a number of different routines for point-to-point communication. The most efficient ones in terms of latency and bandwidth are the blocking and nonblocking send/receive functions (MPI_Send, MPI_Isend , MPI_Recv, and MPI_Irecv).
Unless required for application semantics, the synchronous send calls (MPI_Ssend and MPI_Issend) should be avoided. The buffered send calls (MPI_Bsend and MPI_Ibsend) should also usually be avoided as these double the amount of memory copying on the sender side. The ready send routines (MPI_Rsend and MPI_Irsend) are treated as standard MPI_Send and MPI_Isend in this implementation. Persistent requests do not offer any performance advantage over standard requests in this implementation.
The MPI collective calls are frequently layered on top of the point-to-point primitive calls. For small process counts, this can be reasonably effective. However, for higher process counts (32 processes or more) or for clusters, this approach can become less efficient. For this reason, a number of the MPI library collective operations have been optimized to use more complex algorithms.
Most collectives have been optimized for use with clusters. In these cases, steps are taken to reduce the number of messages using the relatively slower interconnect between hosts.
Some of the collective operations have been optimized for use with shared memory. The barrier operation has also been optimized to use hardware fetch operations (fetchops). The MPI_Alltoall routines also use special techniques to avoid message buffering when using shared memory. For more details, see “Avoiding Message Buffering -- Single Copy Methods”.
Note: Collectives are optimized across partitions by using the XPMEM driver which is explained in Chapter 8, “Run-time Tuning”. The collectives (except MPI_Barrier) will try to use single-copy by default for large transfers unless MPI_DEFAULT_SINGLE_COPY_OFF is specified. |
While MPI_Pack and MPI_Unpack are useful for porting parallel virtual machine (PVM) codes to MPI, they essentially double the amount of data to be copied by both the sender and receiver. It is generally best to avoid the use of these functions by either restructuring your data or using derived data types. Note, however, that use of derived data types may lead to decreased performance in certain cases.
In general, you should avoid derived data types when possible. In the SGI implementation, use of derived data types does not generally lead to performance gains. Use of derived data types might disable certain types of optimizations (for example, unbuffered or single copy data transfer).
The use of wild cards (MPI_ANY_SOURCE, MPI_ANY_TAG) involves searching multiple queues for messages. While this is not significant for small process counts, for large process counts the cost increases quickly.
One of the most significant optimizations for bandwidth sensitive applications in the MPI library is single copy optimization, avoiding the use of shared memory buffers. However, as discussed in “Buffering”, some incorrectly coded applications might hang because of buffering assumptions. For this reason, this optimization is not enabled by default for MPI_send, but can be turned on by the user at run time by using the MPI_BUFFER_MAX environment variable. The following steps can be taken by the application developer to increase the opportunities for use of this unbuffered pathway:
The MPI data type on the send side must be a contiguous type.
The sender and receiver MPI processes must reside on the same host or, in the case of a partitioned system, the processes may reside on any of the partitions.
The sender data must be globally accessible by the receiver. The SGI MPI implementation allows data allocated from the static region (common blocks), the private heap, and the stack region to be globally accessible. In addition, memory allocated via the MPI_Alloc_mem function or the SHMEM symmetric heap accessed via the shpalloc or shmalloc functions is globally accessible.
Certain run-time environment variables must be set to enable the unbuffered, single copy method. For more details on how to set the run-time environment, see “Avoiding Message Buffering - Enabling Single Copy” in Chapter 8.
Note: With the Intel 7.1 compiler, ALLOCATABLE arrays are not eligible for single copy, since they do not reside in a globally accessible memory region. This restriction does not apply when using the Intel 8.0/8.1 compilers. |
SGI Altix 4000 and SGI Altix UV 1000 series systems have a ccNUMA memory architecture. For single process and small multiprocess applications, this architecture behaves similarly to flat memory architectures. For more highly parallel applications, memory placement becomes important. MPI takes placement into consideration when laying out shared memory data structures, and the individual MPI processes' address spaces. In general, it is not recommended that the application programmer try to manage memory placement explicitly. There are a number of means to control the placement of the application at run time, however. For more information, see Chapter 8, “Run-time Tuning”.
The MPT software includes the Global Shared Memory (GSM) Feature. This feature allows users to allocate globally accessible shared memory from within an MPI or SHMEM program. The GSM feature can be used to provide shared memory access across partitioned Altix systems and additional memory placement options within a single host configuration.
User-callable functions are provided to allocate a global shared memory segment, free that segment, and provide information about the segment. Once allocated, the application can use this new global shared memory segment via standard loads and stores, just as if it were a System V shared memory segment. For more information, see the GSM_Intro or GSM_Alloc man pages.
A number of additional programming options might be worth consideration when developing MPI applications for SGI systems. For example, the SHMEM programming model can provide a means to improve the performance of latency-sensitive sections of an application. Usually, this requires replacing MPI send/recv calls with shmem_put/ shmem_get and shmem_barrier calls. The SHMEM programming model can deliver significantly lower latencies for short messages than traditional MPI calls. As an alternative to shmem_get /shmem_put calls, you might consider the MPI-2 MPI_Put/ MPI_Get functions. These provide almost the same performance as the SHMEM calls, while providing a greater degree of portability.
Alternately, you might consider exploiting the shared memory architecture of SGI systems by handling one or more levels of parallelism with OpenMP, with the coarser grained levels of parallelism being handled by MPI. Also, there are special ccNUMA placement considerations to be aware of when running hybrid MPI/OpenMP applications. For further information, see Chapter 8, “Run-time Tuning”.