Tuning an application involves making your program run its fastest on the available hardware. The first step is to make your program run as efficiently as possible on a single processor system and then consider ways to use parallel processing.
Application tuning is different from system tuning, which involves topics such as disk partitioning, optimizing memory management, and configuration of the system. See the SGI Altix UV Systems Linux Configuration and Operations Guide for SGI Altix UV series systems.
This chapter provides an overview of concepts involved in working in parallel computing environments.
Scalability is computational power that can grow over a large number of CPUs. Scalability depends on the time between nodes on the system. Latency is the time to send the first byte between nodes.
A Symmetric Multiprocessor (SMP) is a parallel programming environment in which all processors have equally fast (symmetric) access to memory. These types of systems are easy to assemble and have limited scalability due to memory access times.
On an symmetric multiprocessor (SMP) machine, all data is visible from all processors. NonUniform Memory Access (NUMA) machines also have a shared address space. In both cases, there is a single shared memory space and a single operating system instance. However, in an SMP machine, each processor is functionally identical and has equal time access to every memory address. In contrast, a NUMA system has a shared address space, but the access time to memory vary over physical address ranges and between processing elements. The Intel Xeon 7500 series processor (Nehalem i7 architecture) is an example of NUMA architecture. Each processor has its own memory and can address the memory attached to another processor through the Quick Path Interconnet (QPI). For more information, see the system architecture overview in “Data Placement Tools Overview” in Chapter 5.
Another parallel environment is that of arrays, or clusters. Any networked computer can participate in a cluster. These are highly scalable, easy to assemble, but are often hard to use. There is no shared memory and there are frequently long latency times.
Massively Parallel Processors (MPPs) have a distributed memory and can scale to thousands of processors; they have large memories and large local memory bandwidth.
Scalable Symmetric Multiprocessors (S2MPs), as in the ccNUMA environment, combine qualities of SMPs and MPPs. They are logically programmable like an SMP and have MPP-like scability.
See the appropriate Altix hardware manual for system architecture overviews. The SGI Performance Suite 1.0 Start Here lists all the current SGI hardware manuals. The SGI Tempo System Administrator's Guide provides system architecure overviews for the SGI Altix ICE 8200 and SGI Altix ICE 8400 series systems. All books are available on the Tech Pubs Library at http://docs.sgi.com
Virtual memory (VM), also known as virtual addressing, is used to divide a system's relatively small amount of physical memory among the potentially larger amount of logical processes in a program. It does this by dividing physical memory into pages, and then allocating pages to processes as the pages are needed.
A page is the smallest unit of system memory allocation. Pages are added to a process when either a page fault occurs or an allocation request is issued. Process size is measured in pages and two sizes are associated with every process: the total size and the resident set size (RSS). The number of pages being used in a process and the process size can be determined by using either the ps(1) or the top(1) command.
Swap space is used for temporarily saving parts of a program when there is not enough physical memory. The swap space may be on the system drive, on an optional drive, or allocated to a particular file in a filesystem. To avoid swapping, try not to overburden memory. Lack of adequate swap space limits the number and the size of applications that can run simultaneously on the system, and it can limit system performance. Access time to disk is orders of magnitude slower than access to random access memory (RAM). A system that runs out of memory and uses swap to disk while running a program will have its performance seriously affected, as swapping will become a major bottleneck. Be sure your system is configured with enough memory to run your applications.
Linux is a demand paging operating system, using a least-recently-used paging algorithm. Pages are mapped into physical memory when first referenced and pages are brought back into memory if swapped out. In a system that uses demand paging, the operating system copies a disk page into physical memory only if an attempt is made to access it, that is, a page fault occurs. A page fault handler algorithm does the necessary action. For more information, see the mmap(2) man page.