Tuning an application involves making your program run its fastest on the available hardware. The first step is to make your program run as efficiently as possible on a single processor system and then consider ways to use parallel processing.
Application tuning is different from system tuning, which involves topics such as disk partitioning, optimizing memory management, and configuration of the system. The Linux Configuration and Operations Guide discusses those topics in detail.
This chapter provides an overview of concepts involved in working in parallel computing environments.
Scalability is computational power that can grow over a large number of CPUs. Scalability depends on the time between nodes on the system. Latency is the time to send the first byte between nodes.
A Symmetric Multiprocessor (SMP) is a parallel programming environment in which all processors have equally fast (symmetric) access to memory. These types of systems are easy to assemble and have limited scalability due to memory access times.
Another parallel environment is that of arrays, or clusters. Any networked computer can participate in a cluster. These are highly scalable, easy to assemble, but are often hard to use. There is no shared memory and there are frequently long latency times.
Massively Parallel Processors (MPPs) have a distributed memory and can scale to thousands of processors; they have large memories and large local memory bandwidth.
Scalable Symmetric Multiprocessors (S2MPs), as in the ccNUMA environment, combine qualities of SMPs and MPPs. They are logically programmable like an SMP and have MPP-like scability.
This section provides a brief overview of the SGI Altix 3000 and 4000 series systems.
In order to optimize your application code, some understanding of the SGI Altix architecture is needed. This section provides a broad overview of the system architecture.
The SGI Altix 3000 family of servers and superclusters can have as many as 256 processors and 2048 gigabytes of memory. It uses Intel's Itanium 2 processors and uses nonuniform memory access (NUMA) in SGI's NUMAflex global shared-memory architecture. An SGI Altix 350 system can have as many as 16 processors and 96 gigabytes of memory.
The NUMAflex design permits modular packaging of CPU, memory, I/O, graphics, and storage into components known as bricks. The bricks can then be combined and configured into different systems, based on customer needs.
On Altix 3700 systems, two Itanium processors share a common frontside bus and memory. This constitutes a node in the NUMA architecture. Access to other memory (on another node) by these processors has a higher latency, and slightly different bandwidth characteristics. Two such nodes are packaged together in each computer brick. For a detailed overview, see the SGI Altix 3000 User's Guide.
On an SGI Altix 3700 Bx2 system, the CR-brick contains the processors (8 processors per CR-brick) and two internal high-speed routers. The routers connect to other system bricks via NUMAlink cables and expand the compute or memory capacity of the Altix 3700 Bx2. For a detailed overview, see the SGI Altix 3700 Bx2 User's Guide.
All Altix 350 systems contain at least one base compute module that contains the following components:
One or two Intel Itanium 2 processors; each processor has integrated L1, L2, and L3 caches
Up to 24 GB of local memory
Four PCI/PCI-X slots
One IO9 PCI card that comes factory-installed in the lowermost PCI/PCI-X slot
The system software consists of a standard Linux distribution (Red Hat) and SGI ProPack, which is an overlay providing additional features such as optimized libraries and enhanced kernel support. See Chapter 2, “The SGI Compiling Environment”, for details about the compilers and libraries included with the distribution.
In the new SGI Altix 4000 series systems, functional blades - interchangeable compute, memory, I/O, and special purpose blades in an innovative blade-to-NUMAlink architecture are the basic building blocks for the system. Compute blades with a bandwidth configuration have one processor socket per blade. Compute blades with a density configuration have two processor sockets per blade. Cost-effective compute density is one advantage of this compact blade packaging.
The Altix 4000 series is a family of multiprocessor distributed shared memory (DSM) computer systems that currently scales from 8 to 512 CPU sockets (up to 1,024 processor cores) and can accommodate up to 6TB of globally shared memory in a single system while delivering a teraflop of performance in a small-footprint rack. The SGI Altix 450 currently scales from 2 to 76 cores as a cache-coherent single system image (SSI). In a DSM system, each processor board contains memory that it shares with the other processors in the system. Because the DSM system is modular, it combines the advantages of low entry-level cost with global scalability in processors, memory, and I/O. You can install and operate the Altix 4700 series system in a rack in your lab or server room. Each 42U SGI rack holds from one to four 10U high enclosures that support up to ten processor and I/O sub modules known as "blades." These blades are single printed circuit boards (PCBs) with ASICS, processors, and memory components mounted on a mechanical carrier. The blades slide directly in and out of the Altix 4700 1RU enclosures. Each individual rack unit (IRU) is 10U in height (see Figure 1-1).
For more information on this system, see the SGI Altix 4700 System User's Guide available on the SGI Technical Publications Library. It provides a detailed overview of the SGI Altix 4700 system components and it describes how to set up and operate the system. For an overview of the new SGI Altix 450 system, see Chapter 3, "System Overview" in the SGI Altix 450 System User's Guide.
Virtual memory (VM), also known as virtual addressing, is used to divide a system's relatively small amount of physical memory among the potentially larger amount of logical processes in a program. It does this by dividing physical memory into pages, and then allocating pages to processes as the pages are needed.
A page is the smallest unit of system memory allocation. Pages are added to a process when either a validity fault occurs or an allocation request is issued. Process size is measured in pages and two sizes are associated with every process: the total size and the resident set size (RSS). The number of pages being used in a process and the process size can be determined by using either the ps(1) or the top(1) command.
Swap space is used for temporarily saving parts of a program when there is not enough physical memory. The swap space may be on the system drive, on an optional drive, or allocated to a particular file in a filesystem. To avoid swapping, try not to overburden memory. Lack of adequate swap space limits the number and the size of applications that can run simultaneously on the system, and it can limit system performance.
Linux is a demand paging operating system, using a least-recently-used paging algorithm. On a validity fault, pages are mapped into physical memory when first referenced and pages are brought back into memory if swapped out.