This chapter contains suggested workarounds and shortcuts that you can use on your SGI Altix system. It covers the following topics:
This section describes methods that can be used to determine where different processes are running. This can help you understand your application structure and help you decide if there are obvious placement issues.
There are some set-up steps to follow before determining process placement (note that all examples use the C shell):
Set up an alias as in this example, changing guest to your username:
% pu % alias pu "ps -edaf|grep guest" |
The pu command shows current processes.
Create the .toprc preferences file in your login directory to set the appropriate top options. If you prefer to use the top defaults, delete the .toprc file.
% cat <<EOF>> $HOME/.toprc YEAbcDgHIjklMnoTP|qrsuzV{FWX 2mlt EOF |
Inspect all processes and determine which CPU is in use and create an alias file for this procedure. The CPU number is shown in the first column of the top output:
% top -b -n 1 | sort -n | more % alias top1 "top -b -n 1 | sort -n " |
Use the following variation to produce output with column headings:
% alias top1 "top -b -n 1 | head -4 | tail -1;top -b -n 1 | sort -n" |
View your files (replacing guest with your username):
% top -b -n 1 | sort -n | grep guest |
% top -b -n 1 | head -4 | tail -1;top -b -n 1 | sort -n grep guest |
The following example demonstrates a simple usage with a program name of th. It sets the number of desired OpenMP threads and runs the program. Notice the process hierarchy as shown by the PID and the PPID columns. The command usage is the following, where n is the number of threads:
% th n |
% th 4 % pu UID PID PPID C STIME TTY TIME CMD root 13784 13779 0 12:41 pts/3 00:00:00 login -- guest1 guest1 13785 13784 0 12:41 pts/3 00:00:00 -csh guest1 15062 13785 0 15:23 pts/3 00:00:00 th 4 <-- Main thread guest1 15063 15062 0 15:23 pts/3 00:00:00 th 4 <-- daemon thread guest1 15064 15063 99 15:23 pts/3 00:00:10 th 4 <-- worker thread 1 guest1 15065 15063 99 15:23 pts/3 00:00:10 th 4 <-- worker thread 2 guest1 15066 15063 99 15:23 pts/3 00:00:10 th 4 <-- worker thread 3 guest1 15067 15063 99 15:23 pts/3 00:00:10 th 4 <-- worker thread 4 guest1 15068 13857 0 15:23 pts/5 00:00:00 ps -aef guest1 15069 13857 0 15:23 pts/5 00:00:00 grep guest1 % top -b -n 1 | sort -n | grep guest1 LC %CPU PID USER PRI NI SIZE RSS SHARE STAT %MEM TIME COMMAND 3 0.0 15072 guest1 16 0 3488 1536 3328 S 0.0 0:00 grep 5 0.0 13785 guest1 15 0 5872 3664 4592 S 0.0 0:00 csh 5 0.0 15062 guest1 16 0 15824 2080 4384 S 0.0 0:00 th 5 0.0 15063 guest1 15 0 15824 2080 4384 S 0.0 0:00 th 5 99.8 15064 guest1 25 0 15824 2080 4384 R 0.0 0:14 th 7 0.0 13826 guest1 18 0 5824 3552 5632 S 0.0 0:00 csh 10 99.9 15066 guest1 25 0 15824 2080 4384 R 0.0 0:14 th 11 99.9 15067 guest1 25 0 15824 2080 4384 R 0.0 0:14 th 13 99.9 15065 guest1 25 0 15824 2080 4384 R 0.0 0:14 th 15 0.0 13857 guest1 15 0 5840 3584 5648 S 0.0 0:00 csh 15 0.0 15071 guest1 16 0 70048 1600 69840 S 0.0 0:00 ort 15 1.5 15070 guest1 15 0 5056 2832 4288 R 0.0 0:00top |
Now skip the Main and daemon processes and place the rest:
% usr/bin/dplace -s 2 -c 4-7 th 4 % pu UID PID PPID C STIME TTY TIME CMD root 13784 13779 0 12:41 pts/3 00:00:00 login -- guest1 guest1 13785 13784 0 12:41 pts/3 00:00:00 -csh guest1 15083 13785 0 15:25 pts/3 00:00:00 th 4 guest1 15084 15083 0 15:25 pts/3 00:00:00 th 4 guest1 15085 15084 99 15:25 pts/3 00:00:19 th 4 guest1 15086 15084 99 15:25 pts/3 00:00:19 th 4 guest1 15087 15084 99 15:25 pts/3 00:00:19 th 4 guest1 15088 15084 99 15:25 pts/3 00:00:19 th 4 guest1 15091 13857 0 15:25 pts/5 00:00:00 ps -aef guest1 15092 13857 0 15:25 pts/5 00:00:00 grep guest1 % top -b -n 1 | sort -n | grep guest1 LC %CPU PID USER PRI NI SIZE RSS SHARE STAT %MEM TIME COMMAND 4 99.9 15085 guest1 25 0 15856 2096 6496 R 0.0 0:24 th 5 99.8 15086 guest1 25 0 15856 2096 6496 R 0.0 0:24 th 6 99.9 15087 guest1 25 0 15856 2096 6496 R 0.0 0:24 th 7 99.9 15088 guest1 25 0 15856 2096 6496 R 0.0 0:24 th 8 0.0 15095 guest1 16 0 3488 1536 3328 S 0.0 0:00 grep 12 0.0 13785 guest1 15 0 5872 3664 4592 S 0.0 0:00 csh 12 0.0 15083 guest1 16 0 15856 2096 6496 S 0.0 0:00 th 12 0.0 15084 guest1 15 0 15856 2096 6496 S 0.0 0:00 th 15 0.0 15094 guest1 16 0 70048 1600 69840 S 0.0 0:00 sort 15 1.6 15093 guest1 15 0 5056 2832 4288 R 0.0 0:00 top |
The following example demonstrates a simple OpenMP usage with a program name of md. Set the desired number of OpenMP threads and run the program, as shown below:
% alias pu "ps -edaf | grep guest1 % setenv OMP_NUM_THREADS 4 % md |
The following output is created:
% pu UID PID PPID C STIME TTY TIME CMD root 21550 21535 0 21:48 pts/0 00:00:00 login -- guest1 guest1 21551 21550 0 21:48 pts/0 00:00:00 -csh guest1 22183 21551 77 22:39 pts/0 00:00:03 md <-- parent / main guest1 22184 22183 0 22:39 pts/0 00:00:00 md <-- daemon guest1 22185 22184 0 22:39 pts/0 00:00:00 md <-- daemon helper guest1 22186 22184 99 22:39 pts/0 00:00:03 md <-- thread 1 guest1 22187 22184 94 22:39 pts/0 00:00:03 md <-- thread 2 guest1 22188 22184 85 22:39 pts/0 00:00:03 md <-- thread 3 guest1 22189 21956 0 22:39 pts/1 00:00:00 ps -aef guest1 22190 21956 0 22:39 pts/1 00:00:00 grep guest1 % top -b -n 1 | sort -n | grep guest1 LC %CPU PID USER PRI NI SIZE RSS SHARE STAT %MEM TIME COMMAND 2 0.0 22192 guest1 16 0 70048 1600 69840 S 0.0 0:00 sort 2 0.0 22193 guest1 16 0 3488 1536 3328 S 0.0 0:00 grep 2 1.6 22191 guest1 15 0 5056 2832 4288 R 0.0 0:00 top 4 98.0 22186 guest1 26 0 26432 2704 4272 R 0.0 0:11 md 8 0.0 22185 guest1 15 0 26432 2704 4272 S 0.0 0:00 md 8 87.6 22188 guest1 25 0 26432 2704 4272 R 0.0 0:10 md 9 0.0 21551 guest1 15 0 5872 3648 4560 S 0.0 0:00 csh 9 0.0 22184 guest1 15 0 26432 2704 4272 S 0.0 0:00 md 9 99.9 22183 guest1 39 0 26432 2704 4272 R 0.0 0:11 md 14 98.7 22187 guest1 39 0 26432 2704 4272 R 0.0 0:11 md |
From the notation on the right of the pu list, you can see the -x 6 pattern.
place 1, skip 2 of them, place 3 more [ 0 1 1 0 0 0 ] now, reverse the bit order and create the dplace -x mask [ 0 0 0 1 1 0 ] --> [ 0x06 ] --> decimal 6 dplace does not currently process hex notation for this bit mask) |
The following example confirms that a simple dplace placement works correctly:
% setenv OMP_NUM_THREADS 4 % /usr/bin/dplace -x 6 -c 4-7 md % pu UID PID PPID C STIME TTY TIME CMD root 21550 21535 0 21:48 pts/0 00:00:00 login -- guest1 guest1 21551 21550 0 21:48 pts/0 00:00:00 -csh guest1 22219 21551 93 22:45 pts/0 00:00:05 md guest1 22220 22219 0 22:45 pts/0 00:00:00 md guest1 22221 22220 0 22:45 pts/0 00:00:00 md guest1 22222 22220 93 22:45 pts/0 00:00:05 md guest1 22223 22220 93 22:45 pts/0 00:00:05 md guest1 22224 22220 90 22:45 pts/0 00:00:05 md guest1 22225 21956 0 22:45 pts/1 00:00:00 ps -aef guest1 22226 21956 0 22:45 pts/1 00:00:00 grep guest1 % top -b -n 1 | sort -n | grep guest1 LC %CPU PID USER PRI NI SIZE RSS SHARE STAT %MEM TIME COMMAND 2 0.0 22228 guest1 16 0 70048 1600 69840 S 0.0 0:00 sort 2 0.0 22229 guest1 16 0 3488 1536 3328 S 0.0 0:00 grep 2 1.6 22227 guest1 15 0 5056 2832 4288 R 0.0 0:00 top 4 0.0 22220 guest1 15 0 28496 2736 21728 S 0.0 0:00 md 4 99.9 22219 guest1 39 0 28496 2736 21728 R 0.0 0:12 md 5 99.9 22222 guest1 25 0 28496 2736 21728 R 0.0 0:11 md 6 99.9 22223 guest1 39 0 28496 2736 21728 R 0.0 0:11 md 7 99.9 22224 guest1 39 0 28496 2736 21728 R 0.0 0:11 md 9 0.0 21551 guest1 15 0 5872 3648 4560 S 0.0 0:00 csh 15 0.0 22221 guest1 15 0 28496 2736 21728 S 0.0 0:00 md |
For this example, explicit placement using the dplace -e -c command is used to achieve the desired placement. If an x is used in one of the CPU positions, dplace does not explicitly place that process.
If running without a cpuset, the x processes run on any available CPU.
If running with a cpuset, you have to renumber the CPU numbers to refer to “logical” CPUs (0 ... n) within the cpuset, regardless of which physical CPUs are in the cpuset. When running in a cpuset, the unplaced processes are constrained to the set of CPUs within the cpuset.
For details about cpuset usage, see the Linux Resource Administration Guide.
The following example shows a “hybrid” MPI and OpenMP job with two MPI processes, each with two OpenMP threads and no cpusets:
% setenv OMP_NUM_THREADS 2 % efc -O2 -o hybrid hybrid.f -lmpi -openmp % mpirun -v -np 2 /usr/bin/dplace -e -c x,8,9,x,x,x,x,10,11 hybrid ------------------------- # if using cpusets ... ------------------------- # we need to reorder cpus to logical within the 8-15 set [0-7] % cpuset -q omp -A mpirun -v -np 2 /usr/bin/dplace -e -c x,0,1,x,x,x,x,2,3,4,5,6,7 hybrid # We need a table of options for these pairs. "x" means don't # care. See the dplace man page for more info about the -e option. # examples at end -np OMP_NUM_THREADS /usr/bin/dplace -e -c <as shown> a.out --- --------------- --------------------------------------- 2 2 x,0,1,x,x,x,x,2,3 2 3 x,0,1,x,x,x,x,2,3,4,5 2 4 x,0,1,x,x,x,x,2,3,4,5,6,7 4 2 x,0,1,2,3,x,x,x,x,x,x,x,x,4,5,6,7 4 3 x,0,1,2,3,x,x,x,x,x,x,x,x,4,5,6,7,8,9,10,11 4 4 x,0,1,2,3,x,x,x,x,x,x,x,x,4,5,6,7,8,9,10,11,12,13,14,15 Notes: 0 <- 1 -> <- 2 -> <- 3 -> <------ 4 ------------------> Notes: 0. mpi daemon process 1. mpi child procs, one per np 2. omp daemon procs, one per np 3. omp daemon helper procs, one per np 4. omp thread procs, (OMP_NUM_THREADS - 1) per np --------------------------------------------- # Example - -np 2 and OMP_NUM_THREADS 2 --------------------------------------------- % setenv OMP_NUM_THREADS 2 % efc -O2 -o hybrid hybrid.f -lmpi -openmp % mpirun -v -np 2 /usr/bin/dplace -e -c x,8,9,x,x,x,x,10,11 hybrid % pu UID PID PPID C STIME TTY TIME CMD root 21550 21535 0 Mar17 pts/0 00:00:00 login -- guest1 guest1 21551 21550 0 Mar17 pts/0 00:00:00 -csh guest1 23391 21551 0 00:32 pts/0 00:00:00 mpirun -v -np 2 /usr/bin/dplace guest1 23394 23391 2 00:32 pts/0 00:00:00 hybrid <-- mpi daemon guest1 23401 23394 99 00:32 pts/0 00:00:03 hybrid <-- mpi child 1 guest1 23402 23394 99 00:32 pts/0 00:00:03 hybrid <-- mpi child 2 guest1 23403 23402 0 00:32 pts/0 00:00:00 hybrid <-- omp daemon 2 guest1 23404 23401 0 00:32 pts/0 00:00:00 hybrid <-- omp daemon 1 guest1 23405 23404 0 00:32 pts/0 00:00:00 hybrid <-- omp daemon hlpr 1 guest1 23406 23403 0 00:32 pts/0 00:00:00 hybrid <-- omp daemon hlpr 2 guest1 23407 23403 99 00:32 pts/0 00:00:03 hybrid <-- omp thread 2-1 guest1 23408 23404 99 00:32 pts/0 00:00:03 hybrid <-- omp thread 1-1 guest1 23409 21956 0 00:32 pts/1 00:00:00 ps -aef guest1 23410 21956 0 00:32 pts/1 00:00:00 grep guest1 % top -b -n 1 | sort -n | grep guest1 LC %CPU PID USER PRI NI SIZE RSS SHARE STAT %MEM TIME COMMAND 0 0.0 21551 guest1 15 0 5904 3712 4592 S 0.0 0:00 csh 0 0.0 23394 guest1 15 0 883M 9456 882M S 0.1 0:00 hybrid 4 0.0 21956 guest1 15 0 5856 3616 5664 S 0.0 0:00 csh 4 0.0 23412 guest1 16 0 70048 1600 69840 S 0.0 0:00 sort 4 1.6 23411 guest1 15 0 5056 2832 4288 R 0.0 0:00 top 5 0.0 23413 guest1 16 0 3488 1536 3328 S 0.0 0:00 grep 8 0.0 22005 guest1 15 0 5840 3584 5648 S 0.0 0:00 csh 8 0.0 23404 guest1 15 0 894M 10M 889M S 0.1 0:00 hybrid 8 99.9 23401 guest1 39 0 894M 10M 889M R 0.1 0:09 hybrid 9 0.0 23403 guest1 15 0 894M 10M 894M S 0.1 0:00 hybrid 9 99.9 23402 guest1 25 0 894M 10M 894M R 0.1 0:09 hybrid 10 99.9 23407 guest1 25 0 894M 10M 894M R 0.1 0:09 hybrid 11 99.9 23408 guest1 25 0 894M 10M 889M R 0.1 0:09 hybrid 12 0.0 23391 guest1 15 0 5072 2928 4400 S 0.0 0:00 mpirun 12 0.0 23406 guest1 15 0 894M 10M 894M S 0.1 0:00 hybrid 14 0.0 23405 guest1 15 0 894M 10M 889M S 0.1 0:00 hybrid |
To regulate these limits on a per-user basis (for applications that do not rely on limit.h), the limits.conf file can be modified. System limits that can be modified include maximum file size, maximum number of open files, maximum stack size, and so on. You can view this file is, as follows:
[user@machine user]# cat /etc/security/limits.conf # /etc/security/limits.conf # #Each line describes a limit for a user in the form: # # # #Where: # can be: # - an user name # - a group name, with @group syntax # - the wildcard *, for default entry # # can have the two values: # - "soft" for enforcing the soft limits # - "hard" for enforcing hard limits # # can be one of the following: # - core - limits the core file size (KB) # - data - max data size (KB) # - fsize - maximum filesize (KB) # - memlock - max locked-in-memory address space (KB) # - nofile - max number of open files # - rss - max resident set size (KB) # - stack - max stack size (KB) # - cpu - max CPU time (MIN) # - nproc - max number of processes # - as - address space limit # - maxlogins - max number of logins for this user # - priority - the priority to run user process with # - locks - max number of file locks the user can hold # # # #* soft core 0 #* hard rss 10000 #@student hard nproc 20 #@faculty soft nproc 20 #@faculty hard nproc 50 #ftp hard nproc 0 #@student - maxlogins 4 # End of file |
For instructions on how to change these limits, see “Resetting the File Limit Resource Default”.
Several large user applications use the value set in the limit.h file as a hard limit on file descriptors and that value is noted at compile time. Therefore, some applications may need to be recompiled in order to take advantage of the SGI Altix system hardware.
To regulate these limits on a per-user basis (for applications that do not rely on limit.h), the limits.conf file can be modified. This allows the administrator to set the allowed number of open files per user and per group. This also requires a one-line change to the /etc/pam.d/login file.
Follow this procedure to execute these changes:
Add the following line to /etc/pam.d/login:
session required /lib/security/pam_limits.so |
Add the following line to /etc/security/limits.conf , where username is the user's login and limit is the new value for the file limit resource:
[username] hard nofile [limit] |
The following command shows the new limit:
ulimit -H -n |
Because of the large number of file descriptors that that some applications require, such as MPI jobs, you might need to increase the system-wide limit on the number of open files on your Altix system. The default value for the file limit resource is 1024. The default 1024 file descriptors allows for approximately 199 MPI processes per host.You can increase the file descriptor value to 8196 to allow for more than 512 MPI processes per host by adding adding the following lines to the /etc/security/limits.conf file:
* soft nofile 8196 * hard nofile 8196 |
For more information on setting system limits, see the Chapter 5, “Kernel Tunable Parameters on SGI ProPack Servers” in the Linux Configuration and Operations Guide.
Some applications will not run well on an Altix system with a small stack size. To set a higher stack limit, follow the instructions in “Resetting the File Limit Resource Default” and add the following lines to the /etc/security/limits.conf file:
* soft stack 300000 * hard stack unlimited |
This sets a soft stack size limit of 300000 KB and an unlimited hard stack size for all users (and all processes).
Another method that does not require root privilege relies on the fact that many MPI implementation use ssh, rsh, or some sort of login shell to start the MPI rank processes. If you merely need to bump up the soft limit, you can modify your shell's startup script. For example, if your login shell is bash then add something like the following to your .bashrc file:
% ulimit -s 300000 |
Note that SGI MPT MPI allows you to set your stack size limit larger with the ulimit or limit shell command before launching an MPI program with mpirun(1) or mpiexec_mpt(1). MPT will propagate the stack limit setting to all MPI processes in the job.
For more information on defaul settings, also see “Resetting the File Limit Resource Default”.
The virtual memory parameter vmemoryuse determines the amount of virtual memory available to your application. If you are running with csh, use csh commands, such as, the following:
limit limit vmemoryuse 7128960 limit vmemoryuse unlimited |
The following MPI program fails with a memory-mapping error because of a virtual memory parameter vmemoryuse value set too low:
% limit vmemoryuse 7128960 % mpirun -v -np 4 ./program MPI: libxmpi.so 'SGI MPI 4.9 MPT 1.14 07/18/06 08:43:15' MPI: libmpi.so 'SGI MPI 4.9 MPT 1.14 07/18/06 08:41:05' MPI: MPI_MSGS_MAX = 524288 MPI: MPI_BUFS_PER_PROC= 32 mmap failed (memmap_base) for 504972 pages (8273461248 bytes) Killed n |
The program now succeeds when virtual memory is unlimited:
% limit vmemoryuse unlimited % mpirun -v -np 4 ./program MPI: libxmpi.so 'SGI MPI 4.9 MPT 1.14 07/18/06 08:43:15' MPI: libmpi.so 'SGI MPI 4.9 MPT 1.14 07/18/06 08:41:05' MPI: MPI_MSGS_MAX = 524288 MPI: MPI_BUFS_PER_PROC= 32 HELLO WORLD from Processor 0 HELLO WORLD from Processor 2 HELLO WORLD from Processor 1 HELLO WORLD from Processor 3 |
If you are running with bash, use bash commands, such as, the following:
ulimit -a ulimit -v 7128960 ulimit -v unlimited |
The Linux operating system does not calculate memory utilization in a manner that is useful for certain applications in situations where regions are shared among multiple processes. This can lead to over-reporting of memory and to processes being killed by schedulers erroneously detecting memory quota violation.
The get_weighted_memory_size function weighs shared memory regions by the number of processes using the regions. Thus, if 100 processes are each sharing a total of 10GB of memory, the weighted memory calculation shows 100MB of memory shared per process, rather than 10GB for each process.
Because this function applies mostly to applications with large shared-memory requirements, it is located in the SGI NUMA tools package and made available in the libmemacct library available from a new package called memacct. The library function makes a call to the numatools kernel module, which returns the weighted sum back to the library, and then returns back to the application.
The usage statement for the memacct call is, as follows:
cc ... -lmemacct #include <sys/types.h> extern int get_weighted_memory_size(pid_t pid); |
The syntax of the memacct call is, as follows:
int *get_weighted_memory_size(pid_t pid); |
Returns the weighted memory (RSS) size for a pid, in bytes. This weights the size of shared regions by the number of processes accessing it. Return -1 when an error occurs and set errno, as follows:
ESRCH | Process pid was not found. | |
ENOSYS | The function is not implemented. Check if numatools kernel package is up-to-date. |
Normally, the following errors should not occur:
ENOENT | Can not open /proc/numatools device file. | |
EPERM | No read permission on /proc/numatools device file. | |
ENOTTY | Inappropriate ioctl operation on /proc/numatools device file. | |
EFAULT | Invalid arguments. The ioctl() operation performed by the function failed with invalid arguments. |
For more information, see the memacct(3) man page.