Chapter 9. Suggested Shortcuts and Workarounds

This chapter contains suggested workarounds and shortcuts that you can use on your SGI Altix system. It covers the following topics:

Determining Process Placement

This section describes methods that can be used to determine where different processes are running. This can help you understand your application structure and help you decide if there are obvious placement issues.

There are some set-up steps to follow before determining process placement (note that all examples use the C shell):

  1. Set up an alias as in this example, changing guest to your username:

    % pu
    % alias pu "ps -edaf|grep guest"

    The pu command shows current processes.

  2. Create the .toprc preferences file in your login directory to set the appropriate top options. If you prefer to use the top defaults, delete the .toprc file.

    % cat <<EOF>> $HOME/.toprc
    
    YEAbcDgHIjklMnoTP|qrsuzV{FWX
    2mlt
    EOF

  3. Inspect all processes and determine which CPU is in use and create an alias file for this procedure. The CPU number is shown in the first column of the top output:

    % top -b -n 1 | sort -n | more
    % alias top1 "top -b -n 1 | sort -n "

    Use the following variation to produce output with column headings:

    % alias top1 "top -b -n 1 | head -4 | tail -1;top -b -n 1 | sort -n"

  4. View your files (replacing guest with your username):

    % top -b -n 1 | sort -n | grep guest

    Use the following variation to produce output with column headings:
    % top -b -n 1 | head -4 | tail -1;top -b -n 1 | sort -n grep guest

Example Using pthreads

The following example demonstrates a simple usage with a program name of th. It sets the number of desired OpenMP threads and runs the program. Notice the process hierarchy as shown by the PID and the PPID columns. The command usage is the following, where n is the number of threads:

% th n

% th 4
% pu

UID       PID   PPID   C STIME TTY          TIME CMD
root      13784 13779  0 12:41 pts/3    00:00:00 login --
guest1                  
guest1    13785 13784  0 12:41 pts/3    00:00:00 -csh
guest1    15062 13785  0 15:23 pts/3    00:00:00 th 4   <-- Main thread
guest1    15063 15062  0 15:23 pts/3    00:00:00 th 4   <-- daemon thread
guest1    15064 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 1
guest1    15065 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 2
guest1    15066 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 3
guest1    15067 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 4
guest1    15068 13857  0 15:23 pts/5    00:00:00 ps -aef
guest1    15069 13857  0 15:23 pts/5    00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER     PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 3  0.0 15072 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
 5  0.0 13785 guest1     15   0  5872 3664  4592 S     0.0   0:00 csh
 5  0.0 15062 guest1     16   0 15824 2080  4384 S     0.0   0:00 th
 5  0.0 15063 guest1     15   0 15824 2080  4384 S     0.0   0:00 th
 5 99.8 15064 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
 7  0.0 13826 guest1     18   0  5824 3552  5632 S     0.0   0:00 csh
10 99.9 15066 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
11 99.9 15067 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
13 99.9 15065 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
15  0.0 13857 guest1     15   0  5840 3584  5648 S     0.0   0:00 csh
15  0.0 15071 guest1     16   0 70048 1600 69840 S     0.0   0:00 ort
15  1.5 15070 guest1     15   0  5056 2832  4288 R     0.0   0:00top

Now skip the Main and daemon processes and place the rest:

% usr/bin/dplace -s 2 -c 4-7 th 4
% pu

UID         PID  PPID  C STIME TTY          TIME CMD
root      13784 13779  0 12:41 pts/3    00:00:00 login --
guest1                  
guest1    13785 13784  0 12:41 pts/3    00:00:00 -csh
guest1    15083 13785  0 15:25 pts/3    00:00:00 th 4
guest1    15084 15083  0 15:25 pts/3    00:00:00 th 4
guest1    15085 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15086 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15087 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15088 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15091 13857  0 15:25 pts/5    00:00:00 ps -aef
guest1    15092 13857  0 15:25 pts/5    00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER      PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 4 99.9 15085 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 5 99.8 15086 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 6 99.9 15087 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 7 99.9 15088 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 8  0.0 15095 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
12  0.0 13785 guest1     15   0  5872 3664  4592 S     0.0   0:00 csh
12  0.0 15083 guest1     16   0 15856 2096  6496 S     0.0   0:00 th
12  0.0 15084 guest1     15   0 15856 2096  6496 S     0.0   0:00 th
15  0.0 15094 guest1     16   0 70048 1600 69840 S     0.0   0:00 sort
15  1.6 15093 guest1     15   0  5056 2832  4288 R     0.0   0:00 top

Example Using OpenMP

The following example demonstrates a simple OpenMP usage with a program name of md. Set the desired number of OpenMP threads and run the program, as shown below:

% alias pu "ps -edaf | grep guest1
% setenv OMP_NUM_THREADS 4
% md

The following output is created:

% pu

UID         PID  PPID  C STIME TTY          TIME CMD
root      21550 21535  0 21:48 pts/0    00:00:00 login -- guest1
guest1    21551 21550  0 21:48 pts/0    00:00:00 -csh
guest1    22183 21551 77 22:39 pts/0    00:00:03 md    <-- parent / main
guest1    22184 22183  0 22:39 pts/0    00:00:00 md    <-- daemon 
guest1    22185 22184  0 22:39 pts/0    00:00:00 md    <-- daemon helper
guest1    22186 22184 99 22:39 pts/0    00:00:03 md    <-- thread 1
guest1    22187 22184 94 22:39 pts/0    00:00:03 md    <-- thread 2
guest1    22188 22184 85 22:39 pts/0    00:00:03 md    <-- thread 3
guest1    22189 21956  0 22:39 pts/1    00:00:00 ps -aef
guest1    22190 21956  0 22:39 pts/1    00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER      PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 2  0.0 22192 guest1     16   0 70048 1600 69840 S     0.0   0:00 sort
 2  0.0 22193 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
 2  1.6 22191 guest1     15   0  5056 2832  4288 R     0.0   0:00 top
 4 98.0 22186 guest1     26   0 26432 2704  4272 R     0.0   0:11 md
 8  0.0 22185 guest1     15   0 26432 2704  4272 S     0.0   0:00 md
 8 87.6 22188 guest1     25   0 26432 2704  4272 R     0.0   0:10 md
 9  0.0 21551 guest1     15   0  5872 3648  4560 S     0.0   0:00 csh
 9  0.0 22184 guest1     15   0 26432 2704  4272 S     0.0   0:00 md
 9 99.9 22183 guest1     39   0 26432 2704  4272 R     0.0   0:11 md
14 98.7 22187 guest1     39   0 26432 2704  4272 R     0.0   0:11 md

From the notation on the right of the pu list, you can see the -x 6 pattern.

place 1, skip 2 of them, place 3 more  [ 0 1 1 0 0 0 ]
  now, reverse the bit order and create the dplace -x mask   
  [ 0 0 0 1 1 0 ]  -->  [ 0x06 ]  --> decimal 6
  dplace does not currently process hex notation for this bit mask)

The following example confirms that a simple dplace placement works correctly:

% setenv OMP_NUM_THREADS 4
% /usr/bin/dplace -x 6 -c 4-7 md
% pu
UID         PID  PPID  C STIME TTY          TIME CMD
root      21550 21535  0 21:48 pts/0    00:00:00 login -- guest1
guest1    21551 21550  0 21:48 pts/0    00:00:00 -csh
guest1    22219 21551 93 22:45 pts/0    00:00:05 md
guest1    22220 22219  0 22:45 pts/0    00:00:00 md
guest1    22221 22220  0 22:45 pts/0    00:00:00 md
guest1    22222 22220 93 22:45 pts/0    00:00:05 md
guest1    22223 22220 93 22:45 pts/0    00:00:05 md
guest1    22224 22220 90 22:45 pts/0    00:00:05 md
guest1    22225 21956  0 22:45 pts/1    00:00:00 ps -aef
guest1    22226 21956  0 22:45 pts/1    00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER      PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 2  0.0 22228 guest1     16   0 70048 1600 69840 S     0.0   0:00 sort
 2  0.0 22229 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
 2  1.6 22227 guest1     15   0  5056 2832  4288 R     0.0   0:00 top
 4  0.0 22220 guest1     15   0 28496 2736 21728 S     0.0   0:00 md
 4 99.9 22219 guest1     39   0 28496 2736 21728 R     0.0   0:12 md
 5 99.9 22222 guest1     25   0 28496 2736 21728 R     0.0   0:11 md
 6 99.9 22223 guest1     39   0 28496 2736 21728 R     0.0   0:11 md
 7 99.9 22224 guest1     39   0 28496 2736 21728 R     0.0   0:11 md
 9  0.0 21551 guest1     15   0  5872 3648  4560 S     0.0   0:00 csh
15  0.0 22221 guest1     15   0 28496 2736 21728 S     0.0   0:00 md

Combination Example (MPI and OpenMP)

For this example, explicit placement using the dplace -e -c command is used to achieve the desired placement. If an x is used in one of the CPU positions, dplace does not explicitly place that process.

If running without a cpuset, the x processes run on any available CPU.

If running with a cpuset, you have to renumber the CPU numbers to refer to “logical” CPUs (0 ... n) within the cpuset, regardless of which physical CPUs are in the cpuset. When running in a cpuset, the unplaced processes are constrained to the set of CPUs within the cpuset.

For details about cpuset usage, see the Linux Resource Administration Guide.

The following example shows a “hybrid” MPI and OpenMP job with two MPI processes, each with two OpenMP threads and no cpusets:

% setenv OMP_NUM_THREADS 2
% efc -O2 -o hybrid hybrid.f -lmpi -openmp

% mpirun -v -np 2 /usr/bin/dplace -e -c x,8,9,x,x,x,x,10,11 hybrid

-------------------------
# if using cpusets ... 
-------------------------
# we need to reorder cpus to logical within the 8-15 set [0-7]

% cpuset -q omp -A mpirun -v -np 2 /usr/bin/dplace -e -c x,0,1,x,x,x,x,2,3,4,5,6,7 hybrid

# We need a table of options for these pairs. "x" means don't
# care. See the dplace man page for more info about the -e option.
# examples at end 

  -np  OMP_NUM_THREADS  /usr/bin/dplace -e -c <as shown> a.out
  ---  ---------------  ---------------------------------------
   2         2          x,0,1,x,x,x,x,2,3
   2         3          x,0,1,x,x,x,x,2,3,4,5
   2         4          x,0,1,x,x,x,x,2,3,4,5,6,7 

   4         2          x,0,1,2,3,x,x,x,x,x,x,x,x,4,5,6,7 
   4         3         
x,0,1,2,3,x,x,x,x,x,x,x,x,4,5,6,7,8,9,10,11
   4         4         
x,0,1,2,3,x,x,x,x,x,x,x,x,4,5,6,7,8,9,10,11,12,13,14,15
   Notes:               0 <- 1 -> <- 2 -> <- 3 -> <------ 4
------------------>

   Notes:
     0. mpi daemon process
     1. mpi child procs, one per np
     2. omp daemon procs, one per np
     3. omp daemon helper procs, one per np
     4. omp thread procs, (OMP_NUM_THREADS - 1) per np

---------------------------------------------
# Example -   -np 2 and OMP_NUM_THREADS 2
---------------------------------------------

% setenv OMP_NUM_THREADS 2
% efc -O2 -o hybrid hybrid.f -lmpi -openmp

% mpirun -v -np 2 /usr/bin/dplace -e -c x,8,9,x,x,x,x,10,11 hybrid

% pu

UID        PID  PPID  C STIME TTY          TIME CMD
root  21550 21535  0 Mar17 pts/0 00:00:00 login -- guest1
guest1 21551 21550  0 Mar17 pts/0 00:00:00 -csh
guest1 23391 21551  0 00:32 pts/0 00:00:00 mpirun -v -np 2

/usr/bin/dplace
guest1 23394 23391  2 00:32 pts/0 00:00:00 hybrid   <-- mpi daemon
guest1 23401 23394 99 00:32 pts/0 00:00:03 hybrid   <-- mpi child 1
guest1 23402 23394 99 00:32 pts/0 00:00:03 hybrid   <-- mpi child 2
guest1 23403 23402  0 00:32 pts/0 00:00:00 hybrid   <-- omp daemon 2
guest1 23404 23401  0 00:32 pts/0 00:00:00 hybrid   <-- omp daemon 1
guest1 23405 23404  0 00:32 pts/0 00:00:00 hybrid   <-- omp daemon hlpr 1
guest1 23406 23403  0 00:32 pts/0 00:00:00 hybrid   <-- omp daemon hlpr 2
guest1 23407 23403 99 00:32 pts/0 00:00:03 hybrid   <-- omp thread 2-1
guest1 23408 23404 99 00:32 pts/0 00:00:03 hybrid   <-- omp thread 1-1
guest1 23409 21956  0 00:32 pts/1 00:00:00 ps -aef
guest1 23410 21956  0 00:32 pts/1 00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER      PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 0  0.0 21551 guest1     15   0  5904 3712  4592 S     0.0   0:00 csh
 0  0.0 23394 guest1     15   0  883M 9456  882M S     0.1   0:00 hybrid
 4  0.0 21956 guest1     15   0  5856 3616  5664 S     0.0   0:00 csh
 4  0.0 23412 guest1     16   0 70048 1600 69840 S     0.0   0:00 sort
 4  1.6 23411 guest1     15   0  5056 2832  4288 R     0.0   0:00 top
 5  0.0 23413 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
 8  0.0 22005 guest1     15   0  5840 3584  5648 S     0.0   0:00 csh
 8  0.0 23404 guest1     15   0  894M  10M  889M S     0.1   0:00 hybrid
 8 99.9 23401 guest1     39   0  894M  10M  889M R     0.1   0:09 hybrid
 9  0.0 23403 guest1     15   0  894M  10M  894M S     0.1   0:00 hybrid
 9 99.9 23402 guest1     25   0  894M  10M  894M R     0.1   0:09 hybrid
10 99.9 23407 guest1     25   0  894M  10M  894M R     0.1   0:09 hybrid
11 99.9 23408 guest1     25   0  894M  10M  889M R     0.1   0:09 hybrid
12  0.0 23391 guest1     15   0  5072 2928  4400 S     0.0   0:00 mpirun
12  0.0 23406 guest1     15   0  894M  10M  894M S     0.1   0:00 hybrid
14  0.0 23405 guest1     15   0  894M  10M  889M S     0.1   0:00 hybrid

Resetting System Limits

To regulate these limits on a per-user basis (for applications that do not rely on limit.h), the limits.conf file can be modified. System limits that can be modified include maximum file size, maximum number of open files, maximum stack size, and so on. You can view this file is, as follows:

[user@machine user]# cat /etc/security/limits.conf
# /etc/security/limits.conf
#
#Each line describes a limit for a user in the form:
#
#            #
#Where:
# can be:
#        - an user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#
# can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
# can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open files
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit
#        - maxlogins - max number of logins for this user
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#
#                 #

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# End of file

For instructions on how to change these limits, see “Resetting the File Limit Resource Default”.

Resetting the File Limit Resource Default

Several large user applications use the value set in the limit.h file as a hard limit on file descriptors and that value is noted at compile time. Therefore, some applications may need to be recompiled in order to take advantage of the SGI Altix system hardware.

To regulate these limits on a per-user basis (for applications that do not rely on limit.h), the limits.conf file can be modified. This allows the administrator to set the allowed number of open files per user and per group. This also requires a one-line change to the /etc/pam.d/login file.

Follow this procedure to execute these changes:

  1. Add the following line to /etc/pam.d/login:

    session  required  /lib/security/pam_limits.so

  2. Add the following line to /etc/security/limits.conf , where username is the user's login and limit is the new value for the file limit resource:

    [username]  hard  nofile  [limit]

The following command shows the new limit:

ulimit -H -n

Because of the large number of file descriptors that that some applications require, such as MPI jobs, you might need to increase the system-wide limit on the number of open files on your Altix system. The default value for the file limit resource is 1024. The default 1024 file descriptors allows for approximately 199 MPI processes per host.You can increase the file descriptor value to 8196 to allow for more than 512 MPI processes per host by adding adding the following lines to the /etc/security/limits.conf file:

*     soft    nofile      8196
*     hard    nofile      8196

For more information on setting system limits, see the Chapter 5, “Kernel Tunable Parameters on SGI ProPack Servers” in the Linux Configuration and Operations Guide.

Resetting the Default Stack Size

Some applications will not run well on an Altix system with a small stack size. To set a higher stack limit, follow the instructions in “Resetting the File Limit Resource Default” and add the following lines to the /etc/security/limits.conf file:

* soft stack 300000
* hard stack unlimited

This sets a soft stack size limit of 300000 KB and an unlimited hard stack size for all users (and all processes).

Another method that does not require root privilege relies on the fact that many MPI implementation use ssh, rsh, or some sort of login shell to start the MPI rank processes. If you merely need to bump up the soft limit, you can modify your shell's startup script. For example, if your login shell is bash then add something like the following to your .bashrc file:

% ulimit -s 300000

Note that SGI MPT MPI allows you to set your stack size limit larger with the ulimit or limit shell command before launching an MPI program with mpirun(1) or mpiexec_mpt(1). MPT will propagate the stack limit setting to all MPI processes in the job.

For more information on defaul settings, also see “Resetting the File Limit Resource Default”.

Resetting Virtual Memory Size

The virtual memory parameter vmemoryuse determines the amount of virtual memory available to your application. If you are running with csh, use csh commands, such as, the following:

limit
limit vmemoryuse 7128960
limit vmemoryuse unlimited

The following MPI program fails with a memory-mapping error because of a virtual memory parameter vmemoryuse value set too low:

% limit vmemoryuse 7128960

% mpirun -v -np 4 ./program
MPI: libxmpi.so 'SGI MPI 4.9 MPT 1.14  07/18/06 08:43:15'
MPI: libmpi.so  'SGI MPI 4.9 MPT 1.14  07/18/06 08:41:05'
MPI: MPI_MSGS_MAX = 524288
MPI: MPI_BUFS_PER_PROC= 32
mmap failed (memmap_base) for 504972 pages (8273461248

bytes) Killed n

The program now succeeds when virtual memory is unlimited:

%  limit vmemoryuse unlimited


% mpirun -v -np 4 ./program
MPI: libxmpi.so 'SGI MPI 4.9 MPT 1.14  07/18/06 08:43:15'
MPI: libmpi.so  'SGI MPI 4.9 MPT 1.14  07/18/06 08:41:05'
MPI: MPI_MSGS_MAX = 524288
MPI: MPI_BUFS_PER_PROC= 32

HELLO WORLD from Processor 0

HELLO WORLD from Processor 2

HELLO WORLD from Processor 1

HELLO WORLD from Processor 3

If you are running with bash, use bash commands, such as, the following:

ulimit -a
ulimit -v 7128960
ulimit -v unlimited

Linux Shared Memory Accounting

The Linux operating system does not calculate memory utilization in a manner that is useful for certain applications in situations where regions are shared among multiple processes. This can lead to over-reporting of memory and to processes being killed by schedulers erroneously detecting memory quota violation.

The get_weighted_memory_size function weighs shared memory regions by the number of processes using the regions. Thus, if 100 processes are each sharing a total of 10GB of memory, the weighted memory calculation shows 100MB of memory shared per process, rather than 10GB for each process.

Because this function applies mostly to applications with large shared-memory requirements, it is located in the SGI NUMA tools package and made available in the libmemacct library available from a new package called memacct. The library function makes a call to the numatools kernel module, which returns the weighted sum back to the library, and then returns back to the application.

The usage statement for the memacct call is, as follows:

cc ... -lmemacct
 #include <sys/types.h>
 extern int get_weighted_memory_size(pid_t pid);

The syntax of the memacct call is, as follows:

int *get_weighted_memory_size(pid_t pid);

Returns the weighted memory (RSS) size for a pid, in bytes. This weights the size of shared regions by the number of processes accessing it. Return -1 when an error occurs and set errno, as follows:

ESRCH 

Process pid was not found.

ENOSYS 

The function is not implemented. Check if numatools kernel package is up-to-date.

Normally, the following errors should not occur:

ENOENT 

Can not open /proc/numatools device file.

EPERM 

No read permission on /proc/numatools device file.

ENOTTY 

Inappropriate ioctl operation on /proc/numatools device file.

EFAULT 

Invalid arguments. The ioctl() operation performed by the function failed with invalid arguments.

For more information, see the memacct(3) man page.