Chapter 7. Flexible File I/O

Flexible File I/O (FFIO) provides a mechanism for improving the file I/O performance of existing applications without having to resort to source code changes, that is, the current executable remains unchanged. Knowledge of source code is not required, but some knowledge of how the source and the application software work can help you better interpret and optimize FFIO results. To take advantage of FFIO, all you need to do is to set some environment variables before running your application. This chapter covers the following topics:

FFIO Operation

The FFIO subsystem allows you to define one or more additional I/O buffer caches for specific files to augment the Linux kernel I/O buffer cache. The FFIO subsystem then manages this buffer cache for you. In order to accomplish this, FFIO intercepts standard I/O calls like open, read, and write, and replaces them with FFIO equivalent routines. These routines route I/O requests through the FFIO subsystem which utilizes the user defined FFIO buffer cache. FFIO can bypass the Linux kernel I/O buffer cache by communicating with the disk subsystem via direct I/O. This gives you precise control over cache I/O characteristics and allows for more efficient I/O requests. For example, doing direct I/O in large chunks (say 16 megabytes) allows the FFIO cache to amortize disk access. All file buffering occurs in user space when FFIO is used with direct I/O enabled. This differs from the Linux buffer cache mechanism which requires a context switch in order to buffer data in kernel memory. Avoiding this kind of overhead, helps FFIO to scale efficiently. Another important distinction is that FFIO allows you to create an I/O buffer cache dedicated to a specific application. The Linux kernel, on the other hand, has to manage all the jobs on the entire system with a single I/O buffer cache. As a result, FFIO typically outperforms the Linux kernel buffer cache when it comes to I/O intensive throughput.

Environment Variables

There are only two environment variables that you need to set in order to use FFIO. They are LD_PRELOAD and FF_IO_OPTS.

In order to enable FFIO to trap standard I/O calls, you must set the LD_PRELOAD environment variable.

For SGI Altix 4000 series systems, perform the following:

setenv LD_PRELOAD /usr/lib/libFFIO.so

For SGI Altix XE systems, perform the following:

setenv LD_PRELOAD /usr/lib64/libFFIO.so

The LD_PRELOAD software is a Linux feature that instructs the linker to preload the indicated shared libraries. In this case, libFFIO.so is preloaded and provides the routines which replace the standard I/O calls. An application that is not dynamically linked with the glibc library will not work with FFIO, since the standard I/O calls will not be intercepted. To disable FFIO, perform the following:
unsetenv LD_PRELOAD

The FFIO buffer cache is managed by the FF_IO_OPTS environment variable. The syntax for setting this variable can be quite complex. A simple method for defining this variable is, as follows:

setenv FF_IO_OPTS  '<string>(eie.direct.mbytes:<size>:<num>:<lead>:<share>:<stride>:0)'

You can use the following parameters with the FF_IO_OPTS environment variable:

<string> 

Matches the names of files that can use the buffer cache.

<size> 

Number of 4k blocks in each page of the I/O buffer cache.

<num> 

Number of pages in the I/O buffer cache.

<lead> 

The maximum number of "read ahead" pages.

<share> 

A value of 1 means a shared cache, 0 means private

<stride> 

Note that the number after the stride parameter is always 0.

The following example shows a command that creates a shared buffer cache of 128 pages where each page is 16 megabytes (that is, 4096*4k). The cache has a lead of six pages and uses a stride of one, as follows:

setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)'

Each time the application opens a file, the FFIO code checks the file name to see if it matches the string supplied by FF_IO_OPTS . The file's path name is not considered when checking for a match against the string. So in the example supplied above, file names like /tmp/test16 and /var/tmp/testit would both be a match.

More complicated usages of FF_IO_OPTS are built upon this simpler version. For example, multiple types of file names can share the same cache, as follows:

setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0)'

Multiple caches may also be specified with FF_IO_OPTS. In the example that follows, files of the form output* and test* share a 128 page cache of 16 megabyte pages. The file special42 has a 256 page private cache of 32 megabyte pages, as follows:

setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0) special42(eie.direct.mbytes:8192:256:6:0:1:0)'

Additional parameters can be added to FF_IO_OPTS to create feedback that is sent to standard output. Examples of doing this diagnostic output will be presented in the following section.

Simple Examples

This section walks you through some simple examples using FFIO.

Assume that LD_PRELOAD is set for the correct library and FF_IO_OPTS is defined, as follows:

setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)'

This example uses a small C program called fio that reads four megabyte chunks from a file for 100 iterations. When the program runs it produces output, as follows:

./fio -n 100 /build/testit									
Reading 4194304 bytes 100 times to /build/testit						
Total time  = 7.383761							       		
Throughput  = 56.804439 MB/sec	

It can be difficult to tell what FFIO may or may not be doing even with a simple program such as shown above. A summary of the FFIO operations that occurred can be directed to standard output by making a simple addition to FF_IO_OPTS, as follows:

setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )'

This new setting for FF_IO_OPTS generates the following summary on standard output when the program is run:

./fio -n 100 /build/testit									
Reading 4194304 bytes 100 times to /build/testit						
Total time  = 7.383761							       		
Throughput  = 56.804439 MB/sec								
										
event_close(testit)    eie <-->syscall   (496 mbytes)/( 8.72 s)=   56.85 mbytes/s		
oflags=0x0000000000004042=RDWR+CREAT+DIRECT				
sector size =4096(bytes)								
cblks =0  cbits =0x0000000000000000							
current file size =512 mbytes   high water file size =512 mbytes

function     times      wall     all     mbytes     mbytes       min        max       avg	
             called     time    hidden  requested  delivered   request    request    request
    open          1     0.00						
    read          2     0.61                 32         32        16         16         16	
    reada        29     0.01       0        464        464        16         16         16	
    fcntl                    								
       recall									
       reada     29     8.11								
       other      5     0.00				 		
    flush         1     0.00					
    close         1     0.00			

Two synchronous reads of 16 megabytes each were issued (for a total of 32 megabytes) and 29 asynchronous reads (reada) were also issued (for a total of 464 megabytes). Additional diagnostic information can be generated by specifying the .diag modifier, as follows:

setenv FF_IO_OPTS 'test*(eie.direct.diag.mbytes:4096:128:6:1:1:0 )'

The .diag modifier may also be used in conjunction with .event.summary, the two operate independently from one another, as follows:

setenv FF_IO_OPTS 'test*(eie.diag.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )'

An example of the diagnostic output generated when just the .diag modifier is used is, as follows:

./fio -n 100 /build/testit									
Reading 4194304 bytes 100 times to /build/testit						
Total time  = 7.383761							       		
Throughput  = 56.804439 MB/sec

eie_close EIE final stats for file /build/testit	
eie_close  Used shared eie cache 1								
eie_close  128 mem pages of 4096 blocks (4096 sectors), max_lead = 6 pages			
eie_close  advance reads used/started :       23/29    79.31%   (1.78 seconds wasted)	
eie_close  write hits/total           :        0/0      0.00%					
eie_close  read  hits/total           :       98/100   98.00%					
eie_close  mbytes transferred    parent --> eie --> child      sync        async
eie_close                                 0            0        0             0   
eie_close                               400          496        2            29 (0,0)
eie_close                        parent <-- eie <-- child

eie_close EIE stats for Shared cache 1								
eie_close  128 mem pages of 4096 blocks							
eie_close  advance reads used/started :       23/29    79.31%   (0.00 seconds wasted)
eie_close  write hits/total           :        0/0      0.00%					
eie_close  read  hits/total           :       98/100   98.00%					
eie_close  mbytes transferred    parent --> eie --> child      sync        async
eie_close                                 0                     0             0
eie_close                               400         496         2            29 (0,0)

Information is listed for both the file and the cache. An mbytes transferred example is shown below:

eie_close  mbytes transferred    parent --> eie --> child      sync        async
eie_close                                 0                     0             0
eie_close                               400         496         2            29 (0,0)

The last two lines are for write and read operations, respectively. Only for very simple I/O patterns, the difference between (parent --> eie) and (eie --> child) read statistics can be explained by the number of read aheads. For random reads of a large file over a long period of time, this is not the case. All write operations count as async .

Multithreading Considerations

FFIO will work with applications that use MPI for parallel processing. An MPI job assigns each thread a number or rank. The master thread has rank 0, while the remaining threads (called slave threads) have ranks from 1 to N-l where N is the total number of threads in the MPI job. It is important to consider that the threads comprising an MPI job do not (necessarily) have access to each others address space. As a result, there is no way for the different MPI threads to share the same FFIO cache. By default, each thread defines a separate FFIO cache based on the parameters defined by FF_IO_OPTS.

Having each MPI thread define a separate FFIO cache based on a single environment variable (FF_IO_OPTS) can waste a lot of memory. Fortunately, FFIO provides a mechanism that allows the user to specify a different FFIO cache for each MPI thread via the following environment variables:

setenv FF_IO_OPTS_RANK0 'result*(eie.direct.mbytes:4096:512:6:1:1:0)'
setenv FF_IO_OPTS_RANK1 'output*(eie.direct.mbytes:1024:128:6:1:1:0)'
setenv FF_IO_OPTS_RANK2 'input*(eie.direct.mbytes:2048:64:6:1:1:0)'
             .
             .
             .
setenv FF_IO_OPTS_RANKN-1 ...   (N = number of threads).

Each rank environment variable is set using the exact same syntax as FF_IO_OPTS and each defines a distinct cache for the corresponding MPI rank. If the cache is designated shared, all files within the same ranking thread will use the same cache. FFIO works with SGI MPI, HP MPI, and LAM MPI. In order to work with MPI applications, FFIO needs to determine the rank of callers by invoking the mpi_comm_rank_() MPI library routine . Therefore, FFIO needs to determine the location of the MPI library used by the application. This is accomplished by having the user set one (and only one) of the following environment variables:

setenv SGI_MPI /usr/lib   # ia64 only
         or
setenv LAM_MPI *see below
         or
setenv HP_MPI  *see below
          
*LAM and HP MPIs are usually distributed via a third party application.  The precise
paths  to the LAM and the HP MPI libraries are application dependent.  Please refer to the  
application installation guide to find the correct path.

In order to use the rank functionality, both the MPI and FF_IO_OPTS_RANK0 environment variables must be set. If either variable is not set, then the MPI threads all use FF_IO_OPTS . If both the MPI and the FF_IO_OPTS_RANK0 variables are defined but, for example, FF_IO_OPTS_RANK2 is undefined, all rank 2 files would generate a no match with FFIO. This means that none of the rank 2 files would be cached by FFIO (in this case things DO NOT default to FF_IO_OPTS).

Fortran and C/C++ applications that use the pthreads interface will create threads that share the same address space. These threads can all make use of the single FFIO cache defined by FF_IO_OPTS.

Application Examples

FFIO has been deployed successfully with several HPC applications such as Nastran and Abaqus. In a recent customer benchmark, an eight-way Abaqus throughput job ran approximately twice as fast when FFIO was used. The FFIO cache used 16 megabyte pages (that is, page_size = 4096) and the cache size was 8.0 gigabytes. As a rule of thumb, it was determined that setting the FFIO cache size to roughly 10-15% of the disk space required by Abaqus yielded reasonable I/O performance. For this benchmark, the FF_IO_OPTS environment variable was defined by:

setenv FF_IO_OPTS '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
       *.nck* *.sct *.lop *.ngr *.elm *.ptn* *.stp* *.eig *.lnz* *.mass *.inp* *.scn* *.ddm
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)'

For the MPI version of Abaqus, different caches were specified for each MPI rank, as follows:

setenv FF_IO_OPTS_RANK0 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm  
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)'
      
setenv FF_IO_OPTS_RANK1 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm     
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'
      
setenv FF_IO_OPTS_RANK2 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm  
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'
      
setenv FF_IO_OPTS_RANK3 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm     
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'

Event Tracing

By specifying the .trace option as part of the event parameter the user can enable the event tracing feature in FFIO, as follows:

setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.trace )'

This option generates files of the form ffio.events.pid for each process that is part of the application. By default, event files are placed in /tmp but this destination can be changed by setting the FFIO_TMPDIR environment variable. These files contain time stamped events for files using the FFIO cache and can be used to trace I/O activity (for example, I/O sizes and offsets).

System Information and Issues

The SGI ProPack 5 Service Pack 1 release provided the first stable version of FFIO. Applications written in C, C++, and Fortran are supported. C and C++ applications can be built with either the Intel or gcc compiler. Only Fortran codes built with the Intel compiler will work.

The following restrictions on FFIO must also be observed:

  • The FFIO implementation of pread/ pwrite is not correct (the file offset advances).

  • Do not use FFIO to do I/O on a socket.

  • Do not link your application with the librt asynchronous I/O library.

  • Calls that operate on files in /proc, /etc, and /dev are not intercepted by FFIO.

  • Calls that operate on stdin, stdout, and stderr are not intercepted by FFIO.

  • FFIO is not intended for generic I/O applications such as vi, cp, or mv, and so on.