Flexible File I/O (FFIO) provides a mechanism for improving the file I/O performance of existing applications without having to resort to source code changes, that is, the current executable remains unchanged. Knowledge of source code is not required, but some knowledge of how the source and the application software work can help you better interpret and optimize FFIO results. To take advantage of FFIO, all you need to do is to set some environment variables before running your application. This chapter covers the following topics:
The FFIO subsystem allows you to define one or more additional I/O buffer caches for specific files to augment the Linux kernel I/O buffer cache. The FFIO subsystem then manages this buffer cache for you. In order to accomplish this, FFIO intercepts standard I/O calls like open, read, and write, and replaces them with FFIO equivalent routines. These routines route I/O requests through the FFIO subsystem which utilizes the user defined FFIO buffer cache. FFIO can bypass the Linux kernel I/O buffer cache by communicating with the disk subsystem via direct I/O. This gives you precise control over cache I/O characteristics and allows for more efficient I/O requests. For example, doing direct I/O in large chunks (say 16 megabytes) allows the FFIO cache to amortize disk access. All file buffering occurs in user space when FFIO is used with direct I/O enabled. This differs from the Linux buffer cache mechanism which requires a context switch in order to buffer data in kernel memory. Avoiding this kind of overhead, helps FFIO to scale efficiently. Another important distinction is that FFIO allows you to create an I/O buffer cache dedicated to a specific application. The Linux kernel, on the other hand, has to manage all the jobs on the entire system with a single I/O buffer cache. As a result, FFIO typically outperforms the Linux kernel buffer cache when it comes to I/O intensive throughput.
There are only two environment variables that you need to set in order to use FFIO. They are LD_PRELOAD and FF_IO_OPTS.
In order to enable FFIO to trap standard I/O calls, you must set the LD_PRELOAD environment variable.
For SGI Altix 4000 series systems, perform the following:
setenv LD_PRELOAD /usr/lib/libFFIO.so |
For SGI Altix XE systems, perform the following:
setenv LD_PRELOAD /usr/lib64/libFFIO.so |
unsetenv LD_PRELOAD |
The FFIO buffer cache is managed by the FF_IO_OPTS environment variable. The syntax for setting this variable can be quite complex. A simple method for defining this variable is, as follows:
setenv FF_IO_OPTS '<string>(eie.direct.mbytes:<size>:<num>:<lead>:<share>:<stride>:0)' |
You can use the following parameters with the FF_IO_OPTS environment variable:
<string> | Matches the names of files that can use the buffer cache. | |
<size> | Number of 4k blocks in each page of the I/O buffer cache. | |
<num> | Number of pages in the I/O buffer cache. | |
<lead> | The maximum number of "read ahead" pages. | |
<share> | A value of 1 means a shared cache, 0 means private | |
<stride> | Note that the number after the stride parameter is always 0. |
The following example shows a command that creates a shared buffer cache of 128 pages where each page is 16 megabytes (that is, 4096*4k). The cache has a lead of six pages and uses a stride of one, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)' |
Each time the application opens a file, the FFIO code checks the file name to see if it matches the string supplied by FF_IO_OPTS . The file's path name is not considered when checking for a match against the string. So in the example supplied above, file names like /tmp/test16 and /var/tmp/testit would both be a match.
More complicated usages of FF_IO_OPTS are built upon this simpler version. For example, multiple types of file names can share the same cache, as follows:
setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0)' |
Multiple caches may also be specified with FF_IO_OPTS. In the example that follows, files of the form output* and test* share a 128 page cache of 16 megabyte pages. The file special42 has a 256 page private cache of 32 megabyte pages, as follows:
setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0) special42(eie.direct.mbytes:8192:256:6:0:1:0)' |
Additional parameters can be added to FF_IO_OPTS to create feedback that is sent to standard output. Examples of doing this diagnostic output will be presented in the following section.
This section walks you through some simple examples using FFIO.
Assume that LD_PRELOAD is set for the correct library and FF_IO_OPTS is defined, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)' |
This example uses a small C program called fio that reads four megabyte chunks from a file for 100 iterations. When the program runs it produces output, as follows:
./fio -n 100 /build/testit Reading 4194304 bytes 100 times to /build/testit Total time = 7.383761 Throughput = 56.804439 MB/sec |
It can be difficult to tell what FFIO may or may not be doing even with a simple program such as shown above. A summary of the FFIO operations that occurred can be directed to standard output by making a simple addition to FF_IO_OPTS, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )' |
This new setting for FF_IO_OPTS generates the following summary on standard output when the program is run:
./fio -n 100 /build/testit Reading 4194304 bytes 100 times to /build/testit Total time = 7.383761 Throughput = 56.804439 MB/sec event_close(testit) eie <-->syscall (496 mbytes)/( 8.72 s)= 56.85 mbytes/s oflags=0x0000000000004042=RDWR+CREAT+DIRECT sector size =4096(bytes) cblks =0 cbits =0x0000000000000000 current file size =512 mbytes high water file size =512 mbytes function times wall all mbytes mbytes min max avg called time hidden requested delivered request request request open 1 0.00 read 2 0.61 32 32 16 16 16 reada 29 0.01 0 464 464 16 16 16 fcntl recall reada 29 8.11 other 5 0.00 flush 1 0.00 close 1 0.00 |
Two synchronous reads of 16 megabytes each were issued (for a total of 32 megabytes) and 29 asynchronous reads (reada) were also issued (for a total of 464 megabytes). Additional diagnostic information can be generated by specifying the .diag modifier, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.diag.mbytes:4096:128:6:1:1:0 )' |
The .diag modifier may also be used in conjunction with .event.summary, the two operate independently from one another, as follows:
setenv FF_IO_OPTS 'test*(eie.diag.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )' |
An example of the diagnostic output generated when just the .diag modifier is used is, as follows:
./fio -n 100 /build/testit Reading 4194304 bytes 100 times to /build/testit Total time = 7.383761 Throughput = 56.804439 MB/sec eie_close EIE final stats for file /build/testit eie_close Used shared eie cache 1 eie_close 128 mem pages of 4096 blocks (4096 sectors), max_lead = 6 pages eie_close advance reads used/started : 23/29 79.31% (1.78 seconds wasted) eie_close write hits/total : 0/0 0.00% eie_close read hits/total : 98/100 98.00% eie_close mbytes transferred parent --> eie --> child sync async eie_close 0 0 0 0 eie_close 400 496 2 29 (0,0) eie_close parent <-- eie <-- child eie_close EIE stats for Shared cache 1 eie_close 128 mem pages of 4096 blocks eie_close advance reads used/started : 23/29 79.31% (0.00 seconds wasted) eie_close write hits/total : 0/0 0.00% eie_close read hits/total : 98/100 98.00% eie_close mbytes transferred parent --> eie --> child sync async eie_close 0 0 0 eie_close 400 496 2 29 (0,0) |
Information is listed for both the file and the cache. An mbytes transferred example is shown below:
eie_close mbytes transferred parent --> eie --> child sync async eie_close 0 0 0 eie_close 400 496 2 29 (0,0) |
The last two lines are for write and read operations, respectively. Only for very simple I/O patterns, the difference between (parent --> eie) and (eie --> child) read statistics can be explained by the number of read aheads. For random reads of a large file over a long period of time, this is not the case. All write operations count as async .
FFIO will work with applications that use MPI for parallel processing. An MPI job assigns each thread a number or rank. The master thread has rank 0, while the remaining threads (called slave threads) have ranks from 1 to N-l where N is the total number of threads in the MPI job. It is important to consider that the threads comprising an MPI job do not (necessarily) have access to each others address space. As a result, there is no way for the different MPI threads to share the same FFIO cache. By default, each thread defines a separate FFIO cache based on the parameters defined by FF_IO_OPTS.
Having each MPI thread define a separate FFIO cache based on a single environment variable (FF_IO_OPTS) can waste a lot of memory. Fortunately, FFIO provides a mechanism that allows the user to specify a different FFIO cache for each MPI thread via the following environment variables:
setenv FF_IO_OPTS_RANK0 'result*(eie.direct.mbytes:4096:512:6:1:1:0)' setenv FF_IO_OPTS_RANK1 'output*(eie.direct.mbytes:1024:128:6:1:1:0)' setenv FF_IO_OPTS_RANK2 'input*(eie.direct.mbytes:2048:64:6:1:1:0)' . . . setenv FF_IO_OPTS_RANKN-1 ... (N = number of threads). |
Each rank environment variable is set using the exact same syntax as FF_IO_OPTS and each defines a distinct cache for the corresponding MPI rank. If the cache is designated shared, all files within the same ranking thread will use the same cache. FFIO works with SGI MPI, HP MPI, and LAM MPI. In order to work with MPI applications, FFIO needs to determine the rank of callers by invoking the mpi_comm_rank_() MPI library routine . Therefore, FFIO needs to determine the location of the MPI library used by the application. This is accomplished by having the user set one (and only one) of the following environment variables:
setenv SGI_MPI /usr/lib # ia64 only or setenv LAM_MPI *see below or setenv HP_MPI *see below *LAM and HP MPIs are usually distributed via a third party application. The precise paths to the LAM and the HP MPI libraries are application dependent. Please refer to the application installation guide to find the correct path. |
In order to use the rank functionality, both the MPI and FF_IO_OPTS_RANK0 environment variables must be set. If either variable is not set, then the MPI threads all use FF_IO_OPTS . If both the MPI and the FF_IO_OPTS_RANK0 variables are defined but, for example, FF_IO_OPTS_RANK2 is undefined, all rank 2 files would generate a no match with FFIO. This means that none of the rank 2 files would be cached by FFIO (in this case things DO NOT default to FF_IO_OPTS).
Fortran and C/C++ applications that use the pthreads interface will create threads that share the same address space. These threads can all make use of the single FFIO cache defined by FF_IO_OPTS.
FFIO has been deployed successfully with several HPC applications such as Nastran and Abaqus. In a recent customer benchmark, an eight-way Abaqus throughput job ran approximately twice as fast when FFIO was used. The FFIO cache used 16 megabyte pages (that is, page_size = 4096) and the cache size was 8.0 gigabytes. As a rule of thumb, it was determined that setting the FFIO cache size to roughly 10-15% of the disk space required by Abaqus yielded reasonable I/O performance. For this benchmark, the FF_IO_OPTS environment variable was defined by:
setenv FF_IO_OPTS '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 *.nck* *.sct *.lop *.ngr *.elm *.ptn* *.stp* *.eig *.lnz* *.mass *.inp* *.scn* *.ddm *.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)' |
For the MPI version of Abaqus, different caches were specified for each MPI rank, as follows:
setenv FF_IO_OPTS_RANK0 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm *.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)' setenv FF_IO_OPTS_RANK1 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)' setenv FF_IO_OPTS_RANK2 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)' setenv FF_IO_OPTS_RANK3 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)' |
By specifying the .trace option as part of the event parameter the user can enable the event tracing feature in FFIO, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.trace )' |
This option generates files of the form ffio.events.pid for each process that is part of the application. By default, event files are placed in /tmp but this destination can be changed by setting the FFIO_TMPDIR environment variable. These files contain time stamped events for files using the FFIO cache and can be used to trace I/O activity (for example, I/O sizes and offsets).
The SGI ProPack 5 Service Pack 1 release provided the first stable version of FFIO. Applications written in C, C++, and Fortran are supported. C and C++ applications can be built with either the Intel or gcc compiler. Only Fortran codes built with the Intel compiler will work.
The following restrictions on FFIO must also be observed:
The FFIO implementation of pread/ pwrite is not correct (the file offset advances).
Do not use FFIO to do I/O on a socket.
Do not link your application with the librt asynchronous I/O library.
Calls that operate on files in /proc, /etc, and /dev are not intercepted by FFIO.
Calls that operate on stdin, stdout, and stderr are not intercepted by FFIO.
FFIO is not intended for generic I/O applications such as vi, cp, or mv, and so on.