Chapter 7. Checkpoint/Restart

MPT 2.02 (or later) supports application checkpoint/restart by using the Berkeley Lab Checkpoint/Restart (BLCR) implementation. This allows applications to periodically save a copy of their state. They can then later resume from that point in time if the application crashes or the job is aborted to free up resources for higher priority jobs.

There are some important limitations to keep in mind, as follows:

For more information on BLCR, see https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

BLCR Installation

To use checkpoint/restart with MPT, BLCR must first be installed. This requires installing the blcr-, blcr-libs-, and blcr-kmp- RPMs. BLCR must then be enabled by root, as follows:

% chkconfig blcr on

BLCR uses a kernel module which must be built against the specific kernel that the operating system is running. In the case that the kernel module fails to load, it must be rebuilt and installed. Install the blcr- SRPM. In the blcr.spec file, set the kernel variable to the name of the current kernel, then rebuild and install the new set of RPMs.

Using BLCR with MPT

To enable checkpoint/restart within MPT, mpirun or mpiexec_mpt must be passed the -cpr option, for example:

% mpirun -cpr hostA, hostB -np 8 ./a.out

To checkpoint a job, use the mpt_checkpoint command on the same host where mpirun is running. mpt_checkpoint needs to be passed the PID of mpirun and a name with which you want to prefix all the checkpoint files. For example:

% mpt_checkpoint -p 12345 -f my_checkpoint

This will create a my_checkpoint.cps meta-data file and a number of my_checkpoint.*.cpd files.

To restart the job, pass the name of the .cps file to mpirun, for example:

% mpirun -restart my_checkpoint.cps hostC, hostD -np 8 ./a.out

The job may be restarted on a different set of hosts but there must be the same number of hosts and each host must have the same number of ranks as the corresponding host in the original run of the job.