MPT 2.02 (or later) supports application checkpoint/restart by using the Berkeley Lab Checkpoint/Restart (BLCR) implementation. This allows applications to periodically save a copy of their state. They can then later resume from that point in time if the application crashes or the job is aborted to free up resources for higher priority jobs.
There are some important limitations to keep in mind, as follows:
BLCR does not checkpoint the state of any data files that the application may be using.
Certain MPI features including spawning and one-sided MPI are also not supported when using CPR.
InfiniBand XRC queue pairs are not supported.
Checkpoint files are often very large and require significant disk bandwidth to create in a timely manner.
For more information on BLCR, see https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
To use checkpoint/restart with MPT, BLCR must first be installed. This requires installing the blcr-, blcr-libs-, and blcr-kmp- RPMs. BLCR must then be enabled by root, as follows:
% chkconfig blcr on |
BLCR uses a kernel module which must be built against the specific kernel that the operating system is running. In the case that the kernel module fails to load, it must be rebuilt and installed. Install the blcr- SRPM. In the blcr.spec file, set the kernel variable to the name of the current kernel, then rebuild and install the new set of RPMs.
To enable checkpoint/restart within MPT, mpirun or mpiexec_mpt must be passed the -cpr option, for example:
% mpirun -cpr hostA, hostB -np 8 ./a.out |
To checkpoint a job, use the mpt_checkpoint command on the same host where mpirun is running. mpt_checkpoint needs to be passed the PID of mpirun and a name with which you want to prefix all the checkpoint files. For example:
% mpt_checkpoint -p 12345 -f my_checkpoint |
This will create a my_checkpoint.cps meta-data file and a number of my_checkpoint.*.cpd files.
To restart the job, pass the name of the .cps file to mpirun, for example:
% mpirun -restart my_checkpoint.cps hostC, hostD -np 8 ./a.out |
The job may be restarted on a different set of hosts but there must be the same number of hosts and each host must have the same number of ranks as the corresponding host in the original run of the job.