The CHECKPOINT.h5 file
What is this file ?
It is an hdf5 file containing the essential data given as input to and generated by DIRAC.
All data stored in this file is defined and documented in the DIRAC data schema, the source of which is found in
utils/DIRACschema.txt
.
What can I do with it ?
One purpose is to make restarting and data curation after a run easier.
Another purpose is to facilitate communication with other programs.
With the hdf5 format and h5py it is trivial to import data into Python and further process and view it.
Can I also extend the schema as I have data that is not listed here ?
The first question is whether this data is indeed essential. The CHECKPOINT.h5 file is not intended for large data sets as it gets saved automatically after a run. It is also not intended for highly specialized or intermediate data. If you want to use hdf5 for such data consider making a special hdf5 file using the interface provided in the labeled_storage module. This is even easier as you need not define everything as thoroughly as with the schema (see below).
In case a new data type is indeed a generally useful addition, please start by documenting it (type and description) and ask for a peer review by one of the developers before proceeding to the next step.
How the schema is processed.
The source text in
utils/DIRACschema.txt
is processed at run time by the python functionsread_schema
andwrite_schema
that are found inutils/process_schema.py
and which are called by the DIRAC run scriptpam
. This produces a new text file calledschema_labels.txt
which is placed in the work directory. This is the file used by the actual DIRAC code and contains the set of labels also found on CHECKPOINT.h5. To familiarize yourself with this: copyschema_labels.txt
from the work directory and compare it toDIRACschema.txt
.Note that the hierarchical structure is defined by /s, much like you see in a Unix directory. This also means that one can not use /s in data labels as hdf5 would get confused.
In the Fortran code the generated labels are used directly, an example is found in gp/dircmo.F90:
call checkpoint_write ('/result/wavefunctions/scf/energy',rdata=toterg)
which writes the total energy (a single real number) with the appropriate label.Note that data is classified as
optional
orrequired
in the schema. This is used to define whether restart is possible, for this purpose all required data should be present on the checkpoint file.
How can the schema be extended ?
For extending the schema: Do NOT edit the schema_labels.txt file. All edits should be made in
DIRACschema.txt
.Check first whether the data is optional or required. Be careful to define new required data as restart files will be invalid if this data is missing and this may hamper restarting from old checkpoint files.
If the data consists of a simple standard type (real, integer or string) which fits in an existing subsection you can simply define it at the appropriate place and the scripts will automatically generate the label. After inspecting this you can then use this in calls in the Fortran code.
If the data is of composite type, you need to define its elements below. This is done by creating new subsection in the file, an example is the
molecule
data type that is part ofinput
and is defined in a separate section. Each section starts with a*
and ends with*end
.You may also nest sections, see for instance the data type
wavefunctions
that has the composite typescf
as an element.
What happens at run time and on the Fortran side ?
At the start of the run DIRAC checks whether CHECKPOINT.h5 is present and contains all required data. If it is, a restart will be attempted. Note that you can use the copy facilities of pam to place the file in the work directory.
During the run the only calls needed on the Fortran side are
checkpoint_read
,checkpoint_write
and possiblycheckpoint_query
. These subroutines are found in the modulecheckpoint
and support writing of reals, integers and strings. The query routine can be used to determine whether a data set already exists and what its size is. It is intended to keep the hdf5 interface simple and easily maintainable, so more complicated types should be split up in these standard types. Note that we have a layered structure with thelabeled_storage
module being responsible for the actual I/O, thecheckpoint
module is merely an additional layer that checks the schema. If you feel inclined to change thecheckpoint
module, make sure to not directly callhdf5
routines to not break the fallback option (see below) that is provided in case we hit a system or user that has no hdf5 installed. There are two more public routines in this module (for opening and closing a checkpoint file), but these are already called at the appropriate places in DIRAC and should not be called at other places.If the checkpoint_read routine is called data is located on the file and given back to the caller. Some error handling is provided, like checking whether the array given to hold the data is large enough, but crashes are still possible, for instance in case you try to read a file that is not in hdf5 format or is otherwise corrupted.
If the checkpoint_write routine is called data is stored on file after checking that the label is indeed known in the schema. Undefined data will not be written and a warning is issued. This guarantees that all data placed on the checkpoint file is properly documented.
At the end of the run
pam
checks whether CHECKPOINT.h5 is present and contains all required data. If it is, it will be copied to the directory from which pam was called and renamed following the same convention as used for the output file, but giving a file extension .h5 instead of .out
What happens if hdf5 is not installed ?
At configure time cmake will detect whether DIRAC can be linked with the required hdf5 routines (via the interface contained in the module
mh5
that you find insrc/gp/hdf5_util
). If this is not the case, a simple Unix mimic of hdf5 is activated by creating a directory CHECKPOINT.noh5 in which the hierarchical data structure is implemented with subdirectories containing Fortran unformatted files. A hidden file.initialized
is used to determine whether a data group has been defined, the data itself is stored in a two-record file with the first record containing the data type, data size and data length (the latter being only different from 1 in case of string data). All read, write and query functionalities work, but moving data to and from the work directory is now more complicated as we need to move directories. This is currently implemented by tarring and gzipping the CHECKPOINT.noh5 directory. In case hdf5 can not be linked but h5py is found, the fallback format is converted into proper hdf5 format by the Python routinesnohdf5_load_data
andwrite_hdf5
called by the pam script at the end of the run.Note that use of the fallback option is strongly discouraged as it is less efficient and offers only limited functionality.