The CHECKPOINT.h5 file

What is this file ?

  • It is an hdf5 file containing the essential data given as input to and generated by DIRAC.

  • All data stored in this file is defined and documented in the DIRAC data schema, the source of which is found in utils/DIRACschema.txt.

What can I do with it ?

  • One purpose is to make restarting and data curation after a run easier.

  • Another purpose is to facilitate communication with other programs.

  • With the hdf5 format and h5py it is trivial to import data into Python and further process and view it.

Can I also extend the schema as I have data that is not listed here ?

  • The first question is whether this data is indeed essential. The CHECKPOINT.h5 file is not intended for large data sets as it gets saved automatically after a run. It is also not intended for highly specialized or intermediate data. If you want to use hdf5 for such data consider making a special hdf5 file using the interface provided in the labeled_storage module. This is even easier as you need not define everything as thoroughly as with the schema (see below).

  • In case a new data type is indeed a generally useful addition, please start by documenting it (type and description) and ask for a peer review by one of the developers before proceeding to the next step.

How the schema is processed.

  • The source text in utils/DIRACschema.txt is processed at run time by the python functions read_schema and write_schema that are found in utils/process_schema.py and which are called by the DIRAC run script pam. This produces a new text file called schema_labels.txt which is placed in the work directory. This is the file used by the actual DIRAC code and contains the set of labels also found on CHECKPOINT.h5. To familiarize yourself with this: copy schema_labels.txt from the work directory and compare it to DIRACschema.txt.

  • Note that the hierarchical structure is defined by /s, much like you see in a Unix directory. This also means that one can not use /s in data labels as hdf5 would get confused.

  • In the Fortran code the generated labels are used directly, an example is found in gp/dircmo.F90: call checkpoint_write ('/result/wavefunctions/scf/energy',rdata=toterg) which writes the total energy (a single real number) with the appropriate label.

  • Note that data is classified as optional or required in the schema. This is used to define whether restart is possible, for this purpose all required data should be present on the checkpoint file.

How can the schema be extended ?

  • For extending the schema: Do NOT edit the schema_labels.txt file. All edits should be made in DIRACschema.txt.

  • Check first whether the data is optional or required. Be careful to define new required data as restart files will be invalid if this data is missing and this may hamper restarting from old checkpoint files.

  • If the data consists of a simple standard type (real, integer or string) which fits in an existing subsection you can simply define it at the appropriate place and the scripts will automatically generate the label. After inspecting this you can then use this in calls in the Fortran code.

  • If the data is of composite type, you need to define its elements below. This is done by creating new subsection in the file, an example is the molecule data type that is part of input and is defined in a separate section. Each section starts with a * and ends with *end.

  • You may also nest sections, see for instance the data type wavefunctions that has the composite type scf as an element.

What happens at run time and on the Fortran side ?

  • At the start of the run DIRAC checks whether CHECKPOINT.h5 is present and contains all required data. If it is, a restart will be attempted. Note that you can use the copy facilities of pam to place the file in the work directory.

  • During the run the only calls needed on the Fortran side are checkpoint_read, checkpoint_write and possibly checkpoint_query. These subroutines are found in the module checkpoint and support writing of reals, integers and strings. The query routine can be used to determine whether a data set already exists and what its size is. It is intended to keep the hdf5 interface simple and easily maintainable, so more complicated types should be split up in these standard types. Note that we have a layered structure with the labeled_storage module being responsible for the actual I/O, the checkpoint module is merely an additional layer that checks the schema. If you feel inclined to change the checkpoint module, make sure to not directly call hdf5 routines to not break the fallback option (see below) that is provided in case we hit a system or user that has no hdf5 installed. There are two more public routines in this module (for opening and closing a checkpoint file), but these are already called at the appropriate places in DIRAC and should not be called at other places.

  • If the checkpoint_read routine is called data is located on the file and given back to the caller. Some error handling is provided, like checking whether the array given to hold the data is large enough, but crashes are still possible, for instance in case you try to read a file that is not in hdf5 format or is otherwise corrupted.

  • If the checkpoint_write routine is called data is stored on file after checking that the label is indeed known in the schema. Undefined data will not be written and a warning is issued. This guarantees that all data placed on the checkpoint file is properly documented.

  • At the end of the run pam checks whether CHECKPOINT.h5 is present and contains all required data. If it is, it will be copied to the directory from which pam was called and renamed following the same convention as used for the output file, but giving a file extension .h5 instead of .out

What happens if hdf5 is not installed ?

  • At configure time cmake will detect whether DIRAC can be linked with the required hdf5 routines (via the interface contained in the module mh5 that you find in src/gp/hdf5_util). If this is not the case, a simple Unix mimic of hdf5 is activated by creating a directory CHECKPOINT.noh5 in which the hierarchical data structure is implemented with subdirectories containing Fortran unformatted files. A hidden file .initialized is used to determine whether a data group has been defined, the data itself is stored in a two-record file with the first record containing the data type, data size and data length (the latter being only different from 1 in case of string data). All read, write and query functionalities work, but moving data to and from the work directory is now more complicated as we need to move directories. This is currently implemented by tarring and gzipping the CHECKPOINT.noh5 directory. In case hdf5 can not be linked but h5py is found, the fallback format is converted into proper hdf5 format by the Python routines nohdf5_load_data and write_hdf5 called by the pam script at the end of the run.

  • Note that use of the fallback option is strongly discouraged as it is less efficient and offers only limited functionality.