Implementation of the H5MD file format

While the H5MD specification is agnostic towards the software used to write and read H5MD files, this document provides implementation hints for common HDF5 libraries.

Language-independent

Dataset index lookup at a given step or time

The step and time datasets are in monotonically increasing order, which permits the use of a binary search to efficiently determine the index of an element of the value dataset corresponding to a given step or time.

Compact datasets

For storage efficiency, small-sized time-independent data should be stored using the compact dataset layout (see §5.4.5). The raw data of a compact dataset, which may not exceed 64 kb, is stored in the header block of the dataset object.

The enum datatype for the species data

The species data group in the particles group of a H5MD file may use the enum datatype (see §6.5.3) that is compatible with the integer datatype. This allows to map numerical species identifiers to a string while keeping the performance of integer data.

An enumerated-type dataset can be read using, e.g., H5T_NATIVE_INT as the memory datatype. For writing, however, the memory datatype needs to be an enumerated type. A program extending an existing enumerated-type dataset needs to be aware of this.

Using udunits2 for units interpretation

The udunits2 software from Unidata can be of help to interpret the units system SI from the units module. While all units in SI are present as a subset of the default udunits2 database, a database containing only the units from SI is available as a xml file.

C

This section describes the usage of the HDF5 C API.

#include <hdf5.h>

File format

The H5MD specification recommends HDF5 file format version 2. This format is supported by HDF5 library version 1.8 or later, but needs to be enabled explicitly when creating or opening an HDF5 file for writing.

This is achieved using the function H5Pset_libver_bounds.

/* Create HDF5 file with HDF5 file format version 2. */
hid_t fapl_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_libver_bounds(fapl_id, H5F_LIBVER_18, H5F_LIBVER_18);
hid_t file_id = H5Fcreate("name.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);

Compact datasets

Compact datasets are created by setting the layout property of the dataset creation property list.

hid_t dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_layout(dcpl_id, H5D_COMPACT);
hid_t dset_id = H5Dcreate(loc_id, "name", type_id, space_id, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);

Object time tracking

For HDF5 file format version 2 or later, the HDF5 library automatically tracks access, modification, change, and birth time of an object, which applies to the file, and its groups and datasets.

The function H5Oget_info may be used to query object times in seconds since the epoch.

H5O_info_t info;
H5Oget_info(file_id, &info);
printf("atime %ld\n", info.atime);
printf("mtime %ld\n", info.mtime);
printf("ctime %ld\n", info.ctime);
printf("btime %ld\n", info.btime);

Python

This section describes the usage of HDF5 for Python.

import h5py

File format

The class h5py.File takes a “libver” argument to set the file format.

f = h5py.File("name.h5", libver="18")

Note that h5py up to version 2.1.3 lacks a binding for the H5F_LIBVER_18 constant.

Dataset index lookup at a given step

The function numpy.searchsorted may be used to perform a binary search:

""" Returns value of H5MD element at given step. """
def read_step(group, step):
  index = np.searchsorted(group["step"], step)
  assert(group["step"][index] == step)
  return group["value"][index]

Object time tracking

The low-level function h5py.h5o.get_info retrieves object times.

info = h5py.h5o.get_info(f.id)
print(info.atime)
print(info.mtime)
print(info.ctime)
print(info.btime)

Note that h5py up to version 2.1.3 lacks bindings for the above mentioned H5O_info_t members.