MPI-I/O with pre-computed offsets for each processes

Pattern addressed: Writing unstructured data to linear file formats

Many applications cope with unstructured data, which result in unbalanced data distribution over processes. A simple example for this are particle simulations where particles are moved around between processes over time. So, when the simulation shall write the global state at the end - or in-between, e.g., for a checkpoint - each process will have a different number of particles. This makes it hard to write data efficiently to a file in a contiguous way. The same pattern can also be found, e.g., in applications using unstructured meshes with different number of elements per process. (more...)

Writing unstructured data with differing amounts of data per process to a contiguous file is challenging. If the ordering of the data in the file is not of importance, one approach using MPI I/O functionalities to achieve good performance is to pre-compute per process offsets into the file so that each process can then write his local data starting from this position without interfering with the other processes.

The computation of the offsets is a global operation. The MPI standard provides here a special function to compute exclusive prefix sums over all processes that can be used:

int count; /* local number of data elements */
int count_exsum = 0;
int MPI_Exscan(&count, &count_exsum, 1, MPI_INT, MPI_SUM, comm);

Note: If the number of elements is more than INT_MAX = 2 billion, and one uses an MPI-4.0 compliant MPI implementation one may use MPI_COUNT instead of int types in combination with the “_c” versions of the MPI IO functions.

For writing the actual data then regular (collective) MPI IO functions should be used after setting the per process offset using file views:

MPI_File file_handle;
MPI_File_open(comm, filename, MPI_MODE_*, info, &file_handle);
// Set offset for process to position where it should start writing
MPI_Offset offset = count_exsum * sizeof(DATATYPE); /* displacement has to be provided in bytes */
MPI_File_set_view(file_handle, offset, MPI_DOUBLE, arraytype, "native", MPI_INFO_NULL);
// Write data to the file using collective IO
MPI_Status status;
MPI_File_write_all(file_handle, data, count, datatype, &status);
MPI_File_close(&file_handle);

If it is important that the information about the ownership of data elements for the processes is preserved, e.g., for an efficient restart of the simulation, one has to save the offsets or numbers of data elements per process in the file in addition to the data elements. For this purpose the information can be added to a file header. Such a file may then look like follows:

# header
NP
N1, N2, N3, ..., N{NP}
# data
E^1_1, E^1_2, ..., E^1_{N1}, E^2_1, ..., E^2_{N2}, ...., E^{NP}_1, ..., E^{NP}_{N{NP}}

with NP being the number of processes writing to the file, N1 - N{NP} the number of data elements per processes, and E^i_k being the k-th data element of process i.

Implemented in program(s): Writing unstructured data to a contiguous file using MPI I/O with offsets ·