# Co-design at POP CoE project

## Parallelize packing and unpacking regions

Sequential communications in hybrid programming: A frequent practice in hybrid programming is to only parallelize with OpenMP the main computational regions. The communication phases are left as in the original MPI program and thus execute in order in the main thread while other threads are idling. This may limit the scalability of hybrid programs and often results in the hybrid code being slower than an equivalent pure MPI code using the same total number of cores. (more...)

We consider it is a good practice to taskify these operations allowing them to execute concurrently and far before the actual MPI call in the case of packs or far after in the case of unpacks. These operations are typically not very large and very memory bandwidth bound. For that reason we think it is a good practice not to parallelize each of them with fork join parallel do for each of them as granularity will be very fine and the overhead will have a significant impact. The size typically varies a lot between different pack/unpacks to/from different processes. Using a sufficiently large grain size may allow for some of these operations to be parallelized if the message is large while executing as a single task for small buffers (see dependence flow between pack and MPI_Isend() operations in the following code).

#pragma omp parallel
{
#pragma omp for
for (int i=0; i<SIZE; i++) compute(data);

#pragma omp single
{
for (int n=0; n<n_neighs; n++) {
MPI_Irecv(&r_buff[n], n, ...);

pack( s_buff[n], data, n ); // packing sending buffer for neighbor 'n'
MPI_Isend(&s_buff[n], n, ...);
}
MPI_waitall();

for (int n=0; n<n_neighs; n++) {
unpack (r_buff[n], data, n); // unpacking receiving buffer for neighbor 'n'
}
} // end of single
} // end of parallel


This option may have some overhead if the size of the data is small or the call is an isend. In that case it might be good to refactor the code to first do a loop with all the taskified packs followed by a loop with all the isends/sends (i.e., loop fission). Before entering the second loop we can wait for all previous created tasks:

   for (int n=0; n<n_neighs; n++) {
MPI_Irecv(&r_buff[n], n, ...);

#pragma omp task depend(out: s_buff[n]) // dep. info. will not be used by MPI_Isend ops
pack( s_buff[n], data, n ); // packing sending buffer for neighbor 'n'
}

for (int n=0; n<n_neighs; n++) {
MPI_Isend(&s_buff[n], n, ...);
}


Or we can also synchronize each individual Isend operation with a taskwait with dependencies just before it. In this case a task wait has to be used before each isend/send:

   for (int n=0; n<n_neighs; n++) {