Co-design at POP CoE project

Parallelize packing and unpacking regions

Sequential communications in hybrid programming: A frequent practice in hybrid programming is to only parallelize with OpenMP the main computational regions. The communication phases are left as in the original MPI program and thus execute in order in the main thread while other threads are idling. This may limit the scalability of hybrid programs and often results in the hybrid code being slower than an equivalent pure MPI code using the same total number of cores. (more...)

We consider it is a good practice to taskify these operations allowing them to execute concurrently and far before the actual MPI call in the case of packs or far after in the case of unpacks. These operations are typically not very large and very memory bandwidth bound. For that reason we think it is a good practice not to parallelize each of them with fork join parallel do for each of them as granularity will be very fine and the overhead will have a significant impact. The size typically varies a lot between different pack/unpacks to/from different processes. Using a sufficiently large grain size may allow for some of these operations to be parallelized if the message is large while executing as a single task for small buffers (see dependence flow between pack and MPI_Isend() operations in the following code).

#pragma omp parallel
{
#pragma omp for
for (int i=0; i<SIZE; i++) compute(data);

#pragma omp single
{
   for (int n=0; n<n_neighs; n++) {
      #pragma omp task 
      MPI_Irecv(&r_buff[n], n, ...);

      #pragma omp task depend(out: s_buff[n])
      pack( s_buff[n], data, n ); // packing sending buffer for neighbor 'n'
      #pragma omp task depend(in: s_buff[n])
      MPI_Isend(&s_buff[n], n, ...);
   }
   MPI_waitall();

   for (int n=0; n<n_neighs; n++) {
      #pragma omp task
      unpack (r_buff[n], data, n); // unpacking receiving buffer for neighbor 'n'
   }
} // end of single
} // end of parallel

This option may have some overhead if the size of the data is small or the call is an isend. In that case it might be good to refactor the code to first do a loop with all the taskified packs followed by a loop with all the isends/sends (i.e., loop fission). Before entering the second loop we can wait for all previous created tasks:

   for (int n=0; n<n_neighs; n++) {
      #pragma omp task 
      MPI_Irecv(&r_buff[n], n, ...);

      #pragma omp task depend(out: s_buff[n]) // dep. info. will not be used by MPI_Isend ops
      pack( s_buff[n], data, n ); // packing sending buffer for neighbor 'n'
   }

#pragma omp taskwait

   for (int n=0; n<n_neighs; n++) {
      MPI_Isend(&s_buff[n], n, ...);
   }

Or we can also synchronize each individual Isend operation with a taskwait with dependencies just before it. In this case a task wait has to be used before each isend/send:

   for (int n=0; n<n_neighs; n++) {
      #pragma omp task 
      MPI_Irecv(&r_buff[n], n, ...);

      #pragma omp task depend(out: s_buff[n])
      pack( s_buff[n], data, n ); // packing sending buffer for neighbor 'n'
   }

   for (int n=0; n<n_neighs; n++) {
      #pragma omp taskwait depend(in: s_buff[n])
      MPI_Isend(&s_buff[n], n, ...);
   }

The pattern in the receive side is even simpler as typically it may be a loop receiving for each neigbout and instantiating the corresponding unpack task to be executed by other thread while the main thread proceeds to the mext receive operation. This structure allows for a fast drain of the incomming link and thus can reduce network contention and stalls at the send side.