Coarse grain (comp + comm) taskyfication with dependencies

Pattern addressed: Sequences of independent computations with expensive global communication

It is frequent to find codes where multiple independent operations are executed in sequence. If each operation has communication phases constituting a high percentage of its total execution time, the overall performance will be low compare with the real potential of the application. (more...)

This best-practice involves the creation of coarse grain tasks (including communication and computation), using task dependencies to drive work-flow execution. Parallelism is then exposed at the coarser grain level, whenever it is possible. Using as the starting point the following pattern:

for (i=0; i < ITERS; i++) {
   compute_1(<params[i]>); // OP1
   communication_A(<params[i]>);
   compute_2(<params[i]>); // OP2
   communication_B(<params[i]>);
}

Where it shows dependencies between different phases within the loop body, but still showing different iterations are completely independent among them. The idea will be to taskify the different phases of the code creating a chain of dependences per iteration that will overlap with the execution of the following one.

Taskifying communication requires a different threading level. That is, MPI must be initialized using the MPI_THREAD_MULTIPLE threading support:

int provided;

MPI_Init_thread( 0, 0, MPI_THREAD_MULTIPLE, &provided );

The initial code could be transformed in the following manner:

for (i=0; i < ITERS; i++) {
#pragma omp task depend(out: <dependencies>)
   {
   compute_1(<params[i]>); // OP1
   communication_A(<params[i]>);
   }
#pragma omp task depend(in: <dependencies>)
   {
   compute_2(<params[i]>); // OP2
   communication_B(<params[i]>);
   }
}

In some cases, although they are not actually using different storage, a simple code restructure could be applied by transforming individual buffers into multi-buffers, with a given look ahead factor (LAF). We call this factor look ahead because it will allow certain level of progression without dependencies. In any case, these cases are still restricted to execute independent operations (data generated in one iteration is not needed in the next one) although the code was implemented reusing the storage. So, the following code:

data_t data[SIZE];

for (i=0; i < ITERS; i++) {
   read_input_data(data);

#pragma omp task depend(out: data)
   {
   compute_1(<params[i]>, data); // OP1
   communication_A(<params[i]>, data);
   }
#pragma omp task depend(in: data)
   {
   compute_2(<params[i]>, data); // OP2
   communication_B(<params[i]>, data);
   }
}

Could be easily transformed into:

#define LAF nn
data_t data[LAF][SIZE];

for (i=0; i < ITERS; i++) {
   read_input_data(data);

#pragma omp task depend(out: data[i%LAF])
   {
   compute_1(<params[i]>, data[i%LAF]); // OP1
   communication_A(<params[i]>, data[i%LAF]);
   }
#pragma omp task depend(in: data[i%LAF])
   {
   compute_2(<params[i]>, data[i%LAF]); // OP2
   communication_B(<params[i]>, data[i%LAF]);
   }
}

This approach will allow to overlap different iterations (as far as they work with completely disjoint additional data, e.g., partial results computed between phases). If the code also works with such additional temporary data, we can still create as many temporary buffers as the number of concurrent iterations (i.e., the Look Ahead Factor), in the same maner we have done for the main data. In such a way, scalars will become vectors, vectors will become 2-dimensional arrays, etc. At the end, any type of variables will need to add an extra dimension to break its associated dependencies between iterations.

It will be also necessary to replicate the MPI communicators to avoid resource dependencies among iterations. That will allow to have different communications on-the-fly, decoupling send/recv operations from different iterations.

#define LAF nn
data_t data[LAF][SIZE];

mpi_communicator_t comm[LAF];
comm[0] = MPI_COMM_WORLD;

for (i=1; i<LAF; i++)
   MPI_communicator(&comm[i], MPI_COMM_DUP);

for (i=0; i < ITERS; i++) {
   read_input_data(data);

#pragma omp task depend(out: data[i%LAF])
   {
   compute_1(<params[i]>, data[i%LAF]); // OP1
   communication_A(<params[i]>, data[i%LAF], comm[i%LAF]);
   }
#pragma omp task depend(in: data[i%LAF])
   {
   compute_2(<params[i]>, data[i%LAF]); // OP2
   communication_B(<params[i]>, data[i%LAF], comm[i%LAF]);
   }
}

Recommended in program(s): FFTXlib (2.0) ·

Implemented in program(s): FFTXlib (ompss) ·