It is frequent to find codes where multiple independent operations are executed in sequence. If each operation has communication phases constituting a high percentage of its total execution time, the overall performance will be low compare with the real potential of the application. (more...)
This best-practice involves the creation of coarse grain tasks (including communication and computation), using task dependencies to drive work-flow execution. Parallelism is then exposed at the coarser grain level, whenever it is possible. Using as the starting point the following pattern:
for (i=0; i < ITERS; i++) {
compute_1(<params[i]>); // OP1
communication_A(<params[i]>);
compute_2(<params[i]>); // OP2
communication_B(<params[i]>);
}
Where it shows dependencies between different phases within the loop body, but still showing different iterations are completely independent among them. The idea will be to taskify the different phases of the code creating a chain of dependences per iteration that will overlap with the execution of the following one.
Taskifying communication requires a different threading level. That is, MPI
must be initialized using the MPI_THREAD_MULTIPLE
threading support:
int provided;
MPI_Init_thread( 0, 0, MPI_THREAD_MULTIPLE, &provided );
The initial code could be transformed in the following manner:
for (i=0; i < ITERS; i++) {
#pragma omp task depend(out: <dependencies>)
{
compute_1(<params[i]>); // OP1
communication_A(<params[i]>);
}
#pragma omp task depend(in: <dependencies>)
{
compute_2(<params[i]>); // OP2
communication_B(<params[i]>);
}
}
In some cases, although they are not actually using different storage, a simple
code restructure could be applied by transforming individual buffers into
multi-buffers, with a given look ahead factor (LAF
). We call this factor look
ahead because it will allow certain level of progression without dependencies.
In any case, these cases are still restricted to execute independent operations
(data generated in one iteration is not needed in the next one) although the
code was implemented reusing the storage. So, the following code:
data_t data[SIZE];
for (i=0; i < ITERS; i++) {
read_input_data(data);
#pragma omp task depend(out: data)
{
compute_1(<params[i]>, data); // OP1
communication_A(<params[i]>, data);
}
#pragma omp task depend(in: data)
{
compute_2(<params[i]>, data); // OP2
communication_B(<params[i]>, data);
}
}
Could be easily transformed into:
#define LAF nn
data_t data[LAF][SIZE];
for (i=0; i < ITERS; i++) {
read_input_data(data);
#pragma omp task depend(out: data[i%LAF])
{
compute_1(<params[i]>, data[i%LAF]); // OP1
communication_A(<params[i]>, data[i%LAF]);
}
#pragma omp task depend(in: data[i%LAF])
{
compute_2(<params[i]>, data[i%LAF]); // OP2
communication_B(<params[i]>, data[i%LAF]);
}
}
This approach will allow to overlap different iterations (as far as they work with completely disjoint additional data, e.g., partial results computed between phases). If the code also works with such additional temporary data, we can still create as many temporary buffers as the number of concurrent iterations (i.e., the Look Ahead Factor), in the same maner we have done for the main data. In such a way, scalars will become vectors, vectors will become 2-dimensional arrays, etc. At the end, any type of variables will need to add an extra dimension to break its associated dependencies between iterations.
It will be also necessary to replicate the MPI communicators to avoid resource dependencies among iterations. That will allow to have different communications on-the-fly, decoupling send/recv operations from different iterations.
#define LAF nn
data_t data[LAF][SIZE];
mpi_communicator_t comm[LAF];
comm[0] = MPI_COMM_WORLD;
for (i=1; i<LAF; i++)
MPI_communicator(&comm[i], MPI_COMM_DUP);
for (i=0; i < ITERS; i++) {
read_input_data(data);
#pragma omp task depend(out: data[i%LAF])
{
compute_1(<params[i]>, data[i%LAF]); // OP1
communication_A(<params[i]>, data[i%LAF], comm[i%LAF]);
}
#pragma omp task depend(in: data[i%LAF])
{
compute_2(<params[i]>, data[i%LAF]); // OP2
communication_B(<params[i]>, data[i%LAF], comm[i%LAF]);
}
}
Recommended in program(s): FFTXlib (2.0) ·
Implemented in program(s): FFTXlib (ompss) ·