The rebalance version of the Communication Imbalance kernel redistribute the work computed per rank in such a way that the proccess 0 drastically reduces the amount of work computed in order to compensate the execution imbalance generated in the communication phase.
The domain decomposition in that case will consist of:
n_elems = GLOBAL_SIZE / we_are; // Perfectly balanced
if (myself == 0) n_elems = n_elems/1000; // (drastically) reduced load in process with more connectivity
else n_elems += n_elems*(1.0-(double)1/1000)/(we_are-1); //keeping global load
Connectivity [RANKS] [RANKS] = { ... }; // Asymmetric
The algorith still keeps the same amount of imbalace due to the asymmetric connectivity matrix. After these changes in the setup phase, the program executes the main loop algorithm:
for (int step = 0; step < N_STEPS; step++) {
// Different number of neighbors per rank (unbalanced)
for (int i = 0; i<n_neighs; i++)
MPI_Irecv(&r_data[i][0], ..., neighbors[i], ... , &requests[i]);
for (int i = 0; i<n_neighs; i++)
MPI_Send(&s_data[i][0], ..., neighbors[i], ... );
compute_local(n_elems, ... ); // purposedly imbalanced computation to compensate communication imbalance
MPI_Waitall(n_neighs, requests, status); // waiting for all the receives
work(); // Some additional code before collective. Might process data just arrived
MPI_Allreduce(...);
}