The original of the Communication Imbalance kernel decompose the domain problem just taking into account the number of elements per rank. With this kind of distribution, although the computational phases are completely balanced, the source of the imbalance comes from the communication phase.
The domain decomposition in that case will consist of:
n_elems = GLOBAL_SIZE / we_are; // Perfectly balanced
connectivity [RANKS] [RANKS] = { ... }; // Asymmetric (rank 0 communicates with everybody)
And then we execute the main loop algorithm:
for (int step = 0; step < N_STEPS; step++) {
// Different number of neighbors per rank (unbalanced)
for (int i = 0; i<n_neighs; i++)
MPI_Irecv(&r_data[i][0], ..., neighbors[i], ... , &requests[i]);
for (int i = 0; i<n_neighs; i++)
MPI_Send(&s_data[i][0], ..., neighbors[i], ... );
compute_local(n_elems, ... ); // same amount of computation on each process (balanced)
MPI_Waitall(n_neighs, requests, status); // waiting for all the receives
work(); // Some additional code before collective. Might process data just arrived
MPI_Allreduce(...);
}