MPI processes often have to communicate with a list of neighbours. Depending on the order of send and receive calls it may happen that many processes get “synchronized” in that all of them try to send at the same time to the same given destination, resulting in the limited incoming bandwidth at the destination becoming a limiter for the overall communication performance. (more...)
A simple way to address the issue would be to sort the list in ways that avoid such endpoint contention. Optimal communication schedules can be computed, but in practice, just starting each list by the first neighbor with rang higher that the sender and proceeding circularly to the lower ranked neighbor when the number of processes in the communicator is reached will probably reduce the endpoint contention effect.
rank_id_t neighbors[N]; // ordered list of neighbors of this rank
int next_neigh = search_idx(neighbors, myRank); // Search for next neighbor with rank greater than myRank
for (int i=0; i < N; i++) {
send(neighbors[(next_neigh + i) % N]); // Circular traversal of the list of neighbors
}