FFTXlib optimization with OmpSs

The following figure shows the runtime of the FFT phase with an increasing number of MPI ranks for comparing the original with the OmpSs version. The original version uses Nx8 MPI ranks, i.e. N ranks for the first MPI layer and 8 FFT task groups. The OmpSs version uses N MPI ranks and 8 threads that replace the FFT task groups. From the figure it can be seen that the version using OmpSs performs the FFT phase about 7-10% faster (not counting hyper-threading), in particular, the fastest version with OmpSs (16x8) is about 10% faster as the fastest original version (8x8).

fftx scalability