FFTXlib (ompss)

Version's name: FFTXlib (ompss) ; a version of the FFTXlib program.
Repository: [home] and version downloads: [.zip] [.tar.gz] [.tar.bz2] [.tar]
Patterns and behaviours:

Sequences of independent computations with expensive global communication

Implemented best practices: Coarse grain (comp + comm) taskyfication with dependencies ·

This version of FFTXlib parallelized with the OmpSs programming model targets the low computation scalability. The objective is to soften the resource contention by replacing the second MPI layer (the FFT task groups). Instead of applying parallelism to each loop in each step, as in the original version, the approach converts each loop iteration, i.e. each FFT, into a single task. Since there are no dependencies between the loop iterations each task can be scheduled without any further constraints.

    DO I = 1, NB, NTG
        !$omp task default(shared) firstprivate(ipsi) &
        !$omp & private(aux, time, i, j) inout(psis) &
        !$omp & reduction(+:ncount, my_time)
        CALL pack NTG bands
        CALL multi-band FW-FFT along Z
        CALL multi-band Scatter
        CALL multi-band FW-FFT along XY
        CALL VOFR
        CALL multi-band BW-FFT along XY
        CALL multi-band Scatter
        CALL multi-band BW-FFT along Z
        CALL unpack NTG bands
        !$omp end task
    END DO
    !$omp taskwait

The individual tasks are used to de-synchronize the computation phases. In the original version, all processes execute the computation phases more or less at the same time since they are statically parallelized and synchronized with the MPI collective calls. By using tasks that are scheduled dynamically by the runtime the compute phases are executed based on dependencies and resource availability. Therefore, the execution of the compute phases is de-synchronized, i.e. at any time only a subset of processes executes the main phase with high compute intensity while others execute the phases with lower compute intensity. As a result, the increasing resource contention that leads to the decrease in IPC when scaling to the full node can be partly absorbed.

The following experiments have been registered:

FFTXlib optimization with OmpSs