Lattice Boltzmann kernel (LBC)

Program's name: Lattice Boltzmann kernel (LBC)
Available version(s): EPI-LBC ·
Programming language(s): Fortran ·
Programming model(s): MPI ·
Uses following algorithm(s): Lattice Boltzmann Method ·
Used in following discipline(s): Computational Fluid Dynamics ·

The Lattice Boltzmann kernel included simulates a 3D lid driven cavity problem at a low Reynolds number in double floating-point precision. As a particle method, the kernel discretizes space into a 3D structured grid (a lattice), time according to the CFL condition, and velocity into a number (here, 19) of fixed flow directions. This results in each lattice holding a particle distribution function (PDF) F(space, time, velocity) that describes the flow and allows one to recover variables such as velocity and momentum using simple integrals over these distribution functions and known weights. This results in the well-known D3Q19 model, where the grid is traversed, and the new distribution function values are then calculated only depend on old grid values.


D3Q19 accesses for lattice sites

The three main components of the kernel are firstly the streaming part, where the distribution functions from 19 neighboring lattice sites are pulled, the calculation of macroscopic variables, and lastly the collision step, wherein an operator is executed, calculating the new distribution function and storing the new grid.


A 2D example of a lattice update

The distribution function is stored as a 4D array in one of two forms: F(q,i,j,k) or F(i,j,k,q), where ‘q’ denotes velocity and the other three coordinates denote a lexicographic numbering of the grid. Since Fortran is column major the first layout presents more of a ‘structure of array’ type pattern where each lattice point stores the velocity vectors contiguously, as opposed to the second layout which includes the spatial coordinates in the most rapidly changing dimension.

Features of interest

Address Calculation: With 18 indirect addressing loads per lattice update, the ALU seems to be taxed quite heavily when considering single core performance.


MAQAO port model for Neoverse V2


Unrolled streaming - no ALU vectorization


Rolled streaming

Non temporal writes/Write allocate evasion: Since we do not use new grid values until the next lattice sweep, a write allocate would result in cache pollution, and affect performance heavily since the kernel is generally memory bound. This would result in memory traffice of 3x19x8 + 18x4 = 528 Bytes per lattice update instead of 2x19x8 + 18x4 = 376 Bytes if we instead wrote directly to memory. Some architectures provide write allocate evasion if it detects low cache reuse after a store. For older architectures using !$omp nontemporal pragmas still don’t seem to generate non temporal stores.
FP Vectorization


Loop structure

The overall code structure is extremely challenging for compilers because, beyond the classical three-loop nest, there are two additional loops located in the third loop level. These additional loops make the main computations hard to vectorize because they are no longer in the innermost loop level, which is the standard target for compiler vectorization. Most compilers concentrate their effort on vectorizing these loops.

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.