The Lattice Boltzmann kernel included simulates a 3D lid driven cavity problem at a low Reynolds number in double floating-point precision. As a particle method, the kernel discretizes space into a 3D structured grid (a lattice), time according to the CFL condition, and velocity into a number (here, 19) of fixed flow directions. This results in each lattice holding a particle distribution function (PDF) F(space, time, velocity) that describes the flow and allows one to recover variables such as velocity and momentum using simple integrals over these distribution functions and known weights. This results in the well-known D3Q19 model, where the grid is traversed, and the new distribution function values are then calculated only depend on old grid values.
![]() |
---|
D3Q19 accesses for lattice sites |
The three main components of the kernel are firstly the streaming part, where the distribution functions from 19 neighboring lattice sites are pulled, the calculation of macroscopic variables, and lastly the collision step, wherein an operator is executed, calculating the new distribution function and storing the new grid.
![]() |
---|
A 2D example of a lattice update |
The distribution function is stored as a 4D array in one of two forms: F(q,i,j,k) or F(i,j,k,q), where ‘q’ denotes velocity and the other three coordinates denote a lexicographic numbering of the grid. Since Fortran is column major the first layout presents more of a ‘structure of array’ type pattern where each lattice point stores the velocity vectors contiguously, as opposed to the second layout which includes the spatial coordinates in the most rapidly changing dimension.
![]() |
---|
MAQAO port model for Neoverse V2 |
![]() |
---|
Unrolled streaming - no ALU vectorization |
![]() |
---|
Rolled streaming |
Non temporal writes/Write allocate evasion: Since we do not use new grid values until the next lattice sweep, a write allocate would result in cache pollution, and affect performance heavily since the kernel is generally memory bound. This would result in memory traffice of 3x19x8 + 18x4 = 528 Bytes per lattice update instead of 2x19x8 + 18x4 = 376 Bytes if we instead wrote directly to memory. Some architectures provide write allocate evasion if it detects low cache reuse after a store. For older architectures using !$omp nontemporal pragmas still don’t seem to generate non temporal stores.
FP Vectorization
![]() |
---|
Loop structure |
The overall code structure is extremely challenging for compilers because, beyond the classical three-loop nest, there are two additional loops located in the third loop level. These additional loops make the main computations hard to vectorize because they are no longer in the innermost loop level, which is the standard target for compiler vectorization. Most compilers concentrate their effort on vectorizing these loops.