A regularized version of the lattice Boltzmann method for efficient simulation of soft materials is introduced. Unlike standard approaches, this method reconstructs the distribution functions from available hydrodynamic variables (density, momentum, and pressure tensor) without storing the full set of discrete populations. This scheme shows significantly lower memory requirements and data access costs. A series of benchmark tests of relevance to soft matter, such as collisions of fluid droplets, is discussed to validate the method. The results can be of particular interest for high-performance simulations of soft matter systems on future exascale computers.

## I. INTRODUCTION

Minimizing the cost of accessing data is one of the most compelling challenges of present-day high-performance computing (HPC).^{1} This is particularly true for memory-bound computational schemes used in computational fluid dynamics. Among them, the lattice Boltzmann (LB) method^{2} is characterized by a comparatively low arithmetic intensity spanning in a range between 1 and 5 Flops/Byte ratio depending on the implementation, hence below the roofline threshold beyond which computing resources become the limiting factor.^{3,4}

Given the widespread use of LB across a broad spectrum of scales and regimes, it is of prime interest to develop techniques to mitigate the impact of data access on future (exascale) LB simulations. Several such techniques have been developed in the past, based on hierarchical memory access,^{5} data organization in terms of array of structures and structure of arrays,^{6} as well as special crystallographic lattices that double the resolution at a given level of memory occupation.^{7} Furthermore, a number of approaches have been proposed to reduce memory occupancy. These can be summarized in two main strategies: the first one focuses on algorithmic implementation, while the second one exploits different numerical representations. As a remarkable example of the algorithmic approach, we recall the so-called “in-place” streaming, like the esoteric pull,^{8,9} while in the latest one, low precision and shifted equilibrium distributions are used either with single precision (instead of double) or with a mixed single precision/half-precision representation, with a factor two of memory saving and no loss of accuracy.^{10,11}

In this work, we retrieve the pioneering considerations given by Ladd and Verberg^{12} on the possibility of reducing memory consumption by only storing hydrodynamic fields and their gradients. Thus, we present a new strategy to implement the LB approach to model a bi-component system based on the idea of refraining from storing the populations and reconstructing them on the fly based on the available hydrodynamic information. This approach grants a significant reduction in memory usage since it avoids population storage (in the following simply called “without populations”). Furthermore, this is particularly useful for multi-component and multiple species applications in which each species is carried by a common flow field, thereby requiring a single hydrodynamic field (density) as opposed to the full kinetic representation with *O*(30) populations. A similar argument applies to flows far from equilibrium or relativistic hydrodynamics,^{13,14} which demand higher order lattices, sometimes entailing hundreds of discrete populations per species.

The present approach is tested on a selection of physically relevant situations in soft matter. More specifically, we simulate a 3D off-axis and a 3D head-on collision between two equally sized fluid droplets in which the merging is suppressed. These simulations are compared with those performed using a regularized LB version with populations, and the accuracy of the results is quantitatively assessed by computing the time evolution of the droplet densities and the relative error. Even though the reconstruction procedure suffers from an inherent approximation concerning the standard implementation of the LB method (see Sec. II), our results show that the error remains negligible, the numerical accuracy is generally preserved, and the physics is correctly reproduced. Furthermore, the reconstruction procedure benefits from the regularization techniques^{15–17} that grants stability properties filtering out the non-hydrodynamic content in off-equilibrium distribution functions. The distinctive feature of the reconstruction procedure discussed here is that, unlike a standard LB approach, it considerably enhances the computational performance of lattice Boltzmann codes, entailing about 40 percent of memory savings on large-scale simulations. This represents a significant step toward the study of the multiscale physics of soft materials, particularly on modern graphics processing unit (GPU)-accelerated machines where the minimization of the burden imposed by memory requirements and data access are often mandatory.

## II. LATTICE BOLTZMANN WITHOUT POPULATIONS: THE BASIC IDEA

The LB scheme is based on a set of discrete populations $fi(x\u20d7,t)=f(x\u20d7,\upsilon \u20d7,t)\delta (v\u20d7\u2212c\u20d7i)$ describing the probability of finding a representative fluid particle at position $x\u20d7$ and time *t* with a discrete molecular velocity $\upsilon \u20d7=c\u20d7i$. The discrete velocities are chosen according to suitable crystallographic symmetries ensuring compliance with the basic conservation laws of macroscopic fluids. This implies a set of order *b* ∼ 20 − 30 discrete populations *f*_{i}, *i* = 0, *b*, which exceed the number of associated hydrodynamic fields—scalar density *ρ*, (1), flow velocity *u*_{a} (*a* = 1, 3), and momentum–flux tensor Π_{ab} (six terms for each lattice point)—totaling ten independent fields. Hence, LB requires about twice as much memory as compared to a corresponding Navier–Stokes-based computational fluid dynamics method. This redundancy buys several major computational advantages, primarily the fact that streaming is exact and diffusion is emergent (no need for a second order spatial derivatives), which proved invaluable assets in achieving outstanding parallel efficiency across virtually any HPC platform, also in the presence of real-world complex geometries.

These extra-memory requirements put a premium on strategies aimed at minimizing the cost of data access in massively parallel LB codes. Among others, recognized techniques are the hierarchical access and the use of special crystallographic lattices.^{7}

*f*= [

*f*

_{1}, …,

*f*

_{b}] the

*b*-dimensional array (multiscalar) associated with the discrete populations and

*H*= [

*ρ*,

*u*

_{a}, Π

_{ab}] the array associated with the hydro-fields (including the dissipative component). The hydrodynamic fields are linear and local combinations of the populations,

^{18}that is,

*H*=

*Pf*, where

*P*is a projector matrix defined by the above relations. The inverse operation, namely, the reconstruction of

*f*from

*H*, reads as

*f*=

*RH*, where the reconstruction operator is the pseudo-inverse

*P*. Clearly, the latter is ill-posed since the information removed by the projector

*P*cannot be retraced back exactly. Formally, we write

*f*=

*h*+

*g*, where

*h*is the hydro-component of

*f*, the one that can be reconstructed exactly from

*H*, whereas

*g*is the residual lost in the projection, known as “ghost” (non-hydrodynamic) component in LB jargon.

^{2}At zeroth order, one can simply set

*g*= 0.

*f*

_{i}can be expanded around their equilibrium value

^{12,16–18}

*O*(

*ɛ*

^{k}), where

*ɛ*is the Knudsen number (the ratio between mean free path and a characteristic length) with

*ɛ*≪ 1. Using a multiscale Chapman–Enskog expansion,

^{19}it can be shown that only the two first terms, $fi(0)$ and $fi(1)$, are sufficient to recover (asymptotically) the dynamics of the Navier–Stokes equation.

^{20,21}Consequently, one can write $fi=fi(0)+fi(1)+O(\epsilon 2)$ and identify $fi(1)=fi\u2212fi(0)+O(\epsilon 2)\u2248hineq$ as the non-equilibrium hydrodynamic term, apart from

*O*(

*ɛ*

^{2}) contributions. Therefore, the distribution functions can be written as $fi=hieq+hineq+O(\epsilon 2)=hi+O(\epsilon 2)$. As reported in the literature,

^{22}the reconstruction operator amounts to the following expression:

*w*

_{i}are a set of coefficients whose values depend on the lattice geometry, $uia=ua/cs2$, and $Qiab=(ciacib\u2212cs2\delta ab)/(2cs4)$. Note that, unlike the implementation discussed in Refs. 23–25 (also called simplified lattice Boltzmann method), the populations

*h*

_{i}are reconstructed on the fly as the sum of the equilibrium and non-equilibrium terms. In particular, the first one is computed as

*g*

_{i}= 0. It is also worth reminding that $hieq$ fulfills the relations

^{12,18,20,26}

*σ*

_{ab}=

*η*(

*∂*

_{a}

*u*

_{b}+

*∂*

_{b}

*u*

_{a}) is the shear stress tensor with

*η*the dynamic viscosity, and

*E*

_{Π}is an additional term whose relative error

*e*

_{Π}=

*E*

_{Π}/Π was shown to scale with the second order in the Mach number.

^{18}

The reconstructed populations *h*_{i} are then relaxed and streamed without storing them (see Sec. III). Note that streaming revives the ghost components *g*_{i}, which affect only the non-equilibrium part of the set of populations. As a result, the operational sequence at each time step is Reconstruct–Collide–Stream (RCS). It should be emphasized that the RCS sequence does not match exactly the result of the original, populations preserving, LB scheme. The underlying assumption is that, whenever the present approach is compared to standard Bathnagar–Gross–Krook (BGK) collision,^{27} the discrepancy introduced by the R-step naturally heals itself via the ghost revival described earlier. Since the ghosts contain high-order, non-hydrodynamic non-equilibrium information, this assumption appears well justified, except perhaps in the case of very low viscous flows (turbulence), which lie outside the scope of this paper. However, discrepancies are lowered whenever the comparison is performed with a regularized LB version with populations, as in the present case. To be noted that a further source of discrepancy relates to the truncation error due to the floating-point representation, as well as to the different sequences of operations in the algorithmic implementation. However, none of these are expected to play a significant role in the physical phenomena under study in this work.

## III. LATTICE BOLTZMANN WITHOUT POPULATIONS: IMPLEMENTATION DETAILS

### A. Implementation procedure

The basic idea is to construct the populations on the fly, based on Eq. (3) without storing them. In order to caution against overwriting the values at time *t* and *t* + 1, two copies of the hydrodynamic arrays must be stored, doubling the memory request. For a single component fluid in the paradigm of flip–flop memory access pattern,^{28} the memory saving is significant, 20 hydro-arrays (1 for density, 3 for velocity vector, 6 for the pressure tensor, two times in the flip–flop paradigm) vs 38 populations for the D3Q19 lattice scheme, adopted in the present investigation. For two component fluids, the advantage increases. For example, using the color-gradient lattice Boltzmann (LBCG) model,^{29–31} we need to add two extra arrays for the second component, thus obtaining 22 hydro-arrays vs 76 populations (i.e., 38 populations per each fluid component).

*h*

_{i}are stored as temporary scalars, thus without arrays, and

*ω*is the BGK relaxation frequency that tunes the kinematic viscosity $\nu =cs2(\tau \u221212)$, being

*τ*= 1/

*ω*the relaxation time and $cs=1/3$ the isothermal sound speed of the model.

^{2,20}

*ρ*,

*u*

_{a}, and Π

_{ab}. Note that, since the populations are reconstructed on the fly, collision and streaming are more conveniently fused into a single step (fused approach), thus allowing for computation and storage of the hydro-fields only after Eq. (10). This is in contrast with a split approach, in which the fields would be stored after the collision step of Eq. (8). As aforementioned, the best strategy for the fused approach is to implement the Reconstruct–Collide–Stream (RCS) sequence, since it combines the locality of the collision step with the non-locality of the pulled streaming, allowing the reconstruction of the populations only at the pre-collision stage, in analogy with the regularized method described by Latt and Chopard.

^{16}

^{29–31}This method requires the calculation of the local density gradients, which are necessary for the perturbation and recoloring steps. Denoting the phase field as Θ = (

*ρ*

_{R}−

*ρ*

_{B})/(

*ρ*

_{R}+

*ρ*

_{B}), where

*ρ*

_{R}and

*ρ*

_{B}are the densities of the two fluids (red and blue), the reconstructed populations obtained from Eq. (9) need for the following extra step:

*ρ*=

*ρ*

_{R}+

*ρ*

_{B}, cos

*ϕ*

_{i}is the cosine of the angle between the phase field gradient and the

*i*th lattice vector and $hieq(\rho ,0)$ is the equilibrium distribution function computed for

*ρ*and zero velocity. In addition,

*B*

_{i}are lattice dependent weights given in Ref. 30, the parameter

*A*is chosen to fit the interfacial tension

*γ*through the relation

*β*is a free parameter that tunes the interface width, thus playing the role of an inverse diffusion length scale.

About the implementation of the boundary conditions (BCs), since the populations are reconstructed on the fly, BCs can be easily implemented by conditional statements for the applications of ad-hoc rules on the post-collision populations in the fused approach. For instance, halfway bounce-back can be easily implemented, including a minor correction to endure a Dirichlet boundary condition.^{12,32} On the other hand, the performance could notably decrease due to the extensive use of conditional statements in the fused implementations. The problem can be avoided by adopting the non-local regularized BCs,^{21,26} where the BCs are applied by modifying only the hydrodynamics fields used to reconstruct the populations. In particular, in the non-local regularized BCs, the target values of *ρ*_{BC} and *u*_{BC} at the wall can be assigned or extrapolated by finite differences. At the same time, the non-equilibrium tensor can be estimated^{18,26} as $\Pi abneq\u2248\u2212\rho cs2\tau (\u2202aub+\u2202bua)$. Both the proposed BC strategies preserve second-order accuracy in space and time.^{12,21}

### B. Near-contact interactions

Recently, an extension of the lattice Boltzmann method to model near contact interactions among interfaces (in the following denoted LBNCI)^{33,34} has granted the possibility to simulate a broad variety of complex flow systems, such as foams, emulsions, flowing collections of droplets, bubbles, and high internal phase emulsions, i.e., foamy-like media displaying ordered patterns of fluid drops when under confinement.^{35–37} In those systems, the relative motion of two fluid interfaces in close contact represents a complex hydrodynamic process due to the concurrent action of several repulsive and/or attractive forces, such as electrostatic and van der Waals forces, hydration repulsion, steric interactions, and depletion attraction.^{38,39} Importantly, these forces govern the physics of phenomena, such as coalescence and/or repulsion between neighboring droplets that can considerably alter the mechanics of the material, occurring at lengthscales much larger than the near-contact ones.^{40–44} Their nature has been initially elucidated by the pioneering works of Gibbs and Marangoni on the thermodynamics of liquid thin films^{39} and by the seminal contributions of Derjaguin and Overbeek^{45,46} concurring in the development of the DLVO theory.

The LBNCI method is a variant of an LBM approach for multicomponent flows, based on the color-gradient model,^{30} augmented with an extra forcing term capturing the effects of the near-contact forces acting at the fluid interface level. This method still holds to a continuum description of the interface dynamics, even though near interaction forces arise at a molecular level down to nanometers (and below), namely, the relevant spatial scale of contact forces. In other words, the mesoscale representation of near-contact forces provides a coarse-grained description able to retain computational viability without compromising the underlying physics.

In the above equation, $Ah[h(x\u20d7)]$ is the parameter controlling the strength (force per unit volume) of the near contact interactions, $h(x\u20d7)$ is the distance between two close interfaces, $n\u20d7$ is a unit vector normal to the interface, and *δ*_{I} ∝ ∇Θ is a function, proportional to the phase field Θ, which localizes the force at the interface.

*σ*denoting the surface tension,

^{47}and

*p*the pressure. This is the Navier–Stokes equation for a multicomponent system augmented with a surface-localized repulsive term.

## IV. VALIDATION

^{33}In the present article, we use a GPU-based version of the method that has been included in LBcuda,

^{48}a GPU-accelerated version of LBsoft.

^{49}We test the algorithm in conditions where the BGK operator is under-relaxed and over-relaxed; this is done by setting two different values of relaxation time

*τ*for the two fluid components, with

*τ*

_{R}= 0.6 and

*τ*

_{B}= 1.4. These ones define the corresponding kinematic viscosity

*ν*

_{R}≃ 0.03 and

*ν*

_{B}≃ 0.3. At the interface, the effective relaxation time

*τ*

_{eff}is linearly interpolated as

*ω*

_{eff}= 1/

*τ*

_{eff}, is inserted in Eqs. (9) and (13).

The numerical experiment is as follows: Two drops, having diameter *D* = 30 lattice sites, of a fluid (red component) are dispersed in a second fluid (blue component), with the viscosity ratio reported above and densities equal to one for both the components. The system is placed within a cubic box of linear size *L* = 128 and the drops are located at a distance such that reciprocal interactions are negligible. The impact parameter (i.e., the distance between the collision trajectories perpendicular to their relative direction of motion) is *b* = 0.85, the strength of near-contact forces preventing coalescence is *A*_{h} = 0.1, while other values can be found in Table I of Ref. 33. In Fig. 1, we show the time sequence of such a collision simulated using an LB approach without populations [(a)–(e)] and with populations [(f)–(j)]. In both cases, the relative impact velocity *U*_{rel} is set to 0.5, while Reynolds and Capillary numbers are *Re*_{R} = 450 and *Ca*_{R} ≃ 1.6 for the droplet (red fluid), and *Re*_{B} = 50 for the surrounding medium (blue fluid). The two drops initially approach and then impact without merging [Figs. 1(b), 1(c), 1(g), and 1(h)], since the repulsive near-contact interactions prevent the coalescence and enable the formation of a thin film that separates the drops.^{33} During the bouncing, the shape of the drops considerably departs from the initial spherical geometry, temporarily flattening where opposite interfaces come into close contact. Afterward, they separate and gradually reacquire a spherical shape when their reciprocal distance is sufficiently high.

The time sequence of Fig. 1 shows that the two numerical approaches with and without populations essentially lead to equivalent dynamic behaviors. On a quantitative basis, this result is corroborated by evaluating, for example, the time evolution of the density field $\u27e8\rho \u27e9|\rho >0.5$ shown in Fig. 2, where ⟨⋯⟩ represents a spatial average computed over lattice points for which *ρ* > 0.5 (i.e., within both drops, the red regions in Fig. 1). Larger fluctuations occur only at early times, i.e., when the droplets come into contact, while they progressively shrink as the droplets move away. Note that the differences between the two plots remain marginal over the full process. In Fig. 3, we compute, at two different lattice sites, the time evolution of the relative error $|\rho N\u2212\rho P|\rho NMAX$, where *ρ*_{N} and *ρ*_{P} are the densities of the LB without and with populations, while $\rho NMAX$ is the maximum value of *ρ* calculated using LB without populations. Unlike the previous one, this quantity provides a *local* measure of potential discrepancies between the two approaches. The plot shows that the relative error remains rather small over time, a further indication that the procedure of reconstructing the populations proposed in this work preserves, at a high level of accuracy, the physics of the systems.

The approach without populations has been also tested in simulations of a head-on collision between two fluid droplets (see Fig. 4). Here, *b* = 0, *U*_{rel} = 0.25, and *D* = 30. Once again, the physics is correctly reproduced and no appreciable discrepancies emerge, as also shown in Fig. 5 where the time evolution of ⟨*ρ*⟩_{ρ>0.5} is plotted. Note incidentally that, in this case, the interaction time is considerably longer than before since, after the bouncing, the drops remain in close contact for a longer period of time.

The numerical experiments discussed so far present results in which near-contact interactions function as an effective force field preventing droplet coalescence. But what if these forces are turned off? In Fig. 6, we simulate precisely this case by setting *A*_{h} equal to zero (other parameters are the same as those of Fig. 1). The drops, initially static and perfectly spherical, approach and connect through a tiny neck of red fluid, a region in which capillary forces are strong enough to prevent its breakup. Afterward, the two droplets merge into a single one that slowly attains an equilibrated spherical shape. Once again, both methods with and without populations describe the physics of the process with negligible differences, as also demonstrated in Fig. 7, where early times density fluctuations capture the dynamics of droplet coalescence.

### A. Performance data

^{®}Tesla

^{®}V100 Tensor Core equipped with 16 GB of GPU memory. We exploit the Mega Lattice Updates Per Second (MLUPS) metrics to probe the efficiency. In particular, the definition of MLUPS reads

*t*

_{s}is the run (wall-clock) time (in seconds) per single time step iteration, while

*L*

_{x},

*L*

_{y}, and

*L*

_{z}are the domain sizes. The current version without populations runs at about 622 MLUPS against the 277 MLUPS of the standard CGLB. Several specific CUDA optimization techniques have been developed to improve the performance, to be detailed in a future and more technical presentation. The main difficulty is the non-locality in the pulled streaming step, which requires reading the hydrodynamic fields (11 arrays) in the 18 neighbor points for a total of 198 data accesses (792 bytes in single precision). Although the readings are scattered, neighbor threads share several readings of the hydrodynamic fields. Consequently, threads in the same thread block may cooperate in the data access by using shared memory. Nonetheless, the shared memory is set to 48 Kbytes by default, and it is configurable up to 96 Kbytes per streaming multiprocess in the NVIDIA V100 card. In the present work, we exploit the shared memory to store only the density fields, obtaining a performance improvement of about 20 percent. In this context, the larger size of shared memory (up to 164 Kbytes) in the NVIDIA Ampere

^{®}A100 Tensor Core GPU will probably grant a significant performance gain, allowing to store also the velocity vector in the shared memory. Finally, the main advantage is observed in the GPU memory usage that is equal to 0.6782 GB in the version without populations compared to the version with populations where it is equal to 1.1411 GB, $\u223c40%$ of GPU memory saving. Note that the advantage is strictly connected to the three-dimensional lattice model implemented in the code. A 40% memory saving is highly appealing on very large-scale simulations with multi-billion grid points since it amounts to hundreds of Gbytes savings.

^{50}More importantly, such savings are destined to become much more significant for the case of multiple species, such as those occurring in many natural and industrial applications.

## V. SUMMARY AND OUTLOOK

In summary, we have introduced a regularized lattice Boltzmann method for the efficient simulation of soft materials that reconstructs the distribution functions from hydrodynamics quantities (density, momentum, and pressure tensor) with no need to store the full set of populations. This entails a significant decrease in memory requirements and a substantial enhancement of the computational performance, features of particular relevance for large-scale simulations. The accuracy of the method has been investigated in numerical experiments of relevance to soft matter, such as the collision of fluid droplets. For such cases, the formulation without populations shows satisfactory numerical stability and negligible differences with respect to a standard reconstruction-free LB approach. In practical terms, 40% memory reduction essentially means doubling the volume of the simulation, which would save roughly 4 TB memory, for the case of a simulation demanding about 10 TB of data.^{50–52} It would be of interest to further test the model on more complex states of matter, such as colloidal fluids and fluid droplets flowing within microfluidic channels, where the implementation of solid boundaries is often necessary.^{53} On a more general basis, our results could be helpful for the design of novel high-performance computing methods in soft matter, where customized numerical schemes are fundamental to tackle phenomena occurring over a broad spectrum of scales in space and time.

## ACKNOWLEDGMENTS

The research leading to these results has received funding from MIUR under the project “3D-Phys” (Grant No. PRIN 2017PHRM8X) and from the European Research Council under the European Union’s Horizon 2020 Framework Program (Grant No. FP/2014–2020)/ERC Grant Agreement No. 739964 (“COPMAT”). We acknowledge CINECA Project No. IsB25_DIW3F under the ISCRA initiative, for the availability of high-performance computing resources and support. F.B. acknowledges funding by the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Program (Grant Agreement No. 882340). M.L. acknowledges the support of the Italian National Group for Mathematical Physics (GNFM-INdAM).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Adriano Tiribocchi**: Software (equal); Validation (equal); Visualization (lead); Writing – original draft (lead). **Andrea Montessori**: Formal analysis (equal); Software (equal); Validation (equal); Writing – review & editing (equal). **Giorgio Amati**: Software (equal); Writing – review & editing (equal). **Massimo Bernaschi**: Software (equal); Writing – review & editing (equal). **Fabio Bonaccorso**: Software (equal); Writing – review & editing (equal). **Sergio Orlandini**: Software (equal); Writing – review & editing (equal). **Sauro Succi**: Formal analysis (equal); Supervision (equal); Writing – review & editing (equal). **Marco Lauricella**: Conceptualization (lead); Formal analysis (equal); Methodology (lead); Software (equal); Supervision (equal); Validation (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data supporting this study’s findings are available from the corresponding author upon reasonable request.

## REFERENCES

*Using HPC for Computational Fluid Dynamics: A Guide to High Performance Computing for CFD Engineers*

*The Lattice Boltzmann Equation: For Complex States of Flowing Matter*

*Relativistic Hydrodynamics*

*The Mathematical Theory of Non-uniform Gases: An Account of the Kinetic Theory of Viscosity, Thermal Conduction and Diffusion in Gases*

*The Lattice Boltzmann Method*