We extend our linear-scaling approach for the calculation of Hartree–Fock exchange energy using localized *in situ* optimized orbitals [Dziedzic *et al.*, J. Chem. Phys. **139**, 214103 (2013)] to leverage massive parallelism. Our approach has been implemented in the onetep (Order-N Electronic Total Energy Package) density functional theory framework, which employs a basis of non-orthogonal generalized Wannier functions (NGWFs) to achieve linear scaling with system size while retaining controllable near-complete-basis-set accuracy. For the calculation of Hartree–Fock exchange, we use a resolution-of-identity approach, where an auxiliary basis set of truncated spherical waves is used to fit products of NGWFs. The fact that the electrostatic potential of spherical waves (SWs) is known analytically, combined with the use of a distance-based cutoff for exchange interactions, leads to a calculation cost that scales linearly with the system size. Our new implementation, which we describe in detail, combines distributed memory parallelism (using the message passing interface) with shared memory parallelism (OpenMP threads) to efficiently utilize numbers of central processing unit cores comparable to, or exceeding, the number of atoms in the system. We show how the use of multiple time-memory trade-offs substantially increases performance, enabling our approach to achieve superlinear strong parallel scaling in many cases and excellent, although sublinear, parallel scaling otherwise. We demonstrate that in scenarios with low available memory, which preclude or limit the use of time-memory trade-offs, the performance degradation of our algorithm is graceful. We show that, crucially, linear scaling with system size is maintained in all cases. We demonstrate the practicability of our approach by performing a set of fully converged production calculations with a hybrid functional on large imogolite nanotubes up to over 1400 atoms. We finish with a brief study of how the employed approximations (exchange cutoff and the quality of the SW basis) affect the calculation walltime and the accuracy of the obtained results.

## I. INTRODUCTION

Owing to its favorable balance of accuracy and relatively low computational cost, Kohn–Sham (KS) density functional theory (DFT) is a widely used technique in many branches of computational chemistry and materials science.^{1} The accuracy of DFT crucially depends on the approximations invoked in the exchange–correlation (XC) functional used. Hybrid functionals, which include a fraction of Hartree–Fock exchange (HFx), are among the most accurate functionals in use today, offering an elegant way of reducing the self-interaction error and leading to a more faithful description of geometries and of several properties, such as bond energies and bandgaps, particularly for metal oxides.^{2}

*ψ*

_{i}} are the canonical molecular orbitals (MOs),

*z*

_{i}are their occupancies, and

*N*

_{MO}is the total number of molecular orbitals present in the calculation. Given that MOs extend throughout the entire system and that the Coulomb operator is long-ranged, the cost of each volume integration in Eq. (1) is proportional to the size of the system, whether measured by the number of atoms

*N*or the number of molecular orbitals

*N*

_{MO}(which is ∝

*N*). The presence of a

*double*integral over volume together with a

*double*sum over MOs makes a direct calculation of

*E*

_{HFx}scale as $O(N4)$.

*φ*

_{α}}.

**K**is the representation of the single-particle density matrix in the duals of {

*φ*

_{α}} and is known as the density kernel. The density kernel is, in general, a spin-dependent quantity. Here and in the text that follows, we will omit spin-dependence for clarity and brevity of notation.

Pre-screening ERIs, that is, avoiding their evaluation if they are deemed to be zero or below a given threshold, forms the basis of a number of methods for reducing the computational cost of HFx, particularly, for Gaussian basis sets. Examples of such approaches include LinK^{3} and Order-N exchange (ONX).^{4} It has also been recognized that rigorous upper bounds for integrals are of secondary importance, with tighter estimates of integral values permitting better control of the precision of calculating HFx.^{5} Another approach is to use a truncated Coulomb operator to evaluate the ERIs, which makes them short-range. The long range contribution can then be recovered in the form of a systematically improvable correction.^{6}

In the context of extended basis sets, such as plane waves, the non-locality of the exchange operator makes the calculation of HFx more challenging. Nevertheless, suitable techniques have been proposed for calculations on periodic systems^{7–9} with some of them being linear-scaling.^{10,11} Some of these approaches employ mixed basis sets (e.g., Gaussians and plane waves^{12,13} or Wannier functions and plane waves^{10,11}).

In the context of localized atom-centered basis sets, such as Gaussians or NAOs, methods based on the resolution of the identity (RI) are commonly employed to make the computational effort associated with computing HFx more manageable. These techniques, pioneered in the 1970s^{14,15} and particularly popular in the field of correlated wavefunction methods,^{16–18} expand pair products of atomic orbitals in an auxiliary basis whose functions are similarly atom-centered.^{19}

One of the most notable developments in this area is the robust fitting formula of Dunlap,^{20,21} which, when used instead of directly replacing the pair products with their RI fits, ensures that the resultant error is bilinear in the error of the fitted products,^{19} with the linear error removed. Dunlap *et al*. were also credited with establishing^{22} that using the Coulomb metric in the fitting offers more accurate energies compared to other metrics, such as the overlap metric.

In recent years, certain difficulties associated with robust fitting have been recognized and solutions or workarounds were proposed. Merlot *et al.*^{19} observed that the two-electron integral matrix is not manifestly positive semidefinite under certain conditions. They proposed a pair-atomic resolution of identity (PARI) approach based on local fitting of either the bra or the ket side of the ERI, combined with the robust correction, to achieve quadratic accuracy.^{20} Tew^{23} recently proposed a “quasi-robust” local density fitting approach that addresses issues with undesired long-range behavior when the auxiliary basis is incomplete. Sodt and Head-Gordon^{24} proposed a local modification to RI that yields energies that are differentiable with respect to nuclear positions.

In this paper, we present a massively parallel approach for the efficient calculation of Hartree–Fock exchange in linear-scaling time. The technique employs the resolution of identity and uses truncated spherical waves as the auxiliary basis, with only the ket side of Eq. (4) being fitted, while the products in the bra are unexpanded. It is based on our previous developments,^{25} where the original implementation was serial. In Sec. II, we first briefly outline the basics of onetep (Order-N Electronic Total Energy Package)—the linear-scaling reformulation of KS-DFT in which our approach is implemented—followed by a description of the theoretical basis of our method. In Sec. III, we describe the implementation details of the algorithm, devoting particular attention to the parallel data distribution and time-memory trade-offs that enable its efficient parallelization. Section IV describes the setup of the calculations we performed to benchmark our method.

In Sec. V, we show the results of these calculations, demonstrate that our algorithm is, indeed, linear-scaling, compare calculation walltimes against non-hybrid functionals, and investigate how individual components of the calculation scale. We show excellent strong and weak parallel scaling of our approach, with superlinear speed-ups in many cases. We briefly demonstrate how preconverging the calculation with a non-hybrid functional before continuing with a hybrid functional significantly shortens calculation time with negligible loss of accuracy. We conclude this section with a demonstration of the feasibility of the proposed approach by performing calculations with the B3LYP hybrid functional on imogolite nanotube systems with 1416 atoms. We finish with Sec. VI, which contains conclusions and thoughts about future work. In the supplementary material, we show how the additional approximations (the use of a cutoff finiteness of the auxiliary basis) are controllable and how the associated errors are very small.

## II. THEORY

### A. ONETEP

^{26}reformulates Kohn–Sham DFT

^{27}in terms of the single-particle density matrix,

*ρ*(

**r**,

**r**′). The density matrix is represented as

*φ*

_{ϵ}} are the non-orthogonal generalized Wannier functions (NGWFs)

^{28}centered at

**r**

_{ϵ}, which coincide with nuclear coordinates (

*ϵ*being a generic NGWF index). The NGWFs are strictly localized within spherical regions with radii {

*R*

_{ϵ}}.

**K** is the density kernel, a sparse contravariant matrix, whose elements *K*^{αβ} are nonzero only if |**r**_{α} − **r**_{β}| < *r*_{K}, where *r*_{K} is a real-space cutoff length, known as the density kernel cutoff.

^{29}

*D*

_{m}(

**r**) =

*D*(

**r**−

**r**

_{m}),

*m*runs over the points of the real-space Cartesian grid

**r**

_{m}, which are the centers of the psinc functions, inside the localization region of

*φ*

_{ϵ},

*LR*(

*ϵ*). The psinc functions form an orthogonal basis and are related to plane waves by a Fourier transform, thus sharing many of their desirable properties, notably the independence from the nuclear coordinates and the ability of the basis set to be systematically improved by increasing a single parameter: the kinetic energy cutoff.

The total energy is minimized self-consistently with respect to the density kernel elements *K*^{αβ} and the NGWF expansion coefficients *c*_{mϵ} under the constraints of the idempotency of the density matrix and conservation of the number of electrons, *N*_{e}.

In typical ONETEP calculations, this is done in two nested loops. In the inner loop, **K** is optimized via a modified Li–Nunes–Vanderbilt (LNV) algorithm^{30–32} with the NGWFs fixed. The outer loop optimizes the NGWF expansion coefficients *c*_{mϵ} through gradient-based energy minimization. The fact that the NGWFs remain fixed in the inner loop (and thus during the majority of energy evaluations) will play a crucial role in the code optimizations employed in the calculation of the Hartree–Fock exchange energy.

### B. Hartree–Fock exchange in ONETEP

**X**,

*E*

_{HFx}[Eq. (3)] simply as

*u*

_{γβ}(

**r**) [Eq. (6)] can be obtained in reciprocal space

^{33}or by solving the Poisson equation. Here, the difficulty lies in the fact that this has to be done for each pair-atomic quantity $\phi \gamma *(r\u2032)\phi \beta (r\u2032)$ and that the potential

*u*

_{γβ}(

**r**) is long-ranged (cf. Fig. 1). The latter precludes the use of ONETEP’s traditional tool, the FFT box,

^{34}as the FFT box would have to coincide with the entire simulation cell. This approach is quadratically scaling with a very large prefactor.

^{34}The use of a finite exchange cutoff, e.g., by assuming

*X*

_{αβ}to vanish when |

**r**

_{α}−

**r**

_{β}| <

*r*

_{X}, where

*r*

_{X}is the exchange cutoff length, makes this approach linear-scaling, but the prefactor remains prohibitively large.

We now briefly recount the linear-scaling approach actually used in onetep. For more details, we refer the reader to Ref. 25 where this approach was first described.

**V**with the following elements:

*V*

^{ps}of the inverse metric matrix

**V**

^{−1}, we can define a resolution-of-identity (RI) operator,

^{14–18,20,22–24,35–39}with the aim of replacing four-center ERIs with more tractable three-center integrals.

^{25}

*m*is an integer from the interval $\u2212l,l$, and

*a*is the radius of the sphere where the zero boundary condition is imposed. In onetep’s implementation, we assume ∀

_{ϵ}

*R*

_{ϵ}=

*a*, that is, all NGWF localization radii are identical and equal to

*a*. The value of

*q*is chosen in such a way that $jlqa=0$, and the suitable values of

*q*depend on the angular momentum index

*l*. The maximum values of

*q*and

*l*are limited by the kinetic energy cutoff and the corresponding grid spacing. In typical scenarios, it is sufficient to truncate the SW basis at

*l*

_{max}= 3 and

*q*

_{max}= 12, where

*q*

_{max}is the number of different values of

*q*used for each

*l*. In the text that follows, we will use a single index (

*p*or

*s*) for the SWs for simplicity. This index covers all the possible combinations of

*l*,

*q*, and

*m*and runs from 1 to

*N*

_{SW}. We will use

*u*

_{p}(

**r**) to denote the potential of a SW (“SWpot” in the further text),

^{33}

An important issue is the choice of centers for the SWs used in the expansion of $\phi \gamma *(r\u2032)\phi \beta (r\u2032)$. Using SWs centered only on NGWF *γ* or only on NGWF *β* breaks symmetry in Eq. (4) because it leads to *c*_{γβ} ≈ *c*_{βγ} rather than *c*_{γβ} = *c*_{βγ}. Using SWs centered on both *β* and *γ* alleviates this problem, but there is still the problem of broken *αδ* ↔ *γβ* symmetry, i.e., the products in the bra of the ERI Eq. (4) are then exact, while the products in the ket are approximate (fitted). To formally satisfy all the symmetry requirements, one can resort to using SWs centered on all four atoms (*α*, *β*, *γ*, and *δ*) in the expansion, which keeps the fitting domain identical between all permutations of *α*, *β*, *γ*, and *δ* in the ERI. Such a fit is then robust in the sense of Ref. 20, that is, the fitted integral does not contain any terms linear in the error in the fitted densities. However, in our case, this is impracticable. Even though it maintains linear-scaling, using four centers in the expansion leads to prefactors that are prohibitively large because the product $\phi \gamma *(r)\phi \beta (r)$ then needs to be re-expanded every time *α* or *δ* changes in Eq. (14).

*β*and

*γ*) and recovering the

*αδ*↔

*γβ*symmetry by symmetrizing the approximate (SW-expanded) exchange matrix, $X\u0303$,

^{25}

^{,}

*f*} to indicate where they are centered. While the above expression formally includes SWs centered on

*α*and

*δ*, we again stress that the symmetrization procedure [Eq. (19)] makes it sufficient to only expand using SWs on

*β*and

*γ*. We have shown

^{25}that, in practice, any differences between the elegant but prohibitively expensive four-center fit and the two-center fit we employ are negligible, although our approach is no longer robust in the sense of Ref. 20.

An important advantage of the two-center expansion is that it only includes SWs centered on atoms whose NGWFs overlap, making the electrostatic matrix in Eq. (12) effectively sparse (even though the potential of a SW is not localized). This is necessary for the approach to be linear-scaling.

*γ*and use the resultant potential

*δ*, we obtain an element of the exchange matrix, $X\u0303\alpha \beta $ [Eq. (15)]. In the final step, we contract the exchange matrix with the negative of the density kernel

**K**over

*α*and

*β*according to (11) to obtain the Hartree–Fock exchange energy.

For every exchange matrix element [a pair of indices $\alpha ,\beta $], the cost of the above calculation is asymptotically constant (independent of system size). This is because all the NGWFs are strictly localized and their overlap matrix is sparse. For our approach to be linear-scaling, the number of pairs $\alpha ,\beta $ must increase linearly with system size, i.e., the exchange interaction must be truncated. This is a standard practice in linear-scaling methods.^{40} Our method employs a simple distance-based cutoff, where the exchange matrix **X** is made sparse by neglecting contributions from pairs $\alpha ,\beta $ when |**r**_{α} − **r**_{β}| > *r*_{X}, where *r*_{X} is an assumed real-space cutoff for exchange. We have shown^{25} that even in more demanding applications, *r*_{X} ≈ 20*a*_{0}, or even less, is sufficient to keep the truncation error below 0.01%.

*in situ*necessitates calculating the gradient of the energy with respect to the NGWF expansion coefficients (which, mathematically, is the functional derivative of the energy with respect to the complex conjugate of some NGWF

*α*). The derivation of the relevant expression for Hartree–Fock exchange was presented in Ref. 25; here, for the sake of brevity, we only recount the following final form:

*I*

_{κλ}is the resolution of the identity operator in terms of SWs centered on NGWFs

*κ*and

*λ*,

The first term in Eq. (23) involves summing large numbers of expansions of pair NGWF products *φ*_{γ}*φ*_{β} in terms of SWs centered on these NGWFs. In the second, more involved term linear combinations of NGWF products are expanded in SWs centered not on these NGWFs, but rather on the NGWF with respect to which we differentiate (*α*) and the NGWFs that overlap with it (*δ*).

The expression in Eq. (23) is the *contravariant* gradient, which cannot be directly used to update NGWFs $\phi \alpha $, which are *covariant*. To achieve tensorial correctness, it must be converted to a covariant gradient: *G*_{ϵ}(**r**) = *G*^{α}(**r**)*S*_{αϵ}.

### C. Summary of truncation parameters

As an aid to the reader, in Table I, we recount and briefly describe the truncation parameters used in ONETEP and in this work.

Parameter . | Symbol . | Typical value . | Used in . | Description . |
---|---|---|---|---|

Density kernel cutoff | r_{K} | 50–100a_{0} | onetep | Distance between two atomic centers beyond which |

the density matrix is assumed to vanish. | ||||

NGWF localization radius | R_{ϵ} | 8–10a_{0} | onetep | Radius of an atom-centered sphere beyond which the |

NGWF is assumed to be strictly zero. | ||||

Exchange cutoff | r_{X} | 20–25a_{0} | HFx method | Distance between two atomic centers beyond which |

the exchange matrix is assumed to vanish. |

Parameter . | Symbol . | Typical value . | Used in . | Description . |
---|---|---|---|---|

Density kernel cutoff | r_{K} | 50–100a_{0} | onetep | Distance between two atomic centers beyond which |

the density matrix is assumed to vanish. | ||||

NGWF localization radius | R_{ϵ} | 8–10a_{0} | onetep | Radius of an atom-centered sphere beyond which the |

NGWF is assumed to be strictly zero. | ||||

Exchange cutoff | r_{X} | 20–25a_{0} | HFx method | Distance between two atomic centers beyond which |

the exchange matrix is assumed to vanish. |

## III. IMPLEMENTATION

In this section, we describe the parallel implementation of our algorithm, highlighting choices of parallel decompositions and time-memory trade-offs that make it efficient and massively parallelizable.

### A. Preliminary comments and notation

We begin with a minor implementation detail that will gain more importance later in the text. As mentioned earlier [cf. Eq. (9)], in onetep, the NGWFs are stored as psinc expansion coefficients on a Cartesian grid. In practice, we use a so-called parallelepiped representation, illustrated in Fig. 2.

The simulation cell is tiled into suitably sized parallelepipeds (PPDs). Each NGWF localization region is encompassed by a number of PPDs (shaded areas in the diagram). The psinc coefficients are stored for all points in these PPDs, arranged into a continuous 1D array. Such a packed representation, while introducing some overhead (points within PPDs but outside of the NGWF sphere), makes it easy to pass NGWF data between message passing interface (MPI) ranks and enables efficient processing. In the current implementation of HFx, this representation is also used for NGWF overlaps (dark-shaded area in the diagram).

In the text that follows, we will use a convention where capital letters (*A*, *B*, *C*, *D*, *I*, *J*, and *M*) denote atoms. Lowercase letters (*a*, *b*, *c*, and *d*) will be used to index NGWFs on a corresponding atom; for example, *a* will count NGWFs on atom *A*. For indexing NGWFs globally, we will use greek letters (*α*, *β*, *γ*, *δ*, *ϵ*, and *ζ*) like we have already done in the Introduction. We will occasionally switch between these two ways of indexing NGWFs (*Aa* ≡ *α*, etc.) as some concepts are easier explained using one notation or the other. In addition, implicit summation will be assumed only over repeated greek indices.

Matrix sparsity plays an important role in the algorithm we present. The sparsity of **S** (due to strict localization of NGWFs), the sparsity of **K** (due to the assumed kernel cutoff), and the sparsity of $X\u0303$ (due to the assumed exchange cutoff) are all crucial for achieving linear scaling. In the text that follows, it will often be useful to think of matrix sparsity patterns in terms of pairs of atoms that are either *within* a sparsity pattern of a matrix (meaning that the corresponding matrix element is non-zero) or *outside* it (meaning that the corresponding element is zero and is not stored). We will use the following terminology to describe this: two atoms whose NGWFs overlap (and so the corresponding **S** element is non-zero) will be termed *S*-neighbors. Two atoms whose centers are within the exchange cutoff (and so the corresponding $X\u0303$ element is non-zero) will be termed *X*-neighbors. The relations of being an *S*-neighbor or *X*-neighbor are commutative. Finally, by saying an atom *I* is an *X*–*S*-neighbor of atom *J*, we will mean that there exists an atom *M* such that *J* and *M* are *X*-neighbors and *M* and *I* are *S*-neighbors. This is best understood referring to Fig. 1, where atom *D* (NGWF *δ*) is an *X*–*S*-neighbor of atom *B* (NGWF *β*). Note that this relation is not, in general, commutative.

### B. Hybrid MPI-openMP parallelism

onetep employs hybrid parallelism for all its major algorithms.^{41} Distributed-memory process-based (MPI^{42}) parallelism is used in two ways: for the geometric decomposition of the simulation cell (where each MPI rank deals with a number of slabs comprising the cell) and to divide atoms across MPI ranks (each MPI rank “owns” or is responsible for a subset of atoms). Such a scheme is naturally limited in the number of MPI ranks that can be used—performance begins to deteriorate once the number of MPI processes exceeds the number of slabs in the cell (because some ranks no longer have any work to do) or when the number of MPI ranks exceeds the number of atoms in the system (at which point onetep cannot be run). This problem can be alleviated by reducing the number of MPI ranks and, instead, having each rank spawn a number of threads to saturate the available central processing unit (CPU) cores.

Shared-memory thread-based parallelism (OpenMP^{43}) is then used on a finer scale to subdivide larger work grains into threads. There are two main advantages using such a hybrid model: One is being able to utilize a large number of CPU cores without significant loss of performance and another stems from the fact that threads share an address space, so the memory load associated with quantities that would normally be replicated across MPI ranks is lowered since fewer MPI ranks are used. This second advantage does not play a large role in onetep, as the most memory-intensive quantities are distributed, not replicated, across MPI ranks.

The two main bottlenecks in onetep’s approach to Hartree–Fock exchange are as follows: (a) the evaluation of SWpots [Eq. (17)] from analytical expressions involving trigonometric and radial terms (see Ref. 33 for details) and (b) the evaluation of SW-expanded potentials of NGWF pair products [Eq. (21)]. Both of these are performed repeatedly, often, for the same parameters. For example, Eq. (21) will be evaluated with the same *γ* and *β* for multiple combinations of *α* and *δ*, of which it is independent.

Naturally, this opens up opportunities for time-memory trade-offs, where the already calculated results are cached in memory and subsequently reused, reducing the computational cost to mere look-ups. However, this has to be done carefully. Both quantities considered here (SWpots and expanded potentials) require very large amounts of memory to store. Further in the text (Sec. III E), we outline how the requirements scale with the system size; here, we will only provide an estimate—in typical scenarios, this memory requirement begins to exceed 1 TB at about 400 atoms, just at the onset of linear scaling in the evaluation of HFx. This is a requirement on total memory, and so it could, in principle, be divided across multiple compute nodes. In today’s high performance computing environments, using this quantity of distributed memory is practical since compute nodes are routinely equipped with 128–512 GB of memory.

Not to be neglected, however, is the cost to accessing such *distributed* cached data—accessing data that are not local to an MPI rank entails interprocess communication. The calculation of SWpots in onetep has been extensively optimized—calculating ≈200 SWpots that are required in typical scenarios in one PPD that contains 125 grid points takes only about 5 *µ*s, and so any attempt to access a cached copy from a remote MPI rank would be much slower. For this reason, our algorithm relies on caching all relevant data *locally* (within the same MPI rank), even if this means replicating some data across ranks.

To minimize this replication and to increase the available RAM for the cache, it stands to reason to use as many OpenMP threads as possible and as few MPI ranks as threads can then share a large cache without any need for message passing. Suitable thread synchronization mechanisms must be used when populating the cache to avoid data races, but once the cache is populated and becomes read-only, there is no need for synchronization.

For the above approach to be efficient, our algorithm had to be optimized for large numbers of threads. This is accomplished by ordering operations in such a way as to move OpenMP parallelism as high in the loop structure as possible, making computational grains larger and by avoiding synchronization mechanisms (critical sections) whenever possible. As we will demonstrate in Sec. V, the result is an algorithm that scales well to at least tens of OpenMP threads per process and at least thousands of CPU cores.

Another component crucial for massive parallelism is load balancing. As described above, onetep distributes atoms across MPI ranks. Looking at relevant expressions [e.g., Eqs. (18), (21), and (23)], it becomes apparent that the key quantity being processed is the kets $\phi \gamma \phi \beta $. For this reason, our algorithm distributes atom *pairs* (*B*, *C*) rather than atoms as in onetep’s original parallel distribution.^{44} In the text that follows, we will term onetep’s original distribution “scheme 1,” while the scheme used by the HFx algorithm will be termed “scheme 2.” The quantities that are inputs to the algorithm, such as NGWFs, density kernel elements, and the metric matrix **V**, will need to re-distributed from scheme 1 to scheme 2 before actual processing begins, and the quantities produced by the algorithm [the exchange matrix $X\u0303$ and the NGWF gradients *G*^{α}(**r**)] will have to be re-distributed from scheme 2 back to scheme 1 before they can be used by the rest of onetep machinery. The walltime cost of these operations is very modest below 0.5%. The associated memory cost (due to having to store additional NGWFs and density kernel elements) is also acceptable at a level below 1 MB/atom per MPI rank.

### C. Algorithm overview

Examining expressions in Eqs. (18) and (21) makes it clear that they are independent of the density kernel and involve only the NGWFs. This means that the calculation of $E\u0303HFx$ can be broken up into two distinct stages—one that is independent of the density kernel and one that is density-kernel-dependent. The first stage involves calculating expansion coefficients of NGWF pair products [Eq. (18)] and the potential of these expansions [Eq. (21)] does not need to be repeated in the inner (kernel optimization, LNV) loop. The second stage {Eq. (22), subsequent calculation of the elements of $X\u0303$ and of $E\u0303HFx$ itself [Eq. (11)]} must be repeated every time **K** changes, that is, multiple times in the inner loop.

The density-kernel-independent stage will be described in Sec. III D, and the density-kernel-dependent stage, which entails the actual calculation of $E\u0303HFx$—will be described in Sec. III E.

The last component is the calculation of the gradient of $E\u0303HFx$ with respect to the NGWFs, which is required for *in situ* optimization of NGWFs. It will be described in Sec. III F.

### D. Density-kernel-independent stage

The density-kernel-independent stage is described in Algorithm 1. In step 1, the workload associated with all kets in Eq. (14), that is, the set of all (*B*, *C*) atom pairs in the calculation, is distributed across MPI ranks. During load-balancing, we assume the computational effort associated with a (*B*, *C*) atom pair to be proportional to the product of the number of NGWFs on atoms *B* and *C*. This is justified by the fact that the algorithm deals with products of pairs of NGWFs on these atoms. In the final distribution, each MPI rank has a subset of (*B*, *C*) atom pairs and these subsets are mutually disjoint, i.e., a given (*B*, *C*) pair is only found on one MPI rank.

1: Distribute (B, C) pairs across MPI ranks |

2: Determine B and C atoms for each MPI rank |

3: Determine A atoms for each MPI rank |

4: Determine D atoms for each MPI rank |

5: Redistribute the matrix V from scheme 1 to scheme 2 |

6: Redistribute NGWFs from scheme 1 to scheme 2 |

7: Determine the set of all PPDs where SW expansions of the potential will be needed for each MPI rank |

8: Populate the SWpot cache on each MPI rank (→Algorithm 2) |

9: Populate the NGWF product cache (→Algorithm 3) |

10: Expand NGWF pair products (→Algorithm 4) |

1: Distribute (B, C) pairs across MPI ranks |

2: Determine B and C atoms for each MPI rank |

3: Determine A atoms for each MPI rank |

4: Determine D atoms for each MPI rank |

5: Redistribute the matrix V from scheme 1 to scheme 2 |

6: Redistribute NGWFs from scheme 1 to scheme 2 |

7: Determine the set of all PPDs where SW expansions of the potential will be needed for each MPI rank |

8: Populate the SWpot cache on each MPI rank (→Algorithm 2) |

9: Populate the NGWF product cache (→Algorithm 3) |

10: Expand NGWF pair products (→Algorithm 4) |

1: Perform a dry-run of the calculation, where only the numbers of accesses to each SWpot are counted (for every MPI rank separately) |

2: Determine how many SWpots can be cached before the user-specified memory limit is reached, $NSWpotmax$ |

3: Precalculate the $NSWpotmax$ most accessed SWpots and store them in a hash table |

1: Perform a dry-run of the calculation, where only the numbers of accesses to each SWpot are counted (for every MPI rank separately) |

2: Determine how many SWpots can be cached before the user-specified memory limit is reached, $NSWpotmax$ |

3: Precalculate the $NSWpotmax$ most accessed SWpots and store them in a hash table |

In step 2, each MPI rank constructs a set of all *B* atom indices and a set of all *C* atom indices present in its set of (*B*, *C*) pairs. We will refer these as the rank’s *B* atoms and rank’s *C* atoms. Unlike (*B*, *C*) pairs, more than one MPI rank can have the same *B* or *C* atom in its subset (i.e., the subsets are not disjoint).

In step 3, each MPI rank determines the set of its *A* atoms, which we define as all atoms that are *X*-neighbors to any *B* atom the MPI rank owns. Similarly, in step 4, each MPI rank determines the set of its *D* atoms, which we define as all atoms that are *X*–*S*-neighbors to any *B* atom the MPI rank owns.

In step 5, the electrostatic metric matrix **V** [Eq. (12)] is redistributed from scheme 1 (which was used during its calculation) to scheme 2 (which is used in all subsequent calculations). This also involves a change in the underlying datastructure—from a distributed sparse matrix to rank-local hash tables.

In step 6, the NGWFs themselves are redistributed from scheme 1 to scheme 2. Each MPI rank requests scheme-1-owners of all NGWFs it would need to send them (in PPD format), and simultaneously, each MPI rank listens for requests addressed to it and satisfies them by sending the NGWFs its scheme-1-owns. Following this step, each MPI rank has all the NGWFs it would need in the calculation of HFx locally. No further communication of NGWFs is going to be necessary. This is an important trade-off—we sacrifice some memory (because most NGWFs are now replicated on more than one rank), but in return, we avoid interspersing calculation with communication in all subsequent stages of the calculation. Our algorithm will thus be free of any idle waits of a rank for communication with another rank and of any potential convoy effects. The memory cost of this overhead is acceptable—below 1 MB/atom on each MPI rank, increasing linearly with system size (owing to the sparsity of **S** and **X**). We use the same approach for communicating elements of the density kernel (see step 1 of Algorithm 5).

In step 7, each MPI rank establishes the set of PPDs where SW-expanded potentials of ket NGWF pairs will be required. This is a union of the sets of PPDs on all the rank’s atoms *A*.

In step 8, each MPI rank populates its SWpot cache, following Algorithm 2.

In step 9, each MPI rank populates its NGWF product cache, following Algorithm 3.

Finally, in step 10, NGWF pair products are expanded in terms of SWs, following Algorithm 4.

This completes the description of the density-kernel-independent stage. We now briefly describe the algorithms used in the last three steps.

The SWpot cache is populated according to Algorithm 2.

The NGWF product cache is populated following Algorithm 3. First, each rank counts and enumerates the atom pairs (*A*, *D*) relevant to it, building a list of pairs. This is done to switch from atom indices to a fused (pair) index, which lends itself better to OpenMP (OMP) parallelization. Subsequently, for all the atom pairs, the products of their NGWFs are calculated and stored, with the pair loop parallelized over OpenMP threads. If at any time the user-specified maximum size of the product cache is reached, the loop terminates early. That means that we only store as many NGWF products as possible, with the memory requirement strictly bounded. In subsequent steps, products that are not found in the cache will simply be recalculated on the fly. This precludes runaway memory use and ensures graceful performance degradation in low-memory scenarios.

1: n_{pairs} = 0 |

2: for all my A atoms do (⊳) count (A, D) pairs |

3: for all D atoms that are S-neighbors of this A do |

4: n_{pairs} = n_{pairs} + 1 |

5: end for D |

6: end for A |

7: OMP for i_{pair} = 1 to n_{pairs} do (⊳) process (A, D) pairs |

8: for all NGWFs d on atom D in pair do |

9: for all NGWFs a on atom A in pair do |

10: if room left in NGWF product cache then |

11: Compute and store φ_{Aa}φ_{Dd} in PPDs |

12: else |

13: exit OMP for |

14: end if |

15: end for a |

16: end for d |

17: end OMP for i_{pair} |

1: n_{pairs} = 0 |

2: for all my A atoms do (⊳) count (A, D) pairs |

3: for all D atoms that are S-neighbors of this A do |

4: n_{pairs} = n_{pairs} + 1 |

5: end for D |

6: end for A |

7: OMP for i_{pair} = 1 to n_{pairs} do (⊳) process (A, D) pairs |

8: for all NGWFs d on atom D in pair do |

9: for all NGWFs a on atom A in pair do |

10: if room left in NGWF product cache then |

11: Compute and store φ_{Aa}φ_{Dd} in PPDs |

12: else |

13: exit OMP for |

14: end if |

15: end for a |

16: end for d |

17: end OMP for i_{pair} |

NGWF pair products are expanded in terms of SWs following Algorithm 4. For all NGWF pairs on all (*B*, *C*) atom pairs owned by an MPI rank, we first calculate the products of the NGWFs and then calculate their overlaps with all SWs *s* (centered on *B* and *C*). The overlaps are then used as the constant term in a system of linear equations to determine the expansion coefficients (step 7). OpenMP parallelism is employed over PPDs in the product.

1: for all (B, C) pairs owned by this MPI rank do |

2: for all NGWFs b on atom B in pair do |

3: for all NGWFs c on atom C in pair do |

4: Compute φ_{Bb}φ_{Cc} in PPDs |

5: OMP for all PPDs in product φ_{Bb}φ_{Cc} do |

6: Calculate $bs,\gamma \beta =fs|\phi \gamma \phi \beta $ for all SWs s |

7: For all SWs p, obtain expansion coefficients |

$c\gamma \beta p=Vpsbs,\gamma \beta $ by solving a linear |

equation system $Vspc\gamma \beta p=bs,\gamma \beta $ |

8: end OMP for PPD |

9: end for c |

10: end for b |

11: end for (B, C) |

1: for all (B, C) pairs owned by this MPI rank do |

2: for all NGWFs b on atom B in pair do |

3: for all NGWFs c on atom C in pair do |

4: Compute φ_{Bb}φ_{Cc} in PPDs |

5: OMP for all PPDs in product φ_{Bb}φ_{Cc} do |

6: Calculate $bs,\gamma \beta =fs|\phi \gamma \phi \beta $ for all SWs s |

7: For all SWs p, obtain expansion coefficients |

$c\gamma \beta p=Vpsbs,\gamma \beta $ by solving a linear |

equation system $Vspc\gamma \beta p=bs,\gamma \beta $ |

8: end OMP for PPD |

9: end for c |

10: end for b |

11: end for (B, C) |

### E. Density-kernel-dependent stage

In this stage, the Hartree–Fock exchange energy $E\u0303HFx$ is calculated using the SW expansion of NGWF products obtained from Algorithm 1. Unlike the previous stage, this one depends on the values of the density kernel, and as such, it needs to be performed multiple times within the kernel optimization loop. The procedure is outlined in Algorithm 5 and will now be described.

1: Redistribute K from scheme 1 to scheme 2 |

2: for all (B, C) pairs owned by this MPI rank do |

3: Find the PPDs relevant to atom B from the pair. |

This is a union of the sets of PPDs of all atoms A that are X-neighbors with this atom B |

4: for all NGWFs b on atom B in pair do |

5: for all NGWFs c on atom C in pair do |

6: Retrieve expansion coefficients for φ_{Bb}φ_{Cc} |

7: OMP for all PPDs relevant to atom B do |

8: Calculate the expanded potential |

$e\gamma \beta =\u2211p=1NSWc\gamma \beta pup(r)$ in PPD i |

9: if room left in expansion cache then |

Store e_{γβ} in expansion cache |

10: else |

11: exit (B, C) loop |

12: end if |

13: end OMP for PPD |

14: end for c |

15: end for b |

16: end for (B, C) |

17: for all my B atoms do |

18: Find the PPDs relevant to atom B like in step 3 |

19: Calculate $u\u0303\beta \delta =\u2211\gamma K\delta \gamma e\gamma \beta $ in all relevant PPDs, thereby eliminating index γ (→Algorithm 6) |

20: Calculate contribution to $X\u0303$ from atom B, thereby eliminating index β (→Algorithm 7) |

21: if NGWF gradient needed then |

22: Accumulate contribution to contravariant gradient G_{1} from this atom B (→Algorithm 8) |

23: end if |

24: end for B |

25: Redistribute $X\u0303$ from scheme 2 back to scheme 1 |

26: if NGWF gradient needed then |

27: Finalize NGWF gradient term G_{1} (→Algorithm 9) |

28: Calculate the NGWF gradient term G_{2} (→Algorithm 10) |

29: end if |

30: Symmetrize $X\u0303$ [Eq. (19)] |

31: Calculate $E\u0303HFx$ [Eq. (20)] |

1: Redistribute K from scheme 1 to scheme 2 |

2: for all (B, C) pairs owned by this MPI rank do |

3: Find the PPDs relevant to atom B from the pair. |

This is a union of the sets of PPDs of all atoms A that are X-neighbors with this atom B |

4: for all NGWFs b on atom B in pair do |

5: for all NGWFs c on atom C in pair do |

6: Retrieve expansion coefficients for φ_{Bb}φ_{Cc} |

7: OMP for all PPDs relevant to atom B do |

8: Calculate the expanded potential |

$e\gamma \beta =\u2211p=1NSWc\gamma \beta pup(r)$ in PPD i |

9: if room left in expansion cache then |

Store e_{γβ} in expansion cache |

10: else |

11: exit (B, C) loop |

12: end if |

13: end OMP for PPD |

14: end for c |

15: end for b |

16: end for (B, C) |

17: for all my B atoms do |

18: Find the PPDs relevant to atom B like in step 3 |

19: Calculate $u\u0303\beta \delta =\u2211\gamma K\delta \gamma e\gamma \beta $ in all relevant PPDs, thereby eliminating index γ (→Algorithm 6) |

20: Calculate contribution to $X\u0303$ from atom B, thereby eliminating index β (→Algorithm 7) |

21: if NGWF gradient needed then |

22: Accumulate contribution to contravariant gradient G_{1} from this atom B (→Algorithm 8) |

23: end if |

24: end for B |

25: Redistribute $X\u0303$ from scheme 2 back to scheme 1 |

26: if NGWF gradient needed then |

27: Finalize NGWF gradient term G_{1} (→Algorithm 9) |

28: Calculate the NGWF gradient term G_{2} (→Algorithm 10) |

29: end if |

30: Symmetrize $X\u0303$ [Eq. (19)] |

31: Calculate $E\u0303HFx$ [Eq. (20)] |

In step 1, the density kernel matrix **K** is redistributed from scheme 1 (which is used in the rest of onetep) to scheme 2 (which is used in all subsequent calculations). This also involves a change in the underlying datastructure—from a distributed sparse matrix to rank-local hash tables.

Next, each MPI rank processes the atom pairs (*B*, *C*) it was assigned. The result of this processing is the expanded potentials of NGWF products *e*_{γβ} for all NGWFs on atoms *B* and *C* in each pair. The expanded potentials are calculated at all points where they will later be required, that is, in the PPDs of all atoms *A* that are *X*-neighbors to atom *B*. This set of PPDs (termed “PPDs relevant to atom *B*”) is established in step 3.

Subsequently, we iterate over all the NGWF pairs on atoms *B* and *C*. The previously calculated SW expansion coefficients $c\gamma \beta p$ for the NGWF products are retrieved (step 6), and the expanded potential is calculated (step 8) in all relevant PPDs. OpenMP parallelism is leveraged for the loop over PPDs. The expanded potential is stored in a hash table termed the “expansion cache.”

Storing *all* the expanded potentials would require enormous amounts of memory—for all but the smallest systems, they need to be calculated in a sphere with a radius *r*_{X} + *a* (where *r*_{X} is the exchange cutoff and *a* is the NGWF localization radius). At typical settings, this translates to about 6.5 MB of storage per single expansion. The number of expansions on an atom *B* is $nBnCNBS$, where *n*_{B} and *n*_{C} are the numbers of NGWFs on atoms *B* and *C*, respectively, and $NBS$ is the number of *C* atoms that are *S*-neighbors of atom *B*. At typical settings, this number is about 800. For typical *n*_{B} and *n*_{C}, we arrive at a value of about 5 GB per atom, which is clearly excessive even if we assume this cost to be distributed across compute nodes. For this reason, we set a user-adjustable upper bound on the size of the expansion cache. Once the cache is full, the loop exits (step 11). Expanded potentials that did not fit in the cache are later (Algorithm 6) calculated on the fly. This precludes runaway memory use and ensures graceful performance degradation in low-memory scenarios.

1: for all atoms C in my (B, C) pair list for current B do |

2: for all NGWFs b on atom B do |

3: for all NGWFs c on atom C do |

4: OMP for all PPDs relevant to atom B do |

5: Retrieve from cache or calculate e_{γβ} |

6: end OMP for PPD |

7: end for c |

8: end for b |

9: OMP for all PPDs relevant to atom B do |

10: for all atoms D participating in this PPD do |

11: for all NGWFs d on atom D do |

12: for all NGWFs b on atom B do |

13: for all NGWFs c on atom C do |

14: $u\u0303\beta \delta =u\u0303\beta \delta +K\delta CceCc\beta $ |

15: end for c |

16: end for b |

17: end for d |

18: end for D |

19: end OMP for PPD |

20: end for C |

1: for all atoms C in my (B, C) pair list for current B do |

2: for all NGWFs b on atom B do |

3: for all NGWFs c on atom C do |

4: OMP for all PPDs relevant to atom B do |

5: Retrieve from cache or calculate e_{γβ} |

6: end OMP for PPD |

7: end for c |

8: end for b |

9: OMP for all PPDs relevant to atom B do |

10: for all atoms D participating in this PPD do |

11: for all NGWFs d on atom D do |

12: for all NGWFs b on atom B do |

13: for all NGWFs c on atom C do |

14: $u\u0303\beta \delta =u\u0303\beta \delta +K\delta CceCc\beta $ |

15: end for c |

16: end for b |

17: end for d |

18: end for D |

19: end OMP for PPD |

20: end for C |

Once all the expanded potentials are calculated or the expansion cache cannot accommodate more elements, each MPI rank iterates over all its *B* atoms (step 17). First, PPDs relevant to current atom *B* are identified, the same as was done in step 3. Subsequently, the expansion coefficients are contracted with the density kernel over index *γ*, eliminating this index from further computation (step 19). This is done by Algorithm 6, described further in Sec. III E.

Next, the contribution from current atom *B* to all elements of the exchange matrix $X\u0303$ is calculated (step 20). This is done by Algorithm 7, described further in this section. Similarly, contributions to the NGWF gradient from atom *B* are accumulated. This is done by Algorithm 8, described in Sec. III F.

In step 25, the exchange matrix is redistributed from scheme 2 back to scheme 1 to make it possible to use standard sparse algebra routines on it in the rest of onetep.

The main part of the NGWF gradient calculation takes place in step 27; this is carried out by Algorithm 10, described in Sec. III F.

Finally, the exchange matrix $X\u0303$ is symmetrized [cf. Eq. (19)], and the Hartree–Fock exchange energy is computed according to Eq. (20).

This concludes the general description of the density-kernel-dependent stage. We will now briefly describe Algorithms 6 and 7.

Algorithm 6 processes a single atom *B* owned by a given MPI rank. It iterates over all the *C* atoms that are *S*-neighbors of *B* and are assigned to this MPI rank. Thus, the loop body is executed for a (*B*, *C*) pair. First (steps 2–8), the expanded potentials for all combinations of NGWFs on atoms *B* and *C* are retrieved from the expansion cache into a local array for efficient access later on. This is done for all the PPDs relevant to atom *B*, and OpenMP parallelism is leveraged for the loop over PPDs. Subsequently, for all the PPD (again leveraging OpenMP parallelism) and all the atoms *D* whose localization sphere features this PPD, the quantity $u\u0303\beta \delta $, which is a contraction of expanded potentials with the density kernel over the index *γ*, is accumulated (step 14). As the density kernel depends on another index (*δ*), $u\u0303\beta \delta $ is a two-center quantity. As it is stored only for a single atom *B* before being processed (via Algorithm 7) and discarded, the associated memory requirement remains modest. Algorithm 6 finishes after calculating and storing all requisite $u\u0303\beta \delta $ for a given atom *B*.

The next stage (Algorithm 7) eliminates index *β* from the calculation. Like Algorithm 6, it is also performed for a single atom *B*. The algorithm processes all PPDs relevant to this atom in an OpenMP-parallelized loop. For each PPD, the algorithm iterates over all atoms *D* whose localization sphere spans this PPD and all atoms *A* owned by the MPI rank whose localization sphere spans this PPD. The unusual loop ordering with the loop over PPDs being outermost is cumbersome programmatically but was found to lead to best performance, allowing for OpenMP parallelism to be leveraged for largest grain sizes and for any parallel contention to be avoided. No thread synchronization is required in this loop.

1: OMP for all PPDs relevant to atom B do |

2: for all atoms D participating in this PPD do |

3: for all my atoms A participating in this PPD do |

4: for all NGWFs d on atom D do |

5: for all NGWFs a on atom A do |

6: Retrieve from cache or calculate |

p_{AaDd} = φ_{Aa}φ_{Dd} in current PPD |

7: end for a |

8: for all NGWFs b on atom B do |

9: for all NGWFs a on atom A do |

10: $X\u0303AaBb=X\u0303AaBb+\u2211rPPDpAaDd(r)u\u0303BbDd$ |

11: if NGWF gradient needed then |

12: $kBb=kBb+\phi Ddu\u0303BbDd$ |

13: end if |

14: end for a |

15: end for b |

16: end for d |

17: end for A |

18: end for D |

19: end OMP for PPD |

1: OMP for all PPDs relevant to atom B do |

2: for all atoms D participating in this PPD do |

3: for all my atoms A participating in this PPD do |

4: for all NGWFs d on atom D do |

5: for all NGWFs a on atom A do |

6: Retrieve from cache or calculate |

p_{AaDd} = φ_{Aa}φ_{Dd} in current PPD |

7: end for a |

8: for all NGWFs b on atom B do |

9: for all NGWFs a on atom A do |

10: $X\u0303AaBb=X\u0303AaBb+\u2211rPPDpAaDd(r)u\u0303BbDd$ |

11: if NGWF gradient needed then |

12: $kBb=kBb+\phi Ddu\u0303BbDd$ |

13: end if |

14: end for a |

15: end for b |

16: end for d |

17: end for A |

18: end for D |

19: end OMP for PPD |

First (step 6), the algorithm retrieves all NGWF products for the atom pair (*A*, *D*) from the NGWF product cache. Cache misses lead to the recalculation of the products on the fly. Subsequently (steps 8–11), elements of the exchange matrix are accumulated in a loop over NGWFs on atoms *B* and *A* and over all points in the PPD. The accumulated quantity is the product *φ*_{α}*φ*_{δ} multiplied by $u\u0303\beta \delta =\u2211p\u2211sup(r)Vpsfs|\phi \gamma \phi \beta K\delta \gamma $, which completes the integration required for obtaining the exchange matrix element $X\u0303\alpha \beta $ [cf. Eq. (15)]. Once the algorithm completes, a column stripe of the exchange matrix corresponding to atom *B* has been calculated.

### F. Gradient with respect to the NGWFs

The last component of our algorithm is the calculation of the functional derivative of $E\u0303HFx$ with respect to an NGWF—a rather involved procedure required for *in situ* optimization of NGWFs. The calculation is split into two terms, $G1\alpha $ and $G2\alpha $, where *α* is the NGWF with respect to which we differentiate, as defined in Eq. (23).

Calculating the *G*_{1} term is straightforward, beginning already in step 12 of Algorithm 7, where the auxiliary quantity *k*_{Bb} is calculated for every atom *B* occurring in the atom pairs (*B*, *C*) assigned to an MPI rank. Subsequently, in step 22 of Algorithm 5, Algorithm 8 is invoked for every atom *B* owned by the MPI rank. This is a simple algorithm, contracting the auxiliary quantity *k*_{Bb} with the density kernel over the index *Bb*. Following its completion, the contravariant version of term *G*_{1} has been calculated.

1: for all PPDs relevant to atom B do |

2: for all my atoms A participating in this PPD do |

3: if atom A is an X-neighbor of atom B then |

4: for all NGWFs b on atom B do |

5: for all NGWFs a on atom A do |

6: $G1Aa=G1Aa+kBbKBbAa$ |

7: end for a |

8: end for b |

9: end if |

10: end for A |

11: end for PPD |

1: for all PPDs relevant to atom B do |

2: for all my atoms A participating in this PPD do |

3: if atom A is an X-neighbor of atom B then |

4: for all NGWFs b on atom B do |

5: for all NGWFs a on atom A do |

6: $G1Aa=G1Aa+kBbKBbAa$ |

7: end for a |

8: end for b |

9: end if |

10: end for A |

11: end for PPD |

The calculation is finalized in step 27 of Algorithm 5, where Algorithm 9 is invoked. Here, the calculated NGWF gradient is prepared for subsequent use outside of the HFx code. First, the calculated contravariant gradient is redistributed back to scheme 1, that is, to the atom-based distribution. Each MPI rank obtains the gradient for the NGWFs of the atoms local to it. Next, the points within the PPDs but outside of the NGWF localization radius are zeroed (“shaved”), as they cannot contribute to the gradient, having only been calculated for numerical convenience. Finally, a prefactor is applied [the 2 that is in Eq. (23), together with a grid weight], the contravariant gradient is converted to the covariant form, and reciprocal-space preconditioning^{29} is performed on the gradient. At this point, the calculation of the first term, in its final form, is completed.

1: Redistribute the contravariant gradient G_{i} from scheme 2 back to scheme 1. |

2: for all atoms A scheme-1-local to this MPI rank do |

3: for all NGWFs a on atom A do |

4: for all PPDs spanned by NGWF Aa do |

5: Zero points outside of NGWF sphere |

6: Apply prefactor |

7: end for PPD |

8: Convert $GiAa$ to covariant form |

9: Perform reciprocal-space preconditioning |

10: end for a |

11: end for A |

1: Redistribute the contravariant gradient G_{i} from scheme 2 back to scheme 1. |

2: for all atoms A scheme-1-local to this MPI rank do |

3: for all NGWFs a on atom A do |

4: for all PPDs spanned by NGWF Aa do |

5: Zero points outside of NGWF sphere |

6: Apply prefactor |

7: end for PPD |

8: Convert $GiAa$ to covariant form |

9: Perform reciprocal-space preconditioning |

10: end for a |

11: end for A |

1: for all my C atoms do |

2: Determine atoms B relevant to this atom C |

3: Determine atoms A relevant to this atom C |

4: OMP for all relevant A atoms do |

5: Calculate auxiliary quantity $P\alpha \gamma =\phi \alpha \u2211\delta K\gamma \delta \phi \delta $, |

thereby eliminating index δ (→Algorithm 11) |

6: end OMP for A |

7: for all NGWFs c on atom C do |

8: OMP for all atoms B relevant to atom C do |

9: for all atoms A relevant to atom C do |

10: if A is not an S-neighbor of B then |

11: next A |

12: end if |

13: for all NGWFs b on atom B do |

14: $QCcBb=QCcBb+\u2211aPAaCcKAaBb$ |

15: end for b |

16: end for A |

17: end OMP for B |

18: for all atoms B relevant to atom C do |

19: for all NGWFs b on atom B do |

20: Expand Q^{CcBb} in SWs on B and C |

21: OMP for all PPDs in B ∩ C do |

22: Calculate the expanded potential E^{CcBb} of Q^{CcBb} in PPD |

23: end OMP for PPD |

24: for PPDs in B ∩ C do |

25: Accumulate $G2Bb=G2Bb+\phi CcECcBb$ |

26: end for PPD |

27: end for b |

28: end for B |

29: end for c |

30: end for C |

31: Finalize NGWF gradient term G_{2} (→Algorithm 9) |

1: for all my C atoms do |

2: Determine atoms B relevant to this atom C |

3: Determine atoms A relevant to this atom C |

4: OMP for all relevant A atoms do |

5: Calculate auxiliary quantity $P\alpha \gamma =\phi \alpha \u2211\delta K\gamma \delta \phi \delta $, |

thereby eliminating index δ (→Algorithm 11) |

6: end OMP for A |

7: for all NGWFs c on atom C do |

8: OMP for all atoms B relevant to atom C do |

9: for all atoms A relevant to atom C do |

10: if A is not an S-neighbor of B then |

11: next A |

12: end if |

13: for all NGWFs b on atom B do |

14: $QCcBb=QCcBb+\u2211aPAaCcKAaBb$ |

15: end for b |

16: end for A |

17: end OMP for B |

18: for all atoms B relevant to atom C do |

19: for all NGWFs b on atom B do |

20: Expand Q^{CcBb} in SWs on B and C |

21: OMP for all PPDs in B ∩ C do |

22: Calculate the expanded potential E^{CcBb} of Q^{CcBb} in PPD |

23: end OMP for PPD |

24: for PPDs in B ∩ C do |

25: Accumulate $G2Bb=G2Bb+\phi CcECcBb$ |

26: end for PPD |

27: end for b |

28: end for B |

29: end for c |

30: end for C |

31: Finalize NGWF gradient term G_{2} (→Algorithm 9) |

1: for all atoms D that are S-neighbors of atom A do |

2: for all NGWFs d on atom D do |

3: for all NGWFs a on atom A do |

4: Compute φ_{Dd}φ_{Aa} |

5: for all NGWFs c on atom C do |

6: $PAaCc=PAaCc+KCcDd\phi Dd\phi Aa$ |

7: end for c |

8: end for a |

9: end for d |

10: end for D |

1: for all atoms D that are S-neighbors of atom A do |

2: for all NGWFs d on atom D do |

3: for all NGWFs a on atom A do |

4: Compute φ_{Dd}φ_{Aa} |

5: for all NGWFs c on atom C do |

6: $PAaCc=PAaCc+KCcDd\phi Dd\phi Aa$ |

7: end for c |

8: end for a |

9: end for d |

10: end for D |

*G*

_{2}is more involved. It is performed entirely by Algorithm 10, invoked in step 28 of Algorithm 5. The algorithm is best understood by referring to the second term of Eq. (23). This term, crucially, involves a version of the resolution-of-identity operator $I\u0302\delta \alpha $ that employs SWs centered on atoms

*D*and

*A*that feature in the

*bra*of the ERIs in Eqs. (3) and (4). This is unfortunate because, as explained earlier, our algorithm divides the workload by distributing

*kets*[atom pairs (

*B*,

*C*)] across MPI ranks. Furthermore, in the evaluation of HFx energy and the first term of the gradient (Algorithm 4, step 6 and Algorithm 5, step 8), our algorithm always operates on SWs centered on

*ket*atoms (

*B*and

*C*), and it is those SWs that are cached. Fortunately, we can simply rename dummy indices in the sums to arrive at a less cumbersome expression for the second term in the gradient. Once we rename

*Aaα*↔

*Bbβ*and

*Ddδ*↔

*Ccγ*, we arrive at (where we explicitly indicate summations for clarity) the following:

This expression is more amenable to our approach since its RI operator involves SWs centered on *ket* atoms, as in the previous steps of the calculation. The fact that we now obtain a gradient term for NGWF *β* and not NGWF *α* does not matter, as we will need to redistribute the calculated gradients to scheme 1 in any case. The fact that we will operate on *φ*_{α} and *φ*_{δ} is also not a problem since our algorithm already requires them (cf. Algorithm 7, step 10).

We will now describe Algorithm 10, which calculates the second term in the NGWF gradient. The main loop iterates over atoms *C* owned by an MPI rank, corresponding to the sum over *γ* in Eq. (25). For each atom *C* in the sum, the algorithm first determines the atoms *B* with respect to whose NGWFs the gradient needs to be calculated. These are *S*-neighbors of atom *C*, which are in the list of (*B*, *C*) pairs assigned to this MPI rank and will be termed “*B* atoms relevant to atom *C*.” Similarly, atoms *A* featured in the second sum (over *α*) in Eq. (25) are determined. These atoms, which we term “*A* atoms relevant to atom *C*,” are all atoms that are *X*-neighbors of relevant *B* atoms.

*A*, the algorithm then (steps 4–6) calculates an auxiliary quantity

*A*[cf. the last sum in Eq. (25)]. This is done by a very simple algorithm, Algorithm 11, which we present below. The algorithm is a simple accumulation of the linear combination of NGWF products. The loop over

*A*leverages OpenMP parallelism to make this operation efficient, requiring Algorithm 11 itself to be thread-safe. Given that every instance of Algorithm 11 operates on a different

*Aa*, this can be accomplished without resorting to synchronization mechanisms. Once this is complete, index

*δ*has been eliminated from further calculations.

We note in passing that the same $P\alpha \gamma $ can be calculated on more than one MPI rank since there is some overlap between *C* atoms in (*B*, *C*) atom pairs assigned to MPI ranks. Even so, as will be shown in Sec. V C, the cost of this stage of the algorithm is almost negligible, except in conduction calculations, where it accounts for ≈20% of the total cost—a result of large numbers of NGWFs per atom and lower overlap matrix sparsity.

*α*(steps 13–15). In this way, another auxiliary quantity is calculated,

*α*from further calculations.

The next stage is the application of the RI operator $I\u0302\gamma \beta $, that is, the calculation of the SW-expanded potential of *Q*^{γβ} in all PPDs spanned by the intersection of localization spheres of atoms *B* and *C*. *Q*^{γβ} is first expanded in SWs (step 20), similarly to what was done earlier to NGWF pair products (steps 5–8 of Algorithm 4). The expanded potential itself (denoted as *E*^{CcBb}) is calculated in steps 21–23 in the same PPDs, leveraging OpenMP parallelism. Finally, the contributions are multiplied by *φ*_{γ}(**r**) and summed over *γ* in steps 24–26, corresponding to the leftmost operation in Eq. (25). The gradient is finalized (“shaved,” converted to the covariant form, and preconditioned) through the application of Algorithm 9, just as was done to its first term.

### G. Conduction calculations

Apart from the usual mode of operation where only occupied states are considered, ONETEP has the capability to perform conduction calculations, which finds use, e.g., in calculating optical absorption spectra. Conduction calculations optimize an energy expression involving a separate (conduction) density kernel and a projected conduction Hamiltonian.^{45} They typically use larger numbers of NGWFs (i.e., not a minimal basis) and larger localization radii of the NGWFs (by a factor of ≈1.5), which means that the overlap matrices are not as sparse as in conduction calculations. This, coupled with the fact that in conduction calculations with hybrid functionals the inner (LNV) loop does not involve HFx, leads to a shift of the computational bottlenecks to different parts of the algorithm. Because of the larger NGWF basis, the memory requirement of conduction calculations is also increased relative to valence calculations. For these reasons, we will benchmark conduction calculations separately, and only for the polymer systems, where they are more relevant.

### H. Preconverging with a non-hybrid functional

A simple commonly used optimization is to preconverge the calculation with a non-hybrid [typically a GGA (generalized gradient approximation) functional] and, once it is converged, to continue it with a hybrid. The expectation here is for the pre-converged calculation to reach convergence faster than when starting from the initial guess, thus reducing the number of expensive (hybrid functional) self-consistency iterations. The cost associated with the initial steps performed with a GGA is typically much lower.

We briefly examined two such optimizations—in the first one (referred in the text as B3LYP-1), we pre-converged the calculation with the Perdew–Burke–Ernzerhof (PBE)^{46} functional, and once convergence has been obtained, we continued it with B3LYP, optimizing both the density kernel and the NGWFs. In the second optimization (referred in the text as B3LYP-2), we similarly pre-converged with PBE, but then, we kept the NGWF basis fixed, only optimizing the density kernel. This corresponds to running a fixed-NGWF calculation in a GGA-optimized NGWF basis. In this scenario, onetep operates similarly to other fixed-atomic-orbital-basis electronic structure codes, but with the advantage of using a pre-optimized minimal set of atomic orbitals (NGWFs).

We demonstrate the excellent accuracy of both optimizations and significant performance gains in Sec. V H.

## IV. CALCULATION DETAILS

We demonstrate the behavior and performance of our implementation on two classes of systems. The first class comprises roughly spherical “scoops” of a protein (cGMP-specific phosphodiesterase type 5, PDE5) of increasing size, suitably truncated and protonated. These range from 44 atoms (≈12 Å across) to 4048 atoms (≈54 Å across) and constitute a good model of a typical three-dimensional system of interest in biochemical applications. The second class comprises stacked polymer chains of increasing length (four chains of the PBTZT-stat-BDTT-8 polymer analog, as studied in Ref. 47). These range from 228 atoms (≈21 Å in length) to 2868 atoms (≈265 Å in length) and are a good model of typical systems of interest in materials science. The two classes differ in topology (true 3D systems vs 1D chains) and composition (H-dominated vs C-dominated).

Unless stated otherwise, all calculations used a kinetic energy cutoff of 827 eV, an NGWF localization radius of 8*a*_{0} (≈4.2 Å), and a minimal NGWF basis (one NGWF on H and four NGWFs on p-block elements). Eight inner loop (LNV) iterations were used throughout. The SW basis quality was *l*_{max} = 3 and *q*_{max} = 12, and an exchange interaction cutoff of 20*a*_{0} (≈10.6 Å) was used. Calculations on polymer chains employed density kernel truncation with a cutoff of 50*a*_{0} (≈26.5 Å), while the calculations on proteins did not employ kernel truncation. Because the size of the polymer systems could easily be increased in uniform steps, we will use these to demonstrate weak parallel scaling (where the system size is proportional to the number of CPU cores). Strong parallel scaling (where the system size remains constant and the number CPU cores is increased) will be demonstrated for both classes of systems.

Additionally, for the polymer systems, we benchmarked calculations of *conduction states*, where, as explained in Sec. III, the computational bottlenecks are expected to be different. Here, to account for the expected orbital delocalization, we increased the NGWF radius to 12*a*_{0} (≈6.35 Å), used an extended basis (five NGWFs on H and 13 NGWFs on p-block elements), and increased the number of inner loop iterations to 12, reflecting typical settings used for conduction calculations in onetep.

## V. RESULTS

All walltimes shown in this section correspond to fully converged calculations with default convergence thresholds and so are good representations of real-life scenarios. However, we feel obliged to note two important points. First, in the interest of reducing the load on the high-performance computing (HPC) facility, we only ran calculations for one outer loop (NGWF) iteration and carefully extrapolated the timings to a fully converged calculation from these measurements, separately taking into account the measured constant overhead and the measured per-iteration time. The number of iterations was taken from GGA calculations that could be run at a fraction of the cost. Cross-checks with full hybrid functional calculations that were actually run to convergence (for a subset of the systems) validated our estimates to be within 0.5% of the true walltimes.

Second, in the interest of clarity, we show results normalized to a constant number of outer loop iterations (e.g., Ref. 12 for the protein systems), whereas in reality, the actual number of iterations differed slightly (*σ* = 3.0) between individual systems (for calculations with a GGA and hybrid functional alike). This slight difference in the number of iterations was not systematic, but rather, it reflected statistical noise from how the fragments of the total protein were carved out and truncated, obscuring the linear-scaling behavior that we set out to demonstrate. For the polymer systems, the effect was less pronounced ($N\u0304iter=10$ and *σ* = 2.2) but also present. To disentangle this statistical noise from the actual performance of our algorithm, we chose to show values normalized to the same number of outer loop iterations in all cases.

### A. Linear scaling with system size

We begin by demonstrating that our approach is, indeed, linear-scaling, i.e., beyond a certain system size (“onset of linear scaling”), the walltime of the calculation is linearly proportional to the number of atoms in the system. First, however, we stress an important point regarding *memory scaling*. The performance of our approach heavily depends on the amount of available memory, with maximum performance attained with a greedy algorithm that allocates as much memory as it needs to cache all requisite quantities. The RAM requirement of such a greedy approach increases superlinearly with the size of the system, at least for system sizes considered here, and quickly exceeds typical RAM sizes available on today’s computing nodes—this happens already at about 150 atoms. Thus, to ensure a fair and realistic benchmark, in this work, we employ the conditions of linear scaling in memory, i.e., we settle on a constant memory requirement per atom and limit the RAM use to this value. We choose this value in such a way as to be able to perform the largest calculations demonstrated here without exceeding the available RAM on the Iridis5 system at the University of Southampton. We will term this the *standard memory requirement*. Additionally, we will show how our implementation scales at a quarter of this value (mimicking a low-memory environment) and with four times as much RAM (mimicking typical high-memory nodes often available on HPC systems), giving an idea of how the employed time-memory trade-offs perform. The detailed values are given in Table II.

Memory . | SWpot . | Expansion . | NGWF product . | . |
---|---|---|---|---|

requirement . | cache . | cache . | cache . | Total . |

Low (×0.25) | 2.5 | 1.25 | 0.25 | 4 |

Standard | 10 | 5 | 1 | 16 |

High (×0.4) | 40 | 20 | 4 | 64 |

Memory . | SWpot . | Expansion . | NGWF product . | . |
---|---|---|---|---|

requirement . | cache . | cache . | cache . | Total . |

Low (×0.25) | 2.5 | 1.25 | 0.25 | 4 |

Standard | 10 | 5 | 1 | 16 |

High (×0.4) | 40 | 20 | 4 | 64 |

For demonstrating the effect of available RAM on calculation walltime, we chose a setup, where all calculations are run on 16 compute nodes (40 CPU cores each). In order to maximize the size of the caches shared by threads, we use the maximum number of OMP threads per non-uniform memory access (NUMA) region, that is, we run with 32 MPI ranks, each spawning 20 OMP threads. In this way, each node holds only two MPI ranks and each of the node’s two NUMA regions is fully populated by OMP threads.

Figure 3 shows the scaling of the total time for the protein systems. Linear scaling is achieved in all cases, with the onset at about 400 atoms. In the low-memory scenario (green squares), all calculations were feasible, although the largest ones barely completed one outer loop iteration within the maximum job walltime on Iridis5 (60 h). Calculations that cannot complete even one iteration within this limit (and thus cannot be checkpointed and continued later) we deem infeasible. Given that the total calculation runs for 12 iterations, we marked a “walltime limit” on the plot at 720 h. We can thus estimate the maximum system size feasible on 16 Iridis5 nodes to be about 4200 atoms.

At standard memory requirements (blue crosses), all calculations ran 17%–29% faster, but an out-of-memory condition prevented jobs larger than 3328 atoms from running, corresponding to a memory requirement of ≈100 GB per node. This means that with standard memory memory requirement, the maximum limit on job size is dictated not by the available job time but by available memory per node.

At high memory requirements, a further speedup of 15%–50% was achieved, but the maximum job size was only 909 atoms before available RAM was exhausted.

Figure 4 demonstrates linear scaling for the polymer systems. Here, it was feasible to run all systems studied (up to 2868 atoms) with low and standard memory requirements, with all calculations completing in well under 200 h. Compared to the low-memory scenario, the gain from having standard memory requirements was almost constant for all system sizes and averaged at a sizable 26%. In a high-memory scenario, a further speedup of about 20% was achieved, but the largest system that did not run out of memory was 1108 atoms.

We now turn our attention to the performance of conduction state calculations. Because of the different way the energy minimization is structured compared to valence calculations (cf. Sec. III), we note that there is no benefit to caching NGWF expansions in this case, as NGWFs always change between invocations of the HFx engine. In addition, because of the vastly increased number of NGWFs per atom and the increased NGWF localization radius, caching NGWF products becomes counterproductive, as the cache hit ratio achieved with default memory allowance is very low (in the order of 1%–2%). For the above reasons, when benchmarking conduction calculations, we decided to devote the entire memory allowance to the SWpot cache while keeping the total amount of RAM per atom the same as in valence calculations. This is summarized in Table III.

Memory . | SWpot . | Expansion . | NGWF product . | . |
---|---|---|---|---|

requirement . | cache . | cache . | cache . | Total . |

Low (×0.25) | 4 | 0 | 0 | 4 |

Standard | 16 | 0 | 0 | 16 |

High (×0.4) | 64 | 0 | 0 | 64 |

Memory . | SWpot . | Expansion . | NGWF product . | . |
---|---|---|---|---|

requirement . | cache . | cache . | cache . | Total . |

Low (×0.25) | 4 | 0 | 0 | 4 |

Standard | 16 | 0 | 0 | 16 |

High (×0.4) | 64 | 0 | 0 | 64 |

Figure 5 shows the scaling of the total walltime for a conduction calculation on the polymer systems. The largest system that was feasible on 16 nodes had 1768 atoms, and beyond that, calculations with the low memory requirements could not complete a single iteration within the 60 h job time window. With standard memory requirements, calculations ran about 22% faster, but beyond 1548 atoms, they ran out of memory. With high memory requirements, further gains were very modest and the maximum system size was limited to only 448 atoms. Linear scaling was achieved in all cases.

In summary, we find that 16 compute nodes, with about 100 GB RAM on each node devoted to HFx engine caches, were sufficient to perform valence calculations on systems up to about 4000 atoms and conduction calculations up to about 1800 atoms. These calculations were performed with linear-scaling CPU effort and linear-scaling memory.

We highlight that smaller calculations (up to several hundred atoms) could be made faster, in practice, by not limiting their memory as much. We demonstrate this briefly in Fig. 6, where we show the calculation walltime for a 443-atom protein system as a function of memory devoted to HFx caches. This particular calculation could be made just over twice as fast by using a generous memory allowance.

### B. Calculation walltime compared to a GGA calculation

To give a better idea about the cost of HFx calculations, in Figs. 7 and 8, we show calculation walltimes relative to a calculation with a GGA functional (PBE) for the protein systems and polymer systems.

For the protein systems (Fig. 7), the computational effort plateaus at ≈100 (for standard memory requirements), meaning that for larger systems, hybrid calculations should be expected to be two orders of magnitude slower compared to calculations with a GGA functional. Using less or more memory has the effect discussed already in Sec. V A. For small systems (below 100 atoms), hybrid calculations are only about an order of magnitude slower than calculations with a GGA functional. The reason for that is that exchange matrix sparsity is reached later than NGWF overlap sparsity, which puts larger HFx calculations at a bigger disadvantage.

For the polymer systems (Fig. 8), HFx performs somewhat better, being consistently slower than a GGA by a factor of ≈60 for all system sizes, for standard memory requirements. Using less or more memory has the effect discussed already in Sec. V A. The reason for better performance here is the one-dimensional nature of the polymer systems, which enables exchange matrix sparsity to be reached already for the smallest system, and a more homogeneous structure, which makes load balancing easier.

Finally, we turn to the conduction state calculations on the polymer systems. A comparison against a GGA calculation is shown in Fig. 9. We observe that overall HFx does not perform as well as it did for valence calculations on the same systems, but relative performance clearly improves systematically as the systems become bigger. This indicates that the employed algorithm scales better with system size than standard (GGA) onetep, at least at this range of system sizes. For the largest systems, the cost of a hybrid calculation is ≈50–60 times larger than for a GGA.

We note that all calculations shown in this subsection were performed with the maximum number of OMP threads (20), which is optimal for HFx but suboptimal for standard GGA calculations with onetep. This is because the scaling with the number of threads is much better for the HFx engine than for the rest of onetep, with the latter typically achieving optimum performance at 4–5 OMP threads. Indeed, once the GGA calculations are switched from an *N*_{MPI} = 32 and *N*_{OMP} = 20 setup to a more favorable *N*_{MPI} = 160 and *N*_{OMP} = 4 setup, their performance increases almost exactly by a factor of 2 for all systems. Thus, in practical scenarios, the reported slowdown of HFx relative to GGA would be twice as big.

### C. Individual components of computational effort

We will now provide a detailed breakdown of which components of the calculation are responsible for most of the computational effort for the classes of systems under study, demonstrating that each component also separately scales linearly with system size.

The majority of computational effort in HFx calculations with onetep is associated with the following stages of the calculation:

**Evaluating SWpots**—calculating the values of the spherical wave potentials^{33}originating on atomic centers at Cartesian grid points of NGWFs whose centers are within the exchange cutoff. Although this stage has been extensively optimized, it remains a bottleneck in all studied systems. This effort is mitigated by the SWpot cache.**Expansions**—evaluating the linear combinations (cf. Algorithm 5, step 8) of SWpots originating on atomic centers at Cartesian grid points within the exchange cutoff. This effort is mitigated by the expansion cache.**Sum over**—calculation of the sums over*Cc**Cc*(cf. Algorithm 5, step 19 and Algorithm 6).**Gradient**—calculation of the auxiliary term*P**P*in the exchange NGWF gradient, Eq. (26).**Load imbalance**—idle time spent waiting for other MPI ranks to finish their share of calculations. This component would be zero if load could be balanced perfectly.**MPI comms**—time spent on message passing (communication between MPI ranks).**HFx other**—all the remaining stages of evaluating the HFx energy and gradient, excluding initialization (calculation of the metric matrix).**Rest of**—all the non-HFx related calculations performed within onetep, excluding initialization.*onetep***Initialization**—constant overhead, independent of the number of energy evaluations, HFx-related and HFx-unrelated alike, e.g., calculating the metric matrix and structure factor, setting up the parallel distribution, calculating the Ewald energy term, and initialization of radial Bessel functions.

Figure 10 shows a breakdown of the total calculation walltime for the protein systems, on 16 compute nodes, under standard memory requirements. The same data are presented in a cumulative percentage form in Fig. 11. It is clear that all individual components are linear-scaling with the size of the system and that the evaluation of SWpots (red) constitutes the most computationally intensive part of the calculation, accounting for about half of the walltime. For these systems, at standard memory requirements, only (2.6 ± 1.0)% SWpots can be stored in the cache. Since the SWpots are heavily reused and are stored in the order of reusability, actual SWpot cache hit ratios are much higher here, at (42.6 ± 4.2)%, but this still means that in over half of the cases, SWpots have to be re-evaluated. This cost can be mitigated by increasing the memory allowance for the SWpot cache, but after a certain point, this strategy yields diminishing returns—SWpots evaluated at the outer edges of the system, where there is little NGWF overlap, are necessarily less often reused, and caching them offers less benefit compared to SWpots evaluated at grid points where many NGWFs overlap.

Load imbalance (light green) accounted for (17.4 ± 4.8)% of the calculation walltime. This relatively large value reflects the fact that the protein constitutes a rather heterogeneous system, where it is difficult to simultaneously achieve a good load balance for the energy calculation and for the NGWF gradient calculation. Here, it is the imbalance in the energy calculation that accounts for ≈80% of the total load imbalance. This is mostly caused by the differences in SWpot cache hit ratios between MPI ranks—the current balancing algorithm strives to balance the total number of *Bb*–*Cc* NGWF products, without accounting for the fact that SWpots on the outer edges of the system are unlikely to be cached.

Calculating expansions of NGWF products in terms of SWs (cf. Algorithm 5, step 8) (blue) accounted for (14.5 ± 1.0)% of the calculation walltime. This cost can be mitigated by increasing the allowance for the expansions cache, but the associated memory requirement grows superlinearly, meaning that with linear-scaling memory allowance, cache hit ratios will decrease as the systems are made bigger. Here, in the smallest (44-atom) system, the cache hit ratio was 46%, while in the largest (3228-atom) system, it was only 2.8%.

Summing over the index *Cc* (cf. Algorithm 5, step 19 and Algorithm 6), all the remaining HFx operations were minor contributions, accounting for less than 7% of the total walltime each. The remaining terms (MPI communication, calculation of the NGWF gradient term *P*, and non-HFx operations and initialization) were almost negligible below 2% each.

We now turn our attention to the polymer systems, where corresponding plots are shown in Figs. 12 and 13. Qualitatively, the breakdown of timings resembles that for the protein systems, and all components retain their linear-scaling behavior. Evaluating SWpots remains the most costly component, accounting for (53.7 ± 1.7)% of the walltime. The main difference is the reduced load imbalance (light green) and an increase in the cost of generating the SW expansions (blue). The polymer systems are composed of identical repeat units and increase only along one dimension, which explains the much easier load balancing—associated overheads were below 9%, except for the smallest systems. Higher effort associated with SW expansions, compared to the protein systems, is a consequence of a higher average number of NGWFs per atom (≈3.0 vs $\u22482.5$). Other contributions to the walltime, similarly as for the protein systems, are much smaller. The slight anomaly for the final data point, where the cost of evaluating SWpots decreases despite the increase in the size of the system, is a result of fortuitous load balancing, increasing the average SWpot cache hit ratio from 50.0% to 54.5%.

For conduction state calculations, the distribution of the computational effort is somewhat different. The main reasons are as follows: the use of much bigger NGWF localization regions (which lead to a decrease in metric matrix and overlap matrix sparsities), the use of a large number of NGWFs per atom (vastly increasing the number of *Bb*–*Cc* NGWF products), and the fact that inner loop (LNV) iterations do not require the evaluation of HFx energy in conduction NGWF optimization.

The plots for conduction state calculations on the polymer chains are shown in Figs. 14 and 15. Linear-scaling behavior is retained for all the components of the algorithm. Evaluating SWpots remains the bottleneck, but its cost is slightly reduced to (37.9 ± 2.4)%. One reason is the relative increase in the SWpot cache memory allowance (cf. Table III), and another one is the increase in the effort associated with other components, particularly those associated with the NGWF gradient term.

The main difference from the valence calculation is the increased cost of “HFx: other” (orange diamonds) from (7.5 ± 0.2)% to (17.3 ± 1.8)%. This increase is driven by the increased NGWF overlap, leading to larger efforts associated with computing SWpot-NGWF-product overlaps (cf. Algorithm 4, step 6) and the auxiliary term *Q* in the NGWF gradient [Eq. (27)]. The same reason drives the significant increase in the effort associated with calculating the term *P*, which increased to (16.0 ± 3.4)% from (1.0 ± 0.4)% for the valence calculation, becoming more costly on average than the calculation of SW expansions.

The cost of evaluating SW expansions is significantly decreased [(21.4 ± 2.3)% in the valence calculation to (12.8 ± 1.9)% here], which is a direct consequence of the fact that inner loop iterations in a conduction calculation do not involve HFx. The cost of imperfect load balancing is similar [(9.3 ± 3.0)%], but here, it is mostly the NGWF gradient calculation that is less than optimally balanced. Finally, the non-HFx components of the calculation (“rest of onetep”) are no longer negligible, accounting for (1.4 ± 0.3)% of the total calculation walltime.

In summary, we showed that all components of the calculation scale linearly with the system size, that evaluating SWpots is the bottleneck in all types of calculations, that the relative effort associated with other components depends on the type of calculation (valence vs conduction) and type of the system, that communication overheads are negligible, and that load balancing could be improved for more difficult systems.

Here, we briefly demonstrate how the total walltime of HFx calculations depends on the number of compute nodes and what system sizes are feasible. In Fig. 16, we plot the walltimes of calculations for protein systems, with standard memory requirements (cf. Fig. 3), run on 4, 8, 16, and 32 Iridis5 compute nodes (160, 320, 640, and 1280 CPU cores, respectively). The computational effort is linear-scaling (with an onset at ≈400 atoms) in all cases. Calculations on four and eight nodes reach the walltime limit (60 h per outer loop iteration) beyond 1396 and 2486 atoms, respectively. Larger calculations would only be possible on these configurations only if the memory allowance could be increased significantly or if the walltime window per job could be increased beyond 60 h.

### D. Calculation walltime and feasibility depending on number of compute nodes

On 16 and 32 nodes, the maximum system size is much larger (3228 and 3622 atoms, respectively) and, in this case, is bounded by available RAM. Larger calculations would be possible on these configurations if memory allowance was decreased. Indeed, we showed in Fig. 3 that *all* the studied systems (up to 4048 atoms) could be already run on 16 compute nodes under low memory requirements.

For the polymer chain systems, we show an equivalent plot in Fig. 17. Again, the calculation is linear-scaling (with the onset already at the smallest system). Here, owing to the fact that the systems are effectively one-dimensional and thus exchange matrix sparsity is reached very early, all the studied systems fit within the walltime limit (60 h per outer loop iteration) even on four compute nodes, and available RAM is not exhausted under standard memory conditions.

### E. Strong parallel scaling

*t*(

*N*

_{cores}) is the walltime of the calculation on

*N*

_{cores}CPU cores. Since in many scenarios it is not feasible to run the calculation on one CPU core, speedup relative to a fixed number of cores $Ncores0$ is often used instead,

Well-parallelized algorithms achieve near-linear speedup, which usually becomes sublinear and then plateaus for larger *N*_{cores}, due to small but non-zero fractions of the algorithm that could not or have not been parallelized, an effect captured by Amdahl’s law.^{48} Non-negligible communication overheads and imperfect load balancing also play a role. A linear speedup is typically termed *perfect* or *optimal* scaling because it corresponds to a scenario where all the overheads vanish and the fraction of the algorithm that has not been parallelized is zero.

In the case of our implementation, this simplified analysis does not strictly hold because our algorithm makes good use of the extra memory that becomes available as additional compute nodes are allocated to the calculation. We showed in Sec. V A (see, e.g., Fig. 6) that the performance of our implementation strongly depends on the amount of memory that can be devoted to the calculation. This particular feature allows our approach to *exceed* what is typically deemed to be perfect parallel scaling by improving the degree of caching as more CPU cores are added. We demonstrate this for representative protein systems (one small and one large in Fig. 18) and for representative polymer systems (one small and one large in Fig. 19). We show parallel speedups relative to 160 CPU cores (four nodes) or, where the system is too large to run on four nodes, relative to 320 CPU cores (eight nodes).

For the small protein system (Fig. 18, top), we modestly exceed perfect scaling, except for the last two data points, where the speedup becomes marginally worse than linear. This happens due to imperfect load balancing, which is difficult to achieve when the number of CPU cores exceeds the number of atoms by a factor of about 7. For the large protein system (Fig. 18, bottom), we exceed perfect scaling in all cases, even for the largest number of CPU cores, where at 1280 cores we achieve a 4.17-fold speedup over 320 cores.

The polymer chains, being more homogeneous and practically one-dimensional, constitute an easier system for our approach, which achieves better-than-perfect scaling at least until 1280 CPU cores for a smaller and a larger system alike (Fig. 19). For the larger system, in particular the gains are substantial—an eightfold increase in the number of CPU cores (from 160 to 1280) yields a 10.2-fold speedup.

For conduction state calculations, our algorithm does not scale superoptimally (see Fig. 20), but the scaling remains respectable. For a small system (228 atoms), an eightfold increase in the number of CPU cores offers a 4.22-fold speedup, corresponding to a parallel efficiency of 0.53. For a larger system (888 atoms), which was benchmarked against 320 cores, a fourfold increase in the number of CPU cores yielded a 2.72-fold speedup (efficiency of 0.68). The main culprit responsible for the worse scaling of conduction calculations is parallel load imbalance in the NGWF gradient part. A more careful load balancing scheme, perhaps one dedicated to the particular requirement of conduction calculations, might be able to mitigate the problem.

### F. Weak parallel scaling

*parallel efficiency*,

From the definition of efficiency and the discussion in Sec. V E, it follows that perfect scaling will correspond to *e* = 1. In typical scenarios, Amdahl’s law together with load imbalance and communication overheads will cause efficiency to drop below 1 for larger core counts. As our algorithm can make good use not only of the additional CPU cores but also of the additional RAM, we expect it to achieve “superoptimal” efficiency at least in some scenarios.

When investigating weak parallel scaling, one must be able to increase the system size in a uniform fashion so that the number of atoms per CPU core is constant. For this reason, we will only show weak scaling for the polymer systems, where this can be achieved by adding identical units to the system.

The results of our measurements are presented in Fig. 21, which shows plots of efficiency (relative to 160 CPU cores) for both the valence and conduction state calculations. For valence calculations, scaling is near-optimal at lower system sizes and core counts, slightly exceeding perfect scaling for larger systems, which is an excellent result. Standard GGA calculations do not scale as well, although their performance is still good, with efficiencies of about 0.8 for all setups.

For conduction calculations, scaling is suboptimal, but respectable, across the board, with *e* ≈0.7–0.8, which is still much better than GGA conduction calculations, where efficiency drops to below 0.4 for the largest core counts. Several factors are responsible for the worse scaling performance of conduction calculations compared to valence calculations. First, because of how conduction calculations are structured (cf. Sec. III), there is no benefit to caching NGWF expansions in this case, as NGWFs always change between invocations of the HFx engine. This means that the SW expansion stage does not benefit from the additional RAM at higher core counts. Second, the calculation of the *P* term in the NGWF gradient does not scale as well as the other components, presumably because NGWF products are not cached either in this case. The additional RAM is not wasted; it is devoted to caching SWpots (see Table III), and this stage does achieve *e* > 1. Finally and most importantly, the impact of parallel load imbalances increases as the calculations move to larger core counts. In contrast to valence calculations, here, it is the NGWF gradient calculation stage that becomes increasingly poorly balanced. This again (cf. the end of Sec. V E) suggests that taking the NGWF gradient stage into account in the load balancing scheme would be worth pursuing.

### G. Calculation walltime vs MPI/OMP balance

We will now briefly consider how the calculation walltime depends on the division of work across MPI ranks and OMP threads. onetep supports the so-called hybrid parallelism, that is, it runs on multiple MPI processes (termed ranks), each of which spawns OMP threads. Processes reside in separate address spaces (and often on distinct physical machines, termed nodes), while threads spawned from a single rank share memory. All large data structures are thus distributed across ranks, but shared across threads. It is up to the user to divide the pool of available CPU cores *N*_{cores} across *N*_{MPI} ranks and *N*_{OMP} threads such that *N*_{cores} = *N*_{MPI}*N*_{OMP}. The maximum number of ranks is limited by system size. Roughly speaking, the number of atoms must be larger than *N*_{MPI}, and in practice, load balancing issues cause performance to deteriorate when the number of atoms per rank becomes small. The maximum number of threads is limited by the number of CPU cores on a node.

Standard (non-hybrid) onetep calculations typically attain best performance at four or five OMP threads, where load balancing overheads are balanced between MPI ranks and OMP threads.^{41} The HFx engine, however, benefits from additional memory that becomes available when the MPI/OMP balance is shifted toward using more OMP threads and fewer MPI ranks. For instance, on a typical 128 GB node with 32 CPU cores, one could allocate 16 GB of memory per MPI rank when using *N*_{MPI} = 8 and *N*_{OMP} = 4 but as much as 64 GB per MPI rank when using *N*_{MPI} = 2 and *N*_{OMP} = 16 (neglecting the memory used by the OS and supporting software architecture). The additional memory devoted to HFx engine caches will substantially increase performance (cf. Fig. 6). However, for this to happen, the algorithm needs to scale well with the number of OMP threads.

In Fig. 22, we show how the calculation walltime changes with the number of OMP threads. Crucially, the total memory use of the HFx engine has been kept constant across all data points in each of the two curves. Clearly, for both systems, the calculation is fastest when 20 OMP threads are used, becoming somewhat slower as balance is shifted toward lower numbers of threads. The performance decrease at 40 OMP threads, despite allocating all of the node’s memory to a single MPI rank, is due to the fact that Iridis5 nodes encompass two NUMA regions, and splitting a process across two regions results in a severe performance hit due to nonlocal memory accesses. This is a typical situation on many HPC machines. We did not run the small system with one or two OMP threads because the system was too small for the resulting large *N*_{MPI}. The large system could not be run with four or fewer OMP threads because the resulting large number of MPI ranks per node led to an out-of-memory condition due to the cost of non-HFx components of onetep.

In summary, we confirmed our expectation that our algorithm performs best at the highest possible number of OMP threads, provided that processes do not cross NUMA boundaries. This also confirms excellent OMP scaling of the algorithm, without which the above result would not have been possible.

### H. Preconverging with a non-hybrid functional

We now briefly show how the performance of hybrid functional calculations can be improved by employing the optimization described in Sec. III H, and we assess the effect of this optimization on energies.

We begin by demonstrating that the error in the energies associated with restarting from a pre-converged calculation is practically negligible (and certainly much smaller than the error introduced by switching to PBE) in both scenarios—when the calculation is continued with a hybrid functional, optimizing both the density kernel and the NGWFs (“B3LYP-1”), and when only the density kernel is optimized (“B3LYP-2”)—with the latter approach avoiding the NGWF gradient computation stage entirely.

As the investigated energies we use (a) the bond-stretch energy curve of ethene, (b) the interaction energy curve of water with a chloride ion, (c) the interaction energy of a sodium cation with its first solvation shell for a number of snapshots obtained from classical molecular dynamics, and (d) the HOMO–LUMO gap of an 888-atom polymer system investigated earlier in the text and shown in Fig. 20. In this way, we probe not only bonded but also non-bonded interactions, and we investigate small systems (a)–(c) and large systems (d).

The smallest systems were run using correspondingly modest computational resources—two MPI processes with four OpenMP threads each for ethene and H_{2}O:Cl^{−} and eight MPI processes with five OpenMP threads each for Na^{+}:6H_{2}O. Memory load did not exceed 4 GB per MPI processes, and as such, it was not capped. The 888-atom polymer system was run on 16 Iridis5 nodes (two MPI processes per node, with 20 OpenMP threads each), with memory capped to 50 GB per MPI processes.

The results are shown in Figs. 23–25 and in Tables IV and V. For ethene (Fig. 23), the bond-stretch curves are practically indistinguishable between the full B3LYP calculation and both approaches based on restarts, with mean errors in the order of 0.1 kcal/mol or less, while PBE consistently overbinds by as much as 17 kcal/mol (Table IV). For the H_{2}O:Cl^{−} system (Fig. 24), the interaction energy curves are also very similar between the full B3LYP calculation and both approaches based on restarts, with errors never exceeding 0.1 kcal/mol, while PBE overbinds by 1–2 kcal/mol (Table IV). In the Na^{+}:6H_{2}O system, PBE underbinds by 1–2.5 kcal/mol, and again, the results of the two restart-based approaches are almost always within 0.1 kcal/mol from the full B3LYP calculation. For these very small systems (*N*_{atoms} < 20), the efficiency gains from using a restart-based approach are either very modest or non-existent (Table V). This is due to the fact that for small systems, all requisite SWs and expansions can be cached in RAM, making the employed time-memory trade-offs very efficient and the calculation of HFx near-optimal. As shown earlier in Figs. 3 and 4, the memory load of caching everything quickly becomes prohibitive, and indeed, the efficiency gains from using a restart-based approach for a large polymer system become quite significant—over twofold for B3LYP-1 and over 19-fold for B3LYP-2 (Table V). As we did not look at the energy of binding for the polymer system, we instead calculate its HOMO–LUMO gap to assess the accuracy of the restart-based approach. Table VI shows that the associated error was in the order of 0.02 eV (less than 2%), again much smaller than the one associated with switching to PBE for the entire calculation.

. | . | Mean signed . | RMS . | Maximum . |
---|---|---|---|---|

. | . | error . | error . | error . |

System . | Model . | (kcal/mol) . | (kcal/mol) . | (kcal/mol) . |

C_{2}H_{4} | PBE | −17.017 | 17.268 | 21.651 |

B3LYP-1 | −0.016 | 0.035 | 0.091 | |

B3LYP-2 | 0.110 | 0.121 | 0.159 | |

H_{2}O:Cl^{−} | PBE | −1.046 | 1.121 | 1.953 |

B3LYP-1 | 0.021 | 0.022 | 0.030 | |

B3LYP-2 | 0.085 | 0.085 | 0.096 | |

Na^{+}:6H_{2}O | PBE | 1.650 | 1.689 | 2.490 |

B3LYP-1 | −0.087 | 0.091 | 0.148 | |

B3LYP-2 | −0.012 | 0.042 | 0.111 |

. | . | Mean signed . | RMS . | Maximum . |
---|---|---|---|---|

. | . | error . | error . | error . |

System . | Model . | (kcal/mol) . | (kcal/mol) . | (kcal/mol) . |

C_{2}H_{4} | PBE | −17.017 | 17.268 | 21.651 |

B3LYP-1 | −0.016 | 0.035 | 0.091 | |

B3LYP-2 | 0.110 | 0.121 | 0.159 | |

H_{2}O:Cl^{−} | PBE | −1.046 | 1.121 | 1.953 |

B3LYP-1 | 0.021 | 0.022 | 0.030 | |

B3LYP-2 | 0.085 | 0.085 | 0.096 | |

Na^{+}:6H_{2}O | PBE | 1.650 | 1.689 | 2.490 |

B3LYP-1 | −0.087 | 0.091 | 0.148 | |

B3LYP-2 | −0.012 | 0.042 | 0.111 |

. | . | Speed-up relative to a . |
---|---|---|

System . | Model . | full B3LYP calculation . |

C_{2}H_{4} | B3LYP-1 | 0.75 |

B3LYP-2 | 1.02 | |

H_{2}O:Cl^{−} | B3LYP-1 | 0.71 |

B3LYP-2 | 1.01 | |

Na^{+}:6H_{2}O | B3LYP-1 | 1.10 |

B3LYP-2 | 1.72 | |

888-atom polymer | B3LYP-1 | 2.21 |

B3LYP-2 | 19.44 |

. | . | Speed-up relative to a . |
---|---|---|

System . | Model . | full B3LYP calculation . |

C_{2}H_{4} | B3LYP-1 | 0.75 |

B3LYP-2 | 1.02 | |

H_{2}O:Cl^{−} | B3LYP-1 | 0.71 |

B3LYP-2 | 1.01 | |

Na^{+}:6H_{2}O | B3LYP-1 | 1.10 |

B3LYP-2 | 1.72 | |

888-atom polymer | B3LYP-1 | 2.21 |

B3LYP-2 | 19.44 |

. | HOMO–LUMO . | Error relative . | Error relative . |
---|---|---|---|

Model . | gap (eV) . | to B3LYP (eV) . | to B3LYP (%) . |

PBE | 0.482 | −0.810 | −62.69 |

B3LYP | 1.293 | 0.000 | 0.00 |

B3LYP-1 | 1.267 | −0.026 | −2.01 |

B3LYP-2 | 1.281 | −0.012 | −0.89 |

. | HOMO–LUMO . | Error relative . | Error relative . |
---|---|---|---|

Model . | gap (eV) . | to B3LYP (eV) . | to B3LYP (%) . |

PBE | 0.482 | −0.810 | −62.69 |

B3LYP | 1.293 | 0.000 | 0.00 |

B3LYP-1 | 1.267 | −0.026 | −2.01 |

B3LYP-2 | 1.281 | −0.012 | −0.89 |

We thus conclude that it is practicable, and sufficiently accurate, to significantly reduce the cost of hybrid functional calculations by first preconverging with a GGA for all systems except the smallest ones.

### I. Demonstration of practicability—example calculations on large imogolite nanotube systems

We finish with a demonstration of the practicability of the presented approach for calculating Hartree–Fock exchange by showing fully converged results obtained with B3LYP for large nanotube systems (up to 1416 atoms). The potential of these aluminosilicate nanotubes and their derivatives for selective photo-catalytic applications has been recently explored computationally^{49–53} and started to be verified experimentally.^{54,55} In this work, we calculated the electronic density of states (Fig. 26), the HOMO–LUMO bandgap (Fig. 27), and the free energy of solvation in implicit solvent water (Fig. 28) of pristine (undefected), hydrated, and aluminosilicate imogolite nanotubes of three different sizes—three, five, and seven units (see Ref. 49 for more details).

The structure of the largest considered nanotube, surrounded with implicit solvent, is shown in Fig. 29. Our calculations ran on 32 compute nodes, each with 40 CPU cores and 192 GB of RAM and, for the largest system, took about 14 h per NGWF optimization iteration, thus requiring several restarts to complete (calculations typically take 10–20 iterations to fully converge).

As expected, we observe (Figs. 27 and 28) a widening of the HOMO–LUMO gap by about 1.5 eV once the hybrid functional is switched from PBE to a hybrid (B3LYP). The subsequent addition of the water environment (modeled using our minimal-parameter solvent model^{56,57}) further increases the gap by about 0.4 eV. The magnitude of the gap appears reasonably well-converged with system size at seven nanotube units.

The calculated free energy of solvation is in the order of −200 kcal/mol per one nanotube unit of length and is more favorable by about 13 kcal/mol when calculated with B3LYP, compared to PBE. It changes appreciably between system sizes, presumably due to the effect of the ends of the nanotube and different structural relaxation.^{49} The calculated free energy of solvation still seems slightly underconverged with system size even at seven nanotube units. The presented method development in onetep paves the way for follow-up research into this aspect of imogolite nanotubes, as well as hybrid (linear-scaling) DFT simulations of the mechanisms of the assembly of solvated proto-imogolite fragments into the final nanotubes.^{58,59}

## VI. CONCLUSIONS

We presented a massively parallel linear-scaling algorithm for the calculation of Hartree–Fock exchange and hybrid functionals, subsequently discussing and benchmarking its implementation on a number of systems relevant to industrial applications.

Our approach is based on expressing products of localized orbitals (NGWFs) in terms of truncated spherical waves, the expressions for the electrostatic potential of which are known analytically. The carefully thought out parallel distribution of both data and algorithms, together with the aggressive use of time-memory trade-offs, allows our approach to achieve very high parallel efficiency in scenarios where the number of CPU cores is comparable to the number of atoms and beyond.

We showed how on today’s machines our approach is able to treat systems of up to about 1500 atoms routinely—requiring several hundred CPU cores to achieve a walltime of under one week. The largest system that we demonstrated to be practicable on 32 compute nodes contained 4048 atoms. The excellent scaling properties of our approach mean that systems even larger than that will be treatable, although they would require substantial computational resources (*N*_{cores} ≈ *N*_{atoms}). Conduction state calculations have larger requirements due to reduced matrix sparsity and larger numbers of NGWFs, and the largest system we showed to be practicable on 16 nodes was 1768 atoms.

The computational effort of our approach scales linearly with system size, and the same is true for all of its components individually. The strong parallel scaling is excellent, occasionally becoming superlinear owing to the extensive use of time-memory trade-offs, and although there is still room for optimization in conduction state calculations, even these scale better than corresponding calculations with a GGA. Our implementation scales very well to high thread counts and to large numbers of MPI processes, retaining very good efficiency even in the regime where *N*_{cores} ≫ *N*_{atoms}.

In light of the fact that, through the use of a finite auxiliary basis, our approach is an approximation, in the supplementary material, we assessed the magnitude of the introduced error and showed how it converges with the tunable parameters of the SW basis set. In all cases, we found the magnitude of the error to be extremely small and controllable.

The methods presented in this paper not only significantly narrow the performance gap between hybrid functionals and GGAs and *meta*-GGAs in linear-scaling DFT but also pave the way for future developments in onetep, which would employ four-center electron repulsion integrals—such as Random Phase Approximation (RPA) or Møller–Plesset perturbation theories or calculations employing screened hybrids.

## SUPPLEMENTARY MATERIAL

See the supplementary material for how the additional approximations employed in our approach (the use of an exchange cutoff and the finiteness of the auxiliary basis) are controllable and how the associated errors are very small.

## ACKNOWLEDGMENTS

This work was funded by the Engineering and Physical Sciences Research Council (EPSRC), UK, as part of a flagship project of the CCP9 consortium (EPSRC Grant No. EP/P02209X/1). We acknowledge the support of the high-performance computing centers where we ran the calculations: Iridis5 at the University of Southampton (UK) and tryton at the TASK Academic Computer Centre (Gdańsk, Poland). We also acknowledge the UKCP for access to ARCHER and ARCHER2 (EPSRC Grant No. EP/P022030/1) and the MMM hub for access to Young (EPSRC Grant No. EP/T022213/1). J.D. would like to thank Gilberto Teobaldi and Emiliano Poli for fruitful discussions regarding the imogolite nanotube systems and for making their structures available.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors declare that there is no conflict of interest.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.