Simulations of condensed matter systems at the hybrid density functional theory level pose significant computational challenges. The elevated costs arise from the non-local nature of the Hartree–Fock exchange (HFX) in conjunction with the necessity to approach the thermodynamic limit. In this work, we address these issues with the development of a new efficient method for the calculation of HFX in periodic systems, employing k-point sampling. We rely on a local atom-specific resolution-of-the-identity scheme, the use of atom-centered Gaussian type orbitals, and the truncation of the Coulomb interaction to limit computational complexity. Our real-space approach exhibits a scaling that is, at worst, linear with the number of k-points. Issues related to basis set diffuseness are effectively addressed through the auxiliary density matrix method. We report the implementation in the CP2K software package, as well as accuracy and performance benchmarks. This method demonstrates excellent agreement with equivalent Γ-point supercell calculations in terms of relative energies and nuclear gradients. Good strong and weak scaling performances, as well as graphics processing unit (GPU) acceleration, make this implementation a promising candidate for high-performance computing.
I. INTRODUCTION
Finding their roots in the adiabatic connection theory,1–3 hybrid functionals have emerged as an accuracy standard for density functional theory (DFT) calculations of molecules.4–8 Beyond the local descriptions provided by functionals in the local density approximation (LDA) and generalized gradient approximation (GGA), hybrid functionals incorporate a fraction of non-local Hartree–Fock exchange (HFX). While such functionals are known to be accurate for the simulation of solids in periodic boundary conditions (PBCs),9–15 their computational cost increases dramatically due to their non-local nature and their necessity to approach the thermodynamic limit (TDL). This can be achieved by sampling k-points over the Brillouin zone (BZ) or by considering large supercells at the Γ-point only. Although the former strategy scales better toward the TDL, implementing Γ-point methods in molecular codes is simpler, leading to the availability of both approaches in various quantum chemistry codes.16–23 In addition to an improved description of the electronic structure, hybrid DFT calculations of solids can serve as a foundation for higher level methods, such as GW24–26 or RPA.27–29 Periodic post-HF methods30–33 also benefit from such implementations.
The development of HFX methods for PBC calculations has a rich history and remains an active area of research. In the context of quantum chemistry utilizing local atom-centered basis sets, typically Gaussian type orbitals (GTOs), fundamental work was carried out by Pisani and co-workers in the 1980s.34–38 Their approach directly extended molecular HFX to PBCs, involving the calculation of two-electron four-center electron repulsion integrals (ERIs) and the truncation of resulting infinite sums. This latter point is highly non-trivial as the exact exchange energy diverges for systems with finite sampling of the BZ.39 To tackle this, novel approaches were developed, relying on the minimum image convention40 (MIC) or the truncated Coulomb (TC) interaction.41 While these new methods effectively dealt with the divergence issue, they were initially developed only for Γ-point supercell calculations. More recently, Irmler and co-workers extended them to k-point sampling.42 Note that both the MIC and TC approaches were adapted from equivalent plane wave schemes.43–45 Over the years, various strategies were devised for the efficient evaluation of molecular HFX energy, with some adapted for PBCs. Variations of the resolution-of-the-identity (RI) technique46,47 have been employed for calculations with k-point sampling,48–53 while the auxiliary density matrix method (ADMM) was specifically developed for Γ-point calculations.54 We note that efficient algorithms have also been developed for plane wave calculations.55–59
In this work, we present the development of a novel atom-specific RI method tailored for the calculation of periodic HFX with k-point sampling and GTOs. Particular emphasis is put on efficiency, which is enhanced by the adaptation of the ADMM approximation to k-point sampling, as well as the use of the truncated Coulomb interaction with a fixed range. We discuss the implementation of the method in the open source CP2K software package23 and demonstrate its potential for high performance computing applications.
II. THEORY
A. General
Both approaches to building the exact exchange matrix are found in the literature, each with its variations, advantages, and drawbacks. The direct approach outlined in Eq. (3) confines the Bloch AO indices to the unit cell, facilitating the efficient handling of diffuse basis functions. This approach is favored in most plane-wave codes as well as in the occ-RI-K method of Lee and co-workers.52,53 Its primary limitation lies in the quadratic scaling cost with respect to the number of k-points. On the other hand, the cost of building the exact exchange matrix in real-space, as depicted in Eq. (5), is, in principle, independent of the number of k-points. It entirely depends on the extent of the cell translation vectors a, b, and c, determined by the basis set diffuseness and the range of the exchange potential. Variations of this latter approach are followed by the pioneering work of Pisani and co-workers,34 as well as in more recent publications.42,48,51 Hybrid methods combining aspects of both techniques have also emerged.60
Considering that most periodic hybrid DFT calculations are performed with small unit cells and dense k-point meshes, we believe that the real-space approach is more suitable. Indeed, the overall cost scales, at worst, linearly with the number of k-points, provided that diagonalization dominates. In this work, we undertake several measures to reduce the cost of building the real-space exact exchange matrices and implement them in the CP2K software package. To achieve this, we reduce the extent of a, b, and c cell indices by using the truncated Coulomb operator. Concurrently, we improve the overall computational scaling and efficiency with a novel atom-specific resolution-of-the-identity (RI) scheme. Furthermore, we address the challenge of basis set diffuseness by employing the auxiliary density matrix method (ADMM).54 Each of these steps is detailed in the following sections: Sec. II B for the truncated Coulomb operator, Sec. II C for the RI approximation, and Sec. II D for ADMM.
Note that real-space approaches are also well suited for band structure calculations. Given a converged SCF calculation with a dense enough k-point mesh, the resulting collection of real-space Fock matrices Fb can be back Fourier transformed to any k-point path. After diagonalization, eigenvalues are used for the band structure along the given path.
B. Truncated Coulomb operator
C. Atom-specific resolution-of-the-identity scheme
From the real-space exact exchange matrix outlined in Eq. (5), it becomes apparent that calculating two-electron four-center electron repulsion integrals (ERIs) comes with an unfavorable scaling with respect to the system and basis set sizes. In Γ-point calculations, ERIs stemming from different periodic images can be summed in advance before being contracted with the density matrix. This typically allows for the computation and storage of all ERIs at the first SCF step, while they can be efficiently retrieved from the memory for all subsequent iterations. However, in four-center k-point HFX calculations, the ERIs can only be partially pre-summed and storing them in the memory would rapidly become intractable. This would require a complete recalculation of the ERIs at each SCF step, leading to substantial additional costs. On the other hand, RI schemes only require two- and three-center ERIs, making integral storage more manageable. Moreover, previous research70 has demonstrated that RI-HFX is particularly efficient for small- to medium-sized dense systems, which is the typical application case for k-point calculations. Finally, adopting an RI-based method shifts the computational bottleneck from four-center ERI evaluations to tensor contractions, a task better suited for graphics processing unit (GPU) acceleration. For all these reasons, we developed an RI-based method for HFX periodic k-point calculations.
The atom-specific RI basis for atom i in the reference cell, , should span the same space as all possible products, where φμi(r) is an AO centered on atom i and φσ(r − a) is a general AO in cell a. For these products to be non-zero, the AOs have to overlap, which only happens within a region around atom i, where φμi(r) > 0. In practical terms, we define the range of a primitive Gaussian function as Rmax, such that φμi(Rmax) = ɛPGF. Two functions overlap if . The extent of the atom-specific RI basis is defined by a sphere centered on atom i, with a radius equal to the range of the most diffuse AO in the system. All atoms within that sphere contribute to their RI basis functions. To avoid a discontinuous change of the basis upon atomic displacements, atoms on the edge of the RI region have lesser weight, enforced with a bump function, as introduced by Sodt and co-workers.71, Figure 1 illustrates how the atom-specific RI basis is constructed. This choice of a local RI basis is based on the properties of the system and does not require user input, beside the general ɛPGF accuracy parameter. Any standard optimized or automatically generated RI basis can be used. Note that the extent of the RI basis is independent of the system size. Further details about the bump function used in this work can be found in the supplementary material. Other local RI schemes, where each product is projected on GTOs centered on atoms i in the reference cell and j in cell a, can be found in the literature.48,51 The RI method used in this work is not robust in the definition given by Dunlap.72 A strictly robust density fitting51,73 scheme would, however, lead to a much greater number of tensor contractions and steep computational costs. Furthermore, the accuracy of RI-HFXk is not very sensitive to the value of ɛPGF (see the supplementary material).
D. Auxiliary density matrix method
The use of the ADMM with the RI-HFXk method described in this work offers several advantages. Calculations are accelerated by selecting less diffuse auxiliary basis sets, thereby reducing the extent of cell indices a, b, and c in Eq. (9). Additionally, the use of smaller auxiliary basis sets contributes to faster matrix–matrix multiplications and tensor contractions. Finally, the approach enables calculations with more diffuse primary basis sets that would typically be prohibitively expensive.
E. Implementation
Equation (9) for the exact exchange contribution to the Fock matrix is straightforward, but its efficient implementation requires effort. There are indeed numerous aspects that hinder performance. The triple loop over periodic cells is problematic, given that the number of images to consider typically reaches the hundreds. Coupled with the additional double loop over atom indices, these result in a very large number of small tensor contractions, not ideal for distributed computing. Additionally, a trade-off has to be found between what can be stored in memory and what needs recomputing at each SCF step. The most important aspects of our implementation are discussed below.
Equation (13) can be efficiently handled by batched contractions. Large matrices/tensors can be built by stacking and in a proper order such that various a and c pairs sum up to the same d. While stacking along d can also take place in Eq. (14), the tensors remain much smaller, leading to a hard limit on strong scaling. The triple loop over b, i, and j becomes, in most cases, the bottleneck of the calculation. To efficiently deal with it, we distribute the b, i, and j triplets over multiple MPI subgroups working independently. Because these subgroups each have fewer resources, they perform small matrix/tensor contractions at higher a parallel efficiency. This comes with some communication and memory overhead, as the three-center tensors and must be replicated on each subgroup. The former can be replicated once and for all at the first SCF step, while the latter has to be broadcast at each step. Note that i, j atomic pairs are calculated on different subgroups according to index i such that the tensor is only partially copied.
Finally, we exploit the linearity of the exact exchange matrix with respect to the density matrix. At the nth SCF step, we calculate . As the SCF converges, ΔP becomes smaller, leading to sparser tensors contractions and faster SCF iterations. To maintain a low overall number of SCF steps, we implemented a k-point-specific DIIS solver.75 We also leverage the symmetry to reduce the computational effort.
Depending on the system and the method used to generate k-point meshes, symmetries can be exploited to reduce the effective number of k-points. This can be useful to lower the time spent in diagonalization [Eq. (1)]. Because these operations are usually not a bottleneck in the real-space approach developed in this work, only the straightforward time-reversal symmetry (k = −k) is exploited.
In our implementation, we use the Libint library76 to evaluate all two- and three-center ERIs. Matrix and tensor contractions are efficiently managed by the GPU accelerated block-sparse DBM and DBT libraries.70 The default filtering threshold for sparsity is set to 10−9. By default, the RI basis is generated on the fly according to the scheme of Stoychev and co-workers,77 but user provided RI basis sets are also possible. Nuclear gradients and stress tensors were also implemented, included within the ADMM approximation for the ADMM2, ADMMP, ADMMQ, and ADMMS flavors.
III. RESULTS
Many calculations and benchmarks were performed with the aim of showcasing the method’s accuracy and computational performance. First, a discussion concerning the choice of the RI metric is conducted. A study on the TC truncation radius to be used in RI-HFXk calculations and on the application of the ADMM approximation follows. Finally, various scaling tests are reported. Unless specified otherwise, we systematically use Monkhorst–Pack78 k-point meshes in this work.
We measure accuracy in terms of relative energy errors, i.e., the error made on the energy difference between two systems of the same composition, but with different atomic structures. For a given accuracy test, we subtract the energy of the pristine crystal structure from that of a system with a broken symmetry, which is achieved by randomly displacing atoms by a maximum of ±0.1 Å. The same structures are used for all data points of the test. While absolute energies may vary, the focus on relative energies ensures that consistent comparisons between different calculations at the same level of theory can be done. We systematically report relative energy errors in terms of μHa/atom, referring to the number of atoms in the calculation unit cell (independent of the k-point mesh).
A. Choice of the RI metric
The selection of the RI metric, denoted by the operator ⌊ in Eq. (9), is an input parameter to any RI-HFX and low-scaling post-HF calculation within CP2K. We are interested in making the optimal choice, with the assumption that a short range RI metric is more efficient due to sparsity, while longer ranged metrics improve accuracy. Figure 2 illustrates the relative errors and SCF timings observed when running RI-HFXk calculations on two-dimensional hBN and bulk silicon, with RI metrics of increasing range. Two basis sets are used: the solid-state specific all-electron pob-DZVP-rev279 and the more diffuse and general purpose ccGRB-D,23 with GTH pseudopotentials.80 The results reveal a general trend of error reduction with an increasing RI metric range, albeit with non-monotonic convergence and occasional significant jumps, as exemplified by TC 3.0 for silicon. Additionally, the SCF timings vary little, with a maximum increase of 20%. This is due to the almost constant levels of sparsity because of the local nature of the RI basis. This is in stark contrast with global RI schemes, where such long range RI metric choices lead to a significant increase in computational costs.70 In conclusion, for optimal accuracy with a maximum error of 21 μHa/atom (for the specified setup of bulk silicon with pob-DZVP-rev2), we recommend employing an RI metric identical to the HFX potential. Furthermore, we note that the ccGRB-D basis set consistently outperforms pob-DZVP-rev2. A similar investigation on nuclear gradients corroborates these findings, and additional details can be found in the supplementary material. Unless explicitly stated, all calculations in this work use the same operator for the RI metric and the HFX potential.
RC . | hBN . | LiH . | Silicon . | Diamond . |
---|---|---|---|---|
4.0 | 116 | 229 | 270 | 713 |
5.0 | 136 | 274 | 321 | 916 |
6.0 | 155 | 313 | 407 | 1117 |
7.0 | 165 | 413 | 530 | 1371 |
8.0 | 192 | 428 | 599 | 1650 |
9.0 | 213 | 540 | 703 | 2005 |
10.0 | 256 | 634 | 798 | 2352 |
RC . | hBN . | LiH . | Silicon . | Diamond . |
---|---|---|---|---|
4.0 | 116 | 229 | 270 | 713 |
5.0 | 136 | 274 | 321 | 916 |
6.0 | 155 | 313 | 407 | 1117 |
7.0 | 165 | 413 | 530 | 1371 |
8.0 | 192 | 428 | 599 | 1650 |
9.0 | 213 | 540 | 703 | 2005 |
10.0 | 256 | 634 | 798 | 2352 |
Primary/aux basis . | hBN (13 × 13 × 1) . | Silicon (8 × 8 × 8) . | ||
---|---|---|---|---|
Error . | Speedup . | Error . | Speedup . | |
ccGRB-T/admm-tzp | −200.1 | 32 | −24.4 | 62 |
ccGRB-T/admm-tz2p | −128.1 | 8 | 7.9 | 12 |
ccGRB-T/cFTI3 | −121.1 | 68 | 454.9 | 233 |
ccGRB-T/pFTI3 | −81.8 | 16 | −7.9 | 39 |
TZVP-HYB/admm-tzp | −172.5 | 18 | −16.9 | 182 |
TZVP-HYB/admm-tz2p | −122.9 | 4 | 12.1 | 36 |
TZVP-HYB/cFIT3 | −75.6 | 37 | 424.2 | 703 |
TZVP-HYB/pFIT3 | −55.1 | 9 | −49.2 | 115 |
6-31G*/3-21G | −68.1 | 3 | 484.7 | 9 |
6-31G*/pob-DZVP-rev2 | −33.6 | 4 | −60.7 | 8 |
6-31G*/pob-TZVP-rev2 | −9.8 | 1 | 9.1 | 7 |
Primary/aux basis . | hBN (13 × 13 × 1) . | Silicon (8 × 8 × 8) . | ||
---|---|---|---|---|
Error . | Speedup . | Error . | Speedup . | |
ccGRB-T/admm-tzp | −200.1 | 32 | −24.4 | 62 |
ccGRB-T/admm-tz2p | −128.1 | 8 | 7.9 | 12 |
ccGRB-T/cFTI3 | −121.1 | 68 | 454.9 | 233 |
ccGRB-T/pFTI3 | −81.8 | 16 | −7.9 | 39 |
TZVP-HYB/admm-tzp | −172.5 | 18 | −16.9 | 182 |
TZVP-HYB/admm-tz2p | −122.9 | 4 | 12.1 | 36 |
TZVP-HYB/cFIT3 | −75.6 | 37 | 424.2 | 703 |
TZVP-HYB/pFIT3 | −55.1 | 9 | −49.2 | 115 |
6-31G*/3-21G | −68.1 | 3 | 484.7 | 9 |
6-31G*/pob-DZVP-rev2 | −33.6 | 4 | −60.7 | 8 |
6-31G*/pob-TZVP-rev2 | −9.8 | 1 | 9.1 | 7 |
. | Graphene . | hBN . | Silicon . | LiH . |
---|---|---|---|---|
11 × 11 × 1 . | 11 × 11 × 1 . | 5 × 5 × 5 . | 5 × 5 × 5 . | |
Γ-4center | 2.449 | 2.496 | 3.867 | 4.038 |
RI-HFXk | 2.450 | 2.496 | 3.862 | 4.037 |
ADMM-Γ-4 center | 2.450 | 2.497 | 3.868 | 4.037 |
ADMM-RI-HFXk | 2.450 | 2.497 | 3.863 | 4.037 |
. | Graphene . | hBN . | Silicon . | LiH . |
---|---|---|---|---|
11 × 11 × 1 . | 11 × 11 × 1 . | 5 × 5 × 5 . | 5 × 5 × 5 . | |
Γ-4center | 2.449 | 2.496 | 3.867 | 4.038 |
RI-HFXk | 2.450 | 2.496 | 3.862 | 4.037 |
ADMM-Γ-4 center | 2.450 | 2.497 | 3.868 | 4.037 |
ADMM-RI-HFXk | 2.450 | 2.497 | 3.863 | 4.037 |
. | Number of workers . | . | Size of workers . | ||
---|---|---|---|---|---|
Speedup . | Efficiency . | Speedup . | Efficiency . | ||
1 | 1.00 | 1.00 | 1 | 1.00 | 1.00 |
2 | 1.93 | 0.97 | 2 | 1.79 | 0.90 |
4 | 3.65 | 0.91 | 4 | 2.88 | 0.72 |
8 | 6.73 | 0.84 | 8 | 4.28 | 0.53 |
16 | 11.05 | 0.69 | 16 | 6.69 | 0.42 |
32 | 14.43 | 0.45 | 32 | 8.32 | 0.26 |
. | Number of workers . | . | Size of workers . | ||
---|---|---|---|---|---|
Speedup . | Efficiency . | Speedup . | Efficiency . | ||
1 | 1.00 | 1.00 | 1 | 1.00 | 1.00 |
2 | 1.93 | 0.97 | 2 | 1.79 | 0.90 |
4 | 3.65 | 0.91 | 4 | 2.88 | 0.72 |
8 | 6.73 | 0.84 | 8 | 4.28 | 0.53 |
16 | 11.05 | 0.69 | 16 | 6.69 | 0.42 |
32 | 14.43 | 0.45 | 32 | 8.32 | 0.26 |
B. Ideal TC truncation radius
All RI-HFXk calculations performed in this work use the truncated Coulomb operator as the HFX potential. As discussed in Sec. II B, this choice allows for a finite number of periodic images to be considered, while ensuring convergence of the exact exchange energy term. In CP2K, unlike the original work of Spencer and Alavi44 and the GTO-based implementation of Irmler and co-workers,42 we do not systematically increase the truncation radius of the TC operator with the extent of the BvK supercell. Instead, a fixed value is chosen as an input parameter, large enough such that relative energies are converged. The general requirement that RC must be smaller than the radius of the largest sphere fitting in the BvK supercell still holds. Figure 3 shows the relative energy errors and timings measured on 2D hBN, bulk LiH, bulk silicon, and diamond, as a function of the HFX potential truncation radius. The errors exhibit a clear monotonic decrease, with RC = 6 Å being converged enough for all systems considered. The maximum relative error for this cutoff is 43 μHa/atom for silicon using the pob-TZVP-rev2 basis set. Note that increasing the truncation radius leads to significant additional costs for dense solids because of the larger number of periodic images involved (Table I). In the case of diamond, switching from 6 to 10 Å induces a 2.5× increase in cost. Section III B summarizes the number of periodic images considered as a function of the HFX potential truncation radius for the considered systems. We stress that RC is a parameter of the HFXk method, and its value should be properly tested for convergence. In particular, small gaps and metallic systems might require larger truncation radii.
C. Auxiliary density matrix method
The results presented in this section demonstrate the performance of the ADMM approximation in combination with the RI-HFXk method. Table II outlines the relative errors and speedups achieved with various combinations of basis sets for 2D hBN and bulk silicon. A consistent trend is observed, in which the most substantial speedups, reaching up to 700×, are associated with larger errors, typically an order of magnitude higher than the atom-specific RI-induced errors. This observation aligns with the findings of Rebolini and co-workers in molecular calculations.81 Note that the cFIT354 auxiliary basis for bulk silicon leads to particularly high errors. In the spirit of ADMM, high-quality triple-zeta valence basis sets (ccGRB-T and TZVP-MOLOPT-HYB-GTH82) were chosen as the primary basis for this test. Given the strong dependence of the HFX cost on basis set size and diffuseness, significant benefits are expected from ADMM in this context. Furthermore, ADMM facilitates k-point HFX calculations with diffuse basis sets, provided that the selected auxiliary basis is less diffuse. The related accuracy and speedup measurements are not reported here because the reference RI-HFXk calculations become computationally intractable. Finally, ADMM can also be used with all-electron basis sets, as illustrated with 6-31G*83–85 in the table. There are not many ADMM optimized all-electron basis sets available, except for the ones from Kumar and co-workers,86 which are too diffuse for efficient solid-state calculations. Instead, we show that the CRYSTAL16 pob family of basis sets can be efficiently used as auxiliary basis sets because of their limited diffuseness. Note that while 6-31G* is of double zeta quality, speedups and low errors are still achieved with the pob-TZVP-rev2 basis for silicon. The pob family can only be used as auxiliary basis sets for all-electron calculations. The results presented in Table II provide insights into selected few basis sets and systems, but the general trends are assumed to be universal. All calculations were done in the ADMMS flavor.
Additional tests were conducted in order to confirm the accuracy of RI-HFXk nuclear gradients and stress tensors, with and without ADMM. Cell optimization calculations were conducted for multiple crystals, namely, graphene, hBN, silicon, and LiH, with Γ-point supercells and equivalent k-point meshes. Results are presented in Table III. The maximum difference in the converged cell parameter is 0.005 Å (0.1%) between the four-center Γ-point supercell approach and RI-HFXk, while it is only 0.001 Å between ADMM and non-ADMM runs. Two important lessons can be taken from this experiment: The new RI-HFXk method is consistent with the well-established Γ-point implementation of CP2K and ADMM nuclear gradients and stress tensors are accurate.
D. Performance
1. RI-HFXk vs Γ 4-center
The main interest of k-point approaches is to reach the thermodynamic limit more efficiently than with Γ-point supercell methods. It is shown in Figs. 4 and 5, where the cost of equivalent supercells and k-point meshes calculations are reported for 2D hBN and bulk silicon. Regardless of the system and the basis set used, RI-HFXk calculations are much more efficient. Moreover, their cost remains constant with respect to the k-point mesh, whereas Γ-point computation scales linearly with a supercell size at best. For both methods, this behavior is only possible because the cutoff radius of the TC HFX potential is kept constant. The cost of Γ-point ccGRB-D calculations with silicon increases rapidly, illustrating the difficulty in running hybrid DFT simulations of solids with diffuse basis sets. In contrast, the RI-HFXk calculations with the same setting are very well behaved, notably because of the local nature of our atom-specific RI scheme. We point to the fact that the y axis of both figures have very different bounds, demonstrating how much more costly the simulations of dense bulk systems are. These results show the Γ-point four-center method in its most favorable light, where enough resources are always allocated to pre-compute and store all ERIs at the first SCF step. Note that a series of Γ-point calculations were conducted with different amounts of resources, leading to slight variations in parallel efficiency, which are clearly visible in the ccGRB-D curve of Fig. 4. For both figures discussed here, the cost of RI-HFXk is constant with respect to the k-point mesh. This happens because the cost of building the real-space exact exchange matrices dominates in this regime. However, eventually performing a diagonalization at each k-point becomes the bottleneck of the calculation, leading to linear scaling with the number of k-points. For bulk silicon, this regime transition starts at around 800–1000 k-points, as illustrated in the supplementary material. Additional tests showcasing the convergence of the total energies as a function of the number of k-points, as well as acceleration of individual SCF steps due to increasing ΔP sparsity, are also available.
2. Strong scaling
As discussed in Sec. II E, the most expensive step of the algorithm is the triple loop over cell vector b and i, j atom pairs. Because this represents a large amount of small matrix/tensor contractions, we took the decision to split the work among independent workers, each calculating a fraction of the b and i, j atomic blocks. This comes at the cost of some data replication and communication overhead. There are two ways that this bottleneck can be accelerated: by increasing the number of workers (MPI subgroups) and by increasing the resources allocated to each worker. Figure 6 shows the strong scaling performance of the RI-HFXk method for both approaches independently. The measured speedups and parallel efficiencies for the pob-TZVP-rev2 basis are reported in Table IV. The strong scaling performance is high across the board, especially when increasing the number of independent MPI subgroups. We expect better efficiencies with respect to the worker size with larger simulation cells and/or larger basis sets because the size of individual tensors would increase. Note that calculations using the ccGRB-D basis set are significantly more expensive than those using pob-TZVP-rev2 because of diffuseness rather than basis size. In both the subfigures, the first data point of the ccGRB-D series is missing. This is due to memory limitations on a single node as the ccGRB-D basis requires more than twice the number of periodic cells to be considered (697 vs 321). While running with a single MPI subgroup on one node is not possible, two workers on two nodes is possible. This happens because not all data are copied to each subgroup. Note that while the b, i, and j triple loop is considerably accelerated with additional MPI subgroups, the overall increase in resources also speeds up the rest of the calculation, notably the evaluation of the ERIs and the contraction with the density matrices in Eq. (13). Additional tests covering various systems and basis sets, as well as measures of GPU acceleration, are available in the supplementary material. GPU runs were measured to be 1.3–2.4 times faster than the CPU only calculations on the same computer. The GPUs used in this work date back to 2016, and greater acceleration is expected on more modern hardware. A previous study using the state-of-the-art LUMI-G supercomputer70 suggests that GPU acceleration could double.
3. Weak scaling
A given calculation can be accelerated by allocating more resources to it, both by increasing the number and the size of the MPI subgroups working on the b, i, and j loop of Eq. (14). On the other hand, the computational effort grows together with the simulation unit cell size, according to the analysis performed in Sec. II E. In Fig. 7, we demonstrate a good weak scaling performance by keeping constant time to solution for 2D hBN, bulk silicon, and diamond systems of increasing sizes. The CPU time necessary to complete these calculations grows significantly following a linear trend on the log–log plot. Polynomial fitting yields the effective scaling of the method with the system size, measured at , , and for hBN, silicon, and diamond, respectively. While these tests were conducted with the overlap RI metric to keep the resource consumption low, the observed trends are not expected to be affected by that choice. As the cell size of the test systems increased, so did both the number of MPI subgroups and their sizes. The largest calculation—the 32 atoms diamond cell—was run on 192 nodes with 16 MPI subgroups. Note the significant augmentation of required resources when going from 2D hBN to bulk silicon and, finally, to diamond. This again highlights the difficulty of running HFX calculation of dense solids. Additional analysis, available in the supplementary material, shows that the method scales slightly more favorably than the predicted . We attribute this to sparsity in the tensor/matrix contractions.
IV. CONCLUSION
In this work, we have reported the development of the RI-HFXk method and its implementation into the CP2K software package. This novel approach enables accurate and efficient periodic Hartree–Fock exchange calculations with k-point sampling. Our results are in excellent agreement with equivalent Γ-point supercell calculations based on a well-established implementation, for relative energies and nuclear gradients. To enhance the efficiency, both methods leverage the truncated Coulomb operator with a fixed, converged value.
We adopt a real-space approach that ensures a computational cost scaling, at worst, linearly with the number of k-points when diagonalization dominates. The primary computational bottleneck occurs during the construction of real-space exact exchange matrices. In this context, our local atom-specific RI scheme demonstrates a notable advantage over more conventional two-electron four-center approaches. Indeed, its modest memory requirements allow for the storage and replication of pre-calculated quantities, leading to a very efficient parallel implementation.
Due to the employment of RI, the computational effort is largely dominated by linear algebra. This is efficiently handled by the use of the DBM and DBT libraries,70 designed around block sparsity. Besides the inherent sparsity coming from atom-centered Gaussian basis functions, we leverage the linearity of the exact exchange with respect to the density matrix, leading to decreasing costs as the SCF converges. Overall, our implementation exhibits a good strong and weak scaling performance, as well as some level of GPU acceleration. This makes the method highly suitable for high-performance computing, offering rapid time to solution. The effective scaling of the method with system size was measured at .
The auxiliary density matrix method can be seamlessly combined with RI-HFXk. Significant speedups are achieved in calculations using large basis sets, with reasonable accuracy for relative energies. Assessments of cell optimization runs also demonstrate the high quality of ADMM-derived forces and stress tensors. Additionally, the use of ADMM makes calculations with highly diffuse basis sets possible, provided the suitable selection of a less diffuse auxiliary basis. This is of significant importance as real-space approaches to HFX encounter substantial computational challenges with basis set diffuseness.
This fully open-source implementation opens the door to a new range of calculations with the CP2K software package. High quality hybrid DFT calculations of solids near the thermodynamic limit are now possible at a limited computational cost. Furthermore, the RI-HFXk method can be used as a starting point for higher level methods, such as periodic GW with k-point sampling,88 and holds potential for future periodic post-HF developments.
SUPPLEMENTARY MATERIAL
See the supplementary material for details on the RI bump function used in this work. Additional figures covering nuclear gradient accuracy, ɛPGF convergence tests, scaling with respect to k-point number, total energy convergence as a function of k-point number, an illustrative band structure calculation, and strong scaling/GPU acceleration are also available. An illustrative CP2K input file is included as well.
ACKNOWLEDGMENTS
This work was funded by the Swiss Platform for Advanced Scientific Computing (PASC) as part of the 2021–2024 “Ab Initio Molecular Dynamics at the Exa-Scale” project. Computing time on Piz Daint was provided by the CSCS Swiss National Supercomputing Centre under Account Nos. d110 and c18. The authors would like to thank Tiziano Müller for the fruitful discussions on ADMM.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Augustin Bussy: Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal). Jürg Hutter: Funding acquisition (equal); Project administration (equal); Resources (equal); Software (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study, namely, CP2K input and output files, are openly available in the Materials Cloud at https://doi.org/10.24435/materialscloud:rg-dn.89