We investigate the applicability of single-precision (fp32) floating point operations within our linear-scaling, seminumerical exchange method sn-LinK [Laqua et al., J. Chem. Theory Comput. 16, 1456 (2020)] and find that the vast majority of the three-center-one-electron (3c1e) integrals can be computed with reduced numerical precision with virtually no loss in overall accuracy. This leads to a near doubling in performance on central processing units (CPUs) compared to pure fp64 evaluation. Since the cost of evaluating the 3c1e integrals is less significant on graphic processing units (GPUs) compared to CPU, the performance gains from accelerating 3c1e integrals alone is less impressive on GPUs. Therefore, we also investigate the possibility of employing only fp32 operations to evaluate the exchange matrix within the self-consistent-field (SCF) followed by an accurate one-shot evaluation of the exchange energy using mixed fp32/fp64 precision. This still provides very accurate (1.8 µEh maximal error) results while providing a sevenfold speedup on a typical “gaming” GPU (GTX 1080Ti). We also propose the use of incremental exchange-builds to further reduce these errors. The proposed SCF scheme (i-sn-LinK) requires only one mixed-precision exchange matrix calculation, while all other exchange-matrix builds are performed with only fp32 operations. Compared to pure fp64 evaluation, this leads to 4–7× speedups for the whole SCF procedure without any significant deterioration of the results or the convergence behavior.
I. INTRODUCTION
The evaluation of the exact (Fock)-exchange matrix usually represents the major computational bottleneck (typically >80% of the computation time) within Hartree–Fock or hybrid-density functional theory (DFT) calculations. Traditionally, this requires the computation of the four-center-two-electron (4c2e) integral-tensor leading to a formal scaling, which is particularly problematic when combined with larger atomic-orbital (AO) basis sets, even if efficient screening techniques1–10 are employed.
Therefore, seminumerical integration techniques, e.g., the pseudospectral method of Friesner et al.,11–16 the chain-of-spheres-exchange (COS-X) method of Neese et al.,17 the seminumerical methods of Plessow and Weigend,18 Liu et al.,19–21 and Kaupp and co-workers (the latter focusing more on local-hybrid functionals),22–26 the semi-JK algorithm of Holzer,27 and our sn-LinK method28,29 are more efficient for large basis sets due to their superior formal scaling and their scaling with the basis set size. Moreover, as shown in Refs. 27 and 29, seminumerical integration is perfectly suited for modern, highly parallel hardware, where performance relies heavily on the use of single-instruction-multiple-data (SIMD) instructions, which applies to both modern central processing units (CPUs) and even to a greater extent to graphic processing units (GPUs).
By default, most quantum chemistry programs execute the necessary floating point operations with double numeric precision (fp64) due to its reliable accuracy (about 10−16 relative precision).30 However, since most “gaming grade” GPUs typically provide significantly less computational performance for fp64 operations compared to single-precision floating point (fp32) operations (e.g., the fp32:fp64 ratio for the GTX 1080Ti is 32:1), the possibility of executing as many of the necessary computation using fp32 operations needs to be explored, despite its lower numerical precision of about 10−7. In addition, the possibility of twofold speedups on CPUs also justifies the need for such a study even for pure CPU code. Investigations of pure fp32 or mixed fp32/fp64 execution have indeed already been carried out for the traditional 4c2e integral based Fock-exchange evaluation31–36 and post-HF correlation methods,32,37–39 but, to our knowledge, the applicability of reduced numerical precision within seminumerical integration has not yet been studied.
Therefore, we provide such a study in this work, which is organized as follows: First, we briefly review our seminumerical integration method sn-LinK from Ref. 29 in Sec. II. After reporting on the computational details in Sec. III, we explore the applicability of fp32 operations regarding performance and numerical stability in Sec. IV. This exploration is partitioned into three parts: In part 1 (Sec. IV A), we explore mixed fp32/fp64 evaluation of the three-center-one-electron (3c1e) integral tensor. In part 2 (Sec. IV B), we then explore the possibility of pure fp32 evaluation in all steps instead of only the 3c1e integral evaluation. Finally, in part 3 (Sec. IV C), we propose a specific self-consistent-field (SCF) method, denoted as i-sn-LinK, that employs incremental exchange-builds to reduce the numerical error from pure fp32 execution. Afterward, we illustrate the performance and accuracy of the so developed mixed-precision methods for a wide variety of molecules and basis sets in Sec. V and finally summarize our results in Sec. VI.
II. THEORY: SN-LINK REVIEWED
A. The seminumerical exchange method
Seminumerical integration decomposes the 4c2e integral tensor
symmetrically as
employing numeric integration grids with grid points rg and corresponding weights wg.
Inserting this decomposition into the atomic orbital (AO) representation of the exact-exchange matrix leads to
which is evaluated in three consecutive steps:
The so-obtained exchange-matrix is finally symmetrized to account for the transpose in Eq. (3).
If only the exchange energy
is of interest (e.g., to compute the final energy after converging the SCF), the evaluation of Eq. (7) can be avoided and EX can instead be obtained as
Equations (5) and (7) are evaluated with dense matrix–matrix multiplications employing batch-local matrices of asymptotically constant size (see Sec. 2.4 of Ref. 29 for details) utilizing highly optimized BLAS-3 libraries.
In contrast, evaluation of Eq. (6) requires the computation of the 3c-1e integral tensor
which usually represents the most expensive step. Effective integral screening techniques are therefore essential for an efficient implementation.
B. Screening for 3c1e-integrals reviewed
In order to assess the significance of a 3c1e-integral Aνλg, we consider its contribution to the total exchange energy
and to the final exchange matrix
An integral is then labeled as significant if either of the contributions is larger than a given threshold, i.e.,
III. COMPUTATIONAL DETAILS
Unless stated otherwise, all calculations are performed with our FermiONs++ quantum chemistry program,7,8 employing the “Karlsruhe” (“def2-”) basis sets40 (the prefix “def2-” is omitted for simplicity) and the “gm4” multi-grid41 (∼5000 points per atom within the SCF and ∼15 000 points per atom for the final energy calculation). For yttrium, the respective effective core potential (ECP)40 was employed to replace core electrons. For the evaluation of the Coulomb-interaction, the resolution-of-the-identity (RI-J) approximation42–44 in combination with the RI-J-optimized basis set of Ref. 45 was used throughout this work. These settings are chosen to represent typical applications.
Since the computation of the exact-exchange contributions is usually the most expensive step within hybrid-DFT calculations, the discussion for Hartree–Fock calculations within this work is equally meaningful for hybrid-DFT calculations. Indeed, since only a fraction of exact Fock-exchange is employed in hybrid-DFT methods, the numerical errors from single-precision execution are proportionally lower in this case. We provide analogous results to Tables IX and XI using the PBE0 hybrid functional46 instead of the Hartree–Fock method within the supplementary material.
For optimal performance, the CPU code was compiled with the Intel C++ compiler (ICPC) version 19.1.047 with all compiler-optimization enabled (“-Ofast,” “-march = native”), which is necessary to fully utilize SIMD-vector-instructions within the 3c1e integral kernels. The GPU code was compiled with NVCC-10.1 (CUDA-10.1),48 also employing all possible compiler-optimizations (“-O3” and “-use fast math”). To provide sufficient parallel workload for each device, grid-batches of 512 grid points on CPUs, 20 480 points on the NVIDIA GV100 GPU, and 10 240 points on the 1080Ti GPU are employed.
SCF convergence is always measured by the root mean square of the DIIS error matrices (SPF–FPS).49,50 Unless stated otherwise, all timings are given for one full exchange-matrix build averaged over all but the very first SCF cycles, since the very first Fock-build is considerably faster due to the sparsity of the superposition-of-atomic-densities (SAD) guess density matrix. For rigorous comparisons, errors in the converged energy (denoted ΔE), the root mean square deviation (RMSD) of the converged density matrix (denoted as ΔP), and the root mean square deviation of the converged nuclear forces (denoted as ΔForces) are always referenced to pure fp64-execution with all other settings being identical.
The test geometries51,52 in this work are chosen to represent typical applications. In particular, A–T DNA-fragments [(DNA)x] exemplify biochemistry applications, spherical water balls [(H2O)68], explicit solvent environments, LiF cutouts [(LiF)36], materials science applications, and Y2C5H20N20O13 inorganic complex chemistry. All geometries employed in this work are provided in the supplementary material.
IV. THE MIXED FP32/FP64 PRECISION SN-LINK METHOD
In this section, the possibility of employing single-precision algebra is systematically studied, focusing on the speedups and accuracy. The total execution times corresponding to the here presented speedups are given in Sec. 2 of the supplementary material.
A. Part 1: Mixed precision evaluation of the 3c1e integral tensor
The evaluation and contraction of the 3c1e-integral tensor [Eq. (6)] usually represents the most time-consuming step within sn-LinK, particularly on CPUs and for smaller systems. Therefore, we first investigate the applicability of fp32 operations for this step.
We can assign each 3c1e-integral (or batch of integrals) an upper bound to its contribution to the final exchange-energy [Eq. (11)] and exchange-potential [Eq. (12)]. These upper bounds not only enable the linear-scaling evaluation of the 3c1e integrals via computing only the significant subset of the tensor but also enable a finer-grained partitioning of the 3c1e tensor into three instead of only two categories:
Category one contains the most significant 3c1e integrals (). These are computed with fp64 operations.
Category two contains all 3c1e integrals that are too significant to be completely neglected, but not so significant that they require fp64 evaluation (). These are computed and accumulated with fp32 operations.
Category three contains all the insignificant integrals (). These are not computed at all, even with pure fp64-execution.
Compared to the original sn-LinK from Ref. 29, we just split the set of significant integrals into two subsets (category one and two), resulting in only a slight adjustment of the pseudocode for the 3c1e-integral evaluation, which is given in Fig. 1.
Impact of the fp64 multiplier on the accuracy and performance for (DNA)4/HF/SVP. Speedups are given for two Intel Xeon E5-2630 CPUs (20 cores@2.20 GHz) using AVX-2 instructions. The corresponding thresholds for switching between fp32- and fp64 evaluation derived from the “parent” screening-thresholds ϑE = 10−11 and ϑK = 10−8 are given for context.
fp64-multiplier . | 1012 . | 1010 . | 108 . | 107 . | 106 . | 105 . | 104 . |
---|---|---|---|---|---|---|---|
104 | 102 | 1 | 10−1 | 10−2 | 10−3 | 10−4 | |
101 | 10−1 | 10−3 | 10−4 | 10−5 | 10−6 | 10−7 | |
Energy error (μEh) | 34.19 | 0.55 | 0.08 | 0.01 | <0.01 | <0.01 | <0.01 |
ΔP (10−6) | 0.20 | 0.20 | 0.15 | 0.04 | 0.01 | <0.01 | <0.01 |
% of integrals fp64 | 0.000 | 0.001 | 0.083 | 0.37 | 1.2 | 3.0 | 7.9 |
Speedup (only integrals) | 1.99× | 1.98× | 1.96× | 1.95× | 1.94× | 1.89× | 1.79× |
fp64-multiplier . | 1012 . | 1010 . | 108 . | 107 . | 106 . | 105 . | 104 . |
---|---|---|---|---|---|---|---|
104 | 102 | 1 | 10−1 | 10−2 | 10−3 | 10−4 | |
101 | 10−1 | 10−3 | 10−4 | 10−5 | 10−6 | 10−7 | |
Energy error (μEh) | 34.19 | 0.55 | 0.08 | 0.01 | <0.01 | <0.01 | <0.01 |
ΔP (10−6) | 0.20 | 0.20 | 0.15 | 0.04 | 0.01 | <0.01 | <0.01 |
% of integrals fp64 | 0.000 | 0.001 | 0.083 | 0.37 | 1.2 | 3.0 | 7.9 |
Speedup (only integrals) | 1.99× | 1.98× | 1.96× | 1.95× | 1.94× | 1.89× | 1.79× |
The complete discarding of all category three integrals (line 3 of Fig. 1) is just standard integral screening, so we can employ the same “parent” thresholds as in the original sn-LinK, i.e., , within the SCF, and in the final energy calculation. The screening-errors from these thresholds are usually well below 1 µEh. This integral screening is not altered within this work, and instead, the speedups presented here are solely caused by a more efficient handling of the category two integrals by using a lower precision arithmetic.
We decided to choose our second set of thresholds, which distinguish between fp32- and fp64-execution (l.5 and 8 of Fig. 1) dynamically, based on the original thresholds ϑE/K. Indeed, since the relative numerical precision of fp32 numbers is about 10−7, one would naturally choose . In Table I, we test a variety of threshold-multipliers and find that we can employ even tighter fp64-thresholds, since still only a very small fraction of 3c1e-integrals need to be computed with fp64 operations. Therefore, we decided to use a fp64 threshold multiplier of 105 for all further calculations, providing virtually perfect accuracy regarding both the converged energy and the converged density matrix, while over 97% of all integrals are computed using fp32 instructions. As expected, this results in nearly 2× speedup for the evaluation of the 3c1e integrals.
We further evaluate the robustness and possible speedups of mixed fp32/fp64 3c1e-integral evaluation for different molecular systems and basis sets in Table II. Mixed precision evaluation provides essentially error-free results for both the converged energy and the density matrix and converges within exactly the same amount of SCF cycles. That is, mixed-precision evaluation of the 3c1e integrals is essentially error-free while providing close to 2× speedups (1.73–1.97×) for the 3c1e integral evaluation, in line with the expected 2× performance gains for CPUs.
Comparison of mixed fp32/fp64 3c1e integral evaluation and pure fp64 evaluation. Speedups for only the 3c1e integrals and one whole exchange build are given for two Intel E5-2630 CPUs (20 cores @2.20 GHz) using AVX-2 instructions. The number of SCF cycles is given for a convergence threshold of ϑconv = 10−8 for the DIIS-error.
System/Basis . | Speedup 3c1e . | Speedup (K) . | Error (μEh) . | ΔP (10−6) . | niter (fp64) . | niter (mp) . |
---|---|---|---|---|---|---|
(DNA)1/SVP | 1.73× | 1.55× | <0.01 | <0.01 | 15 | 15 |
(DNA)4/SVP | 1.86× | 1.62× | <0.01 | <0.01 | 14 | 14 |
(H2O)68/SVP | 1.92× | 1.61× | <0.01 | <0.01 | 11 | 11 |
(LiF)36/SVP | 1.90× | 1.73× | <0.01 | 0.19 | 11 | 11 |
Y2C5H20N20O13/SVP | 1.79× | 1.62× | <0.01 | <0.01 | 15 | 15 |
(DNA)1/TZVP | 1.87× | 1.69× | <0.01 | 0.03 | 14 | 14 |
(DNA)4/TZVP | 1.87× | 1.68× | <0.01 | 0.04 | 14 | 14 |
(H2O)68/TZVP | 1.88× | 1.58× | <0.01 | <0.01 | 11 | 11 |
(LiF)36/TZVP | 1.97× | 1.74× | <0.01 | 0.18 | 10 | 10 |
Y2C5H20N20O13/TZVP | 1.91× | 1.72× | <0.01 | 0.01 | 14 | 14 |
(DNA)1/QZVP | 1.80× | 1.65× | <0.01 | 0.07 | 14 | 14 |
(H2O)68/QZVP | 1.78× | 1.53× | <0.01 | <0.01 | 10 | 10 |
System/Basis . | Speedup 3c1e . | Speedup (K) . | Error (μEh) . | ΔP (10−6) . | niter (fp64) . | niter (mp) . |
---|---|---|---|---|---|---|
(DNA)1/SVP | 1.73× | 1.55× | <0.01 | <0.01 | 15 | 15 |
(DNA)4/SVP | 1.86× | 1.62× | <0.01 | <0.01 | 14 | 14 |
(H2O)68/SVP | 1.92× | 1.61× | <0.01 | <0.01 | 11 | 11 |
(LiF)36/SVP | 1.90× | 1.73× | <0.01 | 0.19 | 11 | 11 |
Y2C5H20N20O13/SVP | 1.79× | 1.62× | <0.01 | <0.01 | 15 | 15 |
(DNA)1/TZVP | 1.87× | 1.69× | <0.01 | 0.03 | 14 | 14 |
(DNA)4/TZVP | 1.87× | 1.68× | <0.01 | 0.04 | 14 | 14 |
(H2O)68/TZVP | 1.88× | 1.58× | <0.01 | <0.01 | 11 | 11 |
(LiF)36/TZVP | 1.97× | 1.74× | <0.01 | 0.18 | 10 | 10 |
Y2C5H20N20O13/TZVP | 1.91× | 1.72× | <0.01 | 0.01 | 14 | 14 |
(DNA)1/QZVP | 1.80× | 1.65× | <0.01 | 0.07 | 14 | 14 |
(H2O)68/QZVP | 1.78× | 1.53× | <0.01 | <0.01 | 10 | 10 |
Speedups for one full K-build from employing mixed-precision kernels for 3c1e-integral evaluation on two Intel E5-2630 CPUs (20 cores@2.20 GHz) using AVX-2 instructions on one NVIDIA GV100 GPU (fp32:fp64 ratio 2:1) and one GTX 1080Ti GPU (fp32:fp64 ratio 32:1).
System/Basis . | CPU . | GV100 . | 1080Ti . |
---|---|---|---|
(DNA)4/SVP | 1.55× | 1.01× | 1.61× |
(DNA)1/SVP | 1.62× | 1.21× | 1.95× |
(H2O)68/SVP | 1.61× | 0.96× | 1.83× |
(LiF)36/SVP | 1.73× | 1.03× | 2.51× |
Y2C5H20N20O13/SVP | 1.62× | 0.97× | 1.94× |
(DNA)1/TZVP | 1.69× | 1.22× | 2.19× |
(DNA)4/TZVP | 1.68× | 1.42× | 2.05× |
(H2O)68/TZVP | 1.58× | 1.22× | 1.70× |
(LiF)36/TZVP | 1.74× | 1.21× | 2.04× |
Y2C5H20N20O13/TZVP | 1.72× | 1.34× | 2.00× |
(DNA)1/QZVP | 1.65× | 1.79× | 2.14× |
(H2O)68/QZVP | 1.53× | 1.44× | 1.73× |
System/Basis . | CPU . | GV100 . | 1080Ti . |
---|---|---|---|
(DNA)4/SVP | 1.55× | 1.01× | 1.61× |
(DNA)1/SVP | 1.62× | 1.21× | 1.95× |
(H2O)68/SVP | 1.61× | 0.96× | 1.83× |
(LiF)36/SVP | 1.73× | 1.03× | 2.51× |
Y2C5H20N20O13/SVP | 1.62× | 0.97× | 1.94× |
(DNA)1/TZVP | 1.69× | 1.22× | 2.19× |
(DNA)4/TZVP | 1.68× | 1.42× | 2.05× |
(H2O)68/TZVP | 1.58× | 1.22× | 1.70× |
(LiF)36/TZVP | 1.74× | 1.21× | 2.04× |
Y2C5H20N20O13/TZVP | 1.72× | 1.34× | 2.00× |
(DNA)1/QZVP | 1.65× | 1.79× | 2.14× |
(H2O)68/QZVP | 1.53× | 1.44× | 1.73× |
As presented in Sec. 1 of the supplementary material, this speedup is only achieved if vector-instructions (AVX2 in this work) are utilized. This matches our expectations, since one 256-bit vector instruction processes four fp64 but eight fp32 values, whereas one scalar instruction always processes 1 number, regardless of the precision.
However, the 2×-speedup for the 3c1e integral evaluation does not perfectly translate to the full K-build, since the other two steps [Eqs. (5) and (7)] have not been accelerated. As depicted in Table III, this effect is even more pronounced with GPU evaluation because the 3c1e integral evaluation contributes less to the total runtime on GPUs (compare also with the discussion in Sec. 5.2.1 of Ref. 29). For this reason, only minor speedups from employing the mixed precision evaluation of the 3c1e integrals are obtained for the GV100 GPU (0.96–1.79×) and the speedups on the GTX 1080Ti (1.6× to 2.5×) are far away from the theoretical value of 32×.
To summarize part 1, the 3c1e integral tensor is partitioned so that only the most significant contributions are evaluated with fp64 operations. The introduced errors are negligible while providing nearly 2× speedups in the integral-evaluation part. However, when employing GPUs with low fp64-performance, the BLAS-3 steps [Eqs. (5) and (7)] also need to be performed with fp32 operations to achieve optimal performance.
B. Part 2: Pure fp32 evaluation
In contrast to the integral evaluation, computing the BLAS-3 steps [Eqs. (5) and (7)] with mixed precision instruction is not straightforward because they are best evaluated with a single call of a highly optimized dense linear algebra routine for optimal performance. Although algorithms for mixed precision linear-algebra have been proposed in the literature,53 we prefer to, if possible, avoid them for the sake of simplicity.
Therefore, the possibility of pure fp32 evaluation, i.e., evaluation of Eqs. (5)–(7) with only fp32 operations and accumulation of the batch-local fp32-K-matrices into a global fp64-K-matrix, needs to be explored. Note that in the final energy calculation, Eqs. (5) and (6) are evaluated with single-precision, but the accumulation in Eq. (9) is performed with double-precision.54
As depicted in Table IV, pure fp32 evaluation results in up to 7.4× speedups compared to pure fp64 execution. This speedup is still significantly less than the theoretically possible 32× speedups because the code is not purely limited by the floating point throughput alone and the demand for memory bandwidth and local storage (cache) is only reduced by a factor of two.
Speedups (averaged over all but the very first K-builds) of mixed precision integral evaluation (mp) and full fp32 evaluation (fp32) on GTX 1080Ti compared to pure fp64-execution. Errors of the converged energy from pure fp32-evaluation (ΔEfp32) and from fp32-evaluation in the SCF followed by a post-SCF mixed precision energy evaluation (ΔEfp32*) and the root-mean-square error of the converged density matrix (ΔP) are given referenced to pure fp64 execution.
System/Basis . | Speedup (mp) . | Speedup (fp32) . | ΔEfp32 (μEh) . | ΔEfp32* (μEh) . | ΔP (10−6) . |
---|---|---|---|---|---|
(DNA)1/SVP | 1.61× | 5.14× | 14.2 | <0.01 | 0.40 |
(DNA)4/SVP | 1.95× | 6.38× | 68.3 | 0.08 | 2.56 |
(H2O)68/SVP | 1.83× | 5.78× | 20.1 | <0.01 | 0.05 |
(LiF)36/SVP | 2.51× | 5.45× | 89.3 | <0.01 | 6.26 |
Y2C5H20N20O13/SVP | 1.94× | 6.39× | 15.8 | <0.01 | 0.18 |
(DNA)1/TZVP | 2.19× | 7.12× | 12.2 | 0.17 | 39.44 |
(DNA)4/TZVP | 2.05× | 7.42× | 85.4 | 1.82 | 26.79 |
(H2O)68/TZVP | 1.70× | 6.49× | 27.9 | 0.01 | 0.24 |
(LiF)36/TZVP | 2.04× | 5.65× | 19.8 | 0.09 | 15.94 |
Y2C5H20N20O13/TZVP | 2.00× | 6.94× | 19.9 | 0.08 | 5.52 |
(DNA)1/QZVP | 2.14× | 6.27× | 5.3 | 1.32 | 31.83 |
(H2O)68/QZVP | 1.73× | 7.32× | 14.5 | 0.33 | 1.89 |
System/Basis . | Speedup (mp) . | Speedup (fp32) . | ΔEfp32 (μEh) . | ΔEfp32* (μEh) . | ΔP (10−6) . |
---|---|---|---|---|---|
(DNA)1/SVP | 1.61× | 5.14× | 14.2 | <0.01 | 0.40 |
(DNA)4/SVP | 1.95× | 6.38× | 68.3 | 0.08 | 2.56 |
(H2O)68/SVP | 1.83× | 5.78× | 20.1 | <0.01 | 0.05 |
(LiF)36/SVP | 2.51× | 5.45× | 89.3 | <0.01 | 6.26 |
Y2C5H20N20O13/SVP | 1.94× | 6.39× | 15.8 | <0.01 | 0.18 |
(DNA)1/TZVP | 2.19× | 7.12× | 12.2 | 0.17 | 39.44 |
(DNA)4/TZVP | 2.05× | 7.42× | 85.4 | 1.82 | 26.79 |
(H2O)68/TZVP | 1.70× | 6.49× | 27.9 | 0.01 | 0.24 |
(LiF)36/TZVP | 2.04× | 5.65× | 19.8 | 0.09 | 15.94 |
Y2C5H20N20O13/TZVP | 2.00× | 6.94× | 19.9 | 0.08 | 5.52 |
(DNA)1/QZVP | 2.14× | 6.27× | 5.3 | 1.32 | 31.83 |
(H2O)68/QZVP | 1.73× | 7.32× | 14.5 | 0.33 | 1.89 |
More important, however, is the fact that pure fp32 execution leads to a significant deterioration of the accuracy, with errors up to 90 µEh. In contrast, the exchange matrix from pure fp32 evaluation is much more accurate than the errors in the energy indicate. That is, even though converging the SCF using fp32 only can lead to quite significant errors in the converged density matrix of up to 4 × 10−5 a.u., computing the final energy from this inaccurate matrix employing mixed precision integral evaluation instead of pure fp32-execution (denoted as “fp32*”) leads to significantly smaller errors (1.8 µEh at most) while providing the performance of full fp32 execution (i.e., up to 7× speedups) within the SCF.
The situation is analogous to adaptive integration grids, i.e., employing three times coarser grids within the SCF than for the final energy calculation provides virtually the same result as performing the whole calculation with the large grid.41 That is, the exchange energy is generally more sensitive to numeric errors than the exchange matrix.
However, as presented in Table V, executing all K-builds within the SCF with pure fp32 arithmetic deteriorates the SCF convergence behavior measurably, meaning that very tight convergence, e.g., to 10−8, is impossible due to the increase in numerical fluctuations. Although the level of precision is sufficient for many applications (e.g., ab initio molecular dynamics), a truly error-free method similar to part 1 (i.e., no measurable change in the SCF convergence behavior and as little change in the converged energy and density matrix as possible) is our goal in the following.
C. Part 3: Incremental K-builds (i-sn-LinK)
The idea of incremental Fock/exchange-builds, i.e., computing K[ΔP] and then incrementally updating K instead of always recomputing the full exchange matrix, was initially proposed by Almloef et al.55 and later improved by Haeser and Ahlrichs56 to allow for tighter density-matrix-based screening of the 4c2e-integrals within the SCF. However, because only small increments to K are computed, incremental Fock-builds have also been shown to reduce the numerical error introduced by fp32-evaluation, since the absolute error is proportional to the magnitude of the contribution.31 In part 2 (Sec. IV B), we showed that pure fp32 exchange-builds are already quite accurate and allow for unproblematic convergence to about 10−6. Therefore, the possibility to perform the last exchange-builds incrementally should be explored.
We thus propose i-sn-LinK, a special SCF scheme given in Fig. 2. First, the SCF is converged to a relative loose threshold with non-incremental fp32 K-builds. Since the difference density is comparatively large for these early SCF steps, there are no significant performance gains from incremental updates in these steps. In this work, we choose a convergence criterion of 10−5 to ensure that convergence to this point is always reached.
Number of SCF steps to achieve a given SCF convergence ϑconv quantified by the DIIS error. “n.c.” denotes non-converged SCF.
System/Basis . | ϑconv . | fp64 . | mp . | fp32 . |
---|---|---|---|---|
(DNA)1/SVP | 10−6 | 9 | 9 | 9 |
10−8 | 15 | 15 | n.c. | |
(DNA)4/SVP | 10−6 | 9 | 9 | 9 |
10−8 | 14 | 14 | n.c. | |
(H2O)68/SVP | 10−6 | 7 | 7 | 7 |
10−8 | 11 | 11 | 11 | |
(LiF)36/SVP | 10−6 | 7 | 7 | 7 |
10−8 | 10 | 11 | n.c. | |
Y2C5H20N20O13/SVP | 10−6 | 9 | 9 | 9 |
10−8 | 15 | 15 | n.c. | |
(DNA)1/TZVP | 10−6 | 9 | 9 | 9 |
10−8 | 15 | 15 | n.c. | |
(DNA)4/TZVP | 10−6 | 9 | 9 | 9 |
10−8 | 14 | 14 | n.c. | |
(H2O)68/TZVP | 10−6 | 7 | 7 | 7 |
10−8 | 12 | 12 | 12 | |
(LiF)36/TZVP | 10−6 | 7 | 7 | 7 |
10−8 | 10 | 10 | n.c. | |
Y2C5H20N20O13/TZVP | 10−6 | 9 | 9 | 9 |
10−8 | 14 | 14 | n.c. | |
(DNA)1/QZVP | 10−6 | 8 | 8 | 8 |
10−8 | 13 | 13 | n.c. | |
(H2O)68/QZVP | 10−6 | 7 | 7 | 7 |
10−8 | 10 | 10 | n.c. |
System/Basis . | ϑconv . | fp64 . | mp . | fp32 . |
---|---|---|---|---|
(DNA)1/SVP | 10−6 | 9 | 9 | 9 |
10−8 | 15 | 15 | n.c. | |
(DNA)4/SVP | 10−6 | 9 | 9 | 9 |
10−8 | 14 | 14 | n.c. | |
(H2O)68/SVP | 10−6 | 7 | 7 | 7 |
10−8 | 11 | 11 | 11 | |
(LiF)36/SVP | 10−6 | 7 | 7 | 7 |
10−8 | 10 | 11 | n.c. | |
Y2C5H20N20O13/SVP | 10−6 | 9 | 9 | 9 |
10−8 | 15 | 15 | n.c. | |
(DNA)1/TZVP | 10−6 | 9 | 9 | 9 |
10−8 | 15 | 15 | n.c. | |
(DNA)4/TZVP | 10−6 | 9 | 9 | 9 |
10−8 | 14 | 14 | n.c. | |
(H2O)68/TZVP | 10−6 | 7 | 7 | 7 |
10−8 | 12 | 12 | 12 | |
(LiF)36/TZVP | 10−6 | 7 | 7 | 7 |
10−8 | 10 | 10 | n.c. | |
Y2C5H20N20O13/TZVP | 10−6 | 9 | 9 | 9 |
10−8 | 14 | 14 | n.c. | |
(DNA)1/QZVP | 10−6 | 8 | 8 | 8 |
10−8 | 13 | 13 | n.c. | |
(H2O)68/QZVP | 10−6 | 7 | 7 | 7 |
10−8 | 10 | 10 | n.c. |
Comparison of i-sn-LinK compared to non-incremental pure fp64 evaluation. Speedups on the GTX 1080Ti are given for the sum of all exchange-builds including the final exchange energy calculation. The percentage of the two mixed-precision K-builds to the total time for all exchange-builds (%fp64), the error of the converged energy of i-sn-LinK ΔE, and the RMSD of the converged density matrix ΔP are referenced to full fp64 evaluation, and the number of SCF cycles for ϑconv = 10−8 is given for reference.
System/Basis . | Speedup . | %fp64 (%) . | ΔE (μEh) . | ΔP (10−6) . | niter (fp64) . | niter (i-sn-LinK) . |
---|---|---|---|---|---|---|
(DNA)1/SVP | 4.81× | 42 | <0.01 | 0.01 | 15 | 15 |
(DNA)4/SVP | 5.03× | 40 | <0.01 | 0.01 | 14 | 14 |
(H2O)68/SVP | 4.59× | 46 | <0.01 | 0.00 | 11 | 11 |
(LiF)36/SVP | 4.27× | 45 | <0.01 | 0.65 | 10 | 10 |
Y2C5H20N20O13/SVP | 4.93× | 45 | <0.01 | 0.00 | 15 | 15 |
(DNA)1/TZVP | 4.97× | 42 | <0.01 | 0.15 | 15 | 15 |
(DNA)4/TZVP | 5.19× | 39 | 0.03 | 0.29 | 14 | 14 |
(H2O)68/TZVP | 4.67× | 47 | <0.01 | 0.00 | 11 | 11 |
(LiF)36/TZVP | 4.40× | 42 | 0.01 | 0.37 | 10 | 10 |
Y2C5H20N20O13/TZVP | 4.99× | 45 | <0.01 | 0.02 | 14 | 14 |
(DNA)1/QZVP | 4.42× | 36 | 0.01 | 0.29 | 13 | 14 |
(H2O)68/QZVP | 4.62× | 47 | <0.01 | 0.01 | 10 | 10 |
System/Basis . | Speedup . | %fp64 (%) . | ΔE (μEh) . | ΔP (10−6) . | niter (fp64) . | niter (i-sn-LinK) . |
---|---|---|---|---|---|---|
(DNA)1/SVP | 4.81× | 42 | <0.01 | 0.01 | 15 | 15 |
(DNA)4/SVP | 5.03× | 40 | <0.01 | 0.01 | 14 | 14 |
(H2O)68/SVP | 4.59× | 46 | <0.01 | 0.00 | 11 | 11 |
(LiF)36/SVP | 4.27× | 45 | <0.01 | 0.65 | 10 | 10 |
Y2C5H20N20O13/SVP | 4.93× | 45 | <0.01 | 0.00 | 15 | 15 |
(DNA)1/TZVP | 4.97× | 42 | <0.01 | 0.15 | 15 | 15 |
(DNA)4/TZVP | 5.19× | 39 | 0.03 | 0.29 | 14 | 14 |
(H2O)68/TZVP | 4.67× | 47 | <0.01 | 0.00 | 11 | 11 |
(LiF)36/TZVP | 4.40× | 42 | 0.01 | 0.37 | 10 | 10 |
Y2C5H20N20O13/TZVP | 4.99× | 45 | <0.01 | 0.02 | 14 | 14 |
(DNA)1/QZVP | 4.42× | 36 | 0.01 | 0.29 | 13 | 14 |
(H2O)68/QZVP | 4.62× | 47 | <0.01 | 0.01 | 10 | 10 |
Method . | 3c1e (SCF) . | 3c1e (final) . | BLAS-3 (SCF) . | BLAS-3 (final) . |
---|---|---|---|---|
fp64 | fp64 | fp64 | fp64 | fp64 |
mp (part1) | fp32 + fp64 | fp32 + fp64 | fp64 | fp64 |
fp32 (part2) | fp32 | fp32 | fp32 | fp32 |
fp32* (part2) | fp32 | fp32 + fp64 | fp32 | fp64 |
i-sn-LinK (part3) | fp32 (fp32 + fp64)a | fp32 + fp64 | fp32 (fp64)a | fp64 |
Method . | 3c1e (SCF) . | 3c1e (final) . | BLAS-3 (SCF) . | BLAS-3 (final) . |
---|---|---|---|---|
fp64 | fp64 | fp64 | fp64 | fp64 |
mp (part1) | fp32 + fp64 | fp32 + fp64 | fp64 | fp64 |
fp32 (part2) | fp32 | fp32 | fp32 | fp32 |
fp32* (part2) | fp32 | fp32 + fp64 | fp32 | fp64 |
i-sn-LinK (part3) | fp32 (fp32 + fp64)a | fp32 + fp64 | fp32 (fp64)a | fp64 |
Higher precision in one non-incremental Fock-build.
Number of SCF steps (Niter) and error in the converged SCF energy of full-fp64 execution, mixed-precision integral evaluation (mp), and i-sn-LinK for the ten most difficult molecules of the ASCDB database.
. | . | fp64 . | mp . | i-sn-LinK . | ||
---|---|---|---|---|---|---|
Molecule . | ϑconv . | Niter . | Niter . | ΔE (μEh) . | Niter . | ΔE(μEh) . |
cisHO3 | 10−6 | 32 | 32 | 0.06 | 32a | 0.83 |
CrCl2 | 10−6 | 11 | 11 | 0.02 | 11a | 0.10 |
H6LiB3OMg2AlSi2 | 10−7 | 28 | 28 | <0.01 | 28a | 0.04 |
VO | 10−7 | 25 | 25 | <0.01 | 25b | <0.01 |
C2H5N4O2MgS2 | 10−7 | 23 | 23 | <0.01 | 23a | <0.01 |
transHO3 | 10−7 | 23 | 23 | <0.01 | 23a | <0.01 |
ClOO | 10−7 | 23 | 23 | <0.01 | 23a | <0.01 |
C3H6B3NAlSi2 | 10−7 | 22 | 22 | 0.01 | 22a | 0.08 |
FOO | 10−7 | 22 | 22 | <0.01 | 22a | <0.01 |
OHCl | 10−7 | 20 | 20 | <0.01 | 20a | <0.01 |
. | . | fp64 . | mp . | i-sn-LinK . | ||
---|---|---|---|---|---|---|
Molecule . | ϑconv . | Niter . | Niter . | ΔE (μEh) . | Niter . | ΔE(μEh) . |
cisHO3 | 10−6 | 32 | 32 | 0.06 | 32a | 0.83 |
CrCl2 | 10−6 | 11 | 11 | 0.02 | 11a | 0.10 |
H6LiB3OMg2AlSi2 | 10−7 | 28 | 28 | <0.01 | 28a | 0.04 |
VO | 10−7 | 25 | 25 | <0.01 | 25b | <0.01 |
C2H5N4O2MgS2 | 10−7 | 23 | 23 | <0.01 | 23a | <0.01 |
transHO3 | 10−7 | 23 | 23 | <0.01 | 23a | <0.01 |
ClOO | 10−7 | 23 | 23 | <0.01 | 23a | <0.01 |
C3H6B3NAlSi2 | 10−7 | 22 | 22 | 0.01 | 22a | 0.08 |
FOO | 10−7 | 22 | 22 | <0.01 | 22a | <0.01 |
OHCl | 10−7 | 20 | 20 | <0.01 | 20a | <0.01 |
Incremental K-builds starting at 10−4.
Incremental K-builds starting at 10−3.
Then, one full K-build is performed with mixed fp32/fp64 precision (i.e., employing part 1) and the SCF is subsequently converged further with incremental fp32 K-builds. These incremental K-builds are always built from the difference density matrix ΔP referenced to the last full K-build, not to the previous density matrix. This removes the possibility of incremental error accumulation, further improving the numerical stability in this way.57 In order to reduce errors from the density-matrix including integral screening when operating on incremental matrices, we employ two orders of magnitude tighter screening thresholds ϑE/K for these incremental builds, which, due to the elements of ΔP being much smaller than P, leads to a similar performance as full K-builds with looser thresholds. The final energy (also optionally nuclear forces) is then computed using mixed fp32/fp64 precision, analogous to part 2.
In contrast to the works of Almloef et al.55 and Haeser and Ahlrichs,56 the aim of our incremental scheme i-sn-LinK is only to improve the convergence behavior while using as many fp32 K-builds as possible, not to improve the tightness of the integral-selection.
The performance, accuracy, and SCF convergence of i-sn-LinK are presented in Table VI. Overall, we obtain virtually the same accuracy and numerical stability as for pure fp64 execution while obtaining up to 5.2× speedups (averaged over the whole calculation). As expected, the speedups are higher if more SCF cycles need to be performed because the cost of the two remaining mixed-precision K-builds, which comprise 36%–47% of the computation time, is comparatively less impactful in this situation.
However, the most significant result of Table VI is the fact that i-sn-LinK nearly always converges within the same amount of SCF cycles as pure fp64 execution, proving the numerical stability of the method.
D. Parts 1, 2, and 3 in comparison
In order to compare the different approaches discussed in Secs. IV A–IV C, a brief summary of them is given in Table VII. In this context, the mixed-precision 3c1e integral method (“mp”) of part 1 represents the simplest introduction of reduced numerical precision; that is, only the 3c1e integral evaluation part [Eq. (6)] is adjusted. This approach has virtually no impact on the SCF convergence behavior or the accuracy of the final result. Since the 3c1e integral evaluation is by far the most significant bottleneck for CPU execution, accelerating the other steps cannot provide significant speedups. Hence, we recommend to only employ the “mp” approach on CPUs.
On GPUs, however, the BLAS-3 steps with fp64 execution can become significant. Therefore, we investigated the possibility of pure fp32-execution in Sec. IV B (part 2) and found that the so-converged SCF can yield surprisingly accurate energies, if the final energy was evaluated with higher precision. However, the convergence behavior and the quality of the converged density matrix were measurably deteriorated.
Comparison of mixed-precision 3c1e integral evaluation (mp) and i-sn-LinK in terms of the speed of convergence [number of SCF steps to reach ] and in terms of the accuracy of the converged energy (ΔE), the root mean square deviation of the converged density matrix (ΔP), and the root mean square deviation of the nuclear forces (ΔForces).
. | . | Niter(10−7) . | ΔE (μEh) . | ΔP (10−6) . | ΔForces (μEh a0−1) . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Molecule . | Basis . | fp64 . | mp . | i-sn-LinK . | mp . | i-sn-LinK . | mp . | i-sn-LinK . | mp . | i-sn-LinK . |
Taxol | 6-31G* | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | 0.03 |
Valinomycin | 6-31G* | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | 0.06 |
Olestra | 6-31G* | 11 | 11 | 11 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | 0.05 |
Fullerene C60 | TZVP | 10 | 10 | 11 | <0.01 | <0.01 | 0.48 | 10.36 | 0.03 | 0.20 |
Fullerene C60 | QZVP | 10 | 10 | 10 | −0.01 | 0.01 | 0.34 | 8.00 | 0.04 | 0.67 |
(S8)20 | TZVP | 10 | 10 | 10 | <0.01 | <0.01 | <0.01 | 0.01 | <0.01 | 0.20 |
Crambin | TZVP | 10 | 10 | 10 | 0.01 | 0.05 | 0.09 | 0.31 | 0.01 | 0.20 |
(DNA)1 | SVP | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | 0.01 | <0.01 | 0.10 |
(DNA)4 | SVP | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | 0.03 | <0.01 | 0.13 |
(DNA)16 | SVP | 11 | 11 | 11 | <0.01 | −0.02 | <0.01 | 0.03 | 0.05 | 0.14 |
(DNA)1 | TZVP | 12 | 12 | 12 | <0.01 | <0.01 | 0.01 | 0.25 | <0.01 | 0.20 |
(DNA)4 | TZVP | 11 | 11 | 11 | <0.01 | 0.05 | 0.02 | 0.39 | <0.01 | 0.15 |
(DNA)1 | QZVP | 11 | 11 | 11 | <0.01 | <0.01 | 0.03 | 0.32 | <0.01 | 0.12 |
(DNA)4 | QZVP | 10 | 10 | 10 | <0.01 | −0.03 | 0.05 | 0.60 | 0.01 | 0.40 |
Y2C5H20N20O13 | TZVP | 11 | 11 | 12 | <0.01 | <0.01 | <0.01 | 0.04 | <0.01 | 0.15 |
Y4C10H42N40O27 | TZVP | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | 0.02 | <0.01 | 0.26 |
. | . | Niter(10−7) . | ΔE (μEh) . | ΔP (10−6) . | ΔForces (μEh a0−1) . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Molecule . | Basis . | fp64 . | mp . | i-sn-LinK . | mp . | i-sn-LinK . | mp . | i-sn-LinK . | mp . | i-sn-LinK . |
Taxol | 6-31G* | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | 0.03 |
Valinomycin | 6-31G* | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | 0.06 |
Olestra | 6-31G* | 11 | 11 | 11 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | 0.05 |
Fullerene C60 | TZVP | 10 | 10 | 11 | <0.01 | <0.01 | 0.48 | 10.36 | 0.03 | 0.20 |
Fullerene C60 | QZVP | 10 | 10 | 10 | −0.01 | 0.01 | 0.34 | 8.00 | 0.04 | 0.67 |
(S8)20 | TZVP | 10 | 10 | 10 | <0.01 | <0.01 | <0.01 | 0.01 | <0.01 | 0.20 |
Crambin | TZVP | 10 | 10 | 10 | 0.01 | 0.05 | 0.09 | 0.31 | 0.01 | 0.20 |
(DNA)1 | SVP | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | 0.01 | <0.01 | 0.10 |
(DNA)4 | SVP | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | 0.03 | <0.01 | 0.13 |
(DNA)16 | SVP | 11 | 11 | 11 | <0.01 | −0.02 | <0.01 | 0.03 | 0.05 | 0.14 |
(DNA)1 | TZVP | 12 | 12 | 12 | <0.01 | <0.01 | 0.01 | 0.25 | <0.01 | 0.20 |
(DNA)4 | TZVP | 11 | 11 | 11 | <0.01 | 0.05 | 0.02 | 0.39 | <0.01 | 0.15 |
(DNA)1 | QZVP | 11 | 11 | 11 | <0.01 | <0.01 | 0.03 | 0.32 | <0.01 | 0.12 |
(DNA)4 | QZVP | 10 | 10 | 10 | <0.01 | −0.03 | 0.05 | 0.60 | 0.01 | 0.40 |
Y2C5H20N20O13 | TZVP | 11 | 11 | 12 | <0.01 | <0.01 | <0.01 | 0.04 | <0.01 | 0.15 |
Y4C10H42N40O27 | TZVP | 12 | 12 | 12 | <0.01 | <0.01 | <0.01 | 0.02 | <0.01 | 0.26 |
These shortcomings were then addressed in Sec. IV C (i-sn-LinK; part 3) by introducing incremental K-builds, which update a once-computed high-precision K-matrix. This recovers the accuracy and stable convergence behavior of the “mp” approach, but only one non-fp32 K-build has to be performed within the SCF. The final energy is, of course, never computed with pure single-precision, since this leads to unacceptably large errors (cf. “fp32” in Table IV). Because the i-sn-LinK scheme represents the best trade-off between accuracy and performance on GPUs, i.e., close to pure-fp32 performance and essentially pure-fp64 accuracy, we recommend the i-sn-LinK method for GPUs.
V. ILLUSTRATIVE CALCULATIONS
A. SCF convergence for difficult electronic structures
Since deterioration in SCF convergence stability was the main reason to develop the i-sn-LinK method of part 3 (Sec. IV C) and since it was also shown to be problematic in other works on single-precision execution,31–36 we provide a more detailed investigation on difficult molecules regarding SCF stability in Table VIII. In order to exemplify the worst-case scenario, we selected the ten molecules with the worst convergence behavior of the ASCDB benchmark set58 and employ the def-TZVPPD basis set, which, due to the additional diffuse basis-functions, requires even higher numerical precision.
The results verify that both the mixed precision 3c1e evaluation and i-sn-LinK have no impact on the stability of the SCF convergence, even when considering very difficult electronic structures. Furthermore, the largest error of 0.83 µEh occurs for the most difficult structure (cisHO3) and is still smaller than the SCF convergence criterion of 10−6.
B. Performance for large systems and large basis sets
To illustrate the practical applicability of the mixed-precision methods developed in this work, a range of systems are tested with up to quadruple-ζ basis sets in Table IX.
First, the reliability and accuracy of the two methods is proven again, i.e., nearly all molecules converge within the same amount of SCF steps (10–12 steps) regardless of the method used for the Fock-exchange. Furthermore, the converged energies match the fp64 results to <0.1 µEh even in the case of heavy elements (e.g., yttrium-complexes in Table XI), the nuclear forces are accurate within 0.7 µEh a0−1, and the converged density matrices are accurate within 10−6 a.u. except for C60.
For the C60-Fullerene, larger deviations (∼10−5 a.u.) are, however, not directly due to fp32-execution but are instead caused by the slightly different integral screening in i-sn-link, where ΔP is employed instead of P within the incremental updates. This is supported by Table X, where tighter thresholds are employed for both the fp64-reference and the i-sn-LinK calculations in order to remove the effect of the integral screening. The remaining deviations are then only caused by the reduced numerical precision in i-sn-LinK. These remaining errors are significantly smaller for the most challenging molecules such as C60 or (DNA)4/QZVP, proving that those larger deviations in Table IX were indeed caused by the altered integral screening and not by the reduced numerical precision.
Summarizing the accuracy comparisons, the two tested mixed-precision methods “mp” and i-sn-LinK lead to essentially negligible errors (ΔE < 0.1 µEh, ΔP ≤ 10−6, ΔForces <1 µEh a0−1), which are multiple orders of magnitude smaller than “chemical accuracy”(∼1000 µEh).
Meanwhile, the methods lead to considerable speedups compared to pure fp64-execution, as presented in Table XI: 1.4–1.8× speedups are obtained from the mixed-precision method on CPUs, around 2× speedups are obtained from the mixed-precision method on GPUs, and up to 6.9× speedups are obtained with the incremental i-sn-LinK method on GPUs. The speedups of i-sn-LinK are typically larger for more expensive computations (larger molecules and larger basis sets) because the BLAS-3 steps [Eqs. (5) and (7)] are comparatively more expensive in these situations.
Comparison of the root mean square deviation of the converged density matrix (ΔP) from i-sn-LinK with the original thresholds / (same values as in Table IX) and with very tight thresholds /.
Molecule . | Basis . | ΔP (default thresh) (10−6) . | ΔP (tight thresh) (10−6) . |
---|---|---|---|
Fullerene C60 | TZVP | 10.36 | 0.63 |
Fullerene C60 | QZVP | 8.00 | 1.00 |
Crambin | TZVP | 0.31 | 0.12 |
(DNA)1 | TZVP | 0.25 | 0.16 |
(DNA)4 | TZVP | 0.39 | 0.26 |
(DNA)1 | QZVP | 0.32 | 0.17 |
(DNA)4 | QZVP | 0.60 | 0.21 |
Molecule . | Basis . | ΔP (default thresh) (10−6) . | ΔP (tight thresh) (10−6) . |
---|---|---|---|
Fullerene C60 | TZVP | 10.36 | 0.63 |
Fullerene C60 | QZVP | 8.00 | 1.00 |
Crambin | TZVP | 0.31 | 0.12 |
(DNA)1 | TZVP | 0.25 | 0.16 |
(DNA)4 | TZVP | 0.39 | 0.26 |
(DNA)1 | QZVP | 0.32 | 0.17 |
(DNA)4 | QZVP | 0.60 | 0.21 |
Performance-comparison of pure fp64 execution (fp64), mixed-precision 3c1e integral evaluation (mp), and i-sn-LinK on GTX 1080Ti GPU and two Xeon E5-2630 CPUs (CPU). Timings are given in seconds as the cumulative time for all exchange-builds, including the final exchange-energy calculation. The speedups compared to the respective pure fp64 implementation are given in parenthesis.
. | . | . | K-time (s) GPU . | K-time (s) CPU . | |||
---|---|---|---|---|---|---|---|
Molecule . | Basis . | Nbfc . | fp64 . | mp . | i-sn-LinK . | fp64 . | mp . |
Taxol | 6-31G* | 1 013 | 194 | 92 (2.1×) | 40 (4.8×) | 187 | 107 (1.7×) |
Valinomycin | 6-31G* | 1 350 | 385 | 180 (2.1×) | 77 (5.0×) | 353 | 203 (1.7×) |
Olestra | 6-31G* | 3 181 | 698 | 421 (1.7×) | 147 (4.7×) | 567 | 376 (1.5×) |
Fullerene C60 | TZVP | 2 160 | 718 | 343 (2.1×) | 167 (4.3×) | 771 | 436 (1.8×) |
Fullerene C60 | QZVP | 4 320 | 2 690 | 1 220 (2.2×) | 622 (4.3×) | 2 780 | 1 590 (1.7×) |
(S8)20 | TZVP | 6 720 | 2 290 | 1 220 (1.9×) | 333 (6.9×) | 2 470 | 1 480 (1.7×) |
Crambina | TZVP | 13 698 | 22 400 | 11 200 (2.0×) | 5260 (4.3×) | 28 800 | 16 800 (1.7×) |
(DNA)1 | SVP | 660 | 48 | 27 (1.8×) | 12 (4.0×) | 52 | 32 (1.6×) |
(DNA)4 | SVP | 2 904 | 1 240 | 627 (2.0×) | 256 (4.8×) | 1 120 | 678 (1.6×) |
(DNA)16 | SVP | 11 880 | 9 600 | 6 200 (1.5×) | 1890 (5.1×) | 8 800 | 6 390 (1.4×) |
(DNA)1 | TZVP | 1 422 | 205 | 105 (2.0×) | 45 (4.6×) | 176 | 102 (1.7×) |
(DNA)4 | TZVP | 6 336 | 5 070 | 2 480 (2.0×) | 1000 (5.0×) | 5 200 | 3 060 (1.7×) |
(DNA)1 | QZVP | 3 465 | 975 | 450 (2.2×) | 212 (4.6×) | 920 | 550 (1.7×) |
(DNA)4a | QZVP | 15 030 | 19 700 | 9 490 (2.1×) | 3700 (5.3×) | 19 952 | 12 349 (1.6×) |
Y2C5H20N20O13 | TZVP | 1 580 | 237 | 120 (2.0×) | 53 (4.5×) | 248 | 143 (1.7×) |
Y4C10H42N40O27 | TZVP | 3 208 | 1 060 | 514 (1.7×) | 202 (5.2×) | 1 060 | 614 (1.7×) |
. | . | . | K-time (s) GPU . | K-time (s) CPU . | |||
---|---|---|---|---|---|---|---|
Molecule . | Basis . | Nbfc . | fp64 . | mp . | i-sn-LinK . | fp64 . | mp . |
Taxol | 6-31G* | 1 013 | 194 | 92 (2.1×) | 40 (4.8×) | 187 | 107 (1.7×) |
Valinomycin | 6-31G* | 1 350 | 385 | 180 (2.1×) | 77 (5.0×) | 353 | 203 (1.7×) |
Olestra | 6-31G* | 3 181 | 698 | 421 (1.7×) | 147 (4.7×) | 567 | 376 (1.5×) |
Fullerene C60 | TZVP | 2 160 | 718 | 343 (2.1×) | 167 (4.3×) | 771 | 436 (1.8×) |
Fullerene C60 | QZVP | 4 320 | 2 690 | 1 220 (2.2×) | 622 (4.3×) | 2 780 | 1 590 (1.7×) |
(S8)20 | TZVP | 6 720 | 2 290 | 1 220 (1.9×) | 333 (6.9×) | 2 470 | 1 480 (1.7×) |
Crambina | TZVP | 13 698 | 22 400 | 11 200 (2.0×) | 5260 (4.3×) | 28 800 | 16 800 (1.7×) |
(DNA)1 | SVP | 660 | 48 | 27 (1.8×) | 12 (4.0×) | 52 | 32 (1.6×) |
(DNA)4 | SVP | 2 904 | 1 240 | 627 (2.0×) | 256 (4.8×) | 1 120 | 678 (1.6×) |
(DNA)16 | SVP | 11 880 | 9 600 | 6 200 (1.5×) | 1890 (5.1×) | 8 800 | 6 390 (1.4×) |
(DNA)1 | TZVP | 1 422 | 205 | 105 (2.0×) | 45 (4.6×) | 176 | 102 (1.7×) |
(DNA)4 | TZVP | 6 336 | 5 070 | 2 480 (2.0×) | 1000 (5.0×) | 5 200 | 3 060 (1.7×) |
(DNA)1 | QZVP | 3 465 | 975 | 450 (2.2×) | 212 (4.6×) | 920 | 550 (1.7×) |
(DNA)4a | QZVP | 15 030 | 19 700 | 9 490 (2.1×) | 3700 (5.3×) | 19 952 | 12 349 (1.6×) |
Y2C5H20N20O13 | TZVP | 1 580 | 237 | 120 (2.0×) | 53 (4.5×) | 248 | 143 (1.7×) |
Y4C10H42N40O27 | TZVP | 3 208 | 1 060 | 514 (1.7×) | 202 (5.2×) | 1 060 | 614 (1.7×) |
Reduced grid-batch size due to limited GPU memory.
Comparing the CPU and GPU performance, we note that our incremental i-sn-LinK method is essential for an effective GPU acceleration with gaming GPUs. That is, pure fp64 execution is about as fast on the GPU as on CPUs,59 but i-sn-LinK is up to 4.4× faster on the GPU [333 vs 1475 s for (S8)20] than on CPUs.
VI. CONCLUSION AND OUTLOOK
This work presents a systematic study of the applicability of single-precision instructions to evaluate the Fock-exchange matrix within seminumerical integration schemes. First, we demonstrated that only a very small fraction of the 3c1e integrals needs to be evaluated with fp64 instructions to still provide virtually the same result while providing nearly 2× speedups for this step on CPUs. We then demonstrated that pure fp32 execution still leads to surprisingly accurate results, as long as the final energy is computed at higher accuracy, although the SCF convergence behavior is somewhat deteriorated. Finally, we proposed the i-sn-LinK method, where incremental exchange-builds are employed for the last SCF steps, which removes all instabilities and inaccuracies from pure fp32 execution while requiring only a single high precision K-matrix build. This method provides up to 7× speedups for the whole SCF on typical “gaming” GPUs (1080Ti) without any significant impact on the numerical stability or accuracy and is therefore essential for an effective GPU acceleration using more affordable “gaming” GPUs.
Finally, we note that the present study on single-precision execution within seminumerical integration is also relevant for the evaluation of local-hybrid functionals, where the most time consuming step, i.e., the evaluation of the local exchange contributions, is identical to the seminumerical expression for the exchange matrix. Thus, we expect the performance benefits presented in this work to also translate to this novel and exciting class of functionals.
SUPPLEMENTARY MATERIAL
ACKNOWLEDGMENTS
The authors acknowledge financial support by the “Deutsche Forschungsgemeinschaft” (DFG) via the Grant No. SFB 1309-32587107 and the Cluster of Excellence “Munich Center for Quantum Science and Technology” (MCQST, Grant No. EXC2111-390814868). C.O. further acknowledges financial support as a Max-Planck-Fellow at the MPI-FKF, Stuttgart.
DATA AVAILABILITY
The data that support the findings of this study are available within the article and its supplementary material.
REFERENCES
We also tested evaluating the exchange energy as tr(PK) instead, which results in up to 10 times larger errors.
We also tested conventional incremental updates, which lead to the need of slightly less integrals in the last SCF steps (<10% performance gains for the whole calculation) at the cost of slightly less stable convergence.
Note that we compare a 700$ GPU with 1400$ (2 × 700$) CPUs.