We introduce Quantum Machine Learning (QML)-Lightning, a PyTorch package containing graphics processing unit (GPU)-accelerated approximate kernel models, which can yield trained models within seconds. QML-Lightning includes a cost-efficient GPU implementation of FCHL19, which together can provide energy and force predictions with competitive accuracy on a microsecond per atom timescale. Using modern GPU hardware, we report learning curves of energies and forces as well as timings as numerical evidence for select legacy benchmarks from atomistic simulation including QM9, MD-17, and 3BPA.

## I. INTRODUCTION

Data-driven approximate machine learning (ML) methods have become increasingly prominent in theoretical chemistry in recent years.^{1–11} In particular, supervised learning can be used to augment accurate but computationally prohibitive electronic structure calculations. For tasks such as *ab initio* molecular dynamics (AIMD), approximate variants of the electronic Schrödinger equation are solved for every coordinate update. Consequently, surrogate ML models that can partially substitute for quantum calculations are extremely beneficial for reducing the overall computational burden (and carbon footprint). In general, these ML models first transform atomic coordinates into an intermediate symmetry-preserving representation, which is then passed on to a nonlinear machine learning model, usually one based on neural network (NNs)^{12} or Gaussian process (GP) methods.^{13}

While NNs have received significant focus to enable them to scale performantly on graphics processing unit (GPU) architectures through GPU-accelerated packages, such as PyTorch^{14} or TensorFlow,^{15} there has been much less focus in providing similar accelerations and tooling to GP methods, as the vanilla variant of these tend to scale unfavorably on GPU devices.

However, there are improved time complexity GP models relying sparsification, whereby a subset of the training data is used to enable training on the full dataset.^{16,17} In particular, the training step time complexity is linearized with respect to the amount of sparse training data and the inference time of these models is favorable. This strategy has enabled sparse GP models to be trained in a “Big Data” context while retaining sufficient accuracy with respect to vanilla GP.

Alternatively, the mapping provided by the kernel function central to GP models can be approximated to yield a desired linearization. In this work, we briefly discuss the seminal work on random Fourier features (RFF)^{18} in which the kernel mapping is approximated via a low-dimensional feature function. We also summarize a significantly more computationally efficient variant, termed structured orthogonal features (SORF),^{19} which further reduces the computational cost.

We provide a software package to perform training and prediction of Quantum Machine Learning (QML) models, termed QML-Lightning. It includes GPU implementations of both sparse GP, RFF, and SORF models as well as a GPU implementation of FCHL19,^{20} an accurate atom-centered representation. QML-Lightning is built upon the PyTorch software package, with additional CUDA C implementations for critical components to improve its computational throughput. A thorough benchmark of the predictive accuracy of QML-Lightning has been performed on several established datasets of chemical compounds from literature, comparing against existing kernel- and neural network based models. To assess the models’ performance across chemical compound space, we have benchmarked the model against the QM9 dataset,^{21} which contains 134k small organic molecules with elements C, H, O, N, and F. For dynamics and structural relaxation applications, we have benchmarked the model against both the MD17^{22–24} and rectified rMD17^{25} datasets, which contain trajectories of 10 small organic molecules. To infer the models’ extrapolative performance, QML-Lightning has been benchmarked against the challenging 3PBA dataset,^{26} which contains three sets of MD trajectories at temperatures 300, 600, and 1200 K. We note that we have focused on energetic properties in this work as these are the most critical for AIMD applications; however, other QM properties can also be used within the QML-Lightning framework straightforwardly. Finally, we detail the wall time to perform 1 K MD steps times for a variety of systems including periodic systems with up to ∼14k atoms, using an OpenMM^{27} plugin that is provided as part of this work.

Note that while finalizing work on this paper, Dhaliwal *et al.*^{28} most recently published RFF-based interatomic potentials for performing molecular dynamics with promising results for central processing unit (CPU) based applications. Here, however, we focus on a GPU implementation of the more efficient and accurate SORF model.

## II. SOFTWARE AVAILABILITY

The QML-Lightning software is provided under an MIT license at https://github.com/nickjbrowning/qml-lightning. We additionally provide an OpenMM wrapper for QML-Lightning under the same license at https://github.com/nickjbrowning/openmm-qml-lightning.

## III. THEORY

This section first summarizes a subset of kernel methods to learn quantum mechanical properties. First, Gaussian process regression (GP)^{29,30} is introduced to learn both energies and forces of chemical compounds. Then, sparse GP^{17,30} is discussed, and finally the approximate kernel methods random Fourier features (RFF)^{18} and structured orthogonal random features (SORF)^{19} are introduced.

In the following, upper case indices denote the index of chemical compounds in a dataset, while lower-case indices denote the index of atomic centers in each chemical compound.

### A. Gaussian process regression (GP)

In GP kernel models of quantum mechanical properties, one constructs a linear model using a basis of kernel functions *k*(*p*_{J}, ⋅). For example, to learn potential energies, a suitable functional form would be

where *ρ*_{I} and *ρ*_{J} are the atomic representations of atom *I* and *J*, respectively, and *N*_{t} is the number of training molecules. We note that these atomic representations are functions of the set of local atomic charges and coordinates {*z*, ** r**}, which has been omitted for brevity. The sets $CI$ and $CJ$ contain the set of atoms for training and query molecules

*i*and

*j*, respectively. The coefficients

*α*

_{i}are obtained from the regularized minimization problem given by

which has the following solution in matrix form:

The hyperparameter *λ* is a small number in order to regularize and ensure numerical stability upon kernel inversion.^{31} For learning potential energies and atomic forces simultaneously, one can construct an expression for the potential energy as follows:

where matrix notation is introduced for simplicity and $\u2202\u2202r\u20d7$ is used as a shorthand to stack the following derivatives:

where *l* indices the coordinate components from the query atom *J*. The Hessian in Eq. (1) has the following form:

where *k* indices the coordinate components from training atom *I*. The dimension of the full GP kernel is (3*MN*_{t} + *N*_{t}) × (3*MN*_{t} + *N*_{t}), where *M* is the average number of atoms per molecule in the entire training set. In particular, the Hessian term in Eq. (1) has a compute time scaling as $O(36Nt2M4)$, which will severely limit applicability with respect to both large training set sizes and large systems.

### B. Sparse Gaussian process regression (sparse GP)

To alleviate the complexity of the above, which will be referred to as “full GP” in the following, Bartok *et al.*^{16,29} introduced sparse GPs for modeling molecular systems. For a thorough review of GP and sparsification, please see Ref. 30. In sparse GP, rather than using all available training points, one selects a subset of representations as the basis. For example, a subset of the kernels centered on atoms can be used as the sparse basis such that the expression for the potential energy is as follows:

where *N*_{s} is the number of sparse points selected. The sparse points *I* are typically selected from the training database using uniform random sampling, CUR matrix decomposition,^{32} or farthest point sampling (FPS).^{17} The key difference from full GP presented above is that there is the use of one coefficient *α*_{I} per sparse point, rather than one coefficient per training molecule. For potential energy models, this formalism reduces the cost of building the kernel and predictions to $O(Ns2NtMd)$ and *O*(*N*_{s}*Md*), respectively, where *M* in this context is the (average) number of atoms in the (training) query molecule and *d* is the size of the representation. The model coefficients can be obtained by solving the following loss function in matrix form:

which yields the solution

Additionally, atomic forces can be included in the training scheme resulting in the following model:

where the matrix $\u2212\u2202\u2202r\u20d7*K\u2208R3NtM\xd7Ns$ stacks the 3*N*_{t}*M* partial Cartesian derivatives with all sparse points. In the following, we have dropped the sGP label for simplicity, and we also introduce the notation *∂*** K** as a synonym for the matrix $\u2212\u2202\u2202r\u20d7*K$. This yields the loss function,

where the notation [** U**,

**] is used to indicate the concatenation of the matrices**

*F***and**

*U***. The coefficients are then obtained from the following equation:**

*F*We stress that including derivatives in the above formulation does not increase the size of basis in which the total energy is expanded, so these derivatives only appear in the loss function. To compute these model coefficients, the dominant term is now the formation of the *∂***K**^{T}*∂***K** matrix product, which scales as $O(3NtMNs2)$.

### C. Random Fourier features (RFF)

Alternatively to sparsifying over the training data to reduce the computational complexity of training and inference, it is also possible to approximate the mapping provided by the kernel function such that linear models result. In the seminal work of Rahimi and Recht,^{18} a lower-dimensional lifting function was introduced to approximate the inner product synonymous with the kernel method,

As a consequence of Bochner’s theorem, the Fourier transform of a shift-invariant kernel *k*(*ρ*_{I}, *ρ*_{J}) = *k*(*ρ*_{I} − *ρ*_{J}) is a proper probability distribution. Consequently, one can readily define an explicit feature map that approximates the kernel via Monte Carlo integral estimation,

where *N*_{f} is the number of independent vectors ** w** drawn from the probability distribution

*p*(

**) and**

*w**i*is the imaginary unit. For different kernels, the distribution

*p*(

**) takes different forms; however, for Gaussian kernels used here,**

*w**p*(

**) is also Gaussian. After applying Euler’s formula to**

*w***(**

*z***) and after simplification, the following low-dimensional feature map is obtained:**

*ρ*where *b* is sampled from a uniform distribution on [0, 2*π*]. Since the potential energies are extensive, one can partition them into atomic energy contributions *ϵ*_{I}, and the representation of the atomic environment is passed into this low-dimensional feature mapping,

where $Z\u2208RNt\xd7Nf$ is the feature matrix corresponding to *N*_{t} training observations. The weights ** α** are the solution to the following regularized normal equation:

where the coefficients are obtained first by an LU decomposition of $ZTZ+\lambda I$. The time complexity of constructing the *Z*^{T}** Z** matrix and performing inference scale as $O(Nf2NtMd)$ and

*O*(

*N*

_{f}

*Md*), respectively. To include forces in the training scheme, the derivatives of the feature vectors $\u2202zIl\u2202rik$ are computed, where

*l*and

*k*are the feature and coordinate component indices, respectively, and stored in a derivatives feature matrix $\u2202Z\u2208R3NtM\xd7Nf$. The following regularized normal equation is then solved:

where the notation $Z,\u2202Z$ indicates the concatenation of the feature matrix **Z** with the derivative features **∂Z**. The dominant term in constructing the normal equations is the *∂***Z**^{T}*∂***Z** matrix product, which scales as $O(3NtMNf2)$.

### D. Structured orthogonal random features (SORF)

The above formulation revolves around computing the affine transformation $W[\rho I,1]T$, which is applied to *ρ*_{I} prior to the cosine operation $(z(\rho I)=cos(W[\rho I,1]T))$. Storing and computing this linear transformation has *O*(*N*_{f}*d*) space and time complexity. To reduce this, Yu *et al.*^{19} introduced structured orthogonal random features (SORF). In this method, the matrix **W** is replaced by a special structured matrix consisting of products of random binary diagonal matrices and Walsh–Hadamard matrices. The resulting linear transformation has *O*(*N*_{f} log *d*) time complexity and *O*(*d*) or *O*(1) space complexity, depending on its implementation. Briefly, in the case of Gaussian kernel approximation, one can replace the transformation

where $G\u2208RNf\xd7d$ is a random Gaussian matrix, with the following transformation:

where **Q** is a uniformly distributed random orthogonal matrix (e.g., obtained via QR decomposition of **G**) and **S** is a diagonal matrix with entries sampled i.i.d from the *χ*-distribution with *d* degrees of freedom. The resulting matrix **SQ** is an unbiased estimator of the Gaussian kernel with low variance.^{19} While this construction still has *O*(*N*_{f}*d*) time complexity as well as the additional cost of computing the QR decomposition in a preprocessing step, one can further approximate this transformation as

where **S** has first been replaced by a scalar $d$ and the random orthogonal matrix **Q** has been replaced by a special type of structured matrix. The matrices $Di\u2208Rd\xd7d$ are diagonal sign-flipping matrices, where each entry is sampled from a Rademacher distribution, and **H** is the Walsh–Hadamard matrix,

Note that when the number of features *N*_{f} > *d*, the operation is simply repeated $Nfd$ times, with the resulting vectors concatenated into a length *N*_{f} vector. Crucially, the product **W**_{SORF}*ρ*_{I} now has time complexity *O*(*N*_{transform}*N*_{f} log *d*), since the multiplication with **H** can be efficiently implemented via the fast Hadamard transform using in-place operations in *O*(*d* log *d*) time. Finally, since the Walsh–Hadamard matrix is only defined in $R2n\xd72n$, *ρ*_{I} must also be projected into 2^{n} dimensions. This is achieved via an SVD decomposition on a subset of the atomic environment representations for each element *e*, concatenated into the matrix $Z\u0303e$ given by

and the atomic representations are projected into a lower dimension via the following matrix-vector product,

where only the first *N*_{PCA} columns from the matrix **P**_{e} are used. The subscript *e* indicates that the matrix $Z\u0303e$ is built using only atomic representations of atom type *e*; hence, each element has its own projection matrix $PeNPCA$. Here, we have found *N*_{PCA} = 128 or 256 to be sufficient. In the rest of the text, *d* refers to the dimension of the atomic representation after projection.

We note that there are a number of other approximate kernel methods that aim to reduce computational complexity, including other RFF-type approximations^{33–35} as well as those based on the Nyström method,^{36} which relies on low-rank structure in the kernel matrix. Here, however, we have opted to use SORF due to its simplicity, computational efficiency, and accuracy.^{35}

Finally, for RFF and SORF models there are two user-defined hyperparameters: *σ*, *λ*. In all results presented in this work, the kernel width parameter (*σ*) and regularization parameter (*λ*) are selected from a grid using fivefold cross-validation; however, we note that suitable *σ* and *λ* values can be found within the range of 2–12 and 10^{−5} to 10^{−8}, respectively.

### E. Representation

In this work, we use FCHL19^{20} as the permutationally and rotationally invariant atomic environment featurization layer. FCHL19 is an atom-centered representation consisting of two- and three-body elemental bins, similar in construction to the atom-centered symmetry functions (ACSFs) used in the work of Behler.^{12,37} The functional form is briefly summarized here. For every unique combination of two elements $X,Y$, the representation for each atom *I* is constructed as follows:

where $Z,RX,Y$ refers to the set of atomic charges and coordinates that have either element $X$ or $Y$. The two-body function is given by

where *R*_{s} are the *n*_{2} radial grid centers linearly distributed between 0 and *r*_{cut} and *μ*(*r*_{IJ}) and *σ*(*r*_{IJ}) are the parameters of the log normal distribution,

where *η*_{2} is a hyperparameter. The cutoff function to smoothly decay the representation to zero at *r*_{cut} is defined as

The three-body term $GX,Y3-body$ is given by the following function:

The radial term $Gradial3-body$ is given by the following expression:

where *η*_{3} is a parameter that controls the width of the *n*_{3} radial distribution functions, located at *R*_{s} grid points. The three-body scaling function *ξ*_{3} is the Axilrod–Teller–Muto term^{38,39} with modified exponents,^{40}

where *θ*_{KIJ} is the angle between atoms *I*, *J*, *K*, with *I* at the center, *c*_{3} is a weight term, and *N*_{3} is a three-body scaling factor. Finally, the angular term is given by a Fourier expansion as follows:

where the cosine and sine terms are given by

where *ζ* is a parameter describing the width of the angular Gaussian function and *n* > 0 is the expansion order. Similar to previous work, only the two *n* = 1 cosine and sine terms are used.

### F. Computational details

#### 1. Optimization of representation parameters

The optimal parameters to generate the FCHL19 representation differ here from those used in the original implementation.^{20} We have found improved parameters using a lower cutoff of *r*_{cut} = 6.0 Å. To fit these parameters, we employed a subset of 576 distorted geometries of small molecules with up to five atoms of the type CNO, saturated with hydrogen atoms, for which forces and energies have been obtained from density functional theory (DFT) calculations.^{20,41} This dataset is identical to that used in the original FCHL19 publication. This dataset is randomly divided into a training set of 384 geometries and a test set of 192 geometries. Models are fitted to the training set and predictions on the test set are used to minimize the following cost function with respect to the parameters:

where *U*_{i} is the energy of molecule *i* and *F*_{i} and *n*_{i} are the forces and number of atoms for the same molecule, respectively. A greedy Monte Carlo optimization was used to perform this optimization, where real-type parameters are optimized by multiplying with a factor randomly chosen from a normal distribution centered on 1 with a variance of 0.05 and integer-type parameters by randomly adding +1 or −1. The final parameters that were found to work best are listed in Table I.

Parameter . | U . | U + F . |
---|---|---|

n_{2} | 23 | 24 |

n_{3} | 22 | 22 |

η_{2} | 0.27 | 0.28 |

η_{3} | 5.6 | 3.2 |

N_{2} | 2.78 | 2.3 |

N_{3} | 2.1 | 0.65 |

c_{3} | 60.1 | 18.8 |

ζ | π | π |

r_{cut} | 6.0 | 6.0 |

Parameter . | U . | U + F . |
---|---|---|

n_{2} | 23 | 24 |

n_{3} | 22 | 22 |

η_{2} | 0.27 | 0.28 |

η_{3} | 5.6 | 3.2 |

N_{2} | 2.78 | 2.3 |

N_{3} | 2.1 | 0.65 |

c_{3} | 60.1 | 18.8 |

ζ | π | π |

r_{cut} | 6.0 | 6.0 |

#### 2. GPU implementation: Representation

The FCHL19 representation is constructed by assigning each atom *I* to each block in a batch. One block is launched for each atom in the system. For each block, a total of 256 threads are used in a two-dimensional thread grid. The first dimension of this grid contains 16 threads and enumerates over all two-body interactions with central atom *I* to construct $GI2$, while the second dimension contains eight threads, which enumerates the third index in the three-body interaction $GI3$. For the forwards pass of the model, the reduction of all $GI2$ and $GI3$ scalar elements is performed global memory. For the backward pass, however, the atomic force components are stored and summed in (local, on-chip) memory, therefore significantly increasing throughput. A simple tiled neighbor-list is used to linearize the cost of FCHL19 with respect to increasing number of atoms in the local environments; therefore, only atoms within the cutoff radius are considered when constructing the representation. Once the atomic representation has been constructed, it is projected to a lower dimension of size *N*_{PCA} using a matrix obtained from an SVD of a randomly selected subset of atomic representations from the training set. The size of this lower-dimensional vector is constrained to be a power of 2 for the purposes of the Hadamard transform. Each element within the training database has its own projection matrix and is used to project down each atomic representation separately.

#### 3. GPU implementation: Structured orthogonal features

For the SORF forward pass, each block handles all $Nfd$ Hadamard transforms for a single atom to produce a feature vector of the desired length *N*_{f}. The Hadamard transform itself operates on the projected FCHL19 representation (dimension *d* = 2^{n}, *n* ≥ 1), after multiplication with the diagonal sign-flipping matrix, using a shared-memory butterfly operation that has O(*d* log *d*) complexity. This operation is performed iteratively *N*_{transform} times. In this work, either *d* = 128 or *d* = 256 is used and *N*_{transform} is set to 1. For the backward pass, the gradients are stored and reduced in shared memory.

## IV. RESULTS AND DISCUSSION

We begin by briefly comparing the performance of RFF-type and SORF-type approximations with those of a sparse GP method. We have implemented a sparse GP model in QML-Lightning that uses a Gaussian kernel where the sparse points for each element are uniformly randomly selected from the available training data. The left column of Fig. 1 shows the convergence of the out-of-sample mean absolute errors (MAEs) for energy with respect to increasing *N*_{f} used to approximate the kernel as well as the wall time to compute 1k configurations. For the sparse GP model, we set the total number of sparse points *N*_{s} to be equal to *N*_{f}. Here, we have used the aspirin trajectory from the unrectified MD17 database.^{22–24} The amount of training data remains fixed, using 1k configurations and training on energies only. The right column of Fig. 1 displays the aforementioned energy MAEs but additionally against an estimate of the number of operations performed per atom for SORF, RFF, and sparse GP models. Since we are only concerned with relative differences, the horizontal axis has been normalized to 1.

For a fixed number of features, the SORF and RFF models outperform the sparse GP model, with SORF having the lowest error. Furthermore, SORF models are faster than both RFF and sparse GP models for all model sizes. We note that, while RFF and sparse GP models have the same formal cost, warp matrix-multiply-accumulate (WMMA) operations are accelerated on TensorCore-enabled GPUs such as the RTX 3080 used here, thereby making the calculation of matrix-vector product $W[\rho I,1]T$ significantly faster. However, increasing *N*_{f} incurs a numerical error in this matrix-vector product for the RFF model if performed in single-point precision. To remedy this controlled-error, low-precision matrix-vector products can be computed with 3 WMMA operations, which has not been implemented in this work. While this would increase the cost of RFF, it would not significantly change the ordering of results presented here. We note that this does not appear to be an issue for the SORF model since the product $W[\rho I,1]T$ is computed via a series of fast Hadamard transforms requiring only simple additions.

### A. QM9 dataset

In Fig. 2, the predictive accuracy of several explicit kernel models and SORF models for atomization energy of molecules in the QM9 dataset^{21} are compared. These models include FCHL18^{40} and FCHL19^{20} using the OQML^{20} regressor, which is equivalent to a sparse GP model in which all training atoms are used in the kernel basis. For the SORF models, learning curves using both *N*_{f} = 16 384 and *N*_{f} = 32 768 are shown. We find that the SORF models with FCHL19 perform similar to OQML with FCHL19: The MAE for OQML/FCHL19 and SORF-32k/FCHL19 at 75 000 training samples are 11 and 12 meV, respectively. We note that there is a small deviation away from linearity in the learning curve at $\u224850000$ training samples, indicating that more features may be required.

### B. MD-17 and rMD-17 datasets

Figure 3 reports the energy and force MAEs as a function of number of training samples using three molecules from the MD-17dataset.^{22–24} To be consistent with previous literature,^{20} we use the unrectified MD-17 dataset, which is known to contain significant noise on the energy values.^{25} The learning curves for SORF/FCHL19 with both *N*_{f} = 16 384, 32 768 are reported. We compare against OQML models based on FCHL19^{20} as well as GDML^{22} and sGDML^{42} models. Additionally, SchNet^{43} and state-of-the-art NequIP^{44} neural networks have been included.

For energy learning, the SORF/FCHL19 models in general display similar accuracies to the OQML/FCHL19 model. For toluene, naphthalene, and salicylic acid, the OQML/FCHL19 model slightly outperforms the SORF/FCHL19 models for both *N*_{f} = 16 384 and *N*_{f} = 32 768. However, in all other cases, the SORF models display similar or better accuracy. Both sGDML and GDML perform worse than OQML and SORF models in general; however, for toluene and naphthalene specifically, sGDML has the lowest error among the kernel models.

For force learning, the SORF-based models are as accurate or better than OQML/FCHL19 for *N*_{f} = 16 384, and they are reasonably more accurate than OQML for *N*_{f} = 32 768. These models outperform SchNet in all cases, while NequIP outperforms the SORF models in all cases. We note that the training times for SORF/FCHL19 are on the order of seconds, while OQML/FCHL19 and sGDML models take several minutes to train. Furthermore, GDML models take several hours, and SchNet and NequIP are trained over hours to days. We additionally provide a comparative benchmark of the revised MD-17 dataset,^{25} a recomputed version of the original MD-17 dataset with tighter SCF convergence criteria. Table II lists the out-of-sample MAEs for the largest SORF model constructed in this work with FCHL19, GP with FCHL19,^{25} sGDML, ACE,^{45} and NequIP, with rotation orders *l* = 0 and *l* = 3. We note that FCHL19 is a comparatively simplistic atomic featurization layer; consequently, it is expected that it does not perform as well as state-of-the-art equivariant many-body neural networks^{46–48} such as NequIP. For a more reasonable comparison, the *l* = 0 channel NequIP model, which contains at most 3-body terms similar to FCHL19, has been included here. While the force errors are similar to the MD-17 results, the energy errors are significantly lower across all models, which is consistent with the observation that the noise floor on the original MD-17 data is higher on the energies.

Molecule . | . | SORF-32k
. | GP/FCHL19 . | sGDML . | ACE . | NequIP (l = 0) . | NequIP (l = 3) . |
---|---|---|---|---|---|---|---|

Aspirin | Energy | 9.6 | 6.2 | 7.2 | 6.1 | 25.2 | 2.3 |

Forces | 25.9 | 20.9 | 31.8 | 17.9 | 41.9 | 8.5 | |

Azobenzene | Energy | 5.6 | 2.8 | 4.3 | 3.6 | 20.3 | 0.7 |

Forces | 13.6 | 10.8 | 19.2 | 10.9 | 42.3 | 3.6 | |

Ethanol | Energy | 1.5 | 0.9 | 2.4 | 1.2 | 2.0 | 0.4 |

Forces | 7.5 | 6.2 | 16.0 | 7.3 | 13.7 | 3.4 | |

Malonaldehyde | Energy | 2.3 | 1.5 | 3.1 | 1.7 | 4.4 | 0.8 |

Forces | 12.1 | 10.3 | 18.8 | 11.1 | 23.4 | 5.2 | |

Naphthalene | Energy | 4.9 | 1.2 | 0.8 | 0.9 | 14.7 | 0.2 |

Forces | 9.1 | 6.5 | 5.4 | 5.1 | 20.1 | 1.2 | |

Salicylic acid | Energy | 4.3 | 1.8 | 2.1 | 1.8 | 11.4 | 0.7 |

Forces | 12.5 | 9.5 | 12.8 | 9.3 | 28.7 | 4.0 | |

Toluene | Energy | 4.2 | 1.7 | 1.0 | 1.1 | 9.7 | 0.3 |

Forces | 11.6 | 8.8 | 6.3 | 6.5 | 27.2 | 1.6 | |

Uracil | Energy | 1.4 | 0.6 | 1.4 | 1.1 | 10.0 | 0.4 |

Forces | 6.0 | 4.2 | 10.4 | 6.6 | 25.8 | 3.2 |

Molecule . | . | SORF-32k
. | GP/FCHL19 . | sGDML . | ACE . | NequIP (l = 0) . | NequIP (l = 3) . |
---|---|---|---|---|---|---|---|

Aspirin | Energy | 9.6 | 6.2 | 7.2 | 6.1 | 25.2 | 2.3 |

Forces | 25.9 | 20.9 | 31.8 | 17.9 | 41.9 | 8.5 | |

Azobenzene | Energy | 5.6 | 2.8 | 4.3 | 3.6 | 20.3 | 0.7 |

Forces | 13.6 | 10.8 | 19.2 | 10.9 | 42.3 | 3.6 | |

Ethanol | Energy | 1.5 | 0.9 | 2.4 | 1.2 | 2.0 | 0.4 |

Forces | 7.5 | 6.2 | 16.0 | 7.3 | 13.7 | 3.4 | |

Malonaldehyde | Energy | 2.3 | 1.5 | 3.1 | 1.7 | 4.4 | 0.8 |

Forces | 12.1 | 10.3 | 18.8 | 11.1 | 23.4 | 5.2 | |

Naphthalene | Energy | 4.9 | 1.2 | 0.8 | 0.9 | 14.7 | 0.2 |

Forces | 9.1 | 6.5 | 5.4 | 5.1 | 20.1 | 1.2 | |

Salicylic acid | Energy | 4.3 | 1.8 | 2.1 | 1.8 | 11.4 | 0.7 |

Forces | 12.5 | 9.5 | 12.8 | 9.3 | 28.7 | 4.0 | |

Toluene | Energy | 4.2 | 1.7 | 1.0 | 1.1 | 9.7 | 0.3 |

Forces | 11.6 | 8.8 | 6.3 | 6.5 | 27.2 | 1.6 | |

Uracil | Energy | 1.4 | 0.6 | 1.4 | 1.1 | 10.0 | 0.4 |

Forces | 6.0 | 4.2 | 10.4 | 6.6 | 25.8 | 3.2 |

### C. Flexible 3BPA dataset

The 3PBA dataset^{26,45} contains both ambient and high temperature configurational samples for the small drug-like molecule, 3-(bezyloxy)pyridin-2-amine (3BPA). This molecule has 3 central rotatable dihedral angles (*α*, *β*, *γ*), leading to a complex dihedral potential energy surface with many local minima, which can be challenging for both classical and ML-based potentials.^{49} In particular, at ambient temperatures (300 K) the accessible phase space is small; however, the dataset contains both 600 and 1200 K configurational samples, which have increasingly large phase space. Therefore, a crucial test on whether an ML model can extrapolate well is if the model, when trained on the 300 K samples, can accurately predict the 600 and 1200 K samples. Consequently, we have trained the SORF/FCHL19 model on the 300 K subset of the 3PBA dataset in order to compare against results obtained from ACE, two kernel models sGDML and GAP^{29} using SOAP,^{50,51} as well as two related neural network architectures ANI^{52,53} and ANI-2x.^{54} We note that these results have been summarized directly from the ACE publication.^{45}

Table III lists the energy and force root-mean-squared errors (RMSEs) of a variety of different models listed in the ACE paper,^{45} including the SORF model with both 16 384 and 32 768 features. For the 300 K test dataset, the SORF model with *N*_{f} = 32 768 achieves very low out-of-sample errors, having more accurate forces than all models except ACE. For the 600 K dataset, the SORF models do not fare as well as the ACE and ANI models; however, they significantly outperform both kernel based models sGDML and GAP with SOAP features. For the 1200 K dataset, the SORF model has lower force errors than ACE; however, the ANI models have a reasonably lower error. We note that ACE contains up to five-body terms in its cluster expansion, while FCHL19 contains only up to three-body terms. Therefore, one would expect ACE to outperform the FCHL19/SORF models in this setting.

T (K) . | Property . | SORF-16k
. | SORF-32k
. | ACE . | sGDML . | GAP . | ANI . | ANI-2X . |
---|---|---|---|---|---|---|---|---|

300* | Energy | 15.0 | 13.7 | 7.1 | 9.1 | 22.8 | 23.5 | 38.6 |

Forces | 39.2 | 36.2 | 27.1 | 46.2 | 87.3 | 42.8 | 84.4 | |

600 | Energy | 49.3 | 37.7 | 24.0 | 484.8 | 61.4 | 37.8 | 54.5 |

Forces | 86.8 | 75.9 | 64.3 | 439.2 | 151.9 | 71.7 | 102.8 | |

1200 | Energy | 118.4 | 99.2 | 85.3 | 774.5 | 166.8 | 76.8 | 88.8 |

Forces | 175.6 | 159.3 | 187.0 | 711.1 | 305.5 | 129.6 | 139.6 |

T (K) . | Property . | SORF-16k
. | SORF-32k
. | ACE . | sGDML . | GAP . | ANI . | ANI-2X . |
---|---|---|---|---|---|---|---|---|

300* | Energy | 15.0 | 13.7 | 7.1 | 9.1 | 22.8 | 23.5 | 38.6 |

Forces | 39.2 | 36.2 | 27.1 | 46.2 | 87.3 | 42.8 | 84.4 | |

600 | Energy | 49.3 | 37.7 | 24.0 | 484.8 | 61.4 | 37.8 | 54.5 |

Forces | 86.8 | 75.9 | 64.3 | 439.2 | 151.9 | 71.7 | 102.8 | |

1200 | Energy | 118.4 | 99.2 | 85.3 | 774.5 | 166.8 | 76.8 | 88.8 |

Forces | 175.6 | 159.3 | 187.0 | 711.1 | 305.5 | 129.6 | 139.6 |

### D. Timings

Table IV shows the total time spent in seconds when training various CPU and GPU-accelerated models using both energies and forces, given a fixed amount of training data (*N*_{t} = 1*k*) from the MD-17 database. The device used to train these models has been listed for reference. Three kernel models are shown: OQML and sGDML. We note that OQML and GP models use the same representation as the GPU variant implemented here. We additionally compare against NequIP; however, the training times shown are an upper bound as models with good predictive accuracy can likely be obtained with early stopping. The timings for the SORF model with both 16 384 and 32 768 features are shown using a consumer RTX 3080 GPU and scientific A100 GPU. It should be noted that there is a significant cumulative round-off error when constructing the **Z**^{T}**Z** matrix in floating-point precision (FP32), which leads to ill-conditioning. However, Kahan’s summation^{55} has been implemented here to minimize this error, allowing the normal equations to be constructed in FP32 format to yield a threefold reduction in training time compared to FP64 precision. For the A100 GPU, only FP64 performance is shown as we observed minimal gain in speed when using FP32. Using the RTX 3080 GPU, SORF models are found to be several orders of magnitude faster to train than both GP and neural networks. Furthermore, they are on average ≈3 and ≈210 times faster to train than OQML and sGDML models, respectively, across the dataset shown when using FP64 matrix multiplication, while they are ≈10 and ≈700 times faster to train when using FP32 with Kahan’s summation. On the A100 GPU, trained models can be obtained in seconds. Note that this work focuses on models that can fit into GPU global memory, i.e., both the **Z**^{T}**Z** matrix and the normal equations are constructed and solved on the GPU. However, QML-Lightning also supports an out-of-core approach, where the **Z**^{T}**Z** matrix is tiled, with each tile being copied to the host device and summed. Finally, the normal equations are then solved on the CPU. This second variant is necessary when the number of features yields matrices that exceed the memory required to both construct the **Z**^{T}**Z** matrix as well as solve the normal equations. The choice of CPU(s) or out-of-core CPU/GPU implementation will significantly determine the model training time in these cases. However, this aspect has not been investigated in this work.

. | Model Train Times (s) . | ||||||
---|---|---|---|---|---|---|---|

Device . | 2 x E5-2680 . | 2 x E5-2680 . | E5-2640 . | V100 . | RTX-3080 . | A100 . | |

. | OQML . | GP . | . | . | SORF . | SORF . | SORF . |

Molecule . | FCHL19 . | FCHL19 . | sGDML . | NeqUIP . | 16k-DP
. | 16k-FP
. | 32k-DP
. |

Ethanol | 66 | 2 252 | 144 | ≈h | 24 | 12 | 4 |

Salicylic acid | 249 | 6 836 | 282 | ≈h | 46 | 23 | 7 |

Aspirin | 527 | 101 451 | 570 | ≈h | 84 | 44 | 10 |

Malonaldehyde | 51 | 1 926 | 150 | ≈h | 30 | 15 | 6 |

Toluene | 271 | 7 976 | 216 | ≈h | 30 | 15 | 8 |

Naphthalene | 455 | 11 782 | 348 | ≈h | 60 | 30 | 7 |

Uracil | 87 | 2 576 | 120 | ≈h | 24 | 12 | 6 |

. | Model Train Times (s) . | ||||||
---|---|---|---|---|---|---|---|

Device . | 2 x E5-2680 . | 2 x E5-2680 . | E5-2640 . | V100 . | RTX-3080 . | A100 . | |

. | OQML . | GP . | . | . | SORF . | SORF . | SORF . |

Molecule . | FCHL19 . | FCHL19 . | sGDML . | NeqUIP . | 16k-DP
. | 16k-FP
. | 32k-DP
. |

Ethanol | 66 | 2 252 | 144 | ≈h | 24 | 12 | 4 |

Salicylic acid | 249 | 6 836 | 282 | ≈h | 46 | 23 | 7 |

Aspirin | 527 | 101 451 | 570 | ≈h | 84 | 44 | 10 |

Malonaldehyde | 51 | 1 926 | 150 | ≈h | 30 | 15 | 6 |

Toluene | 271 | 7 976 | 216 | ≈h | 30 | 15 | 8 |

Naphthalene | 455 | 11 782 | 348 | ≈h | 60 | 30 | 7 |

Uracil | 87 | 2 576 | 120 | ≈h | 24 | 12 | 6 |

In Fig. 4, we show the force accuracy of the SORF model with FCHL19 as a function of the number of features, and the corresponding wall time to complete 1k MD steps for azobenzene using the OpenMM QML-Lightning interface. For additional context, we compare against Performant ACE^{56} (PACE), which is an MPI-parallelized and optimized version of ACE.^{45} For the SORF model, an RTX 3080 was used, whereas results for PACE have been provided by the first author of the ACE paper, where 8 MPI-tasks were found to achieve best time performance on a 16-core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30 GHz processor. The reader should be cautioned that there are a number of challenges in directly comparing GPU and CPU codes and that constructing a benchmark that is 100% fair is very challenging. PACE yields significantly lower wall times for smaller models, being approximately an order of magnitude faster than the SORF model when using 4k basis functions and features, respectively. Small molecules represent worst-case systems for GPU codes since there are a number of idle streaming multiprocessors, and those that are active are computing relatively little amount of work. However, it is encouraging to see that even for small systems the SORF model achieves a similar error and wall time as PACE at 16k features. Doubling the number of features to 33k reasonably improves the error while increasing the wall time cost by only 1 s; however, the error begins to saturate here and we do not expect a significant improvement in error at 65k features. Conversely, the linearization achieved in PACE is over an increasing number of higher order many-body correlations; therefore, the model’s accuracy improves as the number of basis functions increases, albeit with a larger computational cost.

Table V lists the total wall time in seconds for completing 1000 MD steps for a variety of systems using the OpenMM QML-Lightning interface. Additionally, per atom per MD step timings and the average and maximum number of neighbors per atom are provided. For representative small-molecule timings, azobenzene is listed. For larger nonperiodic systems, a brass cluster^{57,58} and a vitamin D receptor (5OWD)^{59} with interstitial water molecules are included. For larger periodic systems, two metal organic frameworks (MOFs) obtained from the materials project^{60} and liquid water are included. For the two monoclinic MOF systems, a 3 × 3 × 3 supercell was used. For the Zn_{4}H_{44}C_{51}N_{8}O_{18} system (qmof-0b5e989),^{61,62} the lattice constants and angles are [*a*, *b*, *c*] = [16.02, 21.41, 29.93] Å and [*α*, *β*, *γ*] = [90.0, 90.0, 97.8]°, respectively. For the Nd_{2}C_{36}H_{42}N_{12}O_{18}S_{3} system (qmof-bf74329),^{61,62} the lattice constants and angles are [*a*, *b*, *c*] = [12.14, 16.95, 25.66] Å and [*α*, *β*, *γ*] = [102.2, 90.0, 90.0]°, respectively. For liquid water, a 2 × 2 × 2 supercell was used with a cubic unit cell with a lattice constant of *a* = 25.0 Å, generated using OpenMMTools.^{63} We note that since the our optimization strategy parallelizes the computation over atoms, with each atom being assigned to a different block in round-robin fashion, smaller systems yield an underutilized GPU as each atom is mapped to a different SM, of which there are 68 on an RTX 3080. Furthermore, the small number of neighbors yields a limited amount of thread-parallel work to perform when constructing FCHL19 within each SM. For the brass cluster, the timings per atom per step are significantly improved since the SMs are fully utilized. However, approximately only four atoms are computed per SM; therefore, there is limited work, which yields comparatively poor timings with respect to the larger systems. For large systems, near-microsecond per atom per MD step timings can be observed; for example, forces are evaluated in 5.5 *µ*s/(atom ⋅step) for the Zn_{4}H_{44}C_{51}N_{8}O_{18} system and 8.56 *µ*s/(atom ⋅step) for the Nd_{2}C_{36}H_{42}N_{12}O_{18}S_{3} system. We expect that sub-microsecond timings could be obtained using low-precision computation.

. | N_{atoms}
. | Periodic . | Wall Time (s) . | ms/(atom ⋅ MD step) . | N_{neighbors}
. |
---|---|---|---|---|---|

Azobenzene | 24 | No | 3.0 | 1.25 × 10^{−1} | 16 (23) |

Brass cluster | 245 | No | 4.5 | 1.8 × 10^{−2} | 46 (78) |

Vitamin D receptor | 4 369 | No | 45.7 | 1.0 × 10^{−2} | 78 (116) |

Zn_{4}H_{44}C_{51}N_{8}O_{18} | 13 500 | Yes | 72.8 | 5.39 × 10^{−3} | 49 (74) |

Nd_{2}C_{36}H_{42}N_{12}O_{18}S_{3} | 12 204 | Yes | 104.5 | 8.56 × 10^{−3} | 85 (124) |

Water (l) | 12 000 | Yes | 116.4 | 9.7 × 10^{−3} | 90 (103) |

. | N_{atoms}
. | Periodic . | Wall Time (s) . | ms/(atom ⋅ MD step) . | N_{neighbors}
. |
---|---|---|---|---|---|

Azobenzene | 24 | No | 3.0 | 1.25 × 10^{−1} | 16 (23) |

Brass cluster | 245 | No | 4.5 | 1.8 × 10^{−2} | 46 (78) |

Vitamin D receptor | 4 369 | No | 45.7 | 1.0 × 10^{−2} | 78 (116) |

Zn_{4}H_{44}C_{51}N_{8}O_{18} | 13 500 | Yes | 72.8 | 5.39 × 10^{−3} | 49 (74) |

Nd_{2}C_{36}H_{42}N_{12}O_{18}S_{3} | 12 204 | Yes | 104.5 | 8.56 × 10^{−3} | 85 (124) |

Water (l) | 12 000 | Yes | 116.4 | 9.7 × 10^{−3} | 90 (103) |

For a component breakdown of the computational cost, Fig. 5 shows the approximate on-device time for key components in computing energies and forces for azobenzene (*N*_{atoms} = 24), a brass cluster (*N*_{atoms} = 245), a vitamin D receptor (*N*_{atoms} = 4369), and the periodic MOF system Nd_{2}C_{36}H_{42}N_{12}O_{18}S_{3} (*N*_{atoms} = 12 204) using *N*_{f} = 16 384. Profiling PyTorch GPU codes yields an additional cost to each node on the autograd graph; therefore, we display the results in terms of percentage of the total device time. For azobenzene, the forward and backward passes of FCHL19 and SORF require only 28% of the total device time, while indexing operations, initialization, and intermediary operations require ∼70%. We note that many of these operations could be optimized away through combining them into single kernel calls. In particular, it is likely that the indexing costs, which are mostly associated with sub-selecting the atomic representations for projection with the relevant matrices $PeNPCA$, could be removed through using customized CUDA C functions to generate these indices for use directly in the FCHL19 and SORF modules. For the larger brass cluster, forward and backward passes of FCHL19 and SORF occupy 47% of the total device time; however, indexing and other costs still occupy a large percentage of the total device time. Here, the percentage cost of the forward and backward passes of SORF compared to azobenzene are reduced from 23% to 18%. For the vitamin D receptor, the forward and backward passes of FCHL19 begin to dominate the cost, accounting for 24.5% and 45.7%, respectively, while the cost of the SORF transforms is reduced to 10% of the total device time. Finally, for the periodic MOF system, the forward and backward passes of FCHL19 occupy 80.5% of the total cost, while the SORF transform only requires 7.6% of the total device time.

## V. CONCLUSION

In this paper, we have introduced a PyTorch-based library, termed QML-Lightning, which contains approximate kernel models and an efficient representation designed for learning quantum mechanical properties. We have provided a low-cost, PyTorch-wrapped CUDA C implementation of structured orthogonal random features, a variant of the well-known random Fourier features, as well as a computationally efficient implementation of FCHL19, an accurate atom-centered representation.

The combination of structured orthogonal features and FCHL19 has been benchmarked against existing datasets, yielding not only similar or better accuracy than explicit kernel models with FCHL19 but also competitive accuracy with contemporary models, with significantly reduced training time and performant prediction time.

## ACKNOWLEDGMENTS

This research was supported by the NCCR MARVEL, a National Center of Competence in Research, funded by the Swiss National Science Foundation (Grant No. 182892). Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing center at University of Basel.

We would also like to thank David Kovacs, Ilyes Batatia, and Gábor Csányi for fruitful discussions and David Kovacs for providing PACE timings.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Nicholas J. Browning**: Conceptualization (lead); Data curation (lead); Formal analysis (equal); Investigation (lead); Methodology (lead); Software (lead); Visualization (equal); Writing – original draft (lead); Writing – review & editing (equal). **Felix A. Faber**: Formal analysis (equal); Validation (lead); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **O. Anatole von Lilienfeld**: Funding acquisition (lead); Project administration (lead); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available within the article.

## Nomenclature

- $C$
set of atoms belonging to a compound

*d*size of atomic representation.

**F**matrix containing force components

*G*atom density function

**H**Walsh-Hadamard matrix

**I**identity matrix

*I*,*J*indices denoting atoms

*I*and*J*, respectively**K**Kernel matrix

*k*(⋅, ⋅)Kernel function

*L*loss function

- $PeNPCA$
atomic projection matrix for element

*e**N*_{f}number of features for the RFF approximation

*N*_{s}number of sparse points in sparse GP basis

*N*_{t}number of training instances

**U**matrix containing total formation energies

*U*total formation energy of a compound

*w*angular frequency of a Fourier transform

*Z*matrix containing approximate feature mappings

(⋅)*z*RFF approximation of the feature mapping

*α*regression weights

*ρ*atomic representations

(⋅)*ϕ*feature mapping of a kernel function

*ϵ*atomic energy contributions