We introduce Quantum Machine Learning (QML)-Lightning, a PyTorch package containing graphics processing unit (GPU)-accelerated approximate kernel models, which can yield trained models within seconds. QML-Lightning includes a cost-efficient GPU implementation of FCHL19, which together can provide energy and force predictions with competitive accuracy on a microsecond per atom timescale. Using modern GPU hardware, we report learning curves of energies and forces as well as timings as numerical evidence for select legacy benchmarks from atomistic simulation including QM9, MD-17, and 3BPA.

Data-driven approximate machine learning (ML) methods have become increasingly prominent in theoretical chemistry in recent years.1–11 In particular, supervised learning can be used to augment accurate but computationally prohibitive electronic structure calculations. For tasks such as ab initio molecular dynamics (AIMD), approximate variants of the electronic Schrödinger equation are solved for every coordinate update. Consequently, surrogate ML models that can partially substitute for quantum calculations are extremely beneficial for reducing the overall computational burden (and carbon footprint). In general, these ML models first transform atomic coordinates into an intermediate symmetry-preserving representation, which is then passed on to a nonlinear machine learning model, usually one based on neural network (NNs)12 or Gaussian process (GP) methods.13 

While NNs have received significant focus to enable them to scale performantly on graphics processing unit (GPU) architectures through GPU-accelerated packages, such as PyTorch14 or TensorFlow,15 there has been much less focus in providing similar accelerations and tooling to GP methods, as the vanilla variant of these tend to scale unfavorably on GPU devices.

However, there are improved time complexity GP models relying sparsification, whereby a subset of the training data is used to enable training on the full dataset.16,17 In particular, the training step time complexity is linearized with respect to the amount of sparse training data and the inference time of these models is favorable. This strategy has enabled sparse GP models to be trained in a “Big Data” context while retaining sufficient accuracy with respect to vanilla GP.

Alternatively, the mapping provided by the kernel function central to GP models can be approximated to yield a desired linearization. In this work, we briefly discuss the seminal work on random Fourier features (RFF)18 in which the kernel mapping is approximated via a low-dimensional feature function. We also summarize a significantly more computationally efficient variant, termed structured orthogonal features (SORF),19 which further reduces the computational cost.

We provide a software package to perform training and prediction of Quantum Machine Learning (QML) models, termed QML-Lightning. It includes GPU implementations of both sparse GP, RFF, and SORF models as well as a GPU implementation of FCHL19,20 an accurate atom-centered representation. QML-Lightning is built upon the PyTorch software package, with additional CUDA C implementations for critical components to improve its computational throughput. A thorough benchmark of the predictive accuracy of QML-Lightning has been performed on several established datasets of chemical compounds from literature, comparing against existing kernel- and neural network based models. To assess the models’ performance across chemical compound space, we have benchmarked the model against the QM9 dataset,21 which contains 134k small organic molecules with elements C, H, O, N, and F. For dynamics and structural relaxation applications, we have benchmarked the model against both the MD1722–24 and rectified rMD1725 datasets, which contain trajectories of 10 small organic molecules. To infer the models’ extrapolative performance, QML-Lightning has been benchmarked against the challenging 3PBA dataset,26 which contains three sets of MD trajectories at temperatures 300, 600, and 1200 K. We note that we have focused on energetic properties in this work as these are the most critical for AIMD applications; however, other QM properties can also be used within the QML-Lightning framework straightforwardly. Finally, we detail the wall time to perform 1 K MD steps times for a variety of systems including periodic systems with up to ∼14k atoms, using an OpenMM27 plugin that is provided as part of this work.

Note that while finalizing work on this paper, Dhaliwal et al.28 most recently published RFF-based interatomic potentials for performing molecular dynamics with promising results for central processing unit (CPU) based applications. Here, however, we focus on a GPU implementation of the more efficient and accurate SORF model.

The QML-Lightning software is provided under an MIT license at https://github.com/nickjbrowning/qml-lightning. We additionally provide an OpenMM wrapper for QML-Lightning under the same license at https://github.com/nickjbrowning/openmm-qml-lightning.

This section first summarizes a subset of kernel methods to learn quantum mechanical properties. First, Gaussian process regression (GP)29,30 is introduced to learn both energies and forces of chemical compounds. Then, sparse GP17,30 is discussed, and finally the approximate kernel methods random Fourier features (RFF)18 and structured orthogonal random features (SORF)19 are introduced.

In the following, upper case indices denote the index of chemical compounds in a dataset, while lower-case indices denote the index of atomic centers in each chemical compound.

In GP kernel models of quantum mechanical properties, one constructs a linear model using a basis of kernel functions k(pJ, ⋅). For example, to learn potential energies, a suitable functional form would be

Upred(zJ,rJJCJ)=iNtαiICIJCJk(ρI,ρJ)KijGP=ICIJCJk(ρI,ρJ),

where ρI and ρJ are the atomic representations of atom I and J, respectively, and Nt is the number of training molecules. We note that these atomic representations are functions of the set of local atomic charges and coordinates {z, r}, which has been omitted for brevity. The sets CI and CJ contain the set of atoms for training and query molecules i and j, respectively. The coefficients αi are obtained from the regularized minimization problem given by

L=iNt(UirefUipred)2+λiNtjNtαiαjKijGP,

which has the following solution in matrix form:

α=(KGP+λI)1Uref.

The hyperparameter λ is a small number in order to regularize and ensure numerical stability upon kernel inversion.31 For learning potential energies and atomic forces simultaneously, one can construct an expression for the potential energy as follows:

UF=KGPrKGPr*KGP2rr*KGPα,
(1)

where matrix notation is introduced for simplicity and r is used as a shorthand to stack the following derivatives:

KijGPrJl=ICIJCJk(ρI,ρJ)rJl,

where l indices the coordinate components from the query atom J. The Hessian in Eq. (1) has the following form:

2KijGPrIkrJl=ICIJCJk(ρI,ρJ)rIkrJl,

where k indices the coordinate components from training atom I. The dimension of the full GP kernel is (3MNt + Nt) × (3MNt + Nt), where M is the average number of atoms per molecule in the entire training set. In particular, the Hessian term in Eq. (1) has a compute time scaling as O(36Nt2M4), which will severely limit applicability with respect to both large training set sizes and large systems.

To alleviate the complexity of the above, which will be referred to as “full GP” in the following, Bartok et al.16,29 introduced sparse GPs for modeling molecular systems. For a thorough review of GP and sparsification, please see Ref. 30. In sparse GP, rather than using all available training points, one selects a subset of representations as the basis. For example, a subset of the kernels centered on atoms can be used as the sparse basis such that the expression for the potential energy is as follows:

Upred(zJ,rJJCJ)=INsαIKIjsGPKIjsGP=JCJk(ρI,ρJ),

where Ns is the number of sparse points selected. The sparse points I are typically selected from the training database using uniform random sampling, CUR matrix decomposition,32 or farthest point sampling (FPS).17 The key difference from full GP presented above is that there is the use of one coefficient αI per sparse point, rather than one coefficient per training molecule. For potential energy models, this formalism reduces the cost of building the kernel and predictions to O(Ns2NtMd) and O(NsMd), respectively, where M in this context is the (average) number of atoms in the (training) query molecule and d is the size of the representation. The model coefficients can be obtained by solving the following loss function in matrix form:

L(α)=UrefUpred2+λαKsGP2,

which yields the solution

α=((KsGP)TKsGP+λI)1(KsGP)TUref.

Additionally, atomic forces can be included in the training scheme resulting in the following model:

UF=KsGPr*KsGPα,

where the matrix r*KR3NtM×Ns stacks the 3NtM partial Cartesian derivatives with all sparse points. In the following, we have dropped the sGP label for simplicity, and we also introduce the notation K as a synonym for the matrix r*K. This yields the loss function,

L(α)=[U,F]ref[U,F]pred2+λα[K,K]2

where the notation [U, F] is used to indicate the concatenation of the matrices U and F. The coefficients are then obtained from the following equation:

α=([K,K]T[K,K]+λI)1[K,K]T[U,F]ref.

We stress that including derivatives in the above formulation does not increase the size of basis in which the total energy is expanded, so these derivatives only appear in the loss function. To compute these model coefficients, the dominant term is now the formation of the KTK matrix product, which scales as O(3NtMNs2).

Alternatively to sparsifying over the training data to reduce the computational complexity of training and inference, it is also possible to approximate the mapping provided by the kernel function such that linear models result. In the seminal work of Rahimi and Recht,18 a lower-dimensional lifting function was introduced to approximate the inner product synonymous with the kernel method,

k(ρI,ρJ)=ϕ(ρI),ϕ(ρJ)z(ρI)Tz(ρJ).

As a consequence of Bochner’s theorem, the Fourier transform of a shift-invariant kernel k(ρI, ρJ) = k(ρIρJ) is a proper probability distribution. Consequently, one can readily define an explicit feature map that approximates the kernel via Monte Carlo integral estimation,

k(ρIρJ)=Rdp(w)eiwT(ρIρJ)dw1Nfi=1NfeiwiT(ρIρJ)=1Nfeiw1TρI1Nfeiw2TρI1NfeiwNfρIT1Nfeiw1TρI1Nfeiw2TρI1NfeiwNfρIz(ρI)z(ρJ)*,

where Nf is the number of independent vectors w drawn from the probability distribution p(w) and i is the imaginary unit. For different kernels, the distribution p(w) takes different forms; however, for Gaussian kernels used here, p(w) is also Gaussian. After applying Euler’s formula to z(ρ) and after simplification, the following low-dimensional feature map is obtained:

z(ρI)=2Nf[cos(w1TρI+b1),,cos(wNfTρI+bNf)]T,

where b is sampled from a uniform distribution on [0, 2π]. Since the potential energies are extensive, one can partition them into atomic energy contributions ϵI, and the representation of the atomic environment is passed into this low-dimensional feature mapping,

Ui{q,r}=ICIϵI=ICIαTz(ρI)=Zα,

where ZRNt×Nf is the feature matrix corresponding to Nt training observations. The weights α are the solution to the following regularized normal equation:

ZTZ+λIα=ZTU,

where the coefficients are obtained first by an LU decomposition of ZTZ+λI. The time complexity of constructing the ZTZ matrix and performing inference scale as O(Nf2NtMd) and O(NfMd), respectively. To include forces in the training scheme, the derivatives of the feature vectors zIlrik are computed, where l and k are the feature and coordinate component indices, respectively, and stored in a derivatives feature matrix ZR3NtM×Nf. The following regularized normal equation is then solved:

Z,ZTZ,Z+λIα=Z,ZTU,F,

where the notation Z,Z indicates the concatenation of the feature matrix Z with the derivative features ∂Z. The dominant term in constructing the normal equations is the ZTZ matrix product, which scales as O(3NtMNf2).

The above formulation revolves around computing the affine transformation W[ρI,1]T, which is applied to ρI prior to the cosine operation (z(ρI)=cos(W[ρI,1]T)). Storing and computing this linear transformation has O(Nfd) space and time complexity. To reduce this, Yu et al.19 introduced structured orthogonal random features (SORF). In this method, the matrix W is replaced by a special structured matrix consisting of products of random binary diagonal matrices and Walsh–Hadamard matrices. The resulting linear transformation has O(Nf log d) time complexity and O(d) or O(1) space complexity, depending on its implementation. Briefly, in the case of Gaussian kernel approximation, one can replace the transformation

WRFF=1σG,

where GRNf×d is a random Gaussian matrix, with the following transformation:

WORF=1σSQ,

where Q is a uniformly distributed random orthogonal matrix (e.g., obtained via QR decomposition of G) and S is a diagonal matrix with entries sampled i.i.d from the χ-distribution with d degrees of freedom. The resulting matrix SQ is an unbiased estimator of the Gaussian kernel with low variance.19 While this construction still has O(Nfd) time complexity as well as the additional cost of computing the QR decomposition in a preprocessing step, one can further approximate this transformation as

WSORFdσQdσNtransformHD1,,NtransformHDNfd,

where S has first been replaced by a scalar d and the random orthogonal matrix Q has been replaced by a special type of structured matrix. The matrices DiRd×d are diagonal sign-flipping matrices, where each entry is sampled from a Rademacher distribution, and H is the Walsh–Hadamard matrix,

H2n=H2n1H2n1H2n1H2n1n>01n=0

Note that when the number of features Nf > d, the operation is simply repeated Nfd times, with the resulting vectors concatenated into a length Nf vector. Crucially, the product WSORFρI now has time complexity O(NtransformNf log d), since the multiplication with H can be efficiently implemented via the fast Hadamard transform using in-place operations in O(d log d) time. Finally, since the Walsh–Hadamard matrix is only defined in R2n×2n, ρI must also be projected into 2n dimensions. This is achieved via an SVD decomposition on a subset of the atomic environment representations for each element e, concatenated into the matrix Z̃e given by

Z̃e=PeSeVeT

and the atomic representations are projected into a lower dimension via the following matrix-vector product,

ρIproj=ρITPeNPCA,

where only the first NPCA columns from the matrix Pe are used. The subscript e indicates that the matrix Z̃e is built using only atomic representations of atom type e; hence, each element has its own projection matrix PeNPCA. Here, we have found NPCA = 128 or 256 to be sufficient. In the rest of the text, d refers to the dimension of the atomic representation after projection.

We note that there are a number of other approximate kernel methods that aim to reduce computational complexity, including other RFF-type approximations33–35 as well as those based on the Nyström method,36 which relies on low-rank structure in the kernel matrix. Here, however, we have opted to use SORF due to its simplicity, computational efficiency, and accuracy.35 

Finally, for RFF and SORF models there are two user-defined hyperparameters: σ, λ. In all results presented in this work, the kernel width parameter (σ) and regularization parameter (λ) are selected from a grid using fivefold cross-validation; however, we note that suitable σ and λ values can be found within the range of 2–12 and 10−5 to 10−8, respectively.

In this work, we use FCHL1920 as the permutationally and rotationally invariant atomic environment featurization layer. FCHL19 is an atom-centered representation consisting of two- and three-body elemental bins, similar in construction to the atom-centered symmetry functions (ACSFs) used in the work of Behler.12,37 The functional form is briefly summarized here. For every unique combination of two elements X,Y, the representation for each atom I is constructed as follows:

GI(ZJ,RJX,Y)=GX2-body,GY2-body,GX,Y3-body,

where Z,RX,Y refers to the set of atomic charges and coordinates that have either element X or Y. The two-body function is given by

G2-body=fcut(rIJ)1rIJN21Rsσ(rIJ)2πexp(lnRsμ(rIJ))22σ(rIJ)2,

where Rs are the n2 radial grid centers linearly distributed between 0 and rcut and μ(rIJ) and σ(rIJ) are the parameters of the log normal distribution,

μ(rIJ)=lnrIJ1+η2rIJ2,
σ(rIJ)2=ln1+η2rIJ2,

where η2 is a hyperparameter. The cutoff function to smoothly decay the representation to zero at rcut is defined as

fcut(rIJ)=12cosπrIJrcut+1.

The three-body term GX,Y3-body is given by the following function:

G3-body=ξ3Gradial3-bodyGangular3-bodyfcut(rIJ)fcut(rJK)fcut(rIK).

The radial term Gradial3-body is given by the following expression:

Gradial3-body=η3πexpη312rIJ+rIKRs2,

where η3 is a parameter that controls the width of the n3 radial distribution functions, located at Rs grid points. The three-body scaling function ξ3 is the Axilrod–Teller–Muto term38,39 with modified exponents,40 

ξ3=c31+3cosθKIJcosθIJKcosθJKIrIJrJKrIKN3,

where θKIJ is the angle between atoms I, J, K, with I at the center, c3 is a weight term, and N3 is a three-body scaling factor. Finally, the angular term is given by a Fourier expansion as follows:

Gangular3-body=Gncos,Gnsin,

where the cosine and sine terms are given by

Gncos=expζn22cosnθKIJcosnθKIJ+π,Gnsin=expζn22sinnθKIJsinnθKIJ+π,

where ζ is a parameter describing the width of the angular Gaussian function and n > 0 is the expansion order. Similar to previous work, only the two n = 1 cosine and sine terms are used.

1. Optimization of representation parameters

The optimal parameters to generate the FCHL19 representation differ here from those used in the original implementation.20 We have found improved parameters using a lower cutoff of rcut = 6.0 Å. To fit these parameters, we employed a subset of 576 distorted geometries of small molecules with up to five atoms of the type CNO, saturated with hydrogen atoms, for which forces and energies have been obtained from density functional theory (DFT) calculations.20,41 This dataset is identical to that used in the original FCHL19 publication. This dataset is randomly divided into a training set of 384 geometries and a test set of 192 geometries. Models are fitted to the training set and predictions on the test set are used to minimize the following cost function with respect to the parameters:

L=0.01i(UiUiref)2+i1niFiFiref2,

where Ui is the energy of molecule i and Fi and ni are the forces and number of atoms for the same molecule, respectively. A greedy Monte Carlo optimization was used to perform this optimization, where real-type parameters are optimized by multiplying with a factor randomly chosen from a normal distribution centered on 1 with a variance of 0.05 and integer-type parameters by randomly adding +1 or −1. The final parameters that were found to work best are listed in Table I.

TABLE I.

Optimized representation parameters for FCHL19 for both energy-only (U) and simultaneous energy and force (U + F) learning.

ParameterUU + F
n2 23 24 
n3 22 22 
η2 0.27 0.28 
η3 5.6 3.2 
N2 2.78 2.3 
N3 2.1 0.65 
c3 60.1 18.8 
ζ π π 
rcut 6.0 6.0 
ParameterUU + F
n2 23 24 
n3 22 22 
η2 0.27 0.28 
η3 5.6 3.2 
N2 2.78 2.3 
N3 2.1 0.65 
c3 60.1 18.8 
ζ π π 
rcut 6.0 6.0 

2. GPU implementation: Representation

The FCHL19 representation is constructed by assigning each atom I to each block in a batch. One block is launched for each atom in the system. For each block, a total of 256 threads are used in a two-dimensional thread grid. The first dimension of this grid contains 16 threads and enumerates over all two-body interactions with central atom I to construct GI2, while the second dimension contains eight threads, which enumerates the third index in the three-body interaction GI3. For the forwards pass of the model, the reduction of all GI2 and GI3 scalar elements is performed global memory. For the backward pass, however, the atomic force components are stored and summed in (local, on-chip) memory, therefore significantly increasing throughput. A simple tiled neighbor-list is used to linearize the cost of FCHL19 with respect to increasing number of atoms in the local environments; therefore, only atoms within the cutoff radius are considered when constructing the representation. Once the atomic representation has been constructed, it is projected to a lower dimension of size NPCA using a matrix obtained from an SVD of a randomly selected subset of atomic representations from the training set. The size of this lower-dimensional vector is constrained to be a power of 2 for the purposes of the Hadamard transform. Each element within the training database has its own projection matrix and is used to project down each atomic representation separately.

3. GPU implementation: Structured orthogonal features

For the SORF forward pass, each block handles all Nfd Hadamard transforms for a single atom to produce a feature vector of the desired length Nf. The Hadamard transform itself operates on the projected FCHL19 representation (dimension d = 2n, n ≥ 1), after multiplication with the diagonal sign-flipping matrix, using a shared-memory butterfly operation that has O(d log d) complexity. This operation is performed iteratively Ntransform times. In this work, either d = 128 or d = 256 is used and Ntransform is set to 1. For the backward pass, the gradients are stored and reduced in shared memory.

We begin by briefly comparing the performance of RFF-type and SORF-type approximations with those of a sparse GP method. We have implemented a sparse GP model in QML-Lightning that uses a Gaussian kernel where the sparse points for each element are uniformly randomly selected from the available training data. The left column of Fig. 1 shows the convergence of the out-of-sample mean absolute errors (MAEs) for energy with respect to increasing Nf used to approximate the kernel as well as the wall time to compute 1k configurations. For the sparse GP model, we set the total number of sparse points Ns to be equal to Nf. Here, we have used the aspirin trajectory from the unrectified MD17 database.22–24 The amount of training data remains fixed, using 1k configurations and training on energies only. The right column of Fig. 1 displays the aforementioned energy MAEs but additionally against an estimate of the number of operations performed per atom for SORF, RFF, and sparse GP models. Since we are only concerned with relative differences, the horizontal axis has been normalized to 1.

FIG. 1.

Left: Wall time and mean absolute energy errors of computing 1k out-of-sample aspirin configurations given a fixed training budget of 1k configurations for RFF, SORF, and sparse GP models. Right: Estimate normalized number of operations per atom to compute the aforementioned errors. Number of features for each model are indicated with different markers. For the sparse GP model, Ns = Nf.

FIG. 1.

Left: Wall time and mean absolute energy errors of computing 1k out-of-sample aspirin configurations given a fixed training budget of 1k configurations for RFF, SORF, and sparse GP models. Right: Estimate normalized number of operations per atom to compute the aforementioned errors. Number of features for each model are indicated with different markers. For the sparse GP model, Ns = Nf.

Close modal

For a fixed number of features, the SORF and RFF models outperform the sparse GP model, with SORF having the lowest error. Furthermore, SORF models are faster than both RFF and sparse GP models for all model sizes. We note that, while RFF and sparse GP models have the same formal cost, warp matrix-multiply-accumulate (WMMA) operations are accelerated on TensorCore-enabled GPUs such as the RTX 3080 used here, thereby making the calculation of matrix-vector product W[ρI,1]T significantly faster. However, increasing Nf incurs a numerical error in this matrix-vector product for the RFF model if performed in single-point precision. To remedy this controlled-error, low-precision matrix-vector products can be computed with 3 WMMA operations, which has not been implemented in this work. While this would increase the cost of RFF, it would not significantly change the ordering of results presented here. We note that this does not appear to be an issue for the SORF model since the product W[ρI,1]T is computed via a series of fast Hadamard transforms requiring only simple additions.

In Fig. 2, the predictive accuracy of several explicit kernel models and SORF models for atomization energy of molecules in the QM9 dataset21 are compared. These models include FCHL1840 and FCHL1920 using the OQML20 regressor, which is equivalent to a sparse GP model in which all training atoms are used in the kernel basis. For the SORF models, learning curves using both Nf = 16 384 and Nf = 32 768 are shown. We find that the SORF models with FCHL19 perform similar to OQML with FCHL19: The MAE for OQML/FCHL19 and SORF-32k/FCHL19 at 75 000 training samples are 11 and 12 meV, respectively. We note that there is a small deviation away from linearity in the learning curve at 50000 training samples, indicating that more features may be required.

FIG. 2.

Learning curves for QM9 dataset. The mean absolute error (MAE) of atomization energy (in meV) is shown for 2 explicit kernel models: OQML with FCHL18 and FCHL19 and two SORF models with FCHL19 using Nf = 16 384 and Nf = 32 768, respectively.

FIG. 2.

Learning curves for QM9 dataset. The mean absolute error (MAE) of atomization energy (in meV) is shown for 2 explicit kernel models: OQML with FCHL18 and FCHL19 and two SORF models with FCHL19 using Nf = 16 384 and Nf = 32 768, respectively.

Close modal

Figure 3 reports the energy and force MAEs as a function of number of training samples using three molecules from the MD-17dataset.22–24 To be consistent with previous literature,20 we use the unrectified MD-17 dataset, which is known to contain significant noise on the energy values.25 The learning curves for SORF/FCHL19 with both Nf = 16 384, 32 768 are reported. We compare against OQML models based on FCHL1920 as well as GDML22 and sGDML42 models. Additionally, SchNet43 and state-of-the-art NequIP44 neural networks have been included.

FIG. 3.

Energy and force learning curves for the molecules (left to right): ethanol, salicylic acid, aspirin, malonaldehyde, toluene, naphthalene, and uracil from the MD-17 dataset. The top row contains learning curves for out-of-sample MAE energy prediction (meV). Bottom row contains learning curves for MAE force component prediction (meV/Å).

FIG. 3.

Energy and force learning curves for the molecules (left to right): ethanol, salicylic acid, aspirin, malonaldehyde, toluene, naphthalene, and uracil from the MD-17 dataset. The top row contains learning curves for out-of-sample MAE energy prediction (meV). Bottom row contains learning curves for MAE force component prediction (meV/Å).

Close modal

For energy learning, the SORF/FCHL19 models in general display similar accuracies to the OQML/FCHL19 model. For toluene, naphthalene, and salicylic acid, the OQML/FCHL19 model slightly outperforms the SORF/FCHL19 models for both Nf = 16 384 and Nf = 32 768. However, in all other cases, the SORF models display similar or better accuracy. Both sGDML and GDML perform worse than OQML and SORF models in general; however, for toluene and naphthalene specifically, sGDML has the lowest error among the kernel models.

For force learning, the SORF-based models are as accurate or better than OQML/FCHL19 for Nf = 16 384, and they are reasonably more accurate than OQML for Nf = 32 768. These models outperform SchNet in all cases, while NequIP outperforms the SORF models in all cases. We note that the training times for SORF/FCHL19 are on the order of seconds, while OQML/FCHL19 and sGDML models take several minutes to train. Furthermore, GDML models take several hours, and SchNet and NequIP are trained over hours to days. We additionally provide a comparative benchmark of the revised MD-17 dataset,25 a recomputed version of the original MD-17 dataset with tighter SCF convergence criteria. Table II lists the out-of-sample MAEs for the largest SORF model constructed in this work with FCHL19, GP with FCHL19,25 sGDML, ACE,45 and NequIP, with rotation orders l = 0 and l = 3. We note that FCHL19 is a comparatively simplistic atomic featurization layer; consequently, it is expected that it does not perform as well as state-of-the-art equivariant many-body neural networks46–48 such as NequIP. For a more reasonable comparison, the l = 0 channel NequIP model, which contains at most 3-body terms similar to FCHL19, has been included here. While the force errors are similar to the MD-17 results, the energy errors are significantly lower across all models, which is consistent with the observation that the noise floor on the original MD-17 data is higher on the energies.

TABLE II.

Energy and force MAEs for models trained on 1k configurations from the revised MD-17 dataset. Errors are reported in meV and meV/Å for energies and forces, respectively. Bold face values denote best-performing model among those listed.

MoleculeSORF-32kGP/FCHL19sGDMLACENequIP (l = 0)NequIP (l = 3)
Aspirin Energy 9.6 6.2 7.2 6.1 25.2 2.3 
Forces 25.9 20.9 31.8 17.9 41.9 8.5 
Azobenzene Energy 5.6 2.8 4.3 3.6 20.3 0.7 
Forces 13.6 10.8 19.2 10.9 42.3 3.6 
Ethanol Energy 1.5 0.9 2.4 1.2 2.0 0.4 
Forces 7.5 6.2 16.0 7.3 13.7 3.4 
Malonaldehyde Energy 2.3 1.5 3.1 1.7 4.4 0.8 
Forces 12.1 10.3 18.8 11.1 23.4 5.2 
Naphthalene Energy 4.9 1.2 0.8 0.9 14.7 0.2 
Forces 9.1 6.5 5.4 5.1 20.1 1.2 
Salicylic acid Energy 4.3 1.8 2.1 1.8 11.4 0.7 
Forces 12.5 9.5 12.8 9.3 28.7 4.0 
Toluene Energy 4.2 1.7 1.0 1.1 9.7 0.3 
Forces 11.6 8.8 6.3 6.5 27.2 1.6 
Uracil Energy 1.4 0.6 1.4 1.1 10.0 0.4 
Forces 6.0 4.2 10.4 6.6 25.8 3.2 
MoleculeSORF-32kGP/FCHL19sGDMLACENequIP (l = 0)NequIP (l = 3)
Aspirin Energy 9.6 6.2 7.2 6.1 25.2 2.3 
Forces 25.9 20.9 31.8 17.9 41.9 8.5 
Azobenzene Energy 5.6 2.8 4.3 3.6 20.3 0.7 
Forces 13.6 10.8 19.2 10.9 42.3 3.6 
Ethanol Energy 1.5 0.9 2.4 1.2 2.0 0.4 
Forces 7.5 6.2 16.0 7.3 13.7 3.4 
Malonaldehyde Energy 2.3 1.5 3.1 1.7 4.4 0.8 
Forces 12.1 10.3 18.8 11.1 23.4 5.2 
Naphthalene Energy 4.9 1.2 0.8 0.9 14.7 0.2 
Forces 9.1 6.5 5.4 5.1 20.1 1.2 
Salicylic acid Energy 4.3 1.8 2.1 1.8 11.4 0.7 
Forces 12.5 9.5 12.8 9.3 28.7 4.0 
Toluene Energy 4.2 1.7 1.0 1.1 9.7 0.3 
Forces 11.6 8.8 6.3 6.5 27.2 1.6 
Uracil Energy 1.4 0.6 1.4 1.1 10.0 0.4 
Forces 6.0 4.2 10.4 6.6 25.8 3.2 

The 3PBA dataset26,45 contains both ambient and high temperature configurational samples for the small drug-like molecule, 3-(bezyloxy)pyridin-2-amine (3BPA). This molecule has 3 central rotatable dihedral angles (α, β, γ), leading to a complex dihedral potential energy surface with many local minima, which can be challenging for both classical and ML-based potentials.49 In particular, at ambient temperatures (300 K) the accessible phase space is small; however, the dataset contains both 600 and 1200 K configurational samples, which have increasingly large phase space. Therefore, a crucial test on whether an ML model can extrapolate well is if the model, when trained on the 300 K samples, can accurately predict the 600 and 1200 K samples. Consequently, we have trained the SORF/FCHL19 model on the 300 K subset of the 3PBA dataset in order to compare against results obtained from ACE, two kernel models sGDML and GAP29 using SOAP,50,51 as well as two related neural network architectures ANI52,53 and ANI-2x.54 We note that these results have been summarized directly from the ACE publication.45 

Table III lists the energy and force root-mean-squared errors (RMSEs) of a variety of different models listed in the ACE paper,45 including the SORF model with both 16 384 and 32 768 features. For the 300 K test dataset, the SORF model with Nf = 32 768 achieves very low out-of-sample errors, having more accurate forces than all models except ACE. For the 600 K dataset, the SORF models do not fare as well as the ACE and ANI models; however, they significantly outperform both kernel based models sGDML and GAP with SOAP features. For the 1200 K dataset, the SORF model has lower force errors than ACE; however, the ANI models have a reasonably lower error. We note that ACE contains up to five-body terms in its cluster expansion, while FCHL19 contains only up to three-body terms. Therefore, one would expect ACE to outperform the FCHL19/SORF models in this setting.

TABLE III.

Root-mean-squared error of energy and force predictions in meV and meV/Å, respectively, for different models trained on the 300 K 3BPA dataset using the 600 and 1200 K as out-of-distribution test sets. The asterisk indicates the dataset used for training. Bold face values denote best-performing model among those listed.

T (K)PropertySORF-16kSORF-32kACEsGDMLGAPANIANI-2X
300* Energy 15.0 13.7 7.1 9.1 22.8 23.5 38.6 
Forces 39.2 36.2 27.1 46.2 87.3 42.8 84.4 
600 Energy 49.3 37.7 24.0 484.8 61.4 37.8 54.5 
Forces 86.8 75.9 64.3 439.2 151.9 71.7 102.8 
1200 Energy 118.4 99.2 85.3 774.5 166.8 76.8 88.8 
Forces 175.6 159.3 187.0 711.1 305.5 129.6 139.6 
T (K)PropertySORF-16kSORF-32kACEsGDMLGAPANIANI-2X
300* Energy 15.0 13.7 7.1 9.1 22.8 23.5 38.6 
Forces 39.2 36.2 27.1 46.2 87.3 42.8 84.4 
600 Energy 49.3 37.7 24.0 484.8 61.4 37.8 54.5 
Forces 86.8 75.9 64.3 439.2 151.9 71.7 102.8 
1200 Energy 118.4 99.2 85.3 774.5 166.8 76.8 88.8 
Forces 175.6 159.3 187.0 711.1 305.5 129.6 139.6 

Table IV shows the total time spent in seconds when training various CPU and GPU-accelerated models using both energies and forces, given a fixed amount of training data (Nt = 1k) from the MD-17 database. The device used to train these models has been listed for reference. Three kernel models are shown: OQML and sGDML. We note that OQML and GP models use the same representation as the GPU variant implemented here. We additionally compare against NequIP; however, the training times shown are an upper bound as models with good predictive accuracy can likely be obtained with early stopping. The timings for the SORF model with both 16 384 and 32 768 features are shown using a consumer RTX 3080 GPU and scientific A100 GPU. It should be noted that there is a significant cumulative round-off error when constructing the ZTZ matrix in floating-point precision (FP32), which leads to ill-conditioning. However, Kahan’s summation55 has been implemented here to minimize this error, allowing the normal equations to be constructed in FP32 format to yield a threefold reduction in training time compared to FP64 precision. For the A100 GPU, only FP64 performance is shown as we observed minimal gain in speed when using FP32. Using the RTX 3080 GPU, SORF models are found to be several orders of magnitude faster to train than both GP and neural networks. Furthermore, they are on average ≈3 and ≈210 times faster to train than OQML and sGDML models, respectively, across the dataset shown when using FP64 matrix multiplication, while they are ≈10 and ≈700 times faster to train when using FP32 with Kahan’s summation. On the A100 GPU, trained models can be obtained in seconds. Note that this work focuses on models that can fit into GPU global memory, i.e., both the ZTZ matrix and the normal equations are constructed and solved on the GPU. However, QML-Lightning also supports an out-of-core approach, where the ZTZ matrix is tiled, with each tile being copied to the host device and summed. Finally, the normal equations are then solved on the CPU. This second variant is necessary when the number of features yields matrices that exceed the memory required to both construct the ZTZ matrix as well as solve the normal equations. The choice of CPU(s) or out-of-core CPU/GPU implementation will significantly determine the model training time in these cases. However, this aspect has not been investigated in this work.

TABLE IV.

Training time in seconds for various kernel (CPU) and neural network (GPU) models using Nt = 1k and training on both energies and forces. Device used to train the model indicated where appropriate. For NequIP, training times are approximate. For the QML-Lightning models, both double precision (DP) and single precision (FP) have been specified for the RTX 3080 device, whereas only DP has been specified for A100 (see discussion for details). Bold face values denote best-performing model among those listed.

Model Train Times (s)
Device2 x E5-26802 x E5-2680E5-2640V100RTX-3080A100
OQMLGPSORFSORFSORF
MoleculeFCHL19FCHL19sGDMLNeqUIP16k-DP16k-FP32k-DP
Ethanol 66 2 252 144 ≈h 24 12 4 
Salicylic acid 249 6 836 282 ≈h 46 23 7 
Aspirin 527 101 451 570 ≈h 84 44 10 
Malonaldehyde 51 1 926 150 ≈h 30 15 6 
Toluene 271 7 976 216 ≈h 30 15 8 
Naphthalene 455 11 782 348 ≈h 60 30 7 
Uracil 87 2 576 120 ≈h 24 12 6 
Model Train Times (s)
Device2 x E5-26802 x E5-2680E5-2640V100RTX-3080A100
OQMLGPSORFSORFSORF
MoleculeFCHL19FCHL19sGDMLNeqUIP16k-DP16k-FP32k-DP
Ethanol 66 2 252 144 ≈h 24 12 4 
Salicylic acid 249 6 836 282 ≈h 46 23 7 
Aspirin 527 101 451 570 ≈h 84 44 10 
Malonaldehyde 51 1 926 150 ≈h 30 15 6 
Toluene 271 7 976 216 ≈h 30 15 8 
Naphthalene 455 11 782 348 ≈h 60 30 7 
Uracil 87 2 576 120 ≈h 24 12 6 

In Fig. 4, we show the force accuracy of the SORF model with FCHL19 as a function of the number of features, and the corresponding wall time to complete 1k MD steps for azobenzene using the OpenMM QML-Lightning interface. For additional context, we compare against Performant ACE56 (PACE), which is an MPI-parallelized and optimized version of ACE.45 For the SORF model, an RTX 3080 was used, whereas results for PACE have been provided by the first author of the ACE paper, where 8 MPI-tasks were found to achieve best time performance on a 16-core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30 GHz processor. The reader should be cautioned that there are a number of challenges in directly comparing GPU and CPU codes and that constructing a benchmark that is 100% fair is very challenging. PACE yields significantly lower wall times for smaller models, being approximately an order of magnitude faster than the SORF model when using 4k basis functions and features, respectively. Small molecules represent worst-case systems for GPU codes since there are a number of idle streaming multiprocessors, and those that are active are computing relatively little amount of work. However, it is encouraging to see that even for small systems the SORF model achieves a similar error and wall time as PACE at 16k features. Doubling the number of features to 33k reasonably improves the error while increasing the wall time cost by only 1 s; however, the error begins to saturate here and we do not expect a significant improvement in error at 65k features. Conversely, the linearization achieved in PACE is over an increasing number of higher order many-body correlations; therefore, the model’s accuracy improves as the number of basis functions increases, albeit with a larger computational cost.

FIG. 4.

Mean-absolute force error and wall time to complete 1k MD steps for a model trained on 1k azobenzene configurations on a log-log scale. Parenthesized numbers show the number of features used to approximate the kernel for the SORF model, and for PACE they show the number of basis functions.

FIG. 4.

Mean-absolute force error and wall time to complete 1k MD steps for a model trained on 1k azobenzene configurations on a log-log scale. Parenthesized numbers show the number of features used to approximate the kernel for the SORF model, and for PACE they show the number of basis functions.

Close modal

Table V lists the total wall time in seconds for completing 1000 MD steps for a variety of systems using the OpenMM QML-Lightning interface. Additionally, per atom per MD step timings and the average and maximum number of neighbors per atom are provided. For representative small-molecule timings, azobenzene is listed. For larger nonperiodic systems, a brass cluster57,58 and a vitamin D receptor (5OWD)59 with interstitial water molecules are included. For larger periodic systems, two metal organic frameworks (MOFs) obtained from the materials project60 and liquid water are included. For the two monoclinic MOF systems, a 3 × 3 × 3 supercell was used. For the Zn4H44C51N8O18 system (qmof-0b5e989),61,62 the lattice constants and angles are [a, b, c] = [16.02, 21.41, 29.93] Å and [α, β, γ] = [90.0, 90.0, 97.8]°, respectively. For the Nd2C36H42N12O18S3 system (qmof-bf74329),61,62 the lattice constants and angles are [a, b, c] = [12.14, 16.95, 25.66] Å and [α, β, γ] = [102.2, 90.0, 90.0]°, respectively. For liquid water, a 2 × 2 × 2 supercell was used with a cubic unit cell with a lattice constant of a = 25.0 Å, generated using OpenMMTools.63 We note that since the our optimization strategy parallelizes the computation over atoms, with each atom being assigned to a different block in round-robin fashion, smaller systems yield an underutilized GPU as each atom is mapped to a different SM, of which there are 68 on an RTX 3080. Furthermore, the small number of neighbors yields a limited amount of thread-parallel work to perform when constructing FCHL19 within each SM. For the brass cluster, the timings per atom per step are significantly improved since the SMs are fully utilized. However, approximately only four atoms are computed per SM; therefore, there is limited work, which yields comparatively poor timings with respect to the larger systems. For large systems, near-microsecond per atom per MD step timings can be observed; for example, forces are evaluated in 5.5 µs/(atom ⋅step) for the Zn4H44C51N8O18 system and 8.56 µs/(atom ⋅step) for the Nd2C36H42N12O18S3 system. We expect that sub-microsecond timings could be obtained using low-precision computation.

TABLE V.

Wall time (s) to complete 1k MD steps using the OpenMM QML-Lightning interface for azobenzene, a brass cluster, a vitamin D receptor, two MOF systems (details in text), and liquid water. Average number of neighbors per atom and maximum number of neighbors (parenthesis) are provided. All timings listed here were computed using a SORF-16k model on an RTX 3080.

NatomsPeriodicWall Time (s)ms/(atom ⋅ MD step)Nneighbors
Azobenzene 24 No 3.0 1.25 × 10−1 16 (23) 
Brass cluster 245 No 4.5 1.8 × 10−2 46 (78) 
Vitamin D receptor 4 369 No 45.7 1.0 × 10−2 78 (116) 
Zn4H44C51N8O18 13 500 Yes 72.8 5.39 × 10−3 49 (74) 
Nd2C36H42N12O18S3 12 204 Yes 104.5 8.56 × 10−3 85 (124) 
Water (l) 12 000 Yes 116.4 9.7 × 10−3 90 (103) 
NatomsPeriodicWall Time (s)ms/(atom ⋅ MD step)Nneighbors
Azobenzene 24 No 3.0 1.25 × 10−1 16 (23) 
Brass cluster 245 No 4.5 1.8 × 10−2 46 (78) 
Vitamin D receptor 4 369 No 45.7 1.0 × 10−2 78 (116) 
Zn4H44C51N8O18 13 500 Yes 72.8 5.39 × 10−3 49 (74) 
Nd2C36H42N12O18S3 12 204 Yes 104.5 8.56 × 10−3 85 (124) 
Water (l) 12 000 Yes 116.4 9.7 × 10−3 90 (103) 

For a component breakdown of the computational cost, Fig. 5 shows the approximate on-device time for key components in computing energies and forces for azobenzene (Natoms = 24), a brass cluster (Natoms = 245), a vitamin D receptor (Natoms = 4369), and the periodic MOF system Nd2C36H42N12O18S3 (Natoms = 12 204) using Nf = 16 384. Profiling PyTorch GPU codes yields an additional cost to each node on the autograd graph; therefore, we display the results in terms of percentage of the total device time. For azobenzene, the forward and backward passes of FCHL19 and SORF require only 28% of the total device time, while indexing operations, initialization, and intermediary operations require ∼70%. We note that many of these operations could be optimized away through combining them into single kernel calls. In particular, it is likely that the indexing costs, which are mostly associated with sub-selecting the atomic representations for projection with the relevant matrices PeNPCA, could be removed through using customized CUDA C functions to generate these indices for use directly in the FCHL19 and SORF modules. For the larger brass cluster, forward and backward passes of FCHL19 and SORF occupy 47% of the total device time; however, indexing and other costs still occupy a large percentage of the total device time. Here, the percentage cost of the forward and backward passes of SORF compared to azobenzene are reduced from 23% to 18%. For the vitamin D receptor, the forward and backward passes of FCHL19 begin to dominate the cost, accounting for 24.5% and 45.7%, respectively, while the cost of the SORF transforms is reduced to 10% of the total device time. Finally, for the periodic MOF system, the forward and backward passes of FCHL19 occupy 80.5% of the total cost, while the SORF transform only requires 7.6% of the total device time.

FIG. 5.

Percentage of on-device time for inference of component functions for energy and force evaluation for a configuration of (a) azobenzene, (b) Cu225Zn20 brass cluster, (c) vitamin D receptor, and (d) Nd2C36H42N12O18S3 MOF supercell with periodic boundary conditions using a SORF-16k model on an RTX 3080. The shorthand notations Init. and Idx. refer to initialization and indexing operations, respectively, while the notation [func] refers to the backward pass of the given function.

FIG. 5.

Percentage of on-device time for inference of component functions for energy and force evaluation for a configuration of (a) azobenzene, (b) Cu225Zn20 brass cluster, (c) vitamin D receptor, and (d) Nd2C36H42N12O18S3 MOF supercell with periodic boundary conditions using a SORF-16k model on an RTX 3080. The shorthand notations Init. and Idx. refer to initialization and indexing operations, respectively, while the notation [func] refers to the backward pass of the given function.

Close modal

In this paper, we have introduced a PyTorch-based library, termed QML-Lightning, which contains approximate kernel models and an efficient representation designed for learning quantum mechanical properties. We have provided a low-cost, PyTorch-wrapped CUDA C implementation of structured orthogonal random features, a variant of the well-known random Fourier features, as well as a computationally efficient implementation of FCHL19, an accurate atom-centered representation.

The combination of structured orthogonal features and FCHL19 has been benchmarked against existing datasets, yielding not only similar or better accuracy than explicit kernel models with FCHL19 but also competitive accuracy with contemporary models, with significantly reduced training time and performant prediction time.

This research was supported by the NCCR MARVEL, a National Center of Competence in Research, funded by the Swiss National Science Foundation (Grant No. 182892). Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing center at University of Basel.

We would also like to thank David Kovacs, Ilyes Batatia, and Gábor Csányi for fruitful discussions and David Kovacs for providing PACE timings.

The authors have no conflicts to disclose.

Nicholas J. Browning: Conceptualization (lead); Data curation (lead); Formal analysis (equal); Investigation (lead); Methodology (lead); Software (lead); Visualization (equal); Writing – original draft (lead); Writing – review & editing (equal). Felix A. Faber: Formal analysis (equal); Validation (lead); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). O. Anatole von Lilienfeld: Funding acquisition (lead); Project administration (lead); Writing – review & editing (equal).

The data that support the findings of this study are available within the article.

C

set of atoms belonging to a compound

d

size of atomic representation.

F

matrix containing force components

G

atom density function

H

Walsh-Hadamard matrix

I

identity matrix

I, J

indices denoting atoms I and J, respectively

K

Kernel matrix

k(⋅, ⋅)

Kernel function

L

loss function

PeNPCA

atomic projection matrix for element e

Nf

number of features for the RFF approximation

Ns

number of sparse points in sparse GP basis

Nt

number of training instances

U

matrix containing total formation energies

U

total formation energy of a compound

w

angular frequency of a Fourier transform

Z

matrix containing approximate feature mappings

z(⋅)

RFF approximation of the feature mapping

α

regression weights

ρ

atomic representations

ϕ(⋅)

feature mapping of a kernel function

ϵ

atomic energy contributions

1.
B.
Huang
and
O. A.
von Lilienfeld
, “
Ab initio machine learning in chemical compound space
,”
Chem. Rev.
121
,
10001
10036
(
2021
).
2.
O. A.
von Lilienfeld
and
K.
Burke
, “
Retrospective on a decade of machine learning for chemical discovery
,”
Nat. Commun.
11
,
4895
(
2020
).
3.
M.
Ceriotti
,
C.
Clementi
, and
O.
Anatole von Lilienfeld
, “
Machine learning meets chemical physics
,”
J. Chem. Phys.
154
,
160401
(
2021
).
4.
O. A.
von Lilienfeld
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Exploring chemical compound space with quantum-based machine learning
,”
Nat. Rev. Chem.
4
,
347
358
(
2020
).
5.
M.
Ceriotti
,
C.
Clementi
, and
O.
Anatole von Lilienfeld
, “
Introduction: Machine learning at the atomic scale
,”
Chem. Rev.
121
,
9719
9721
(
2021
).
6.
A.
Kamath
,
R. A.
Vargas-Hernández
,
R. V.
Krems
,
T.
Carrington
, and
S.
Manzhos
, “
Neural networks vs Gaussian process regression for representing potential energy surfaces: A comparative study of fit quality and vibrational spectrum accuracy
,”
J. Chem. Phys.
148
,
241702
(
2018
).
7.
J.
Cui
and
R. V.
Krems
, “
Efficient non-parametric fitting of potential energy surfaces for polyatomic molecules with Gaussian processes
,”
J. Phys. B: At. Mol. Opt. Phys.
49
,
224001
(
2016
).
8.
W.
Zhu
,
J.
Luo
, and
A. D.
White
, “
Federated learning of molecular properties with graph neural networks in a heterogeneous setting
,”
Patterns
3
,
100521
(
2021
).
9.
Z.
Yang
,
M.
Chakraborty
, and
A. D.
White
, “
Predicting chemical shifts with graph neural networks
,”
Chem. Sci.
12
,
10802
10809
(
2021
).
10.
B. J.
Braams
and
J. M.
Bowman
, “
Permutationally invariant potential energy surfaces in high dimensionality
,”
Int. Rev. Phys. Chem.
28
,
577
606
(
2009
).
11.
P. L.
Houston
,
C.
Qu
,
A.
Nandi
,
R.
Conte
,
Q.
Yu
, and
J. M.
Bowman
, “
Permutationally invariant polynomial regression for energies and gradients, using reverse differentiation, achieves orders of magnitude speed-up with high precision compared to other machine learning methods
,”
J. Chem. Phys.
156
,
044120
(
2022
).
12.
J.
Behler
, “
Atom-centered symmetry functions for constructing high-dimensional neural network potentials
,”
J. Chem. Phys.
134
,
074106
(
2011
).
13.
C. E.
Rasmussen
and
C. K. I.
Williams
,
Gaussian Processes for Machine Learning
, Adaptive Computation and Machine Learning (
MIT Press
,
Cambridge, MA
,
2006
).
14.
A.
Paszke
,
S.
Gross
,
S.
Chintala
,
G.
Chanan
,
E.
Yang
,
Z.
DeVito
,
Z.
Lin
,
A.
Desmaison
,
L.
Antiga
, and
A.
Lerer
, “
Automatic differentiation in PyTorch
,” in NIPS-W,
2017
.
15.
M.
Abadi
,
A.
Agarwal
,
P.
Barham
,
E.
Brevdo
,
Z.
Chen
,
C.
Citro
,
G.-S.
Corrado
,
A.
Davis
,
J.
Dean
,
M.
Devin
,
S.
Ghemawat
,
I.
Goodfellow
,
A.
Harp
,
G.
Irving
,
M.
Isard
,
Y.
Jia
,
R.
Jozefowicz
,
L.
Kaiser
,
M.
Kudlur
,
J.
Levenberg
,
D.
Mané
,
R.
Monga
,
S.
Moore
,
D.
Murray
,
C.
Olah
,
M.
Schuster
,
J.
Shlens
,
B.
Steiner
,
I.
Sutskever
,
K.
Talwar
,
P.
Tucker
,
V.
Vanhoucke
,
V.
Vasudevan
,
F.
Viégas
,
P.
Vinyals
,
O.
Warden
,
M.
Wattenberg
,
M.
Wicke
,
Y.
Yu
, and
X.
Zheng
, “
TensorFlow: Large-scale machine learning on heterogeneous systems
,” Software available from tensorflow.org,
2015
.
16.
A. P.
Bartók
,
M. C.
Payne
,
R.
Kondor
, and
G.
Csányi
, “
Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons
,”
Phys. Rev. Lett.
104
,
136403
(
2010
).
17.
E.
Snelson
and
Z.
Ghahramani
, “
Local and global sparse Gaussian process approximations
,” in
Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, San Juan, Puerto Rico
, edited by
M.
Meila
and
X.
Shen
(
PMLR
,
2007
) Vol. 2, pp.
524
531
.
18.
A.
Rahimi
and
B.
Recht
, “
Random features for large-scale kernel machines
,” in
Advances in Neural Information Processing Systems
, edited by
J.
Platt
,
D.
Koller
,
Y.
Singer
, and
S.
Roweis
(
Curran Associates, Inc.
,
2007
), Vol. 20.
19.
F. X. X.
Yu
,
A. T.
Suresh
,
K. M.
Choromanski
,
D. N.
Holtmann-Rice
, and
S.
Kumar
, “
Orthogonal random features
,” in
Advances in Neural Information Processing Systems
, edited by
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, and
R.
Garnett
(
Curran Associates, Inc.
,
2016
), Vol. 29.
20.
A. S.
Christensen
,
F. A.
Faber
, and
O. A.
von Lilienfeld
, “
Operators in quantum machine learning: Response properties in chemical space
,”
J. Chem. Phys.
150
,
064105
(
2019
).
21.
R.
Ramakrishnan
,
P. O.
Dral
,
M.
Rupp
, and
O. A.
von Lilienfeld
, “
Quantum chemistry structures and properties of 134 kilo molecules
,”
Sci. Data
1
,
140022
(
2014
).
22.
S.
Chmiela
,
A.
Tkatchenko
,
H. E.
Sauceda
,
I.
Poltavsky
,
K. T.
Schütt
, and
K. R.
Müller
, “
Machine learning of accurate energy-conserving molecular force fields
,”
Sci. Adv.
3
,
e1603015
(
2017
).
23.
K. T.
Schütt
,
F.
Arbabzadah
,
S.
Chmiela
,
K. R.
Müller
, and
A.
Tkatchenko
, “
Quantum-chemical insights from deep tensor neural networks
,”
Nat. Commun.
8
,
13890
(
2017
).
24.
S.
Chmiela
,
H. E.
Sauceda
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
Towards exact molecular dynamics simulations with machine-learned force fields
,”
Nat. Commun.
9
,
3887
(
2018
).
25.
A. S.
Christensen
and
O. A.
von Lilienfeld
, “
On the role of gradients for machine learning of molecular energies and forces
,”
Mach. Learn.: Sci. Technol.
1
,
045018
(
2020
).
26.
D. J.
Cole
,
L.
Mones
, and
G.
Csányi
, “
A machine learning based intramolecular potential for a flexible organic molecule
,”
Faraday Discuss.
224
,
247
264
(
2020
).
27.
P.
Eastman
,
J.
Swails
,
J. D.
Chodera
,
R. T.
McGibbon
,
Y.
Zhao
,
K. A.
Beauchamp
,
L. P.
Wang
,
A. C.
Simmonett
,
M. P.
Harrigan
,
C. D.
Stern
,
R. P.
Wiewiora
,
B. R.
Brooks
, and
V. S.
Pande
, “
OpenMM 7: Rapid development of high performance algorithms for molecular dynamics
,”
PLoS Comput. Biol.
13
(
7
),
e1005659
(
2017
).
28.
G.
Dhaliwal
,
P. B.
Nair
, and
C. V.
Singh
, “
Machine learned interatomic potentials using random features
,”
Npj Comput. Mater.
8
,
7
(
2022
).
29.
A. P.
Bartók
and
G.
Csányi
, “
Gaussian approximation potentials: A brief tutorial introduction
,”
Int. J. Quantum Chem.
115
,
1051
1057
(
2015
).
30.
V. L.
Deringer
,
A. P.
Bartók
,
N.
Bernstein
,
D. M.
Wilkins
,
M.
Ceriotti
, and
G.
Csányi
, “
Gaussian process regression for materials and molecules
,”
Chem. Rev.
121
,
10073
10141
(
2021
).
31.
R. A.
Willoughby
, “
Solutions of ill-posed problems (A. N. Tikhonov and V. Y. Arsenin)
,”
SIAM Rev.
21
,
266
267
(
1979
).
32.
R. K.
Cersonsky
,
B. A.
Helfrecht
,
E. A.
Engel
,
S.
Kliavinek
, and
M.
Ceriotti
, “
Improving sample and feature selection with principal covariates regression
,”
Mach. Learn.: Sci. Technol.
2
,
035038
(
2021
).
33.
Q. V.
Le
,
T.
Sarlos
, and
A. J.
Smola
, “
Fastfood: Approximating kernel expansions in loglinear time
,” in
Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, GA
(
JMLR
,
2013
), Vol. 28, pp.
III–244
III–252
.
34.
L.
Wu
,
I. E. H.
Yen
,
J.
Chen
, and
R.
Yan
, “
Revisiting random binning features: Fast convergence and strong parallelizability
,” in , (
Association for Computing Machinery
,
New York, NY
,
2016
), pp.
1265
1274
.
35.
F.
Liu
,
X.
Huang
,
Y.
Chen
, and
J. A. K.
Suykens
, “
Random features for kernel approximation: A survey on algorithms, theory, and beyond
,”
IEEE Trans. Pattern Anal. Mach. Intel.
44
(
10
),
7128
7148
(
2022
).
36.
E. J.
Nyström
, “
Uber Die Praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben
,”
Acta Math.
54
,
185
204
(
1930
).
37.
M.
Gastegger
,
L.
Schwiedrzik
,
M.
Bittermann
,
F.
Berzsenyi
, and
P.
Marquetand
, “
wACSF—weighted atom-centered symmetry functions as descriptors in machine learning potentials
,”
J. Chem. Phys.
148
,
241709
(
2018
).
38.
Y.
Muto
, “
Force between nonpolar molecules
,”
J. Phys. Math. Soc. Jpn.
17
,
629
(
1943
).
39.
B. M.
Axilrod
and
E.
Teller
, “
Interaction of the van der waals type between three atoms
,”
J. Chem. Phys.
11
,
299
(
1943
).
40.
F. A.
Faber
,
A. S.
Christensen
,
B.
Huang
, and
O. A.
von Lilienfeld
, “
Alchemical and structural distribution based representation for universal quantum machine learning
,”
J. Chem. Phys.
148
,
241717
(
2018
).
41.
A. S.
Christensen
,
F. A.
Faber
, and
O. A.
von lilienfeld
, 10.6084/m9.figshare.7000280.v1 figshare, 8,
2018
.
42.
S.
Chmiela
,
H. E.
Sauceda
,
I.
Poltavsky
,
K.-R.
Müller
, and
A.
Tkatchenko
, “
sGDML: Constructing accurate and data efficient molecular force fields using machine learning
,”
Comput. Phys. Commun.
240
,
38
45
(
2019
).
43.
K. T.
Schütt
,
H. E.
Sauceda
,
P.-J.
Kindermans
,
A.
Tkatchenko
, and
K.-R.
Müller
, “
SchNet—a deep learning architecture for molecules and materials
,”
J. Chem. Phys.
148
,
241722
(
2018
).
44.
S.
Batzner
,
A.
Musaelian
,
L.
Sun
,
M.
Geiger
,
J. P.
Mailoa
,
M.
Kornbluth
,
N.
Molinari
,
T. E.
Smidt
, and
B.
Kozinsky
, “
E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials
,”
Nat. Commun.
13
(
1
),
2453
(
2022
).
45.
D. P.
Kovács
,
C.
van der Oord
,
J.
Kucera
,
A. E. A.
Allen
,
D. J.
Cole
,
C.
Ortner
, and
G.
Csányi
, “
Linear atomic cluster expansion force fields for organic molecules: Beyond RMSE
,”
J. Chem. Theory Comput.
17
,
7696
7711
(
2021
).
46.
M.
Geiger
,
T.
Smidt
,
M.
Alby
,
B. K.
Miller
,
W.
Boomsma
,
B.
Dice
,
K.
Lapchevskyi
,
M.
Weiler
,
M.
Tyszkiewicz
,
S.
Batzner
,
M.
Uhrin
,
J.
Frellsen
,
N.
Jung
,
S.
Sanborn
,
J.
Rackers
, and
M.
Bailey
, “
Euclidean neural networks: e3nn
,” arXiv:2207.09453 (
2022
).
47.
K. T.
Schütt
,
O. T.
Unke
, and
M.
Gastegger
, “
Equivariant message passing for the prediction of tensorial properties and molecular spectra
,” in
International Conference on Machine Learning
, (
PMLR
,
2012
), pp.
9377
9388
.
48.
M.
Haghighatlari
,
J.
Li
,
X.
Guan
,
O.
Zhang
,
A.
Das
,
C. J.
Stein
,
F.
Heidar-Zadeh
,
M.
Liu
,
M.
Head-Gordon
,
L.
Bertels
,
H.
Hao
,
I.
Leven
, and
T.
Head-Gordon
, “
Newtonnet: A Newtonian message passing network for deep learning of interatomic potentials and forces
,”
Dig. Discov.
1
,
333
343
(
2022
).
49.
V.
Vassilev-Galindo
,
G.
Fonseca
,
I.
Poltavsky
, and
A.
Tkatchenko
, “
Challenges for machine learning force fields in reproducing potential energy surfaces of flexible molecules
,”
J. Chem. Phys.
154
,
094119
(
2021
).
50.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
, “
On representing chemical environments
,”
Phys. Rev. B
87
,
184115
(
2013
).
51.
F.
Musil
,
M.
Veit
,
A.
Goscinski
,
G.
Fraux
,
M. J.
Willatt
,
M.
Stricker
,
T.
Junge
, and
M.
Ceriotti
, “
Efficient implementation of atom-density representations
,”
J. Chem. Phys.
154
,
114109
(
2021
).
52.
J. S.
Smith
,
O.
Isayev
, and
A. E.
Roitberg
, “
Ani-1: An extensible neural network potential with DFT accuracy at force field computational cost
,”
Chem. Sci.
8
,
3192
3203
(
2017
).
53.
X.
Gao
,
F.
Ramezanghorbani
,
O.
Isayev
,
J. S.
Smith
, and
A. E.
Roitberg
, “
TorchANI: A free and open source PyTorch-based deep learning implementation of the ANI neural network potentials
,”
J. Chem. Inf. Model.
60
,
3408
3415
(
2020
).
54.
C.
Devereux
,
J. S.
Smith
,
K. K.
Huddleston
,
K.
Barros
,
R.
Zubatyuk
,
O.
Isayev
, and
A. E.
Roitberg
, “
Extending the applicability of the ANI deep learning molecular potential to sulfur and halogens
,”
J. Chem. Theory Comput.
16
,
4192
4202
(
2020
).
55.
W.
Kahan
, “
Pracniques: Further remarks on reducing truncation errors
,”
Commun. ACM
8
,
40
(
1965
).
56.
Y.
Lysogorskiy
,
C. v. d.
Oord
,
A.
Bochkarev
,
S.
Menon
,
M.
Rinaldi
,
T.
Hammerschmidt
,
M.
Mrovec
,
A.
Thompson
,
G.
Csányi
,
C.
Ortner
, and
R.
Drautz
, “
Performant implementation of the atomic cluster expansion (PACE) and application to copper and silicon
,”
Npj Comput. Mater.
7
,
97
(
2021
).
57.
J.
Weinreich
,
A.
Römer
,
M. L.
Paleico
, and
J.
Behler
, “
Properties of α-brass nanoparticles. 1. Neural network potential energy surface
,”
J. Phys. Chem. C
124
,
12682
12695
(
2020
).
58.
J.
Weinreich
,
M. L.
Paleico
, and
J.
Behler
, “
Properties of α-brass nanoparticles II: Structure and composition
,”
J. Phys. Chem. C
125
,
14897
14909
(
2021
).
59.
H. M.
Berman
,
J.
Westbrook
,
Z.
Feng
,
G.
Gilliland
,
T. N.
Bhat
,
H.
Weissig
,
I. N.
Shindyalov
, and
P. E.
Bourne
, “
The protein data bank
,”
Nucleic Acids Res.
28
,
235
242
(
2000
).
60.
A.
Jain
,
S. P.
Ong
,
G.
Hautier
,
W.
Chen
,
W. D.
Richards
,
S.
Dacek
,
S.
Cholia
,
D.
Gunter
,
D.
Skinner
,
G.
Ceder
, and
K. A.
Persson
, “
Commentary: The materials project: A materials genome approach to accelerating materials innovation
,”
APL Matter.
1
,
011002
(
2013
).
61.
A. S.
Rosen
,
S. M.
Iyer
,
D.
Ray
,
Z.
Yao
,
A.
Aspuru-Guzik
,
L.
Gagliardi
,
J. M.
Notestein
, and
R. Q.
Snurr
, “
Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery
,”
Matter
4
,
1578
1597
(
2021
).
62.
A. S.
Rosen
,
V.
Fung
,
P.
Huck
,
C. T.
O’Donnell
,
M. K.
Horton
,
D. G.
Truhlar
,
K. A.
Persson
,
J. M.
Notestein
, and
R. Q.
Snurr
, “
High-throughput predictions of metal–organic framework electronic properties: Theoretical challenges, graph neural networks, and data exploration
,”
Npj Comput. Mater.
8
,
112
(
2022
).
63.
A.
Rizzi
,
J. D.
Chodera
,
L. N.
Naden
,
P.
Grinaway
,
K. A.
Beauchamp
,
J.
Fass
,
B.
Rustenburg
,
G.
Ross
,
D. W. H.
Swenson
,
H. B.
Macdonald
,
I.
Pulido
,
I.
Zhang
,
D.
Rufa
, and
M.
Henry
, “
Openmmtools
,” https://github.com/choderalab/openmmtools,
2022
.