Neural networks fit to reproduce the potential energy surfaces of quantum chemistry methods offer a realization of analytic potential energy surfaces with the accuracy of *ab initio* methods at a computational cost similar to classical force field methods. One promising class of neural networks for this task is the SchNet architecture, which is based on the use of continuous-filter convolutional neural networks. Previous work has shown the ability of the SchNet architecture to reproduce density functional theory energies and forces for molecular configurations sampled during equilibrated molecular dynamics simulations. Due to the large change in energy when bonds are broken and formed, the fitting of global potential energy surfaces is normally a more difficult task than fitting the potential energy surface in the region of configurational space sampled during equilibrated molecular dynamics simulations. Herein, we demonstrate the ability of the SchNet architecture to reproduce the energies and forces of the potential energy surfaces of the H + H_{2} and Cl + H_{2} reactions and the OCHCO^{+} and H_{2}CO/*cis*-HCOH/*trans*-HCOH systems. The SchNet models reproduce the potential energy surface of the reactions well with the best performing SchNet model having a test set root-mean-squared error of 0.52 meV and 2.01 meV for the energies of the H + H_{2} and Cl + H_{2} reactions, respectively, and a test set mean absolute error for the force of 0.44 meV/bohr for the H + H_{2} reaction. For the OCHCO^{+} and H_{2}CO/*cis*-HCOH/*trans*-HCOH systems, the best performing SchNet model has a test set root-mean-squared error of 2.92 meV and 13.55 meV, respectively.

## I. INTRODUCTION

Fitting machine learning-based potential energy surfaces to reproduce high-level quantum chemistry calculations has been the subject of extensive previous research. For larger biological systems, machine learning-based potentials offer the possibility of force fields with an *ab initio* level of accuracy at a computational cost similar to classical force field methods.^{1–7} For smaller systems, machine learning-based potentials offer the possibility of allowing many more simulations to be run in dynamical studies leading to better convergence of dynamical properties.^{8–11}

As an example, in fewest-switches surface-hopping (FSSH) studies, a large number of trajectories are needed to determine excitation lifetimes. In principal, configuration interaction (CI)-based approaches are the most straightforward to apply for FSSH studies,^{12} but due to their computational cost, CI-based approaches are infeasible for larger systems and many FSSH studies utilize semiempirical methods such as AM1^{13} or PM3^{14} in conjunction with FOMO-CASCI^{15,16} or other related methods.^{17} Due to their even higher computational cost, this problem is more compounded with the development of CASPT2 analytic gradients and derivative couplings.^{18–20} CASPT2-FSSH studies have been performed on the anionic GFP model chromophore, *p*HBI,^{21} but due to the computational expense, only 20 trajectories could be run, which led to poor convergence of the excitation lifetime, as 20 trajectories are much fewer than the hundreds or thousands of trajectories that are typical for FSSH studies when using semiempirical methods. Machine learning-based methods that accurately reproduce the CASPT2 potential energy surfaces and nonadiabatic couplings would enable more trajectories to be run in FSSH calculations on the CASPT2 potential energy surface, and because of this potential impact, the extension of machine learning-based methods for nonadiabatic simulations is an active area of research.^{22,23}

For most FSSH simulations or dynamical studies of chemical reactivity, it is essential to accurately reproduce the potential energy surface in regions of configurational space that are distant from any equilibrated structure. Commonly, initial studies that develop new machine learning-based methods for fitting potential energy surfaces only examine the ability of the machine learning method to reproduce the potential energy surface in regions of configurational space sampled during equilibrated molecular dynamics calculations.^{5,24} In this study, we show the ability of one newly introduced machine learning architecture, the SchNet architecture,^{24–26} to fit regions of configurational space far from equilibrium by the reproduction of global potential energy surfaces.

A key concern when using machine learning to reproduce potential energy surfaces is the choice of representation of the molecule.^{27} Ideally, predicted properties of the machine-learning model, such as the energy, are invariant to translations or rotations of the entire system or the interchange of identical particles as the energy of the electronic Hamiltonian is invariant to these transformations. Previous approaches for satisfying this invariance include mirroring the fitting data with respect to all possible symmetry operations,^{28,29} using neural networks where the symmetry of the neural network is invariant to the interchange of identical particles,^{30,31} permutationally invariant polynomials,^{32,33} or symmetry functions, where the total energy is given as a sum of atomic energy contributions.^{1,2}

For image recognition tasks in computer science, a class of deep neural networks called convolutional neural networks^{34,35} (CNNs) has proven to be highly accurate^{36,37} with the best performing CNNs winning the ImageNet challenge^{38} multiple years. CNNs are a type of multilayer perceptron networks with building blocks called filters, which are a collection of shared weights and biases that represent some feature of the input. The use of weight and bias sharing limits the number of parameters that must be optimized during training, which allows training on larger data sets compared to dense multilayer perceptron networks. An additional major advantage of CNNs compared to other machine learning-based approaches is that they require very little preprocessing, i.e., they are able to learn the correct representation for the system. Because of these advantages, CNNs have seen wide use in chemistry for the prediction of molecular properties.^{39–43}

A difficulty when using CNNs to fit potential energy surfaces is that the standard convolution filters used in CNNs are mathematical objects with a discrete nature that convolute the filter with the input data on a grid rather than in continuous space.^{44} This does not present a problem for image recognition tasks because computer images are made up of discrete pixels, but is an issue when computing the energy of a molecule, as the distance between two atoms of the molecule can be any real number. Energies can still be obtained from discrete convolution filters, but the resulting potential energy surface is no longer smooth,^{25} so any dynamical study on the potential energy surface would suffer from a lack of energy conservation. Seeking to eliminate this deficiency, in 2017, Schütt *et al.*^{24–26} pioneered the use of continuous-filter CNNs for obtaining molecular energies and properties by introducing the SchNet architecture, which demonstrated excellent performance for molecular properties on the QM9 database^{45,46} and for the prediction of energies and forces on the MD17 database.^{47} The MD17 database contains eight systems and consists of points in configurational space obtained from equilibrated molecular dynamics simulations. While the region of configurational space sampled in the MD17 database is useful for many molecular dynamics studies, for reactive chemical dynamics studies, configurations in the nonequilibrated regions of configurational space, which are not contained in the MD17 database, are regularly sampled. The goal of this study is to assess the ability of the SchNet architecture for reactive chemical dynamics studies by reproducing previously derived global potential energy surfaces that include regions of configurational space where bonds are broken and formed.

## II. SchNet ARCHITECTURE

Here, we outline the important features of the SchNet architecture. The interested reader should consult the original SchNet articles for additional details.^{24,25} A schematic of the SchNet architecture is presented in Fig. 1. The SchNet architecture has four main components, which we outline in sequence: atom embeddings, atomwise layers, interaction blocks, and filter generating networks.

### A. Atom embedding

A molecule with *n* atoms is completely specified by its nuclear charges, *Z* = (*Z*_{1}, …, *Z*_{n}), and the position of these nuclear charges, *R* = (**r**_{1}, …, **r**_{n}). In the SchNet architecture, the system is represented by a tuple of features $Xl=x1l,...,xnl$, where each $xil$ is a real-valued vector of dimension *F* describing the features of atom *i*. *F* is called the feature map dimension, and $xil$ is defined as the representation of atom *i*. The initial values of $xil$ are dependent on an atom-specific embedding layer $xi0=aZi$, which is initialized randomly at the start of training and subsequently optimized during training.

### B. Atomwise layer

Atomwise layers are dense neural-network layers applied to each of the atomic representations, $xil$,

Weights, *W*^{l}, and biases, **b**^{l}, are shared among different atoms, which prevents the number of parameters in the SchNet architecture from growing with the number of atoms. The weights and biases are optimized during training.

### C. Interaction block

The interaction blocks update each atomic representation due to the presence of the other atoms. The key improvement of the SchNet architecture over other CNN-based approaches is the introduction of continuous-filter convolutional layers rather than the discrete-filter convolutional layers that are more commonly used in computer science.^{34,37} Given the tuple, $Xl=x1l,...,xnl$, of atomic representations along with the position vector, *R* = (**r**_{1}, …, **r**_{n}), of the atoms, the continuous-filter convolutional layer for atom *i* is defined as

where $\u25e6$ is elementwise multiplication and $Wlrj\u2212ri$ is a filter-generating network between atoms *i* and *j* defined in Sec. II D.

As shown in Fig. 1(b), an interaction block is defined as an atomwise layer, followed by a continuous-filter convolutional layer, an atomwise layer, the application of the activation function to each feature element, and a final atomwise layer. The resulting interaction for atom *i* is then added to the representation of atom *i*. For the activation function, the original SchNet studies chose a shifted softplus function, *f*(*x*) = ln(1 + *e*^{x}) − ln(2), for all activations, but we will explore using other activation functions. In this study, we will refer to the application of the activation function to each feature element as an activation layer.

### D. Filter generating networks

A schematic of the filter-generating network introduced in Eq. (2) is shown in Fig. 1(c). The input to the filter depends only on the distances between atoms, which gives the SchNet architecture invariance to both translations and rotations of the system. These distances are then used as input for a set of radial basis functions (RBF), *e*_{k}, defined as

The number of radial basis functions, *K*, is a metaparameter, and the center of each radial basis function, *μ*_{k}, is chosen such that the *K* radial basis functions are equally spaced between 0 and a distance cutoff, *d*_{K}, another metaparameter. The distance, *d*_{K}, provides a soft cutoff such that if two atoms are more than *d*_{K} apart, then the direct interaction between the two atoms is greatly reduced. γ in Eq. (3) is a scaling metaparameter. The use of radial basis functions improves the speed of training by reducing the correlation of the filter values. After the values of the radial basis functions have been obtained, a dense layer is applied that has the output dimension equal to the feature dimension. An activation layer is applied to this output, then another dense layer with the output equal to the feature dimension, and finally one more activation layer to obtain the final output of the filter-generating network.

### E. Other notes

After all interaction layers have been applied to each embedded atomic representation, a dense layer with output dimension half the feature dimension is applied followed by an activation layer. Finally, a dense layer with an output dimension of one is applied giving the final output value for each atom. From these final outputs, sum pooling or average pooling is applied depending on the property of interest similar to the method of Behler and Parrinello.^{1} The use of pooling gives the SchNet architecture invariance to the interchange of identical particles. For the present study, where we only fit the energy and force, we will only apply sum pooling.

For each SchNet model, training is performed by optimization of a loss function. For the prediction of energies with the SchNet architecture, the loss function is the mean-squared error between the predicted energy of the SchNet model and the reference energy

where *l* is the loss, $\xca$ is the energy prediction of the SchNet model, and *E* is the reference energy.

Automatic differentiation is available in many machine-learning code packages such as TensorFlow.^{48} Using automatic differentiation, the gradient of the potential is obtained and thus the force for dynamical calculations on the potential energy surface. Using automatic differentiation in this way guarantees the forces are conservative, which is essential for dynamics. Unfortunately, if the loss function only contains information about the energy, the predicted force on the potential energy surface of a SchNet model can be poor. This is particularly true for reactive potential energy surfaces as will be seen in Sec. III. Because of this issue, for systems in which the force is needed, the SchNet architecture can be modified to include force information in the loss function

Here, *ρ* is a metaparameter that controls the relative importance of the energy to the force in the loss function.^{49} In this study, we will specify SchNet models where no force information was included in the loss function as having *ρ* = ∞.

## III. INITIAL BENCHMARKING

To demonstrate the ability of the SchNet architecture to reproduce global potential energy surfaces, we test on two triatomic systems, the H + H_{2} reaction and the Cl + H_{2} reaction. The systems are chosen to enable a comparison of the SchNet architecture to another machine learning-based approach as these systems were previously used for testing the permutationally invariant-polynomial neural network (PIP-NN) method.^{32} As will be discussed later, due to differences in the fitting procedure, a quantitative comparison between the two methods is still difficult. Rather than using the SchNet architecture to generate a new potential energy surface from *ab initio* calculations, we seek to reproduce the Boothroyd-Keogh-Martin-Peterson (BKMP2) potential^{50} and the Capuchin-Werner (CW) potential^{51} for the H + H_{2} reaction and the Cl + H_{2} reaction, respectively, to speed up the generation of the fitting data. As both the BKMP2 and CW potentials accurately reproduce an *ab initio* potential energy surface, the use of an analytic potential rather than *ab initio* calculations to generate the fitting data should have no impact on the conclusion of whether the SchNet architecture is able to reproduce global potential energy surfaces with data from *ab initio* calculations.

The coordinates for sampling the potential energy surfaces were generated using a Sobol sequence,^{52,53} a quasirandom low discrepancy sequence, which ensures that all of the configurational space in the fitting region is evenly sampled. For the H + H_{2} reaction, two different Sobol sequences were used to generate coordinates in a reactant channel and interaction region defined by $RHH\u2032\u22080.9,3.0$ bohr, $RHH\u2032\u2032\u22084.5,10.0$ bohr, $\theta H\u2032HH\u2032\u2032\u22080,\pi $ radians for the reactant channel and $RHH\u2032\u22080.9,4.5$ bohr, $RHH\u2032\u2032\u22080.9,4.5$ bohr, $\theta H\u2032HH\u2032\u2032\u22080,\pi $ radians for the interaction region. For the H + H_{2} reaction, the reactant and product channels are identical, so no additional fitting data for the product channel are needed. A small number of coordinates were generated from a Sobol sequence in a larger region defined by $RHH\u2032\u22080.9,10.0$ bohr, $RHH\u2032\u2032\u22080.9,10.0$ bohr, $\theta H\u2032HH\u2032\u2032\u22080,\pi $ radians which were used to provide a better overall fit. For all of these points, all atoms were constrained to be in the YZ plane. This does not affect the energy but does constrain the force to only have nonzero y- and z-components. The Sobol sequence was generated with 8000, 8000, and 200 points in the reactant channel, interaction region, and larger region, respectively. The energies and forces of these points were then calculated using the BKMP2 potential. From these combined three sets of points, a point was removed from the fitting data if $RH\u2032H\u2033<0.9$ bohr, the energy of the point was more than 6.0 eV higher than the minimum of the BKMP2 potential, or, for points in the larger region, the point had a norm of less than 0.1 bohr from that of any point in the reaction channel or interaction region where the norm is defined as $norm=\u2211iNatomsdi2$, where *d*_{i} is the Cartesian distance between the coordinates of atom *i* for a fitting point in the larger region and a point in the reaction channel or interaction region and *N*_{atoms} is the number of atoms in the system. After the removal of these points, a total of 15 787 points remained.

For the Cl + H_{2} reaction, the generation of fitting data was similar to that of the H + H_{2} reaction. With the reactant channel defined as the Cl + H_{2} channel and the product channel defined as the HCl + H channel, three Sobol sequences were used to generate coordinates in a reactant channel, product channel, and interaction region defined by *R*_{HCl} ∈ [5.5, 11.0] bohr, $RHH\u2032\u22080.9,3.0$ bohr, $\theta ClHH\u2032\u22080,\pi $ radians for the reactant channel; *R*_{HCl} ∈ [2.0,4.0] bohr, $RHH\u2032\u22084.5,10.0$ bohr, $\theta ClHH\u2032\u22080,\pi $ radians for the product channel; and *R*_{HCl} ∈ [2.0,5.5] bohr, $RHH\u2032\u22080.9,4.5$ bohr, $\theta ClHH\u2032\u22080,\pi $ radians for the interaction region. A small number of coordinates were generated from a Sobol sequence in a larger region defined by *R*_{HCl} ∈ [2.0,11.0] bohr, $RHH\u2032\u22080.9,10.0$ bohr, $\theta ClHH\u2032\u22080,\pi $ radians to provide a better overall fit. The Sobol sequence was generated with 4000, 4000, 12 000, and 100 points in the reactant channel, product channel, interaction region, and larger region, respectively. The energies of these points were then calculated using the CW potential. From these combined four sets of points, a point was removed from the fitting data if $RH\u2032Cl<2.0$ bohr, the energy of the point was more than 3.5 eV higher than the minimum of the CW potential, or, for points in the larger region, the point had a norm of less than 0.1 bohr from that of a point in the reaction channel, product channel, or interaction region. After the removal of these points, a total of 16 946 points remained.

Adopting best practices from computer science,^{54} studies that seek to fit potential energy surfaces using neural networks normally partition the fitting data into three disjoint sets: (i) a training set, on which the parameters of the machine-learning model are trained by minimization of the loss function of the training set, (ii) a validation set, on which the accuracy of the machine learning model is validated against during training, and (iii) a test set, which is used to determine the final accuracy of the machine-learning model. Machine-learning models can suffer from overfitting, which causes the models to not generalize well to data outside of the training set. The use of the validation set that is independent of the test set enables the use of early stopping, whereby the final parameters of the model are taken to be the parameters that minimize the loss function of the validation set.

For the H + H_{2} reaction, the fitting data were partitioned into a training set of 12 000 points and a validation set of 2400 points with the remaining fitting data in the test set. For the Cl + H_{2} reaction, the fitting data were partitioned into a training set of 13 000 points and a validation set of 2600 points with the remaining fitting data in the test set. To address the possibility of a randomly chosen partition leading to poor results, we partition our fitting data five times and fit a SchNet model to each of these partitions. This training of multiple sets of parameters to access performance is common when training machine-learning models for the reproduction of potential energy surfaces.^{52}

All SchNet models fit to reproduce the potential energy surface of the H + H_{2} reaction used a filter size and feature dimension of 8. A set of 100 RBF with a distance cutoff of 20.0 was used. γ in Eq. (3) was set to 5.0. As 20.0 bohr is a region of space larger than the fitting region, this means for practical purposes, no distance cutoff was used. SchNet models were fit with *ρ* in Eq. (5) set to ∞, 0.1, and 1.0. SchNet models were also fit using 2 and 3 interaction blocks. 5 × 10^{6} batch gradient training steps were performed with a batch gradient size of 32 and stochastic gradient descent with the ADAM optimizer.^{55} An initial learning rate of 0.001 was used for the training steps. The learning rate was reduced by exponential decay with a ratio of 0.96 every 100 000 steps. The loss of the validation set was computed every 1000 batch gradient training steps for early stopping.

The original SchNet architecture used the shifted softplus activation function. While the shifted softplus activation function has proven to be accurate for CNNs, recent studies have shown that other activation functions such as the swish activation function can improve the predictions of CNNs.^{56} Other recent studies have claimed that shifting vs not shifting the activation function can also lead to better models.^{57} Therefore, in this study, we also explore the use of different activation functions in the SchNet architecture by fitting SchNet models not only with the shifted softplus activation function but also with the swish activation function

and the unshifted softplus activation function

A graph of the activation functions used in this study is shown in Fig. 2. All the activation functions have a continuous first derivative, which is necessary to generate continuous potential energy surfaces.

For the reproduction of potential energy surface of the Cl + H_{2} reaction, the fitting of each SchNet models was performed in a manner identical to that of the reproduction of the potential energy surface of the H + H_{2} reaction, except that as the CW potential did not derive an analytic gradient,^{51} no fitting was performed that included any contributions from the force in the loss function; i.e., *ρ* = ∞ for all models for the Cl + H_{2} reaction.

The test set root-mean-square error (RMSE) for the energy of the best performing SchNet model for the H + H_{2} and Cl + H_{2} reactions for a given SchNet architecture is presented in Table I. Results for every SchNet model from the five different training runs can be found in the supplementary material. Overall, the SchNet architecture performs well at reproducing the energies of the reference potentials for all combinations of activation functions, interaction blocks, and values of *ρ*, with a high and low test set RMSE for the energy of 3.46 meV and 0.52 meV for the H + H_{2} reaction and 2.24 meV and 2.01 meV for the Cl + H_{2} reaction.

. | H + H_{2}
. | Cl + H_{2}
. | ||||||
---|---|---|---|---|---|---|---|---|

ρ
. | ∞ . | 1.0 . | 0.1 . | ∞ . | ||||

Interaction blocks . | 2 . | 3 . | 2 . | 3 . | 2 . | 3 . | 2 . | 3 . |

Shifted softplus | 1.44 | 1.76 | 1.44 | 0.60 | 3.46 | 0.52 | 2.21 | 2.01 |

Swish | 1.31 | 0.88 | 1.35 | 0.77 | 0.91 | 0.65 | 2.01 | 2.07 |

Unshifted softplus | 0.77 | 0.84 | 1.17 | 0.89 | 1.67 | 1.69 | 2.24 | 2.02 |

. | H + H_{2}
. | Cl + H_{2}
. | ||||||
---|---|---|---|---|---|---|---|---|

ρ
. | ∞ . | 1.0 . | 0.1 . | ∞ . | ||||

Interaction blocks . | 2 . | 3 . | 2 . | 3 . | 2 . | 3 . | 2 . | 3 . |

Shifted softplus | 1.44 | 1.76 | 1.44 | 0.60 | 3.46 | 0.52 | 2.21 | 2.01 |

Swish | 1.31 | 0.88 | 1.35 | 0.77 | 0.91 | 0.65 | 2.01 | 2.07 |

Unshifted softplus | 0.77 | 0.84 | 1.17 | 0.89 | 1.67 | 1.69 | 2.24 | 2.02 |

Setting *ρ* ≠ ∞ lessens the weight of the energy in the loss function, but for the H + H_{2} reaction, the lowest test set RMSE for the energy for the shifted softplus and swish activation functions is obtained with *ρ* = 0.1, confirming previous evidence that when fitting potential energy surfaces, including the force in the loss function can improve the prediction of energies.^{24} For the shifted softplus and swish activation functions with *ρ* = 1.0 and*ρ* = 0.1, energy predictions are significantly improved using three interaction blocks rather than two interaction blocks with the test set RMSE decreasing by nearly a factor of seven in one case. This is likely because including the force in the loss function effectively increases the dimension of the training data as the loss function, in the case of the H + H_{2} reaction, is now a ten-dimensional function of the reference data rather than a one-dimensional function. It appears that two interaction blocks do not have sufficient flexibility to fit the energy and force at the same time leading to a larger energy error.

For the Cl + H_{2} reaction, all activation functions and number of interaction blocks give similar results, so it is hard to draw any broad conclusions from the data other than the SchNet model is able to fit reactive systems with more than one type of atom. As the CW potential did not publish an analytic gradient, we did not train any SchNet models using the force in the loss function.

The error of every point in the test set for the best performing SchNet model relative to the energy of the BKMP2 potential for the H + H_{2} reaction is shown in Fig. 3. The SchNet model performs slightly worse for higher energy points, with a maximum absolute error of 2.7 meV. From the results presented here, the SchNet model shows a slight bias toward overestimating the energy. We note that an overall bias in fitting the energy should have no effect on dynamical properties, which depend on the forces, or on relative energies, which are typically of more interest than absolute energies. The best performing model for fitting the energy has *ρ* = 0.1, and we hypothesize that this bias is due to the SchNet model placing more weight in the loss function on fitting the force rather than the energy. Evidence for this hypothesis can be seen by examining the error of every point in the test set relative to the energy of the BKMP2 potential for the best performing SchNet model that does contain any force information in the loss function, i.e., *ρ* = ∞, as shown in Fig. 4. The spread of the errors of the SchNet model in Fig. 4 is qualitatively similar to that of the errors of the SchNet model in Fig. 3 except that the spread is over a larger range and does not show any positive bias. Further evidence for this hypothesis can been seen in Fig. 5, where the error of every point in the test set for the best performing SchNet model for the Cl + H_{2} reaction relative to the energy of the CW potential is presented. The SchNet model in Fig. 5 contains no force information in the loss function and shows no positive bias in the error. The best performing SchNet model for the Cl + H_{2} reaction has a larger maximum error of 16 meV relative to that of the best performing SchNet model for the H + H_{2} reaction, but a maximum error of 16 meV is still within the normal range of error for previous machine-learning based models for fitting the CW potential.^{32}

Contour plots of the energy of the best performing SchNet model and reference potentials as a function of H–H distance or Cl–H distance with the bond angle relaxed to a minimum are shown in Fig. 6. For both the H + H_{2} and Cl + H_{2} reactions, the SchNet model accurately reproduces the contour plot over the entire fitting surface. We note that when finding the minimum energy bond angle of the SchNet model, the bond angle must be restricted to angles such that the distances between all pairs of atoms remain within the bounds of the fitting data.

The test set mean absolute error (MAE) for the force of the best performing SchNet model for a given SchNet architecture is presented in Table II. We report MAE rather than RMSE to enable a better comparison with previous SchNet results. Results for the test set MAE and test set RMSE of the force for every SchNet model from the five different training runs can be found in the supplementary material. As expected, the SchNet models do a poor job of fitting the force when force data in not included in the loss function. For the SchNet models fitted with the shifted softplus and swish activation functions and *ρ* ≠ ∞, models that use three interaction layers perform significantly better than those with two interaction layers.

. | H + H_{2}
. | |||||
---|---|---|---|---|---|---|

ρ
. | ∞ . | 1.0 . | 0.1 . | |||

Interaction blocks . | 2 . | 3 . | 2 . | 3 . | 2 . | 3 . |

Shifted softplus | 2.72 | 2.77 | 1.32 | 0.49 | 1.06 | 0.44 |

Swish | 2.40 | 2.08 | 1.37 | 0.53 | 0.81 | 0.49 |

Unshifted softplus | 1.70 | 1.80 | 1.00 | 0.77 | 1.08 | 1.07 |

. | H + H_{2}
. | |||||
---|---|---|---|---|---|---|

ρ
. | ∞ . | 1.0 . | 0.1 . | |||

Interaction blocks . | 2 . | 3 . | 2 . | 3 . | 2 . | 3 . |

Shifted softplus | 2.72 | 2.77 | 1.32 | 0.49 | 1.06 | 0.44 |

Swish | 2.40 | 2.08 | 1.37 | 0.53 | 0.81 | 0.49 |

Unshifted softplus | 1.70 | 1.80 | 1.00 | 0.77 | 1.08 | 1.07 |

The error of the force for every force element in the test set for the best performing SchNet model relative to the force of the BKMP2 potential for the H + H_{2} reaction is shown in Fig. 7. The best performing model for the force in Fig. 7 is the same model as the best performing model for the energy in Fig. 3. The maximum error in the force is 26 meV/bohr for a force element with value 1805 meV/bohr. The outlying points with the highest error are clustered toward force values with smaller absolute magnitude, but overall, the SchNet model fits the force well over the entire range of force values. In Fig. 7, the SchNet model shows a small bias toward underestimating the magnitude of the force. This bias mainly arises for force values with larger magnitude, which are not representative of typical force values.

Of the three activation functionals tested in the study, the shifted softplus and swish activation functions perform similarly. For the H + H_{2} reaction, the unshifted softplus activation function performs worse than the other two activation functions, especially in the best-case scenario. For the shifted softplus and swish activation functions, force errors are greatly reduced when using *ρ* = 0.1 and three interaction layers. This combination of *ρ* and number of interaction layers also gives the smallest energy error, so we recommend always training with the force included in the loss function if the force is available.

Comparing the accuracy of the SchNet model to other machine learning-based approaches for fitting potential energy surfaces is difficult due to the lack of community agreed upon systems or data sets for benchmarking newly developed methods for reproducing global potential energy surfaces like those that exist for molecular properties^{46,58,59} or for molecular systems near equilibrium.^{47} The closest point of comparison for the initial benchmarking in this study is the original PIP-NN study^{32} as that study also examined ability of a machine-learning architecture to reproduce the potential energy surfaces of the H + H_{2} and Cl + H_{2} reactions generated with the BKMP2 and CW potentials, respectively. We emphasize that due to the larger set of fitting data used in the present study and the lack of a partition of the fitting data into training, validation, and test sets in the PIP-NN study, any comparison should not be treated as a quantitative measure of accuracy of the SchNet architecture or PIP-NN method. With that caveat, the SchNet architecture performs well for the H + H_{2} and Cl + H_{2} reactions, with a test set RMSE for the energy of 0.5 meV and 2.0 meV for the best performing SchNet model, respectively, compared to an RMSE for the energy of 3.6 meV and 4.2 meV, respectively for the PIP-NN method.^{32} The PIP-NN study did not report any error for the force of the H + H_{2} reaction, but we can compare to previous studies that examined the error of the GDML^{47} and SchNet^{25} architectures for predicting the force with the MD17 data set.^{47} For the eight systems in the MD17 data set, the MAE for the force ranged from 5.3 meV/bohr to 22.7 meV/bohr and from 1.1 meV/bohr to 7.6 meV/bohr for the GDML and SchNet architectures, respectively, compared to a MAE for the force of 0.44 meV/bohr for the SchNet architecture applied to the H + H_{2} reaction. Fitting the force for configurations obtained from equilibrated molecular dynamics calculations as in the previous SchNet study should be easier than fitting the force for a global potential energy surface as in this study as the force should be smaller on average in the former case. Conversely, the systems in the previous study consisted of molecules with up to four different atom types rather than a single atom type as for the H + H_{2} reaction. Fitting the force with multiple atom types is likely to be more difficult, so once again a direct comparison of the error for the SchNet models in the initial benchmarking of this study to the error of other studies is difficult other than to say the errors in this study appear to fall within a range deemed acceptable for previous machine learning-based studies that fit the force of potential energy surfaces.

## IV. OCHCO^{+} AND H_{2}CO

As potential energy surfaces with more than three internal degrees of freedom are regularly fit with analytic potentials, benchmarking on test systems more complex than the H + H_{2} and Cl + H_{2} reactions is needed to fully access the accuracy of the SchNet architecture. Therefore, SchNet fits of the global potential energy surfaces of the OCHCO^{+} and H_{2}CO systems were performed. Although containing fewer degrees of freedom than the OCHCO^{+} system, the H_{2}CO system is challenging to reproduce as an accurate description of this system requires the fitting of three different isomers and their saddle points (Fig. 8). The OCHCO^{+} and H_{2}CO systems and their corresponding fitting data^{60,61} were previously used in a recent study^{62} investigating the accuracy of the permutationally invariant polynomial (PIP),^{63} the Gaussian process regression (GP),^{64,65} and the PIP-GP approaches.^{62} The systems and fitting data are chosen to enable a comparison of the SchNet architecture to these other approaches for reproducing global potential energy surfaces.

The fitting data for the OCHCO^{+} systems consist of 7800 CCSD(T)-F12/aVTZ energies selected from a set of 8613 CCSD(T)-F12/aVTZ energies previously used for fitting the OCHCO^{+} system with the PIP method.^{60} The fitting data for the H_{2}CO system consist of 34 750 MRCI + Q/VTZ energies selected from a set of 67 193 MRCI + Q/VTZ energies previously used for fitting the H_{2}CO system with the PIP method.^{61} More information about the generation of the fitting data can be found in the previous studies. The fitting data for these two systems are randomly partitioned into training, validation, and test sets of sizes 7020, 390, and 390, respectively, for the OCHCO^{+} system and 31 275, 1738, and 1737, respectively, for the H_{2}CO system. We partition the fitting data five times and fit a SchNet model for each of the partitions. The parameters and procedure for training the SchNet models for the OCHCO^{+} and H_{2}CO systems are identical to that of the H + H_{2} and Cl + H_{2} reactions except that due to the poor performance of the unshifted softplus activation function for the H + H_{2} and Cl + H_{2} reactions, we train no SchNet models with the unshifted softplus activation function. As the fitting data contain no gradient information, *ρ* = ∞for all models for the OCHCO^{+} and H_{2}CO systems.

The test set RMSE for the energy of the best performing SchNet model for the OCHCO^{+} and H_{2}CO systems for a given SchNet architecture is presented in Table III. Results for every SchNet model from the five training runs can be found in the supplementary material. The test set RMSE of the best performing SchNet model, 2.92 meV, is similar to that of previous studies that fit this data set for the OCHCO^{+} system, which had a lowest test set RMSE of 1.4 meV, 1.9 meV, and 12.1 meV for the GP, PIP-GP, and PIP methods, respectively.^{62} We note that the PIP method when trained on all 8613 points performs excellently with an overall RMSE of 0.05 meV.^{60} Similar to the initial benchmarking results, we emphasize that any comparison among the SchNet models and previous studies should be done qualitatively as the previous study that fit the OCHCO^{+} potential energy surface used a different number of training points and a nonrandom procedure for partitioning the fitting data.

. | OCHCO^{+}
. | H_{2}CO
. | ||
---|---|---|---|---|

ρ
. | ∞ . | ∞ . | ||

Interaction blocks . | 2 . | 3 . | 2 . | 3 . |

Shifted softplus | 2.92 | 3.26 | 13.85 | 17.41 |

Swish | 4.01 | 3.68 | 13.55 | 20.84 |

. | OCHCO^{+}
. | H_{2}CO
. | ||
---|---|---|---|---|

ρ
. | ∞ . | ∞ . | ||

Interaction blocks . | 2 . | 3 . | 2 . | 3 . |

Shifted softplus | 2.92 | 3.26 | 13.85 | 17.41 |

Swish | 4.01 | 3.68 | 13.55 | 20.84 |

The best performing SchNet model for the H_{2}CO system has a test set RMSE of 13.55 meV. While this initially seems poor as the test set RMSE is over a factor of three higher than the best performing SchNet model for any of the previous systems, the SchNet architecture performs well relative to previous studies that fit the H_{2}CO potential energy surface using this fitting set as the PIP-GP and PIP methods had a test set RMSE of 17.7 meV and 29.6 meV, respectively.^{62} In fact, when the PIP method is fit with all 34 750 points in the fitting data, the overall training RMSE is 13.02 meV,^{62} which is only 0.53 meV lower than the test set RMSE for the best performing SchNet model. The good performance of the SchNet architecture for the H_{2}CO system is indicative of the ability of CNNs to generalize well as the global potential energy surface of the H_{2}CO system has three distinct chemical isomers.

The error of every point in the test set for the best performing SchNet model relative to the energy of the fitting data for the OCHCO^{+} and H_{2}CO systems is shown in Figs. 9 and 10, respectively. The distribution of errors for the OCHCO^{+} system is similar to the errors for the H + H_{2} and Cl + H_{2} reactions that used *ρ* = ∞, with a maximum absolute error of 20.7 meV. For the H_{2}CO system, the larger test set RMSE can be seen to arise in part due to the presence of several outlier points that the SchNet model does not do a good job of reproducing. The maximum absolute error for the H_{2}CO system is 335.4 meV, which is over a factor of 15 times higher than the next highest maximum absolute error of the best performing SchNet model for any of the other systems in this study. Outlier points with large absolute errors previously are also seen with the GP-PIP and PIP methods,^{62} which had a maximum absolute error of over 500 meV and 850 meV, respectively, for their best performing models. The high errors of the outlier points for all of these methods are an indication of the difficulty of fitting the H_{2}CO potential energy surface.

## V. CONCLUSIONS

In this study, we have applied the recently developed SchNet architecture, a machine-learning model based on the use of continuous-filter convolutions, to reproduce the potential energy surfaces of the H + H_{2} and Cl + H_{2} reactions and the OCHCO^{+} and H_{2}CO systems. Overall, the SchNet architecture performs well with the best performing models having a test set RMSE for the energy of 0.5 meV and 2.0 meV, respectively. For the H + H_{2} reaction, the best performing SchNet model had a MAE for the force of 0.44 meV/bohr. For the more challenging OCHCO^{+} and H_{2}CO/*cis*-HCOH/*trans*-HCOH systems, the best performing SchNet model has a test set RMSE of 2.92 meV and 13.55 meV, respectively. With the demonstration of the accuracy of the SchNet architecture for fitting the reactive potential energy surface, future work will focus on applying the SchNet architecture for chemical dynamics studies and on forming a set of best practices for the generation and partitioning of the fitting data for these systems.

## SUPPLEMENTARY MATERIAL

See supplementary material for training set error for all SchNet models.

## ACKNOWLEDGMENTS

K.R.B. thanks the University of Missouri for generous startup funding. The computations for this work were performed on the high performance computing infrastructure provided by the Research Computing Support Services and in part by the National Science Foundation under Grant No. CNS-1429294 at the University of Missouri, Columbia, MO.