In recent years, the machine learning techniques have shown great potent1ial in various problems from a multitude of disciplines, including materials design and drug discovery. The high computational speed on the one hand and the accuracy comparable to that of density functional theory on another hand make machine learning algorithms efficient for high-throughput screening through chemical and configurational space. However, the machine learning algorithms available in the literature require large training datasets to reach the chemical accuracy and also show large errors for the so-called outliers—the out-of-sample molecules, not well-represented in the training set. In the present paper, we propose a new machine learning algorithm for predicting molecular properties that addresses these two issues: it is based on a local model of interatomic interactions providing high accuracy when trained on relatively small training sets and an active learning algorithm of optimally choosing the training set that significantly reduces the errors for the outliers. We compare our model to the other state-of-the-art algorithms from the literature on the widely used benchmark tests.

## I. INTRODUCTION

A permanent demand for the discovery of new compounds in numerous fields of industry requires the development of the computational tools for the prediction of molecular properties. There are many quantum-mechanical algorithms that are able to accurately predict the properties of, theoretically, arbitrary atomic systems; however in practice, these algorithms are too computationally expensive to be applied to a very large number of molecules. Density functional theory, which is frequently used due to its favorable trade-off between accuracy and computational cost, is still too time-consuming for the high-throughput (rapid) screening of a large amount of molecules.

Machine learning (ML) algorithms are a class of algorithms that allow for fast and accurate calculations by constructing a surrogate model which “interpolates” between reference quantum-mechanical and experimental data. In particular, such models were proposed for organic molecules^{1–10} and they can already reach the accuracy of a few kcal/mol or lower by fitting the quantum-mechanical data. Some of the existing approaches are based on different kernels: kernels based on the Coulomb matrix,^{11,12} the SOAP (smooth overlap of atomic positions) kernel,^{13} HDAD (histogram of distances, angles, and dihedrals^{8}), BOB (bag of bonds^{2}), BAML (bonds, angles, and machine learning^{6}), and MBTR (many-body tensor representations^{10}). Other models are based on neural networks: MPNN (message passing neural networks^{9}), DTNN (deep tensor neural networks^{7}), HIP-NN (Hierarchically Interacting Particle Neural Network^{14}), and SchNet.^{15}

In this work, we propose a new algorithm of fitting of molecular properties. Our model resembles neural networks in the sense of employing several computing layers. We show that our model requires less training data to achieve the chemical accuracy when compared against the state-of-the-art approaches on the existing benchmark tests. For example, on the benchmark database consisting of 130k molecules,^{3} the majority of recent state-of-the-art algorithms achieve chemical accuracy only when trained on tens of thousands of samples, while our model does this with only a few thousands of samples. We attribute this to our local model of interatomic interaction that effectively relates the molecular properties to the atomic environments and makes predictions for the molecules not present in the training set by accounting for contributions of the individual atomic environments.

The other problem we address with our algorithm is the issue of the so-called *outliers*. In the discovery of new molecules, the most “interesting” molecules are often the most atypical ones. This is a challenge for ML approaches: if no molecules with a similar structure are present in the training set, ML models extrapolate and commit large prediction errors for these outliers. With the proposed active learning algorithm, we reduce the errors for outliers by including the molecules with the most different geometries and compositions in the training set. This prevents the cases when we are trying to predict the properties of molecules that are too different from any of the training samples; instead the properties of out-of-sample molecules are interpolated by the ML model and are thus predicted accurately.

This paper has the following structure: we first formulate the problem of prediction of molecular properties in an ML framework and expose our machine-learning model. Then, in Sec. III, we describe our active learning algorithm. In Sec. IV, we compare our algorithm to the existing algorithms^{1–10} on the two benchmark datasets: QM9^{3} and QM7.^{2} The concluding remarks are given in Sec. V.

## II. MACHINE LEARNING MODEL

We formulate the problem in a machine learning framework in the following way. Suppose there is a large number, *N*, of molecules, whose structure (and composition) is encoded by *x*^{(1)}, …, *x*^{(N)}. The task is to construct a function *F* that predicts a certain property (e.g., an atomization energy) of each molecule, *F*(*x*^{(1)}), …, *F*(*x*^{(N)}). The function *F*, which we refer to as the *model*, will be constructed based on data—the properties of the first *n* (*n* < *N*) molecules, *y*^{(1)}, …, *y*^{(n)}, which are called the *labels*. We will call the set {*x*^{(1)}, …, *x*^{(n)}} together with {*y*^{(1)}, …, *y*^{(n)}} the *labeled dataset*. The labeled dataset is often chosen randomly. It is a separate problem to select an optimal subset of molecules such that the prediction error for the rest of the molecules is minimized—we solve it by an *active learning* (or *optimal experiment design*) algorithm outlined in Sec. III. The remaining set {*x*^{(n+1)}, …, *x*^{(N)}} will be called the *evaluation set* or the *validation set*. The model has a number of free parameters ** θ** = (

*θ*

_{1}, …,

*θ*

_{m}) that are found from minimizing the total loss functional

We next describe in detail the model used in the present work and its functional form *F*. We note that our testing suggests that these models do not need regularization provided that the number of parameters, *m*, is chosen significantly smaller than the number of molecules, *n*. Hence, *m* is the only tunable hyperparameter of our model and we found that a good rule of thumb for choosing it is *n*$\u2a7e$ 2*m*.

We encode a molecule by its atomic positions and types which are denoted collectively by *x*. Its molecular property (e.g., its atomization energy) is denoted by *y* and is a function of *x*. To include our knowledge of the underlying physics into the model, we require that the representation of *y* is invariant with respect to translation, rotation, and permutation of chemically equivalent atoms in *x*.

We anticipate that some properties, such as the atomization energy or polarizability, are spatially local, i.e., depend on the local energy or polarizability of interatomic bonds. Therefore, in addition to the physical invariances, we also introduce locality in our model.

### A. Locality

We introduce locality into our model by partitioning *y* into the sum of contributions of individual atoms. The contribution of atom *i* depends on its atomic neighborhood: $ni={(rij,zj)}$, where *j* indexes all atoms within the distance *R*_{cut} from the atom *i*, and *r*_{ij} and *z*_{j} denote the radius-vectors (i.e., position of the *j*th atom relative to the *i*th atom) and the types of the neighboring atoms, respectively; see Fig. 1 for an illustration of a neighborhood. Thus, our model is

Here the summation is performed over all the atoms *i* in the molecule. Our locality assumption makes this model similar to an *interatomic potential* or a *force field* as used in molecular dynamics. This model is schematically illustrated in Fig. 2. The problem thus reduces to finding the right function *V*.

### B. Moment tensors as the descriptors of atomic environment

We take *V* as a generalization of the *moment tensor potentials*^{16} to the case of several chemically different atoms. Thus, we let

where we call *ξ*_{α} the *linear parameters*, that are fitted from the data, and $B\alpha (ni)$ are the basis functions, which we describe below. To define the basis functions, we choose a cut-off radius *R*_{cut} and introduce, as described below, a representation of the neighborhoods, which is invariant with respect to rotations and permutations of chemically equivalent atoms. We note that the translation invariance is already built into (2).

We represent atomic environments using the so-called *moment tensor* descriptors of the following form:

where the index *j* goes through all the neighbors of atom *i*. The symbol “⊗” stands for the outer product of vectors, thus in (4)*r*_{ij} ⊗ ⋯ ⊗ *r*_{ij} is the tensor of rank *ν*. This way, each descriptor in (4) is composed of the radial part *f*_{μ}(|*r*_{ij}|, *z*_{i}, *z*_{j}), which depends only on the relative distances between atoms and on their chemical types, and the angular part *r*_{ij} ⊗ ⋯ ⊗ *r*_{ij} resembling the moments of inertia. Indeed, if in (4), we set *f*_{μ} = *m**(*z*_{j}) = *m*_{j}, where *m*_{j} is the mass of atom of type *j*, and take the values of *ν* equal to 0, 1, and 2, we will get the descriptors $M\mu ,0(ni)=\u2211jmj$, $M\mu ,1(ni)=\u2211jmjrij$, and $M\mu ,2(ni)=\u2211jmjrij\u2297rij$, having a form resembling the mass, scaled center of the mass, and tensor of inertia of the neighborhood of atom *i*, respectively. Note that by construction $M\mu ,\nu (ni)$ are invariant with respect to rotations (in a tensorial sense) and permutations of chemically equivalent atoms. It should be emphasized that *M*_{μ,0} are the standard two-body descriptors of atomic environments that do not contain information about angles between bonds. The general moment tensor descriptors *M*_{μ,ν} remain two-body and thus offer an alternative way of including the angular information—the traditional way is to include at least three-body descriptors.

The descriptors *M*_{μ,ν} defined above are tensors of different ranks. We hence define the basis functions as all possible contractions of these tensors to a scalar, for example,

Note that all the functions $B\alpha (ni)$ constructed this way are rotation invariant. The basis functions *B*_{α} resemble invariant polynomials of the atomic positions *r*_{ij} (if the radial functions *f*_{μ}(|*r*_{ij}|, *z*_{i}, *z*_{j}) were polynomials of *r*_{ij} then *B*_{α} would have been polynomials). We therefore define a level of *M*_{μ,ν} by lev*M*_{μ,ν} = 2*μ* + *ν* and if *B*_{α} is obtained from $M\mu 1,\nu 1$, $M\mu 2,\nu 2$, …, then lev*B*_{α} = (2*μ*_{1} + *ν*_{1}) + (2*μ*_{2} + *ν*_{2}) + ⋯. In practice, we obtain different models by including all basis functions such that lev*B*_{α} ≤ *d*. We then denote such a model by MTM_{d} (moment tensor model of level *d*). The larger the level is, the more parameters the *ξ*_{α} has.

The radial functions *f*_{μ} from (4) are represented in the following way:

where

Here *T*_{k}(*r*) are the Chebyshev polynomials on the interval [*R*_{min}, *R*_{cut}] and the term $(Rcut\u2212r)2$ is introduced to ensure a smooth cut-off to 0. Taking into account that in real atomic systems atoms never stay too close to each other, we can always, in practice, choose some minimal distance *R*_{min}.

The point that makes this model different from the single-component moment tensor potentials^{16} is that now there is a dependence of the radial functions *f*_{μ} on the type of the central atom *z*_{i} and the types of the other atoms in the neighborhood *z*_{j}. This dependence is encoded in the $c\mu ,zi,zj(k)$ coefficients. Together with *ξ*_{α} from (3), they form the set of model parameters $\theta =\xi \alpha ,c\mu ,zi,zj(k)$ that are found during the fitting procedure, see Fig. 3.

For the case of single-component systems, it was proved in Ref. 16 that the moment tensor descriptors form a complete set of descriptors, in the sense that any function of a local atomistic environment can be approximated by a polynomial of the descriptors. It is hence easy to see that the same property holds for the multi-component systems: since the radial functions *f*_{μ} have an arbitrary dependence on the types of the neighboring atoms, the moments *M*_{μ,ν} can describe the way atoms of type one, two, etc., are placed around the *i*th atom independently of the other types of atoms. Since the description of each atomic type in the neighborhood is complete, the description of the entire neighborhood is complete.

### C. Intensive quantities

To further include the knowledge of physics/chemistry into our model, we distinguish between intensive and extensive quantities. An extensive quantity (such as the atomization energy) scales with the number of atoms, while an intensive quantity (such as the HOMO/LUMO level) remains of the same order when the number of atoms in a molecule increases. The above description of the model is valid for extensive quantities, while for an intensive quantity we modify the partition relation (2) as

where *#*(*x*) is the number of atoms in *x*.

### D. A nonlocal model

To go beyond the locality assumption (2), we include nonlocal effects by introducing two different local models *V*_{1} and *V*_{2} (each with its own set of parameters) and let

and

We then define the nonlocal model *nlMTM* in the following form:

Figure 4 shows how the parameters *p*_{i} form the additional computing “layer,” if compared to the local model in Fig. 2 and extend the set of parameters, $\theta =\xi \alpha ,c\mu ,zi,zj(k),pi$, that are found in the training process. Such a model architecture is motivated by the fact that molecular orbitals depend largely on local environments described by several *V*_{i} (although we only considered *i* = 1, 2). However, these orbitals get occupied by electrons not independently of each other; hence, we assume a nonlinear dependence of the answer on the local features (7).

## III. ACTIVE LEARNING

The accuracy of prediction depends not only on the machine learning model but also on the training set that is used for the fitting. On the one hand, the size of the training set is typically limited by the amount of experiments or *ab initio* calculations we can conduct. On the other hand, the training set should represent the full variety of molecules to prevent extrapolation while the evaluation of molecular properties. It has been shown^{17} that an optimal choice of the training set of a fixed size can, in principle, significantly reduce the out-of-sample errors, if we were allowed to use the labels (*y*^{(i)}) of all the available data. However, a practical selection algorithm needs to choose the molecules for the training set based only on the unlabeled data (*x*^{(i)}) since in practice we want to compute the labels only after selection. The approaches related to the construction (or selection) of an optimal training set are known as *active learning* approaches.

Recently an approach for active learning of interatomic potentials with a linear dependence on the model parameters has been proposed.^{18} This approach is based on a D-optimality criterion for selecting the training dataset, which is equivalent to choosing atomistic configurations maximizing the determinant of the matrix of the linear equations on the model parameters. This algorithm effectively detects samples on which the model extrapolates. Hence, training on such samples prevents extrapolation and thus ensures reliable treatment of the remaining molecules at the evaluation stage.

### A. Generalized D-optimality criterion

The model proposed in this paper has a nonlinear dependence on its parameters ** θ**. Therefore we propose a generalization of the D-optimality criterion to the nonlinear case. To that end, we assume that the values of the parameters $\theta \xaf$, which are found from the training procedure (Sec. II), are already near the optimal ones and we hence linearize each term in the loss function (1) with respect to the parameters

We can then interpret the fitting as the solution of the following overdetermined system of equations with respect to *θ*_{j}:

where *n* is the size of the labeled dataset {*x*^{(1)}, …, *x*^{(n)}}. The matrix of this system is a tall Jacobi matrix

where each row corresponds to a particular molecule from the training set.

Next, we select for training a subset of molecules yielding the most linearly independent rows in *B*. This is equivalent to finding a square *m* × *m* submatrix A of maximal volume (i.e., with a maximal value of |det(A)|). We do it by using the so-called maxvol algorithm.^{19}

Active learning allows us not only to extract a D-optimal training set from the labeled dataset, but also (probably, equally importantly) to estimate, at the evaluation stage, the novelty of molecules compared to those in the training set. If some molecule *x** significantly differs from all other molecules in the training set then its properties cannot be reliably calculated. Therefore it should be labeled and added to the training set instead.

We define the “novelty” grade *γ*(*x**) as the maximal factor by which |det(A)| can grow if *x** is added to the training set. According to Ref. 19, it can be calculated as

where

Thus, we add the molecule *x** to the training set if *γ*(*x**) ≥ *γ*_{trsh}, where *γ*_{trsh} ≥ 1 is a threshold parameter that prevents the algorithm from training on the molecules with not sufficiently high novelty grade *γ*.

Taking into account that $F(x)=\u2211iV(ni)$, for a certain configuration *x*^{(n)} one can write

where individual atomic environments of *x** are enumerated with *i*. Such partitioning over neighborhoods prevents an AL algorithm from selecting molecules composed from already known neighborhoods, as for such molecule the vectors $\u2202V(ni)\u2202\theta m$ will be linearly dependent on such vectors of smaller molecules and thus the big molecule will not be recognized as sufficiently new one to be chosen.

### B. Active learning scheme

We next describe our active learning procedure. It expands the training set iteratively, each time increasing its size by no more than 10%.

Start with a random initial training set.

Train the model on the current training set.

Using the active learning algorithm,

^{18}select molecules with*γ*≥*γ*_{trsh}and add either all such molecules or the molecules with the maximal*γ*such that the size of the training set increases by 10% or less.Unless satisfied with the current model (see the discussion below), go to step 1.

For the purpose of studying the performance of the proposed algorithm, we stopped the loop (as mentioned in step 3) when the number of molecules in the training set reached 6000. However, in practice, other stopping criterion could be considered: the difference in accuracy of the models on two consecutive iterations on the selected (at step 2) molecules is too small. This is an intuitive criterion; however, we have observed that sometimes the accuracy improvement on one particular iteration may be small, while on the next one it can increase again. Therefore, it may be better to track improvement on several iterations of the cycle.

## IV. RESULTS

We have conducted a number of tests in order to clarify the following three questions: what accuracy can our model achieve on a dataset when trained on its randomly chosen subset, how the accuracy can be improved using our active learning technique (Sec. III), and how many training samples are required to reach the chemical accuracy, 1 kcal/mol. We remind that we use the following terminology: the *training set* is the set on which we train our model and the *validation set* consists of the full database excluding the training set. All errors quoted below are measured on the validation set.

We have tested the models of levels 16, 20, and 24 denoted MTM_{16}, MTM_{20}, and MTM_{24}, respectively. The fitting of the models was done with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization algorithm, by performing between 2000 and 5000 iterations.

### A. Fitting enthalpy on QM9

First, we fit our model on the so-called QM9 dataset^{3} consisting of 130 831 molecules formed by C, H, O, N, and F atoms with up to nine heavy (C, O, N, and F) atoms. This is a subset of the originally proposed database^{20} consisting of 166 × 10^{9} of organic molecules. The number 130 831 excludes 3054 molecules from the database that failed a consistency test, as reported in Ref. 3. Following the existing studies,^{4,5,7,9,10} we demonstrate the performance of our method by fitting the enthalpy (or atomization energy) at 300 K.

### B. Random choice of the training dataset

To investigate the accuracy of the MTMs with a different number of parameters, we have calculated the learning curves showing the dependence of the mean absolute error (MAE) on the training set size, see Fig. 5. Our results are averaged over three independent random choices of the training set. As expected, the models with fewer parameters show good results for small training datasets, but are outperformed by the models with more parameters as the number of training samples grows.

We next compare different models by how fast (i.e., with what training dataset size) they reach chemical accuracy. Table I lists the prediction errors for the MTMs and the existing state-of-the-art methods when training on random training sets. While filling this table, we used MTM_{16} for the 1k training set size, MTM_{20} for the 3.5k training set size, and MTM_{24} for the 10k training set size. For the sizes of 25k and 50k, we used the MTM_{28} model which has more parameters to fit then MTM_{24}. At the same time, from Fig. 5, it can be seen that using either MTM_{20} or MTM_{24} models still provides competitive results.

. | Training set size . | ||||||
---|---|---|---|---|---|---|---|

Model . | 1k . | 3.5k . | 10k . | 25k . | 35k^{a}
. | 50k . | 110k . |

DTNN^{7} | … | … | 1.2 | 1.0 | … | 0.94 | … |

BAML^{6} | … | … | 2.4 | … | … | … | … |

$\Delta MP7B3LYP\u2212ML$^{5} | 4.8 | … | 3.0 | … | … | … | … |

MPNN^{9} | … | … | … | … | … | … | 0.39 |

HDAD^{8} | … | … | … | … | … | … | 0.58^{a} |

HIP-NN^{14} | … | … | … | … | … | 0.35 | 0.26 |

SchNet^{15} | … | … | … | … | … | 0.59 | 0.31 |

aSLATM^{17} | 1.8 | 0.98^{b} | … | … | … | … | … |

MTM^{c} | 1.8 | 1.0^{d} | 0.86 | 0.63 | … | 0.41 | … |

. | Training set size . | ||||||
---|---|---|---|---|---|---|---|

Model . | 1k . | 3.5k . | 10k . | 25k . | 35k^{a}
. | 50k . | 110k . |

DTNN^{7} | … | … | 1.2 | 1.0 | … | 0.94 | … |

BAML^{6} | … | … | 2.4 | … | … | … | … |

$\Delta MP7B3LYP\u2212ML$^{5} | 4.8 | … | 3.0 | … | … | … | … |

MPNN^{9} | … | … | … | … | … | … | 0.39 |

HDAD^{8} | … | … | … | … | … | … | 0.58^{a} |

HIP-NN^{14} | … | … | … | … | … | 0.35 | 0.26 |

SchNet^{15} | … | … | … | … | … | 0.59 | 0.31 |

aSLATM^{17} | 1.8 | 0.98^{b} | … | … | … | … | … |

MTM^{c} | 1.8 | 1.0^{d} | 0.86 | 0.63 | … | 0.41 | … |

The aSLATM^{17} model reaches the chemical accuracy with the training set size about 3200, while the learning curve from Fig. 5 shows that the MTM_{24} model requires almost the same number, 3500 molecules to reach such accuracy. We obtained the results for MTM_{16}, MTM_{20}, and MTM_{24} by doing 2000 iterations of the BFGS algorithm and 5000 iterations for MTM_{28}. For a more accurate comparison with aSLATM, we trained the MTM_{24} model on 10 different random samples of the training set with 3000 molecules by doing 3000 iterations of BFGS. This provided us with a validation MAE of 1.006 ± 0.0316 kcal/mol.

As can be seen from Table I, for training set sizes not more than 10k MTM shows a better learning curve than the existing state-of-the-art methods and only the model from the Ref. 17 shows nearly equal results. The studies using deep NNs (Refs. 15 and 14) show better accuracy for training set sizes over 50k. References 7 and 8 need 25k and 35k training samples to attain the chemical accuracy, respectively. We note of another work^{9} that also reaches the chemical accuracy, but it only reports MAE of 0.55 kcal/mol while training on 110k molecules plus another 10k molecules used as a hold-out set for the early stopping criterion.

### C. Active learning

While the MAE of MTMs trained on random training sets is small, the corresponding maximal error is of the order of 100 kcal/mol, resulting in the outliers for which the atomization energy prediction is too inaccurate. To get rid of the outliers, we have applied the active learning algorithm as described in Sec. III. We started with a random training set of 1k molecules and it takes about 20 iterations to reach the training set size of 6k molecules. At each iteration, our training procedure takes an amount of time proportional to the training set size, while the selection always takes nearly constant time, approximately the same as required to train a model on a training set with 1k molecules.

Figure 6 shows the graphs of the MAE, root-mean-square error (RMSE), and maximal absolute errors depending on the training set size, comparing random and active selection. The “Active” curve corresponds to the scenario of actively selecting molecules from the entire set of molecules and measuring the error on the remaining set. In the “Active 2” scenario, we instead separate out a validation set of 30k molecules on which we measure error, while selecting and learning from the remaining 100k molecules. The error bars correspond to the 95% confidence interval as measured on three independent runs, in each of which the initial training set of 1000 molecules and the validation set in the “Active 2” scenario were random. We have used the MTM_{24} model. It has about 2000 parameters to fit, which explains why the error in Fig. 6 exhibits *overfitting* when trained on less than 2000 molecules. As stated in Sec. II, we could improve our results for such small training set sizes by introducing regularization (which is otherwise not required).

In the active learning approach (see Sec. III), the training set tends to be as diverse as possible in the sense of spanning the largest volume of configurational (or molecular) space. The most extrapolative molecules lie on the boundaries of this volume and if the amount of molecules is less than or close to the number of model parameters we would not have sufficient amount of training samples in-between the boundaries. With the random selection, we cover the configurational space more evenly, which results in lower RMS errors. From the part of the graph starting from 4000 training samples (this is twice the number of model parameters, as we suggested as a rule of thumb in Sec. II), we can see that there is no improvement in MAE (though it is rather close to the random-sampling MAE), but the RMS and maximal errors are lower than that for the random sampling.

From Fig. 6, it can be seen that the maximal error is much less when we are allowed to add any possible molecule to the training set (the “Active” scenario) as compared to the “Active 2” scenario; however, the error in the “Active 2” scenario is still smaller than that in random sampling. This indicates that the active learning has two mechanisms of decreasing the error: the actively chosen molecules represent better the unusual molecules in the validation set of the “Active 2” scenario, but if we are allowed to select also those unusual molecules for training, the error further drops. We argue that the latter can be useful in those applications where the region of interest in the chemical space is fixed *a priori*. As an example of such application, in Ref. 21 the authors found six candidates of Co superalloys from about 2k *a priori* identified potential chemical compositions.

To better understand the impact of active learning, in Fig. 7, we plotted the true and predicted enthalpies for the models trained on randomly and actively chosen training sets of 10k molecules. The plot is focused on a small region of enthalpies to show the scale of the error. As can be seen, active learning makes the error small uniformly over all the samples, while the random choice of the training set results in outliers for which the error is large.

We investigated the sizes of the molecules entering the training set at each iteration of selection (see Sec. III B) and compared them to the average molecule size among the QM9 database. The results are shown in Fig. 8. From these graphs, it can be seen that the algorithm tries to select molecules with sizes lower than average. This could indicate that some small molecules contain representative atomic neighborhoods which occur in many bigger molecules and the active learning algorithm detects and selects such small molecules. This result is in correspondence with the observations of Huang and von Lilienfeld,^{17} where the authors state that an accurate model can be obtained by training on a small amount of the most relevant atomic environments (neighborhoods).

### D. Fitting the QM7 database

Another common benchmark database we used consists of 7.2k small organic molecules with up to seven heavy atoms (C, N, O, S) saturated by H, and is referred to as QM7.^{22} We have chosen the four properties to fit: atomization energy, polarizability, and HOMO and LUMO levels. The atomization energy and polarizability are extensive quantities, and the other two are intensive quantities. The training set consisted of 5k randomly chosen molecules while the remaining 2200 were used for validation. In Table II, we compare the accuracy of the MTMs to the other state-of-the-art methods. Our results are averaged over three independent runs. We see that the local properties, enthalpy and polarizability are fitted with the same or higher accuracy as by the existing methods, while predictions of the HOMO and LUMO levels have a 50% larger error than the state-of-the-art.

Model . | E (kcal/mol)
. | α (Å^{3})
. | HOMO (eV) . | LUMO (eV) . |
---|---|---|---|---|

BAML^{6} | 1.15 | 0.07 | 0.10 | 0.11 |

SOAP^{13} | 0.92 | 0.05 | 0.12 | 0.12 |

DTNN^{7} | 1.04 | … | … | … |

MBTR^{10} | 0.60 | 0.04 | … | … |

MTM^{a} | 0.52 | 0.04 | 0.16 | 0.16 |

nlMTM^{a} | … | … | 0.11 | 0.11 |

Model . | E (kcal/mol)
. | α (Å^{3})
. | HOMO (eV) . | LUMO (eV) . |
---|---|---|---|---|

BAML^{6} | 1.15 | 0.07 | 0.10 | 0.11 |

SOAP^{13} | 0.92 | 0.05 | 0.12 | 0.12 |

DTNN^{7} | 1.04 | … | … | … |

MBTR^{10} | 0.60 | 0.04 | … | … |

MTM^{a} | 0.52 | 0.04 | 0.16 | 0.16 |

nlMTM^{a} | … | … | 0.11 | 0.11 |

^{a}

This work.

To address this issue, we have applied the nonlocal modification of our algorithm, nlMTM, as given by (7). From Table II, we see that accounting for nonlocality improves the error of HOMO and LUMO to the state-of-the-art accuracy. As, the nonlocal scheme is, essentially, a three-layer model, it was found to suffer from overfitting similar to the deep neural networks; therefore, we used the early stopping technique for training the nlMTM. To that end, we used 1100 samples for estimating the error during training and measured the error on the remaining 1100 samples.

## V. CONCLUSIONS

We have introduced the moment tensor model (MTM) for the prediction of the molecular properties based on the contributions of local atomic environments represented by the moment tensor descriptors. The accuracy of the proposed model is comparable to or better than that of the state-of-the-art algorithms. The proposed model outperforms the existing algorithms by how fast it reaches the chemical accuracy on a database of 130k molecules. We attribute this to the provable completeness of our moment tensor descriptors of local environments. It should be emphasized that although our descriptors (4) are, essentially, two-body descriptors, their completeness and, in particular, angular dependence come from their tensorial structure.

In addition, we have proposed an active learning algorithm that significantly reduces the maximal error (in other words, the error for the outliers). The algorithm effectively selects the molecules most differently from those already in the training set and adds these molecules to the training set.

## ACKNOWLEDGMENTS

We thank Matthias Rupp and Anatole von Lilienfeld for fruitful discussions of our results. This work was supported by the Skoltech NGP Program No. 2016-7/NGP (a Skoltech-MIT joint project). This work was performed, in part, by A.S. during the Fall 2016 long program at the Institute of Pure and Applied Mathematics, UCLA; and, in part, at the Center for Integrated Nanotechnologies, an Office of Science User Facility operated for the U.S., Department of Energy (DOE) Office of Science by Los Alamos National Laboratory (Contract No. DE-AC52-06NA25396) and Sandia National Laboratories (Contract No. DE-NA-0003525).