Two different classes of molecular representations for use in machine learning of thermodynamic and electronic properties are studied. The representations are evaluated by monitoring the performance of linear and kernel ridge regression models on well-studied data sets of small organic molecules. One class of representations studied here counts the occurrence of bonding patterns in the molecule. These require only the connectivity of atoms in the molecule as may be obtained from a line diagram or a SMILES string. The second class utilizes the three-dimensional structure of the molecule. These include the Coulomb matrix and Bag of Bonds, which list the inter-atomic distances present in the molecule, and Encoded Bonds, which encode such lists into a feature vector whose length is independent of molecular size. Encoded Bonds’ features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules. A wide range of feature sets are constructed by selecting, at each rank, either a graph or geometry-based feature. Here, rank refers to the number of atoms involved in the feature, e.g., atom counts are rank 1, while Encoded Bonds are rank 2. For atomization energies in the QM7 data set, the best graph-based feature set gives a mean absolute error of 3.4 kcal/mol. Inclusion of 3D geometry substantially enhances the performance, with Encoded Bonds giving 2.4 kcal/mol, when used alone, and 1.19 kcal/mol, when combined with graph features.

## I. INTRODUCTION

Machine learning (ML) has the potential to predict the properties of molecules at a small fraction of the cost of quantum computations.^{1,2} Recent successes on organic molecules include ML models of thermodynamic properties,^{1–11} electronic properties,^{12–25} and molecular structures.^{26–29} Considerable success has also been obtained on inorganic systems.^{30,31} In developing ML models, an important decision is how to represent the molecule through features, or descriptors, that provide useful inputs to the ML algorithm. Here, we assemble a large variety of feature sets from two main classes of features. These classes differ regarding the need for a full three-dimensional structure, as opposed to simply a graph that summarizes the bonding between atoms. The utility of these feature sets is evaluated by comparing the performance on linear and kernel ridge regression (KRR) models of small organic molecules.

The first class of features is that of “graph” features, which utilize only the bonding patterns in a molecule. The graph features considered here are based on connectivity counts. Such counts have been explored in a number of different ways.^{6,32–35} Here, these features include counts of atom types and bond types, which we refer to as rank-1 and rank-2 features because they summarize patterns related to one and two atoms, respectively. Higher-rank features, that count occurrences of patterns related to three or more bonded atoms, are also explored. An advantage of graph features is that they do not require a full three-dimensional structure of the molecule and, instead, may be extracted from a line drawing or from a SMILES^{36} string.

The second class is that of “geometry” features, which utilize the 3D molecular geometry. This class includes the Coulomb matrix, Bag of Bonds (BoB), and Bonds and Angles Machine Learning (BAML) features, which have been shown to lead to good performance on thermodynamic and electronic properties.^{2,37,38} A disadvantage of these representations is that the length of the feature vectors scales with the number of interactions present in the molecules.^{19} This is in contrast to the connectivity counts of the graph features, where the length of the feature vector depends only on the types of atoms and bonds in the molecules. Below, a constant size geometry feature is introduced that encodes the distances between atoms in the molecule as a smoothed histogram of inter-atomic distances between atoms, including non-bonded atoms. Such Encoded Bond features provide a constant-size feature vector that contains information similar to that present in the Coulomb matrix or BoB.

A potential benefit of using feature vectors whose length is independent of molecular size is that the resulting models may generalize well from smaller to larger systems. This generalization assumes (a) an effective chemical periodicity where large systems consist of smaller fragments which resemble exemplary chemistries present in the database used for training and (b) locality implying additivity. For certain systems, these assumptions can break down, in particular, the latter one when strong electron delocalization effects are present as in metals or in the *π*-bonds of conjugated polymers.

Both classes of features summarize information regarding the local environment of atoms and bonds in the molecule. The resulting feature vectors therefore contain information on the fragments present in the molecule.^{39} In particular, when two molecules possess the same molecular fragment, the interactions among the atoms in that fragment make similar contributions to the feature vector. Use of these feature vectors to predict similarity between two molecules, as in the kernel methods below, therefore incorporates similarity resulting from the presence of similar fragments. In addition, the feature vectors include information regarding interactions among the fragments. The features may therefore enable models to generalize from smaller to larger molecules. The degree to which this is true is tested below by training models on organic molecules with up to 7 heavy atoms and testing these models on molecules with up to 9 heavy atoms.

A variety of feature sets are assembled by selecting the type of feature to use for each rank. For the tasks considered here, the best performing feature sets use geometry features for the second rank and graph features for other ranks.

## II. RELATED WORK

The features studied here are somewhat related to features developed for neural net models of potentials.^{8,40–43} In those models, the total energy is written as a sum over atomic energies, with the atomic energies being functions of the local environment.^{44} The environment is described using features based on two-body and three-body correlation functions around the atom. The Encoded Bonds’ features explored here may be viewed as smoothed versions of two-body correlation functions between each pair of elements. We note that inclusion of only two-body correlations will fail to distinguish homometric molecules.^{19}

The graph and geometry features compared here differ in the degree to which results from chemical computations are used as features for the prediction task. For the geometry features, quantum chemistry is typically used to compute the molecular structure. Delta models go further in their use of results from quantum chemistry by including, as features, computed values for the atomization energy or other properties of interest. In this manner, delta models correct predictions from standard computational chemistry methods so that they better match more expensive calculations^{16,17,45} or experiments.^{3,46}

Graph features lie at the other end of the spectrum, with the only information required being presented in a line drawing or a SMILES^{36} string. At the latter level, there have been many examples of using simple counts of atoms and bonds to make predictions of molecular properties.^{2,37} The low computational cost of such models makes them useful for searching molecular databases.^{47} The graph features explored here generalize these features in a manner that connects well with the geometry-based Encoded Bonds’ features introduced below. Alternative features that require only a graph of the molecule include molecular fingerprints; however, the direct use of these as features does not lead to good performance on predictions of thermodynamic and electronic properties.^{48} Other styles of graph-based features that are not explored here include the direct use of SMILES representations in conjunction with a variational autoencoder^{49} and graph convolutions.^{50,51}

Below, we use kernel-based learning methods to explore the utility of the features investigated here. A number of approaches have been developed that compute the kernel directly, without introducing a feature vector that can be used in other ML approaches.^{5,22} The features introduced here are not limited to kernel methods and, when used in kernel methods, allow us to separately consider the influences of three factors: the feature vector, the distance metric used to connect features to the similarity between molecules, and the kernel function. Because our focus is on molecular representations as opposed to kernels, direct comparison to these methods is not included below.

## III. DATA SETS

This work compares the performance of various molecular features on data sets that have been the subject of past work that investigated the ability of machine learning models to predict molecular properties. These data sets utilize molecules selected from the following two data sets, each of which contains only the bonding pattern as summarized in a SMILES string. The **GDB13** data set contains 977 × 10^{6} molecules made of C, H, O, N, S, and Cl, with up to 13 heavy atoms,^{52} and the **GDB17** data set contains 166 × 10^{9} molecules made up of C, H, O, N, and F, with up to 17 heavy atoms.^{53}

The **QM7** data set corresponds to a subset of the GDB13 molecules, consisting of all 7165 molecules that contain 7 or fewer heavy atoms of elements C, N, O, or S for which optimized structures and atomization energies were calculated with the Perdew-Burke-Ernzerhof hybrid functional (PBE0).^{1} Models for the atomization energies in QM7 have been extensively studied, primarily using features derived from the Coulomb matrix.^{1,2,4,9,37} As such, this data set provides a useful benchmark for testing models and comparing to previous work.

The **QM7b** data set extends the QM7 data set to molecules containing chlorine and includes a broader range of properties such as HOMO and LUMO energies, excitation energies, and polarizabilities calculated at various levels of theory.^{15} QM7b allows the performance of machine learning models to be explored on a wider variety of properties. Past studies have also used QM7b to explore multi-target regression methods which, by simultaneously predicting various properties, can potentially benefit from relationships between the target properties.^{4,15,37}

The **QM9** data set^{54} contains data for a subset of the GDB17 molecules, consisting of 133 885 molecules that contain nine or fewer heavy atoms. (The name QM9 is used here to be consistent with that of QM7.) The QM9 data set has been used in several studies^{18–20,37} and includes optimized structures and 18 different properties calculated using B3LYP/6-31g(2df,p). Below, QM9 is used to explore the degree to which model training on smaller molecules can transfer to larger molecules. The large amount of data in QM9 also allow more extensive studies on the degree to which inclusion of additional data improves the model performance.

## IV. MACHINE LEARNING

A variety of machine learning methods have been used to predict molecular properties, including linear regression, linear ridge regression (LRR), support vector regression, kernel ridge regression, regression trees, k-nearest neighbors, and neural net/deep learning.^{2,8,15} For this study, we have elected to use two standard machine learning methods,^{55} linear ridge regression (LRR) and kernel ridge regression (KRR) with the Gaussian and Laplacian kernels. These methods were chosen due to their relative simplicity, compared, for instance, to deep learning, and because past work has shown them to be effective on the data sets studied here.^{2,37}

The models involve many parameters which, during “model training,” are adjusted to obtain predictions that agree with a given set of examples. In addition, both LRR and KRR involve hyperparameters that serve to define further the model form. The wide range of values for the hyperparameters leads to a wide range of possible models. “Model selection” involves searching through these to find a model, or equivalently a set of hyperparameters, that leads to good performance. Here, we do an exhaustive search over a grid of hyperparameter values using cross validation as described below.

In LRR, there is one hyperparameter, *α*, which is the weight on the ridge term. When *α* is zero, the model produces the same result as linear regression. Finite values for *α* penalize larger regression parameters, and so using finite values for *α* favors simpler models (those with smaller regression parameters). For LRR, the following *α* values were scanned over: {10^{−5}, 10^{−3}, …, 10^{1}}. KRR retains the *α* parameter of LRR and adds two more. The choice of kernel for KRR can be considered as a discrete hyperparameter and here the Gaussian ($exp(\u2212\gamma ||x\u2212y||22)$) and Laplacian (exp(−*γ*||*x* − *y*||_{1})) kernels are included in the search. For each of these kernels, there is an additional continuous hyperparameter, *γ*, which sets the width of the kernel. The search over *α* and *γ* is exhaustive on a square grid with values {10^{−11}, 10^{−9}, …, 10^{−1}}.

Our analysis begins by randomly selecting 20% of the data to use as a test set. The remaining data become the training set which is used for both model selection and model training. Model selection is done using 5-fold cross validation (CV). In 5-fold CV, the training data are split into 5 equally sized sets or folds. One of these folds is then held out as a validation set, a model is trained on the remaining folds, and the mean absolute error (MAE) is computed for the validation set. This process is repeated 5 times, with each one of the folds serving as the validation set. This leads to 5 different MAEs, and the average of these MAEs is taken as a measure of model performance. The set of hyperparameters, *α* and *γ*, that leads to the best model performance is then selected as the final model. (See the supplementary material for the hyperparameters of each selected model.) Once the hyperparameters are selected, the model is trained on all data in the training set. This model is then applied to the test data, and the MAEs reported below are from predictions on these test data. Note that the test data are not used in model selection and training.

## V. FEATURES

The primary goal of this work is a comparison of models developed using different types of molecular features. Ideally, identical molecular geometries should lead to identical feature vectors, implying that the features are invariant with respect to molecular translations and rotations, and to reordering of the atoms.

For models trained on smaller molecules to transfer to larger molecules, a number of other properties of the feature vectors are desirable. One such property is that the length of the feature vector be independent of molecular size. For the Coulomb matrix and BoB features, this property does not hold, as the number of elements in the feature vector scales quadratically with the number of atoms in the largest molecule in the data set. To support models of differently sized molecules, such feature vectors are expanded to a length that can accommodate the largest molecules in the data set, with zeros added to pad the extended regions for smaller molecules. For the remaining features considered below, the feature vectors have a length that depends on the molecular diversity, e.g., atom and bond types, but does not depend on the molecular size.

### A. Null model

A “null” model is used to provide a baseline measure of the difficulty of the prediction task. The null model always predicts the mean value of the training data. The MAE of the null model therefore reflects the spread of the data.

### B. Coulomb matrix and Bag of Bonds

A class of features that has led to well-performing models of molecular properties are those derived from the Coulomb matrix^{1,2,4,15–18,20}

where *Z*_{i} and *r*_{i} are the atomic number and Cartesian position of the *i*th atom. This feature is appealing in that it summarizes the molecular structure in terms of Coulomb interactions between atoms. Alternative forms for the interaction, other than Coulomb interactions between bare nuclei, have been tried but were not found to enhance the model performance.^{37}

While the Coulomb matrix is invariant to molecular rotations and translations, it is not invariant to reordering of the atoms. A number of means have been explored to address the dependence on atom order, including sorting the matrix based on the magnitude of the norm of the rows, using randomly permuted sets of matrices, and using the eigenvalues of the matrix.^{2,4,15} (The results generated here do not sort the Coulomb matrix and simply arrange the lower triangle of the symmetric matrix into a vector.) A recent successful approach rearranges the Coulomb matrix into a “Bag of Bonds,” BoB, feature.^{37} In BoB, off-diagonal elements of the matrix are first divided into bags, based on the elements involved in the Coulomb interaction (CH, CC, CO, etc). The values in each bag are then sorted from high to low values. The maximum size of each bag, across the molecules in the data set, is then determined, and zeros are added to each bag so that all molecules have bags of the same length. The bags are then concatenated to make a single feature vector that is invariant to reordering of the atoms. Because the BoB model is just a reordering of Coulomb matrix elements, the length of the vector still grows quadratically with the size of the largest molecule in the data set. BoB has also recently been extended to include three-body and higher terms,^{38} leading to feature vectors whose length grows as the third or higher power of the number of atoms in the largest molecule.

### C. Connectivity counts

These graph features summarize the bonding pattern of a molecule by counting occurrences of some pattern regarding the bonding between atoms. Each feature may be assigned a rank, based on the number of distinct atoms that are examined when testing for the presence of the pattern. Rank-1 features count atom types, such as an element or an element plus a coordination number, leading to a vector whose length is equal to the number of atom types. Rank-2 features count bond types (single, aromatic, double, and triple), leading to a vector whose length is the number of bond types. Rank-3 and rank-4 features count triplets and quartets of bonded atoms.

For rank 1, each element of the feature vector, $vT$, holds the number of atoms in the molecule that have type *T*,

where the sum is over atoms, *i*, *A*(*i*) is a function that returns the atom type, and *δ* is the Kronecker symbol. Two atom typing schemes are explored here, leading to two different rank-1 feature vectors denoted as **1** and **1**^{C}. For **1**, the atom type is defined by an element only. For **1**^{C}, the atom type is defined by a combination of the element and coordination number. The coordination number is taken as the number of bonds in which the atom participates.

For rank 2, each element of the feature vector, $vT$, holds the number of bonds in the molecule that have type *T*,

where the sum is over pairs of atoms and *B*(*i*, *j*) is a function that returns the bond type between atoms *i* and *j* or null if a bond is not present. Two bond typing schemes are explored here, leading to two different rank-2 feature vectors denoted as **2** and **2**^{B}. For **2**, the bond type is defined by the two elements participating in the bond. A bond is assumed to be present if the separation between atoms is less than the cutoffs listed in the supplementary material. For **2**^{B}, the bond type is defined by the two elements participating in the bond and the bond order, with bond orders assigned based on the bond length as discussed in the supplementary material.

For rank 3, each element of the feature vector corresponds to a tuple of two bond types (*T*_{1}, *T*_{2}), with a value

where the sum is over triples of atoms. The feature counts only situations where atoms *i* and *k* are bonded to a common atom, *j*. The summation is thus over all triples of atoms to which a bond angle would typically be ascribed. Higher-rank patterns are generalizations of Eq. (4) to higher order. For example, the rank 4 summation is

with the summation being over all sets of four atoms to which a dihedral angle would typically be ascribed.

### D. Encoded Bonds

Here, we introduce a new set of features, Encoded Bonds, that combine desirable aspects of connectivity counts with desirable aspects of the BoB feature. The connectivity counts of Sec. V C have the desirable aspect of leading to a feature vector whose length is independent of the size of the molecules being studied and instead depends only on the types of atoms present in the molecules. BoB is a rank 2 feature that has the desirable aspect of including information from the 3D structure of the molecule. However, because BoB is a list of interatomic distances present in the molecule, the length of the BoB feature vector increases with the molecular size. Encoded Bonds’ features developed here incorporate information from the 3D structure of the molecule into a feature vector whose length is independent of molecular size. We include only rank-2 encodings, which summarize information regarding atom-atom distances. Extension to higher ranks would lead to a substantial increase in the length of the total feature vector and are not explored here.

Encoded Bonds’ features may be viewed as generalizations of the rank-2 connectivity features of Sec. V C. Because the bond types of Sec. V C are based on atom-atom distances, the rank-2 connectivity counts of Sec. V C may be viewed as encoding atom-atom distances in a coarse-grained manner. To introduce the notation we will use for encoding, we rewrite Eq. (3) as

where $rTlow$ and $rThigh$ are the lower and upper limits of the distance ranges that define the bond type, *T*, for the corresponding pair of elements and *r*_{jk} is the distance between atoms *j* and *k*. Eq. (6) may be viewed as creating a histogram for each pair of elements. For the previously defined feature vector 2, the histogram has just one bin and reports the number of bonds between those elements in the molecule. For feature vector **2**^{B}, the number of bins in the histogram is the number of allowed bond orders between those elements.

A simple approach to adding more information to the feature vectors is to add more bins to the histogram. The information encoded can also be expanded to include distances between non-bonded atoms. To accomplish this, we create a uniformly spaced grid,

For each pair of elements, we then get a set of *N*_{grid} features,

The feature vector of Eq. (8) has two disadvantages. The first is that a small change in *r*_{jk} can lead to a discontinuous change in the feature vector when *r*_{jk} is near a grid point, *d*_{i}. In addition, for values of *N*_{grid} that are large compared to the number of pairs of elements in the molecules of interest, the feature vector becomes sparse, making the learning prone to overfitting.

To overcome these disadvantages of the histogram-like features, we generalize Eq. (8) by replacing $1$(*r*_{jk} ∈ (*d*_{i}, *d*_{i+1}]) of Eq. (8) with an encoding function, *f*,

*f* returns a value in the range [0, 1] and may depend on a parameter, *β*, that is used to alter the smoothness of the function. We consider two general classes of encoding functions: probability density functions (PDFs) and cumulative distribution functions (CDFs). For both of these, we consider three types of functions: the normal distribution, the logistic distribution, and the spike distribution (Table I).

Name . | Name2 . | f(r_{jk}, d_{i}, β) in Eq. (9)
. |
---|---|---|

2^{NP} | Normal PDF | exp(−x^{2}) |

2^{NC} | Normal CDF | 1 + erf(x) |

2^{LP} | Logistic PDF | exp(−x)/(1 + exp(−x))^{2} |

2^{LC} | Logistic CDF | 1/(1 + exp(−x)) |

2^{SP} | Spike PDF | $1$(|r_{jk} − d_{i}| < ϵ/2) |

2^{SC} | Spike CDF | $1$(r_{jk} > d_{i}) |

Name . | Name2 . | f(r_{jk}, d_{i}, β) in Eq. (9)
. |
---|---|---|

2^{NP} | Normal PDF | exp(−x^{2}) |

2^{NC} | Normal CDF | 1 + erf(x) |

2^{LP} | Logistic PDF | exp(−x)/(1 + exp(−x))^{2} |

2^{LC} | Logistic CDF | 1/(1 + exp(−x)) |

2^{SP} | Spike PDF | $1$(|r_{jk} − d_{i}| < ϵ/2) |

2^{SC} | Spike CDF | $1$(r_{jk} > d_{i}) |

For the PDF class of functions, the feature vector may be viewed as adding noise to each of the atom-atom distances, *r*_{jk}, with noise distributed according to the distribution of the encoding function. The CDF class of functions integrate the corresponding PDFs, leading to a feature vector that increases monotonically with distance and has a slope that indicates the presence of atom-atom separations near *d*_{i}.

Empirical results, discussed below, suggest that the model performance may be improved by limiting the geodesic distance, *G*, between atoms included when computing the feature,

where *G*_{jk} is an integer specifying the number of bonds along the shortest path between the atom *j* and atom *k*. The last term limits the summation to pairs of atoms that are separated by up to *G*^{max} bonds. Other studies have taken a similar approach by setting a cutoff based on the Euclidean distance between atoms.^{8,9,40–42} In the current studies, the model performance was found to be somewhat better when using a cutoff based on the geodesic distance. Examples of Encoded Bonds’ features for benzene can be seen in Fig. 1.

The parameters specifying the Encoded Bonds’ features were tuned by evaluating the performance on the QM7 data set. The supplementary material discusses the choice of *N*_{grid} and *β* of Eq. (9). For the results reported below, *β* was 20 Å^{−1} and the grid consists of 100 points between 0.2 Å and 6 Å. The choice of cutoff distance, *G*^{max} of Eq. (10), is based on the results in Fig. 2, which show how the model performance varies with *G*^{max} for various Encoded Bonds’ features. The results for the Coulomb matrix and BoB features are also shown, in which case *G*^{max} is implemented by setting to zero, those values of the Coulomb matrix that correspond to atoms separated by a geodesic distance greater than *G*^{max}. For nearly all features, the best performance is obtained with a *G*^{max} of 2 or 3, corresponding to distances over which angles and dihedrals would be assigned to the molecular structure. This is true of all features, when LRR is used, and nearly all features, when KRR is used. The exceptions are KRR with **2**^{SP} (results not shown as the MAE is greater than 50 kcal/mol) and KRR with the Coulomb matrix. Note, however, that when the Coulomb matrix elements are reordered to form BoB feature vectors, the optimal *G*^{max} is in the 2–3 range. The observation that *G*^{max} is in this range for a broad range of feature types and for both KRR and LRR models suggests that the atomization energy has a local character, whereby the energy of a given bond is primarily a function of its local environment. For the remainder of this work, *G*^{max} is set to 3 for Encoded Bonds’ features. To allow comparison with past work, the BoB features do not limit *G*^{max}.

Full results from both LRR and KRR are tabulated in the supplementary material. The KRR models typically outperform LRR. For the Coulomb matrix and BoB, KRR outperforms LRR by 7.4 and 23 kcal/mol, respectively, on QM7. For connectivity counts and Encoded Bonds, the difference between LRR and KRR is considerably smaller, with KRR outperforming LRR by about 0.5–1.0 kcal/mol on QM7 for the best performing features. Because the KRR models outperform LRR, we limit the following discussion to KRR results.

## VI. RESULTS

Throughout this section, we summarize results by listing the features, with notation as in Sec. V, followed in parentheses by the MAE of a KRR model developed as in Sec. IV. For example, **2**^{B} (7.69 kcal/mol) indicates that a KRR model using a feature vector **2**^{B} leads to an MAE of 7.69 kcal/mol. The supplementary material contains an extensive list of results from both LRR and KRR models based on different feature sets. Selected results are discussed here.

We first consider models that use a single type of feature from Sec. V to predict the atomization energies of the QM7 data set. For connectivity counts, the best performing single features are **2**^{B} (7.69 kcal/mol) and **3**^{B} (6.36 kcal/mol). The use of the 3D molecular structure via Encoded Bonds’ features lowers the error substantially, with the best feature being **2**^{NP} (2.45 kcal/mol). The other encoding functions perform nearly as well, except for **2**^{SP} (146 kcal/mol) which is essentially a histogram and so gives a discontinuous and sparse feature vector. This suggests that smooth encoding functions lead to enhanced performance.

The model performance can be enhanced by concatenating features of different ranks. The effects of adding features of increasing rank are shown in Fig. 3, with selected results shown in Table II. The upper branch in Fig. 3 (solid lines) shows results when only connectivity counts are included and all feature vectors begin with **12**^{B}. These have the advantage of requiring only a line drawing or a SMILES string of the molecule. Addition of higher-rank connectivity counts improves the performance up through rank 4, with addition of rank 5 leading to a small degradation. Figure 3 also examines whether inclusion of bond order improves the performance, e.g., whether **3**^{B} (red line) or **3** (black line) leads to a lower MAE. For feature vectors including only connectivity counts, inclusion of bond order enhances the performance at rank 3, but degrades the performance at higher ranks. The lowest MAE for concatenated connectivity counts is **12**^{B}**3**^{B}**4** (3.40 kcal/mol).

. | . | . | First X thousand
. | ||
---|---|---|---|---|---|

. | . | . | molecules of QM9 . | ||

Feature . | QM7 . | CV . | 20 . | 50 . | 133 . |

Null | 179.01 | 189.45 | 290.57 | 322.72 | 425.12 |

Coulomb matrix | 3.37 | 3.83 | 43.78 | 66.56 | 106.09 |

BoB | 2.40 | 2.43 | 9.68 | 22.21 | 29.99 |

1 | 14.58 | 15.46 | 18.20 | 21.65 | 20.99 |

2^{B} | 7.69 | 7.52 | 10.42 | 16.67 | 15.73 |

3^{B} | 6.36 | 6.84 | 8.26 | 15.39 | 15.73 |

12^{B} | 6.88 | 6.37 | 8.75 | 11.11 | 11.17 |

12^{B}3 | 5.28 | 4.19 | 6.53 | 10.03 | 9.99 |

12^{B}3^{B} | 4.51 | 4.25 | 6.34 | 8.36 | 8.59 |

12^{B}34 | 3.64 | 3.68 | 5.70 | 9.18 | 9.89 |

12^{B}34^{B} | 4.13 | 3.81 | 6.10 | 9.90 | 10.51 |

12^{B}3^{B}4 | 3.40 | 3.56 | 5.43 | 8.64 | 9.42 |

12^{B}3^{B}4^{B} | 3.87 | 3.69 | 5.92 | 9.53 | 10.25 |

2^{LC} | 2.74 | 2.10 | 3.26 | 5.83 | 5.76 |

2^{NP} | 2.45 | 1.65 | 2.65 | 5.18 | 4.71 |

12^{LC} | 1.68 | 1.57 | 2.42 | 4.20 | 4.09 |

12^{LC}3 | 1.61 | 1.44 | 2.32 | 4.13 | 3.95 |

12^{LC}3^{B} | 1.43 | 1.28 | 2.03 | 3.72 | 3.60 |

12^{LC}34 | 1.73 | 1.38 | 2.21 | 4.36 | 4.21 |

12^{LC}34^{B} | 1.46 | 1.34 | 1.89 | 3.77 | 3.52 |

12^{LC}3^{B}4 | 1.56 | 1.24 | 1.98 | 3.93 | 3.77 |

12^{LC}3^{B}4^{B} | 1.45 | 1.31 | 1.83 | 3.66 | 3.41 |

12^{NP} | 1.45 | 1.14 | 2.03 | 3.84 | 3.51 |

12^{NP}3^{B} | 1.19 | 1.05 | 1.63 | 3.37 | 3.05 |

12^{NP}3^{B}4^{B} | 1.25 | 1.16 | 1.62 | 3.44 | 3.30 |

. | . | . | First X thousand
. | ||
---|---|---|---|---|---|

. | . | . | molecules of QM9 . | ||

Feature . | QM7 . | CV . | 20 . | 50 . | 133 . |

Null | 179.01 | 189.45 | 290.57 | 322.72 | 425.12 |

Coulomb matrix | 3.37 | 3.83 | 43.78 | 66.56 | 106.09 |

BoB | 2.40 | 2.43 | 9.68 | 22.21 | 29.99 |

1 | 14.58 | 15.46 | 18.20 | 21.65 | 20.99 |

2^{B} | 7.69 | 7.52 | 10.42 | 16.67 | 15.73 |

3^{B} | 6.36 | 6.84 | 8.26 | 15.39 | 15.73 |

12^{B} | 6.88 | 6.37 | 8.75 | 11.11 | 11.17 |

12^{B}3 | 5.28 | 4.19 | 6.53 | 10.03 | 9.99 |

12^{B}3^{B} | 4.51 | 4.25 | 6.34 | 8.36 | 8.59 |

12^{B}34 | 3.64 | 3.68 | 5.70 | 9.18 | 9.89 |

12^{B}34^{B} | 4.13 | 3.81 | 6.10 | 9.90 | 10.51 |

12^{B}3^{B}4 | 3.40 | 3.56 | 5.43 | 8.64 | 9.42 |

12^{B}3^{B}4^{B} | 3.87 | 3.69 | 5.92 | 9.53 | 10.25 |

2^{LC} | 2.74 | 2.10 | 3.26 | 5.83 | 5.76 |

2^{NP} | 2.45 | 1.65 | 2.65 | 5.18 | 4.71 |

12^{LC} | 1.68 | 1.57 | 2.42 | 4.20 | 4.09 |

12^{LC}3 | 1.61 | 1.44 | 2.32 | 4.13 | 3.95 |

12^{LC}3^{B} | 1.43 | 1.28 | 2.03 | 3.72 | 3.60 |

12^{LC}34 | 1.73 | 1.38 | 2.21 | 4.36 | 4.21 |

12^{LC}34^{B} | 1.46 | 1.34 | 1.89 | 3.77 | 3.52 |

12^{LC}3^{B}4 | 1.56 | 1.24 | 1.98 | 3.93 | 3.77 |

12^{LC}3^{B}4^{B} | 1.45 | 1.31 | 1.83 | 3.66 | 3.41 |

12^{NP} | 1.45 | 1.14 | 2.03 | 3.84 | 3.51 |

12^{NP}3^{B} | 1.19 | 1.05 | 1.63 | 3.37 | 3.05 |

12^{NP}3^{B}4^{B} | 1.25 | 1.16 | 1.62 | 3.44 | 3.30 |

The lower branch (dashed lines) of Fig. 3 shows results from feature vectors that begin with **12**^{LC} and so include encoded bond distances from 3D molecular structures. For feature vectors that include Encoded Bonds, inclusion of rank 3 leads to substantial improvements in the model performance and inclusion of rank 5 degrades the performance. Inclusion of rank 4 either slightly improves or slightly degrades the performance, depending on both the type of rank-2 encoding employed and the property being predicted. In contrast to the connectivity only features, inclusion of bond order leads to better performance at both rank 3 and rank 4. The best performance on QM7 atomization energies is **12**^{NP}**3**^{B} (1.19 kcal/mol). Inclusion of rank 4, as **12**^{NP}**3**^{B}**4**^{B} (1.25 kcal/mol), slightly degrades the performance.

In the above models, the 3D geometry is used at rank 2, to obtain distances between the atoms. The similar performance obtained from **2**^{NP} (2.45 kcal/mol) and BoB (2.40 kcal/mol) indicates that the specific encoding of the interatomic distances does not substantially alter the performance on QM7. (Although it does substantially alter some aspects of the performance, as will be explored below.) There are two primary ways in which the use of the 3D geometry to obtain the interatomic distance can enhance the performance. The first is that the 3D geometry gives precise values for the distance between bonded atoms. Because the bond strength correlates with bond length, it is plausible that this could have a substantial effect. However, similar performance is obtained for **12**^{B}**34** (3.64 kcal/mol) and **12**^{LC}**34** with *G*_{max} = 1 (3.62 kcal/mol) (see the supplementary material). The only difference between these feature sets is that **2**^{B} includes the bond type, which is derivable from the molecular graph, while **2**^{LC} with *G*^{max} = 1 includes a precise value for the bond length. The similar performance obtained from these models indicates that there is little benefit being gained from a correlation between the bond length and bond strength. Rather, the enhanced performance from the 3D geometry comes from including distances between non-bonded atoms, as are included at rank 2 with *G*^{max} > 1.

We next use the QM9 data set to explore the extent to which models trained on smaller molecules may be transferred to larger molecules. KRR models, using the features of Table II, were trained to the 3993 molecules of the QM9 data set that have seven or fewer heavy atoms. Two-fold cross validation was used to select hyperparameters, and the average MAE of the two folds, computed using the selected hyperparameters, is listed in the column labeled CV (for cross validation) in Table II. The remaining columns list the MAE as molecules from the QM9 data set are added to the test data set. Because the molecules are sorted by size, the average molecular size increases as molecules are added. A useful quantitative measure of this increase is provided by the MAE of the null model, which reports on the spread of the atomization energies. The MAE of the null model for the entire QM9 data set is 2.4 times that of the smaller molecules of QM7.

We first consider models that use only connectivity features. For simple models based on features that count elements (**1**) or bond types (**2**^{B}), the MAE for QM9 is 1.4 and 2.0 times that of QM7, respectively. This is less than the factor of 2.4 seen for the null model (Table II) and suggests that the average atomization energy per element, or average strength of a given bond type, does not change substantially with molecular size. Inclusion of higher-rank connectivity counts substantially improves the model performance, with **12**^{B}**3**^{B}**4** leading to the best performance on both QM7 and QM9. With **12**^{B}**3**^{B}**4**, the MAE of QM9 is 2.8 times that of QM7, only slightly larger than the 2.4 increase in the null model.

Not surprisingly, models that use the Coulomb matrix and BoB features do not transfer well to larger molecules. The MAEs of such models increase by over a factor of ten as one moves from small molecules to the entire QM9 data set. In addition, the MAEs on the full QM9 data set are larger than those from models based either on atom counts, **1**, or bond counts, **2**^{B}.

Good transfer to larger molecules is, on the other hand, obtained for models based on Encoded Bond features. For models based on the **2**^{LC} or **2**^{NP} feature alone, the MAE for QM9 relative to QM7 increases by factors of 2.1 and 1.9, respectively, which is comparable to or less than the 2.4 increase of the null model. This advantage is retained upon inclusion of connectivity features of different ranks. For both **12**^{NP}**3**^{B} and **12**^{NP}**3**^{B}**4**^{B}, the MAE of QM9 is 2.6 times that of QM7, which is only slightly larger than the 2.4 increase in the null model.

The degree to which various feature vectors are useful for creating models of other target properties is explored using the QM7b data set. Table III lists the MAE from BoB models developed here, along with results reported from the use of the Coulomb matrix.^{15} These are compared to models developed using **12**^{NP}**3**^{B}, which gives the best performance on QM7 atomization energies (Table II), and **12**^{NP}**3**^{B}**4**^{B}, which extends this to rank 4 and gives comparable results on QM7b. For the remaining studies reported below, which include larger molecules and larger data sets, we use the longer **12**^{NP}**3**^{B}**4**^{B} feature. The results show that the performance with **12**^{NP}**3**^{B} and **12**^{NP}**3**^{B}**4**^{B} is comparable to that of BoB, all of which lead to predictions whose MAEs are at least as low as the typical deviation of the quantum methods from the experiment used to generate the QM7b data set (QM in Table III).

. | Mean absolute errors . | |||||
---|---|---|---|---|---|---|

Property . | Null . | 12^{NP}3^{B}
. | 12^{NP}3^{B}4^{B}
. | CM^{15}
. | BoB^{a}
. | QM . |

E (PBE0) (eV) | 7.69 | 0.04 | 0.05 | 0.16 | 0.08 | 0.09–0.23^{b} |

α (PBE0) (Å^{3}) | 1.04 | 0.06 | 0.07 | 0.11 | 0.08 | 0.04–0.27^{c} |

α (SCS) (Å^{3}) | 1.15 | 0.05 | 0.06 | 0.08 | 0.04 | 0.04–0.27^{c} |

HOMO (GW) (eV) | 0.54 | 0.13 | 0.14 | 0.16 | 0.14 | … |

HOMO (PBE0) (eV) | 0.49 | 0.12 | 0.12 | 0.15 | 0.13 | 2.08^{d} |

HOMO (ZINDO) (eV) | 0.78 | 0.13 | 0.13 | 0.15 | 0.14 | 0.79^{e} |

LUMO (GW) (eV) | 0.31 | 0.13 | 0.13 | 0.13 | 0.15 | … |

LUMO (PBE0) (eV) | 0.53 | 0.09 | 0.09 | 0.12 | 0.11 | 1.30^{e} |

LUMO (ZINDO) (eV) | 1.08 | 0.10 | 0.10 | 0.11 | 0.14 | 0.93^{e} |

IP (ZINDO) (eV) | 0.78 | 0.16 | 0.18 | 0.17 | 0.19 | 0.20, 0.15^{f} |

EA (ZINDO) (eV) | 1.17 | 0.09 | 0.11 | 0.11 | 0.15 | 0.16^{g}, 0.11^{f} |

$E1st*$ (ZINDO) (eV) | 1.54 | 0.23 | 0.28 | 0.13 | 0.20 | 0.18^{g}, 0.21^{h} |

$Emax*$ (ZINDO) (eV) | 2.62 | 1.23 | 1.27 | 1.06 | 1.30 | … |

I_{max} (ZINDO) (arb.) | 0.15 | 0.06 | 0.07 | 0.07 | 0.08 | … |

. | Mean absolute errors . | |||||
---|---|---|---|---|---|---|

Property . | Null . | 12^{NP}3^{B}
. | 12^{NP}3^{B}4^{B}
. | CM^{15}
. | BoB^{a}
. | QM . |

E (PBE0) (eV) | 7.69 | 0.04 | 0.05 | 0.16 | 0.08 | 0.09–0.23^{b} |

α (PBE0) (Å^{3}) | 1.04 | 0.06 | 0.07 | 0.11 | 0.08 | 0.04–0.27^{c} |

α (SCS) (Å^{3}) | 1.15 | 0.05 | 0.06 | 0.08 | 0.04 | 0.04–0.27^{c} |

HOMO (GW) (eV) | 0.54 | 0.13 | 0.14 | 0.16 | 0.14 | … |

HOMO (PBE0) (eV) | 0.49 | 0.12 | 0.12 | 0.15 | 0.13 | 2.08^{d} |

HOMO (ZINDO) (eV) | 0.78 | 0.13 | 0.13 | 0.15 | 0.14 | 0.79^{e} |

LUMO (GW) (eV) | 0.31 | 0.13 | 0.13 | 0.13 | 0.15 | … |

LUMO (PBE0) (eV) | 0.53 | 0.09 | 0.09 | 0.12 | 0.11 | 1.30^{e} |

LUMO (ZINDO) (eV) | 1.08 | 0.10 | 0.10 | 0.11 | 0.14 | 0.93^{e} |

IP (ZINDO) (eV) | 0.78 | 0.16 | 0.18 | 0.17 | 0.19 | 0.20, 0.15^{f} |

EA (ZINDO) (eV) | 1.17 | 0.09 | 0.11 | 0.11 | 0.15 | 0.16^{g}, 0.11^{f} |

$E1st*$ (ZINDO) (eV) | 1.54 | 0.23 | 0.28 | 0.13 | 0.20 | 0.18^{g}, 0.21^{h} |

$Emax*$ (ZINDO) (eV) | 2.62 | 1.23 | 1.27 | 1.06 | 1.30 | … |

I_{max} (ZINDO) (arb.) | 0.15 | 0.06 | 0.07 | 0.07 | 0.08 | … |

^{a}

Calculations done in this work for comparison.

^{b}

0.15 eV from PBE0, MAE of formation enthalpy for the G3/99 set.^{56,57} 0.23 eV from PBE0, MAE of atomization energy for six small molecules.^{58,59} 0.09–0.22 eV from B3LYP, MAE of atomization energy from various studies.^{60}

^{c}

0.05–0.27 Å^{3} from B3LYP, MAE from various studies.^{60} 0.04–0.14 Å^{3} from MP2, MAE from various studies.^{60}

^{d}

MAE from GW values.^{15}

^{e}

ZINDO, MAE for a set of 17 retinal analogs.^{61}

^{f}

B3LYP, MAE from various studies.^{60}

^{g}

PBE0, MAE for the G3/99 set.^{56,57}

^{h}

TD-DFT(PBE0), MAE for a set of 17 retinal analogs.^{61}

The learning rates of models based on various feature vectors are shown in Fig. 4. Similar learning rates are obtained across feature vectors, with the decrease in MAE for models based on simple bond counting, **2**^{B}, being similar to those of BoB, **2**^{LC}, and **12**^{NP}**3**^{B}**4**^{B}. Figure 5 provides additional insight by showing the MAE for both the training and test data. The point at which the train and test error converge indicates the limit beyond which inclusion of additional data may no longer be expected to improve the model performance. The model based on the concatenated **12**^{NP}**3**^{B}**4**^{B} feature approaches this convergence significantly more slowly than the single **2**^{B} or **2**^{LC} features, suggesting that the increase in the performance of **12**^{NP}**3**^{B}**4**^{B} comes at the cost of a somewhat slower learning rate. For BoB, the training error is nearly zero, suggesting that inclusion of additional data has the potential to further reduce errors.

The performance and learning rate of models, based on the **12**^{NP}**3**^{B}**4**^{B} feature vector, that target a range of properties in the QM9 data set are shown in Table IV. The performance and learning rate on all thermodynamic energies are roughly equivalent, giving an MAE of 1.6 kcal/mol. The heat capacity is also well predicted, with an MAE of 0.10 cal/(mol K), that is, 2% that of the null model. Models of the HOMO and LUMO energies, and the HOMO-LUMO gap, also lead to reasonable performance, with MAEs below 0.2 eV. The performance is considerably poorer on the rotational constants, A, B, and C. These are trivially derivable from the structure and so are not a relevant target for machine learning. However, the poor performance on these properties suggests that the features explored here may not be well suited for properties related to the global structure of the molecule, as opposed to the local environments of atoms and bonds. This may account for the poor performance on dipole moment, *μ*, which has a strong dependence on the overall geometry of the molecule and for which the **12**^{NP}**3**^{B}**4**^{B} model reduces the MAE of the null model by only about one half. Significantly better performance is obtained for the polarizability, *α*, where the MAE of the **12**^{NP}**3**^{B}**4**^{B} model is 6.5% that of the null model. Although polarizability is a global property, it is primarily dependent on the total volume of the molecule and thus is likely well modeled as a sum of fragment properties. The dipole moment, on the other hand, depends on the detailed global arrangement of the atoms.

. | Sampled molecules . | ||||||
---|---|---|---|---|---|---|---|

Property . | Null . | 500 . | 1000 . | 2000 . | 4000 . | 8000 . | 16 000 . |

μ (D) | 1.19 | 0.92(1) | 0.86(1) | 0.83(1) | 0.76(1) | 0.68(0) | 0.63(0) |

α (bohr^{3}) | 6.30 | 0.98(4) | 0.77(2) | 0.66(1) | 0.60(4) | 0.49(2) | 0.41(1) |

HOMO (eV) | 0.44 | 0.25(0) | 0.22(1) | 0.19(0) | 0.17(0) | 0.15(0) | 0.12(0) |

LUMO (eV) | 1.05 | 0.39(1) | 0.32(2) | 0.26(0) | 0.22(0) | 0.18(0) | 0.15(0) |

Gap (eV) | 1.08 | 0.44(0) | 0.39(2) | 0.32(0) | 0.28(0) | 0.24(0) | 0.19(0) |

⟨R^{2}⟩ (bohr^{2}) | 202.52 | 86(3) | 71(2) | 59.41(91) | 50.40(54) | 44.94(35) | 36.31(19) |

zpve (kcal/mol) | 16.63 | 0.29(1) | 0.21(1) | 0.17(0) | 0.14(0) | 0.11(0) | 0.09(0) |

U_{0} (kcal/mol) | 188.47 | 7.28(33) | 4.67(10) | 3.64(13) | 2.75(5) | 2.08(5) | 1.58(2) |

U (kcal/mol) | 190.18 | 7.31(33) | 4.69(10) | 3.63(11) | 2.77(5) | 2.09(5) | 1.59(2) |

H (kcal/mol) | 191.55 | 7.32(33) | 4.69(10) | 3.63(11) | 2.77(5) | 2.09(5) | 1.59(2) |

G (kcal/mol) | 173.30 | 7.08(31) | 4.58(7) | 3.57(11) | 2.73(4) | 2.06(5) | 1.56(2) |

$Cv$ [cal/(mol K)] | 4.95 | 0.32(0) | 0.24(0) | 0.20(0) | 0.16(0) | 0.12(0) | 0.10(0) |

. | Sampled molecules . | ||||||
---|---|---|---|---|---|---|---|

Property . | Null . | 500 . | 1000 . | 2000 . | 4000 . | 8000 . | 16 000 . |

μ (D) | 1.19 | 0.92(1) | 0.86(1) | 0.83(1) | 0.76(1) | 0.68(0) | 0.63(0) |

α (bohr^{3}) | 6.30 | 0.98(4) | 0.77(2) | 0.66(1) | 0.60(4) | 0.49(2) | 0.41(1) |

HOMO (eV) | 0.44 | 0.25(0) | 0.22(1) | 0.19(0) | 0.17(0) | 0.15(0) | 0.12(0) |

LUMO (eV) | 1.05 | 0.39(1) | 0.32(2) | 0.26(0) | 0.22(0) | 0.18(0) | 0.15(0) |

Gap (eV) | 1.08 | 0.44(0) | 0.39(2) | 0.32(0) | 0.28(0) | 0.24(0) | 0.19(0) |

⟨R^{2}⟩ (bohr^{2}) | 202.52 | 86(3) | 71(2) | 59.41(91) | 50.40(54) | 44.94(35) | 36.31(19) |

zpve (kcal/mol) | 16.63 | 0.29(1) | 0.21(1) | 0.17(0) | 0.14(0) | 0.11(0) | 0.09(0) |

U_{0} (kcal/mol) | 188.47 | 7.28(33) | 4.67(10) | 3.64(13) | 2.75(5) | 2.08(5) | 1.58(2) |

U (kcal/mol) | 190.18 | 7.31(33) | 4.69(10) | 3.63(11) | 2.77(5) | 2.09(5) | 1.59(2) |

H (kcal/mol) | 191.55 | 7.32(33) | 4.69(10) | 3.63(11) | 2.77(5) | 2.09(5) | 1.59(2) |

G (kcal/mol) | 173.30 | 7.08(31) | 4.58(7) | 3.57(11) | 2.73(4) | 2.06(5) | 1.56(2) |

$Cv$ [cal/(mol K)] | 4.95 | 0.32(0) | 0.24(0) | 0.20(0) | 0.16(0) | 0.12(0) | 0.10(0) |

## VII. DISCUSSION

This paper explores two classes of molecular representations for machine learning. The first class requires only the graph of the molecular structure and counts occurrences of bonding patterns in the molecule. The second class requires the 3D structure of the molecule. For this second class, an Encoded Bonds’ feature vector is introduced whose length depends on the diversity of molecules in the data set, e.g., the number of elements and bond types, but is independent of the size of the molecules.

The feature sets are tested using both LRR and KRR. For the Coulomb matrix and BoB features, KRR substantially outperforms LRR, with KRR lowering the MAE on the QM7 data set by over 7 kcal/mol. For the connectivity counts and Encoded Bonds’ features, the benefits of KRR over LRR are more modest, being only about 0.5 kcal/mol for the better performing feature vectors.

The best performance obtained with a model based solely on connectivity counts is 3.4 kcal/mol for QM7. Inclusion of 3D geometry substantially enhances the performance, with models based solely on Encoded Bonds giving 2.4 kcal/mol. Comparison of various models reveals that the performance enhancement from the use of the 3D geometry arises from inclusion of distances between non-bonded atoms, as opposed to precise values for distances between bonded atoms, i.e., the bond lengths. Merging connectivity counts with Encoded Bonds leads to additional performance enhancements, lowering the MAE on QM7 to 1.19 kcal/mol.

For models trained and tested on molecules of similar size, Encoded Bonds and BoB lead to similar performance. However, Encoded Bonds’ features allow models trained on small molecules to be applied successfully to larger molecules.

The current work focuses on KRR models, for which the features are used to compute distances between pairs of molecules. These distances are then passed through a kernel for use in the regression. The above results suggest that the features developed here are useful for computing distances between molecules, when coupled to the standard kernels employed in machine learning. The development of kernels that better describe molecular similarity is left to future work. The degree to which the features introduced here lead to improved performance in deep learning and other ML models also remains to be explored; although, we note that the models with the best current performance on the data sets studied here use KRR.

As part of this work, we also developed an open-source python library, MolML, that supports the creation of the feature sets investigated here and their use in model development.^{62}

## SUPPLEMENTARY MATERIAL

See supplementary material for the following: distances between atoms used to define the bond types of Sec. V C; details regarding the implementation of Encoded Bonds’ features of Sec. V D; and a listing of the MAE and hyperparameters from LRR and KRR models of the QM7 data set, obtained using a wide variety of different feature vectors.

## ACKNOWLEDGMENTS

The authors thank Haichen Li for helpful discussions. The authors would also like to thank Raghunathan Ramakrishnan for helpful discussions on details regarding the BoB model and the data sets studied here. This work was supported by the National Science Foundation under Grant No. CHE-1027985.