Equilibrium structures determine material properties and biochemical functions. We here propose to machine learn phase space averages, conventionally obtained by ab initio or force-field-based molecular dynamics (MD) or Monte Carlo (MC) simulations. In analogy to ab initio MD, our ab initio machine learning (AIML) model does not require bond topologies and, therefore, enables a general machine learning pathway to obtain ensemble properties throughout the chemical compound space. We demonstrate AIML for predicting Boltzmann averaged structures after training on hundreds of MD trajectories. The AIML output is subsequently used to train machine learning models of free energies of solvation using experimental data and to reach competitive prediction errors (mean absolute error ∼ 0.8 kcal/mol) for out-of-sample molecules—within milliseconds. As such, AIML effectively bypasses the need for MD or MC-based phase space sampling, enabling exploration campaigns of Boltzmann averages throughout the chemical compound space at a much accelerated pace. We contextualize our findings by comparison to state-of-the-art methods resulting in a Pareto plot for the free energy of solvation predictions in terms of accuracy and time.

Structure determines function—a hallmark paradigm in the atomistic sciences, ranging from biologists studying protein functions based on x-ray structures to organic chemists discussing reaction mechanisms based on NMR measurements.12 The connection between structure R and a compound’s function is given through statistical mechanics averages A over the ensemble E of Boltzmann weighted configurations,

ERA.
(1)

The function of a molecule depends on the biological or chemical context, e.g., solubilities or binding affinities—all of which can be expressed as phase space averages A. Understanding this relation is of fundamental importance as the temperature-dependent balance of configurations dictates the biological function of proteins and their macroscopic behavior (think of egg white). Unfortunately, the quantitative prediction of thermal averages that minimize the free energy imposes major computational challenges due to the necessity of sampling phase space. Furthermore, since the existing experimental methods to obtain a compounds’ structure R are cumbersome, several computational routes have been introduced.

However, the use of this method to cover molecular structures poses a monumental challenge. We also highlight the bigger picture of a chemical compound space13 (CCS) and its hierarchical structure given by composition, constitution, and conformation. The inherent curse of the dimensions of CCS means that even considering all possible molecules of a single fixed composition quickly results in a combinatorial explosion as illustrated in Fig. 1. Thus, most approaches follow a divide-and-conquer strategy addressing the combinatorial problem of CCS at individual levels. Still, the numerical complexity of studying such relationships using molecular dynamics14 (MD) or Monte Carlo15,16 (MC) is overwhelming and to this day most methods with accurate yet rapid predictions suffer from the curse of conformer sampling. For instance, atomistic simulations study statistical mechanics (SM) ensembles through molecular dynamics EMDRSMA and are deeply intertwined with insights into biological functions. Ab initio molecular dynamics (AIMD) simulations not only allow studying molecules but also chemical reactions.2,17–19 However, they are much more costly than force fields1,5,6,20,21 due to having to solve approximate quantum mechanical equations at every time step. To account for this, hybrid setups EMDRMLA using both atomistic simulation and machine learning (ML) have been introduced uniting quantum mechanical equations with surrogate learning on the fly potentials.22–24 This helps mitigate some of the ab initio costs but may still require extensive MD sampling. These challenges have driven technological advancements of dedicated computer hardware, e.g., of the supercomputer Anton,25 specifically designed to accelerate MD simulations. Conversely, large MD codes5,26 have been rewritten in CUDA27 just so that they can run on GPUs. Decentralized global computing network initiatives such as Folding@home28,29 also predominantly run MD. MD also routinely consumes major fractions of resources and energy costs of high-performance computing centers, as recently seen for the Gordon Bell award to Car and co-workers for running MD on 100M atoms.30 

FIG. 1.

Example of a chemical compound space (CCS) for Np = Ne = 32 protons (and electrons) with a hierarchy of composition, constitution, and conformation. Each level corresponds to distinct temperature-regimes and is described by specific quantum chemical methods.1–10Ab initio machine learning (AIML) can act on all three levels and does not require fixed constitutions but allows general ensemble predictions A using ML-based averaged structures R and free energy machine learning11 (FML).

FIG. 1.

Example of a chemical compound space (CCS) for Np = Ne = 32 protons (and electrons) with a hierarchy of composition, constitution, and conformation. Each level corresponds to distinct temperature-regimes and is described by specific quantum chemical methods.1–10Ab initio machine learning (AIML) can act on all three levels and does not require fixed constitutions but allows general ensemble predictions A using ML-based averaged structures R and free energy machine learning11 (FML).

Close modal

To address the length and timescale problem of conformational space across the CCS, we have recently introduced free energy machine learning (FML), which relies on the averaged structure R as input to predict ensemble averages, such as the free energy,11RFMLA. Averaging the structure is necessary because ensemble properties inherently depend on multiple configurations and as such using a single geometry per molecule introduces ambiguities to the ML model. Here, we propose to replace the preceding step, i.e., the generation of the averaged input through extensive molecular dynamics runs for any query compound by an ab initio machine learning (AIML) model, EAIMLR.

AIML makes use of the Graph-To-Structure37 (G2S) method to predict three-dimensional structures through chemical compound space. Training data and labels, however, are fundamentally different from G2S: Instead of using a single optimized conformer per molecule, AIML training data consist of averages over complete MD trajectories, enabling the prediction of thermodynamically averaged conformers. This accounts for the fundamentally important difference between a Boltzmann average and a single atomic configuration for ensemble property predictions, as previously discussed in free energy machine learning11 (FML).

In particular, we replace ensemble sampling with machine learning of ensemble averages A of an equilibrium property a(r, p) by

A=1Za(r,p)eβEr,pdrdpAML(R),
(2)

with β, Z, and E being the Boltzmann factor, the partition function, and the total energy, respectively.38 The approximate equality of Eq. (2) is achieved by training an ML model AML that uses only the averaged conformer R with values A that include the rigorous integral over the ensemble.

AIML is a purely ML-based framework that properly accounts for the underlying Boltzmann statistics and can subsequently be used to generate the appropriate input for FML model-based predictions. The goal of our work is not to build an ML model that perfectly reflects thermodynamic expectation values but to construct a surrogate model that can predict these integrals with high accuracy and speed. By including Boltzmann averaged conformers, we make sure to include a canonical mapping of the underlying ensemble of each molecule to the ensemble property. By training a second FML model on experimental values, we ensure that our mapping AML(R) from the averaged structure also includes contributions from the complete ensemble as defined in the proper phase space integral A. Our numerical results, i.e., the systematic improvement of the models’ accuracy with training set size, indicate that our assumption [Eq. (5)] is sufficiently valid for the dataset we studied.

After briefly introducing AIML in the following, we will demonstrate its applicability for the prediction of aqueous free energies of solvation of out-of-sample molecules without having to perform explicit MD simulations. For training, however, extensive MD trajectories at corresponding temperatures are necessary, as well as experimental measurements of solvation reference energies. Finally, we provide an overview of the efficiency of AIML in the context of alternative state-of-the-art solvation methods. We find that AIML offers respective speedups by four to seven orders of magnitude when compared to classical or ab initio MD-based predictions of free energies of solvation.

For the previously published free energy machine learning11 (FML) approach, first, all sampled geometries r had to be transformed to representation vectors x(r) before finally the average X over the representation vectors,

X=x(r)eβE(r)dr,
(3)

could be computed. The key advantage of AIML is that instead of explicitly sampling the conformer space, a single AIML evaluation is required to predict a surrogate vector x(R) evaluated for the system average R to replace X [see Fig. 2(d)].

FIG. 2.

Centered histograms and standard deviation σd of all distances between ten non-hydrogen atoms of aspirin extracted from an ab initio molecular dynamics (AIMD) trajectory.31 (a) Conventional AIMD and ab initio ML (AIML) map the ensemble E to averaged structure R to the statistical mechanics (SM) average A. (b) Bold labeled index pairs (i, j) define a bond topology. (c) Sketch of two principal components PC1, PC2 of ML-based representations32–35 on a fictitious free energy surface (d) corresponding to conformers represented by a disconnectivity graph.36 Instead of MD average representations11 (FML) X, we propose AIML predicted representations of averaged conformers x(R).

FIG. 2.

Centered histograms and standard deviation σd of all distances between ten non-hydrogen atoms of aspirin extracted from an ab initio molecular dynamics (AIMD) trajectory.31 (a) Conventional AIMD and ab initio ML (AIML) map the ensemble E to averaged structure R to the statistical mechanics (SM) average A. (b) Bold labeled index pairs (i, j) define a bond topology. (c) Sketch of two principal components PC1, PC2 of ML-based representations32–35 on a fictitious free energy surface (d) corresponding to conformers represented by a disconnectivity graph.36 Instead of MD average representations11 (FML) X, we propose AIML predicted representations of averaged conformers x(R).

Close modal

Graph-To-Structure (G2S) exploits implicit correlations among relaxed structures in training datasets to infer interatomic distances for out-of-sample compounds across the chemical space. G2S effectively enables direct reconstruction of three-dimensional coordinates and, thereby, allows circumventing conventional energy optimization. G2S can reach an accuracy on par with or better than conventional structure generators. As query input, G2S requires only bond-network and stoichiometry-based information. G2S learns the direct mapping from a chemical graph to that structure that had been recorded in the training dataset. For the prediction of new structures, only molecular connectivity is needed, which can be provided, e.g., via SMILES39 or SELFIES.40 The G2S machines predict all pairwise distances. The full 3D geometry is then reconstructed using DGSOL41 for heavy atoms and a Lebedev sphere optimization scheme for hydrogen atoms.

An essential step for developing our AIML model was extending the previous Graph-To-Structure37 (G2S) method via the introduction of an averaged geometry R. This enables a computationally efficient ML-based map between ensemble and free energies and addresses the conformer sampling bottleneck. AIML provides an effective representation for the conformer ensemble by mapping all degrees of freedom to a single averaged structure to approximate the ensemble averaged structure with the corresponding AIML representation vector x (cf. Fig. 2). A schematic overview of the steps for AIML training and prediction is given in Fig. 3. As we will discuss in the following, two steps are needed to combine both methods: (i) the use of Boltzmann weighted distance matrices D for training and (ii) the use of AIML average conformer predictions as input for an ensemble average model (see Fig. 3 right). The required training steps for the AIML structure prediction use Boltzmann averaged intramolecular distance matrices. More specifically, as a first step, the molecule is transformed to a graph-based representation37g with the average distance matrix D as training labels resulting in ML models for heavy atom pairs as well as heavy and hydrogen atoms. Before entering the next step of free energy prediction, AIML is trained with the maximal number of molecules (N = 512) to construct the average training and test set conformers. This process is repeated for the complete dataset with consistent training–test splits between the AIML structure and free energy prediction. Next, a machine (see Sec. II C) for learning free energies is trained using AIML predicted average geometries. Finally, AIML can be used for out-of-sample predictions (see Fig. 3). Based on the molecular graph, the Boltzmann weighted distance matrix D is predicted. Next, a distance geometry solver41 (DGSOL) is used to convert the distance matrix to three-dimensional coordinates. The predicted average conformer then serves as a link between the graph and corresponding three-dimensional geometry (see Fig. 2). Subsequently, the average predicted conformer R is transformed to a single Bag-of-Bonds42 (BoB) representation vector x(R) and used to predict the free energy.

FIG. 3.

To generate ab initio machine learning (AIML) training data, the conformer space is sampled to obtain Boltzmann weighted average distance matrices for the given molecular graphs. AIML predicts the averaged distance matrix, which is then converted to a three-dimensional geometry R. Finally, the ensemble average A is predicted using free energy machine learning11 (FML).

FIG. 3.

To generate ab initio machine learning (AIML) training data, the conformer space is sampled to obtain Boltzmann weighted average distance matrices for the given molecular graphs. AIML predicts the averaged distance matrix, which is then converted to a three-dimensional geometry R. Finally, the ensemble average A is predicted using free energy machine learning11 (FML).

Close modal

ML, in particular, of solutions to the Schrödinger equation, i.e., quantum machine learning43–47 (QML), allows navigating the chemical compound space (CCS) with high efficiency and precision. ML has become a popular avenue for the study of material science, with applications such as the prediction of atomization energies,48,49 crystal formation energies,50 carbene properties,51 excited states,52,53 oxidation states,54 nuclear magnetic resonance spectra,55 reaction barriers,56,57 magnetic systems,58 and charge transfer59 or molecular fragments.60,61 Many recent approaches are based on individual levels of CCS, i.e., composition and62,63 constitution48,64,65,66 (see Fig. 1). Moreover, ensemble properties, such as the free energy of solvation,11,68–73 melting points,24,74,75 magnetic anisotropy tensors,76 and phases of water,78–79 have previously been addressed with ML.

Our ML approach is based on the kernel–ridge regression80 (KRR), a supervised learning method that allows approximating arbitrary functional relationships between input data given as molecular representations x and properties A(x). Using KRR, x is mapped into a high-dimensional feature space rendering the regression problem linear. A remarkable result of KRR81–82 is that the mapping does not need to be carried out explicitly; instead, the distances between representations xi and xj are computed, e.g., using Gaussian kernel functions,

k(xi,xj)=exp||xixj||222σ2,
(4)

that measure the similarity between two compounds i and j, resulting in the kernel matrix K where σ is the kernel-width hyperparameter. We use KRR to predict the vector of all interatomic distances Dq of a query compound q and a distance geometry solver41 (DSGOL) for subsequent reconstruction of the three-dimensional geometry.

The vector Dq contains all distances of atoms in the query molecule. The average conformer prediction R(xq=gq) of a query molecule q represented by a graph37gq is given by

R(gq)=DGSOL(D(gq)),
(5)

where the distance prediction is as follows:

Dq=Kα.
(6)

Here, the kernel matrix K is evaluated between query and training compounds gi and gq with regression coefficients α. The optimal regression coefficients α are obtained by solving a set of equations,

α=(K+λI)1D,
(7)

where the vector D contains all distances between atoms for each of the training molecules and λ is a regularization parameter. The ensemble property prediction is given by

Aq=K̃α̃.
(8)

As before, for training K̃ is evaluated between all training compounds now using molecular representation vectors x(R):

α̃=(K̃+λI)1A,
(9)

where the vector A contains the values of the ensemble property in the training set. Learning curves quantify the model prediction error, often measured as mean absolute error (MAE), against the number of training samples N, and are key to understanding the efficiency of ML models. It is generally found80 that they are linear on a log–log scale,

logMAEunitISlog(N),
(10)

where I is the initial error and S is the slope indicating model improvement given more training data.

Within AIML, we view the averaged structure as an ensemble property, R=A, representing the connection to the aforementioned overarching theme of structure determining function.12 In addition, the thermal equilibrium structure is relevant for NMR spectroscopy, which accounts for protein flexibility by resulting in time averaged structures equivalent to R due to the ergodic theorem.38 Purely ML-based implementation bypasses routinely encountered sampling issues—de facto replacing MD simulations with predicted averages as ensemble fingerprints for subsequent property prediction.

We first use kernel–ridge regression80 (KRR) to predict the symmetric matrix of averaged interatomic distances of a query compound q, i.e., DqK · α, which contains all inferred averaged distances of atoms in the query molecule. K and α correspond to the kernel matrix and training weights obtained for a training set consisting of MD trajectories and averaged interatomic distances as labels. In analogy to Graph-To-Structure37 (G2S), we subsequently rely on the distance geometry solver41 (DSGOL) for reconstruction of the three-dimensional structure R, as well as on graph-based representations, gq. To exemplify the AIML approach, consider the ab initio molecular dynamics (AIMD) trajectory published in Ref. 31 of the aspirin molecule at 300 K, resulting in atomic distance histograms shown in Fig. 2(a). In order to establish a graph-based representation to replace ab initio MD with ab initio ML (AIML) as illustrated in Fig. 2(b), it is necessary to assign bonds. This is straightforward using the distance histogram as covalently bonded atoms will not move far from each other [see Fig. 2(c)]—a concept that can be generalized via coarse-graining.21,84–85 While AIML requires a suggested molecular graph, it is not restricted to a single fixed bond topology but allows adapting the molecular graph depending on the relevant degrees of freedom depending on temperature or the environment of the molecule. Thus, AIML can include the formation and breaking of bonds (cfg. Fig. 1), corresponding to adding or erasing a one in the bond topology matrix gq (cfg. Fig. 2). Note that the predicted pairwise distance matrix does not only account for nearest neighbor effects but also the complete many-body description since it includes all cross combinations of atomic distances.

Using the graph as the representation for constructing kernels, the AIML model then learns the center of each off-diagonal element in the distance histogram as a label. Next, the AIML predicted distances are used to reconstruct the average conformer. Subsequently, the ML representation vector x(R) is computed. Therefore, AIML allows exchanging the order of average evaluation compared to the previous FML11 approach (illustrated Fig. 2), resulting in a dramatic reduction of computational costs.

AIML proposes a different paradigm by connecting all hierarchies of CCS with ensemble properties into a single ML-based framework. Because of its generality, AIML does not require a priori information about bonds but only averaged atomic distances. In analogy to the hierarchy of CCS, AIML has several special cases with fundamental physical interpretations (see Fig. 1): At temperatures higher than most bond energies, AIML operates on atomic clusters corresponding to a topology matrix that (mostly) contains zeros, i.e., the composition. At moderate temperatures, bonds exist but may occasionally break, corresponding to adding or removing a zero in the topology matrix. In this case, multiple molecular graphs can be extracted and AIML predicts the constitution averaged structure.

In this section, we demonstrate the usefulness of AIML for the problem of accurate predictions of free energies of solvation. In particular, we focus on experimental free energies of solvation of 642 charge-neutral small to medium-sized bioorganic molecules, as encoded in the FreeSolv database.88 Solvation free energies5,6,90–99 are of fundamental importance for chemistry, and the FreeSolv database has become a popular benchmark for performance testing of novel models.24,68–74

To use AIML to predict averaged structures and FML to predict free energies of solvation for out-of-sample molecules, we first trained AIML models as described above using molecular graphs as input and as labels for the averaged distances.

These were obtained from extended force-field-based MD runs and density functional theory (DFT) for conformer sampling with Def2TZVPD-FINE101–103 basis set and Becke–Perdew86,87 (BP) functional in the gas phase (see Sec. IV A for details). Note that we neglect the effect of water on the phase space of the solute (e.g., through hydrogen bonds). We believe that this aspect warrants further in-depth investigation in future studies.

Depending on the temperature (cfg. Fig. 1), some degrees of freedom do not get averaged out by the phase space integral [Eq. (2)]), restricting the domain to certain local basins of the total free energy. These remaining degrees of freedom can be identified in the way described in Fig. 2(a). We demonstrate the idea of AIML for ambient temperatures in the gas phase for which conventional molecular graph topologies as in biochemistry hold—without any loss of generality. Within such a regime, we can safely assume that any distance histogram matrix would be consistent with one topology, which represents the coarse-grained backbone that is not averaged out by the phase space integral. Of course, this approach could also be applied to any other set of conditions presuming that there is some way to easily infer valid topologies as a function of conditions (such as temperature and composition). The latter is a separate problem that goes beyond the scope of this work.

As numerical evidence of the functionality of the AIML model, we present in Fig. 3(a) prediction errors of the three-dimensional Boltzmann averaged structures R as a function of the training set size (aka learning curves80,104,105).

Numerical results shown in Fig. 4(a) indicate a systematic linear improvement on a log–log scale106 as a function of the size of the training set Ns, that is, the number of averaged training structures. Note that N stands for the number of training points for free energy values and that we have trained and evaluated two different machines after each other, for structure and free energy prediction, respectively. For the maximal training set size considered (Ns = 512 molecules, 80% of FreeSolv), the average root-mean-square deviation108–109 (RMSD), a measure of the structural distance between structures, has decayed to only 0.80 Å for the predicted FF and 0.82 Å for the DFT averaged conformers; the slope of the learning curve, however, indicates that learning has not yet been saturated. We find structure prediction of large molecules a particularly hard problem, i.e., on average the RMSDs increase with the size of the structure. This was also observed for a random subset of the GDB17 molecular database110 [see the supplementary material, Fig. 7(b), for a scatter plot of molecular size vs RMSD indicating a rough correlation].

FIG. 4.

Learning curve of the root-mean-square deviation (RMSD), a measure of structural distance, of ab initio machine learning (AIML) three-dimensional structure predictions as a function of the number of training structures Ns using GAFF25,6 force-field and density functional theory (DFT) with Becke–Perdew86,87 (BP) functional for training structure sampling (a). Two predicted average conformers R show the improvement of structure prediction along the learning curve (at Ns = 32, 128, 512). The mean absolute error (MAE) of predicted free energies F of the FreeSolv88 database as a function of N for free energy machine learning (FML) (b) using MD sampling11 vs AIML (no sampling) and the Bag-of-Bonds42 (BoB) representation. The AIML Ns = 32 model was trained with only 32 structures for conformer prediction.

FIG. 4.

Learning curve of the root-mean-square deviation (RMSD), a measure of structural distance, of ab initio machine learning (AIML) three-dimensional structure predictions as a function of the number of training structures Ns using GAFF25,6 force-field and density functional theory (DFT) with Becke–Perdew86,87 (BP) functional for training structure sampling (a). Two predicted average conformers R show the improvement of structure prediction along the learning curve (at Ns = 32, 128, 512). The mean absolute error (MAE) of predicted free energies F of the FreeSolv88 database as a function of N for free energy machine learning (FML) (b) using MD sampling11 vs AIML (no sampling) and the Bag-of-Bonds42 (BoB) representation. The AIML Ns = 32 model was trained with only 32 structures for conformer prediction.

Close modal

Note that the corresponding learning curve predicting optimal (not Boltzmann averaged) distances is less steep for the molecules in FreeSolv and exhibits a higher offset (see the supplementary material, Fig. 4). This could be due to the fact that learning thermal averages is less ambiguous for AIML than learning potential energy minima as it is in the case of G2S. This also holds for the heavy atom–hydrogen distances (see the supplementary material, Table 1, and Fig. 1 in the supplementary material). From a different point of view, by computing the Boltzmann average over distances, the conformer flexibility is effectively integrated out, thus, simplifying learning compared to the optimized structures.

Next, we demonstrate the learning efficiency [see Fig. 4(b)] of AIML for free energy prediction based on previously predicted structures R. The main disadvantage of the preceding free energy machine learning11 (FML) model was that it requires explicit conformer sampling for free energy prediction. The novel advantage of AIML is that no sampling is required. Instead, after prediction of the average conformer R, the free energy prediction is based on the ML representation vector x(R) where x is the ML-based representation Bag-of-Bonds42 (BoB). It is encouraging that FML (requiring explicit MD sampling) and AIML exhibit similar learning curves, achieving mean absolute errors (MAE) of 0.68 and 0.82 kcal mol−1, respectively, as shown in Fig. 4(b). This indicates that the performance of distance-based representations in conjunction with AIML is fairly robust, showing only 0.14 kcal mol−1 loss of accuracy compared to running MD simulations. Using predicted averaged DFT conformers, we obtain roughly the same accuracy of 0.84 kcal mol−1 (see the supplementary material, Fig. 8). This is consistent with our previous assessment of out-of-sampling AIML conformer RMSDs using FF and DFT, which also resulted in comparable errors for both methods. To illustrate the importance of structure prediction for subsequent property prediction, we have added a learning curve in Fig. 4(b) using an AIML model with only N = 32 average conformers for training the structure prediction. The mentioned model shows a much smaller learning rate. In general, we find that better average structure prediction will lead to improved subsequent FML models (discussed in more detail in the supplementary material, Fig. 5).

We find AIML to perform consistently better when the training distances result from Boltzmann sampling instead of using optimized structures—similar to what we also noted above for the structure prediction (see the supplementary material, Fig. 9). Even more surprising is the observation that training free energy prediction with AIML predicted structures resulted in slightly better models than training with the ground-truth averaged conformers resulting from MD. This indicates that AIML effectively smoothens the conformer space by isolating the most important degrees of freedom, thus, facilitating structure-based regression of thermal averages.

To gain a more comprehensive idea of AIML’s importance to the field, we have assessed the cost accuracy trade-off. Testing AIML on the FreeSolv88 database, we have measured average prediction times of 41 ms/molecule. These prediction times are dominated by the structure reconstruction task (40 ms), while only 1 ms is required to yield the free energy estimate (on a single-core AMD EPYC 7402P computer chip). For comparison, the corresponding prediction based on a classical force-field MD simulation protocol would have consumed three to four orders of magnitude more time, not to mention the costs associated with quantum chemistry-based ab initio MD. To gain a comprehensive overview of the field, we have performed solvation free energy calculations for all of the FreeSolv molecules using the following methods (all listed free energies available, see Sec. VI, MAEs listed in the supplementary material, Table 2):

  1. S̲olvation m̲odel based on d̲ensity93 (SMD) at M06-2X111/Def2-TZVPP101–103 (timing for SMD implemented in Gaussian112 not to be published)

  2. COSMO-RS-B1 and COSMO-RS-B2 referring to COSMO-RS94,95,113,114 with Def2TZVP101–103 and Def2TZVPD-FINE101–103 basis sets, respectively, and Becke–Perdew86,87 (BP) functional in the gas phase

  3. R̲eaction m̲echanism g̲enerator group (RMG) solvation125 

  4. ddCOSMO115 results obtained with PySCF116 and PBE-0117,118/Def2TZVPD101–103 

  5. Generalized Born119,120 (GBSA) model, results obtained using AMBER5,6

  6. Free energy machine learning11 (FML) with explicit conformer sampling on FreeSolv database

To complete the picture, we also included literature values for FreeSolv concerning the methods ARROW-PIMD8,20T̲hermodynamic i̲ntegration (TI) with GAFF25,6 extracted from the FreeSolv88 database and r̲eference i̲nteraction s̲ite m̲odel99,124–123 (3D-RISM). The trade-off between cost and accuracy, including an outline of the resulting Pareto front, is shown in Fig. 5.

FIG. 5.

Comparison of solvation methods (red if multiple weighted conformers were used, else blue) in terms of the mean absolute error (MAE) and order of magnitude of per molecule prediction time t for the FreeSolv88 database. Pareto front (dotted) is formed by methods with best accuracy and cost per prediction trade-off. The central processing unit (CPU) computation time varies depending on hardware, code, etc., and it was estimated if not available. All references and MAE of free energies are listed in the supplementary material, Table 2.

FIG. 5.

Comparison of solvation methods (red if multiple weighted conformers were used, else blue) in terms of the mean absolute error (MAE) and order of magnitude of per molecule prediction time t for the FreeSolv88 database. Pareto front (dotted) is formed by methods with best accuracy and cost per prediction trade-off. The central processing unit (CPU) computation time varies depending on hardware, code, etc., and it was estimated if not available. All references and MAE of free energies are listed in the supplementary material, Table 2.

Close modal

We note that AIML adds to the convexity of the Pareto front, representing a meaningful compromise: Although roughly twice as slow, it is slightly more accurate at 0.82 kcal mol−1 than the Reaction Mechanism Generator124 (RMG) model (MAE of 0.98 kcal mol−1) but still four orders of magnitude faster than the next best ab initio method COSMO-RS.94,95,113,114 A list of all MAE is provided in the supplementary material, Table 2. Thus, AIML is positioned on the Pareto front of the available solvation methods located in a sweet spot between speed and accuracy, providing the fastest predictions at the given accuracy of about 0.82 kcal mol−1. Note that the AIML learning curves have not yet saturated and that its accuracy will likely further improve if more training samples are included (cf. Fig. 4). Improving the AIML model will hardly worsen the prediction time due to the linear scaling of KRR predictions with respect to the training set size (see Sec. II C) and, therefore, shift the Pareto front toward higher accuracy. Furthermore, it is important to note that arbitrary accurate ab initio trajectories can be used for training while the prediction time is independent of the level of theory. We expect that recently published ML models tailored toward solvation, such as SoluteML,125 may outperform the presented AIML model’s accuracy, but we note that only a small training set of N = 512 molecules was used and we expect MAE to decay further with the training set size. A3D-PNAConv-FT126 combines the 2D and 3D structure and transfer learning, achieving an MAE of 0.417 kcal mol−1 for the FreeSolv dataset. Predictions require conformer sampling using a FF and the lowest-energy conformer. In contrast, AIML does not require sampling and is in fact replacing the functionality of a force-field or of an ab initio calculation.

Note that we have also tried to combine RMG and AIML/FML via the Δ-ML approach127 where RMG is used as a baseline for AIML; unfortunately, the prediction errors did not improve (see the supplementary material, Fig. 8). Moreover, AIML performs worse for large molecules with many conformers [see the supplementary material, Fig. 7(b)]. Unfortunately, combining a random sampled GDB17110 dataset with 10 000 molecules with the FreeSolv average conformers did not lead to improved structure predictions due to the small overlap of the two datasets (see the supplementary material, Fig. 6). Specifically, the small training set size and very high chemical diversity of the FreeSolv database, including the elements C, H, O, S, N, F, I, Br, P, Cl, and up to 24 non-hydrogen atoms per molecule, limit the accuracy of structure prediction for large compounds. Instead of adding random structures to improve structure prediction for the FreeSolv database (see the supplementary material, Fig. 6), we could show that sampling128 the local chemical space of the largest FreeSolv can help to improve the model’s accuracy (see the supplementary material, Fig. 8). Alternatively, these problems may be resolved by improved graph-based representations that include information about local chemical substructures, leading to better structure and improved free energy predictions.

Recent graph-based ML models129,130 can achieve a competitive accuracy with root-mean-squared errors (RMSEs) around 1 kcal mol−1. We have achieved a similar RMSE of 1.35 kcal mol−1 for a training set size of N = 512. We emphasize that the AIML approach is very different: First, AIML uses three-dimensional conformations, which can lead to a much-improved accuracy (MAE of 0.57 kcal mol−1 for N = 490) as we have shown earlier11 and allows going beyond fixed graphs. Secondly, AIML also predicts ensemble-based representation whereas SMILES-based ML uses molecular graphs as input. We found a direct comparison with the two previously mentioned other ML methods129,130 difficult because they either use a different training–test split129 or neglect130 certain molecules of the FreeSolv database.88 The comparison shows that our method might have a slightly higher initial offset due to having to learn the representation, i.e., the averaged conformer before predicting the free energy. Our learning curves (see Fig. 4) do not indicate saturation of the MAE with training set size N and might still surpass graph-based models for large training set sizes because AIML contains information about molecular conformations.

We have introduced ab initio machine learning (AIML) allowing for efficient predictions of ensemble averages, which systematically improve in accuracy as the training set size grows. To the best of our knowledge, for the first time, AIML has been found to effectively bypass the need for extensive MD or MC simulations to directly infer Boltzmann averaged geometries. Unlike all other solvation models (shown in Fig. 5), the AIML framework could easily be applied to other ensemble properties, e.g., melting points, without much adaptation since no manual preselection of features for molecular fingerprints is required. AIML does not require any additional sampling for inferring ensemble averages of new out-of-sample query molecules: Instead, AIML accounts for multiple Boltzmann weighted configurations implicitly through its training data. We have exemplified AIML for estimating experimental solvation free energies, and our numerical results amount to evidence showing that the conformer ensemble can effectively be linked to a single averaged conformer that serves as a canonical representative. AIML predictions are consistent with the previous free energy machine learning11 (FML) approach without the need to run an MD simulation for each prediction, reaching errors as low as 0.82 kcal mol−1 for 41 CPU ms/molecule prediction cost.

Further analysis has revealed that AIML does not yet work well with all the available molecular representations.32 More specifically, we find that representations, tailored toward atomization energies and including explicit angular dependencies, such as FCHL19,131,132 yield less favorable AIML models (see the supplementary material, Fig. 9). Conversely, it might be possible to further improve AIML by tailoring and optimizing representations and architecture (e.g., using locality, symmetry, neural networks).

The question of uniqueness is very fundamental for molecular representations.32,35,133,139–136 For the present dataset, however, this was not an issue as all averaged representations distinguished all data items. It is possible, however, to imagine a scenario where this is not the case: For two different ensembles, E and Ē with the same average conformer R=R̄ but two different averages AĀ, the presented AIML model would make the same predictions and could not distinguish between these two systems. However, this problem could be resolved by including higher-order moments for the prediction of the representation, for example, by including an AIML model for the standard deviation of the representation. Future studies will deal with this question. Here, our main focus was on molecules with uniquely defined bond topologies. For future applications, AIML could be applied to cases where assigning bond topologies is ambiguous or impossible, such as transition states56,137 or molecules at very high temperatures. These are important cases where AIML can be used but graph-based ML models cannot be used.

In summary, in comparison to classical or ab initio MD-based predictions of free energies of solvation, AIML offers respective speedups by four to seven orders of magnitude. AIML achieves such speedups by effectively shifting the computational cost for the query prediction to the training set generation. However, in light of the sheer scale of the chemical compound space available for molecular queries, this trade-off might be useful.

ML based on a single geometry can lead to ambiguous11 ensemble property predictions because predictions can vary substantially depending on the conformer. Many-body representations32,131,132,138,139 rely on three-dimensional geometries, which become even more relevant if the target property depends on multiple relevant conformers. A solution to this issue is to sample the configuration space to obtain a conformer invariant ML representation.11 Sampling can be achieved by different strategies: MD simulations,5,6 systematic conformer ensemble scans,147–142 and conformer generation methods that are either knowledge or force-field (FF) based, such as ETKDG,144 Gen3D,144 and others.152–149 A more expensive but accurate method is to obtain conformations using ab initio approaches, such as density functional theory (DFT)18,19 or tight binding147–142 (TB). Despite these advantages, a common pitfall of these methods is that sometimes the extension to arbitrary chemistries is not straightforward. To this end, ML-based methods37,157–152 hold the promise of providing faster and more general structure predictions. There are only very few ML structure generation methods, e.g., based on reinforcement learning153 or stochastic normalizing flows154 that take energy weights of different conformers into account. Here, to obtain a diverse set of conformers as the AIML training set, we have performed MD simulations in vacuum at an elevated temperature of T = 350 K using OpenMM26 using a Langevin integrator. GAFF25,6 with a time step of Δt = 2 fs was used with a total simulation time of 2 ns. Partial charges are computed with antechamber5,6 at AM1-BCC155 level. MD samples are selected with 2 ps time separation. To compare AIML with the COSMO-RS94,95,113,114 solvation method, we used the COSMO-RS workflow based on ab initio DFT calculations with Turbomole156 and the Becke–Perdew (BP)86,87 functional as implemented in COSMOconf157 with two different basis sets, Def2TZVP and Def2TZVPD-FINE101–103 (in the future referred to as B1 and B2). Based on these results, free energies are extracted using the COSMOtherm158 program. In addition, the r̲eaction m̲echanism g̲enerator group (RMG)-based approach was used to compute free energies of solvation125 of the FreeSolv database via the leruli.com API.159 The FreeSolv88 dataset contains 642 charge-neutral compounds and their experimental free energies of solvation. The average unsigned error of the experimental values is 0.57 kcal mol−1, which is close to the level of thermal energy fluctuations (kB ·  300 K ≈ 0.6 kcal mol−1). All ML models use a maximal training set size of 80% corresponding to N = 512 molecules. Hyperparameters are optimized with nested five-fold cross validation.

See the supplementary material for learning curves of the AIML distance predictions, scaling of prediction error with molecular size, and a complete list of references of all the solvation models mentioned in this article.

We acknowledge support from the European Research Council (ERC-CoG Grant No. QML). This project received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 772834. This research was also supported by the NCCR MARVEL, a National Center of Competence in Research, funded by the Swiss National Science Foundation (Grant No. 182892).

D.L., G.F.v.R., and O.A.v.L. are shareholders of Leruli GmbH.

The AIML code and all free energies of solvation of the FreeSolv database (if produced by the authors) are openly available at https://doi.org/10.5281/zenodo.6401711. The authors will be glad to provide more data for specific requests.

1.
A. C. T.
van Duin
,
S.
Dasgupta
,
F.
Lorant
, and
W. A.
Goddard
,
J. Phys. Chem. A
105
,
9396
(
2001
).
2.
R.
Car
and
M.
Parrinello
,
Phys. Rev. Lett.
55
,
2471
(
1985
).
3.
C.
Oostenbrink
,
A.
Villa
,
A. E.
Mark
, and
W. F.
Van Gunsteren
,
J. Comput. Chem.
25
,
1656
(
2004
).
4.
M.
Christen
,
P. H.
Hünenberger
,
D.
Bakowies
,
R.
Baron
,
R.
Bürgi
,
D. P.
Geerke
,
T. N.
Heinz
,
M. A.
Kastenholz
,
V.
Kräutler
,
C.
Oostenbrink
,
C.
Peter
,
D.
Trzesniak
, and
W. F.
van Gunsteren
,
J. Comput. Chem.
26
,
1719
(
2005
).
5.
J.
Wang
,
W.
Wang
,
P. A.
Kollman
, and
D. A.
Case
,
J. Mol. Graphics Modell.
25
,
247
(
2006
).
6.
J.
Wang
,
R. M.
Wolf
,
J. W.
Caldwell
, and
P. A.
Kollman
,
J. Comput. Chem.
25
,
1157
(
2004
).
7.
C.
Vega
,
J. L. F.
Abascal
,
M. M.
Conde
, and
J. L.
Aragones
,
Faraday Discuss.
141
,
251
(
2009
).
8.
W. L.
Jorgensen
,
D. S.
Maxwell
, and
J.
Tirado-Rives
,
J. Am. Chem. Soc.
118
,
11225
(
1996
).
9.
W. L.
Jorgensen
,
J.
Chandrasekhar
,
J. D.
Madura
,
R. W.
Impey
, and
M. L.
Klein
,
J. Chem. Phys.
79
,
926
(
1983
).
10.
J. B.
Abrams
,
L.
Rosso
, and
M. E.
Tuckerman
,
J. Chem. Phys.
125
,
074115
(
2006
).
11.
J.
Weinreich
,
N. J.
Browning
, and
O. A.
von Lilienfeld
,
J. Chem. Phys.
154
,
134113
(
2021
).
12.
D.
Voet
and
J. G.
Voet
,
Biochemistry
(
John Wiley & Sons, Inc.
,
Hoboken, NJ
,
2006
).
13.
O. A.
von Lilienfeld
,
K.-R.
Müller
, and
A.
Tkatchenko
,
Nat. Rev. Chem.
4
,
347
(
2020
).
14.
D.
Frenkel
and
B.
Smit
,
Understanding Molecular Simulation
, 2nd ed. (
Academic Press, Inc.
,
2001
).
15.
M. N.
Rosenbluth
,
AIP Conf. Proc.
690
,
22
(
2003
).
16.
W. K.
Hastings
,
Biometrika
57
,
97
(
1970
).
17.
P.
Hohenberg
and
W.
Kohn
,
Phys. Rev.
136
,
B864
(
1964
).
18.
G.
Kresse
and
J.
Furthmüller
,
Phys. Rev. B
54
,
11169
(
1996
).
19.
G.
Kresse
and
D.
Joubert
,
Phys. Rev. B
59
,
1758
(
1999
).
20.
L.
Pereyaslavets
,
G.
Kamath
,
O.
Butin
,
A.
Illarionov
,
M.
Olevanov
,
I.
Kurnikov
,
S.
Sakipov
,
I.
Leontyev
,
E.
Voronina
,
T.
Gannon
,
G.
Nawrocki
,
M.
Darkhovskiy
,
I.
Ivahnenko
,
A.
Kostikov
,
J.
Scaranto
,
M. G.
Kurnikova
,
S.
Banik
,
H.
Chan
,
M. G.
Sternberg
, and
B.
Fain
,
Nat. Commun.
13
,
414
(
2022
).
21.
S.
Kmiecik
,
D.
Gront
,
M.
Kolinski
,
L.
Wieteska
,
A. E.
Dawid
, and
A.
Kolinski
,
Chem. Rev.
116
,
7898
(
2016
).
22.
G.
Csányi
,
T.
Albaret
,
M. C.
Payne
, and
A.
De Vita
,
Phys. Rev. Lett.
93
,
175503
(
2004
).
23.
Z.
Li
,
J. R.
Kermode
, and
A.
De Vita
,
Phys. Rev. Lett.
114
,
096405
(
2015
).
24.
R.
Jinnouchi
,
F.
Karsai
, and
G.
Kresse
,
Phys. Rev. B
100
,
014105
(
2019
).
25.
D. E.
Shaw
,
M. M.
Deneroff
,
R. O.
Dror
,
J. S.
Kuskin
,
R. H.
Larson
,
J. K.
Salmon
,
C.
Young
,
B.
Batson
,
K. J.
Bowers
,
J. C.
Chao
,
M. P.
Eastwood
,
J.
Gagliardo
,
J. P.
Grossman
,
C. R.
Ho
,
D. J.
Ierardi
,
I.
Kolossváry
,
J. L.
Klepeis
,
T.
Layman
,
C.
McLeavey
,
M. A.
Moraes
,
R.
Mueller
,
E. C.
Priest
,
Y.
Shan
,
J.
Spengler
,
M.
Theobald
,
B.
Towles
, and
S. C.
Wang
,
Commun. ACM
51
,
91
97
(
2008
).
26.
P.
Eastman
,
J.
Swails
,
J. D.
Chodera
,
R. T.
McGibbon
,
Y.
Zhao
,
K. A.
Beauchamp
,
L.-P.
Wang
,
A. C.
Simmonett
,
M. P.
Harrigan
,
C. D.
Stern
,
R. P.
Wiewiora
,
B. R.
Brooks
, and
V. S.
Pande
,
PLoS Comput. Biol.
13
,
e1005659
(
2017
).
27.
P.
Vingelmann
,
F.
Fitzek
, and
NVIDIA
, NVIDIA, https://developer.nvidia.com/cuda-toolkit,
2020
.
28.
B.
Zagrovic
,
C. D.
Snow
,
M. R.
Shirts
, and
V. S.
Pande
,
J. Mol. Biol.
323
,
927
(
2002
).
29.
V. A.
Voelz
,
G. R.
Bowman
,
K.
Beauchamp
, and
V. S.
Pande
,
J. Am. Chem. Soc.
132
,
1526
(
2010
).
30.
See https://www.acm.org/media-center/2020/november/gordon-bell-prize-2020 for information about the Gordon Bell prize.
31.
S.
Chmiela
,
A.
Tkatchenko
,
H. E.
Sauceda
,
I.
Poltavsky
,
K. T.
Schütt
, and
K.-R.
Müller
,
Sci. Adv.
3
,
e1603015
(
2017
).
32.
B.
Parsaeifard
,
D.
Sankar De
,
A. S.
Christensen
,
F. A.
Faber
,
E.
Kocer
,
S.
De
,
J.
Behler
,
O.
Anatole von Lilienfeld
, and
S.
Goedecker
,
Mach. Learn.: Sci. Technol.
2
,
015018
(
2021
).
33.
A. P.
Bartók
,
R.
Kondor
, and
G.
Csányi
,
Phys. Rev. B
87
,
184115
(
2013
).
34.
J.
Nigam
,
G.
Fraux
, and
M.
Ceriotti
, “
Unified theory of atom-centered representations and graph convolutional machine-learning schemes
,”
J. Chem. Phys.
156
,
204115
(
2022
).
35.
B.
Huang
and
O. A.
von Lilienfeld
,
J. Chem. Phys.
145
,
161102
(
2016
).
36.
D. J.
Wales
,
Philos. Trans. R. Soc. London, Ser. A
370
,
2877
(
2012
).
37.
D.
Lemm
,
G. F.
von Rudorff
, and
O. A.
von Lilienfeld
,
Nat. Commun.
12
,
4468
(
2021
).
38.
M.
Tuckerman
,
Statistical Mechanics: Theory and Molecular Simulation
(
Oxford Graduate Texts
,
2001
).
39.
D.
Weininger
,
J. Chem. Inf. Comput. Sci.
28
,
31
(
1988
).
40.
M.
Krenn
,
F.
Häse
,
A.
Nigam
,
P.
Friederich
, and
A.
Aspuru-Guzik
,
Mach. Learn.: Sci. Technol.
1
,
045024
(
2020
).
41.
J. J.
Moré
and
Z.
Wu
,
J. Global Optim.
15
,
219
(
1999
).
42.
K.
Hansen
,
F.
Biegler
,
R.
Ramakrishnan
,
W.
Pronobis
,
O. A.
von Lilienfeld
,
K.-R.
Müller
, and
A.
Tkatchenko
,
J. Phys. Chem. Lett.
6
,
2326
(
2015
).
43.
B.
Huang
and
O. A.
von Lilienfeld
, “
Ab initio machine learning in chemical compound space
,”
Chem. Rev.
121
,(
16
),
10001
10036
(
2021
).
44.
M.
Ceriotti
,
C.
Clementi
, and
O.
Anatole von Lilienfeld
,
J. Chem. Phys.
154
,
160401
(
2021
).
45.
J.
Westermayr
,
M.
Gastegger
,
K. T.
Schütt
, and
R. J.
Maurer
,
J. Chem. Phys.
154
,
230903
(
2021
).
46.
47.
J. A.
Keith
,
V.
Vassilev-Galindo
,
B.
Cheng
,
S.
Chmiela
,
M.
Gastegger
,
K.-R.
Müller
, and
A.
Tkatchenko
,
Chem. Rev.
121
,
9816
(
2021
).
48.
M.
Rupp
,
A.
Tkatchenko
,
K.-R.
Müller
, and
O. A.
von Lilienfeld
,
Phys. Rev. Lett.
108
,
058301
(
2012
).
49.
F. A.
Faber
,
L.
Hutchison
,
B.
Huang
,
J.
Gilmer
,
S. S.
Schoenholz
,
G. E.
Dahl
,
O.
Vinyals
,
S.
Kearnes
,
P. F.
Riley
, and
O. A.
von Lilienfeld
,
J. Chem. Theory Comput.
13
,
5255
(
2017
).
50.
F. A.
Faber
,
A.
Lindmaa
,
O. A.
von Lilienfeld
, and
R.
Armiento
,
Phys. Rev. Lett.
117
,
135502
(
2016
).
51.
M.
Schwilk
,
D. N.
Tahchieva
, and
O. A.
von Lilienfeld
, “
Large yet bounded: Spin gap ranges in carbenes
,” arXiv:2004.10600 [physics.chem-ph] (
2020
).
52.
J.
Westermayr
and
P.
Marquetand
,
Chem. Rev.
121
,
9873
(
2021
).
53.
P. O.
Dral
and
M.
Barbatti
,
Nat. Rev. Chem.
5
,
388
(
2021
).
54.
M.
Eckhoff
,
K. N.
Lausch
,
P. E.
Blöchl
, and
J.
Behler
,
J. Chem. Phys.
153
,
164107
(
2020
).
55.
A.
Gupta
,
S.
Chakraborty
, and
R.
Ramakrishnan
,
Mach. Learn.: Sci. Technol.
2
,
035010
(
2021
).
56.
G. F.
von Rudorff
,
S.
Heinen
,
M.
Bragato
, and
O. A.
von Lilienfeld
,
Mach. Learn.: Sci. Technol.
1
,
045026
(
2020
).
57.
M.
Bragato
,
G. F.
von Rudorff
, and
O. A.
von Lilienfeld
,
Chem. Sci.
11
,
11859
(
2020
).
58.
M.
Eckhoff
and
J.
Behler
,
npj Comput. Mater.
7
,
170
(
2021
).
59.
T. W.
Ko
,
J. A.
Finkler
,
S.
Goedecker
, and
J.
Behler
,
Acc. Chem. Res.
54
,
808
(
2021
).
60.
B.
Huang
and
O. A.
von Lilienfeld
,
Nat. Chem.
12
,
945
(
2020
).
61.
B.
Huang
and
O. A.
von Lilienfeld
, “
Dictionary of 140k GDB and ZINC derived AMONs
,” arXiv:2008.05260 [physics.chem-ph] (
2020
).
62.
R. E.
Goodall
and
A. A.
Lee
,
Nat. Commun.
11
,
6250
(
2020
).
63.
R. E. A.
Goodall
,
A. S.
Parackal
,
F. A.
Faber
,
R.
Armiento
, and
A. A.
Lee
, “
Rapid discovery of novel materials by coordinate-free coarse graining
,” arXiv:2106.11132 [cond-mat.mtrl-sci] (
2021
).
64.
O.
Wieder
,
S.
Kohlbacher
,
M.
Kuenemann
,
A.
Garon
,
P.
Ducrot
,
T.
Seidel
, and
T.
Langer
,
Drug Discovery Today: Technol.
37
,
1
(
2020
).
65.
D.
Rogers
and
M.
Hahn
,
J. Chem. Inf. Model.
50
,
742
(
2010
).
66.
K. T.
Schütt
,
H. E.
Sauceda
,
P.-J.
Kindermans
,
A.
Tkatchenko
, and
K.-R.
Müller
,
J. Chem. Phys.
148
,
241722
(
2018
).
67.
A.
Alibakhshi
and
B.
Hartke
,
Nat. Comm.
12
,
3584
(
2021
).
68.
S.
Riniker
,
J. Chem. Inf. Model.
57
,
726
(
2017
).
69.
S.
Axelrod
and
R.
Gomez-Bombarelli
, “
Molecular machine learning with conformer ensembles
,” arXiv:2012.08452 [cs.LG] (
2020
).
70.
J.
Gebhardt
,
M.
Kiesel
,
S.
Riniker
, and
N.
Hansen
,
J. Chem. Inf. Model.
60
,
5319
(
2020
).
71.
J.
Scheen
,
W.
Wu
,
A. S. J. S.
Mey
,
P.
Tosco
,
M.
Mackey
, and
J.
Michel
,
J. Chem. Inf. Model.
60
,
5331
(
2020
).
72.
H.
Lim
and
Y.
Jung
,
J. Cheminf.
13
,
56
(
2021
).
73.
F. H.
Vermeire
and
W. H.
Green
,
Chem. Eng. J.
418
,
129307
(
2021
).
74.
V.
Venkatraman
,
S.
Evjen
,
H. K.
Knuutila
,
A.
Fiksdahl
, and
B. K.
Alsberg
,
J. Mol. Liq.
264
,
318
(
2018
).
75.
K.
Mansouri
,
C. M.
Grulke
,
R. S.
Judson
, and
A. J.
Williams
,
J. Cheminf.
10
,
10
(
2018
).
76.
V.
Zaverkin
,
J.
Netz
,
F.
Zills
,
A.
Köhn
, and
J.
Kästner
,
J. Chem. Theory Comput.
18
,
1
(
2022
).
77.
A.
Reinhardt
and
B.
Cheng
,
Nat. Commun.
12
,
588
(
2021
).
78.
B.
Cheng
,
M.
Bethkenhagen
,
C. J.
Pickard
, and
S.
Hamel
, arXiv:2103.09035 (
2021
).
79.
B.
Monserrat
,
J. G.
Brandenburg
,
E. A.
Engel
, and
B.
Cheng
,
Nat. Comm.
17
,
1228
1232
(
2021
).
80.
V. N.
Vapnik
,
Statistical Learning Theory
(
Wiley-Interscience
,
1998
).
81.
M. P.
Deisenroth
,
A. A.
Faisal
, and
C. S.
Ong
,
Mathematics for Machine Learning
(
Cambridge University Press
,
2020
).
82.
K. T.
Schütt
,
S.
Chmiela
,
O. A.
von Lilienfeld
,
A.
Tkatchenko
,
K.
Tsuda
, and
K.-R.
Müller
,
Machine Learning Meets Quantum Physics
(
Springer
,
2020
).
83.
A.
Koliński
 et al,
Acta Biochim. Pol.
51
(
2
),
349
371
(
2004
).
84.
A.
Liwo
,
M.
Baranowski
,
C.
Czaplewski
,
E.
Gołaś
,
Y.
He
,
D.
Jagieła
,
P.
Krupa
,
M.
Maciejczyk
,
M.
Makowski
,
M. A.
Mozolewska
, et al,
J. Mol. Model.
20
,
2306
(
2014
).
85.
M.
Levitt
and
A.
Warshel
,
Nature
253
,
694
(
1975
).
86.
A. D.
Becke
,
Phys. Rev. A
38
,
3098
(
1988
).
87.
J. P.
Perdew
,
Phys. Rev. B
33
,
8822
(
1986
).
88.
G.
Duarte Ramos Matos
,
D. Y.
Kyu
,
H. H.
Loeffler
,
J. D.
Chodera
,
M. R.
Shirts
, and
D. L.
Mobley
,
J. Chem. Eng. Data
62
,
1559
(
2017
).
89.
J. W.
Gibbs
,
Transa. Conn. Acad. Arts Sci.
2
,
382
404
(
1873
).
90.
F.
Fogolari
,
A.
Brigo
, and
H.
Molinari
,
J. Mol. Recognit.
15
,
377
(
2002
).
91.
B.
Mennucci
,
J.
Tomasi
,
R.
Cammi
,
J. R.
Cheeseman
,
M. J.
Frisch
,
F. J.
Devlin
, and
P. J.
Stephens
,
J. Phys. Chem. A
106
,
6102
(
2002
).
92.
A. V.
Marenich
,
C. J.
Cramer
, and
D. G.
Truhlar
,
J. Chem. Theory Comput.
9
,
609
(
2013
).
93.
A. V.
Marenich
,
C. J.
Cramer
, and
D. G.
Truhlar
,
J. Phys. Chem. B
113
,
6378
(
2009
).
94.
A.
Klamt
and
G.
Schüürmann
,
J. Chem. Soc., Perkin Trans. 2
1993
,
799
.
95.
A.
Klamt
,
J. Phys. Chem.
99
,
2224
(
1995
).
96.
A.
Klamt
and
F.
Eckert
,
Fluid Phase Equilib.
172
,
43
(
2000
).
97.
D.
Beglov
and
B.
Roux
,
J. Phys. Chem. B
101
,
7821
(
1997
).
98.
A.
Kovalenko
and
F.
Hirata
,
Chem. Phys. Lett.
290
,
237
(
1998
).
99.
D.
Roy
and
A.
Kovalenko
,
J. Phys. Chem. A
123
,
4087
(
2019
).
100.
D.
Rappoport
and
F.
Furche
,
J. Chem. Phys.
133
,
134105
(
2010
).
101.
A.
Schäfer
,
C.
Huber
, and
R.
Ahlrichs
,
J. Chem. Phys.
100
,
5829
(
1994
).
102.
F.
Weigend
and
R.
Ahlrichs
,
Phys. Chem. Chem. Phys.
7
,
3297
(
2005
).
103.
F.
Weigend
,
Phys. Chem. Chem. Phys.
8
,
1057
(
2006
).
104.
K.-R.
Müller
,
M.
Finke
,
N.
Murata
,
K.
Schulten
, and
S.
Amari
,
Neural Comput.
8
,
1085
(
1996
).
105.
C.
Cortes
,
L. D.
Jackel
,
S. A.
Solla
,
V. N.
Vapnik
, and
J. S.
Denker
, in
NIPS
,
1993
.
106.
C.
Cortes
,
L. D.
Jackel
,
S. A.
Solla
,
V.
Vapnik
, and
J. S.
Denker
,
Advances in Neural Information Processing Systems, Proceedings of the First 12 Conferences, edited by
M.I.
Jordan
,
Y.
LeCun
, and
S. A.
Solla
MIT Press
, (
1994
), pp.
327
334
.
107.
J. C.
Kromann
, “
Calculate root-mean-square deviation (RMSD) of two molecules using rotation
,”
Github, Dataset.
https://github.com/charnley/rmsd.
108.
M. W.
Walker
,
L.
Shao
, and
R. A.
Volz
,
CVGIP: Image Understanding
54
,
358
(
1991
).
109.
W.
Kabsch
,
Acta Crystallogr., Sect. A
32
,
922
(
1976
).
110.
L.
Ruddigkeit
,
R.
van Deursen
,
L. C.
Blum
, and
J.-L.
Reymond
,
J. Chem. Inf. Model.
52
,
2864
(
2012
).
111.
Y.
Zhao
and
D. G.
Truhlar
,
Theor. Chem. Acc.
120
,
215
(
2008
).
112.
M. J.
Frisch
,
G. W.
Trucks
,
H. B.
Schlegel
,
G. E.
Scuseria
,
M. A.
Robb
,
J. R.
Cheeseman
,
G.
Scalmani
,
V.
Barone
,
G. A.
Petersson
,
H.
Nakatsuji
,
X.
Li
,
M.
Caricato
,
A. V.
Marenich
,
J.
Bloino
,
B. G.
Janesko
,
R.
Gomperts
,
B.
Mennucci
,
H. P.
Hratchian
,
J. V.
Ortiz
,
A. F.
Izmaylov
,
R.
Cammi
,
J. W.
Ochterski
,
R. L.
Martin
,
K.
Morokuma
,
O.
Farkas
,
J. B.
Foresman
, and
D. J.
Fox
, Gaussian16 Revision C.01,
Gaussian, Inc.
,
Wallingford, CT
,
2016
).
113.
A.
Klamt
,
V.
Jonas
,
T.
Bürger
, and
J. C. W.
Lohrenz
,
J. Phys. Chem. A
102
,
5074
(
1998
).
114.
F.
Eckert
and
A.
Klamt
,
AIChE J.
48
,
369
(
2002
).
115.
F.
Lipparini
,
B.
Stamm
,
E.
Cancès
,
Y.
Maday
, and
B.
Mennucci
,
J. Chem. Theory Comput.
9
,
3637
3648
(
2013
).
116.
Q.
Sun
,
T. C.
Berkelbach
,
N. S.
Blunt
,
G. H.
Booth
,
S.
Guo
,
Z.
Li
,
J.
Liu
,
J. D.
McClain
,
E. R.
Sayfutyarova
,
S.
Sharma
,
S.
Wouters
, and
G. K.-L.
Chan
,
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
8
,
e1340
(
2018
).
117.
J. P.
Perdew
,
M.
Ernzerhof
, and
K.
Burke
,
J. Chem. Phys.
105
,
9982
(
1996
).
118.
C.
Adamo
and
V.
Barone
,
J. Chem. Phys.
110
,
6158
(
1999
).
119.
A.
Onufriev
,
D.
Bashford
, and
D. A.
Case
,
Proteins
55
,
383
(
2004
).
120.
J.
Weiser
,
P. S.
Shenkin
, and
W. C.
Still
,
J. Comput. Chem.
20
,
217
(
1999
).
121.
D.
Roy
and
A.
Kovalenko
,
J
4
,
604
(
2021
).
122.
A.
Kovalenko
and
F.
Hirata
,
J. Chem. Phys.
110
,
10095
(
1999
).
123.
M.
Reimann
and
M.
Kaupp
,
J. Phys. Chem. A
124
,
7439
(
2020
).
124.
Y. L.
Chung
,
R. J.
Gillis
, and
W. H.
Green
,
AIChE J.
66
,
e16976
(
2020
).
125.
Y.
Chung
,
F. H.
Vermeire
,
H.
Wu
,
P. J.
Walker
,
M. H.
Abraham
, and
W. H.
Green
,
J. Chem. Inf. Model.
62
,
433
(
2022
).
126.
D.
Zhang
,
S.
Xia
, and
Y.
Zhang
,
J. Chem. Inf. Model.
62
,
1840
(
2022
).
127.
R.
Ramakrishnan
,
P. O.
Dral
,
M.
Rupp
, and
O. A.
von Lilienfeld
,
J. Chem. Theory Comput.
11
,
2087
(
2015
).
128.
A.
Nigam
,
R.
Pollice
,
M.
Krenn
,
G. d. P.
Gomes
, and
A.
Aspuru-Guzik
,
Chem. Sci.
12
,
7079
(
2021
).
129.
Z.
Wu
,
B.
Ramsundar
,
E. N.
Feinberg
,
J.
Gomes
,
C.
Geniesse
,
A. S.
Pappu
,
K.
Leswing
, and
V.
Pande
,
Chem. Sci.
9
,
513
(
2018
).
130.
P.
Gao
,
X.
Yang
,
Y.-H.
Tang
,
M.
Zheng
,
A.
Andersen
,
V.
Murugesan
,
A.
Hollas
, and
W.
Wang
,
Phys. Chem. Chem. Phys.
23
,
24892
(
2021
).
131.
F. A.
Faber
,
A. S.
Christensen
,
B.
Huang
, and
O. A.
von Lilienfeld
,
J. Chem. Phys.
148
,
241717
(
2018
).
132.
A. S.
Christensen
,
L. A.
Bratholm
,
F. A.
Faber
, and
O.
Anatole von Lilienfeld
,
J. Chem. Phys.
152
,
044107
(
2020
).
133.
J. E.
Moussa
,
Phys. Rev. Lett.
109
,
059801
(
2012
).
134.
M. F.
Langer
,
A.
Goeßmann
, and
M.
Rupp
,
npj Comput. Mater.
8
,
41
(
2022
).
135.
S. N.
Pozdnyakov
,
M. J.
Willatt
,
A. P.
Bartók
,
C.
Ortner
,
G.
Csányi
, and
M.
Ceriotti
,
Phys. Rev. Lett.
125
,
166001
(
2020
).
136.
O. A.
von Lilienfeld
,
R.
Ramakrishnan
,
M.
Rupp
, and
A.
Knoll
,
Int. J. Quantum Chem.
115
,
1084
(
2015
).
137.
S.
Heinen
,
G. F.
von Rudorff
, and
O. A.
von Lilienfeld
,
J. Chem. Phys.
155
,
064105
(
2021
).
138.
J.
Behler
,
J. Chem. Phys.
134
,
074106
(
2011
).
139.
M. P.
Bircher
,
A.
Singraber
, and
C.
Dellago
,
Mach. Learn.: Sci. Technol.
2
,
035026
(
2021
).
140.
P.
Pracht
,
F.
Bohle
, and
S.
Grimme
,
Phys. Chem. Chem. Phys.
22
,
7169
(
2020
).
141.
S.
Grimme
,
J. Chem. Theory Comput.
15
,
2847
(
2019
).
142.
P.
Pracht
and
S.
Grimme
,
Chem. Sci.
12
,
6551
(
2021
).
143.
N.
Yoshikawa
and
G.
Hutchison
,
J. Cheminf.
11
,
49
(
2019
).
144.
S.
Riniker
and
G. A.
Landrum
,
J. Chem. Inf. Model.
55
,
2562
(
2015
).
145.
M. A.
Miteva
,
F.
Guyon
, and
P.
Tufféry
,
Nucleic Acids Res.
38
,
W622
(
2010
).
146.
P. C. D.
Hawkins
,
A. G.
Skillman
,
G. L.
Warren
,
B. A.
Ellingson
, and
M. T.
Stahl
,
J. Chem. Inf. Model.
50
,
572
(
2010
).
147.
G.
Landrum
 et al, Software,
2006
.
148.
N. M.
O’Boyle
,
T.
Vandermeersch
,
C. J.
Flynn
,
A. R.
Maguire
, and
G. R.
Hutchison
,
J. Chem. Inf. Model.
3
(
2011
), .
149.
M. J.
Vainio
and
M. S.
Johnson
,
J. Chem. Inf. Model.
47
,
2462
(
2007
).
150.
E.
Mansimov
,
O.
Mahmood
,
S.
Kang
, and
K.
Cho
,
Sci. Rep.
9
,
20381
(
2019
).
151.
J.
Kästner
,
J. M.
Carr
,
T. W.
Keal
,
W.
Thiel
,
A.
Wander
, and
P.
Sherwood
,
J. Phys. Chem. A
113
(
43
),
11856
(
2009
).
152.
L.
Chan
,
G.
Hutchison
, and
G.
Morris
,
J. Cheminformatics
11
,
32
(
2019
).
153.
S. A.
Meldgaard
,
J.
Köhler
,
H. L.
Mortensen
,
M.-P. V.
Christiansen
,
F.
Noé
, and
B.
Hammer
, “
Generating stable molecules using imitation and reinforcement learning
,”
Mach. Learn. Sci. Technol.
3
,
015008
(
2022
).
154.
H.
Wu
,
J.
Köhler
, and
F.
é
, “
Stochastic normalizing flows,” in NIPS'20 Proceedings of the 34th International Conference on Neural Information Processing Systems
(
NIPS
,
2020
) pp.
5933
5944
.
155.
A.
Jakalian
,
B. L.
Bush
,
D. B.
Jack
, and
C. I.
Bayly
,
J. Comput. Chem.
21
,
132
(
2000
).
156.
TURBOMOLE V7.2 2017, a development of University of Karlsruhe and Forschungszentrum Karlsruhe GmbH, 1989-2007, TURBOMOLE GmbH, since 2007, available from http://www.turbomole.com.
157.
Cosmoconf 4.3, COSMOlogic GmbH Co. KG, http://www.cosmologic.de, Leverkusen, Germany.
158.
F.
Eckert
and
A.
Klamt
, Cosmotherm, 2018, bIOVIA COSMOtherm, Release 2021; Dassault Systèmes, http://www.3ds.com.
159.
D.
Lemm
,
G. F.
von Rudorff
, and
A.
von Lilienfeld
,
LERULI.com, online molecular property predictions in real time and for free
, www.leruli.com,
2021
.

Supplementary Material