DASH properties: Estimating atomic and molecular properties from a dynamic attention-based substructure hierarchy

Recently, we presented a method to assign atomic partial charges based on the DASH (dynamic attention-based substructure hierarchy) tree with high efficiency and quantum mechanical (QM)-like accuracy. In addition, the approach can be considered “rule based”—where the rules are derived from the attention values of a graph neural network—and thus, each assignment is fully explainable by visualizing the underlying molecular substructures. In this work, we demonstrate that these hierarchically sorted substructures capture the key features of the local environment of an atom and allow us to predict different atomic properties with high accuracy without building a new DASH tree for each property. The fast prediction of atomic properties in molecules with the DASH tree can, for example, be used as an efficient way to generate feature vectors for machine learning without the need for expensive QM calculations. The final DASH tree with the different atomic properties as well as the complete dataset with wave functions is made freely available


I. INTRODUCTION
The properties of atoms in molecules are often used to describe or estimate the properties of the corresponding molecule.For instance, atomic properties, such as partial charges, are valuable for fixed-charge force fields used in molecular dynamics (MD) simulations or for machine learning (ML) of molecular properties, such as binding affinity or aqueous solubility, where atomic properties can serve as input features.However, in many of these examples, either simple tabulated values are employed or the accuracy of the atomic property is dictated by the computational cost of the quantummechanical (QM) calculation required to obtain it.2][3][4][5] Although this approach provides the desired computational efficiency while retaining the underlying QM accuracy, it often lacks explainability and is dependent on the chosen software.
Recently, we presented the dynamic attention-based substructure hierarchy (DASH), 6 which is a tree structure built to assign partial charges to atoms based on 2D topological substructures.The substructures are expanded iteratively from the initial atom guided by attention values that represent the importance of neighbors and linearize the search, making the method comparable in accuracy to state-of-the-art ML models but even faster.Importantly, the DASH tree has the added benefit of full explainability and error estimates of the assigned charges based on the matched substructure.The DASH partial charges were found to have small errors compared to the minimal basis iterative stockholder (MBIS) 7 reference values, even when tested on the external test set VEHICLe (virtual exploratory heterocyclic library), 8 which demonstrated the generalizability of the DASH approach to unseen molecules.In many cases, the inherent 2D nature of the predictions with the DASH tree can be an advantage.In the case of partial charges for fixed-charge force fields, they need to be conformationally averaged, which is in other force fields handled by performing multiple QM calculations, 9 which adds to the computational cost.
With the DASH tree at hand, the following question arose: how much of the electronic neighborhood of an atom is captured by the DASH tree and how transferable these learned substructures are?Besides the partial charge, other atomic properties (i.e., The Journal of Chemical Physics ARTICLE pubs.aip.org/aip/jcpatomic polarizability and dispersion) depend on the local electronic environment of an atom.In this work, we demonstrate that such additional atomic properties can be assigned with the same DASH tree (built for MBIS partial charges), without the need to retrain the associated graph neural network and/or re-build the tree.These properties include other partial charge models (Mulliken, 10 AM1-BCC, 11 and RESP 12 ), dispersion (C 6 ), 13 atomic polarizability, and atomic parameters to describe the electro-and nucleophilicity (i.e., Fukui 14 and the dual descriptor 15 ).Many of these properties were evaluated directly on the wave functions calculated for the DASH dataset, 6,16 while for others, semi-empirical methods were used.In addition, we were able to speed up the prediction of these properties with the DASH tree by orders of magnitude with improvements in the tree implementation and algorithm, making the DASH approach an exceptional source of features for ML approaches on organic molecules.

II. METHODS
We use the dataset from the previous work by Lehner et al. 6,16 consisting of 348 935 diverse molecules (determined by Morgan fingerprints 29 with radius two).For each molecule, the wave functions of up to three conformers using a TPSSh/def-TZVP functional and basis set (obtained with Psi4 17 ) were available, allowing us to efficiently calculate additional properties.These were chosen based on their expected correlation with the local electronic environment of the atom and the availability of computational methods for generating them.A schematic depiction of the two-step workflow is shown in Fig. 1.First, the properties were calculated for all molecules in the DASH dataset 16 and the external test set (VEHICLe 8 ).Second, the DASH tree previously built for the assignment of MBIS partial charges was populated with these additional properties.To accommodate the new data, the implementation of the DASH tree was optimized for memory efficiency and runtime.The improvements in runtime over the previous implementation are discussed in Sec.S1 and shown in Fig. S1 of the supplementary material.

A. Test sets
In Ref. 6, the dataset was randomly split into a training set and validation set.We made use of the same split; however, since no classical training was performed here (i.e., the validation set was not used for any meta-or hyperparameter tuning), we consider the validation set from Ref. 6 as a de facto test set for this work.In addition, the VEHICLe set 8 with 24 657 heterocycles was used again as an external test set to validate all results.The 56 molecules that are in both the DASH dataset and the VEHICLe set were removed from the latter.The results with the DASH test set are shown in the main text, while the results for the VEHICLe test set are shown in the supplementary material (Figs.S5-S7).

B. Calculation of the atomic properties 1. Properties at DFT level of theory
The previously calculated wave functions could be used directly for the calculation of additional properties.The function oeprop in Psi4 was used to compute Mulliken charges, 10 and the Psi4 RESP package was used to calculate RESP (restrained electrostatic potential) partial charges. 12We previously 6 showed that the differences between the commonly used B3LYP functional and TPSSh functional used for the DASH tree are small for the calculation of RESP charges.From the calculation of the MBIS 7 partial charge, the dipole moment of the atomic charge distribution could be recovered.The projection of the atomic dipole vectors to the bond vectors of an FIG. 1. Schematic depiction of the workflow to populate a pre-built DASH tree 6 (built to assign MBIS 7 partial charges) with additional computed properties.In the first step (blue, left), the available wave functions were used to calculate new properties.In the second step (green, right), the nodes of the DASH tree were populated with the new properties.
The Journal of Chemical Physics ARTICLE pubs.aip.org/aip/jcpatom was used for the calculation of the molecular dipoles according to Eq. ( 3) as discussed below.The dual descriptor 15 was calculated from the wave function using the Psi4 function cubeprop with the default grid spacing of 0.2 bohrs.

Properties at semi-empirical level of theory
To calculate the dispersion and polarizability of each atom, we employed DFTD4. 13n addition, AM1-BCC 11 partial charges were calculated due to their usage in many force fields. 9,18These partial charges were calculated with the OpenFF toolkit (version 0.10.0) 19using the Amber toolkit (version 22.0). 20

C. Populating the DASH tree with new atomic properties
New properties were assigned to the nodes in the existing DASH tree in the same way as the MBIS partial charges were previously assigned.This means that the new atomic properties were assigned to the same nodes as the corresponding MBIS partial charge of the atom.After all molecules were matched against the tree, the median and variance values of each property in each node were calculated.By reusing the DASH tree structure built for the MBIS partial charges, we make the assumption-and test it-that this DASH tree captures the important aspects of an atom's electronic neighborhood, which are applicable to other atomic properties.This allows the prediction of different properties on the basis of the same substructures.
Note that this population process does not include any fitting.The DASH tree is a classification model that groups atom environments in a hierarchical structure.Each node's property is simply the median of all properties of all atoms matching this node.Due to this classification based on 2D molecular substructures, DASH cannot accurately predict properties that have large 3D conformational dependencies or involve strong delocalization, which requires a treatment of the full molecular orbitals.

D. Calculation of molecular properties
While the atomic properties are straightforward to assign using the atom-centered substructures in the DASH tree, assigning molecular properties requires us to be able to decompose the property value p into atomic contributions, which can then be combined using a function f, p = f (a 1,p , a 2,p , . . ., aN,p). ( The function f can have many forms, but ideally, it would be a simple equation in order to allow for fast property assignment.Here, we focused on two molecular properties for which such a function f exists.For both, the DASH tree was used to predict the atomic contributions, and then, the molecular property was calculated with the corresponding function f.

Molecular polarizability
The molecular polarizability χ mol can simply be calculated from the sum of atomic polarizabilities χ i , 13,21 ( While the molecular polarizability is experimentally measurable, the agreement between QM calculations and experimental values is not very high; 22 thus, we compare the values estimated from the DASH tree with the calculated reference values.

Molecular dipole
Another experimentally measurable property with a simple combination function is the molecular dipole, where q i is the partial charge of atom i, ⃗ ri is the position vector of the atom with respect to the center of mass, and ⃗ di is the bond dipole vector. 23As described by Bader et al., 23 the contribution of the bond dipole vector ⃗ di to the molecular dipole is often small compared to qi ⃗ ri.From the MBIS calculation, we can obtain the atomic dipole moments, which can then be projected onto the bonds of each atom to calculate the bond dipole vector.In DASH, the bonds are distinguished by the attention order; thus, we can use the bond to the first atom in the DASH tree (highest attention) as the first projection vector and so forth.
Again, the molecular dipole is experimentally measurable, but we compare the values estimated from the DASH tree with density functional theory (DFT) reference values due to the known deviations between QM and experiment. 22

E. Optimization of the DASH implementation
The implementation of the DASH tree published in Ref. 6 was adapted in this work to reduce runtime and to allow for an arbitrary number of additional properties.Pre-calculation of atom features and bond relations as well as a more efficient search for possible new subgraph expansions reduced the matching time by an order of magnitude compared to the original implementation.A separation between the tree structure and the stored data on a data structure and file structure allows for the addition of many properties.The improvements and code changes are described in more detail in the supplementary material.

III. RESULTS AND DISCUSSION
In this work, we aim to test the hypothesis that the DASH tree built to assign MBIS partial charges captures sufficient general information about the environment of the atoms that the tree is directly applicable to the prediction of other atomic properties.The same data splits as in Lehner et al. 6 were used to compare the new properties with the corresponding reference values.Note that in contrast to Ref. 6, the validation set was not used for any meta-parameter optimization in this work.The results for the validation set are shown in Secs.III A-III D, while those for the VEHICLe 8 test set are given in the supplementary material.

A. Different partial charge models
The atomic properties most closely related to MBIS partial charges are other charge models such as AM1-BCC, 11 Mulliken, 10 or RESP. 12Given the similarity, we expect that the DASH tree performs well for these, as can be seen in Fig. 2.
All three charge models show small root-mean-square-errors (RMSEs) and high correlations (as measured by R 2 and Kendall's tau) between the partial charges assigned with the DASH tree and the reference values.90% of predictions are found within 0.01e for AM1-BCC, 0.02e for Mulliken, and 0.04e for RESP.The very good performance of the DASH tree on the AM1-BCC and RESP charges-both used in classical force fields for molecular dynamics (MD) simulations-combined with the reproducibility, high speed, and explainability of the approach renders DASH a flexible tool for generating partial charges for different simulation setups.Although the Mulliken charges are not as popular in MD, they are widely used as descriptors in machine learning (ML) approaches, where the DASH tree could be used to assign charges without the need for generating 3D conformers and performing QM calculations.
A comparison of all four different partial charge models for an example molecule is provided in Fig. 3.The different charges are calculated with the DASH tree for the atom highlighted in dark green in the right panel for an increasing matching depth in the tree.The left plot shows how all charge models converge to their final values during the matching process, while their prediction error decreases as more atoms are added to the subgraph.The added atoms are indicated with different shadings from dark yellow (initial) to light yellow (last added atom) in the right panel.These types of figures can be easily generated with the DASH tree and can serve as a source of explainability and error estimation.

B. Atomic dispersion and polarizability
Next, the same DASH tree was used to predict the dispersion (C 6 ) and polarizability of atoms.The reference values were calculated with DFT-D4. 13As shown in Fig. 4, the DASH tree performs exceptionally well in predicting these properties.

C. Electrophilicity and nucleophilicity
5][26] They are a measure of the change in the electron density of a molecule when an electron is added or removed.Morell et al. 15 proposed to merge the Fukui functions into a single function termed the dual descriptor, which captures the same properties in a single, more robust function. 27he dual descriptor is typically calculated as a function of the position in space relative to the molecule and visualized as a surface.
Here, the dual descriptor function was integrated for each atom and the resulting parameter can be predicted with the DASH tree.The integration drastically reduces the resolution of the property since it is possible that the same atom has both strong electrophilic and nucleophilic sites, as seen in the simple case of the oxygen atom in water (see Fig. S4 of the supplementary material).Due to the loss in resolution, the task was modified into a classification problem to distinguish between weak and strong nucleophiles or electrophiles.We defined four bins as follows: strong nucleophiles (< −0.05), weak nucleophiles (−0.05-0), weak electrophiles (0-0.05), and strong electrophiles (>0.05).As can be seen in Fig. 5, there is a good performance by the DASH tree overall, with some confusion between weak nucleophiles and weak electrophiles.

D. Molecular properties
The polarizability of a molecule was calculated with Eq. ( 2) and compared to the reference values [Fig.6(a)].The DASH approach shows a high correlation and small RMSE compared to the reference values.Kendall's tau is even better than for the atomic polarizabilities, possibly due to some cancellation of errors.
The molecular dipole moment was calculated from the atomic contributions with the function described by Bader et al.

IV. CONCLUSIONS
We report that the DASH tree developed to predict MBIS partial charges can be exploited to predict other atomic properties, without re-training the graph neural network from which the tree was built.Thus, the substructures in the DASH tree are shown to be transferable between properties and encode the important local electronic environment of an atom.We demonstrated this by predicting different partial charge models (AM1-BCC, Mulliken, and RESP), atomic dispersion (C 6 ), atomic polarizability, and dual descriptor for electrophilicity and nucleophilicity.Furthermore, molecular properties, such as molecular polarizability and molecular dipole, are accessible if there is a relationship with atomic contributions predicted by the DASH tree.Even if the atomic property is negatively impacted by the 2D nature of the description in DASH (i.e., the 3D conformational dependence is ignored), as seen for atomic dipole moment, a good performance was found for the resulting molecular dipole.
Predicting AM1-BCC and RESP charges with the DASH tree can be beneficial for existing fixed-charge force fields, as it provides a computationally efficient and user-friendly method.As the DASH approach works with 2D structures, conformational averaging is not required and the runtime scales linearly with the number of atoms in a molecule instead of the higher-order scaling of semiempirical methods or DFT methods.In addition, it provides high accuracy while maintaining a high level of explainability due to fragment-based assignment.
We conclude that the DASH tree can be used as an accurate, explainable, and computationally efficient model to predict a diverse set of atomic and molecular properties.It presents an attractive alternative for semi-empirical methods or ML approaches for property calculation, especially for applications with a large number of molecules.The final DASH tree with the different atomic properties as well as the complete dataset with wave functions is made freely available.

SUPPLEMENTARY MATERIAL
See the supplementary material for details on implementation optimizations, additional information on the atomic dipole moment and dual descriptor, and the results for the VEHICLe test set.

FIG. 2 .
FIG. 2.Comparison between the computed reference values and the predictions with the DASH tree for partial charges of the validation set calculated with three different models: (a) AM1-BCC,11 (b) Mulliken, 10 and (c) RESP.12The insets show a histogram of the error (difference between the reference and DASH values) without logarithmic scaling.

FIG. 3 .
FIG. 3.Example of the charge assignment process with the DASH tree for the dark yellow atom of the molecule shown on the right (atom 0).By expanding the subgraph from the initial atom, atoms in lighter yellow and higher number are added, and the charge predictions begin to converge to the final value, with decreasing assignment uncertainty (error bars).

FIG. 4 .FIG. 6 .
FIG. 4. Comparison of the atomic dispersion (a) and polarizability (b) between DFT-D4 13 and the DASH tree for the validation set.The insets show a histogram of the error (difference between the reference and predicted values) without logarithmic scaling.