Density functional theory (DFT) is the most successful and widely used approach for computing the electronic structure of matter. However, for tasks involving large sets of candidate molecules, running DFT separately for every possible compound of interest is forbiddingly expensive. In this paper, we propose a neural network based machine learning algorithm which, assuming a sufficiently large training sample of actual DFT results, can instead learn to predict certain properties of molecules purely from their molecular graphs. Our algorithm is based on the recently proposed covariant compositional networks framework and involves tensor reduction operations that are covariant with respect to permutations of the atoms. This new approach avoids some of the representational limitations of other neural networks that are popular in learning from molecular graphs and yields promising results in numerical experiments on the Harvard Clean Energy Project and QM9 molecular datasets.
I. INTRODUCTION
Density functional theory1 (DFT) is the workhorse of modern quantum chemistry due to its ability to calculate many properties of molecular systems with high accuracy. However, this accuracy comes at a significant computational cost, generally making DFT too costly for applications such as chemical search and drug screening, which may involve hundreds of thousands of compounds. Methods that help overcome this limitation could lead to rapid developments in biology, medicine, and materials engineering.
Recent advances in machine learning, specifically, deep learning,2 combined with the appearance of large datasets of molecular properties obtained both experimentally and theoretically,3–5 present an opportunity to learn to predict the properties of compounds from their chemical structure rather than computing them explicitly with DFT. A machine learning model could allow for rapid and accurate exploration of huge molecular spaces to find suitable candidates for a desired molecule.
Central to any machine learning technique is the choice of a suitable set of descriptors or features used to parametrize and describe the input data. A poor choice of features will limit the expressiveness of the learning architecture and make accurate predictions impossible. On the other hand, providing too many features may make training difficult, especially when training data is limited. Hence, there has been a significant amount of work on designing good features for molecular systems. Predictions of energetics based on molecular geometry have been explored extensively, using a variety of parametrizations.6,7 This includes bond types and/or angles,8 radial distance distributions,9 the Coulomb matrix and related structures,10–12 the Smooth Overlap of Atomic Positions (SOAP),13,14 permutation-invariant distances,15 symmetry functions for atomic positions,16,17 moment tensor potentials,18 and scattering networks.19
Recently, the problem of learning from the structure of chemical bonds alone, i.e., the molecular graph, has attracted a lot of interest, especially in the light of the appearance of a series of neural network architectures designed specifically for learning from graphs.20–26 Much of the success of these architectures stems from their ability to pick up on structures in the graph at multiple different scales, while satisfying the crucial requirement that the output be invariant to permutations of the vertices (which, in molecular learning, correspond to atoms). However, as we will show, the specific way that most of these architectures ensure permutation invariance still limits their representational power.
In this paper, we propose a new type of neural network for learning from molecular graphs, based on the the idea of covariant compositional networks (CCNs), recently introduced in Ref. 27. Importantly, CCNs are based on explicitly decomposing compound objects (in our case, molecular graphs) into a hierarchy of subparts (subgraphs), offering a versatile framework that is ideally suited to capture the multiscale nature of molecular structures from functional groups through local structures to the overall shape. In a certain sense, the resulting models are the neural networks analog of coarse graining. In addition, CCNs offer a more nuanced take on permutation invariance than other graph learning algorithms; while the overall output of a CCN is still invariant to the permutation of identical atoms, internally, the activations of the network are covariant rather than invariant, allowing us to better preserve information about the relationships between atoms. We demonstrate the success of this architecture through experiments on benchmark datasets, including QM93 and the Harvard Clean Energy Project.4
II. LEARNING FROM MOLECULAR GRAPHS
This paper addresses the problem of predicting chemical properties directly from each compound’s molecular graph, which we will denote as (Fig. 1, left). As such, it is related to the sizable literature on learning from graphs. In the kernel machines domain, this includes algorithms based on random walks,28–30 counting subgraphs,31 spectral ideas,32 label propagation schemes with hashing,33,34 and even algebraic ideas.35
Left: Molecular graphs for C18H9N3OSSe and C22H15NSeSi from the Harvard Clean Energy Project (HCEP)4 dataset with corresponding adjacency matrices. Center and right: The comp-net of a graph is constructed by decomposing into a hierarchy of subgraphs and forming a neural network in which each “neuron” corresponds to one of the subgraphs and receives inputs from other neurons that correspond to smaller subgraphs contained in . The center pane shows how this can equivalently be thought of as an algorithm in which each vertex of receives and aggregates messages from its neighbors. To keep the figure simple, we only marked aggregation at a single vertex in each round (layer).
Left: Molecular graphs for C18H9N3OSSe and C22H15NSeSi from the Harvard Clean Energy Project (HCEP)4 dataset with corresponding adjacency matrices. Center and right: The comp-net of a graph is constructed by decomposing into a hierarchy of subgraphs and forming a neural network in which each “neuron” corresponds to one of the subgraphs and receives inputs from other neurons that correspond to smaller subgraphs contained in . The center pane shows how this can equivalently be thought of as an algorithm in which each vertex of receives and aggregates messages from its neighbors. To keep the figure simple, we only marked aggregation at a single vertex in each round (layer).
Recently, a sequence of neural network based approaches has also appeared, starting with Ref. 36. Some of the proposed graph learning architectures21,22,37 directly seek inspiration from the type of classical convolutional neural networks (CNNs) that are used for image recognition.38,39 These methods involve moving a filter across vertices to create feature representations of each vertex based on its local neighborhoods. Other notable studies on graph neural networks include Refs. 24 and 40–42. Very recently, Ref. 25 showed that many of these approaches can be seen as message passing schemes and coined the term message passing neural networks (MPNNs) to refer to them collectively.
Regardless of the specifics, the two major issues that graph learning methods need to grapple with are invariance to permutations and capturing structures at multiple different scales. Let A denote the adjacency matrix of , and suppose that we change the numbering of the vertices by applying a permutation σ. The adjacency matrix will then change to A′, with
However, topologically, A and A′ still represent the same graph. Permutation invariance means that the representation learned or implied by our graph learning algorithm must be invariant to these transformations. Naturally, in the case of molecules, invariance is restricted to permutations that map each atom to another atom of the same type.
The multiscale property is equally crucial for learning molecular properties. For example, in the case of a protein, an ideal graph learning algorithm would represent in a manner that simultaneously captures structures at the level of individual atoms, functional groups, interactions between functional groups, subunits of the protein, and the protein’s overall shape.
A. Compositional networks
The idea of representing complex objects in terms of a hierarchy of parts has a long history in machine learning.43–48 We have recently introduced a general framework for encoding such part-based models in a special type of neural network, called covariant compositional networks.27 In this paper, we show how the CCN formalism can be specialized to learning from graphs, specifically, the graphs of molecules. Our starting point is the following definition.
Let be the graph of a molecule made up of n atoms . The compositional neural network (comp-net) corresponding to is a directed acyclic graph (DAG) in which each node (neuron) is associated with a subgraph of and carries an activation fi. Moreover,
If is a leaf node, then is just a single vertex of , i.e., an atom eξ(i), and the activation fi is some initial label li. In the simplest case, fi is a “one-hot” vector that identifies what type of atom resides at the given vertex.
has a unique root node for which , and the corresponding fr represents the entire molecule.
For any two nodes and , if is a descendant of , then is a subgraph of , and
There is some freedom in how the system of subgraphs is defined, but the default choice is , where is the subgraph of vertices within a radius of ℓ of vertex i. In this case, , consist of the immediate neighbors of i, plus i itself, and so on. The aggregation function is discussed in detail in Sec. IV.
Conceptually, comp-nets are closely related to convolutional neural networks (CNNs), which are the mainstay of neural networks in computer vision. In particular,
Each neuron in a CNN only aggregates information from a small set of neurons from the previous layer, similar to how each node of a comp-net only aggregates information from its children.
The so-called effective receptive fields of the neurons in a CNN, i.e., the image patches for which each neuron is responsible for, form a hierarchy of nested sets similar to the hierarchy of subgraphs.
As mentioned above, an alternative popular framework for learning from graphs is message passing neural networks (MPNNs).25 An MPNN operates in rounds: in each round ℓ = 1, …, L, every vertex collects the labels of its immediate neighbors, applies a nonlinear function Ψ, and updates its own label accordingly. From the neural networks point of view, the rounds correspond to layers and the labels correspond to the activations (Algorithm 1). More broadly, the classic Weisfeiler–Lehman test of isomorphism follows the same logic49–51 and so does the related Weisfeiler–Lehman kernel, arguably the most successful kernel-based approach to graph learning.33
In the above mentioned base case, when is the collection of all subgraphs of of radii ℓ = 0, 1, 2, …, a comp-net can also be thought of as a message passing algorithm: the messages received by vertex i in round ℓ are the activations , where B(i, 1) is the ball of radius one centered at i (note that this contains not just the neighbors of i but also i itself). Conversely, MPNNs can be seen as comp-nets, where is the subgraph defined by the receptive field of vertex i in round ℓ. A common feature of MPNNs, however, is that the Ψ aggregation functions that they employ are invariant to permuting their arguments. Most often, Ψ just sums all incoming messages and then applies a nonlinearity. This certainly guarantees that the final output of the network, , will be permutation invariant. However, in Sec. III, we argue that it comes at the price of a significant loss of representational power.
High level schematic of message passing type algorithms25 in the notations of the present paper. Here li are the starting labels, and denote the neighbors of vertex i.
for each vertex i |
for ℓ = 1 to L |
for each vertex i |
for each vertex i |
for ℓ = 1 to L |
for each vertex i |
III. COVARIANT COMPOSITIONAL NETWORKS
One of the messages of the present paper is that invariant aggregation functions, of the type popularized by message passing neural networks, are not the most general way to build compositional models for compound objects, such as graphs. To understand why this is the case, once again, an analogy with image recognition is helpful. Classical CNNs face two types of basic image transformations: translations and rotations. With respect to translations (barring pooling, edge effects, and other complications), CNNs behave in a quasi-invariant way, in the sense that if the input image is translated by an integer amount (tx, ty), the activations in each layer ℓ = 1, 2, …, L translate the same way: the activation of neuron is simply transferred to neuron , i.e., . This is the simplest manifestation of a well studied property of CNNs called equivariance.52,53
With respect to rotations, however, the situation is more complicated: if we rotate the input image by, e.g., 90°, not only the part of the image that fell in the receptive field of neuron will move to the receptive field of a different neuron but also the orientation of the receptive field will change. For example, features which were previously picked up by horizontal filters will now be picked up by vertical filters. Therefore, in general, (Fig. 2).
In a convolutional neural network, if the input image is translated by some amount (t1, t2), what used to fall in the receptive field of neuron, is moved to the receptive field of . Therefore, the activations transform in the very simple way . By contrast, rotations not only move the receptive fields around but also permute the neurons in the receptive field internally, so the activations transform in a more complicated manner. The right hand figure shows that if the CNN has a horizontal filter (blue) and a vertical one (red), then after a rotation by 90°, their activations are exchanged. In steerable CNNs, if (i, j) is moved to (i′, j′) by the transformation, then , where R is some fixed linear function of the rotation angle.
In a convolutional neural network, if the input image is translated by some amount (t1, t2), what used to fall in the receptive field of neuron, is moved to the receptive field of . Therefore, the activations transform in the very simple way . By contrast, rotations not only move the receptive fields around but also permute the neurons in the receptive field internally, so the activations transform in a more complicated manner. The right hand figure shows that if the CNN has a horizontal filter (blue) and a vertical one (red), then after a rotation by 90°, their activations are exchanged. In steerable CNNs, if (i, j) is moved to (i′, j′) by the transformation, then , where R is some fixed linear function of the rotation angle.
It can be shown that one cannot construct a CNN for images that behaves in a quasi-invariant way with respect to both translations and rotations, unless every filter is directionless. It is, however, possible to construct a CNN in which the activations transform in a predictable and reversible way, , for some fixed function R. This phenomenon is called steerability and has a significant literature in both classical signal processing54–58 and the neural networks field.59
The situation in compositional networks is similar. The comp-net and message passing architectures that we have examined so far, by the virtue of the aggregation function being symmetric in its arguments, are all quasi-invariant (with respect to permutations) in the sense that if and are two comp-nets for the same graph differing only in a reordering σ of the vertices of the underlying graph , and is a neuron in while is the corresponding neuron in , then for any permutation .
Quasi-invariance amounts to asserting that the activation fi at any given node must only depend on as a set and not on the internal ordering of the atoms making up the receptive field. At first sight, this seems desirable since it is exactly what we expect from the overall representation ϕ(G). On closer examination, however, we realize that this property is potentially problematic since it means that loses all information about which vertex in its receptive field has contributed what to the aggregate information fi. In the CNN analogy, we can say that we have lost information about the orientation of the receptive field. In particular, if higher up in the network fi is combined with some other feature vector fj from a node with an overlapping receptive field, the aggregation process has no way of taking into account which parts of the information in fi and fj come from shared vertices and which parts do not (Fig. 3).
These two graphs are not isomorphic, but after a single round of message passing (red arrows), the labels (activations) at vertices 2 and 3 will be identical in both graphs. Therefore, in the second round of message passing, vertex 1 will get the same messages in both graphs (blue arrows) and will have no way of distinguishing whether 5 and 5′ are the same vertex or not.
These two graphs are not isomorphic, but after a single round of message passing (red arrows), the labels (activations) at vertices 2 and 3 will be identical in both graphs. Therefore, in the second round of message passing, vertex 1 will get the same messages in both graphs (blue arrows) and will have no way of distinguishing whether 5 and 5′ are the same vertex or not.
The solution is to regard the receptive fields as ordered sets and explicitly establish how fi co-varies with the internal ordering of the receptive fields. To emphasize that henceforth the sets are ordered, we will use parentheses rather than braces to denote them.
Assume that is the comp-net of a graph and is the comp-net of the same graph but after its vertices have been permuted by some permutation σ. Given any with receptive field , let be the corresponding node in with receptive field . Assume that is the permutation that aligns the orderings of the two receptive fields, i.e., for which . We say that the comp-nets are covariant to permutations if for any π, there is a corresponding function Rπ such that .
To make the form of covariance prescribed by this definition more specific, we make the assumption that the maps are linear. This allows us to treat them as matrices, . Furthermore, linearity also implies that form a representation of in the group theoretic sense of the word60 (this notion of representation should not be confused with the neural networks sense of representations of objects, as “fi is a representation of ”).
We say that is a first order covariant node in a comp-net if under the permutation of its receptive field by any its activation transforms as fi ↦ Pπfi.
We say that is a second order covariant node in a comp-net if under the permutation of its receptive field by any its activation transforms as .
In general, we will call any quantity which transforms under permutations according to (1) a k’th order P-tensor. Saying that a given quantity is a P–tensor then not only means that it is representable by an m × m × ⋯ × m array but also that this array transforms in a specific way under permutations.
Since scalars, vectors, and matrices can be considered 0th, 1st, and 2nd order tensors, the following definition covers both quasi-invariance and first and second order covariance as special cases. To unify notation and terminology, in the following, we will always talk about feature tensors rather than feature vectors and denote the activations as Fi rather than fi.
We say that is a k’th order covariant node in a comp-net if the corresponding activation Fi is a k’th order P–tensor, i.e., it transforms under permutations of according to (1), or the activation is a sequence of d separate P–tensors corresponding to d distinct channels.
A covariant compositional network (CCN) is a comp-net in which each node’s activation is covariant to permutations in the above sense. Hence we can talk about first, second, third, etc., order CCNs (CCN 1D, 2D, …) (Fig. 4).
The real significance of covariance, both here and in classical CNNs, is that while it allows for a richer internal representation of the data than fully invariant architectures, the final ouput of the network can still easily be made invariant. In covariant comp-nets, this is achieved by collapsing the input tensors of the root node at the top of the network into invariant scalars, for example, by computing their sums and traces (reducing them to zeroth order tensors) and outputting permutation invariant combinations of these scalars, such as their sum.
Feature tensors in a first order CCN for ethylene (C2H4) assuming three channels (red, green, and blue). Vertices e1, e2, e5, e6 are hydrogen atoms, while vertices e3, e4 are carbons. Edge (e3, e4) is a double bond and the other edges are single bonds. (a) At the input layer, the receptive field for each atom is just the atom itself, so the feature matrix has size 1 × 3. (b) At level ℓ = 1, the size of the receptive field of each atom grows depending on the local topology. The receptive fields for each hydrogen atom grow to include its neighboring carbon atom, resulting in a feature tensor of size 2 × 3; the receptive fields of the carbon atoms grow to include four atoms each and therefore have size 4 × 3. (c) At layer ℓ = 2, the receptive fields of the carbons will include every atom in the molecule, while the receptive fields of the hydrogens will only be of size 4.
Feature tensors in a first order CCN for ethylene (C2H4) assuming three channels (red, green, and blue). Vertices e1, e2, e5, e6 are hydrogen atoms, while vertices e3, e4 are carbons. Edge (e3, e4) is a double bond and the other edges are single bonds. (a) At the input layer, the receptive field for each atom is just the atom itself, so the feature matrix has size 1 × 3. (b) At level ℓ = 1, the size of the receptive field of each atom grows depending on the local topology. The receptive fields for each hydrogen atom grow to include its neighboring carbon atom, resulting in a feature tensor of size 2 × 3; the receptive fields of the carbon atoms grow to include four atoms each and therefore have size 4 × 3. (c) At layer ℓ = 2, the receptive fields of the carbons will include every atom in the molecule, while the receptive fields of the hydrogens will only be of size 4.
IV. COVARIANT AGGREGATION FUNCTIONS
It remains to explain how to define the Φ aggregation functions so as to guarantee covariance. Specifically, we show how to construct Φ such that if the inputs of a given node at level ℓ are covariant k’th order P–tensors, then will also be a k’th order P–tensor. The aggregation function that we define consists of five operations executed in sequence: promotion, stacking, reduction, mixing, and an elementwise nonlinear transform. (See Fig. 5.) Practically relevant CCNs tend to have multiple channels, so each is actually a sequence of d separate P–tensors . However, except for the mixing step, each channel behaves independently, so for simplicity, for now we drop the channel index.
Schematic of the aggregation process at a given vertex of in a second order CCN not involving tensor products with and assuming a single input channel. Feature tensors F0, …, F3 are collected from the neighbors of as well as from itself, promoted, and stacked to form a third order tensor T. In this example, we compute just three reductions Q1, Q2, Q3. These are then combined with the weights and passed through the σ nonlinearity to form the output tensors . For simplicity, in this figure, the “ℓ” and “a” indices are suppressed.
Schematic of the aggregation process at a given vertex of in a second order CCN not involving tensor products with and assuming a single input channel. Feature tensors F0, …, F3 are collected from the neighbors of as well as from itself, promoted, and stacked to form a third order tensor T. In this example, we compute just three reductions Q1, Q2, Q3. These are then combined with the weights and passed through the σ nonlinearity to form the output tensors . For simplicity, in this figure, the “ℓ” and “a” indices are suppressed.
A. Promotion
Each child tensor captures information about a different receptive field , so before combining them we must “promote” each to a tensor , whose dimensions are indexed by the vertices of rather than the vertices in each . Assuming that and , this is done by defining a indicator matrix
and setting
where, once again, Einstein summation is in use, so summation over is implied. Effectively, the promotion step aligns all the child tensors by permuting the indices of to conform to the ordering of the atoms in and padding with zeros where necessary.
B. Stacking
Now that the promoted tensors all have the same shape, they can be stacked to form a dimensional k + 1’th order tensor T, with
It is easy to see that T itself transforms as a P-tensor of order (k + 1).
We may also add additional information to T by taking its tensor product with another P-tensor. In particular, to explicitly add information about the local topology, we may tensor multiply T by , the restriction of the adjacency matrix to . This will give an order (k + 3) tensor . Otherwise, we just set S = T. Note that in most other graph neural networks, the topology of the underlying graph is only accounted for implicitly, by the pattern in which different activations are combined. Being able to add the local adjacency matrix explicitly greatly extends the representational power of CCNs.
C. Reduction
Stacking and the optional tensor product increase the order of our tensors from k to k + 1 or k + 3. We reduce this to k again by a combination of generalized tensor contractions, more accurately called tensor reductions. For example, one way to drop the rank of S from k + 3 to k is to sum out three of its indices,
Note that while this notation does not make it explicit, and must be removed from amongst the indices of Q. Another way to drop the rank is to contract over three indices,
where is the generalized Kronecker symbol. A third way is to contract over two indices and sum over one index and so on. The crucial fact is that the result of each of these tensor operations is still a covariant P–tensor.
In general, the number of different ways that an order k + q tensor S can be reduced to order k depends on both q and k. For example, when k = 2 and q = 3, there are 50 different possible tensor reductions (excluding diagonals). By contrast, when q = 1 (i.e., we are not multiplying by the adjacency matrix), we only have k + 1 possibilities, corresponding to summing S over each of its dimensions. No matter which case we are in, however, and how many contractions Q1, …, Qjd′ our network actually computes (in our experiments using second order nodes, we compute 18 different ones), what is important is that the resulting order k tensors satify the P–tensor property.
D. Mixing with learnable weights
The reduction step can potentially produce quite a large number of order k tensors, Q1, …, Qr. We reduce this number by linearly mixing Q1, …, Qr, i.e., taking d′ < r linear combinations of the form
This is again a covariant operation, and the mixing weights are the actual learnable parameters of our neural network.
It is natural to interpret as d′ separate channels in neural network terminology. Recall that we also allow the input tensors to have multiple channels, but up to now the corresponding index has been suppressed. The mixing step is the point where these channels are all allowed to interact, so (2) becomes
and the learnable weights of the network at each level form a third order tensor . The channel index c is not to be confused with c1, …, cjk denoting the children of node a. Equation (2) is the main mechanism whereby CCNs are able to learn increasigly complex topological features as we move up the network.
E. Nonlinearity
Finally, to get the actual activation of our neuron , we add an additive bias bℓ,c and an elementwise nonlinearity σ (specifically, the ReLU operator ), as is standard in neural networks. Thus, ultimately, the output of the aggregation function is the collection of P–covariant tensors
with . As usual, in neural networks, the Wℓ weight tensors are learned by backpropagation and some form of stochastic gradient descent.
V. EXPERIMENTS
We tested our CCN framework on three different types of datasets that involve learning the properties of molecules from their structure: (a) four relatively small datasets of small molecules that are standard benchmarks in the kernel literature (MUTAG, PTC, NCI1 and NCI109), (b) the Harvard Clean Energy Project,4 and (c) QM9.3 The first two types of datasets are pure graph classification/regression tasks. QM9 also has physical (spatial) features, which go somewhat beyond the scope of the present paper. Therefore, on QM9, we conducted separate experiments with and without these physical features [QM9(b) vs. QM9(a)].
In each case, we used second order CCNs (CCN2D) and included the tensor product with the restricted adjacency matrix, as described in Sec. IV B. However, for computational reasons, we only used 18 of the 50 possible contractions. The base features of each vertex were initialized with computed histogram alignment kernel features63 of depth up to 10: each vertex receives a base label where (with d being the total number of distinct discrete node labels) is the vector of relative frequencies of each label for the set of vertices at a distance equal to j from vertex i. The network was chosen to be three levels deep, with the number of output channels at each level fixed to 10.
To run our experiments, we used our own custom-built neural network library called GraphFlow.64 Writing our own deep learning software became necessary because at the time we started work on the experiments, none of the standard frameworks such as TensorFlow,65 PyTorch,66 or MXNet67 had efficient support for the type of tensor operations required by CCN. GraphFlow is a fast, general purpose deep learning library that offers automatic differentiation, dynamic computation graphs, multithreading, and Graphics Processing Unit (GPU) support for tensor contractions. In addition to CCN, it also implements other graph neural networks, including neural graph fingerprints,21 PSCN,37 and gated graph neural networks.40 We also provide a reference implementation of CCN1D and CCN2D in PyTorch.68
In each experiment, we used 80% of the dataset for training, 10% for validation, and 10% for testing. For the kernel datasets, we performed the experiments on 10 separate training/validation/test stratified splits and averaged the resulting classification accuracies. Our training technique used mini-batch stochastic gradient descent with the Adam optimization method69 and a batch size of 64. The initial learning rate was set to 0.001 and decayed linearly after each step toward a minimum of 10−6.
A. Graph kernel datasets
Our first set of experiments involved three standard “graph kernels” datasets: (1) MUTAG, which is a dataset of 188 mutagenic aromatic and heteroaromatic compounds,70 (2) PTC, which consists of 344 chemical compounds that have been tested for positive or negative toxicity in lab rats,71 and (3) NCI1 and NCI109, which have 4110 and 4127 compounds, respectively, each screened for activity against small cell lung cancer and ovarian cancer lines.72 In each of these datasets, each molecule has a discrete label (i.e., toxic/non-toxic and aromatic/heteroaromatic) and the goal is to predict this label. We compare CCN 2D against the graph kernel results reported in Ref. 62 (C-SVM algorithm with the Weisfeiler–Lehman kernel,33 Weisfeiler–Lehman edge kernel,33 shortest paths graph kernel,29 graphlet kernel,31 and the multiscale Laplacian graph kernel62), neural graph fingerprints21 (with up to 3 levels and a hidden size of 10), and the “patchy-SAN” convolutional algorithm (PSCN).37 The results are presented in Table I.
Classification of results on the kernel datasets (accuracy ± standard deviation).
. | MUTAG . | PTC . | NCI1 . | NCI109 . |
---|---|---|---|---|
Wesifeiler–Lehman kernel33 | 84.50 ± 2.16 | 59.97 ± 1.60 | 84.76 ± 0.32 | 85.12 ± 0.29 |
Wesifeiler–Lehman edge kernel33 | 82.94 ± 2.33 | 60.18 ± 2.19 | 84.65 ± 0.25 | 85.32 ± 0.34 |
Shortest path kernel29 | 85.50 ± 2.50 | 59.53 ± 1.71 | 73.61 ± 0.36 | 73.23 ± 0.26 |
Graphlets kernel31 | 82.44 ± 1.29 | 55.88 ± 0.31 | 62.40 ± 0.27 | 62.35 ± 0.28 |
Random walk kernel28 | 80.33 ± 1.35 | 59.85 ± 0.95 | Timed out | Timed out |
Multiscale Laplacian graph kernel62 | 87.94 ± 1.61 | 63.26 ± 1.48 | 81.75 ± 0.24 | 81.31 ± 0.22 |
PSCN(k = 10)37 | 88.95 ± 4.37 | 62.29 ± 5.68 | 76.34 ± 1.68 | N/A |
Neural graph fingerprints21 | 89.00 ± 7.00 | 57.85 ± 3.36 | 62.21 ± 4.72 | 56.11 ± 4.31 |
Second order CCN (our method) | 91.64 ± 7.24 | 70.62 ± 7.04 | 76.27 ± 4.13 | 75.54 ± 3.36 |
. | MUTAG . | PTC . | NCI1 . | NCI109 . |
---|---|---|---|---|
Wesifeiler–Lehman kernel33 | 84.50 ± 2.16 | 59.97 ± 1.60 | 84.76 ± 0.32 | 85.12 ± 0.29 |
Wesifeiler–Lehman edge kernel33 | 82.94 ± 2.33 | 60.18 ± 2.19 | 84.65 ± 0.25 | 85.32 ± 0.34 |
Shortest path kernel29 | 85.50 ± 2.50 | 59.53 ± 1.71 | 73.61 ± 0.36 | 73.23 ± 0.26 |
Graphlets kernel31 | 82.44 ± 1.29 | 55.88 ± 0.31 | 62.40 ± 0.27 | 62.35 ± 0.28 |
Random walk kernel28 | 80.33 ± 1.35 | 59.85 ± 0.95 | Timed out | Timed out |
Multiscale Laplacian graph kernel62 | 87.94 ± 1.61 | 63.26 ± 1.48 | 81.75 ± 0.24 | 81.31 ± 0.22 |
PSCN(k = 10)37 | 88.95 ± 4.37 | 62.29 ± 5.68 | 76.34 ± 1.68 | N/A |
Neural graph fingerprints21 | 89.00 ± 7.00 | 57.85 ± 3.36 | 62.21 ± 4.72 | 56.11 ± 4.31 |
Second order CCN (our method) | 91.64 ± 7.24 | 70.62 ± 7.04 | 76.27 ± 4.13 | 75.54 ± 3.36 |
B. Harvard clean energy project
The Harvard Clean Energy Project (HCEP)4 dataset consists of 2.3 × 106 organic compounds that are candidates for use in solar cells. The inputs are molecular graphs (derived from their SMILES strings), and the regression target is the power conversion efficiency (PCE). The experiments were ran on a random sample of 50 000 molecules.
On this dataset, we compared CCN to the following algorithms: Lasso, ridge regression, random forests, Gradient Boosted Trees (GBT), the optimal assignment Wesifeiler–Lehman graph kernel,33 neural graph fingerprints,21 and PSCN.37 For the first four of these baseline methods, we created simple feature vectors from each molecule: the number of bonds of each type (i.e., number of H—H bonds, number of C—O bonds, etc.) and the number of atoms of each type. Molecular graph fingerprints use atom labels of each vertex as base features. For ridge regression and Lasso, we cross-validated over λ. For random forests and GBT, we used 400 trees and cross validated over maximum depth, minimum samples for a leaf, minimum samples to split a node, and learning rate ( for GBT). For neural graph fingerprints, we used up to 3 layers and a hidden layer size of 10. In PSCN, we used a patch size of 10 with two convolutional layers and a dense layer on top as described in their paper (Tables II–IV).
HCEP regression results. Error of predicting power conversion efficiency in units of percent. (Best results indicated in bold.)
. | Test MAE . | Test RMSE . |
---|---|---|
Lasso | 0.867 | 1.437 |
Ridge regression | 0.854 | 1.376 |
Random forest | 1.004 | 1.799 |
Gradient boosted trees | 0.704 | 1.005 |
Weisfeiler–Lehman kernel33 | 0.805 | 1.096 |
Neural graph fingerprints21 | 0.851 | 1.177 |
PSCN (k = 10)37 | 0.718 | 0.973 |
Second order CCN (our method) | 0.340 | 0.449 |
QM9(a) regression results (mean absolute error). Here we have only used the graph as the learning input without any physical features. (Best results indicated in bold.)
. | WLGK . | NGF . | PSCN . | CCN 2D . |
---|---|---|---|---|
α (bohr3) | 3.75 | 3.51 | 1.63 | 1.30 |
[cal/(mol K)] | 2.39 | 1.91 | 1.09 | 0.93 |
G (eV) | 4.84 | 4.36 | 3.13 | 2.75 |
GAP (eV) | 0.92 | 0.86 | 0.77 | 0.69 |
H (eV) | 5.45 | 4.92 | 3.56 | 3.14 |
HOMO (eV) | 0.38 | 0.34 | 0.30 | 0.23 |
LUMO (eV) | 0.89 | 0.82 | 0.75 | 0.67 |
μ (D) | 1.03 | 0.94 | 0.81 | 0.72 |
ω1 (cm−1) | 192.16 | 168.14 | 152.13 | 120.10 |
R2 (bohr2) | 154.25 | 137.43 | 61.70 | 53.28 |
U (eV) | 5.41 | 4.89 | 3.54 | 3.02 |
U0 (eV) | 5.36 | 4.85 | 3.50 | 2.99 |
ZPVE (eV) | 0.51 | 0.45 | 0.38 | 0.35 |
. | WLGK . | NGF . | PSCN . | CCN 2D . |
---|---|---|---|---|
α (bohr3) | 3.75 | 3.51 | 1.63 | 1.30 |
[cal/(mol K)] | 2.39 | 1.91 | 1.09 | 0.93 |
G (eV) | 4.84 | 4.36 | 3.13 | 2.75 |
GAP (eV) | 0.92 | 0.86 | 0.77 | 0.69 |
H (eV) | 5.45 | 4.92 | 3.56 | 3.14 |
HOMO (eV) | 0.38 | 0.34 | 0.30 | 0.23 |
LUMO (eV) | 0.89 | 0.82 | 0.75 | 0.67 |
μ (D) | 1.03 | 0.94 | 0.81 | 0.72 |
ω1 (cm−1) | 192.16 | 168.14 | 152.13 | 120.10 |
R2 (bohr2) | 154.25 | 137.43 | 61.70 | 53.28 |
U (eV) | 5.41 | 4.89 | 3.54 | 3.02 |
U0 (eV) | 5.36 | 4.85 | 3.50 | 2.99 |
ZPVE (eV) | 0.51 | 0.45 | 0.38 | 0.35 |
The mean absolute error of CCN compared to the DFT error when using the complete set of physical features used in Ref. 25 in addition to the graph of each molecule. (Results below DFT error indicated in bold.)
. | CCN . | DFT error . |
---|---|---|
α (bohr3) | 0.22 | 0.4 |
[cal/(mol K)] | 0.07 | 0.34 |
G (eV) | 0.06 | 0.1 |
GAP (eV) | 0.12 | 1.2 |
H (eV) | 0.06 | 0.1 |
HOMO (eV) | 0.09 | 2.0 |
LUMO (eV) | 0.09 | 2.6 |
μ (D) | 0.48 | 0.1 |
ω1 (cm−1) | 2.81 | 28 |
R2 (bohr2) | 4.00 | … |
U (eV) | 0.06 | 0.1 |
U0 (eV) | 0.05 | 0.1 |
ZPVE (eV) | 0.0039 | 0.0097 |
. | CCN . | DFT error . |
---|---|---|
α (bohr3) | 0.22 | 0.4 |
[cal/(mol K)] | 0.07 | 0.34 |
G (eV) | 0.06 | 0.1 |
GAP (eV) | 0.12 | 1.2 |
H (eV) | 0.06 | 0.1 |
HOMO (eV) | 0.09 | 2.0 |
LUMO (eV) | 0.09 | 2.6 |
μ (D) | 0.48 | 0.1 |
ω1 (cm−1) | 2.81 | 28 |
R2 (bohr2) | 4.00 | … |
U (eV) | 0.06 | 0.1 |
U0 (eV) | 0.05 | 0.1 |
ZPVE (eV) | 0.0039 | 0.0097 |
C. QM9 dataset
QM9 has recently emerged as a molecular dataset of significant interest. QM9 contains ∼134 K organic molecules with up to nine atoms (C, H, O, N, and F) out of the GDB-17 universe of molecules. Each molecule is described by a SMILES string, which is converted to the molecular graph. The molecular properties are then calculated using DFT at the level of either B3LYP or 6-31G(2df, p), returning the spatial configurations of each atom along with thirteen molecular properties:
U0: atomization energy at 0 K (eV),
U: atomization at room temperature (eV),
H: enthalpy of atomization at room temperature (eV),
G: free energy of atomization (eV),
ω1: highest fundamental vibrational frequency (cm−1),
ZPVE: zero point vibrational energy (eV),
HOMO: highest occupied molecular orbital, energy of the highest occupied electronic state (eV),
LUMO: lowest unoccupied molecular orbital, energy of the lowest unoccupied electronic state (eV),
GAP: difference between HOMO and LUMO (eV),
R2: electronic spatial extent (bohr2),
μ: norm of the dipole moment (D),
α: norm of the static polarizability (bohr3),
: heat capacity at room temperature (cal/mol/K).
We performed two experiments on the QM9 dataset, with the goal of providing a benchmark of CCN as a graph learning framework and demonstrating that our framework can predict molecular properties to the same level as DFT. In both cases, we trained our system on each of the thirteen target properties of QM9 independently and report the MAE for each.
1. QM9(a)
We use the QM9 dataset to benchmark the CCN architecture against the Weisfeiler–Lehman graph kernel, neural graph fingerprints, and PSCN. For this test, we consider only heavy atoms and exclude hydrogen. The CCN architecture is as described above, and settings for NGF and PSCN are as described for HCEP.
2. QM9(b)
To compare the DFT error, we performed a test of the QM9 dataset with each molecule including hydrogen atoms. We used both physical atomic information (vertex features) and bond information (edge features) including the atom type, atomic number, acceptor, donor, aromatic, hybridization, number of hydrogens, Euclidean distance, and Coulomb distance between pairs of atoms. All the information is encoded in a vectorized format. Our physical features were taken directly from the dataset used in Ref. 25 without any special feature engineering.
To include the edge features into our model along with the vertex features, we used the concept of line graphs from graph theory. We constructed the line graph for each molecular graph in such a way that an edge of the molecular graph corresponds to a vertex in its line graph, and if two edges in the molecular graph share a common vertex, then there is an edge between the two corresponding vertices in the line graph. (See Fig. 6.) The edge features become vertex features in the line graph. The inputs of our model contain both the molecular graph and its line graph. The feature vectors Fℓ between the two graphs are merged at each level ℓ.
Molecular graph of C2H4 (left) and its corresponding line graph (right). The vertices of the line graph correspond to edges of the molecular graph; two vertices of the line graph are connected by an edge if their corresponding edges on the molecular graph share a vertex.
Molecular graph of C2H4 (left) and its corresponding line graph (right). The vertices of the line graph correspond to edges of the molecular graph; two vertices of the line graph are connected by an edge if their corresponding edges on the molecular graph share a vertex.
We present results for both the CCN 1D and CCN 2D architectures. For CCN 1D, our network is seven layers with 64 input channels; at the first and second layer, the number of channels is halved, and beyond that each layer has 16 channels. For the CCN 2D architecture, we used three layers, with 32 channels at the input and 16 at the remaining layers. We report the mean average error for each learning target in its corresponding physical unit and compare it against the DFT error given in Ref. 25.
D. Discussion
Overall, our CCN outperforms the other algorithms on a significant fraction of the experiments we implemented. On the subsampled HCEP dataset, CCN outperforms all other methods by a very large margin. For the graph kernel datasets, the SVM with the Weisfeiler–Lehman kernels achieve the highest accuracy on NCI1 and NCI109, while CCN wins on MUTAG and PTC. Perhaps this poor performance is to be expected since the datasets are small and neural networks usually require tens of thousands of training examples to be effective. Indeed, neural graph fingerprints and PSCN also perform poorly compared to the Weisfeiler–Lehman kernels. In the QM9(a) experiment, CCN obtains better results than the three other graph learning algorithms on all 13 learning targets.
In the QM9(b) experiment, the error of CCN is smaller than that of DFT itself on 11 of 12 learning targets (Ref. 25 does not have DFT errors for R2). However, other recent studies14,25 have obtained even stronger results. Looking at our results, we find that values depending strongly on position information, such as the dipole moment and average electronic spatial extent, are predicted poorly when we include physical features. By contrast, properties that are not expected to strongly depend on spatial extent are predicted significantly better. This suggests that our spatial input features were not fully exploited and that feature engineering position information could significantly enhance the power of our CCN.
Our custom deep learning library64 enabled all the above results to be obtained reasonably efficiently. The prediction time for CCN 1D and CCN 2D on QM9(b) comes out to 6.0 ms/molecule and 7.2 ms/molecule, respectively, making it possible to search through a million candidate molecules in less than 2 h.
VI. CONCLUSIONS
In this paper, we presented a general framework called covariant compositional networks (CCNs) for learning the properties of molecules from their graphs. Central to this framework are two key ideas: (1) a compositional structure that generalizes message passing neural networks (MPNNs) and (2) the concept of covariant aggregation functions based on tensor algebra.
We argue that CCNs can extract multiscale structures from molecular graphs and keep track of the local topology in a manner the MPNNs are not able to. We also introduced the GraphFlow software library that provides an efficient implementation of CCNs. Using GraphFlow, we were able to show that CCNs often outperform existing state-of-the-art algorithms in learning molecular properties.
ACKNOWLEDGMENTS
The authors would like to thank the Institute for Pure and Applied Mathematics and the participants of its “Understanding Many Particle Systems with Machine Learning” program for inspiring this research. Our work was supported by DARPA Young Faculty Award No. D16AP00112 and used computing resources provided by the University of Chicago Research Computing Center.