Density functional theory (DFT) is the most successful and widely used approach for computing the electronic structure of matter. However, for tasks involving large sets of candidate molecules, running DFT separately for every possible compound of interest is forbiddingly expensive. In this paper, we propose a neural network based machine learning algorithm which, assuming a sufficiently large training sample of actual DFT results, can instead *learn* to predict certain properties of molecules purely from their molecular graphs. Our algorithm is based on the recently proposed covariant compositional networks framework and involves tensor reduction operations that are covariant with respect to permutations of the atoms. This new approach avoids some of the representational limitations of other neural networks that are popular in learning from molecular graphs and yields promising results in numerical experiments on the Harvard Clean Energy Project and QM9 molecular datasets.

## I. INTRODUCTION

Density functional theory^{1} (DFT) is the workhorse of modern quantum chemistry due to its ability to calculate many properties of molecular systems with high accuracy. However, this accuracy comes at a significant computational cost, generally making DFT too costly for applications such as chemical search and drug screening, which may involve hundreds of thousands of compounds. Methods that help overcome this limitation could lead to rapid developments in biology, medicine, and materials engineering.

Recent advances in machine learning, specifically, deep learning,^{2} combined with the appearance of large datasets of molecular properties obtained both experimentally and theoretically,^{3–5} present an opportunity to *learn* to predict the properties of compounds from their chemical structure rather than computing them explicitly with DFT. A machine learning model could allow for rapid and accurate exploration of huge molecular spaces to find suitable candidates for a desired molecule.

Central to any machine learning technique is the choice of a suitable set of descriptors or *features* used to parametrize and describe the input data. A poor choice of features will limit the expressiveness of the learning architecture and make accurate predictions impossible. On the other hand, providing too many features may make training difficult, especially when training data is limited. Hence, there has been a significant amount of work on designing good features for molecular systems. Predictions of energetics based on molecular geometry have been explored extensively, using a variety of parametrizations.^{6,7} This includes bond types and/or angles,^{8} radial distance distributions,^{9} the Coulomb matrix and related structures,^{10–12} the Smooth Overlap of Atomic Positions (SOAP),^{13,14} permutation-invariant distances,^{15} symmetry functions for atomic positions,^{16,17} moment tensor potentials,^{18} and scattering networks.^{19}

Recently, the problem of learning from the structure of chemical bonds alone, i.e., the molecular graph, has attracted a lot of interest, especially in the light of the appearance of a series of neural network architectures designed specifically for learning from graphs.^{20–26} Much of the success of these architectures stems from their ability to pick up on structures in the graph at multiple different scales, while satisfying the crucial requirement that the output be invariant to permutations of the vertices (which, in molecular learning, correspond to atoms). However, as we will show, the specific way that most of these architectures ensure permutation invariance still limits their representational power.

In this paper, we propose a new type of neural network for learning from molecular graphs, based on the the idea of *covariant compositional networks* (CCNs), recently introduced in Ref. 27. Importantly, CCNs are based on explicitly decomposing compound objects (in our case, molecular graphs) into a hierarchy of subparts (subgraphs), offering a versatile framework that is ideally suited to capture the multiscale nature of molecular structures from functional groups through local structures to the overall shape. In a certain sense, the resulting models are the neural networks analog of coarse graining. In addition, CCNs offer a more nuanced take on permutation invariance than other graph learning algorithms; while the overall output of a CCN is still invariant to the permutation of identical atoms, internally, the activations of the network are *covariant* rather than invariant, allowing us to better preserve information about the relationships between atoms. We demonstrate the success of this architecture through experiments on benchmark datasets, including QM9^{3} and the Harvard Clean Energy Project.^{4}

## II. LEARNING FROM MOLECULAR GRAPHS

This paper addresses the problem of predicting chemical properties directly from each compound’s molecular graph, which we will denote as $G$ (Fig. 1, left). As such, it is related to the sizable literature on learning from graphs. In the kernel machines domain, this includes algorithms based on random walks,^{28–30} counting subgraphs,^{31} spectral ideas,^{32} label propagation schemes with hashing,^{33,34} and even algebraic ideas.^{35}

Recently, a sequence of neural network based approaches has also appeared, starting with Ref. 36. Some of the proposed graph learning architectures^{21,22,37} directly seek inspiration from the type of classical convolutional neural networks (CNNs) that are used for image recognition.^{38,39} These methods involve moving a filter across vertices to create feature representations of each vertex based on its local neighborhoods. Other notable studies on graph neural networks include Refs. 24 and 40–42. Very recently, Ref. 25 showed that many of these approaches can be seen as message passing schemes and coined the term *message passing neural networks* (MPNNs) to refer to them collectively.

Regardless of the specifics, the two major issues that graph learning methods need to grapple with are invariance to permutations and capturing structures at multiple different scales. Let *A* denote the adjacency matrix of $G$, and suppose that we change the numbering of the vertices by applying a permutation *σ*. The adjacency matrix will then change to *A*′, with

However, topologically, *A* and *A*′ still represent the same graph. Permutation invariance means that the representation $\varphi (G)$ learned or implied by our graph learning algorithm must be invariant to these transformations. Naturally, in the case of molecules, invariance is restricted to permutations that map each atom to another atom of the same type.

The multiscale property is equally crucial for learning molecular properties. For example, in the case of a protein, an ideal graph learning algorithm would represent $G$ in a manner that simultaneously captures structures at the level of individual atoms, functional groups, interactions between functional groups, subunits of the protein, and the protein’s overall shape.

### A. Compositional networks

The idea of representing complex objects in terms of a hierarchy of parts has a long history in machine learning.^{43–48} We have recently introduced a general framework for encoding such part-based models in a special type of neural network, called *covariant compositional networks*.^{27} In this paper, we show how the CCN formalism can be specialized to learning from graphs, specifically, the graphs of molecules. Our starting point is the following definition.

*Definition 1.*

*Let* $G$ *be the graph of a molecule made up of n atoms* $e1,\u2026,en$. *The* *compositional neural network (comp-net)**corresponding to* $G$ *is a directed acyclic graph* (*DAG*) $N$ *in which each node* (*neuron*) $ni$ *is associated with a subgraph* $Pi$ *of* $G$ *and carries an activation f*_{i}. *Moreover,*

*If*$ni$*is a leaf node, then*$Pi$*is just a single vertex of*$G$,*i.e., an atom e*_{ξ(i)},*and the activation f*_{i}*is some initial label l*_{i}.*In the simplest case*,*f*_{i}*is a*“*one-hot*”*vector that identifies what type of atom resides at the given vertex.*$N$

*has a unique root node*$nr$*for which*$Pr=G$,*and the corresponding f*_{r}*represents the entire molecule.**For any two nodes*$ni$*and*$nj$,*if*$ni$*is a descendant of*$nj$,*then*$Pi$*is a subgraph of*$Pj$,*and*

*where*$fc1,\u2026,fck$

*denote the activations of the children of*$ni$.

*Here*, Φ

_{i}

*is called the aggregation function of node i*.Note that we now have two separate graphs: the molecular graph $G$, and a corresponding comp-net $N$ constructed according to Definition 1. One of the fundamental ideas of this paper is that $N$ can be interpreted as a neural network, in which each node $ni$ is a “neuron” that receives inputs $(\u2009fc1,fc2,\u2026,fck)$ and outputs the activation $fi=\Phi i(\u2009fc1,fc2,\u2026,fck)$ (Fig. 1, right). For now, we treat the activations as vectors but will soon generalize them to be tensors.

There is some freedom in how the system $Pi$ of subgraphs is defined, but the default choice is $Pi\u2113$, where $Pi\u2113$ is the subgraph of vertices within a radius of *ℓ* of vertex *i*. In this case, $Pi0=i$, $P\u2113$ consist of the immediate neighbors of *i*, plus *i* itself, and so on. The aggregation function is discussed in detail in Sec. IV.

Conceptually, comp-nets are closely related to convolutional neural networks (CNNs), which are the mainstay of neural networks in computer vision. In particular,

Each neuron in a CNN only aggregates information from a small set of neurons from the previous layer, similar to how each node of a comp-net only aggregates information from its children.

The so-called

*effective receptive fields*of the neurons in a CNN, i.e., the image patches for which each neuron is responsible for, form a hierarchy of nested sets similar to the hierarchy of $Pi$ subgraphs.

*i*, dropping the “effective” qualifier for brevity.

As mentioned above, an alternative popular framework for learning from graphs is message passing neural networks (MPNNs).^{25} An MPNN operates in rounds: in each round *ℓ* = 1, …, *L*, every vertex collects the labels of its immediate neighbors, applies a nonlinear function Ψ, and updates its own label accordingly. From the neural networks point of view, the rounds correspond to layers and the labels correspond to the $fi\u2113$ activations (Algorithm 1). More broadly, the classic Weisfeiler–Lehman test of isomorphism follows the same logic^{49–51} and so does the related Weisfeiler–Lehman kernel, arguably the most successful kernel-based approach to graph learning.^{33}

In the above mentioned base case, when $Pi\u2113$ is the collection of all subgraphs of $G$ of radii *ℓ* = 0, 1, 2, …, a comp-net can also be thought of as a message passing algorithm: the messages received by vertex *i* in round *ℓ* are the activations $fuj\u2113\u22121uj\u2208B(i,1)$, where *B*(*i*, 1) is the ball of radius one centered at *i* (note that this contains not just the neighbors of *i* but also *i* itself). Conversely, MPNNs can be seen as comp-nets, where $Pi\u2113$ is the subgraph defined by the receptive field of vertex *i* in round *ℓ*. A common feature of MPNNs, however, is that the Ψ aggregation functions that they employ are invariant to permuting their arguments. Most often, Ψ just *sums* all incoming messages and then applies a nonlinearity. This certainly guarantees that the final output of the network, $\varphi (G)=fr$, will be permutation invariant. However, in Sec. III, we argue that it comes at the price of a significant loss of representational power.

for each vertex i |

$fi0\u2190li$ |

for ℓ = 1 to L |

for each vertex i |

$fi\u2113\u2190\Psi \u2009fne1i\u2113\u22121,\u2026,fneki\u2113\u22121$ |

$\varphi (G)\u2261fr\u2190\Psi r\u2009f1L,\u2026,fnL$ |

for each vertex i |

$fi0\u2190li$ |

for ℓ = 1 to L |

for each vertex i |

$fi\u2113\u2190\Psi \u2009fne1i\u2113\u22121,\u2026,fneki\u2113\u22121$ |

$\varphi (G)\u2261fr\u2190\Psi r\u2009f1L,\u2026,fnL$ |

## III. COVARIANT COMPOSITIONAL NETWORKS

One of the messages of the present paper is that *invariant* aggregation functions, of the type popularized by message passing neural networks, are *not* the most general way to build compositional models for compound objects, such as graphs. To understand why this is the case, once again, an analogy with image recognition is helpful. Classical CNNs face two types of basic image transformations: translations and rotations. With respect to translations (barring pooling, edge effects, and other complications), CNNs behave in a quasi-invariant way, in the sense that if the input image is translated by an integer amount (*t*_{x}, *t*_{y}), the activations in each layer *ℓ* = 1, 2, …, *L* translate the same way: the activation of neuron $ni,j\u2113$ is simply transferred to neuron $ni+t1,j+t2\u2113$, i.e., $f\u2032i+t1,j+t2\u2113\u2009\u2009=fi,j\u2113$. This is the simplest manifestation of a well studied property of CNNs called *equivariance*.^{52,53}

With respect to rotations, however, the situation is more complicated: if we rotate the input image by, e.g., 90°, not only the part of the image that fell in the receptive field of neuron $ni,j\u2113$ will move to the receptive field of a different neuron $n\u2212j,i\u2113$ but also the orientation of the receptive field will change. For example, features which were previously picked up by horizontal filters will now be picked up by vertical filters. Therefore, in general, $f\u2032\u2212j,i\u2113\u2009\u2260\u2009fi,j\u2113$ (Fig. 2).

It can be shown that one cannot construct a CNN for images that behaves in a quasi-invariant way with respect to both translations and rotations, unless every filter is directionless. It is, however, possible to construct a CNN in which the activations transform in a *predictable and reversible* way, $f\u2032\u2212j,i\u2113=R(\u2009fi,j\u2113)$, for some fixed function *R*. This phenomenon is called *steerability* and has a significant literature in both classical signal processing^{54–58} and the neural networks field.^{59}

The situation in compositional networks is similar. The comp-net and message passing architectures that we have examined so far, by the virtue of the aggregation function being symmetric in its arguments, are all *quasi-invariant* (with respect to permutations) in the sense that if $N$ and $N\u2009\u2032$ are two comp-nets for the same graph differing only in a reordering *σ* of the vertices of the underlying graph $G$, and $ni$ is a neuron in $N$ while $nj\u2032$ is the corresponding neuron in $N\u2032$, then $fi=fj\u2032$ for any permutation $\sigma \u2208Sn$.

Quasi-invariance amounts to asserting that the activation *f*_{i} at any given node must only depend on $Pi=ej1,\u2026,ejk$ as a *set* and not on the internal ordering of the atoms $ej1,\u2026,ejk$ making up the receptive field. At first sight, this seems desirable since it is exactly what we expect from the overall representation *ϕ*(*G*). On closer examination, however, we realize that this property is potentially problematic since it means that $ni$ loses all information about which vertex in its receptive field has contributed what to the aggregate information *f*_{i}. In the CNN analogy, we can say that we have lost information about the *orientation* of the receptive field. In particular, if higher up in the network *f*_{i} is combined with some other feature vector *f*_{j} from a node with an overlapping receptive field, the aggregation process has no way of taking into account which parts of the information in *f*_{i} and *f*_{j} come from shared vertices and which parts do not (Fig. 3).

The solution is to regard the $Pi$ receptive fields as *ordered sets* and explicitly establish how *f*_{i} *co*-*varies* with the internal ordering of the receptive fields. To emphasize that henceforth the $Pi$ sets are ordered, we will use parentheses rather than braces to denote them.

*Definition 2.*

*Assume that* $N$ *is the comp-net of a graph* $G$ *and* $N\u2032$ *is the comp-net of the same graph but after its vertices have been permuted by some permutation σ. Given any* $ni\u2208N$ *with receptive field* $Pi=ep1,\u2026,epm$*, let* $nj\u2032$ *be the corresponding node in* $N\u2009\u2032$ *with receptive field* $Pj\u2032=eq1,\u2026,eqm$*. Assume that* $\pi \u2208Sm$ *is the permutation that aligns the orderings of the two receptive fields, i.e., for which* $eq\pi (a)\u2009=epa$*. We say that the comp-nets are* *covariant to permutations**if for any π, there is a corresponding function R*_{π} *such that* $fj\u2032=R\pi (\u2009fi)$.

To make the form of covariance prescribed by this definition more specific, we make the assumption that the $f\u21a6R\pi (\u2009f)\pi \u2208Sm$ maps are *linear*. This allows us to treat them as matrices, $R\pi \pi \u2208Sn$. Furthermore, linearity also implies that $R\pi \pi \u2208Sm$ form a *representation* of $Sm$ in the group theoretic sense of the word^{60} (this notion of representation should not be confused with the neural networks sense of representations of objects, as “*f*_{i} is a representation of $Pi$”).

^{61}However, there is one particular representation of $Sm$ that is likely familiar even to non-algebraists, the so-called

*defining representation*, given by the $P\pi \u2208Rn\xd7n$ permutation matrices

*f*

_{i}activations in a given comp-net are dictated by this representation, then each

*f*

_{i}must necessarily be a $Pi$ dimensional vector, and intuitively each component of

*f*

_{i}carries information related to one specific atom in the receptive field or the interaction of that specific atom with all the others collectively. We call this case first order permutation covariance.

*Definition 3.*

*We say that* $ni$ *is a* *first order covariant node**in a comp-net if under the permutation of its receptive field* $Pi$ *by any* $\pi \u2208SPi$ *its activation transforms as f*_{i} ↦ *P*_{π}*f*_{i}.

*P*

_{π}-covariant comp-nets is

*P*

_{π}⊗

*P*

_{π}-covariant comp-nets, where the

*f*

_{i}feature vectors are now $Pi2$ dimensional vectors that transform under permutations of the internal ordering by

*π*as

*f*

_{i}↦ (

*P*

_{π}⊗

*P*

_{π})

*f*

_{i}. If we reshape

*f*

_{i}into a matrix $Fi\u2208RPi\xd7Pi$, then the action

*P*

_{π}⊗

*P*

_{π}acting on

*f*

_{i}. In the following, we will prefer this more intuitive matrix view since it makes it clear that feature vectors that transform this way express

*relationships*between the different constituents of the receptive field. Note, in particular, that if we define $A\u2193Pi$ as the restriction of the adjacency matrix to $Pi$ (i.e., if $Pi=ep1,\u2026,epm$ then $[A\u2193Pi]a,b=Apa,pb$), then $A\u2193Pi$ transforms exactly as

*F*

_{i}does in the equation above.

*Definition 4.*

*We say that* $ni$ *is a* *second order covariant node**in a comp-net if under the permutation of its receptive field* $Pi$ *by any* $\pi \u2208SPi$ *its activation transforms as* $Fi\u21a6P\pi \u2009Fi\u2009P\pi \u22a4$.

*k*’th order nodes, in which the activations are

*k*’th order tensors, transforming under permutations as $Fi\u21a6Fi\u2032$, where

In general, we will call any quantity which transforms under permutations according to (1) a *k’th order P*-*tensor*. Saying that a given quantity is a *P*–tensor then not only means that it is representable by an *m* × *m* × ⋯ × *m* array but also that this array transforms in a specific way under permutations.

Since scalars, vectors, and matrices can be considered 0th, 1st, and 2nd order tensors, the following definition covers both quasi-invariance and first and second order covariance as special cases. To unify notation and terminology, in the following, we will always talk about *feature tensors* rather than *feature vectors* and denote the activations as *F*_{i} rather than *f*_{i}.

*Definition 5.*

*We say that* $ni$ *is a* *k’th order covariant node**in a comp-net if the corresponding activation F*_{i} *is a k’th order P*–*tensor, i.e., it transforms under permutations of* $Pi$ *according to* (1)*, or the activation is a sequence of d separate P*–*tensors* $Fi(1),\u2026,Fi(d)$ *corresponding to d distinct channels.*

A **covariant compositional network (CCN)** is a comp-net in which each node’s activation is covariant to permutations in the above sense. Hence we can talk about first, second, third, etc., order CCNs (CCN 1D, 2D, …) (Fig. 4).

The real significance of covariance, both here and in classical CNNs, is that while it allows for a richer internal representation of the data than fully invariant architectures, the final ouput of the network can still easily be made invariant. In covariant comp-nets, this is achieved by collapsing the input tensors of the root node $nr$ at the top of the network into invariant scalars, for example, by computing their sums and traces (reducing them to zeroth order tensors) and outputting permutation invariant combinations of these scalars, such as their sum.

## IV. COVARIANT AGGREGATION FUNCTIONS

It remains to explain how to define the Φ aggregation functions so as to guarantee covariance. Specifically, we show how to construct Φ such that if the $Fc1,\u2026,Fck$ inputs of a given node $na$ at level *ℓ* are covariant *k*’th order *P*–tensors, then $Fa=\Phi (Fc1,\u2026,Fck)$ will also be a *k*’th order *P*–tensor. The aggregation function that we define consists of five operations executed in sequence: promotion, stacking, reduction, mixing, and an elementwise nonlinear transform. (See Fig. 5.) Practically relevant CCNs tend to have multiple channels, so each $Fci$ is actually a sequence of *d* separate *P*–tensors $Fci(1),\u2026,Fci(d)$. However, except for the mixing step, each channel behaves independently, so for simplicity, for now we drop the channel index.

### A. Promotion

Each child tensor $Fci$ captures information about a different receptive field $Pci$, so before combining them we must “promote” each $Fci$ to a $Pa\xd7\cdots \xd7Pa$ tensor $F\u0303ci$, whose dimensions are indexed by the vertices of $Pa$ rather than the vertices in each $Pci$. Assuming that $Pci=eq1,\u2026,eqPci$ and $Pa=ep1,\u2026,epPa$, this is done by defining a $|Pa|\xd7|Pci|$ indicator matrix

and setting

where, once again, Einstein summation is in use, so summation over $j1\u2032,\u2026,jk\u2032$ is implied. Effectively, the promotion step aligns all the child tensors by permuting the indices of $Fci$ to conform to the ordering of the atoms in $Pa$ and padding with zeros where necessary.

### B. Stacking

Now that the promoted tensors $F\u0303c1,\u2026,F\u0303cs$ all have the same shape, they can be stacked to form a $Pa\xd7\cdots \xd7Pa$ dimensional *k* + 1’th order tensor *T*, with

It is easy to see that *T* itself transforms as a *P*-tensor of order (*k* + 1).

We may also add additional information to *T* by taking its tensor product with another *P*-tensor. In particular, to explicitly add information about the local topology, we may tensor multiply *T* by $[A\u2193Pa]i,j=Aepi,epj$, the restriction of the adjacency matrix to $Pa$. This will give an order (*k* + 3) tensor $S=T\u2297A\u2193Pa$. Otherwise, we just set *S* = *T*. Note that in most other graph neural networks, the topology of the underlying graph is only accounted for implicitly, by the pattern in which different activations are combined. Being able to add the local adjacency matrix explicitly greatly extends the representational power of CCNs.

### C. Reduction

Stacking and the optional tensor product increase the order of our tensors from *k* to *k* + 1 or *k* + 3. We reduce this to *k* again by a combination of generalized tensor contractions, more accurately called *tensor reductions*. For example, one way to drop the rank of *S* from *k* + 3 to *k* is to sum out three of its indices,

Note that while this notation does not make it explicit, $i\alpha 1,i\alpha 2$ and $i\alpha 3$ must be removed from amongst the indices of *Q*. Another way to drop the rank is to *contract* over three indices,

where $\delta i\alpha 1,i\alpha 2,i\alpha 3$ is the generalized Kronecker symbol. A third way is to contract over two indices and sum over one index and so on. The crucial fact is that the result of each of these tensor operations is still a covariant *P*–tensor.

In general, the number of different ways that an order *k* + *q* tensor *S* can be reduced to order *k* depends on both *q* and *k*. For example, when *k* = 2 and *q* = 3, there are 50 different possible tensor reductions (excluding diagonals). By contrast, when *q* = 1 (i.e., we are not multiplying by the adjacency matrix), we only have *k* + 1 possibilities, corresponding to summing *S* over each of its dimensions. No matter which case we are in, however, and how many contractions *Q*_{1}, …, *Qj*_{d′} our network actually computes (in our experiments using second order nodes, we compute 18 different ones), what is important is that the resulting order *k* tensors satify the *P*–tensor property.

### D. Mixing with learnable weights

The reduction step can potentially produce quite a large number of order *k* tensors, *Q*_{1}, …, *Q*_{r}. We reduce this number by linearly *mixing Q*_{1}, …, *Q*_{r}, i.e., taking *d*′ < *r* linear combinations of the form

This is again a covariant operation, and the mixing weights are the actual learnable parameters of our neural network.

It is natural to interpret $Q\u0303(1),\u2026,Q\u0303(d\u2032)$ as *d*′ separate channels in neural network terminology. Recall that we also allow the input tensors to have multiple channels, but up to now the corresponding index has been suppressed. The mixing step is the point where these channels are all allowed to interact, so (2) becomes

and the learnable weights of the network at each level form a third order tensor $W\u2113=(wc,c\u2032,j(\u2113))c,c\u2032,j$. The channel index *c* is not to be confused with *c*_{1}, …, *c*_{jk} denoting the children of node *a*. Equation (2) is the main mechanism whereby CCNs are able to learn increasigly complex topological features as we move up the network.

### E. Nonlinearity

Finally, to get the actual activation of our neuron $na$, we add an additive bias *b*_{ℓ,c} and an elementwise nonlinearity *σ* (specifically, the ReLU operator $\sigma (x)=max0,x$), as is standard in neural networks. Thus, ultimately, the output of the aggregation function is the collection of *P*–covariant tensors

with $c\u22081,2,\u2026,d\u2032$. As usual, in neural networks, the *W*_{ℓ} weight tensors are learned by backpropagation and some form of stochastic gradient descent.

## V. EXPERIMENTS

We tested our CCN framework on three different types of datasets that involve learning the properties of molecules from their structure: (a) four relatively small datasets of small molecules that are standard benchmarks in the kernel literature (MUTAG, PTC, NCI1 and NCI109), (b) the Harvard Clean Energy Project,^{4} and (c) QM9.^{3} The first two types of datasets are pure graph classification/regression tasks. QM9 also has physical (spatial) features, which go somewhat beyond the scope of the present paper. Therefore, on QM9, we conducted separate experiments with and without these physical features [QM9(b) vs. QM9(a)].

In each case, we used second order CCNs (CCN2D) and included the tensor product with the restricted adjacency matrix, as described in Sec. IV B. However, for computational reasons, we only used 18 of the 50 possible contractions. The base features of each vertex were initialized with computed histogram alignment kernel features^{63} of depth up to 10: each vertex receives a base label $li=concatj=110Hj(i)$ where $Hj(i)\u2208Rd$ (with *d* being the total number of distinct discrete node labels) is the vector of relative frequencies of each label for the set of vertices at a distance equal to *j* from vertex *i*. The network was chosen to be three levels deep, with the number of output channels at each level fixed to 10.

To run our experiments, we used our own custom-built neural network library called GraphFlow.^{64} Writing our own deep learning software became necessary because at the time we started work on the experiments, none of the standard frameworks such as TensorFlow,^{65} PyTorch,^{66} or MXNet^{67} had efficient support for the type of tensor operations required by CCN. GraphFlow is a fast, general purpose deep learning library that offers automatic differentiation, dynamic computation graphs, multithreading, and Graphics Processing Unit (GPU) support for tensor contractions. In addition to CCN, it also implements other graph neural networks, including neural graph fingerprints,^{21} PSCN,^{37} and gated graph neural networks.^{40} We also provide a reference implementation of CCN1D and CCN2D in PyTorch.^{68}

In each experiment, we used 80% of the dataset for training, 10% for validation, and 10% for testing. For the kernel datasets, we performed the experiments on 10 separate training/validation/test stratified splits and averaged the resulting classification accuracies. Our training technique used mini-batch stochastic gradient descent with the Adam optimization method^{69} and a batch size of 64. The initial learning rate was set to 0.001 and decayed linearly after each step toward a minimum of 10^{−6}.

### A. Graph kernel datasets

Our first set of experiments involved three standard “graph kernels” datasets: (1) MUTAG, which is a dataset of 188 mutagenic aromatic and heteroaromatic compounds,^{70} (2) PTC, which consists of 344 chemical compounds that have been tested for positive or negative toxicity in lab rats,^{71} and (3) NCI1 and NCI109, which have 4110 and 4127 compounds, respectively, each screened for activity against small cell lung cancer and ovarian cancer lines.^{72} In each of these datasets, each molecule has a discrete label (i.e., toxic/non-toxic and aromatic/heteroaromatic) and the goal is to predict this label. We compare CCN 2D against the graph kernel results reported in Ref. 62 (C-SVM algorithm with the Weisfeiler–Lehman kernel,^{33} Weisfeiler–Lehman edge kernel,^{33} shortest paths graph kernel,^{29} graphlet kernel,^{31} and the multiscale Laplacian graph kernel^{62}), neural graph fingerprints^{21} (with up to 3 levels and a hidden size of 10), and the “patchy-SAN” convolutional algorithm (PSCN).^{37} The results are presented in Table I.

. | MUTAG . | PTC . | NCI1 . | NCI109 . |
---|---|---|---|---|

Wesifeiler–Lehman kernel^{33} | 84.50 ± 2.16 | 59.97 ± 1.60 | 84.76 ± 0.32 | 85.12 ± 0.29 |

Wesifeiler–Lehman edge kernel^{33} | 82.94 ± 2.33 | 60.18 ± 2.19 | 84.65 ± 0.25 | 85.32 ± 0.34 |

Shortest path kernel^{29} | 85.50 ± 2.50 | 59.53 ± 1.71 | 73.61 ± 0.36 | 73.23 ± 0.26 |

Graphlets kernel^{31} | 82.44 ± 1.29 | 55.88 ± 0.31 | 62.40 ± 0.27 | 62.35 ± 0.28 |

Random walk kernel^{28} | 80.33 ± 1.35 | 59.85 ± 0.95 | Timed out | Timed out |

Multiscale Laplacian graph kernel^{62} | 87.94 ± 1.61 | 63.26 ± 1.48 | 81.75 ± 0.24 | 81.31 ± 0.22 |

PSCN(k = 10)^{37} | 88.95 ± 4.37 | 62.29 ± 5.68 | 76.34 ± 1.68 | N/A |

Neural graph fingerprints^{21} | 89.00 ± 7.00 | 57.85 ± 3.36 | 62.21 ± 4.72 | 56.11 ± 4.31 |

Second order CCN (our method) | 91.64 ± 7.24 | 70.62 ± 7.04 | 76.27 ± 4.13 | 75.54 ± 3.36 |

. | MUTAG . | PTC . | NCI1 . | NCI109 . |
---|---|---|---|---|

Wesifeiler–Lehman kernel^{33} | 84.50 ± 2.16 | 59.97 ± 1.60 | 84.76 ± 0.32 | 85.12 ± 0.29 |

Wesifeiler–Lehman edge kernel^{33} | 82.94 ± 2.33 | 60.18 ± 2.19 | 84.65 ± 0.25 | 85.32 ± 0.34 |

Shortest path kernel^{29} | 85.50 ± 2.50 | 59.53 ± 1.71 | 73.61 ± 0.36 | 73.23 ± 0.26 |

Graphlets kernel^{31} | 82.44 ± 1.29 | 55.88 ± 0.31 | 62.40 ± 0.27 | 62.35 ± 0.28 |

Random walk kernel^{28} | 80.33 ± 1.35 | 59.85 ± 0.95 | Timed out | Timed out |

Multiscale Laplacian graph kernel^{62} | 87.94 ± 1.61 | 63.26 ± 1.48 | 81.75 ± 0.24 | 81.31 ± 0.22 |

PSCN(k = 10)^{37} | 88.95 ± 4.37 | 62.29 ± 5.68 | 76.34 ± 1.68 | N/A |

Neural graph fingerprints^{21} | 89.00 ± 7.00 | 57.85 ± 3.36 | 62.21 ± 4.72 | 56.11 ± 4.31 |

Second order CCN (our method) | 91.64 ± 7.24 | 70.62 ± 7.04 | 76.27 ± 4.13 | 75.54 ± 3.36 |

### B. Harvard clean energy project

The Harvard Clean Energy Project (HCEP)^{4} dataset consists of 2.3 × 10^{6} organic compounds that are candidates for use in solar cells. The inputs are molecular graphs (derived from their SMILES strings), and the regression target is the power conversion efficiency (PCE). The experiments were ran on a random sample of 50 000 molecules.

On this dataset, we compared CCN to the following algorithms: Lasso, ridge regression, random forests, Gradient Boosted Trees (GBT), the optimal assignment Wesifeiler–Lehman graph kernel,^{33} neural graph fingerprints,^{21} and PSCN.^{37} For the first four of these baseline methods, we created simple feature vectors from each molecule: the number of bonds of each type (i.e., number of H—H bonds, number of C—O bonds, etc.) and the number of atoms of each type. Molecular graph fingerprints use atom labels of each vertex as base features. For ridge regression and Lasso, we cross-validated over *λ*. For random forests and GBT, we used 400 trees and cross validated over maximum depth, minimum samples for a leaf, minimum samples to split a node, and learning rate ( for GBT). For neural graph fingerprints, we used up to 3 layers and a hidden layer size of 10. In PSCN, we used a patch size of 10 with two convolutional layers and a dense layer on top as described in their paper (Tables II–IV).

. | Test MAE . | Test RMSE . |
---|---|---|

Lasso | 0.867 | 1.437 |

Ridge regression | 0.854 | 1.376 |

Random forest | 1.004 | 1.799 |

Gradient boosted trees | 0.704 | 1.005 |

Weisfeiler–Lehman kernel^{33} | 0.805 | 1.096 |

Neural graph fingerprints^{21} | 0.851 | 1.177 |

PSCN (k = 10)^{37} | 0.718 | 0.973 |

Second order CCN (our method) | 0.340 | 0.449 |

. | WLGK . | NGF . | PSCN . | CCN 2D . |
---|---|---|---|---|

α (bohr^{3}) | 3.75 | 3.51 | 1.63 | 1.30 |

$Cv$ [cal/(mol K)] | 2.39 | 1.91 | 1.09 | 0.93 |

G (eV) | 4.84 | 4.36 | 3.13 | 2.75 |

GAP (eV) | 0.92 | 0.86 | 0.77 | 0.69 |

H (eV) | 5.45 | 4.92 | 3.56 | 3.14 |

HOMO (eV) | 0.38 | 0.34 | 0.30 | 0.23 |

LUMO (eV) | 0.89 | 0.82 | 0.75 | 0.67 |

μ (D) | 1.03 | 0.94 | 0.81 | 0.72 |

ω_{1} (cm^{−1}) | 192.16 | 168.14 | 152.13 | 120.10 |

R_{2} (bohr^{2}) | 154.25 | 137.43 | 61.70 | 53.28 |

U (eV) | 5.41 | 4.89 | 3.54 | 3.02 |

U_{0} (eV) | 5.36 | 4.85 | 3.50 | 2.99 |

ZPVE (eV) | 0.51 | 0.45 | 0.38 | 0.35 |

. | WLGK . | NGF . | PSCN . | CCN 2D . |
---|---|---|---|---|

α (bohr^{3}) | 3.75 | 3.51 | 1.63 | 1.30 |

$Cv$ [cal/(mol K)] | 2.39 | 1.91 | 1.09 | 0.93 |

G (eV) | 4.84 | 4.36 | 3.13 | 2.75 |

GAP (eV) | 0.92 | 0.86 | 0.77 | 0.69 |

H (eV) | 5.45 | 4.92 | 3.56 | 3.14 |

HOMO (eV) | 0.38 | 0.34 | 0.30 | 0.23 |

LUMO (eV) | 0.89 | 0.82 | 0.75 | 0.67 |

μ (D) | 1.03 | 0.94 | 0.81 | 0.72 |

ω_{1} (cm^{−1}) | 192.16 | 168.14 | 152.13 | 120.10 |

R_{2} (bohr^{2}) | 154.25 | 137.43 | 61.70 | 53.28 |

U (eV) | 5.41 | 4.89 | 3.54 | 3.02 |

U_{0} (eV) | 5.36 | 4.85 | 3.50 | 2.99 |

ZPVE (eV) | 0.51 | 0.45 | 0.38 | 0.35 |

. | CCN . | DFT error . |
---|---|---|

α (bohr^{3}) | 0.22 | 0.4 |

$Cv$ [cal/(mol K)] | 0.07 | 0.34 |

G (eV) | 0.06 | 0.1 |

GAP (eV) | 0.12 | 1.2 |

H (eV) | 0.06 | 0.1 |

HOMO (eV) | 0.09 | 2.0 |

LUMO (eV) | 0.09 | 2.6 |

μ (D) | 0.48 | 0.1 |

ω_{1} (cm^{−1}) | 2.81 | 28 |

R_{2} (bohr^{2}) | 4.00 | … |

U (eV) | 0.06 | 0.1 |

U_{0} (eV) | 0.05 | 0.1 |

ZPVE (eV) | 0.0039 | 0.0097 |

. | CCN . | DFT error . |
---|---|---|

α (bohr^{3}) | 0.22 | 0.4 |

$Cv$ [cal/(mol K)] | 0.07 | 0.34 |

G (eV) | 0.06 | 0.1 |

GAP (eV) | 0.12 | 1.2 |

H (eV) | 0.06 | 0.1 |

HOMO (eV) | 0.09 | 2.0 |

LUMO (eV) | 0.09 | 2.6 |

μ (D) | 0.48 | 0.1 |

ω_{1} (cm^{−1}) | 2.81 | 28 |

R_{2} (bohr^{2}) | 4.00 | … |

U (eV) | 0.06 | 0.1 |

U_{0} (eV) | 0.05 | 0.1 |

ZPVE (eV) | 0.0039 | 0.0097 |

### C. QM9 dataset

QM9 has recently emerged as a molecular dataset of significant interest. QM9 contains ∼134 K organic molecules with up to nine atoms (C, H, O, N, and F) out of the GDB-17 universe of molecules. Each molecule is described by a SMILES string, which is converted to the molecular graph. The molecular properties are then calculated using DFT at the level of either B3LYP or 6-31G(2df, p), returning the spatial configurations of each atom along with thirteen molecular properties:

*U*_{0}: atomization energy at 0 K (eV),*U*: atomization at room temperature (eV),*H*: enthalpy of atomization at room temperature (eV),*G*: free energy of atomization (eV),*ω*_{1}: highest fundamental vibrational frequency (cm^{−1}),ZPVE: zero point vibrational energy (eV),

HOMO: highest occupied molecular orbital, energy of the highest occupied electronic state (eV),

LUMO: lowest unoccupied molecular orbital, energy of the lowest unoccupied electronic state (eV),

GAP: difference between HOMO and LUMO (eV),

*R*^{2}: electronic spatial extent (bohr^{2}),*μ*: norm of the dipole moment (D),*α*: norm of the static polarizability (bohr^{3}),$Cv$: heat capacity at room temperature (cal/mol/K).

We performed two experiments on the QM9 dataset, with the goal of providing a benchmark of CCN as a graph learning framework and demonstrating that our framework can predict molecular properties to the same level as DFT. In both cases, we trained our system on each of the thirteen target properties of QM9 independently and report the MAE for each.

#### 1. QM9(a)

We use the QM9 dataset to benchmark the CCN architecture against the Weisfeiler–Lehman graph kernel, neural graph fingerprints, and PSCN. For this test, we consider only heavy atoms and exclude hydrogen. The CCN architecture is as described above, and settings for NGF and PSCN are as described for HCEP.

#### 2. QM9(b)

To compare the DFT error, we performed a test of the QM9 dataset with each molecule including hydrogen atoms. We used both physical atomic information (vertex features) and bond information (edge features) including the atom type, atomic number, acceptor, donor, aromatic, hybridization, number of hydrogens, Euclidean distance, and Coulomb distance between pairs of atoms. All the information is encoded in a vectorized format. Our physical features were taken directly from the dataset used in Ref. 25 without any special feature engineering.

To include the edge features into our model along with the vertex features, we used the concept of line graphs from graph theory. We constructed the line graph for each molecular graph in such a way that an edge of the molecular graph corresponds to a vertex in its line graph, and if two edges in the molecular graph share a common vertex, then there is an edge between the two corresponding vertices in the line graph. (See Fig. 6.) The edge features become vertex features in the line graph. The inputs of our model contain both the molecular graph and its line graph. The feature vectors *F*_{ℓ} between the two graphs are merged at each level *ℓ*.

We present results for both the CCN 1D and CCN 2D architectures. For CCN 1D, our network is seven layers with 64 input channels; at the first and second layer, the number of channels is halved, and beyond that each layer has 16 channels. For the CCN 2D architecture, we used three layers, with 32 channels at the input and 16 at the remaining layers. We report the mean average error for each learning target in its corresponding physical unit and compare it against the DFT error given in Ref. 25.

### D. Discussion

Overall, our CCN outperforms the other algorithms on a significant fraction of the experiments we implemented. On the subsampled HCEP dataset, CCN outperforms all other methods by a very large margin. For the graph kernel datasets, the SVM with the Weisfeiler–Lehman kernels achieve the highest accuracy on NCI1 and NCI109, while CCN wins on MUTAG and PTC. Perhaps this poor performance is to be expected since the datasets are small and neural networks usually require tens of thousands of training examples to be effective. Indeed, neural graph fingerprints and PSCN also perform poorly compared to the Weisfeiler–Lehman kernels. In the QM9(a) experiment, CCN obtains better results than the three other graph learning algorithms on all 13 learning targets.

In the QM9(b) experiment, the error of CCN is smaller than that of DFT itself on 11 of 12 learning targets (Ref. 25 does not have DFT errors for R2). However, other recent studies^{14,25} have obtained even stronger results. Looking at our results, we find that values depending strongly on position information, such as the dipole moment and average electronic spatial extent, are predicted poorly when we include physical features. By contrast, properties that are not expected to strongly depend on spatial extent are predicted significantly better. This suggests that our spatial input features were not fully exploited and that feature engineering position information could significantly enhance the power of our CCN.

Our custom deep learning library^{64} enabled all the above results to be obtained reasonably efficiently. The prediction time for CCN 1D and CCN 2D on QM9(b) comes out to 6.0 ms/molecule and 7.2 ms/molecule, respectively, making it possible to search through a million candidate molecules in less than 2 h.

## VI. CONCLUSIONS

In this paper, we presented a general framework called covariant compositional networks (CCNs) for learning the properties of molecules from their graphs. Central to this framework are two key ideas: (1) a compositional structure that generalizes message passing neural networks (MPNNs) and (2) the concept of covariant aggregation functions based on tensor algebra.

We argue that CCNs can extract multiscale structures from molecular graphs and keep track of the local topology in a manner the MPNNs are not able to. We also introduced the GraphFlow software library that provides an efficient implementation of CCNs. Using GraphFlow, we were able to show that CCNs often outperform existing state-of-the-art algorithms in learning molecular properties.

## ACKNOWLEDGMENTS

The authors would like to thank the Institute for Pure and Applied Mathematics and the participants of its “Understanding Many Particle Systems with Machine Learning” program for inspiring this research. Our work was supported by DARPA Young Faculty Award No. D16AP00112 and used computing resources provided by the University of Chicago Research Computing Center.