Molecular Hypergraph Neural Networks

Graph neural networks (GNNs) have demonstrated promising performance across various chemistry-related tasks. However, conventional graphs only model the pairwise connectivity in molecules, failing to adequately represent higher-order connections like multi-center bonds and conjugated structures. To tackle this challenge, we introduce molecular hypergraphs and propose Molecular Hypergraph Neural Networks (MHNN) to predict the optoelectronic properties of organic semiconductors, where hyperedges represent conjugated structures. A general algorithm is designed for irregular high-order connections, which can efficiently operate on molecular hypergraphs with hyperedges of various orders. The results show that MHNN outperforms all baseline models on most tasks of OPV, OCELOTv1 and PCQM4Mv2 datasets. Notably, MHNN achieves this without any 3D geometric information, surpassing the baseline model that utilizes atom positions. Moreover, MHNN achieves better performance than pretrained GNNs under limited training data, underscoring its excellent data efficiency. This work provides a new strategy for more general molecular representations and property prediction tasks related to high-order connections.


I. INTRODUCTION
Graph presentation of molecular structures, also called molecular graphs, finds extensive application in computational chemistry and machine learning, where atoms are served as nodes and chemical bonds as edges.Graph neural networks (GNNs) are a class of deep learning models that can handle graph-structured data and are related to geometric deep learning [1][2][3][4][5] .Unlike traditional neural networks that operate on regular grids (e.g., images) or sequential data (e.g., text), GNNs can handle interconnected and non-Euclidean data, making them suitable for tasks involving graphs with complex topologies 4 .This inherent advantage enables GNNs to directly learn the complex topological relationships of atoms and chemical bonds through molecular graphs 6 .In recent years, GNNs have demonstrated excellent molecular representation capabilities and achieved promising performance on many chemistry-related tasks, such as molecular property prediction [6][7][8] , drug design [9][10][11] , interatomic potentials [12][13][14] , spectroscopic analysis [15][16][17] , reaction prediction and retrosynthesis [18][19][20] .
However, ordinary graphs are limited to modeling pairwise connectivity within molecular structures, falling short in effectively representing higher-order connections 11,21,22 .A substantial number of molecules have delocalized bonds, such as multi-center bonds 23 and conjugated bonds 24 .In contrast to classical chemical bonds localized between pairs of atoms, each delocalized bond involves three or more atoms 25 .As illustrated in Figure 1a, two B atoms and one H atom share two electrons to form a 3-center-2-electron bond, which cannot be represented by a pairwise edge 26 .Similarly, conjugated organic molecules like porphyrin in Figure 1b, possess longrange dispersed π electrons beyond the descriptive capability of conventional edges 24 .Therefore, the development of a more comprehensive graph representation for molecular structures becomes imperative to address this limitation inherent to con- ventional graphs.
A hypergraph is a generalization of the graph where a hyperedge can join any number of nodes 27,28 .Due to the innate ability to capture higher-order relationships, hypergraphs can powerfully model complex topological structures such as social networks 29 , chemical reactions 30 , and compound-protein interactions 11,31,32 .Hypergraph Neural Networks (HGNs) belong to a category of neural networks designed to work with hypergraphs and extend the idea of GNNs to handle hyperedges 28,31 .Several studies 33,34 have employed HGNs in the field of chemistry and depicted atoms as hyperedges and bonds between two atoms as nodes.While these approaches improve the validity of molecule generation and enhance edge representation learning 33,34 , they presently do not leverage hyperedges to articulate high-order connections within molecules.For diverse molecular structures, especially organometallic complexes, and conjugated molecules, hyperedges from hypergraphs are competent to represent multi-atomic connections like delocalized bonds due to their inherent advantages 35,36 .
Conjugated molecules, characterized by alternating single and multiple bonds along a molecular backbone, play a pivotal role in photoelectric applications such as organic light-emitting diodes (OLEDs) and organic solar cells (OSCs) 37,38 .Their distinctive advantage stems from the delocalized π electrons within conjugated structures, which can facilitate charge transport and optical absorption, establishing them as indispensable components of organic semiconductors 38 .Although various machine learning models, especially GNNs, have been developed for predicting optoelectronic properties and accelerating the design of organic semiconductors [39][40][41][42] , high-order conjugated connections have still not been properly modeled.Herein, we introduce the concept of molecular hypergraphs and propose a Molecular Hypergraph Neural Network (MHNN) based on a simple but general message-passing method.MHNN was implemented to predict the optoelectronic properties of organic semiconductors where hyperedges represent conjugated structures.On three photovoltaic-related datasets, MHNN outperforms all baseline models in most tasks.Despite not using any 3D geometric information, MHNN exhibits better results than 3D-based models like SchNet 43 which require atom coordinates as input.Moreover, MHNN possesses high data efficiency even compared with pretrained models, which could be useful for data-scarce applications.This work provides a new model for property prediction of complex molecules containing higher-order connections.

A. Molecular hypergraph
A hypergraph G = (V, E, H, L) is defined by a set of n nodes V , a set of m hyperedges E, node features H ∈ R n×d , and hyperedge features L ∈ R m×d ′ .Each hyperedge e = v 1 , • • • , v |e| is a subset of V and its order |e| ≥ 2. In a molecular hypergraph, it is natural to employ nodes to represent atoms and hyperedges to represent pairwise bonds, delocalized bonds, conjugated bonds and other higher-order associations.It is worth noting that the definition of hyperedges is important and should be related to the prediction target.For example, conjugated structures can significantly affect the light absorption and emission of molecules, so it is reasonable to describe conjugated bonds with hyperedges for the prediction of optoelectronic properties (e.g., bandgap). 38Moreover, hyperedges could be defined by pharmacophores 44 or toxicophores 45 for the prediction of molecular activity or toxicity, respectively.In this work, we show an example of using molecular hypergraphs to describe conjugated molecules (Fig. 2a), where hyperedges are constructed by pairwise bonds and conjugated bonds.Like benzene (C 6 H 6 ) containing 12 atoms, six C-H σ bonds, six C-C σ bonds, and one large delocalized π bond, its molecular hypergraph consists of twelve nodes, twelve 2-order hyperedges, and one 6-order hyperedge.

B. Algorithm
The higher-order relations in complex molecules are often very diverse, that is, the orders of hyperedges in molecular hypergraphs often vary.For example, the number of atoms contained in a conjugated bond can be any integer greater than four.Therefore, model algorithms should not be limited to hyperedges of a specific order or within a specific order range.
In addition, the model should also have good extrapolation ability for hyperedges of unseen orders.Inspired by recent works about hypergraph diffusion algorithms 46,47 , we propose the Molecular Hypergraph Neural Networks (MHNN) based on bipartite representations of hypergraphs, which can efficiently operate on hypergraphs with hyperedges of various orders (Fig. 2bc).
The molecular hypergraph is initially transformed into an equivalent bipartite graph (Fig. 2b), wherein two distinct sets of vertices denote the nodes and hyperedges of the molecular hypergraph, respectively.The message passing of MHNN

MHNN
Then, the hidden states h where h (0) v and l (0) e are derived from initial atom features and bond features (Appendix B).After T steps message passing, the hypergraph-level prediction is calculated in the readout part based on the final hidden states of nodes and hyperedges (|e| > 2), according to: where MLP(•) is a Multi-Layer Perceptron.The output ŷ is the prediction target of MHNN, which can be a scalar or a vector.
In this work, four MLPs are used to act as update functions ( f 1 , f 2 , f 3 , f 4 ).The schematic diagram of MHNN architecture is shown in Fig. 3 and Algorithm 1.

Algorithm1 Algorithm of MHNN
Input: molecular hypergraph G = (V, E, H, L) Send messages from V to E for all e ∈ E: m Update hyperedge embeddings l Update node embeddings h For 2D GNN baselines, the atoms features and bond features designed by OGB 48 are used for the initial features of models.For MHNN, initial atom features are from OGB 48 and only bond types are used as the initial feature of all hyperedges.For  The OPV dataset 39 , named organic photovoltaic dataset, contains 90,823 unique molecules (monomers and soluble small molecules) and their SMILES strings, 3D geometries, and optoelectronic properties from DFT calculations.OPV has four molecular tasks for monomers, the energy of highest occupied molecular orbital (ε HOMO ), lowest unoccupied molecular orbital (ε LUMO ), HOMO-LUMO gap (∆ε), and the spectral overlap I overlap .In addition, OPV has four polymeric tasks, the polymer ε HOMO , polymer ε LUMO , polymer gap ∆ε, and optical LUMO O LUMO . 39he OCELOTv1 dataset 40 comprises about 25,000 organic π-conjugated molecules, along with their optoelectronic and reaction characteristics calculated by precise DFT or TD-DFT methods.The dataset encompasses 15 molecular properties: vertical (VIE) and adiabatic (AIE) ionization energy, vertical (VEA) and adiabatic (AEA) electron affinity, cation (CR) and anion (AR) relaxation energy, HOMO and LUMO energy, HOMO-LUMO energy gap (H-L), electron (ER) and hole (HR) reorganization energy, and lowest-lying singlet (S0S1) and triplet (S0T1) excitation energy.
PCQM4Mv2 49 is based on the PubChemQC project 50 and aims to predict the HOMO-LUMO energy gap of molecules from SMILES strings.PCQM4Mv2 is unprecedentedly large (> 3.8M graphs) compared to other labeled graph-related databases.
We follow the standard train/validation/test dataset splits from OPV and PCQM4Mv2, and use random split for the OCELOT dataset.The experimental results are derived from three separate runs using different random seeds, except for PCQM4Mv2, which is based on one single random seed run.

III. RESULTS AND DISCUSSION
In this section, we initially assessed the predictive performance of MHNN on optoelectronic properties across three datasets.Among them, the OPV 39 and OCELOTv1 39 datasets consist of conjugated molecules and their optoelectronic properties, while the PCQM4Mv2 dataset was employed to investigate the large-scale learning capability of MHNN.Subsequently, we explored the data efficiency of MHNN at different training data sizes.

A. Analysis of datasets
OPV and OCELOTv1 datasets, composed of conjugated molecules, are utilized to explore the learning ability of MHNN on conjugated structure and its prediction performance for optoelectronic properties.As shown in Fig. 4a, the conjugated molecules in the OPV dataset have a broader molar mass distribution (80-1800 g/mol) compared to the OCELOTv1 dataset (90-1400 g/mol).The molecular weights in the OPV dataset are predominantly concentrated in the range of 500 to 1000, whereas the OCELOTv1 dataset shows a concentration in the range of 200 to 400.Therefore, the OPV dataset not only has more data points than the OCELOTv1 dataset, but also has more large conjugated molecules.As depicted in Fig. 4b, molecules with larger conjugated structures are present in the OPV dataset compared to the OCELOTv1 dataset.The number of atoms in each conjugated structure of the OPV dataset spans a range from 4 to 120, with a concentration between 25 and 50.
In contrast, the OCELOTv1 dataset exhibits a narrower range of atom numbers of conjugated structures (5-66), and is mainly concentrated between 15 and 30.Moreover, the conjugated molecules in the OPV dataset generally have lower band gaps (∼ 1.9 eV) compared to the OCELOTv1 dataset (∼ 6.2 eV).
It can be concluded from Fig. 4b that molecules with larger conjugated structures tend to have smaller band gaps, but this is not absolute.The distribution without obvious regularity also demonstrates the complex relationship between the photoelectric properties and conjugated structures.This underscores the significance of utilizing hyperedges to represent conjugated structures.

B. Performance on OPV dataset
For OPV dataset, we compared MHNN with multiple baselines: GCN 51 , GIN 52 , GAT 53 , GATv2 54 , MPNN 55 and SchNet 43 .Table II shows the test performances of MHNN and competitive baselines on the OPV dataset, where the best results are marked in bold.Except for SchNet 43 which uses the 3D molecular geometries from DFT calculations, other models including MHNN, only use 2D topology information from SMILES strings.As for molecular properties, SchNet is obviously better than the 2D baselines, since 3D information is crucial for these properties 39 .However, MHNN outperforms all baselines on three tasks (∆ε, ε HOMO , ε LUMO ) without any 3D information, indicating that molecular hypergraphs with additional conjugation information are reliable representations of organic semiconductors.The SchNet model outperforms other models significantly in the prediction of the target I overlap , indicating that the 3D molecular geometries can provide crucial and unique insights for predicting this target.For polymer property prediction tasks, SchNet 43 cannot exhibit better performance because only atom positions of monomers are available.It also suggests that polymer properties could be less dependent on the precise 3D structures of monomers 39 .Overall, MHNN achieves the best results on 7 out of 8 tasks compared to baselines, which demonstrates the significance of molecular hypergraphs and the excellent performance of MHNN for property prediction of conjugated molecules.

C. Performance on OCELOTv1 dataset
All models from the original paper 40 were selected as baseline models to compare the performance of MHNN on the OCELOTv1 dataset.Extended connectivity fingerprint (ECFP2) and 266 molecular descriptors were calculated from SMILES strings and used as the input for ridge regression (RR), support vector machine (SVM), kernel ridge regression (KRR) and feed-forward network (FFN) 40 .For the MPNN+MolDes model, the graph embeddings computed by MPNN are concatenated with the vectors of molecular descriptors, and employed for predicting molecular properties through a FFN 40 .More details about the baseline models can be found in Reference 40 .Table III shows the test performances of MHNN and baselines, where the best results are marked in bold.On the tasks such as AIE, AEA, S0S1 and S0T1, MPNN exhibits better performance than models (RR, SVM, KRR, FFN) using molecular descriptors.However, the models using molecular descriptors show superior performance than MPNN in the tasks like HOMO, H-L and HR.Moreover, with the assistance of extra molecular descriptors, MPNN+MolDes model demonstrates greater predictive performance across most tasks compared to other models.It indicates that both molecular graphs and molecular descriptors can provide important and specific information for the optoelectronic property prediction, respectively.Despite not using molecular descriptors, MHNN outperforms all baseline models in 15 tasks, demonstrating its excellent prediction performance.This illustrates that molecular hypergraphs are strong representations of conjugated molecules and  To explore the learning ability on large-scale dataset, MHNN is compared with GNN baselines with a message passing mechanism on the PCQM4Mv2 dataset (Table IV).It should be pointed out that there are a large number of small molecules without conjugated structures in this dataset, even though the prediction target is band gap, one of the optoelectronic properties.As shown in Table IV, MHNN can obtain lower MAE results with fewer model parameters, which proves its high learning efficiency.This also shows that MHNN has reliable large-scale learning ability and could reduce the training cost on huge datasets.

E. Data efficiency
To explore the data efficiency of MHNN, we compare it to GIN with or without pretraining on the three most important tasks of OPV dataset under the same data partition.All 80,823 unlabelled molecules in the training set were used to pretrain the GIN model using self-supervised learning (SSL) strategy 57 .Different amounts of data were randomly selected from the training set to directly train GIN and MHNN or finetune the pretrained GIN.As shown in Figure 5, MHNN exhibits better results on three tasks than GIN and pretrained GIN at the different training data sizes.For instance, using 1000 labeled training data, MHNN surpasses pretrained GIN by 31% and 25% on the ε HOMO and ε LUMO tasks, respectively.In addition, directly-trained GIN needs 4∼6 times more training data to attain performance equivalent to MHNN.All the results show that MHNN is highly data-efficient and could be useful for applications without abundant labeled data.

IV. CONCLUSION
The molecular hypergraph and corresponding MHNN were designed to overcome the limitations of traditional molecular graphs when it comes to representing high-order connections within complex molecules.The photoelectric property prediction task of organic semiconductors was selected to evaluate its prediction performance.The definition of molecular hyperedges is specified to focus on conjugated structures of molecules, which relies on human knowledge of relevant connections rather than learning directly from data.Across all three datasets (OPV, OCELOTv1, PCQM4Mv2), MHNN exhibits superior performance to the baselines on most tasks.Impressively, even in the absence of 3D geometric information, MHNN surpasses SchNet which relies on atom positions.Moreover, MHNN demonstrates higher data efficiency compared to pretrained models, making it valuable for applications where labeled data is scarce.

FIG. 2 .
FIG. 2. (a) The method of constructing molecular hypergraphs for conjugated molecules.(b) The conversion from a hypergraph to an equivalent bipartite graph.(c) The message passing method of our MHNN model.

FIG. 3 .
FIG. 3. The MHNN architecture.|| denotes concatenation.The embeddings of nodes and hyperedges are updated in multiple MHNN blocks which can share parameters or not.The final embeddings of nodes and hyperedges are passed into an output block to generate predictions.

v
of each node are updated based on the messages m (t) e→v from involved hyperedges (e : v ∈ e) according to: t) e→v 7: end for 8: hypergraph embedding from nodes: g v = ∑ v∈G h (T ) v 9: hypergraph embedding from hyperedges: g e = ∑ e∈G l (T ) e , |e| > 2 10: ŷ = MLP ([g v , g e ]) Output: ŷ C. Input features

FIG. 4 .
FIG. 4. (a) Distribution of molecular weights for OPV and OCELOTv1 datasets.(b) Distribution of band gap and atomic number of conjugated structures for OPV and OCELOTv1 datasets.

FIG. 5 .
FIG.5.The test results of different models on the HOMO-LUMO gap, HOMO and LUMO tasks of OPV dataset under different amounts of training data.The green lines represent the results of pretrained GIN by self-supervised learning57 , while the blue and red lines show the results from GIN and MHNN without pretraining, respectively.Except for the MHNN model, all data are from the reference57 .

TABLE I .
Overview of the datasets

TABLE II .
39E results on OPV testing set.The unit of I overlap target is W/mol, and the unit of other targets is meV.*represents using DFT-optimized atom coordinates during model training.The results of MPNN and SchNet are from the reference39.

TABLE III .
40E results of baselines and MHNN on OCELOTv1 testing set.The unit of all targets is eV.The results of baselines are from the reference40.
D. Performance on PCQM4Mv2 dataset

TABLE IV .
49,56ate MAE results of MHNN and other messagepassing GNN baselines on the PCQM4Mv2.The results of baselines are from the reference49,56.This dataset does not publish its test set.VN represents the use of virtual nodes to improve performance.