Quantum chemistry calculations have been very useful in providing many key detailed properties and enhancing our understanding of molecular systems. However, such calculation, especially with *ab initio* models, can be time-consuming. For example, in the prediction of charge-transfer properties, it is often necessary to work with an ensemble of different thermally populated structures. A possible alternative to such calculations is to use a machine-learning based approach. In this work, we show that the general prediction of electronic coupling, a property that is very sensitive to intermolecular degrees of freedom, can be obtained with artificial neural networks, with improved performance as compared to the popular kernel ridge regression method. We propose strategies for optimizing the learning rate and batch size, improving model performance, and further evaluating models to ensure that the physical signatures of charge-transfer coupling are well reproduced. We also address the effect of feature representation as well as statistical insights obtained from the loss function and the data structure. Our results pave the way for designing a general strategy for training such neural-network models for accurate prediction.

## I. INTRODUCTION

In the search for novel molecular materials with desirable properties, it is important to establish reliable theoretical predictions.^{1} When simple theoretical approaches reach the limit of generating useful prediction, computer simulations, especially those that do not depend on empirical parameters, become essential. This is true especially in the case of organic optoelectronic materials. For example, in the prediction of charge transport properties of organic semiconducting materials, conventional Marcus theory^{2,3} for charge-hopping and the band theory^{4–6} are both limited; the intermolecular electronic coupling is often large enough to delocalize the charge,^{7–11} whose dynamics couples with nuclear movements. On the other hand, it is not large enough either, to form a robust polaron band where a standard solid-state theory can apply.^{12–14} A direct simulation containing both quantum and classical degrees of freedom of the problem is perhaps the best solution, given the tremendous advancement of computational technologies in this modern era.^{15–17}

One of the major difficulties for the direct dynamics propagation is to establish the Hamiltonian in the quantum mechanical part of the system. All *ab initio* electronic structure approaches offer high-quality matrix elements, but they could take unrealistically large amounts of computational resources to calculate all the cases in an amorphous or thermally disordered state. Recently, machine-learning (ML) based models have garnered attention since they can produce high-quality quantum chemistry (QC) predictions at a small fraction of the computational time.^{18–32} While it is still quite challenging in searching over the chemical space and finding a target “dream molecule,”^{33–35} the available algorithms and tools have made it much easier to build systems with a fixed number and types of nuclei.^{36–41} As an example, potential energy surfaces can be constructed if the system of various nuclei positions is trained properly.^{18–20,42–50}

ML models for predicting electronic coupling (denoted as *V*) for electron transfer (ET) have been developed and reported recently.^{51–58} Such schemes could offer us a way to construct the Hamiltonian quickly and precisely, for the study of complex dynamics. With a trained ML model, we can construct the Hamiltonian quickly, and the system can be investigated without sacrificing model settings. However, the detailed know-hows for using ML for the calculation of electronic coupling are still being developed. Recently, it has been reported that ML approaches work successfully in the cases of anthracene,^{51} pentacene and its derivatives in the crystalline (or quasi-crystalline) phase,^{52–54} and ethylene in the liquid phase.^{55} However, in the cases of amorphous (or highly disorder) morphology, including naphthalene,^{55} tris(8-hydroxyquinolinato) aluminum,^{56} thiophene,^{57} and deoxyribonucleic acid,^{58} the ML approaches reported a mean absolute error (MAE) larger than 10 meV, which is slightly above the desirable error, since the electronic coupling for charge transfer typically ranges from 10^{−5} meV to 10^{2} meV. The poor ML performance observed for data derived from amorphous conformations is likely due to the complex conformational dependence of the ET coupling, which has been observed and reported before.^{55–57} The oscillatory nature of electronic coupling with respect to the movements of the molecule pairs makes the construction of ML models more difficult, as compared to the ML based potential energy surface, for example, since the latter is generally a smoother function. Therefore, applying ML to predict coupling for amorphous conformations is still a developing area. It is worth noting that improving ML models for such systems is an important task as amorphous conformations closely resemble the molecular arrangement during the fabrication of optoelectronic devices.

In building an ML model, the challenges lie in the choice of the model from a wide variety of ML algorithms as well as the optimization of their corresponding hyperparameters. Hyperparameters are the parameters and numerical schemes that need to be determined before the training process begins. The choice of hyperparameters directly affects the performance. Each ML algorithm has its own set of hyperparameters, and therefore, developing an effective ML model becomes a case-by-case problem. In the community of physical chemistry, kernel ridge regression (KRR) and artificial neural networks (ANNs) are two of the most popular ML algorithms.^{18–21} These two have been systematically investigated in the prediction of potential energy surfaces and other molecular properties, as seen in the literature.^{18–21} However, when it comes to predicting new physical properties, the unique nature of the new quantity and its dependence on or sensitivity to the structural features could fail when simply extending these existing approaches. Electronic coupling is one such example, where it is still necessary to perform many trials for optimizing hyperparameters. To reduce the efforts of a training model, especially for tuning hyperparameters, systematic analyses of the numerical results are required. For example, in our previous work,^{55} we reported that for the KRR model learning *V* with its phase matched, the signed coupling *V* always performed better compared to that the same model was used to learn either |*V*| or log _{10}|*V*|, possibly due to the ability of the model to smoothly fit the nodal regions without having to match the rough change in nearly zero couplings such as in |*V*| or log _{10}|*V*|. Such transferable technical insights can help in improving ML performance and enable the field to grow and progress.

With these goals in mind, we aim at providing the strategies we have learned and developed for ANNs in the prediction of electronic coupling. In this work, we include (a) guidelines to establish both KRR and ANN models for the coupling values, (b) strategies to optimize hyperparameters, especially for ANN models, (c) verification for important physical properties, and (d) the statistical insights derived from numerical observations. We choose naphthalene, the simplest polycyclic aromatic hydrocarbon, in its amorphous phase, as the target system. Among the ML models we developed, we find that the best ANN model can achieve an MAE of 6.5 meV, with correct distance and orientation dependence. Most importantly, it only takes a few milliseconds for thousands of coupling values. This approach offers a promising opportunity to further investigate many fundamental aspects of material properties that are otherwise limited by computational power.

## II. METHODS

The general process of supervised ML we developed is depicted in Fig. 1. The first step was to acquire a large number of data with known answers, and in this case, data with molecular pairs with different structures, and their corresponding coupling values (obtained through quantum chemistry calculations). In the present work, molecular dynamics (MD) simulations were employed to generate a diverse set of naphthalene dimer structures, followed by density functional theory (DFT) calculations and a “phase-matching” scheme to generate the corresponding couplings, which are our “ground-truth” values (the *reference label*, denoted as *V*^{QC}). Then, the structural information was transformed into the *input features* (also known as *descriptors*). In the training process, a set of data (including input features and reference labels) was assigned as the training dataset; the remaining became the testing dataset. The ML models based on different ML algorithms were trained using the training dataset. Afterword, the ML models were systematically evaluated by an independent testing dataset, which had never been seen during the training process. Further details are described in the following sections.

### A. Building data sets

#### 1. Generating molecular pairs by MD simulation

We first performed MD simulations with a system of 20 000 naphthalene molecules at 400 K and 1.0 bar, for a melting state of naphthalene, from which we collected densely packed molecules. Further details of MD simulations are provided in the supplementary material accompanying the present work. The structures of pairs were extracted from the MD trajectories. 250 000 pairs were collected based on their center-of-mass distances: 100 000 pairs for distances ranging from 3 Å to 5 Å, and 150 000 pairs in the 5 Å to 10 Å range. Among them, 240 000 samples were assigned to the training dataset, while the remaining 10 000 pairs constitute the testing dataset. For KRR, the model was trained by using five-fold cross validation along with the training dataset, and their accuracy was evaluated by the testing dataset. For ANNs, we followed the common strategy in training ANN models by using the testing set for the early-stop method.^{38} We note that doing so involves a higher risk of undetected over-fitting the model, which can be improved in the future. In the present work, we stress the importance to evaluate the model performance with additional tests, such as the distance and orientation dependence, as discussed in Sec. IV B.

#### 2. Calculation of the reference labels

The corresponding electronic coupling for each pair was evaluated by DFT with the frontier molecular orbital (FMO) approach.^{59–61} In this step, the calculation was performed with the DZ^{*} (extended Dunning) basis set and the long-range corrected functional LC-BLYP,^{62,63} with the range-separation parameter *μ* set to 0.26 bohr^{−1} for a good hole-transfer description.^{64} In the FMO calculation, individual fragments were calculated in their neutral singlet states, and the off-diagonal Kohn–Sham matrix elements were calculated as the coupling. In the present work, we aimed at calculating the hole-transfer case and, thus, calculated the off-diagonal Kohn–Sham matrix element for the highest occupied molecular orbital (HOMO). All QC calculations were performed with a developmental version of Q-Chem.^{65}

Before assigning the electronic coupling obtained from QC calculations as the ground-truth values, the problem of the signs of electronic coupling must be addressed. The sign of the electronic coupling reflects the choice of the signs of individual orbitals, which is arbitrarily given in the quantum chemistry program. The inconsistency of the signs leads to incorrect estimation of electronic properties, such as band structure or charge mobility. It also leads to great difficulty in training the machine if the signs of structurally similar systems were different. Phase-matching problems have been addressed in several works under different contexts.^{30,66,67} Here, we developed a scheme^{55} by comparing the sign of the molecular orbital coefficients (the *p* orbital of one carbon atom, C) and a geometric vector in the planar framework (in this case, the cross product between the vectors of two C–C bonds) in order to obtain the electronic coupling as a continuous function of molecular rotation. Further details of this scheme have been reported before.^{55}

### B. Data preprocessing: Coulomb matrix representation

When we apply ML, the molecular geometries need to be transformed into the *feature vectors*, which is an appropriate vector of numbers that should be translationally and rotationally invariant. It has been found that the success of ML highly relies on whether this vectorial representation can be used to infer the relation between molecular structures and predicted properties correctly.^{68–72} In the present work, we used Coulomb Matrix (CM) representation^{73} as the feature vector. It has been successfully implemented in various ML algorithms for predicting electronic couplings.^{51,55,56} Indeed, there could be more effective input features, but developing the optimal representation is beyond the scope of this work. The matrix element of the CM representation, between atoms *i* and *j*, is defined as follows:

where *Z*_{i} represents nuclear charges of atom *i*, **R**_{i} is the Cartesian position vector of atom *i*, and ∥**R**_{i} − **R**_{j}∥ is the Euclidean distance between atoms *i* and *j*. We note that conventional CM has been used to describe the features of single-molecule geometry and inferring their atomization energies,^{73,74} molecular frontier orbital energies,^{75} excited state energies,^{26} adiabatic potential energy surfaces,^{28} or general electronic properties.^{76,77}

Here, we work on intermolecular electronic coupling, which depends on a pair of molecular structures. Electronic coupling has different sensitivity to the intra- and intermolecular degrees of freedom. In Eq. (1), the off-diagonal CM elements (*i* ≠ *j*) contain the Coulomb repulsion from intra- and intermolecular interactions. The diagonal elements encode a polynomial dependence of atomic energies to nuclear charge,^{73} and they carry information of the atom type. Knowing that electronic coupling has different sensitivity in inter- and intramolecular degrees of freedom, we tested two different representations from the CM as the input features: (1) the complete CM and (2) only the intermolecular block. For each molecular pair, the selected elements from CM were flattened out to a single row vector. The ML models based on these two CM representations are denoted as CM and CM-inter, respectively, in the following discussion.

### C. Training artificial neural networks

In ANNs, the functional form of the predicted value (*V*^{ML}) is defined by the architecture of neural networks, which is parameterized by a set of weights, and these weighting coefficients are optimized iteratively with a given training set. The neural-network architecture could be assigned by the number of hidden layers, the number of neurons in each layer, the activation function (or normalization function) of each node, and the network connection. A general *N*-hidden layer ANN (with one-dimensional dense output layer) can be represented in the analytic form,

where *M*_{i} denotes the total number of nodes of the *i*th hidden layer and $wm,n(l)$ are the weighting coefficients connecting between nodes of two layers. To simplify the notation of Eq. (2), the bias weights which adjust offsets for each node are included and denoted as $wm,0(l)$, and *f* are the activation functions. An illustrating scheme for Eq. (2) is provided in Fig. 2.

In this work, an ANN was composed of a *D*-dimensional input layer followed by four batch-normalized^{78} layers with the same number of neurons in each layer. The number of neurons in each hidden layer was set to be consistent with the dimension of input features. *D* equals to 666 (324) when the ML model was trained by using the CM (CM-inter) as the input feature. We used a leaky rectified linear unit (leaky ReLU) with a negative slope of 0.01 as the activation function for each neuron. The neurons were fully connected, and all the values of weights were initialized with random numbers. Since our goal is to test the performance and compare the ANN with KRR, with subsequent strategy development of optimizing the hyperparameters for the ANN, applying more advanced network architecture is beyond the scope of this work.

As the network architecture was determined, the capability of an ANN model to accurately predict electronic coupling depends on the numerical values of the weights. The so-called *loss function* allows us to quantify the accuracy of any particular set of weights, since the objective of the training is to find a set of weights that minimizes the loss function. The choice of loss function is also regarded as one kind of hyperparameter. In the present work, we built the ANN models with three commonly seen loss functions, the mean absolute error (MAE), the mean squared error (MSE), and the Huber loss. The MAE, also known as *L*_{1} loss, is the simplest loss function defined as follows:

where the subscript *i* indicates the *i*th sample in the training dataset with the size of *N*_{T}. *N*_{T} was switched to the batch size, *N*_{BS}, when the stochastic gradient descent (SGD) method was implemented. The MSE, squared *L*_{2} norm, is another related loss function that is more sensitive to outliers than the MAE,

Huber loss has quadratic *L*_{2} dependence for small discrepancies and linear *L*_{1} dependence for large errors. It has been reported to be less sensitive to outliers than the MSE so that it prevents explosive gradients in some cases.^{79} The Huber loss function is calculated as follows:

where *Z*_{i} is given by

Once the loss function is defined, a standard way to improve the weights (and lower the loss) is using the SGD method or its variants, such as AdaGrad,^{80} AdaDelta,^{81} and Adam.^{82} The SGD-based schemes are used to optimize the ANN models, where the full training data are divided into batches of *N*_{BS} samples. Each batch of training data was used to update the optimized parameters,

where **W**_{q} denotes all the weights [$wm,n(l)$] in Eq. (2) at the *q*_{th} updated step. *N*_{BS} is the number of training data in each batch, or the *batch size*, and *γ*, the step size, is also known as *learning rate* in the ANN community. *L*(**W**_{q}, **x**_{i}) is the value of loss function with given weights and input features. In practice, at the *q*_{th} iteration, the gradient is estimated based on the sum over the gradient of loss for *N*_{BS} training samples, and then, **W**_{q+1} can be determined. The *N*_{BS} training samples are randomly drawn from the training dataset, and each sample had an opportunity to update the weights at one epoch—the number of times that the learning algorithm works through the entire training dataset. Therefore, the SGD-based methods can be interpreted as a gradient descent with noisy gradients caused by the randomly composed batches of data.^{83–86} In Fig. 3, we included an illustration demonstrating the stochastic evolution of the training and validating processes in a single epoch.

In the present work, all the ML models were trained by using the Adam (adaptive moment estimation) algorithm,^{82} with 1000 epochs. Since it was found that both batch size (*N*_{BS}) and learning rate (*γ*) are hyperparameters that significantly affect the ML performance,^{83,87,88} we implement grid search to tune both hyperparameters for all the ANN models. The search space of *N*_{BS} was 100, 200, 300, 400, 500, 600, 800, 1000, 2000, 3000, and 4000, and for the learning rate, the values of 0.05, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005, 0.004, 0.003, 0.002, 0.001, 0.0009, 0.0008, and 0.0007 were searched. Therefore, with different *γ* and *N*_{BS}, we trained a total number of 154 models for each type of loss function. All the ML algorithms of the ANN used in this work are implemented in the PyTorch package.^{38} For KRR, the detailed methodology has been introduced in our previous work.^{55} We have included the detailed methodology for KRR in the supplementary material of this work for the interested readers.

## III. TRAINING STRATEGY FOR THE ANN: TUNING LEARNING RATE AND BATCH SIZE

Compared to the ANN, KRR requires relatively less effort to determine its hyperparameters, such as the functional form of kernel and the regularization parameter as discussed before.^{55} In this section, we aim at summarizing our strategy in tuning hyperparameters for the ANN, which we found effective and useful. We demonstrate an optimization strategy for finding *γ* and *N*_{BS} based on the ratio, *γ*/*N*_{BS}. We included technical details such as batch normalization, normalization at the preprocessing step, and early stops in the supplementary material. We found that employing batch normalization and early stops could substantially improve ANN performance. Although such tips may sound tedious and are often neglected in similar works, we believe that such findings are important as they help the field progress.

The optimized ANN model is obtained by searching for the weights **W** to achieve a desired minimum. A standard way to find the minimum is following the negative gradient of a loss function. An important hyperparameter, the learning rate (*γ*), determines the size of change in optimizing parameters at every iteration. On the other hand, the commonly used SGD-based schemes evaluate the gradient from a randomly selected subset of the training data instead of the entire dataset, and thus, the number of data in such subsets, the batch size, is another hyperparameter. Computing the loss over smaller subsets leads to more frequent updates to the weights [**W** in Eq. (7)], so it can converge faster, especially for high-dimensional optimization problems such as ANN.^{89} In addition, this leads to oscillations of the loss values during the iteration, which can help in escaping from local minima. It has been reported that the model trained by a small batch size [*N*_{BS} in Eq. (7)] shows better generalization than large ones in the case of image classification.^{83,87} However, using small *N*_{BS} can be computationally expensive because one loses the advantage of vectorized operations as it deals with a small number of samples at a time. Therefore, finding appropriate batch size and learning rate becomes one of the most time-consuming steps in training an ANN.

To address this issue, the relationship between learning rate, batch size, and the properties of the final minima has been discussed.^{83–86} In these works, it is found that the process of SGD can be considered as a discretized propagation of a stochastic differential equation (SDE),

where **R**(**W**)**R**(**W**)^{T} = **C**(**W**), with **C**(**W**) being the covariance between elements of **W** and *dω*(*q*) is the Wiener process. Detailed derivation from Eq. (7) to Eq. (8) has been provided in Ref. 83. In Eq. (8), the first term on the right-hand side averages the negative gradients of the whole training dataset. The second term describes the noise of SGD dynamics, and the ratio of learning rate to batch size (*γ*/*N*_{BS}) dominates the scale of noise.

This relationship described in Eq. (8) implies that the “noise” scales as $\gamma /NBS$ in training, which prompted us to search with the ratio *γ*/*N*_{BS} rather than tuning them individually. To verify this assumption, we investigated the model performance with respect to different *γ* and *N*_{BS} values. In Fig. 4, it is reassuring to see that models with *γ*/*N*_{BS} in the range from 10^{−5} to 10^{−6} have a relatively lower averaged MAE of the testing dataset. As illustrated in Figs. 3 and 5, the term “averaged MAE” throughout the present work is the average of the MAE of all the batches in a single epoch. In addition, results from different loss functions (*L*_{MAE}, *L*_{MSE}, and *L*_{Huber}) follow essentially the same trend. We also note that in some cases, with different *γ* and *N*_{BS} settings, models at the same level of accuracy may have very different patterns of the averaged MAE, which changes with the epoch evolving (Fig. S1 in the supplementary material). A systematic investigation of this intriguing behavior would be desirable, but it is out of the present focus of refining the ANN models. In this work, we propose that one can assign several values of *γ* and *N*_{BS} in an appropriate range according to *γ*/*N*_{BS} and tune the other types of hyperparameters.

## IV. RESULTS AND DISCUSSION

In order to demonstrate various effects of ML models on their prediction capability, we selected eight representative ML models, as listed in Table I, for further investigation of the effect of feature representation, their distance and orientation dependence, and the relationship between the loss function and the data structure. The models we picked include two developed with KRR (with both full intermolecular CMs as their input features, denoted as KRR-CM and KRR-CM-inter, respectively), one ANN model with the intermolecular part of the CM as its input (denoted as ANN-1-CM-inter), and five ANN models using the full CM but derived from different hyperparameter settings (as listed in Table I, denoted as ANN-1 through ANN-5). For simplicity, the MAE based on the testing (training) dataset are also called the “testing (training) MAE” throughout the present work, as seen in Table I, as illustrated in Fig. 5. Meanwhile, the direct comparison of the coupling values *V*^{QC} vs the ML results *V*^{ML} is included in Fig. 6 for the selected ML models.

Model^{a}
. | Loss^{b}
. | N_{BS}^{c}
. | γ^{d}
. | Testing MAE^{e}
. | Training MAE^{f}
. | Phase accuracy^{g}
. |
---|---|---|---|---|---|---|

KRR-CM | … | … | … | 10.85 | 1.05 | 87.9/91.9 |

KRR-CM-inter | … | … | … | 11.15 | 6.87 | 88.6/90.1 |

ANN-1-CM-inter | MAE | 500 | 0.0008 | 9.85 | 8.67 | 88.0/93.1 |

ANN-1 | MAE | 500 | 0.0008 | 6.57 | 4.72 | 90.9/92.7 |

ANN-2 | MSE | 600 | 0.004 | 6.79 | 5.69 | 94.1/92.6 |

ANN-3 | MSE | 800 | 0.007 | 6.59 | 5.35 | 95.1/89.8 |

ANN-4 | Huber | 400 | 0.007 | 6.64 | 5.09 | 93.2/92.8 |

ANN-5 | Huber | 600 | 0.008 | 6.27 | 5.44 | 93.9/92.6 |

Model^{a}
. | Loss^{b}
. | N_{BS}^{c}
. | γ^{d}
. | Testing MAE^{e}
. | Training MAE^{f}
. | Phase accuracy^{g}
. |
---|---|---|---|---|---|---|

KRR-CM | … | … | … | 10.85 | 1.05 | 87.9/91.9 |

KRR-CM-inter | … | … | … | 11.15 | 6.87 | 88.6/90.1 |

ANN-1-CM-inter | MAE | 500 | 0.0008 | 9.85 | 8.67 | 88.0/93.1 |

ANN-1 | MAE | 500 | 0.0008 | 6.57 | 4.72 | 90.9/92.7 |

ANN-2 | MSE | 600 | 0.004 | 6.79 | 5.69 | 94.1/92.6 |

ANN-3 | MSE | 800 | 0.007 | 6.59 | 5.35 | 95.1/89.8 |

ANN-4 | Huber | 400 | 0.007 | 6.64 | 5.09 | 93.2/92.8 |

ANN-5 | Huber | 600 | 0.008 | 6.27 | 5.44 | 93.9/92.6 |

^{a}

Only the KRR-CM-inter model and ANN-1-CM-inter used CM-inter as the input feature; the remaining models used the full CM as the input feature.

^{b}

The loss functions as introduced in Sec. II C.

^{c}

The batch size for the ANN.

^{d}

The learning rate for the ANN.

^{e}

The mean absolute error in the units of meV for the testing dataset.

^{f}

The mean absolute error in the unit of micro-electron volt (meV) for the training dataset.

^{g}

The phase accuracy in the unit of number percentage (%) for the evaluation of the orientation dependence of the naphthalene pair-I/pair-II (see Fig. 9).

### A. Effect of feature representation

The choice of feature representation plays an important role in training a successful ML model. The CM representation employed in the present work contains inter- and intra-molecular elements, and they have different sensitivities in determining the intermolecular charge-transfer coupling. For KRR, Fig. 7(a) shows the testing MAE for varying numbers of the training sample, *N*_{T}. The KRR model using CM-inter as the input feature outperformed the one using CM when they were trained by thousands of samples. As the number of training samples increase to 10^{5}, both models showed the same level of accuracy. This performance trend is consistent with what we observed in the amorphous ethylene system.^{55} The dimension of the feature vector (or the number of descriptors) for the CM is much larger than that of CM-inter (666 vs 324). The large dimension of feature space led to higher sparseness in the data, and subsequently, more data are needed to achieve the same level of accuracy.

The best KRR model yielded an MAE of 10.85 meV for the testing dataset, which employs the full CM, with 120 000 training data. While it is still possible to further reduce the error with even more training data, the employment of KRR with 120 000 or more data is impractical both in the training and the prediction. It has been reported that training KRR with more than 80 000 sample data would require more than 100 gigabytes (GB) of memory.^{55} Moreover, for prediction, KRR requires computing the kernels with all the training samples to obtain the coupling value, so the memory requirement and the computational costs for the KRR model increase as the number of training data increases.

For the ANN, on the other hand, we noted that the best ANN model using the full CM with the same size of training dataset, *N*_{T} = 120 000, can reach an MAE of 8.09 meV in testing (data not shown). Moreover, it is much easier to train and implement an ANN model with a much larger number of samples. Thus, to further improve model accuracy, all the ANN models discussed in Secs. IV B–C were trained by a training dataset of 240 000. Since the size of the training dataset was fixed, the comparison between distinct CM representations is based on the averaged MAE over the epoch (iterating time). In Fig. 7(b), the best averaged MAEs for ANN-1 models using CM and CM-inter are 6.88 meV and 10.04 meV, respectively. The other ANN models follow the same trend (data not shown). The higher accuracy of the ANN-CM model indicates the fact that the ANN can automatically extract the important feature element in the nonlinear domain. This property of ANN has also been verified in the case of image classification.^{90}

Concerning the model performance between KRR and the ANN, we first compare the model complexity, judged by the total number of fitting parameters, such as coefficients *α* for KRR [Eq. (S1) in the supplementary material] and **W** in Eq. (2). It is believed that a complex model including more parameters leads to a higher accuracy of the training set, but there is also a risk of over-fitting (or poor generalization). The number of KRR fitting parameters equals to the size of training sample *N*_{T}; the KRR models trained by 120 000 samples have 1.2 × 10^{5} parameters. On the other hand, the number of ANN fitting parameters depends on the given network architecture, as described in Sec. II C. The number of **W** elements for our current ANN models is about 2.2 × 10^{6} and 9.8 × 10^{5} for CM and CM-inter, respectively. As listed in Table I, it is not surprising that the testing MAE of the ANN models (∼6.5 meV) is smaller than that of KRR models (∼11 meV) because of higher model complexity of the ANN. However, the KRR suffers from the problem of over-fitting. Namely, the difference between the training MAE and the testing MAE of the KRR-CM model is larger than 9 meV. In contrast, for all the ANN models, the difference is less than 2 meV (see Table I). The reason for their success in generalization has been discussed, and it is still an open question.^{83,87,91}

### B. Reproducing distance and orientation dependence

Analysis of the error of the testing dataset (e.g., testing MAE) is the most general way to evaluate the ML performance. Additional checks for physical characteristics of the predicted quantity are also important. In the case of charge-transfer coupling, the important properties include the exponential distance dependence and the orientation dependence.^{61,64,92} The exponential distance dependence is due to the asymptotic exponential-decay behavior of the wavefunction. The distance dependence can be described as

where *d* is the intermolecular distance and *β* is the decay rate. Typically, the value of *β* varies between 1 Å^{−1} and 1.5 Å^{−1}. The orientation of the interaction pair involves other intermolecular degrees of freedom with a large variation in the phase of the wavefunctions and, consequently, the coupling values. These characteristics are typically not verified before. In the present work, we tested for the distance and orientation dependence for the five ANN models and the KRR-CM models.

To test the distance dependence, we arbitrarily selected three pairs of naphthalene and separated them in the direction along the vector connecting their centers of mass. We then compared the corresponding electronic coupling obtained from both QC calculation and ML prediction based on the ML models listed in Table I. In Fig. 8(a), the KRR-CM model failed to capture the dependence of exponential decay with increasing distance for one of the pair conformations. In Figs. 8(b)–8(f), most of the ANN models successfully captured the distance dependence, but the results obtained by the ANN-1 and the ANN-5 model show significant deviation from the QC calculation for the third pair. It is worth noting that these two models have the lowest testing MAE among the five selected ANN models (see Table I). These results indicate that the ANN model with the lowest testing MAE does not guarantee that the distance dependence is reproducible.

Next, to check the orientation dependence, we arbitrarily picked two naphthalene pairs and performed a scan in rotating one naphthalene. In Figs. 9(a)–9(g), it is seen that most of the models can capture the major pattern obtained by QC calculation. A close inspection shows that only ANN-2 and ANN-3 models reproduced the regions in the deep black or crimson color in Fig. 9(a). These regions correspond to the pair conformations with coupling strengths larger than 100 meV. In Fig. 9(g), the black regions marked by red dashed ovals indicate that the ANN-5 model severely overestimates the coupling strength by more than 4000 meV (also shown in Fig. S2 in the supplementary material). We noted that the failure in reproducing orientation dependence for the ANN-5 did not occur when we test on a second naphthalene pair, pair-II, as seen in Figs. 9(h)–9(n). We also include the distribution of the error, the difference between *V*^{ML} and *V*^{QC}, in the supplementary material. The general distribution of errors for ANN-5 is hardly different from other models. Such results show that the further checks for distance and orientation dependences are necessary because the ANN model with the lowest testing MAE does not guarantee reproducing these conformation dependences. The reason for failing in reproducing these dependences could be attributed to the choice of loss function, which we discuss below (Sec. IV C).

Additional check of model performance is whether the ML model can evaluate the phase (sign) of electronic coupling accurately, with incorrect predictions shown as blue dotes in Fig. 9. The accuracy of phase prediction for all the selected ANN models is higher than 90%. Most importantly, the incorrect phase predictions only occur in the transition regions where electronic coupling changes the sign. For QC calculation, the inconsistent phase leads to incorrect evaluation of band structure or charge mobility, and therefore, it is necessary to keep track of the sign of coupling values. The high predicting power of ML addresses these problems effectively.

### C. Relationship between the loss function and the data structure

In the present work, we employed the most commonly used regression loss functions, *L*_{MAE}, *L*_{MSE}, and *L*_{Huber} in Eqs. (3)–(5). A loss function allows us to quantify the error of any particular set of weights and to find a set of weights that minimizes the loss. It also estimates how close the prediction matches in the training dataset, and therefore, the statistics of the loss function values can be useful. In our case, the total dataset contains 250 000 samples, and their coupling values cross several orders of magnitude: 42.6% of naphthalene pairs have their electronic couplings smaller than 10 meV, 45.7% of them ranging from 10 meV to 100 meV, and only 11.7% of them are larger than 100 meV. On one hand, the naphthalene pairs with coupling values larger than 100 meV are regarded as *outliers* that differ significantly from the majority of data points. On the other hand, these pairs represent the conformations that molecular orbitals are highly overlapped, and they are the most important in offering charge-transfer potentials in application.

Generally speaking, MAE loss is more robust to outliers than MSE loss. For each loss, there were totally 154 models trained by different *γ* and *N*_{BS} values, as listed in Sec. II C. We note that there are more models demonstrating good performance [as defined by testing MAE lower than 7.2 meV or by testing root mean square error (RMSE) <10 meV] when they were trained by MAE loss, as listed in Table II.

Models trained with . | L_{MAE}
. | L_{MSE}
. | L_{Huber}
. |
---|---|---|---|

Testing MAE < 7.2 meV | 98 (63.6%)^{b} | 65 (42.2%) | 64 (41.5%) |

Failed in the orientation dependence | 9 (9.2%)^{c} | 3 (4.6%) | 10 (15.6%) |

Testing RMSE < 10 meV | 79 (51.3%) | 62 (40.3%) | 62 (40.3%) |

Failed in the orientation dependence | 6 (7.6%)^{d} | 2 (3.2%) | 10 (16.1%) |

Models trained with . | L_{MAE}
. | L_{MSE}
. | L_{Huber}
. |
---|---|---|---|

Testing MAE < 7.2 meV | 98 (63.6%)^{b} | 65 (42.2%) | 64 (41.5%) |

Failed in the orientation dependence | 9 (9.2%)^{c} | 3 (4.6%) | 10 (15.6%) |

Testing RMSE < 10 meV | 79 (51.3%) | 62 (40.3%) | 62 (40.3%) |

Failed in the orientation dependence | 6 (7.6%)^{d} | 2 (3.2%) | 10 (16.1%) |

^{a}

With different learning rate and batch size, the total number of models trained by a particular loss function is 154. The mean absolute error (MAE) and root mean square error (RMSE) quantify the model performance on the testing dataset.

^{b}

Number of models passing the criteria. In parentheses are the percentages over the 154 models.

^{c}

Number of models that failed in orientation dependence tests and their percentage (in parentheses) among those satisfying criteria of the small MAE.

^{d}

Number of models that failed in orientation dependence tests and their percentage (in parentheses) among those satisfying criteria of small RMSEs.

It is seen that models trained by MAE loss were more possible to perform well because they could better match the major distribution of coupling values. However, while evaluating the orientation dependence, which can contain many conformations with |*V*| > 100 meV, we encountered more problematic cases such as ANN-5, among the good-performing models. We found that percentages of failing the orientation tests are 9.2% (9/98), 4.6% (3/65), and 15.6% (10/64) for models trained by *L*_{MAE}, *L*_{MSE}, and *L*_{Huber}, respectively. It is obvious that the models trained by *L*_{MSE}, which gave more weight to the outliers, reduced the probability of obtaining the severe and somewhat unexpected problems in orientation dependence. We note that the trend is very similar even if we change the criteria of “good-performing” to testing RMSE (Table II).

The Huber loss should combine good properties from both MAE and MSE loss functions. It basically estimates the absolute error and becomes quadratic when the error is small. In the present work, the switching point (the size of the error) was chosen to be 1 meV, as shown in Eq. (5). The choice of it is critical because it determines the sensitivity to the outliers. For instance, the performances of some models, such as ANN-5, were not better than models trained by MAE or MSE in the check of orientation dependence. However, the size of the error for switching into a quadratic is a hyperparameter. Determining this optimized hyperparameter is challenging because the values of electronic coupling span over several orders of magnitude. Moreover, the expression of a loss function can be customized by implementing the physical nature of the predicting quantity, as several successful examples have been demonstrated in the cases of predicting potential energy surfaces.^{18,43} Recently, the “phase-less loss,”^{31} which skips the preprocessing of the phase-matching scheme, was designed to predict the coupling at excited states with wavefunction-like features. Developing a good loss function for the prediction of electronic coupling will be one of our future directions. So far, the MSE loss is the best in performance while being simple to implement.

### D. Other perspectives

#### 1. Computational cost

Applying the novel technologies of ML to calculate electronic coupling offers a promising chance for studying charge mobility or polaron dynamics in organic semiconductors where a large distribution of molecular configuration could exist. For example, to investigate polaron dynamics, the Hamiltonian of a system with thousands of molecules is needed at each time step. However, it is almost impossible to build a high-quality Hamiltonian by traditional means. In our case, building a Hamiltonian for 500 naphthalene in the condensed phase requires more than 60 000 calculations of electronic coupling for only one time step, and it takes more than 1800 core hours with FMO calculation (110 s/pair × 60 000 pairs) on a reasonably configured modern computer (Intel Xeon Silver 4216). The KRR model with 120 000 training samples takes about 1 min in one central processing unit (CPU) core (1.1 × 10^{−3} s/pair × 60 000 pairs) once all the structures are loaded in the memory. On the contrary, it takes only 0.16 s for an ANN model to finish the same calculation (2.6 × 10^{−6} s/pair × 60 000 pairs) with graphics processing unit (GPU) acceleration (NVIDIA Tesla P100). With ML, the charge mobility or polaron dynamics in organic molecular solid or conjugated polymers can be investigated with much more realistic models.

#### 2. Improvement with a scaling factor

The FMO approach we took for obtaining the coupling is based on 1-electron orbital interaction. Coupling values of two naphthalenes were also reported by spin-component scaled approximate coupled cluster (SCS-CC2) calculation in the literature.^{93,94} In Refs. 93 and 94, it was also found that a simple scaling relation between couplings from various of reduction in the calculation. With data included in Sec. SVI in the supplementary material, we found that the couplings from our ML models can be scaled by factors of 1.19 and 1.17 for ANN-2 and ANN-3, respectively, and the results became very close to those derived from SCS-CC2. The exponential-decay factor obtained through these tests are 1.54 Å^{−1}, 1.55 Å^{−1}, 1.62 Å^{−1}, and 1.63 Å^{−1} for SCS-CC2, FMO, ANN-2, and ANN-3, respectively.

#### 3. The coupling at very large distances

The ML approach opens up a new way to evaluate electronic coupling, but the present ML models could fail in pairs with the intermolecular distances larger than 7 Å. For instance, in Figs. 6(f) and 6(h), the data in the small vertical line at *V*^{QC} = 0 correspond to the naphthalene pairs at a large separation distance (see Sec. SVII in the supplementary material for further discussion). This problem probably arises from an insufficient sampling in the long-range region, which can be fixed by an improved strategy in sampling.^{95–98} Since these pairs are known to have very small coupling, in practice, they can be neglected in the Hamiltonian for charge transport models in a condensed phase.

#### 4. Other limitations

Despite the high efficiency and the good performance, the ML model obtained in this approach is not transferable. For a molecule, one needs to carry out the processes as depicted in Fig. 1 again. To address the issue, developing a transferable input feature is an alternative strategy. The electronic coupling is related to the overlap between molecular orbitals, so some Gaussian-like or orbital-like expressions could be potential input features.^{22,23,31,40,72,99} In addition, the bottleneck of application of ML is still in the accuracy of the trained ML model. So far, the best two models in the present work are the ANN-2 and the ANN-3 with a testing MAE of ∼6.5 meV. By applying the scaling factor, the ML model can reach the accuracy at a benchmark *ab initio* level, which is very useful already. Testing more advanced ANN algorithms, such as generative adversarial network or residual neural network, or ANN optimization schemes are always important options.

#### 5. Future perspectives

The current development for ML models for electronic coupling is only beginning. Learning from the development history of the classic force field for MD simulation in 1990s, we believe that one possible pathway is to build a universal database of the ML models for each molecule, before the transferable feature model becomes available. In the present work, we used a simple network architecture to address fundamental numerical and statistical insights. We believe that our results are helpful for the computational chemistry community to establish more comprehensive ANN models for other molecular species. Since ML approach greatly reduces the computing efforts of calculating electronic coupling, subsequent applications are no longer limited by the computational costs.

## V. CONCLUSIONS

Machine learning has promising potential in addressing the time-consuming problem of massively calculating charge-transfer coupling. However, establishing a reliable ML model for complex molecules in the condensed phase is still in a developing stage. In this work, we developed a scheme to build an ANN model. We proposed an effective way to optimize learning rate and batch size, discussed the strategies to improve model performance, and showed the importance of checking the model with the distance and orientation dependences. The relationship between the loss function and the data structure was characterized. In addition, we investigated the effect of feature representation on the ANN and KRR, and systematically compared their predicting power. The best ANN models (ANN-2 and ANN-3) achieved a testing MAE of ∼6.5 meV, with good distance and orientation dependences. Most importantly, it only takes a few microseconds to calculate the coupling value for a single molecular pair. We believe that these reported schemes and statistical insights could be the foundation of developing a more advanced ML algorithm in the near future.

## SUPPLEMENTARY MATERIAL

See the supplementary material for additional computational details of molecular dynamics simulations and the kernel ridge regression scheme, strategies employed in improving ANN performance (including batch normalization, feature normalization, and early stops), learning curves of the selected ANN models, comparison with other *ab initio* calculations, and prediction for pairs with large intermolecular distances.

## ACKNOWLEDGMENTS

We are grateful for the support from the Academia Sinica Investigator Award (Grant No. AS-IA-106-M01) and the Ministry of Science and Technology of Taiwan (Project Nos. 105-2113-M-001-009-MY4 and 109-2113-M-001-022-MY4).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in GitHub at https://github.com/cherrihsu/MachineLearning.