Recent developments in machine learning have enabled accurate predictions of the dynamics of slow structural relaxation in glass-forming systems. However, existing machine learning models for these tasks are mostly designed such that they learn a single dynamic quantity and relate it to the structural features of glassy liquids. In this study, we propose a graph neural network model, “BOnd TArgeting Network,” that learns relative motion between neighboring pairs of particles, in addition to the self-motion of particles. By relating the structural features to these two different dynamical variables, the model autonomously acquires the ability to discern how the self motion of particles undergoing slow relaxation is affected by different dynamical processes, strain fluctuations and particle rearrangements, and thus can predict with high precision how slow structural relaxation develops in space and time.

## I. INTRODUCTION

When a liquid is rapidly cooled, it freezes into a glass, which retains a random amorphous structure. This phenomenon, referred to as the glass transition, is ubiquitous, but its origin still remains to be clarified. The elucidation of this origin has been an important research topic for decades.^{1,2} Along with the drastic slowdown of atomic motion accompanying the glass transition, localized domains of particles that rearrange more preferentially than others grow up.^{3–6} Identifying the structural origin of this phenomenon has long been a central problem in the field. Structural analyses of local geometric orders^{7–9} and of real-space normal modes of vibrations derived from static structures^{10–14} have accumulated evidence for structure–dynamics correspondence. However, the random, featureless structures of glasses still prevent us from fully understanding the origin and predicting the dynamics based on the static structure.

Recently, machine learning methodologies have enabled the accurate determination of structure–dynamics correspondence in glassy systems.^{15–23} Among them, graph neural networks (GNNs),^{24} a type of deep learning algorithm that operates on a graph, have been found to be suitable.^{20} GNNs had already proven to be useful for making predictions of material properties by directly representing structures of molecules or atomic clusters as nodes and edges of the graph.^{25–28} Bapst *et al.*^{20} applied this ability of GNNs to the task of predicting particle mobility in a glass-forming liquid from its static structure and showed that GNNs outperform other machine learning methodologies. This high accuracy in the predictions of glassy dynamics has enabled the development of inverse modeling as a further application.^{29} It has also led to the development of a cost-effective linear-regression (LR) model that can overcome the high computational cost of GNNs but still achieves equivalent precision in its prediction.^{21,22}

In the study by Bapst *et al.*,^{20} the prediction accuracy of GNNs was found to exceed not only that of other machine learning methods but also that of normal mode analysis at all temperatures accessible by simulations and across all temporal scales up to the structural relaxation time. Normal mode analysis is a method to predict the spatial distribution of short-time vibrations. Because short-time vibrations (soft modes) are intimately linked to the slow *α* relaxations,^{9,30–34} normal mode analysis yields a strong baseline of physics-based prediction of glassy dynamics from a static structure.^{10,35} Nevertheless, the prediction accuracy of GNNs becomes relatively low in the case of fast *β* dynamics compared to that of slow *α* relaxation.^{20,22} We may therefore conjecture that the dynamics is made difficult to predict owing to elastic vibrations, which are erased after coarse-graining the short-time dynamics. This decline in prediction accuracy can be rectified, as shown by a recent study,^{36} if we encode short-time vibrations obtained from the actual simulations. However, there are still no methodologies to make predictions solely from a static structure by bypassing the challenges caused by such transient fluctuations.

In this article, we propose a GNN model named “BOnd TArgeting Network (BOTAN)” that realizes highly precise predictions of the entire glassy relaxation from a static structure. BOTAN can learn a target characteristic quantity assigned on edges corresponding to pairs of close-by particles, in addition to the previous target quantity assigned on nodes. BOTAN must be positioned as a straightforward but strong extension of a GNN used in the previous study by Bapst *et al.*,^{20} which we term as “node-targeting GNN” (NT-GNN). In this study, such a feature of BOTAN is exploited to characterize and learn relative motion: It allows itself to characterize structural rearrangements being unaffected by elastic fluctuations as well as the local degree of local elastic strains. We show that this modeling allows for highly improved predictions of complex dynamics of slow structural relaxations of glassy systems.

## II. MACHINE LEARNING MODEL

In this section, we explain the neural network architecture of BOTAN. Similar to NT-GNN, it is based on the interaction network^{37,38} consisting of an encoder–process–decoder architecture, where features of nodes and edges in a graph are mutually computed by exchanging messages between a pair of two-layer multilayer perceptrons (MLPs) assigned for nodes and edges. This architecture is suitable for extracting intricate relationships from simulation data computed with high precision. Because the network treats node and edge features equivalently, it is possible to regress the network toward the target quantities on edges as well as nodes^{39} by decoding edge embeddings using an MLP.

*G*(

*V*,

*E*) be a graph that includes a set of nodes

*V*and edges

*E*. For a pair of nodes

*v*,

*u*∈

*V*, an edge between

*v*and

*u*is represented as

*e*

_{v,u}. The neighboring edges of a vertex

*v*∈

*V*are represented as

*N*(

*v*) (⊂

*E*). Given input feature vectors $hvin$ and $hein$ for a node

*v*∈

*V*and an edge

*e*∈

*E*, respectively, the goal is to compute the corresponding output features $hvout$ and $heout$. As shown in a block diagram in Fig. 1, and by denoting an encoder, a multilayer perceptron (MLP), and a decoder as EN(·), MLP(·), and DE(·), respectively, each layer used to compute feature vectors for node

*v*is formulated as follows: First, input feature vectors are encoded by respective encoders,

*n*times,

*m*stands for the iteration index of

*n*repeat cycles (0 <

*m*≤

*n*). In each cycle, messages are passed between nodes and edges, wherein encoded feature vectors are concatenated together before every update. For the results that will be presented in this article, it is critical that edges receive messages from both neighbor nodes

*v*,

*u*in each repeated cycle of message passing in Eq. (3), as indicated in the inset of Fig. 1 (“A. Edge update”; an edge is updated bidirectionally, based on the states of nodes on both sides). Furthermore, at each stage of these updates, encoded information, $he0$ or $h\nu 0$, is concatenated such that the training becomes stable. Decoders DE(·), placed as the final layers, yield the output feature on each node and each edge,

*σ*

_{AA}) is propagated; therefore, the number of repeat cycles

*n*determines how far the structural information can be incorporated.

^{20,21}The prediction accuracy of the model declines as

*n*decreases; this is shown in Appendix A. In this study, we set

*n*= 7, as in the previous study by Bapst

*et al.*

## III. DATASET DETAILS

^{40}wherein the pairwise force is smoothed at its cutoff.

^{41,42}Interactions between the particles are determined by an interatomic pairwise potential given by

*μ*,

*ν*∈ {A, B} denotes the particle types, where the number of particles belonging to type A (denoted as

*N*

_{A}) is 80% of the total number

*N*= 4096. The interaction energy and particle size are defined as

*ɛ*

_{AA}= 1.0,

*ɛ*

_{AB}= 1.5,

*ɛ*

_{BB}= 0.5,

*σ*

_{AA}= 1.0,

*σ*

_{AB}= 0.8, and

*σ*

_{BB}= 0.88. In this study, therefore, distances, time, and temperature are denoted in units of

*σ*

_{AA}, $\tau =m\sigma AA/\epsilon AA$, and

*ɛ*

_{AA}/

*k*

_{B}, with

*k*

_{B}the Boltzmann constant. In the remainder of this article, we use the dimensionless units based on the Lennard-Jones potential and the two species of this mixture are denoted as types A and B. Trajectories for training and prediction tests are generated as so-called the isoconfigurational ensemble,

^{10,31,35,43}wherein 32 simulation runs are conducted with the same particle configuration and at different initial velocities. Among the 500 initial configurations, 400 configurations (12 800 runs) are used for training and the other 100 (3200 runs) are used for testing.

^{41,42}

*r*=

*r*

_{c}. The cutoff lengths are set differently for different combinations of pair species depending on the combination of pair species as

*r*

_{c}= 2.5

*σ*

_{αβ}.

^{20}, except that the system size is fixed to satisfy

*N*/

*V*= 1.2. The time step of numerical integration is kept at Δ

*t*= 10

^{−3}throughout. We set three target temperatures:

*T*= 0.44, 0.50, and 0.64. In Fig. 2, the self-parts of the intermediate scattering functions for particles of type A,

*T*= 0.47 and 0.56), where the wavenumber

*k*is set at 2

*π*/

*σ*

_{AA}. The

*α*-relaxation times

*τ*

_{α}are evaluated as 4120, 217, and 16.5, for

*T*= 0.44, 0.50, and 0.64, respectively, by fitting it to a double stretched exponential function.

^{44,45}In this study, the time points

*t*= 130 (0.03

*τ*

_{α}) and 4120 (

*τ*

_{α}) are mainly investigated at the lowest temperature

*T*= 0.44, which is also indicated in Fig. 2 with vertical dotted lines.

For each of these three target temperatures, 500 independent particle configurations are generated as follows: After keeping the temperature at *T* = 5.0 for the time lapse of *t*_{h} = 10^{4}, the system is cooled rapidly (in *t*_{cool} = 20) to the target temperature. The temperature is subsequently kept constant by using the Nosé–Hoover thermostat until the system reaches a steady state up to the time scale of 40*τ*_{α}. The final particle configuration thus obtained is used as the initial configuration at *t* = 0 in the production run for generating the dataset.

From the 500 initial configurations, we introduce the isoconfigurational ensemble^{10,43} by running 32 separate microcanonical (*NVE*) simulations with each starting from the same configuration; however, the initial velocities are randomly given from the Maxwell–Boltzmann distribution at the target temperature. These simulations are conducted approximately up to 30*τ*_{α}, where the particle configurations are sampled at logarithmically increasing time points, as summarized in Table I. Among the 500 initial configurations, 400 configurations (12 800 runs) are used for training and the other 100 (3200 runs) are used for testing.

Tagging . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . |
---|---|---|---|---|---|---|---|---|---|---|

T = 0.44 | 0.13 | 1.30 | 13.0 | 130 | 412 | 1300 | 4120 | 13 000 | 41 200 | 130 000 |

0.50 | 0.108 | 0.434 | 2.17 | 6.86 | 21.7 | 68.6 | 217 | 686 | 2 170 | 6 860 |

0.64 | 0.100 | 0.165 | 0.52 | 1.65 | 5.2 | 16.5 | 52 | 165 | 520 |

Tagging . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . |
---|---|---|---|---|---|---|---|---|---|---|

T = 0.44 | 0.13 | 1.30 | 13.0 | 130 | 412 | 1300 | 4120 | 13 000 | 41 200 | 130 000 |

0.50 | 0.108 | 0.434 | 2.17 | 6.86 | 21.7 | 68.6 | 217 | 686 | 2 170 | 6 860 |

0.64 | 0.100 | 0.165 | 0.52 | 1.65 | 5.2 | 16.5 | 52 | 165 | 520 |

## IV. PROCEDURES OF MACHINE LEARNING

A graph for training or prediction test is constructed based on the particle configuration wherein the particles are represented by their nodes. Edges are assigned to pairs of particles *i* and *j* within a distance threshold value *r*_{ij}(*t* = 0) ≤ *r*_{e} so that the neighborhood relation can be represented. We choose the threshold length *r*_{e} = 2.0*σ*_{AA} that has the best predictive performance regarding particle propensity.^{20} The constructed graphs are then fed into the model by encoding particle types (A or B) into the nodes and 3D relative positions between particles (*r*_{ij} = *r*_{i} − *r*_{j}) into the edges. Specifically, for input feature vectors $hvin\u2208{A,B}$ and $hein$ (=*r*_{ij}) $\u2208R$, the first layers of the node and edge encoders [Eqs. (1) and (2)] are implemented by linear transformations $fn:R\u2192Rd$ and $fe:R3\u2192Rd$, respectively, where *d* is the dimension of the hidden layers (64 in our case). For the node feature, type A and B are converted into real numbers 0 and 1 beforehand, and both the input features on nodes and edges are treated as real number with 32-bit single-floating point precision.

^{20,22}the node output feature $hvout$ was exploited for learning and prediction test, whereas the edge output feature $heout$ is additionally extracted for these tasks in BOTAN. In this study, the target quantities for training and prediction tests are the particle displacement (traveling distance) of each particle

*i*,

*relative motion*between neighbor pairs,

The learning setups and hyperparameters in the present study are similar to those used by Bapst *et al.*^{20} The training dataset is learned to minimize the loss function, wherein the neural network is optimized using the Adam algorithm without norm regularization, with a learning rate of 10^{−4} and a standard implementation in PyTorch. The data loaded for training are augmented by applying random rotations of the simulation box to the particle positions, as all the target quantities for learning are scalar variables that are rotationally invariant.

_{IC}represents the average over the isoconfigurational ensemble and $S\u0302j(t,{rj})$ and $E\u0302jk(t,{ri})$ denote the values of the particle propensity and pair-distance change as the output from the GNN, respectively. The second summation in Eq. (14) is over all neighbor nodes

*k*(within the distance of

*r*

_{e}) connected with

*j*via edges of the graph. Using loss function $LM$, the model can be trained via both $Sj(t)$ and $Ejk(t)$ simultaneously; furthermore, these quantities can be inferred simultaneously as well. Parameter

*p*is treated as a hyperparameter (0 ≤

*p*≤ 1) as it decides the weight of learning between nodes and edges. Two limiting cases are

*p*= 0, wherein the model only learns pair-distance changes $Eij(t)$, and

*p*= 1, which reduces to NT-GNN.

^{20}For simultaneous learning on nodes and edges (0 <

*p*< 1), we set

*p*= 0.4 in this study. This choice is justified by the ablation experiment described in Appendix B.

In all these training scenarios, all the particles (all the particle pairs) are learned together, regardless of the particle types. No minibatching of graph inputs are applied in the training, except that the minibatch size of the graph input is fixed at 5 in the training of NT-GNN (*p* = 1). Although Bapst *et al.* applied no minibatching in their previous study,^{20} this difference does not significantly change the results presented in this article. The training with NT-GNN required ∼2 h when using one NVIDIA A100 Tensor Core GPU (40 GB SXM). A similar amount of time is required for training BOTAN, except that there may exist a substantial overhead arising from computation of interparticle distances, for which the actual elapsed time depends on the implementation.

In fact, many training epochs are required if we start the training of BOTAN with random weight parameters. The reason is that, as will be discussed in Sec. V A, there is a simple relation between the initial distance *r*_{ij}(*t* = 0) and its change after time $Eij(t)$ when the interval time *t* is short. Such a relation defines a “metabasin” to which many weight parameters are largely optimized. Therefore, in this study, we start all the training of BOTAN (*p* < 1) with weight parameters that are adjusted by pretraining on edges at a low temperature *T* = 0.44. This pretraining is performed for 2000 epochs with the pair-distance changes, by keeping *p* = 0 and using trajectory data at a short time *t* = 13, which results in a sufficient convergence of the loss functions, as shown in Fig. 3. Such pretraining allows the main part of the training scenario to be efficient enough to finish within 1000 epochs.

## V. RESULTS

### A. Predicting pair-distance changes as an edge feature

*p*= 0. In Fig. 4, prediction results of the pair-distance change $Eij(t)$, obtained for

*t*= (a) 130 and (b) 4120 using BOTAN trained at respective time points and at the lowest temperature

*T*= 0.44, are shown for a particle configuration in the test dataset. The data are shown in color maps projected onto segments satisfying

*r*

_{ij}(

*t*= 0) < 1.35 in a two-dimensional cross section. The “actual” pair-distance changes (the isoconfigurational ensemble average) from the simulation are also shown for each as the ground truth. The time points correspond to 0.03

*τ*

_{α}and

*τ*

_{α}, where

*τ*

_{α}is the

*α*-relaxation time. BOTAN effectively discerns which specific pairs become separated, particularly for the short time in (a). Over a long time, as in (b), the prediction deviates from the ground truth, but it still captures pair-level propensity for distance changes as well as the spatial contrast between mobile and immobile regions. In Fig. 5, the prediction accuracy of BOTAN over $Eij(t)$ is quantified by using the Pearson correlation coefficient,

^{20–22}

*y*

_{i}}) and ground truth ({

*x*

_{i}}) data. Here, $x\u0304$ and $y\u0304$ denote the average of

*x*

_{i}and

*y*

_{i}. This coefficient should be quantified carefully to ensure that the comparison is made over equivalent particle pairs. Therefore, among pairs of type A particles (pairs related to type B are excluded), pairs within the first neighboring distance

*r*

_{ij}(

*t*= 0) < 1.35 are chosen in this comparison, because neighbor pairs in the first and second neighbor shells are expected to exhibit different distance changes on average. The prediction accuracy thus evaluated is high in the short time, and

*monotonically*declines with time.

The high prediction accuracy in the short-time region originates from a simple relation between the pair-distance change $Eij(t)$ and initial distance *r*_{ij}(*t* = 0). Figure 6 shows a probability distribution function map as functions of these quantities for the closest neighbor pairs of type A particles in the training dataset at time points ranging from *t* = 13 to 13 000 at *T* = 0.44. In the short-time region (*t* ≲ 0.1*τ*_{α} = 412), a simple anticorrelation exists, wherein the pair-distance change $Eij(t)$ remains one order of magnitude smaller than the particle radii for most of the neighbor pairs. This quantity, therefore, characterizes local strains induced by the elastic vibrations. The anticorrelation gradually disappears but persists up to *t* = *τ*_{α} as the time increases, making the relation between the two less trivial. The anticorrelation determines the metabasin in the landscape of the loss function in the vast weight-parameter space. The number of training epochs required to optimize BOTAN drastically increases as the anticorrelation gradually disappears for a longer time *t*, as the loss landscape becomes more complex to make the metabasin less “noticeable.” Therefore, we pretrained BOTAN over the pair-distance changes at *T* = 0.44 after *t* = 13, then we used the weight parameters for training with varying time. As the metabasins in the parameter space are likely to be located close to each other between the different time points, the usage of pretrained models would allow us to concentrate on optimizing the weight parameters around the metabasins and thus reduce the number of training epochs required for the main part of training.

### B. Predicting particle propensity for motion as a node feature

Because BOTAN has a pair of decoders that respectively computes output features for nodes and edges, it is possible to train BOTAN simultaneously with target quantities for both, particle propensity $Si(t)$ and pair-distance changes $Eij(t)$. This is realized by setting the loss function as a weighted sum of $L2$-norm losses for $Si(t)$ and $Eij(t)$. The model is then trained with other hyperparameters remaining unchanged.

*et al.*Figure 7 shows the prediction results for (1)

*t*= 130 (0.03

*τ*

_{α}) and (2)

*t*= 4120 (

*τ*

_{α}) at

*T*= 0.44 of (a) NT-GNN and (b) BOTAN, in addition to (c) the “actual” propensity directly evaluated from the trajectory data in the test dataset. As all the snapshots are shown in the same cross section as in Fig. 4, the spatial correspondence can be seen between the predictions of $Si(t)$ and $Eij(t)$. BOTAN clearly makes better predictions of spatially heterogeneous patterns of particle propensity than NT-GNN. In (d), 3D vector plots are also provided to show the displacement field locally averaged over nearest neighbors, defined by

*x*) denotes Heaviside’s step function and

*r*

_{f}= 1.35 is the cutoff length of coarse-graining. Each is plotted in a cross section with a thickness of 1.2, where the 3D vectors are enlarged for better visibility. These plots can extract particle motion taking place as a consequence of quasi-localized and phonon vibrations

^{46–48}after removing the effect of interparticle rearrangement. For both

*t*= 130 and 4120, we observe collective motion on the length scale considerably exceeding the particle size, whereas the flows are more aligned in the former case, suggesting the effect of phononic motion. In (e) and (f), error maps of predictions are shown for BOTAN and NT-GNN, respectively, showing the existence of spatial heterogeneity in prediction errors. By comparing the error map of NT-GNN (f) with the locally averaged displacement field in (d), we find that these spatial patterns are corresponding—that is, the locally averaged displacements tend to be parallel with the contour lines of the prediction error map, especially in the region where the displacement vectors are aligned. This clear spatial correlation suggests that the collective fluctuations, which are less intimately related to the local structure around each particle, are a possible reason for the decline in prediction accuracy of NT-GNN.

^{20,22}

We next assess the prediction accuracy of BOTAN in comparison with NT-GNN. Because particles of types A and B are expected to exhibit different diffusivity over a long period, the prediction accuracy is evaluated in terms of the Pearson correlation coefficient using only the data for type A particles. Figure 8 shows how accurately BOTAN and the NT-GNN predict the particle propensity $Si(t)$ for three different temperatures, *T* = (a) 0.44, (b) 0.50, and (c) 0.64. The trend of time dependence for NT-GNN is in agreement with the previous results:^{20} In the shorter time before reaching the plateau region (*t* < 0.2), the Pearson correlation coefficient assumes high values, then falls below 0.5, and afterward gradually increases to reach its peak at around the *α*-relaxation time. Conversely, BOTAN outperforms NT-GNN in its predictive accuracy over the entire temporal range and for all temperatures under investigation.

To this end, we note that the pair-distance change in Eq. (11) has its own significance despite the extreme simplicity of the definition. Because the pair-distance changes characterize relative motion, they can capture structural *rearrangements*. This feature is distinct from standard quantities defined on the basis of particle displacements that is affected by nonlocal phononic fluctuations and thus enables the model to better distinguish the rearranging hotspot separating nonlocal effects. Similar ideas have also been implemented in the method of counting the replacement of neighboring pairs,^{10,49,50} also termed “bond breakage.”^{5,51,52} Furthermore, the pair-distance changes can characterize the extent to which each pair tends to change owing to the *local shear deformation* around rearranging cores; this aspect of structural changes has long been addressed using the concept of “shear transformation zones.”^{47,53–55} By learning how strains of non-rearranging regions are distributed and how rearranging hot spots are localized, BOTAN autonomously fixes the errors in the prediction of particle displacements and overall acquires unprecedented predictive ability regarding the glassy dynamics.

### C. Comparison with prediction accuracy of other quantities by NT-GNN

*α*,

*β*∈ {

*x*,

*y*,

*z*} are explicitly denoted for clarity. Here,

*ɛ*

_{αβ}represents the local strain tensor, which is evaluated from change in the local arrangement of neighboring particles in the distance of 1.6, and summation is again over the nearest neighbors within the same distance with

*N*

_{n.n.}the number of these neighbors. When this local strain reduces to zero (

*ɛ*

_{αβ}= 0), which is valid in the limit of

*t*→ 0, this quantity reduces to a more easily interpretable form $Ci(t)=\Delta ri(t)\u2212Nn.n.\u22121\u2211j\u2208n.n.\Delta rj(t)$, clearly manifesting itself as the relative displacement with respect to neighbor environments.

^{45,56}

^{14,50,53,54,57}and it is defined as the mean-square of difference between the actual displacement and the uniform strain displacement of the neighbors,

*i*–

*j*pair if a specific

*i*–

*j*pair becomes further away than other pairs. Therefore, $Di(t)$ is more susceptible to hot spots where particle rearrangement preferably occurs.

In Fig. 8, Pearson correlation coefficients between predictions of NT-GNN and the actual values of neighbor-relative particle propensities, $Ri(t)$ and $Di(t)$, are shown; furthermore, in (a), this correlation coefficient is also plotted for the prediction made by BOTAN. The correlation coefficient is evaluated only for type A particles, as a proximity between the predicted and ground truth values. The prediction accuracy of NT-GNN improves over these quantities, especially in the time regions shorter than 0.1*τ*_{α}. This clearly implies that spatially extended static fluctuations cause decline in the prediction accuracy of $Si(t)$ by NT-GNN in the short time and that this decline in the prediction accuracy is recovered because, in these quantities, the effect of these spatially extended static fluctuations is removed to a certain extent.

## VI. CONCLUSION

In conclusion, we have introduced a GNN model that realizes faithful predictions of glassy dynamics over the entire temporal range, with accuracy greatly improved from previous models, including the GNN model by Bapst *et al.*^{20} Typical deep learning models, including both convolutional neural networks and GNNs, are designed so that they can characterize local features; thus, they are not extremely good at capturing nonlocal characteristics. The key to the present improvement is that the model learns relative motion between neighbor pairs to relate it to two-body structural correlation in addition to the self-motion of the particles. In this way, the GNN model bypasses direct learning of nonlocal motion itself and acquires the ability to autonomously “interpret” how particle motion is affected by different dynamical effects nonlocal strain fluctuations and local particle rearrangements. As a consequence, BOTAN reconciles the differences between fast *β* and slow *α* relaxation dynamics and achieves state-of-the-art ability in the task of predicting the glassy dynamics from the static structure.

In the present study, particle propensity for motion defined on self-displacement is chosen as the dynamical quantity of interest, in line with many previous studies on structure–dynamics correspondence in glasses.^{7,9,10,20,21} It was shown that the GNN model proposed by Bapst *et al.* (NT-GNN) can be reduced to a linear-regression (LR) model once properly defined local structural parameters are averaged over neighbor shells, wherein predictive abilities are essentially the same between the LR model and NT-GNN.^{21} Our results have, somewhat contrarily, shown that such local structural parameters are not the only descriptor for self-displacement because they are under the influence of strain fluctuations. Recently, new neural network models that successfully improve the predictive ability on glassy dynamics by introducing different ideas have emerged. For example, a recently proposed machine learning framework, GlassMLP, has raised the predictive ability by introducing the physics-informed descriptors’ input and bottleneck layers;^{58} a couple of recent studies also seem to have improved the predictive ability of GNNs by introducing rotational-equivariant structural descriptors^{59} or by additionally introducing self-attention layers into GNNs.^{60} As the reasons for the improved prediction are likely to be different from each other, it may be possible to develop a more advanced framework by combining these features, which is left as an open task in the future. Given that a much higher prediction ability is realized, such models may also become useful for finding the “reaction coordinates” of glassy dynamics, along which the fluctuations should be enhanced for efficient sampling of molecular trajectories. Extensions to the recently discussed machine learning-aided sampling technique may be interesting.^{61,62}

## ACKNOWLEDGMENTS

We thank T. Kawasaki for critical input regarding this research. We also thank Y. Asahi, J. J. Molina, and F. Landes for helpful discussions and comments. H. Shiba and T. Shimokawabe were supported by JSPS KAKENHI Grant No. JP19H05662 and by “Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures” and “High Performance Computing Infrastructure” in Japan (Project No. jh220052). M. Hanai and T. Suzumura were supported by “Advanced Research Infrastructure for Materials and Nanotechnology” from MEXT, Japan. Training and testing are performed using NVIDIA A100 Tensor Core GPUs (40 GB SXM) on Wisteria/BDEC-01 Aquarius subsystem at Information Technology Center, University of Tokyo. For generation of the dataset for training and testing, usage of the following computational resources is acknowledged: Oakforest-PACS at the Joint Center for Advanced High Performance Computing (JCAHPC), Oakbridge-CX at Information Technology Center, University of Tokyo, Supercomputer System B at Institute for Solid State Physics, University of Tokyo, and MASAMUNE-IMR at Institute for Materials Research, Tohoku University.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Hayato Shiba**: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Resources (equal); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (lead). **Masatoshi Hanai**: Formal analysis (supporting); Methodology (supporting); Software (supporting); Writing – original draft (supporting). **Toyotaro Suzumura**: Software (supporting); Supervision (supporting); Writing – review & editing (supporting). **Takashi Shimokawabe**: Conceptualization (supporting); Funding acquisition (lead); Investigation (supporting); Project administration (lead); Resources (equal); Software (supporting); Writing – review & editing (supporting).

## DATA AVAILABILITY

The source code used for training and evaluation is openly available in GitHub repository at https://github.com/h3-Open-BDEC/pyg_botan. This code is an implementation of BOTAN using PyTorch Geometric.^{63} The dataset of simulation trajectories for evaluations is also available via the hyperlink indicated in the same GitHub repository for three years after the publication of this paper.

### APPENDIX A: EFFECT OF THE ITERATION NUMBER IN MESSAGE PASSING

In the model architecture of BOTAN, a pair of two-layer MLPs, characterizing the features of nodes and edges, are mutually updated. Each edge is updated based on the embedding of the nodes it is connecting, in addition to its own previous embedding. Each node is updated based on the sum of incoming edges, in addition to the embedding of the nodes. This is how message passing between the nodes is realized in the present GNNs. Therefore, the number of repeat cycles determines how far away structural information can be incorporated into a node (or into an edge). In Fig. 9, the dependence of the Pearson correlation coefficient on this iteration number (denoted as *n* in Sec. II) is shown for *t* = 130 and 4120 at *T* = 0.44, where the training is performed over 1000 epochs as well. The predictive ability increases as a function of the number of iterations *n*, but the increase is much slower when *n* < 4.

### APPENDIX B: OPTIMIZATION DETAILS RELATED TO TRAINING ON EDGES

In the simultaneous training on nodes and edges [via both $Si(t)$ and $Eij(t)$], a value of 0.4 is chosen for the hyperparameter *p* in Eq. (12) that determines the training weight between nodes and edges. This choice of the parameter is justified by an ablation experiment, in which BOTAN is trained for various values of the hyperparameter *p* without using the pretrained model. In Fig. 10, we show Pearson correlation coefficients *ρ* computed using predicted values for $Si(t)$ and $Eij(t)$, after being trained for 3000 epochs at *t* = 130 and *T* = 0.44. The result suggests that 0.2 ≤ *p* ≤ 0.6 may be a requirement that yields an optimal condition for the simultaneous training.

## REFERENCES

*Colloquium*: The glass transition and elastic models of glass-forming liquids