We introduce a general framework for constructing coarse-grained potential models without *ad hoc* approximations such as limiting the potential to two- and/or three-body contributions. The scheme, called the Deep Coarse-Grained Potential (abbreviated DeePCG), exploits a carefully crafted neural network to construct a many-body coarse-grained potential. The network is trained with full atomistic data in a way that preserves the natural symmetries of the system. The resulting model is very accurate and can be used to sample the configurations of the coarse-grained variables in a much faster way than with the original atomistic model. As an application, we consider liquid water and use the oxygen coordinates as the coarse-grained variables, starting from a full atomistic simulation of this system at the *ab initio* molecular dynamics level. We find that the two-body, three-body, and higher-order oxygen correlation functions produced by the coarse-grained and full atomistic models agree very well with each other, illustrating the effectiveness of the DeePCG model on a rather challenging task.

## I. INTRODUCTION

In molecular dynamics (MD), we are often faced with two types of coarse-graining tasks. In a first set of applications, we are interested in evaluating the Landau free energy, which is a function of a small subset of coarse-grained (CG) variables. In this case, the CG variables are either scalar or low dimensional vector variables. In a second set of applications, we are interested in sampling with molecular dynamics (MD) or with Monte Carlo (MC) the configurations of an extensive set of CG variables. In this case, the dimensionality of the CG space is proportional to the size of the system but is reduced relative to the full space of atomistic coordinates. The first type of CG variables is typically adopted to study problems like phase transitions, where the objective is to perform detailed analyses of the Landau free energy surface by finding the metastable states, the free energy barriers between these states, the transition pathways, etc. Take the melting of a solid as an example; the Steinhardt order parameters^{1} have been used as CG variables to differentiate solid (crystal) and liquid phases. The second type of CG variables is typically used to accelerate configurational sampling relative to full atomistic simulations. For example, one may coarse-grain a polymer by replacing the monomers with point-like particles, or beads, connected by springs.

For a good description of the Landau free energy surface, one needs to find good order parameters acting as CG variables and address the issues associated with crossing high energy barriers. Typically these approaches are limited to a few CG variables, but recent work demonstrated that machine learning methods allow us to describe the functional dependence of the Landau free energy surface on several CG variables.^{2–8}

When considering extensive CG variables, the difficulty is often associated with finding an accurate free energy function in the space of the CG variables. Such a free energy function usually depends on the CG variables in a complex and nonlinear way. Therefore, finding a good representation of this function often requires substantial physical/chemical intuition.^{9–25} In principle, machine learning methods can address this problem more accurately and in an automated way,^{26–31} but most machine learning approaches so far have focused on the representation of the potential energy surface in the space of the atomistic degrees of freedom rather than the representation of the free energy surface in the space of the CG variables. For example, the deep potential method,^{30} a model based on deep neural networks, has made it possible to parameterize an atomistic potential energy function derived from quantum mechanics without *ad hoc* approximations. A subsequent development of this approach, called Deep Potential Molecular Dynamics (DeePMD),^{31} has allowed us to perform MD simulations of comparable quality to *ab initio* molecular dynamics (AIMD)^{32} at the cost of classical empirical force fields.

The free energy surface, rather than the potential energy surface, is the key physical quantity that we need to represent when dealing with CG variables. In this work, we introduce the Deep Coarse-Grained Potential (DeePCG) scheme, an approach that generalizes the deep potential and DeePMD methods to representations of the free energy surface in the space of the CG variables, a quantity that will be called the CG potential in the following. A related method to represent the many-body character of the CG potential in molecules was recently reported in Ref. 33. In our approach, similar to the deep potential and DeePMD methods, no *ad hoc* approximations are required, in addition to the network model itself, to represent the CG potential. The scheme is very accurate as demonstrated by the almost perfect agreement of the many-body correlations extracted from CG simulations with the corresponding correlations extracted from the original atomistic model. In the present work, we use liquid water as an example to illustrate the approach. We choose AIMD as the underlying atomistic model and replace the individual water molecules with point-like particles located at the oxygen sites in the CG model. The excellent agreement of the second-, third-, and higher-order correlation functions between CG and atomistic models shows the promise of the DeePCG approach.

## II. METHODOLOGY

### A. Basic theory

We consider a *d*-dimensional system with *N* atoms in the constant-volume canonical (*NVT*) ensemble. The coordinates of the atoms, in the laboratory frame, are $q={q1,q2,\u2026,qdN}\u2208RdN$. The configurational distribution function is defined by

where *Z* = ∫*e*^{−βV(q)}*d*** q** is the partition function.

The coarse-grained variables $\xi (q)={\xi 1(q),\xi 2(q),\u2026,\xi M(q)}$ are a reduced set of coordinates (*M* < *dN*). *M* can be finite and independent of the system size or it can be extensive with the system size, as in the two cases discussed in the introduction. When *M* is finite, the CG variables are the so-called order parameters of the system. When *M* is extensive, the CG variables replace molecular objects with simpler sub-objects. The configurational distribution of the CG system is the projection of the configurational distribution of the microscopic (atomistic) system onto the space of the CG variables,

The probability distribution in Eq. (2) allows us to define the CG potential and the forces acting on the CG variables as

and

respectively. Equation (3) tells us that a good CG potential should reproduce accurately the *full* configurational distribution of the CG variables in the atomistic model. Testing the quality of the full configurational distribution of the CG variables is difficult, and typically tests have been based only on two- and three-body correlation functions.^{15,16,19,20}

*U*(** ξ**) is uniquely specified by the underlying atomistic model and the definition of the CG degrees of freedoms. Therefore, constructing a CG model involves two steps: (1) the choice of an appropriate CG-potential representation and (2) the optimization of the parameters that define the potential representation. The way in which these two issues are addressed differentiates alternative schemes.

We notice that, even if we knew exactly *U*(** ξ**), we would not have a closed deterministic form for the equation of motion of the CG variables due to the

*dN*−

*M*missing degrees of freedom in the CG potential. The issue of the dynamics of the CG variables has been addressed in the literature. See, e.g., Refs. 34 and 35. Further assumptions, like a time scale separation between the CG variables and the remaining degrees of freedom, are usually required to recover dynamical information of the atomistic model. In the following, we shall focus on the accurate construction of the CG potential. We will leave to future studies the investigation of CG dynamics.

### B. CG potential representation

We adopt a neural network representation $Uw$(** ξ**) for the CG potential

*U*(

**). Here $w$ are the parameters to be optimized by the training process. $Uw$(**

*ξ***) should be constructed using as input only the generalized coordinates**

*ξ***, without any human intervention in the optimization process. The $Uw$(**

*ξ***) constructed in this way should preserve the symmetry properties of**

*ξ**U*(

**). In this manuscript, we limit ourselves to considering CG objects that behave as point particles and have only positional dependence. In this case, the**

*ξ***variables are the coordinates of the CG particles. More general choices of the CG objects have been suggested in the literature**

*ξ*^{36–39}when dealing with, e.g., polymers, biological molecules, or colloidal particles. In these cases, it may be useful to consider rods, ellipsoids, particles connected by springs, etc., as the CG objects. In principle, all these cases could be treated with the present formalism. In the setup adopted here of point-like CG objects, the CG potential $Uw$ is extensive, intrinsically many-body, and should preserve the translational and rotational invariance, as well as the permutational symmetry of the CG objects.

All the properties of the CG potential described above are preserved by the deep potential model.^{30} To illustrate how it works, we use the example of liquid water. We write the CG potential as a sum of the local contributions of the CG particles, i.e., $Uw(\xi )=\u2211iUiw(\xi )$. $Uiw(\xi )$, the potential contribution of the CG particle *i*, is constructed in two steps. First, the coordinates of the CG particle *i* and its neighbors within a cut-off radius *R*_{c} are transformed into the descriptors {*D*_{ij}} of the local environment of the CG particle *i*. We call this procedure local frame transformation and refer to Fig. 1 for more details. In the following, we use the symbol $Di$ to denote the entire set of descriptors for atom *i*. Next, as illustrated in Fig. 2, the descriptors $Di$ are given as input to a fully connected feedforward neural network to compute the potential contribution of the CG particle *i*. The mathematical formulation of the network structure is also presented in Fig. 2, where the operation of each layer of the network corresponds to a linear mapping of the output from the previous layer combined with a nonlinear mapping. The translational and rotational symmetries are preserved by the local frame transformation. The permutational symmetry is preserved because, (a) for each CG particle *i*, its descriptors *D*_{ij} are sorted in ascending order according to the inverse distances between particles *i* and *j*; (b) the subnetworks associated with the same type of particles share the same parameters $w$; and (c) $Uw(\xi )=\u2211iUiw(\xi )$ is an additive relationship. More details on the deep potential method can be found in Refs. 30 and 31. Due to the adoption of a finite cutoff radius, the simulation cost of the DeePCG model scales linearly with the system size.

### C. CG potential optimization

The construction of the CG potential $Uw$(** ξ**), introduced in Subsection II B, has many similarities with the construction of the potential energy

*V*(

**), using the DeePMD method. There is, however, a very important difference in the two cases. In the DeePMD case, the potential energy**

*q**V*(

**) is directly available from the underlying AIMD simulations. In the DeePCG case, the CG potential is a free energy and is not directly available. Therefore, the optimization for the CG potential requires a specific formulation. We adopt a force-matching scheme like the one in the multi-scale coarse-graining (MS-CG) method introduced by Voth**

*q**et al.*

^{15}In addition, we pay special attention to the fact that a neural network representation $Uw$(

**) may have tens of thousands, or more, variational parameters, and a suitable optimization algorithm is needed.**

*ξ*A straightforward force-matching approach would consist in fitting accurate mean forces from atomistic simulations. There have been many efforts in this direction.^{40–42} Of particular interest is a simple formula proposed by Ciccotti *et al.*,^{40} in which a set of *dN*-dimensional vectors *b*_{j}(** q**) that satisfy

is introduced. Then the mean force on *ξ*_{i}(** q**), namely, the negative gradient of

*U*(

**) with respect to the position of the**

*ξ**i*th CG particle, can be expressed as

with an instantaneous force estimator

Here ⟨⋯⟩_{ξ=ξ(q)} denotes conditional expectation over the equilibrium distribution of the system restricted to the hypersurface ** ξ** =

**(**

*ξ***). To train the DeePCG model, one needs to minimize the so-called loss function with respect to the model parameters $w$. The most natural choice of loss function in terms of force-matching is**

*q*where *D* is the number of configurations of CG variables *ξ*_{n} in the dataset and the mean force *F*_{i}(*ξ*_{n}) is estimated with Eq. (6). We notice that the sample of CG configurations in Eq. (8) is very general in the sense that it does not need to be an equilibrium sample at the thermodynamic conditions of the atomistic simulation. For example, the sample could include, in an enhanced way, accessible CG configurations that have a small probability of occurrence at the thermodynamic conditions of interest, such as in the case of rare events. We stress, however, that the sampling of the microscopic degree of freedom orthogonal to ** ξ** in Eq. (6) must be done at the appropriate thermodynamic conditions. In practice, the different configurations

*ξ*_{n}in the dataset can be extracted from unconstrained MD or Monte Carlo (MC) simulations of the microscopic atomistic model at different temperatures.

The above straightforward approach is not convenient when the conditional expectation values in Eq. (6) require computationally expensive constrained/restrained simulations. In this situation, we find it more convenient to approximate the ensemble average ⟨⋯⟩_{q} with the average ($1D\u2211n=1D\cdots \u2009$) over the configurations *ξ*_{n} [see Eq. (10) below]. The latter average does not require constrained/restrained simulations, but it requires *ξ*_{n} to be extracted from equilibrium atomistic simulations at the temperature selected in Eq. (1). Then the mean force *F*_{i} in the loss function (8) can be replaced by the instantaneous force $Fi$. In other words, this corresponds to using an instantaneous version of the loss function,

With a sufficiently large representative dataset, we expect that the ensemble average of the difference between predicted and instantaneous forces should be approximated quite well by $L^ins(w)$, i.e.,

This amounts to an ergodicity requirement for the atomistic system and is always valid if the system samples an equilibrium thermodynamic state.

By definition, $L^ins(w)$ is much easier to compute than $L^(w)$. Below we argue that $L^ins(w)$ is also a valid loss function to optimize the CG potential. To see this, note that the instantaneous force can be viewed as the mean force plus a random error *R*, which depends on the microscopic configuration ** q**, i.e.,

By using Eq. (6), the average $\u27e8Ri(q)\u27e9\xi =\xi (q)$ in the constrained ensemble vanishes, so the average $\u27e8Ri(q)\u27e9q$ also vanishes. By inserting (11) into (10), the instantaneous loss function (10) becomes

with

Since the second term on the right-hand side of Eq. (12) is independent of $w$, *L*^{ins}($w$) and *L*($w$) have the same minimizer. This equivalence justifies our usage of $L^ins(w)$ as the loss function.

In the application example that we discuss in Sec. III, we use CG variables that depend linearly on the microscopic coordinates, similar to Ref. 15. However, the method that we have illustrated in Eqs. (6) and (7) can deal with non-linear dependencies as well, as in the method discussed in Ref. 43.

In practice, we find that the stochastic gradient descent (SGD) method is very efficient to optimize the loss function (9), which is a highly non-convex function corresponding to a rugged landscape in the large parameter space due to the nonlinearity of the neural network interpolation. This ruggedness does not seem to constitute an essential difficulty since the different local minima found with the stochastic gradient descent (SGD) method approximate equally well the physics associated with the target function. We will discuss this issue in more detail later. Within our approach, the stochastic gradients $\u2207w$*l*($w$) applied to update the parameters at each step are provided by the average over a small batch $B$,

where $B$ is a subset of the whole dataset and $|B|$ denotes the batch size. The above procedure is different from the scheme adopted in Ref. 15, in which the full gradients $\u2207wL^ins(w)$ are applied to update the parameters at each step. We find that SGD greatly reduces the number of gradient evaluations that are required.

We note that other systematic procedures to optimize the parameters have been discussed in the literature. Of particular interest is the iterative Boltzmann inversion method,^{12} which works by iteratively optimizing the CG interactions until the radial distribution functions (RDF) of the CG system match those of the target atomistic simulation. By construction, it provides accurate two-body correlations.

## III. COARSE-GRAINING OF LIQUID WATER

To show how we construct the CG potential for an extensive CG system, we use the coarse graining of a liquid water model from an *ab initio* density functional theory (DFT)^{44} based simulation into effective “water particles” as an example. Because of its importance as a solvent in chemical and biological systems and its unique properties, the study of water is of wide interest. The DFT potential energy surface is intrinsically many-body. Developing an accurate CG model that represents a water molecule by a single particle is an ever-evolving and ongoing quest.^{9,12,15–20}

Constructing effective interactions to achieve this goal has usually required a large amount of human effort combined with substantial physical/chemical intuition. For example, in the mW monatomic potential,^{18} which has been successfully used to study crystallization of water,^{45} a specially designed Lennard-Jones-like form is used for two-body interactions, while three-body interactions are adapted from the Stillinger-Weber potential.^{46} In principle, coarse graining approaches that do not require physical/chemical intuition are possible by exploiting general variational principles, such as the one adopted in the multi-scale coarse-graining (MS-CG)^{15} or the iterative Boltzmann inversion^{12} methods. However, even when using general variational principles to bridge atomistic and CG scales, the faithfulness of distribution of the CG variables may still depend on the CG representation. For example, in the applications of the MS-CG method to liquid water, the two- and three-body distribution functions of the CG variables still show non-negligible deviations from the corresponding target microscopic distributions.^{19,20}

The DFT dataset in our example comes from Ref. 47. The electronic structure of the water system is modeled by DFT with the PBE0 exchange-correlation functional^{48} and includes the long-range dispersion interactions self-consistently using the Tkatchenko-Scheffler model.^{49} The corresponding AIMD simulation^{32} adopts periodic boundary conditions and deuterons replace protons for a larger integration time step (0.5 fs). The simulation data consist of snapshots from a 20 ps-long trajectory in the NVT ensemble, where *N* = 192 (64 H_{2}O molecules), *V* = 1.9275 nm^{3} (simple cubic periodic simulation cell), and *T* = 330 K. In total 40 000 snapshots are recorded. The important difference between training DeePMD and DeePCG is that in DeePCG, one is attempting to estimate mean forces that correspond to conditioned averages of fluctuating atomic forces for fixed CG configurations, while in DeePMD one is attempting to estimate the deterministic atomic force in a fixed atomic configuration. Therefore, a short AIMD trajectory is not sufficient to train a DeePCG model with satisfactory accuracy, but this difficulty is circumvented by constructing a DeePMD model^{31} from the AIMD data and sampling the configurations with a much longer DeePMD trajectory (15 ns). Figures 3–5 compare DeePMD and AIMD configurations in terms of the O—O radial distribution function (RDF), O—O—O angular distribution functions (ADFs), and the distributions of two averaged local Steinhardt parameters (defined in the Appendix),^{50} respectively. It is observed that the configurations sampled by DeePMD are in almost perfect agreement with the AIMD data. Therefore, when considering the oxygen configurations, training with the data generated by DeePMD is essentially indistinguishable from that with data generated by AIMD.

Now we construct the DeePCG model. We use oxygen as the CG particle. We define the local environment of an O atom with the same cutoff radius adopted in the DeePMD model, i.e., *R*_{c} = 6 Å. We use the full radial and angular information for the 16 CG particles closest to the particle at the origin (see, e.g., particle *j* in Fig. 1), while retaining only radial information for all the other particles within *R*_{c} (see, e.g., particle *k* in Fig. 1). Next, the local environment of each CG particle defines a sub-network, and we use 4 hidden layers with decreasing number of nodes per layer, i.e., 120, 60, 30, and 15 nodes from the innermost to the outermost layer, to construct the corresponding contribution to the CG potential.

The training process minimizes $L^ins(w)$ defined in Eq. (9). The force on each oxygen in the atomistic model serves as the instantaneous estimator $Fi$ in Eq. (7). We employ the stochastic gradient descent method with the Adam optimizer^{51} to update the parameters of each layer, with a batch size of 4 and a learning rate that exponentially decays with the training step. In our current implementation, the training process requires 15 h on a ThinkPad p50 laptop computer with an Intel Core i7-6700HQ CPU and 32 GB memory. The DeePMD-kit^{52} is used for optimizations and MD simulations of both the DeePMD and the DeePCG models.

After training, we perform an NVT simulation on the CG variables. The initial snapshot for this simulation is taken directly from a snapshot selected along the AIMD trajectory. The CG force is generated directly by the analytical gradient of the CG potential, the volume and the temperature are the same of the AIMD simulation, and the temperature is controlled using a Langevin thermostat with a damping time *τ* = 0.1 ps. In addition, using the same strategy, we perform an NVT simulation on 512 CG variables, where the only difference is that the number of CG variables and the size of the simulation region are 8 times larger than those of the AIMD simulation.

## IV. DISCUSSION

In Figs. 3–5, we show that the DeePCG model reproduces very well the oxygen correlation functions of the atomistic DeePMD model and, by extension, of the underlying AIMD model. In addition to comparing 2- and 3-body correlations, as done in standard protocols,^{19,20} we also perform tests on how well the DeePCG model preserves higher order distribution properties. In this regard, we calculate the sample averaged local Steinhardt bond order parameters $q\xaf4$ and $q\xaf6$ and find satisfactory agreement between the DeePCG and DeePMD models.

In the example that we discussed above, we use $L^ins(w)$ to optimize a CG model of water. We find that to base the optimization on $L^(w)$ defined in Eq. (8) is significantly less efficient. This is because when the oxygens are the CG variables, very long constrained simulations using Eq. (6) are required to sample exhaustively the allowed configurations of the hydrogen bond network (HBN). Typically, when the oxygen positions are fixed, as in a constrained simulation, different HBN configurations are compatible with the fixed oxygen configurations, but it takes a very long time, typically of the order of a few nanoseconds, for the system to sample different HBN configurations. This is because of the long-range correlations imposed on the HBN by the Pauling ice rules (i.e., each oxygen has two nearer and two more distant hydrogen neighbors).^{53} Thus, the scheme used here for matching the on-the-fly instantaneous forces is much more efficient.

It is well-known that neural network models are highly nonlinear functions of the parameters $w$. Multiple local minima exist in the landscape of the loss functions *L*($w$) or *L*^{ins}($w$). Indeed, different initializations of the weights often lead to different local minimizers of the loss function. This, however, does not seem to be a serious problem as demonstrated by the test described below.

In this test, we prepare 1000 configurations randomly selected from the DeePMD data and pick up oxygen positions to define the CG particle configurations. For a CG particle *i* in each configuration, we define the model deviation Σ_{i} to be the standard deviation of the force on CG particle *i* predicted by CG models that only differ among themselves by the initialization of the simulation procedure, i.e.,

where the ensemble average ⟨⋯⟩_{$w$} is taken with respect to models obtained from the same training process and the same training data set,^{56} but different initialization of the parameters $w$. In this way, 64 000 instances of the model deviation Σ_{i} are computed, and they are used to show the consistency of the predictions of different DeePCG models quantitatively. As shown by Fig. 6, with DeePMD data corresponding to 6 independent 2.5 ns-long trajectories, 99.3% of the model deviations, i.e., a large majority of them, are below 50 meV/Å. Moreover, the deviations do not become more significant when the magnitude of the CG force is large (inset in Fig. 6). Therefore, the differences of the CG forces predicted by different DeePCG models are generally consistent. Indeed the configurational distribution functions generated by DeePCG models that differ only in the initialization are indistinguishable. If we use shorter trajectories, the model deviations increase, as shown in Fig. 6 for DeePMD data corresponding to 2 independent 2.5 ns-long trajectories and for DeePMD data corresponding to a single 2 ns-long trajectory. This confirms that longer trajectories give better approximations of the ensemble average for *L*^{ins}($w$).

In terms of computational cost and scalability, in the current implementation, DeePCG accelerates DeePMD 7.5 times. Since all the physical quantities in DeePCG are sums of local contributions, upon training, the DeePCG model can be directly applied to much larger systems with linear scaling of cost. To test the reliability of DeePCG for larger systems, we perform a 1 ns-long NVT coarse-grained MD simulation on a system containing 512 water beads. This system is at the same temperature of the original DeePMD data, but is 8 times larger than the system used to construct the DeePCG model. The corresponding RDF, as shown in Fig. 3, is only very slightly less structured than the DeePCG result with 64 water beads, but tends to unity at large separation with a longer tail as we expect. This is consistent with the result in Ref. 54, which shows that the pair correlation function is almost converged in a 64-water fixed-cell system and larger cells only loosen the structure very slightly.

Finally, like in the case of the deep potential and DeePMD schemes,^{30,31} in our implementation, discontinuities are present in the forces, due to adoption of a sharp cutoff radius, limitation of angular information to a fixed number of atoms, and abrupt changes in the atomic lists due to sorting. Upon training, these discontinuities become much smaller than the thermal fluctuations and can be subsumed in the thermal noise applied by the stochastic thermostat used to sample the canonical ensemble. The accuracy of the canonical distributions of the CG variables reported in Figs. 3–5, relative to the corresponding canonical distributions in the underlying atomistic simulation, validates our approach. While irrelevant for canonical sampling, the discontinuities make the CG potential a piece-wise continuous function of the CG coordinates, whereas in principle it should be a fully continuous function. We have recently generalized the present neural network representation in order to construct potentials that are fully continuous both in the space of the microscopic variables and in that of the CG degrees of freedom.^{55} We leave to future work a discussion of applications using the fully continuous (smooth) version of our approach.

## V. CONCLUSION AND FUTURE WORK

In summary, DeePCG is a promising tool for parameterizing the CG potential and sampling CG configurations via MD. Due to the generality of the procedure adopted to construct the CG potential function, we expect DeePCG to be useful for a wide variety of tasks. In the case of water, we note that one reason for the great success of the mW potential is that it allows us to accelerate ice nucleation by several orders of magnitude because the absence of the hydrogen coordinates in the CG coordinate set eliminates the constraint imposed by the Pauling ice rules.^{18} It would be interesting to investigate whether the CG water model introduced in this paper could describe not only the liquid but also the crystalline ice phase and whether the freezing temperature of the CG model could approximate closely that of the underlying microscopic model. Direct ice nucleation studies would be greatly facilitated by the CG model.

Coarse grained models are often used to describe the conformations of polymers, represented, for example, by a sequence of beads and springs. Until now, these models are typically constructed phenomenologically by requiring that a small set of force constants match experimental and/or molecular simulation data. The DeePCG model presented here has the potential to completely eliminate phenomenological assumptions such as the restriction to harmonic spring interactions by systematically constructing a many-body potential for the beads depending on their configurations. We leave these studies and a more rigorous investigation of the dynamical properties of the CG models to future work.

## ACKNOWLEDGMENTS

The work of L. Zhang, J. Han, and W. E is supported in part by ONR Grant No. N00014-13-1-0338, DOE Grant Nos. DE-SC0008626 and DE-SC0009248, and NSFC Grant Nos. U1430237 and 91530322. The work of R. Car is supported in part by DOE-SciDAC Grant No. DE-SC0008626. The work of H. Wang is supported by the National Science Foundation of China under Grant Nos. 11501039 and 91530322, the National Key Research and Development Program of China under Grant Nos. 2016YFB0201200 and 2016YFB0201203, and the Science Challenge Project No. JCKY2016212A502.

### APPENDIX: DEFINITION OF THE LOCAL AVERAGED STEINHARDT PARAMETERS

The bond orientational order of particle *i* (atom or molecule) in a condensed environment is often described by a local Steinhardt parameter *q*_{l}(*i*),^{1} defined as

with

Here *N*_{b}(*i*) denotes the set of neighbors of particle *i*, $Ylm(r^ij)$ are spherical harmonics, and *s*(*r*_{ij}) is a switching function defined by

In this work, we take *r*_{min} = 0.31 nm and *r*_{max} = 0.36 nm and adopt the modification of the local Steinhardt parameter proposed by Lechner and Dellago,^{50} which is more sensitive than the original bond order parameter in distinguishing different crystal structures. The modified Steinhardt parameter is defined by

with

where $\xd1b$(*i*) includes *N*_{b}(*i*) and the tagged particle *i*. In the full expansion of the local averaged Steinhardt parameters, 4-body terms like $Ylm(r^ik)\u22c5Ylm(r^jl)$, *i* ≠ *j* ≠ *k* ≠ *l* are found. Therefore, the distribution of the value of the local averaged Steihardt parameters includes the effect of 4-body angular correlations.

## REFERENCES

The training data set is the same, but the batches used at each step of the Adam iteration are picked from the data set randomly and independently for different models.