Machine learned reactive force fields based on polynomial expansions have been shown to be highly effective for describing simulations involving reactive materials. Nevertheless, the highly flexible nature of these models can give rise to a large number of candidate parameters for complicated systems. In these cases, reliable parameterization requires a well-formed training set, which can be difficult to achieve through standard iterative fitting methods. Here, we present an active learning approach based on cluster analysis and inspired by Shannon information theory to enable semi-automated generation of informative training sets and robust machine learned force fields. The use of this tool is demonstrated for development of a model based on linear combinations of Chebyshev polynomials explicitly describing up to four-body interactions, for a chemically and structurally diverse system of C/O under extreme conditions. We show that this flexible training database management approach enables development of models exhibiting excellent agreement with Kohn–Sham density functional theory in terms of structure, dynamics, and speciation.

## I. INTRODUCTION

Machine Learning (ML) has gained significant traction in the force field development community,^{1–3} largely due to the afforded versatility and potential to significantly decrease the human effort required to generate high fidelity, complex models. ML methods are well suited for problems demanding first principle-level accuracy in conjunction with the computational efficiency of force field-based methods. This is particularly true for materials under extreme conditions (e.g., 1000s of K and 10s–100s of GPa), which can be highly reactive and exhibit phase separation over several nanoseconds, leading to chemical and structural heterogeneity on several-nanometer scales (e.g., in the case of reaction-driven carbon condensation in shock-compressed materials^{4–6}). In these cases, machine learned interatomic potentials can offer an ideal balance between the predictive power and computational efficiency, allowing simulations to more closely approach experimental time and length scales than possible with quantum simulations alone.

Machine learning can be leveraged in a diversity of manners for model development, ranging from the model and/or mathematical representation^{7,8} to the scheme used for training^{9–14} and data generation.^{15–19} Though powerful, each of these applications has significant associated challenges. For example, the literature contains several success stories surrounding neural network- and polynomial expansion-based interatomic interaction potentials, but these models can be enormous (e.g., comprised of thousands to tens of thousands of parameters), and thus, highly susceptible to overfitting. Generating quality training data for these cases is difficult because the immense amount of training data needed to prevent overfitting makes human inspection challenging (i.e., determining what the data “look like” is not a trivial matter). Moreover, these data are often obtained from molecular dynamics (MD) trajectories, which can yield correlated configurations on computationally accessible timescales.

Active learning (AL) has become a popular means of managing these challenging training problems. Typically, this concept centers around the manner in which a training database is updated and maintained and can be accomplished through a variety of methods.^{20} In general, these approaches involve an iterative framework [i.e., initial models are fit to density functional theory (DFT)-MD data, and subsequent refinements are made via single-point DFT calculations on frames generated through MD simulations with the *i*th model and subsequent addition to the training database, from which the *i* + 1th model is generated] and also include a scheme for deciding which new configurations should be added to the central training database. The first such method applied to force field development was based on the concept of “committee-driven” decisions, where configurations yielding large disagreement among multiple models fit to distinct subsets of a training set are added to the central database.^{15,18,19} Other successful methods include “distance” based decisions, where configurations exhibiting a great enough distance from previous configurations in fingerprint space are added to the central database,^{16} as well as approaches arising from design of experiments,^{17} etc.^{21} The success and efficiency of each of these methods are strongly linked to the nature of the target model and its implementation. For example, committee-based active learning is best used with models of high computational efficiency but relatively slow non-linear fitting as the final “model” requires evaluating each run-time simulation frame with the entire “committee.”

Here, we are concerned with developing an active learning approach well suited for parameterically linear machine learned models (e.g., SNAP,^{12} ChIMES,^{22} Shapeev’s moment tensor potentials,^{23} etc.), for which the parameter solution step is rapid compared to parameterically nonlinear models. We present an alternative active learning approach leveraging cluster analysis in conjunction with concepts from Shannon information theory^{24} and demonstrate its application to one such model with explicit two-, three-, and four-body interactions (i.e., ∼4000 parameters and thus highly susceptible to overfitting) for a C/O system under reactive conditions. The following sections provide the model functional form and description of the proposed AL approach. The resulting models are benchmarked against DFT in terms of predicted small molecule chemistry, pressure, diffusion coefficients, atomic forces, as well as molecular and overall system energetics.

In this section, we provide an overview of ChIMES (i.e., the testbed machine learned force field for this work), revisit a fitting problem for which the necessary model complexity precluded generating a ChIMES force field through the standard iterative approach, and present the active learning framework developed to overcome this challenge.

### A. The ChIMES force field

The recently developed machine learned Chebyshev Interaction Model for Efficient Simulation (ChIMES) provides an excellent testbed for the proposed active learning framework. ChIMES models are comprised of linear combinations of Chebyshev polynomials explicitly describing many-body interactions, where the high degree of flexibility afforded by the basis makes ChIMES highly suitable for problems in chemistry and capable of “quantum accuracy.” As a consequence of this polynomial basis, ChIMES models are completely linear in fitted coefficients and thus rapidly parameterizable, computationally efficient, and scale linearly with the system size. As will be described in greater detail below, ChIMES force fields are fit to forces (and optionally energies and stresses) arising from Kohn–Sham density functional theory (DFT) simulations of 2 ps–20 ps and have typically leveraged 2 + 3-body interaction terms and an iterative refinement scheme, where frames from molecular dynamics (MD) simulations with the *i*th ChIMES force field are occasionally sent back to DFT for single point calculation and combined with the training database, from which an *i* + 1th ChIMES model is generated; the cycle is repeated until desired model performance is achieved. This iterative approach has worked well for molten carbon,^{22} ambient water,^{25} and dissociative carbon monoxide,^{26,27} where in all cases species were small and chemistry was rapid when present. However, upon application to systems in which larger and more complex species form, shortcomings arising from the use of a 2 + 3-body ChIMES many-body truncation have been identified.^{26} Though the ChIMES equations can be easily extended to include higher-bodied interactions, the resulting increase in model complexity necessitates intelligent and automated model development tools.

The generalized ChIMES potential energy equation is given by

where $EnB$ is the total ChIMES system energy, *n*_{B} is the maximum bodiedness, $Ei1i2\u2026inn$ is the *n*-body ChIMES energy for a given set of *n* atoms with indices ** i** = {

*i*

_{1},

*i*

_{2}, …,

*i*

_{n}}, and

*n*

_{a}is the total number of atoms in the system.

In the ChIMES framework, single-body energies are constant values and *n*-body energies are constructed from the product of polynomials of transformed atom pair distances. Thus, a 2-body interaction would involve a single pair, *ij*, while a three-body interaction would involve three pairs, *ij*, *ik*, and *jk*, a 4-body interaction would involve $42$ pairs, and so on. Taking a 3-body interaction as an example, we define the following: ** A** = {

*i*,

*j*,

*k*} is the index over atoms within an interaction cluster, with the corresponding set of pairs given by

**= {**

*P**ij*,

*ik*,

*jk*}, their element pair types by

**= {**

*E**e*

_{i}

*e*

_{j},

*e*

_{i}

*e*

_{k},

*e*

_{j}

*e*

_{k}}, and the polynomial orders for each pair given by

**= {**

*O**α*,

*β*,

*γ*}. Analogous conventions are used for a 4-body interaction.

**= {**

*A**i*,

*j*,

*k*,

*l*},

**= {**

*P**ij*,

*ik*,

*il*,

*jk*,

*jl*,

*kl*}, element pair types are

**= {**

*E**e*

_{i}

*e*

_{j},

*e*

_{i}

*e*

_{k},

*e*

_{i}

*e*

_{l},

*e*

_{j}

*e*

_{k},

*e*

_{j}

*e*

_{l},

*e*

_{k}

*e*

_{l}}, and the polynomial orders for each pair are given by

**= {**

*O**α*,

*β*,

*γ*,

*δ*,

*ϵ*,

*ζ*}. Two mapping functions are used to relate pair indices

**to the three aforementioned pair properties:**

*P**m*

_{1}=

**→**

*P***and**

*E**m*

_{2}=

**→**

*P***. The index**

*O**y*refers to a particular component of

**, defining an interaction pair.**

*P*Using these definitions, we write the generalized ChIMES energy for a cluster of *n* atoms as

As given above, the $\u2211O$ notation indicates a multiple sum for which there are $n2$ distinct indices, $On*$ is the maximum polynomial order for an *n* body interactions, and the asterisk indicates a sufficient number of non-zero terms exist that the graph formed by the edges of interacting atoms connects all *n* atoms, which guarantees a true *n*-body interaction. $Tm2(y)sym1(y)$ is a Chebyshev polynomial of order *m*_{2}(*y*) that depends on pair distance $sym1(y)$ for pair *y* of atom types *m*_{1}(*y*) that has been transformed from *r*_{y}, to ensure it exists in the [−1, 1] domain over which Chebyshev polynomials are defined, and $fsm1(y)ry$ is a cutoff function that ensures smooth behavior at the outer cutoff. For the special case of a two-body interaction, one has

where *f*_{p} is a short-ranged repulsive interaction described later. Higher-body interactions follow the form of Eq. (2). We have for a three-body interaction,

and for a four-body interaction,

Transformed pair distances $sym1(y)$ are obtained via^{8}

where $sym1(y)$ is the pair distance and $\lambda m1(y)$ can be considered a characteristic bonding distance, typically set to the location of the first peak in the DFT radial distribution function for the *m*_{1}(*y*) atom pair type, and $rc,inm1(y)$/$rc,outm1(y)$ are the corresponding inner/outer cutoff radii. We note that Eq. (6) enforces a natural decrease in interaction strength as the distance is increased, and increasing $\lambda m1(y)$ has the effect of decreasing the rate of interaction decay as $rc,outm1(y)$ is approached, in a Morse-like^{8} fashion.

Finally, $fsm1(y)ry$ in Eq. (2) ensures the potential goes smoothly to zero at the outer cutoff, $rc,outm1(y)$. In departure from earlier ChIMES work where $fsm1(y)ry$ took on a cubic form, we employ a Tersoff style cutoff function^{28} in the present work,

where the threshold distance is given by $dt=rc,outm1(y)1\u2212fO$, and *f*_{O} is a value in [0, 1] taken to be 0.5, here. This new form exhibits a smooth step, allowing ^{n}*E*_{A} to remain unmodified by the smoothing function for all $rm1(y)<dt$; this is particularly useful for many-body interactions of large *n*, where the product of $n2\u2009fsm1(y)ry$ factors is used, and can otherwise severely reduce ^{n}*E*_{A} contributions to the total energy for *n* > 2.

To prevent spurious close contact, a penalty term is added to each two-body energy ^{2}*E* in the system,

where $Apm1(y)$ is the penalty prefactor and $dpm1(y)$ is the penalty initiation distance set to 10^{5} kcal/(mol Å^{3}) and 0.01 Å here, respectively.

Permutational invariance of the energy is explicitly enforced. In particular, we require ^{n}*E*_{A} = ^{n}*E*_{ΠA}, where Π is a permutation operator acting on the *n* atoms in the cluster. This leads to equality conditions among the coefficients,

As an example, for a 4-body interaction, we have

which is derived by permuting atoms *i* and *j*. In our implementation of ChIMES, permutational invariance is enforced by treating permutationally related coefficients as the same unique fitting variable.

In this work, we will consider the following objective function, which contains terms for per-atom forces and per-system-configuration energies, though we note additional terms for the system stress tensor can also easily be included,^{14,22,25,26}

where Δ*X* = *X*^{DFT} − *X*^{ChIMES{c}}. *F*_{obj} and {*c*} are the weighted root-mean-squared error (RMSE) and model coefficients, respectively. The number of frames and atoms are given by *n*_{f} and *n*_{a}, respectively, and the factor of 1 in the denominator arises from inclusion of a single per-configuration energy, *E*_{i}. *F*_{ijk} indicates the *k*th Cartesian component of the force on atom *j* in configuration *i*. Units of kcal mol^{−1} Å^{−1} and kcal mol^{−1} and weights of 1.0 and 5.0 were used for forces and energies, respectively (i.e., $wFijk$ and $wEi$). The superscripts “ChIMES” and “DFT” indicate forces/energies predicted from the present force-matched model and the DFT molecular dynamics (DFT-MD) training trajectory, respectively.

Since ChIMES is entirely linear in its fitted parameters, the model optimization problem can be recast as the following over-determined matrix equation:

where *X*_{DFT} is the vector of $FijkDFT$ and *E*^{DFT} values, $w$ is a diagonal matrix of weights to be applied to the elements of *X*_{DFT} and rows of ** M**, and the elements of design matrix

**are given by**

*M*In the above, *a* represents a combined index over force and energy components, and *b* is the index over permutationally invariant model parameters.

To avoid overfitting in determining *c*, we find that a regularization method must be used for most problems. In the present work, the least absolute shrinkage and selection operator (LASSO)^{29} method is used to regularize *c*. The LASSO method minimizes the objective function,

where *n*_{p} is the total number of unique fitting parameters. The parameter *λ* penalizes large magnitude coefficients and reduces the likelihood of overfitting. Moreover, the absolute values (L1 norm) used in the LASSO objective function have the effect of setting certain coefficient values *c*_{i} to 0, which can lead to substantial gains in model efficiency.

We use a locally written code that implements the Least Angle Regression (LARS) algorithm^{30,31} for LASSO. The LARS algorithm proceeds in stages where variables are added or removed one at a time. Each stage in the LARS algorithm is a solution of LASSO optimization for a value of *λ* larger than the requested value. When the requested value is reached, the algorithm terminates. We find that the LARS algorithm has better convergence properties than direct optimization of *F*_{LASSO} and additionally allows the analysis of solutions for each iteration. Our code distributes the ** M** matrix between computing nodes, allowing for solution of large (e.g., >1 TB) problems, and uses rank 1 Cholesky decomposition updates to solve linear equations that arise in LARS.

The standard ChIMES iterative fitting approach is given below. We note that the success of this method hinges upon the notion that the relatively inaccurate models produced during early iterations are more likely to reach unsampled regions of physicochemical space and can thus be considered a means of rare-event sampling. As briefly mentioned above, this approach has been successful for systems of non-reactive small molecules, or those exhibiting rapid chemistry. To generate these models, a simple iterative refinement framework was employed, where (i) training trajectories were obtained from short DFT-MD trajectories at the state points of interest for the system, (ii) a model was obtained by minimizing the objective function [i.e., Eq. (14)], (iii) a MD trajectory was launched using this *i*th force field, (iv) configurations from this simulation were periodically sent to DFT for single point calculation to be merged with the existing database, and (v) steps (ii)–(iv) were repeated until the model exhibited the target level of accuracy.

### B. Shortcomings of the standard parameterization approach

The ChIMES model development scheme described above (i.e., iterative fitting) has been found insufficient when at least one of the following two conditions are met: the system undergoes relatively slow chemistry, with species lifetimes exceeding 1 ps (i.e., where kinetics are limited by relatively large reaction barriers), or the system contains structurally complex species, e.g., oligomers, heterocycles, and fused rings, where greater than 3-body interactions (e.g., intramolecular torsion) have been shown^{26} to play a significant role. As an example of this, consider a system of 50/50% C/O at 2400 K at 1.79 g cm^{−3}. As shown in Figs. 1(a) and 1(c), DFT simulations predict that both criteria are met for the above system. Previously, a ChIMES model was developed with the intention of transferability over *T* = 9350 K, 6500 K, and 2400 K, and *ρ* = 2.56 g cm^{−3}, 2.5 g cm^{−3}, and 1.79 g cm^{−3}.^{26} This model was fit through the standard iterative approach using 2 + 3-body interactions and was found to work well between the 9350 K and 6500 K state points, but failed at 2400 K. As indicated in Figs. 1(b) and 1(c), one of two structural themes emerged during ChIMES MD simulations at 2400 K using models derived from successive iterative iterations; either exclusively small linear species or unphysical ring-like structures featuring highly coordinated oxygen. These two outcomes arose from an insufficiently complex ChIMES interaction model; to a model containing only 2- and 3-body interactions, the coordinated oxygen structure is reasonable, containing bond and angle distances reminiscent of those found in carbon dioxide, ethylene oxide, dimethyl ether, etc. However, when iteratively added to the training database, DFT assigns a high energy, resulting in subsequent models that bias away for cyclic structures. As a consequence, predicted chemistry is far off from the DFT-computed values.

The above issues can be resolved through addition of 4-body terms, which in the language of molecular mechanics force fields would allow for description of bonds (2-body), angles (3-body), and dihedrals/impropers (4-body). However, doing so substantially increases the number of parameters considered in the fitting process. For example, in a system with two atom types, polynomial orders of $O2B/O3B/O4B=12/7/0$ yields a maximum of 806 parameters, whereas $O2B/O3B/O4B=12/7/3$ (i.e., the polynomial orders used in the present work) yields a maximum of 3978 parameters. Increasing model complexity gives rise to an additional set of problems; beyond the concomitant increase in risk of over-fitting, a far greater number of iterative cycles are required to generate a converged force field, which is prohibitive for reactive carbon-containing systems (i.e., that exhibit significantly higher chemical complexity than those of only O, H, and/or N). In Secs. I C–I F, we describe strategies to overcome these challenges, through development of an active learning framework combining cluster analysis and concepts from Shannon information theory.

### C. Active learning framework overview

As described in Sec. I A, development of our active learning framework began with examination of our standard fitting approach. ChIMES training repositories contain both DFT- and iteratively obtained frames from MD trajectories, comprised of 3*n*_{a} coordinates and forces, and a single overall system energy. To illustrate the convoluted nature of this fitting problem, consider a hypothetical scenario in which one attempts to fit a $O2B/O3B/O4B=12/7/3$ force field for a C/O system to only energies from DFT (noting that in practice, forces are always included in a ChIMES fit); doing so requires assignment of 3978 parameters such that the energies for each dimer, trimer, and tetramer sum to the single value for each frame.

The total energy in a given frame arises from the sum of bonded and non-bonded interactions, inter- and intra-molecular interactions, etc. Thus, by decomposing a given frame into a collection of individual atom clusters (i.e., nominal molecules), computing the corresponding DFT forces and energies, and adding them to the training repositories, the resulting fits contain greater information on how to assign 10s–1000s of parameters giving rise to different 2-, 3-, and 4-body energetic contributions. This concept is similar to generating training configurations from a direct cluster expansion, but can be more effective for molecular systems because the resulting configurations are relevant to both the model and the target physicochemical space. Note that the ChIMES model space can allow formation of unphysical many-body cluster configurations (e.g., three atoms forming an equilateral triangle with all distances equal to the corresponding 3-body inner cutoff). Thus, it is important that the training database contains configurations of this nature to inform the fit of their unfavorability and prevents their spurious appearance during MD simulations. However, this approach is still highly inefficient for several reasons: (1) successive MD frames are highly correlated [e.g., due to the limited exploration of physicochemical space in short-time MD simulations at a fixed temperature(s)], (2) the species contained in each frame can be very similar (e.g., distributed about some chemical and/or conformational minimum energy), and (3) the computational cost of evaluating the possible tens of thousands of species via DFT would be prohibitive from a practical standpoint. In order to increase the efficiency of our fits, we aim to increase the information contained in the training database while simultaneously maintaining a minimal size, by developing a method for selecting subsets of possible species. As will be discussed in greater detail in the remaining sections, we do so by defining a “feature” for each species, simply taken to be *E*_{ChIMES,i}, the energy per species atom as computed by the *i*th ChIMES force field; using this construct, the information contained in any given subset of species is maximized when the corresponding probability distribution of *E*_{ChIMES,i} values is flattened. These concepts can be combined with the standard ChIMES iterative fitting procedure to form the active learning cycle shown in Fig. 2. We note the distinction between active learning (“AL”), represented by the orange components of the cycle in Fig. 2, and an active learning cycle (“ALC”), which involves *both* iterative refinement and active learning, i.e., the blue *and* orange components in Fig. 2.

### D. Cluster-based species identification

Figure 3 provides a pictorial representation of the active learning (i.e., orange) portion of the overall fitting scheme given in Fig. 2. The first step of this process is extraction of all possible molecular species from the MD trajectory, i.e., step “a” in both Figs. 2 and 3. In this section, we describe the simple clustering approach through which this is achieved. We note that our choice of the clustering method and criteria is such that the resulting clusters resemble molecules, intermediates, and transition state species. High accuracy condensed phase chemistry is a central ChIMES goal and requires both molecules and “transition”/intermediate species be well described—capturing the former is critical for recovering reaction energy minima and thus speciation, while the latter influences reaction barriers, and thus predicted lifetimes. To ensure both types of species are identified, we use a simple double-pass clustering approach.

The first pass is used to identify nominal molecules, where a relatively short-ranged distance-based criterion is identified for each possible pair type; atoms are then considered part of the same cluster if their distance to any other cluster member is within this “tight” criterion. The second pass identifies nominal transition-state species and uses looser distance criteria; typically, this pass results in a smaller number of clusters with larger overall sizes. Because species identified in the second pass may be identical to those in the first pass (i.e., in the case of well-separated molecules), we take the final set, ** s**, of identified species as those unique across both passes, noting that in our algorithm, two clusters are considered identical only if they are obtained from the same frame with the same atomic coordinates and indices. In this work, respective values of 1.9 Å, 1.8 Å, and 1.7 Å are used as the “tight” criteria for CC, CO, and OO pairs, respectively (i.e., location of the first minima in each pair radial distribution function), while “loose” criteria were taken as ≈117% of these values (i.e., corresponding roughly to the midpoint between the first minima and second maxima). Species were extracted from 250 evenly spaced frames from the 6 ps MD simulations.

### E. Monte Carlo for cluster subset selection

Once the set of candidate clusters have been extracted from the MD trajectory as described above, the next task is identification of a much smaller subset of those clusters, which are maximally informative to the fit (i.e., steps b and c in Figs. 2 and 3). This batch mode pool-based process^{20} is critical from an efficiency standpoint as it drastically reduces the number of single-point DFT calculations required during addition to the training database (i.e., step d); moreover, in selecting this subset by maximizing information, fewer active learning cycles (i.e., Fig. 2) will be required to achieve a converged result.

To achieve this, we look into statistical mechanics (SM) and Shannon information theory (IT). IT provides a measure of the information entropy (*I*) contained in a given dataset according to *I* = −*∑p*_{i} ln *p*_{i}, where *p*_{i} is the probability to observe the *i*th value; the information contained in the dataset is said to increase with increasing *I*. Note that the definition of *I* differs from the statistical mechanical definition of entropy by only a factor of *k*_{B}, where in SM, *p*_{i} is the probability to observe the *i*th energy state; in both cases, information (entropy) is maximized when the distribution of *p*_{i} is uniform. In the present work, we are concerned with obtaining a subset of all extracted clusters that will maximally inform our fits. According to SM, a subset constructed via *random selection* from a larger set will exhibit the same energetic distribution (e.g., Boltzmann). Randomly generated clusters avoid this problem; however, generation of such a set would be computationally inefficient as our training data need only to contain clusters of relevance to the model domain. Thus, we have implemented an intermediate approach, which combines the SM and IE definitions of entropy to establish that the down-selected subset of fixed size *n*_{sel} possessing the maximum possible information in our feature space (i.e., *E*_{ChIMES,i}, the energy per cluster atom predicted by the *i*th ChIMES force field) yields a uniform probability distribution. Because our goal is selection of a finite number of species from a set, our probability distributions are comprised of bins with finite width, where high complexity species within a given bin can be structurally or conformationally isoenergetic. Thus, we devise a simple Monte Carlo (MC) approach for cluster selection; by using a statistical method (i.e., rather than selection of clusters from each bin in a single pass), we ensure a degree of randomness across selected species contained in each distribution bin.

Prior to the start of the Monte Carlo (MC) selection process, clusters extracted from the trajectory of interest are stored in a set (** s**), of size

*n*, as shown in Fig. 3. An energy ($EclusterALC\u2212X$) is assigned to each of the clusters in

**with the**

*s**X*th ChIMES model. The MC process aims to select a subset (

*s*_{sel}) of

*n*

_{sel}clusters from

**, which yields a flat distribution in energy. To initiate the process,**

*s**n*

_{sel}random clusters are moved from

**to**

*s*

*s*_{sel}, and the remainder are put into a pool (“

*s*_{pool}”) containing

*n*

_{pool}=

*n*−

*n*

_{sel}clusters. A histogram (

**) of cluster energies in**

*h*

*s*_{sel}is then constructed, defined over [min(

**), max(**

*s***)], with**

*s**n*

_{bins}bins. We note that Secs. II B and II C will explore alternative definitions of

**and the effect of changing**

*h**n*

_{bins}. Figure 3 shows that, as expected, this initial histogram is sharply peaked (i.e., the orange line) and closely resembles the histogram obtained by considering all possible clusters (i.e., the blue line). In the remainder of this section, we discuss how this histogram shape, indicating that

*s*_{sel}contains many similar clusters, can be flattened to evolve

*s*_{sel}into a maximally informative subset of species for subsequent addition to our training database (i.e., step d).

The MC selection process proceeds as follows: (1) a random cluster (*i*_{old}) having associated energy $EioldALC\u2212X$ is selected from *s*_{sel} and its probability is taken as $pi,old=h(EioldALC\u2212X)/nbins$, (2) a random cluster (*i*_{new}) having associated energy $EinewALC\u2212X$ is selected from *s*_{pool}, and its probability is taken as $pi,new=h(EinewALC\u2212X)/nbins$, (3) if $1.0+pi,new\u2212pi,old2>rand[0,1]$ the attempted move is rejected; otherwise, the move is accepted (i.e., *i*_{old} is moved to *s*_{pool}, *i*_{new} is moved to *s*_{sel}, and ** h** is recomputed). rand[

*a*,

*b*] is defined to be a uniformly distributed random number between

*a*and

*b*. We note that many MC acceptance criteria could be used here and that ours is simply intended to bias toward selections yielding a flat probability distribution (i.e., equivalent histogram bins). Specifically, our MC acceptance rule was chosen to encourage a flat histogram (i.e., we aim to have the value of each histogram bin equal to 1/

*n*

_{bins}) and to encourage acceptance of moves where

*p*

_{i,new}is significantly less than

*p*

_{i,old}, i.e., which will expedite histogram flattening. This process is repeated for the total requested number of MC cycles (

*n*

_{cyc}), where one cycle is equal to the number of requested Monte Carlo steps divided by

*n*, allowing each cluster in

**to be considered for selection at least**

*s**n*

_{cyc}times, in principle. Typically,

*n*

_{cyc}is set to

*n*

_{bins}/10.

The Monte Carlo process is rapid, contributing only ≈5 min to the overall time required for any given ALC. We note that the MC portion was run on a single Intel Xeon E5-2695 core. Users must specify the following options to execute the above MC algorithm: *n*_{sel}, *n*_{cyc}, the number of histogram bins (*n*_{bins}), and the specific definition of ** h** utilized. The two former options are set to 400 and 2 in this work unless otherwise stated. We note that the choice of

*n*

_{sel}was based on identification of four small molecules present in the DFT-MD simulation with mole fractions of at least 0.02, i.e., CO, CO

_{2}, C

_{2}O

_{2}, and C

_{3}O

_{2}. In principle,

*n*

_{sel}= 400 allows for 100 of each aforementioned species to appear in the final

*s*_{sel}set; however, in practice, a diversity of species is observed. We note that optimal values of

*n*

_{cyc}can be rapidly identified by tracking evolution of Shannon information (which increases as the histogram flattens) during the MC process and will depend on the target system and we calculate information via trapezoidal numerical integration of the normalized histograms (i.e., rather than by summation) to ensure calculated information does not significantly differ for calculations of the same distribution using different numbers of histogram bins. Section II will present studies on the influence of the latter two user-specified options,

*n*

_{bins}and

**definition, as they can significantly impact the ALC efficiency.**

*h*### F. Computational details

The initial training trajectory comprised full frames arising from short (<10 ps) spin-restricted DFT-MD simulations of dissociative carbon monoxide at (9350, 2.56), (6500, 2.5), and (2400, 1.79) (K, g cm^{−3}). We note that previous studies have shown that the choice of spin restriction minimally impacts the condensed-phase and isolated molecule forces for the present system.^{26} Simulations at 9350 K were comprised of 64 atoms, while simulations at 6500 K and 2400 K contained 128 each. We note that DFT-MD simulations at each temperature are initialized as molecular CO, but decompose into various species as the simulation progresses. All DFT data were generated with VASP,^{32–35} with a planewave cutoff of 700 eV, Fermi smearing with width equivalent to the ion temperature, the Perdew–Burke–Ernzerhof (PBE) generalized gradient approximation functional,^{36,37} projector-augmented wave pseudopotentials^{38,39} (PAW), and the DFT-D2 method for description of dispersion interactions.^{40} DFT-MD simulations utilized a 0.5 fs time step and a Nose–Hoover thermostat.^{41,42} The initial training trajectory contained 80 evenly spaced full frames from the 2400 K and 6500 K state points, and 160 from the 9350 state point, for a total of 320 initial frames. Simulations at higher temperatures were included because they are likely to sample a broader region of configuration space as well as close contacts, which informs the repulsive portion of the potential and ultimately reduces the number of required iterative cycles. All other simulations (i.e., during active learning) are conducted at 2400 K and 1.79 g cm^{−3}, which we emphasize are the target thermodynamic conditions for benchmarking the actively learned models.

Recall that an active learning cycle involves both a simple iterative and an active learning component (i.e., the blue and orange portions of Fig. 2). During active learning cycles, ChIMESs contained 128 atoms and were run at 2400 K and 1.79 g cm^{−3} for 6 ps with our locally developed ChIMES-MD software using a global Hoover thermostat with periodic boundary conditions. A 0.1 fs timestep is used during the ALC process since early models generally learn from incomplete training repositories and tend to exhibit rapidly varying potential energy surfaces. As will be discussed in Sec. II A, models generated in successive cycles yield far smoother potential energy surfaces, enabling the use of a larger timestep during production simulations. The present 2400 K/1.79 g cm^{−3} DFT simulations yield significantly slower kinetics than is observed at the two higher *T*/*p* state points, which serves to exacerbate uncertainty in predicted chemistry arising from small system sizes and the effects of the initial conditions. Because multiple independent DFT-MD simulations are too time consuming and block averaging is dubious for short (i.e., non-equilibrated) simulations, we instead estimate errors in examined properties by computing standard deviation across eight independent ChIMES-MD simulations.

In addition to the cluster selection, the iterative component of any given ALC involves the selection of a handful of full frames from ChIMES-MD for single-point DFT calculation and subsequent addition to the central training database, for more complete exploration of training phase space. For this process, we select 20 frames evenly spaced over the duration of the simulation, rather than using a special technique to identify optimal frames. We take this approach for two reasons: (i) to ensure the entire simulation progression is represented in the training database and (ii) because the AL portion of each ALC is performed on much more frequently sampled frames (i.e., 250) and thus should identify important events. In addition to the 20 frames discussed above, any frames for which pair distances $rc,inm1(y)<ry<rc,inm1(y)+dp$ are sampled are added to the database and considered during the ALC process to inform the fit in the typically undersampled region near the model’s inner cutoffs.

## II. ACTIVE LEARNING RESULTS AND DISCUSSION

In Secs. II A–II C, we present application of the AL approach discussed above to the development of ChIMES models explicitly describing 2-, 3-, and 4-body interactions for a C/O system under reactive conditions. As described in Secs. I B and I C, coupling between the model and physicochemical complexity makes this a particularly challenging problem and thus well suited for validating the present AL scheme. Three versions of our AL approach are described, which vary by how the histogram ** h** used during MC selection is constructed. In the protocol described in Sec. I E and Fig. 3 (i.e., where

**is constructed considering only clusters in**

*h***), there is no memory of clusters selected in previous active learning cycles (i.e., those in the central database**

*s*

*s*_{cent}). In addition to this “no-memory” mode, we explore cases where there are “partial-memory” (i.e., only some

*s*_{cent}clusters are remembered when constructing

**) and full memory (all**

*h*

*s*_{cent}clusters are remembered when constructing

**).**

*h*### A. No-memory active learning

We begin with investigation of the most basic active learning approach. The first task is generation of the ALC-0 force field, trained to the 320 frames extracted from the 2400 K, 6500 K, and 9350 K DFT trajectory, each of which contain 3*n*_{a} coordinates and forces, and a single frame energy. Following identification, all clusters are extracted from 250 evenly spaced frames spanning each DFT-MD trajectory (i.e., forming ** s**), using the clustering approach described in Sec. I D. Energies for these clusters $EiALC\u22120$ are then computed with the ALC-0 force field, and a subset of these species,

*s*_{sel}, is selected through the Monte Carlo (MC) approach presented in Sec. I E (i.e., step a in Fig. 3). For the MC procedure (i.e., steps b and c in Fig. 3),

*n*

_{cyc}and

*n*

_{bins}are set to 2 and 20, respectively, for identification of

*n*

_{sel}= 400 species. We take

**defined over $mins,maxs$, populated with values for the current**

*h*

*s*_{sel}as the “no-memory” definition of

**. In this scheme, each active learning cycle selects clusters “from scratch,” with no knowledge of clusters selected and added to the central training database, in previous ALCs.**

*h*Focusing on the top plot of Fig. 4, the black data show the normalized histogram over all $EiALC\u22120$ values for clusters in ** s** (i.e., all clusters extracted from the initial training trajectory). The histogram spans approximately −125 kcal mol

^{−1}to 45 kcal mol

^{−1}, and rather than exhibiting a Boltzmann or normal distribution, it is sharply peaked at −25 kcal mol

^{−1}and has a tail extending to positive $EiALC\u22120$ values. The remaining lines in the top plot of Fig. 4 show how the distribution of $EiALC\u22120$ in

*s*

_{sel}changes as the MC selection process continues. Initially, these distributions are similar to that for

*s*, but rapid flattening is observed with successive MC steps. We note that the sloping feature at positive $EChIMES,iALC\u22120$ of the final distribution is due to an insufficient number of high energy configurations to yield

*n*

_{sel}/

*n*

_{bins}configurations in each bin. The bottom plot of Fig. 4 provides evolution of Shannon Information,

*I*(i.e., a measure of histogram flatness), as the MC selection process progresses and also indicates rapid convergence during the selection process. Specifically,

*I*values are converged within ≈5000 MC steps (i.e., ≈1 cycle) and result in 30% increase in information relative to the initial value. The final 400 selected species are then sent to DFT for single point force/energy calculation and added to the central training database (i.e., step d in Fig. 3), completing ALC-0. Successive ALCs progress by generating new ChIMES models fit to the updated training database, launching ChIMES-MD simulations with the resulting models, repeating the MC selection, and updating the central database with both selected clusters and 20 evenly space full frames from the ChIMES-MD simulation. These steps are repeated until the target model accuracy is achieved.

Here, we consider eight active learning cycles (i.e., up to ALC-7). Each ALC requires approximately 6 h of walltime on Intel Xeon E5-2695 hardware, where tasks are either serial on a single processor or in parallel using four 36-core nodes. Tasks used the following number of walltime hours: two for model generation (parallel), molecular dynamics (parallel), and post-processing (serial), one for cluster extraction (serial), $EiALC\u2212X$ calculations (parallel), and *s*_{sel} generation (serial), and 3 h for single point DFT calculations (parallel), though we note that these timings do not necessarily represent an optimized process. Figure 5 shows evolution of normalized ** h** for ALCs 3 and 6, from which several notable features emerge. Distributions exhibit a similar overall shape to ALC-0, with a sharp peak (shifted to −30 kcal mol

^{−1}) and a tail at positive $EiALC\u2212X$. We find the ranges of observed $EiALC\u2212X$ to be similar between the two ALCs, but substantially smaller than that observed in ALC-0; here, values are found to fall between ≈−40 kcal mol

^{−1}and −10 kcal mol

^{−1}. We note that

**for ALC-1 includes clusters extracted from DFT at 6500 K and 9350 K in addition to 2400 K. This is in contrast to configurations extracted in subsequent ALCs that are harvested from ChIMES-MD simulations (i.e., which are**

*s**only*the target 2400 K state point); as a consequence, the range of $EiALC\u2212X$ considered in ALC-1 is expected to be greater than in successive ALCs. Nevertheless, similarity in the range of $EiALC\u2212X$ values between ALC-3 and ALC-6 speaks to convergence in ChIMES force fields generated at late ALC and the resulting ChIMES-MD simulations.

For a more direct means of comparing performance of ChIMES models arising from successive ALCs, we consider speciation of small molecules CO, CO_{2}, C_{2}O_{2}, and C_{3}O_{2}, in terms of mole fractions (*x*_{i}), and corresponding lifetimes (*t*_{i}). For this analysis, a molecule was considered persistent if all constituent bonds remained within a specified cutoff distance, for a specified time. The distance criteria for each pair were set to the location of the first minimum in the radial pair distribution function (i.e., 1.9 Å, 1.8 Å, and 1.7 Å for CC, CO, and OO pairs, respectively), while the lifetime cutoff was set to 50 fs, allowing one-to-two bond vibrations. Overall, the results, given in Fig. 6, indicate a tendency for values to converge to the DFT result at late ALC, with error bars generally decreasing for rarer species, such as C_{2}O_{2} and C_{3}O_{2}. We note that, at early in early ALCs, simulations may become unstable and terminate prematurely, contributing to large error bars in predicted values. In terms of species lifetime, significant fluctuations about the DFT value for CO_{2} are found in conjunction with relatively large error bars, which could indicate a large variance in physically reasonable values.

Per-atom force and system energy recovery were also investigated for the 2400 K ChIMES-MD trajectory. As shown in Fig. 7, excellent force recovery is observed, suggesting the model should yield good reproduction of DFT structure. The largest deviations are found for large magnitude force, which generally corresponds to unusual and less favorable structures, which are useful for fitting purposes but generally not indicative of model performance under the target conditions. Overall, we find a reduced root-mean-squared error in the forces (i.e., RMSE_{r,F} = RMSE_{F}/⟨|*F*|⟩) of 0.345, consistent with previous ChIMES models.^{22,25,26} Figure 7 also provides a comparison between overall frame energies predicted by DFT and ChIMES. The data are found to be highly linear with an *R*^{2} and slope of 0.993 and 0.969, respectively; however, the data are slightly offset from the *x* = *y* line, with an intercept at −5.978 kcal mol^{−1} 100^{−1}. This behavior is attributed to the inclusion of frame energies corresponding to simulations at 6500 and 9350, each of which used distinct electronic temperatures and thus strictly correspond to different electronic potential energy surfaces; in fact, when frame energies are plotted for all three state points, we find data for 6500 K and 9350 K to be slightly below and above *x* = *y*, respectively. Nevertheless, we note that this intercept improves substantially upon that of the previous 2 + 3-body parameterization^{26} (i.e., ≈−57 kcal mol^{−1} 100^{−1}), by an order of magnitude. Rather than modifying the current study to account for this effect, we simply note that it can be mitigated by excluding the highest temperature state points or re-computing forces and energies using a smaller thermal smearing parameter. Overall, these data exhibit a reduced RMSE of $RMSEr,ET=0.003$, i.e., an order of magnitude less than in previous work,^{26} suggesting the present model should still yield high accuracy condensed phase energetics.

In systems such as the present, which exhibit a diversity of chemical species with varied complexity, it is helpful to further decompose energetic contributions, and thus, the final two plots of Fig. 7 provide comparisons between DFT and ChIMES energetics for small species (i.e., CO, CO_{2}, C_{2}O_{2}, and C_{2}O_{3}), and species for which 10 ≤ *n*_{C} + *n*_{O} ≥ 50; the former plot speaks to the predicted concentrations and lifetimes of species that seed formation of larger structures, and the latter to molecular conformations in complex species. We note that these species correspond to the final *s*_{sel} from ALC-0 and includes configurations from all three temperatures identified as nominal molecules *and* nominal intermediates (i.e., both high- and low-energy configurations). Focusing first on the small molecules, we find reasonable agreement, with a corresponding reduced RMSE of $RMSEr,Esml=0.036$. The largest discrepancies between DFT and ChIMES are found for higher-energy configurations involving CO_{2}, i.e., pseudo-intermediate state species, which inform energetic maxima in reaction coordinate space, and thereby play an important role in predicting species lifetimes. Thus, it is somewhat unsurprising that the error bars for $tCO2$ are among the largest observed for lifetimes given in Fig. 6. Possible explanations for this include insufficiently high 2- or 3-body order, or inconsistent DFT energetics for CO_{2} arising from the three considered state points (i.e., due to the use of different smearing parameters). We find excellent recovery of large species energetics, with $RMSEr,Elrg=0.009$, suggesting the present model provides a high-accuracy description of conformations in complex species. These RMSE_{r} values are also listed in Table I along with pressures, diffusion coefficients, and minimized CO and CO_{2} structures predicted by DFT and the present ChIMES model. Noting that stress tensors were not included in the present fits, we find a 20% over-prediction in pressure, but otherwise, all other metrics are in excellent agreement with DFT.

. | DFT . | No-memory . | Full-memory . | Partial-memory . | |
---|---|---|---|---|---|

n_{bins} | … | 20 | 20 | 20 | 40 |

P (GPa) | 9 | 10.8_{3} | 8.9_{8} | 9.6_{1} | 11.4_{4} |

$dsO$ (10^{−8} m^{2}/s) | 1.8 | 1.7_{2} | 1.4_{2} | 1.5_{2} | 1.4_{4} |

$dsC$ (10^{−8} m^{2}/s) | 1.5 | 1.4_{2} | 1.4_{2} | 1.3_{2} | 1.4_{4} |

RMSE_{r,F} | … | 0.345 | 0.356 | 0.338 | 0.331 |

$RMSEr,ET$ | … | 0.003 | 0.004 | 0.003 | 0.003 |

$RMSEr,Esml$ | … | 0.036 | 0.047 | 0.033 | 0.035 |

$RMSEr,Elrg$ | … | 0.009 | 0.010 | 0.009 | 0.008 |

$leq,COC\u2212O$ (Å) | 1.15 | 1.14 | 1.14 | 1.14 | 1.14 |

$leq,CO2C\u2212O$ (Å) | 1.18 | 1.18 | 1.17 | 1.17 | 1.17 |

$\theta eq,CO2O\u2212C\u2212O$ (deg) | 180 | 180 | 180 | 180 | 180 |

. | DFT . | No-memory . | Full-memory . | Partial-memory . | |
---|---|---|---|---|---|

n_{bins} | … | 20 | 20 | 20 | 40 |

P (GPa) | 9 | 10.8_{3} | 8.9_{8} | 9.6_{1} | 11.4_{4} |

$dsO$ (10^{−8} m^{2}/s) | 1.8 | 1.7_{2} | 1.4_{2} | 1.5_{2} | 1.4_{4} |

$dsC$ (10^{−8} m^{2}/s) | 1.5 | 1.4_{2} | 1.4_{2} | 1.3_{2} | 1.4_{4} |

RMSE_{r,F} | … | 0.345 | 0.356 | 0.338 | 0.331 |

$RMSEr,ET$ | … | 0.003 | 0.004 | 0.003 | 0.003 |

$RMSEr,Esml$ | … | 0.036 | 0.047 | 0.033 | 0.035 |

$RMSEr,Elrg$ | … | 0.009 | 0.010 | 0.009 | 0.008 |

$leq,COC\u2212O$ (Å) | 1.15 | 1.14 | 1.14 | 1.14 | 1.14 |

$leq,CO2C\u2212O$ (Å) | 1.18 | 1.18 | 1.17 | 1.17 | 1.17 |

$\theta eq,CO2O\u2212C\u2212O$ (deg) | 180 | 180 | 180 | 180 | 180 |

### B. Full-memory active learning

Though the above “no-memory active” learning approach improved substantially upon early parameterization efforts for CO at 2400 K and 1.79 g cm^{−3} (e.g., by removing the propensity for the ChIMES-MD simulations to form hyper-coordinated structures), there is still room for refinement, most notably in recovery of pressure and small molecule energetics. In this section, we consider an active learning framework with “full-memory” of the central database where we aim to flatten a histogram of *both* clusters selected in previous ALCs (i.e., *s*_{cent}) and current selections from ** s**,

*s*_{sel}. This is achieved by constructing

**from**

*h**both*the set of clusters in the central database (

*s*_{cent}) and those in

*s*_{sel};

**is defined over the minimum and maximum values among both**

*h*

*s*_{cent}and

**. In principle, this approach should improve the model performance by preventing redundancy among species selected in successive ALCs. As a practical point, we note that this approach requires re-calculation of $EiALC\u2212X$ for**

*s*

*s*_{cent}species of each ALC, which, for the present system, adds no more than 15 minutes to the overall time required to complete an ALC. Figure 8 provides the mole fractions and lifetimes predicted from full-memory ALCs 1–7. In general, average mole fractions, ⟨

*x*

_{i}⟩, predicted by the no-memory ALC-7 model are closer to DFT than those from full-memory ALC-7; however, both methods produce values within error of one another. In contrast, ALC-7 full-memory average lifetimes, ⟨

*t*

_{i}⟩, are generally closer to DFT, with 3 out of 4 values within error between the two methods. In both methods, ⟨

*x*

_{i}⟩ and ⟨

*t*

_{i}⟩ values across ALCs 1–7 appear converged to the same extent. Moving to Table I, we find pressure in better agreement with DFT, while diffusion coefficients and predicted CO/CO

_{2}structures are close to the no-memory values.

It is not entirely surprising that full-memory active learning yields such marginal improvements over the no-memory model. In the former framework, early ALC ChIMES-MD simulations can give rise to unphysical structures to which DFT typically assigns high energy; moreover, ALC-1 includes clusters from 6500 K to 9350 K, which can exhibit substantially higher energies than those sampled at 2400 K. As more ALCs are performed, the resulting ChIMES models begin yielding more reasonable configurations and assigning energies more consistent with those arising from DFT. As a consequence, early ALCs set the upper bound histogram value causing late ALCs to cluster about relatively low histogram values and giving rise to a long tail at high histogram values. The practical implication of this is that only a fraction of the 20 available bins are allocated to histogram “active space” at late ALC.

### C. Partial-memory active learning

The pitfalls of full-memory active learning can readily be overcome setting an energetic cutoff for central database configurations “remembered” during the sub-selection process, effectively imposing bounds on the possible range of ** h**. In the present work, this is achieved by constraining the domain of

**to the minimum and maximum $EiALC\u2212X$ values among**

*h**only*

**and populating it only with**

*s*

*s*_{cent}and

*s*_{sel}values that fall within this domain; any

*s*_{cent}values outside of that range are ignored during the MC selection process; henceforth, we refer to this approach as “partial-memory” active learning. Note that this is in contrast to the full-memory approach, which constructed

**from**

*h**all*clusters in

*s*_{cent}

*and*those in

*s*_{sel}. Mole fractions and corresponding lifetimes predicted by ChIMES force fields developed with partial-memory ALC are given in Fig. 9. Comparing with the no-memory model, we find ALC-7 ⟨

*x*

_{i}⟩ for the present model is generally closer to DFT. Values for no-memory ⟨

*t*

_{i}⟩ are closer to those predicted by DFT for all but C

_{2}O

_{2}; however, all partial-memory predictions for these species are within error of the full-memory results. Table I also shows that both models predict pressure and diffusion coefficients within error of one another, and identical predictions for CO and CO

_{2}geometries. Additionally, Table I indicates that, along with no-memory, the partial-memory model exhibits the lowest value of $RMSEr,ET$ and has the lowest value for all other RMSE

_{r}considered. Ultimately, the partial-memory model is found to provide the best overall performance, though differences between the various models are small.

It stands to reason that the present active learning framework should yield improved results when *s*_{hist} is constructed with finer resolution. Thus, we explore the effect of doubling *n*_{bins}, i.e., from 20 to 40. To ensure the MC selection process is converged, we also double *n*_{cyc}, i.e., from 2 to 4. Figure 10 gives the mole fractions and corresponding lifetimes predicted from this 40-bin partial-memory active learning process. Most notably, the present active learning approach yields speciation, which is more obviously converged than that of the previous methods, suggesting enhanced efficiency. Moreover, this indicates the likelihood of serendipitous predictions is decreased in the 40-bin case. We note that C_{2}O_{2} is the least prevalent and shortest lived of the considered small species and thus the most likely to be sensitive to the active learning approach. Values for no-memory $ti$ are closer to those predicted by DFT for all but C_{2}O_{2}; however, all partial-memory predictions for these species are within error of the full-memory results. Nevertheless, mole fractions for C_{2}O_{2} are significantly improved over the no- and full-memory active learning prediction, indicating partial-memory active learning is better suited generating models targeting chemistry.

Table I shows the 40-bin-partial-memory model yields the worst pressure prediction relative to DFT, but diffusion coefficients and CO and CO_{2} geometries are in agreement with the other three approaches within error. Furthermore, $RMSEr,ET$ and RMSE_{r,F} are both the smallest of all considered models. Computed $RMSEr,Esml$ and $RMSEr,Elrg$ values for the 40-bin-partial-memory model are not directly comparable with the other three models, since the final ALC-0 *s*_{sel} contains different species (i.e., because they were selected using different *n*_{bin} values). Nevertheless, we find $RMSEr,Esml$ is the second smallest and $RMSEr,Elrg$ the smallest of the four active learning methods considered. We note that additional simulations of the same size and length were run for the 40-bin-partial-memory ALC-7 model using an increased time-step (i.e., from 0.1 fs to 0.5 fs), for which the resulting predictions were consistent with the 0.1 fs results presented here. As a final comparison between each active learning approach, Fig. 11 provides the radial pair distribution functions (RDFs) predicted for the 2400 K system by the ALC-7 model from each approach. As with all other explored validation metrics, the RDFs are all similar to one another and DFT, though a slight over-structuring is observed in the second CC 20-bin-partial-memory peak ($r\u22482.5$ Å), and the first OO peak predicted by the 20-bin-partial-memory and the no-memory models.

## III. CONCLUSIONS

Active learning provides an automated and flexible means of achieving training database completeness, a primary factor determining the accuracy and robustness of high complexity machine learned force fields. In this paper, we have demonstrated the design of an AL approach for semi-automated development of high-fidelity reactive ChIMES models by sampling only from relevant configurations found in a condensed phase. Moreover, this approach is broadly applicable and well-suited for any parameterically linear high-complexity model. At a high level, this approach involves identification, extraction, and selection of potentially important species by way of clustering, energetics, and a criterion inspired by Shannon information theory. The result is an evolving central training database that enables deconvolution of inter- and intra-molecular contributions to DFT forces and energies, allowing for an improved description of structure, dynamics, and speciation. Though this approach is specifically designed for parameterically linear models, we note that it can be utilized with other machine learned models.

To demonstrate the suitability of this AL framework, we have applied it to the development of a high-complexity (i.e., ∼4000 parameter) explicitly many-bodied machine-learned force field for C/O systems under reactive conditions and shown that the resulting models exhibit excellent agreement with DFT. Overall, we find our active learning method relatively insensitive to the choice of memory mode (i.e., no-memory, full-memory, or partial-memory, 20 bins or 40 bins, etc.). Each approach produces a stable model yielding close agreement with DFT in structure, dynamics, and pressure. Of all investigated validation metrics, predicted small molecule mole fractions and lifetimes appear most sensitive to active learning memory models. Partial-memory active learning is found to yield improved small molecule chemistry over no- and full-memory learning and performs best when a larger number of histogram bins (i.e., 40) are utilized. Specifically, model development through a partial-memory active learning process with 40 bins was found to yield convergent behavior by eight active learning cycles and predicted structure, dynamics, and speciation in excellent agreement with DFT. We note that the present AL approach has an added benefit; instabilities are often encountered during initial ChIMES runs on large systems due to an enhanced probability of sampling new regions of phase space with increased system size. Using the present AL approach, these unstable configurations can be dealt with on a cluster-scale, rather than overall-system scale, drastically reducing the computational requirements for additional ALCs. This work represents a significant step forward in ML model development methodology by substantially enhancing automation and reproducibility. Furthermore, this approach is highly flexible; current efforts are focused on extending this fitting framework to enable transferability through parallel learning at multiple state points and further refining this process by addition of structural criteria during MC selection, as well as a configuration filter based on the expected model change method.^{20}

## SUPPLEMENTARY MATERIAL

Initial training data used for model generation and parameters of the 40-bin partial-memory model and the resulting parameters are available in the supplementary material.

## DATA AVAILABILITY

The data that support the findings of this study are available within the article and its supplementary material.

## ACKNOWLEDGMENTS

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344. The project 17-ERD-011 was funded by the Laboratory Directed Research and Development Program at LLNL with S.B. as principal investigator. Document no. LLNL-JRNL-812206.