We utilize connections between molecular coarse-graining (CG) approaches and implicit generative models in machine learning to describe a new framework for systematic molecular CG. Focus is placed on the formalism encompassing generative adversarial networks. The resulting method enables a variety of model parameterization strategies, some of which show similarity to previous CG methods. We demonstrate that the resulting framework can rigorously parameterize CG models containing CG sites with no prescribed connection to the reference atomistic system (termed virtual sites); however, this advantage is offset by the lack of a closed-form expression for the CG Hamiltonian at the resolution obtained after integration over the virtual CG sites. Computational examples are provided for cases in which these methods ideally return identical parameters as relative entropy minimization CG but where traditional relative entropy minimization CG optimization equations are not applicable.

## I. INTRODUCTION

Classical atomistic molecular dynamics (MD) simulation has provided significant insight into many biological and materials processes.^{1–4} However, its efficacy is often restricted by its computational cost: for example, routine atomic resolution studies of biomolecular systems are currently limited to microsecond simulations of millions of atoms. Phenomena that cannot be characterized in this regime often require investigation using modified computational approaches. Coarse-grained (CG) molecular dynamics can be effective for studying systems where the motions of nearby atoms are highly interdependent.^{5–9} By simulating at the resolution of CG sites or “beads,” each associated with multiple correlated atoms, biomolecular processes at the second time scale and beyond can be accurately probed. High-fidelity coarse-grained molecular dynamics (CGMC) models often depend on flexible parameterizations; as a result, the design of systematic parameterization strategies is an active area of study (e.g., methods and applications in Refs. 10–35).

The CGMD models considered in this article are similar to their atomistic counterparts. They are composed of point-mass CG beads, a corresponding CG force-field, and a simulation protocol that produces Boltzmann statistics in the long-time limit. We restrict the bulk of our study to the parameterization of the CG effective force-field. Here, and in the remainder of the article, we refer to these models simply as CG models. We only consider the static equilibrium properties of these models and not their dynamics. There are two nonexclusive classes of parameterization strategies for CG models of interest to this article: *top-down* and *bottom-up* approaches.^{5–7} Top-down approaches aim to parameterize CG models to recapitulate specific macroscopic properties, such as pressure and partition coefficients,^{36,103} while bottom-up methods attempt to parameterize CG models to reproduce the multidimensional distribution given by explicitly mapping each atomistic configuration (produced by a suitable reference simulation) to a specific CG configuration.^{13–17} The distribution of this mapped system is produced via a Boltzmann distribution with respect to an effective CG Hamiltonian referred to as the many-body Potential of Mean Force (PMF).

Certain scientific inferences could be informally drawn from the fit CG force-field itself, assuming that the force-field is constrained to intuitive low dimensional contributions (e.g., pairwise forces, such as in Ref. 37). For example, one could attempt to infer the effect of an amino acid mutation on protein behavior by considering how the approximated PMF differs when fit on reference wild type and mutant proteins simulations, similar to the analysis of low dimensional free energy surfaces. However, the primary use of CG models is typically based on their ability to generate CG configurations of a system of interest using their approximate force-field.^{21,28,38,39} The computational similarity of CG models with their atomistic counterparts allows CG models to be simulated using the same high performance software packages as those used in atomistic simulation.^{40–46} As a result, the computational profile of CG models is often controlled by the same dominating factor as atomistic models: the calculation of the force-field at each time step.^{40,47,48} This cost similarly provides motivation for specific low dimensional force-field contributions. However, there is no guarantee that a force-field characterized solely by traditional bonded and pairwise nonbonded terms either describes the true PMF of the CG variables or can accurately reproduce all observables of interest to the parameterization.^{5–7} In the case of bottom-up methods, while typical approaches will produce the PMF in the infinite sampling limit when they are capable of representing any CG force-field, in practice each method creates a characteristic approximation (e.g., reproducing two-body at the expense of higher order correlations).

The compromises invoked by various bottom-up CG methods in realistic applications are critical to the utility of the resulting models. Certain methods focus on reproducing correlations dual to the potential form used;^{11,12,49,50} for example, when using a pairwise nonbonded potential, these methods recapitulate the radial distribution function of the target system. Other particular methods are characterized by attempting to reproduce both these dual correlations along with certain higher order correlations intrinsically connected to the CG potential.^{13–15,18,49} The nature of the distributions approximated suggests three natural approaches for improving an inaccurate model: improve the CG force-field basis used, modify the CG representation, or select a different procedure to generate the CG force-field. The first two options are often a central part of the design of a systematic CG model; however, realistic systems, such as proteins, may not be well described by correlations that are typically connected to computationally efficient CG potentials coupled with appropriate CG representations.^{7} More generally, the specific correlations critical to a reasonably accurate CG model may depend on the study at hand and may be representable by simple force-fields—but only at the expense of other correlations connected to that potential form as dictated by a particular bottom-up method. As a result, the diversity of possible applications motivates the creation of additional strategies for bottom-up CG modeling, each of which has different biases in the approximations it produces.

The task of generating examples (such as images) similar to a known empirical sample is of significant interest to the Machine Learning (ML) community.^{51–54} The creation of an artificial process that can produce realistic samples often entails encoding an understanding of the true mechanism underlying the real world distribution; internal representations of an accurately parameterized generative model, such as neural network parameters, can be transferred for use in secondary tasks such as classification^{55} or image retrieval.^{56} The artificial samples produced by the models themselves have additionally shown value by providing novel molecular targets for synthesis^{57,58} or as labeled images for training in classification or regression.^{59,60} A substantial number of these complex applications utilize implicit generative models.^{54,61–63} Implicit generative models, such as Generative Adversarial Networks (GANs),^{61} are characterized by their lack of an explicit probability distribution, or an associated Hamiltonian, at the resolution they produce examples.^{54} For example, a GAN may be trained to generate pictures containing human faces.^{61} Each picture that could be generated has a parameterization specific probability of being a reasonable picture of a human face (admittedly, this probability is often very close to one or zero); however, the GAN itself does not have explicit knowledge of this probability. Instead, the GAN is simply characterized as a procedure that transforms random numbers from a simple noise distribution to images that follow the probability distribution of plausible pictures. The methods used to parameterize (i.e., train) GANs therefore focus on the ability to critique a model distribution against reference samples without knowledge of the probability density function characterizing the model. This is in strong contrast to typical molecular simulation,^{1,64,65} which traditionally requires a known Hamiltonian to produce samples through molecular dynamics or Markov chain Monte Carlo—and whose systematic parameterization techniques often naturally explicitly involve evaluation of the corresponding model Hamiltonian surface.^{10,11,13–18,20–32,34} However, both methods are focused on accurately producing samples, or configurations, as their primary goal.

This article focuses on making this intuitive connection between GANs and molecular models explicit, allowing us to apply established insight from the adversarial community to bottom-up CG modeling, giving rise to new strategies for CG parameterization we term Adversarial-Residual-Coarse-Graining (ARCG). By doing so, we facilitate the use of additional classes of CG model quality measures that may show promise in modifying the approximations characterizing the optimal CG model when using a constrained set of candidate potentials to represent the CG force-field. We additionally find that it is possible to decouple the resolution at which one critiques the behavior of the CG model and the resolution at which a CG force-field is required: as an example, we describe a novel rigorous avenue to increase the expressiveness of bottom-up CG models through the use of virtual sites. Critically, we do not utilize a full GAN architecture to generate CG samples; rather, we utilize the supporting theory^{61,66–71} to optimize traditional CG force-fields.

In this work, we discuss formal connections between CG and GAN-type implicit generative models and provide an initial implementation of the resulting ARCG framework. Section II provides both an informal and a formal summary of the theoretical underpinnings, while Sec. III provides details on a particular instance of ARCG and a public computational implementation. Section IV then provides results on three simple test systems, and Sec. V outlines the consequences of the results and possible future studies. Section VI provides concluding remarks.

## II. THEORY

The purpose of this section is to both informally describe and formally define ARCG and to summarize connections between ARCG and previous CG parameterization methods. We begin by presenting an intuitive understanding of a specific form of ARCG to provide clarity for the subsequent mathematical description. We then follow by defining notation and the fine-grained (FG)/CG systems to which ARCG applies. We define ARCG and describe its estimation and optimization. We then move to decouple the resolution at which one critiques the CG model from the resolution native to the CG Hamiltonian, thereby generalizing our application to systems containing virtual CG sites. We continue by discussing the corresponding challenges with momentum consistency, and we finish by summarizing ARCG’s relationship to previous CG methods.

### A. Informal description of ARCG

Bottom-up CG models are parameterized to approximate the free energy surface implied by mapping fine-grained (FG) configurations to the CG resolution.^{6,7} Generally, this entails considering many different possible CG models (each, for example, characterized by a different pair potential) and their relationship to the reference FG data. Often, this is operationalized by creating a variational statement and searching for the CG model that minimizes it (for example, minimizing the relative entropy between the CG model and FG data^{17}). After such a procedure is complete, the modeler is well advised to visually inspect and compare the configurations produced by the selected CG model to those produced by the reference FG model. If the configurations are dissimilar, then the CG model is likely not adequate, and aspects of the variational statement or set of initial models considered must be modified and the parameterization process repeated.

It is natural to ask whether the final inspection of configurations produced by the FG and CG models can be intrinsically linked to the variational statement parameterizing the CG model. It is intuitive that for systematic CG parameterization methods derived from configurational consistency^{10,11,13–18,20–25} that when an indefinite amount of samples are used and all possible CG models are considered that the optimized CG model will perfectly reproduce the mapped FG statistics, and as a result, the configurations produced by the FG and CG models will be indistinguishable. However, in cases where perfectly reproducing the FG statistics is infeasible, it seems natural to ask if a model could be trained using this criteria of distinguishability directly.

While it could be possible in simple situations to use a human observer to intuitively rank CG models by considering the configurations they produce, this procedure quickly becomes subjective and untenable for complex models. A natural progression in the method design is then to train a computer to distinguish CG models by comparing their samples against the reference data set. One appropriate statistical procedure is classification,^{72} where a computer attempts to differentiate individual configurations based on whether they are more likely drawn from either the CG or mapped FG data sets. The implied procedure for CG parameterization is then to optimize the CG model such that it is intrinsically difficult to complete this task: as a result, the computer will inevitably make many mistakes on average when attempting to isolate configurations characteristic to only the FG and CG data. One possible intuitive manifestation of ARCG theory concretely implements this classification procedure while maintaining clear connection to CG methods such as relative entropy minimization (REM).^{17} Previous CG parameterization methods have used similar, but not identical, motivations to produce parameterization strategies.^{17,25,29} ARCG theory serves to connect, clarify, and reframe these methods where possible while extending beyond the classification metaphor.

It is important to note that the task of classification is a variational procedure itself:^{66,72} the ideal estimate of the true sources of a set of molecular configurations has a lower error than all other estimates. The optimization in classification searches over these various possible hypotheses. As a result, at each step of force-field optimization, ARCG must perform this variational search over possible hypotheses, resulting in two nested variational statements in the full model optimization procedure: one required for classification and the other for choosing the resulting CG model. Importantly, the error rate of the optimal classifier can be explicitly linked to various *f*-divergences (e.g., relative entropy) evaluated between the mapped FG and CG distributions.^{66} This suggests an equivalent formalism with which to view ARCG: the variational estimation of divergences. This alternate interpretation illustrates how additional divergences, such as the Wasserstein distance,^{71} can be estimated, even without a clear connection to classification. As a result, ARCG theory is primarily treated through the lens of variational divergence estimation in the remainder of this work.

The variational estimation intrinsic to ARCG affords an interesting extension to traditional parameterization strategies: the resolution at which the CG Hamiltonian acts may be finer than the resolution at which the model is compared to the reference data. Equivalently, CG samples can be mapped before being compared to the mapped reference FG samples. For example, additional particles may be introduced to facilitate complex effective interactions between the CG particles and then may be mapped out before comparing to the mapped reference FG samples. Applying such a mapping creates issues with many other parameterization strategies as discussed in Sec. II F.

### B. Model definitions and selection

We consider a FG probability density $prefFG$ and a mapping operator $M$ that maps a FG configuration to a CG configuration. The FG simulation is constructed such that it produces samples from the Boltzmann distribution with respect to a FG Hamiltonian giving the following probability density:

where *β* is $1kbT$ with the temperature *T* set by the simulation protocol, *m*_{i} are the FG masses, **r**^{3n} and **p**^{3n} are the FG positions and momenta variables, and our partition functions are defined as expected^{73} such that

where the integrals are taken over the full domains of the position and momentum variables (denoted via $XrFG$ and $XpFG$). The application of the CG map $M$ produces CG configurations that follow an implied probability distribution. $M$ is constrained such that it is linear and can be decomposed into momentum and position components, i.e., $M$(**r**^{3n}, **p**^{3n}) = [$M$_{r}(**r**^{3n}); $M$_{p}(**p**^{3n})],^{74} implying a factorizable probability density *p*_{ref}(**R**^{3N}, **P**^{3N}) ≔ *p*_{ref,R}(**R**^{3N})*p*_{ref,P}(**P**^{3N}) over the CG variables defined as

Bottom-up CG models aim to directly produce samples from the distribution described by *p*_{ref}.^{14,15} Ideally, this is achieved by defining a model CG Hamiltonian $\u2211i=1NPi22Mi+Umod(R3N)$ such that the corresponding Boltzmann statistics

are ideally identical to the mapped FG statistics, criteria expressed with the following CG consistency equations:^{15}

Momentum and configurational consistency are generally treated separately, with momentum consistency exactly satisfied through direct definition of CG masses *M*_{i} and configurational consistency approximated through a variational statement (as the corresponding integral is not generally tractable).^{15} We defer further discussion of momentum consistency until Subsection II E 1. The configurational variational statement is specific to the particular bottom-up CG method chosen and utilizes a variety of information depending on the method considered. Generally, knowledge of $UrefFG$, *U*_{ref}, and $M$ is used. In many cases, the corresponding variational principle can be considered in the following form:

where ** θ** denotes the finite parameterization of our CG potential,

*θ*^{†}parameterizes our ideal model, and $F$ is a function characterizing the quality of our model. Often,

^{10,13–18,25}the exact form of the variational statement contains intractable integrals which are approximated via empirical averages from atomistic and coarse-grained trajectories.

Importantly, while the models discussed in the remainder of this article fit into this framework, they differ in one important respect to many previous CG parameterization strategies: they introduce a variational definition of $F$ itself, resulting in two nested variational statements in the numerical optimization procedure.

### C. Adversarial-residual-coarse-grained models

ARCG models are characterized by a set of possible $F$ that are defined variationally as the difference in ensemble averages of a pair of coupled scalar functions. The functions, *f*^{75} and *g*, are found as producing the maximum of the following variational definition:

leading to a minimax variational statement for the fit model itself,

In other words, for a specific choice of *p*_{mod} and *p*_{ref}, the numerical value of our residual is determined by a specific (*f*, *g*) pair; all other choices of pairs of observables in $Q$ produce a more optimistic estimate of the quality of our model. These observables are evaluated via their configurational average at the CG resolution. As we update ** θ**, the optimal choice of (

*f*,

*g*) will change.

The definition of $Q$ depends on the particular $F$ specified. For example, if *f* = *g* for all pairs in $Q$, this expression defines the class of Maximum Mean Discrepancy (MMD) distances,^{76,77} with each MMD distance then being defined by further constraints on $Q$. Typically, the function space in MMD is restricted to the unit ball in a reproducing kernel Hilbert space, a choice which allows the maximization to be resolved via a closed expression. The examples in this paper will estimate *f*-divergences: in this case,

where we have used ◦ to denote the function composition, e.g., *f* ◦ *g*(*x*) ≔ *f*(*g*(*x*)), $[0,1]XR$ denotes the set of functions from $XR$ to [0, 1], and *l*_{ref} and *l*_{mod} are functions determined by the specific *f*-divergence estimated and whose closed form is given in Sec. II D.

Model selection requires an optimization over ** θ** to satisfy the external minimization in Eq. (11). The strategies available for doing so depend on the structure of $Q$. For low dimensional parameterizations, it may be feasible to do a grid search over possible models and to use Eq. (10) to select the ideal model. However, for higher dimensional parameter spaces, an attractive option is to use methods utilizing the gradient with respect to

**. If the maximized estimate over $Q$ is differentiable at a particular point with respect to**

*θ***, then (due to the envelope theorem,**

*θ*^{78}see Appendix A) the derivatives with respect to

**at that point only include terms related to the ensemble average over the model distribution,**

*θ**p*

_{mod,R},

where *f*^{†} represents one of the optimal observables found at the internal maximum. When the maximized inner estimate is expressible in closed form (which is true in the case of the *f*-divergences estimated in this paper), we can directly confirm the existence of this derivative. Assuming that the integral and derivative operators may be exchanged, simple substitution provides a covariance expression for estimation,

These results suggest a straightforward numerical optimization of Eq. (11) using gradient descent and related first order methods (e.g., RMSprop^{79}). We represent $Q$ by indexing with a finite dimensional vector ** ψ**. At each iteration of optimization, holding

**constant, we maximize over**

*θ***using samples from the model and reference distributions to estimate our expected values; then, holding**

*ψ***constant, we take a single step on the gradient of**

*ψ***estimated by the sample average of the covariance expression. This two step process is completed until convergence of**

*θ***.**

*θ*Not all definitions of $Q$ produce meaningful procedures for creating CG models. Generally, particular forms of $F$ are derived individually, each of which is amenable to the procedures outlined here. We continue by investigating an informative subset of possible $F$, characterized via *f*-divergences, that will provide functionality directly encompassing REM CG,^{17} as well as previous approaches by Stillinger^{10} and Vlcek and Chialvo.^{25}

### D. *f*-divergences

The *f*-divergences are a category of functions characterizing the difference between two distributions.^{66,80,81} When probability density functions are available, we can express this family of divergences as

where each member of the family is indexed by a convex function *f* that satisfies *f*(1) = 0. Relative entropy, the divergence central to REM CG, can be obtained by defining *f*(*x*) ≔ *x* log *x*,^{82} and the Hellinger distance, central to previous methods by Stillinger^{10} and Vlcek and Chialvo,^{25} can be obtained by via $f(x)\u2254x\u221212$.

The *f*-divergence between *p*_{mod} and *p*_{ref} can be expressed in multiple variational statements.^{66–68,70} We here utilize its duality with the training difficulty of classification which is mathematically expressed in the following formulation:

where

The function $\eta ^$ is a function of a CG configuration, mapping each configuration to a real number in [0, 1].^{83} Note that substitution into Eq. (11) (along with the removal of prefactors) provides us with our training residual,

where the optimal $\eta ^$ producing the corresponding *f*-divergence, denoted *η*, is known to be^{66}

While here we have denoted our inner variational statement as optimizing over a space of functions instead of pairs of functions, this is equivalent to Eq. (11) when defining $Q$ via Eq. (12).

The optimization present in Eq. (17) considers a diverse set of functions over a high dimensional domain. However, as expanded on in Sec. III, numerically tractable strategies typically restrict the set of functions considered by only including functions which are expressible through selected features of the molecular systems studied (e.g., functions of the distance matrix). When this restricted set still contains *η*, however, the presented equality with *f*-divergences still applies. We introduce the following term to describe such suitably expressive $Q$: in the context of *f*-divergences, when $Q$ contains $(\u221212lmod\u25e6\eta ,12lref\u25e6\eta )$, i.e., when optimization with the population averages would return the corresponding *f*-divergence, we will refer to $Q$ as being *complete*. Additionally, when $Q$ is expressive enough such that it is complete for each step in an optimization process, we will additionally refer to it as complete, with the distinction evident from the context. For example, if for any two configurational distributions produced by Lennard-Jones (LJ) potentials $Q$ contained *η*, then $Q$ would be complete when optimizing a model Lennard-Jones system to fit data produced by a reference Lennard-Jones system with initially differing parameters. The implications of optimizing via an incomplete $Q$ are discussed in Sec. III.

Despite the seemingly opaque form of Eq. (18), the variational statement provided has a notable intuitive description, which will be useful when considering implementation and connections to similar methods. Consider an external observer that has access to a mixture of molecular configurational samples, some of which are produced by our mapped reference simulation and others from our CG model (termed our reference and model samples, respectively). The observer is faced with the following task: they must distinguish which examples came from which source based solely on configurational details. We represent the observer’s guess by the function $\eta ^$, which maps each molecular configuration to a number in the interval [0, 1]. We associate the label 0 with configurations from our model and the label 1 with configurations from our reference set (note that the labels are discrete, but our estimate is a real number between 0 and 1 inclusive). We decide in this case to use the square loss, giving us the following definitions for our loss functions:

where *x* is a particular molecular configuration. For example, if the observer guesses a probability of 0.68 for a configuration that was drawn from the reference set, they are penalized (1 − 0.68)^{2} = 0.1024. If the configuration instead came from the model data set, they are penalized 0.68^{2} = 0.4624. The observer wishes to minimize their penalty, and if they are able to guess 1 for all configurations drawn from the reference set and 0 for all the configurations drawn from the model set, then their loss will be minimized at 0.

If the model is very poor, achieving an average loss of 0 will be easy—the configurations from the model will be distinct from the reference configurations. However, for higher quality models, many of their configurations will plausibly come from either the model or the reference simulation. Even with the perfect $\eta ^$, a configuration which has a 50% probability of coming from the reference and model sets will entail a minimum loss of 0.25 (this minimum is entailed when the estimated probability is *also* 50%); this loss cannot be reduced further. We refer to this loss as the irreducible loss. This is analogous to the least squares residual present in linear regression with Gaussian noise. The ideal line minimizes the least squares residual, but the least squares residual is nonzero as the line cannot perfectly fit the data.

This inability to perfectly distinguish samples is directly related to our *f*-divergences (e.g., relative entropy).^{66} Modifying the manner in which we penalize incorrect predictions (via *l*_{mod} and *l*_{ref}) specifies which divergence is produced. In this example, we have decided on the form of our losses directly; when estimating a particular *f*-divergence, the expressions defining the losses are given by Eq. (17). This loss function is asymmetric depending on the true origin of the sample: *l*_{mod} penalizes a prediction on a sample gained from the model, while *l*_{ref} penalizes a prediction on a reference sample. Notably, while there are constraints on what functions *l*_{mod} and *l*_{ref} can be defined as in order for *η* (the optimal $\eta ^$) to obey Eq. (19), these constraints are already taken into account by Eq. (17): a valid *f*-divergence will always yield losses whose optimal estimate is given by Eq. (19).

As a result, we simply need to train a classifier with a loss on our samples and consider the average loss implied by its probabilistic predictions. An extended formal description of this task and the corresponding duality is presented in Reid and Williamson.^{66} This interpretation is central to the term adversary in the name of generative adversarial networks:^{61} the adversary attempts identify the source of each sample, and we wish to make its task as difficult as possible.

### E. Virtual sites

The ARCG framework can be lightly generalized to decouple the resolution at which the CG potential acts and the resolution at which we compare our CG and reference systems. More specifically, we see that we can apply a distinct mapping operator to our CG system before it is compared to the mapped FG samples. To better illustrate the practical use of this extension, we begin by providing a motivating example.

As previously discussed, many bottom-up CG methods are shown to produce the ideal PMF when they are allowed to adopt any force-field in the ideal sampling limit. However, CG models are often limited to molecular mechanics type potentials (e.g., pairwise nonbonded potentials), which often do not contain the ideal PMF as a possible parameterization. For example, one might use Multiscale Coarse-Graining^{13–16} (MS-CG) to parameterize a CG lipid bilayer in which all of the solvent and some of the lipid degrees of freedom have been removed. Upon generating samples using the CG model, we may find that certain properties of the membrane, such as its thermodynamic force of bilayer assembly, are poor. However, the MS-CG method has likely provided one with its correct characteristic approximation; in order to improve the model with the same parameterization method, one must either increase the complexity of the CG force-field via higher order terms or change the FG details retained via modification of $M$, the CG map seen in Eqs. (4) and (5). Here, we discuss a third option: augmenting the CG representation directly without modifying $M$. As a simple example, consider modeling the interaction of two benzene molecules using a CG pairwise potential. Here, the CG representation is given by three sites per benzene ring. It may be difficult to capture the *π*-stacking effect using this type of potential at this CG resolution. As a remedy, one could add particles normal to the plane containing the benzene molecule, as shown in Fig. 1, without associating these additional CG sites to FG sites via $M$. Importantly, however, we will only critique the behavior of our CG model after these virtual sites have been integrated out: the CG model is optimized to minimize the relative entropy between the mapped FG reference and CG model after the integration over the possible positions of these virtual CG sites.

Description of the formalism encompassing these situations requires us to suitably expand our notation. We still consider all distributions described previously but use the following modifications: first, samples from *p*_{mod} are no longer generated by a simulation using the approximated PMF as its Hamiltonian. Instead, these samples are produced via a new mapping operator $G$ and simulation of a new finer grained representation characterized by $pmodpre$ via its own Hamiltonian $\u2211i=1\nu pi2/2mi+Umodpre(r3\nu )$, where $m$_{i} are the masses at the pre-CG resolution. As a result, *p*_{mod} is redefined with the following relations:

The resulting relations between resolutions are summarized in Fig. 2.

Importantly, our training procedure needs two minor modifications. First, the variational estimation of divergences presented in Eq. (10) is composed solely of ensemble averages, which are approximated via sample averages; these averages can be evaluated by generating empirical samples from *p*_{mod} via samples drawn from $pmodpre$ and application of $G$. This is a consequence of the following equation:

Second, the gradients required for optimization of the parameters of the variational search (** θ**) are calculable again through Eq. (24), allowing us to utilize our previous expression Eq. (15) at the resolution native to our new pre-CG Hamiltonian by minimizing the variationally optimized observable composed with $G$.

Importantly, while our examples in this section have primarily concerned situations in which fictional particles are added to the CG representation and then completely integrated over before calculating divergences, $G$ can easily be generalized. Fundamentally, it has the full flexibility of $M$; similarly, additional constraints are born from maintaining momentum consistency via methods described in Subsection II E 1. However, if one discards momentum consistency, it is possible to maintain an intuitive pre-CG representation while nonlinearly modifying $M$ and $G$ to represent custom high-dimensional observables. In this case, these mapped distributions are used for determining the quality of the pre-CG model. For example, one can consider improving an existing CG model without virtual sites by modifying both $M$ and $G$ to remove problematic secondary particles and focus on fitting at a reduced resolution. We reserve the bulk of our discussion and investigation of this more complex option to a future article.

#### 1. Momentum consistency

Previous sections have discussed the configurational variational statement central to ARCG; here, we discuss how to ensure momentum consistency. In the case that no pre-CG resolution is considered, momentum consistency in ARCG may be achieved through identical methods as stated in previous approaches, such as MS-CG.^{15} However, when considering three distinct resolutions, momentum consistency takes on a slightly modified form. We provide suitable constraints for a common case below although extensions are straightforward.

Momentum consistency is characterized by the following equation:

We here consider the specific case where both $M$_{r} and $Gr$ are linear functions that satisfy the constraints defined in the MS-CG work:^{15},$Gr$ is limited to associate each CG site in $X\u2009pre$ unambiguously to at most a single site in $X$ and has imposed translational and positivity constraints, and analogous constraints are placed on $M$_{r} (see Appendix B for more details). The momentum map $M$_{p} (and $Gp$ with appropriate modifications) is assumed to take the following form as in Ref. 15:

In this case, previous work^{15} has shown that the constants defining $M$_{p} (and similarly $Gp$) can be combined with the masses of the sites contributing to a mapped site to provide a definition of the mapped masses [Eq. (27)] that define a Boltzmann distribution equal to the mapped momentum distribution,

where $MIM$ is the mass of CG particle *I* as implied by map $M$, $IIM$ is the set of all atoms that map to CG site *I* according to map $M$, and $cIiM$ is the coefficient describing how the positions of FG particle *i* contribute to CG particle *I* according to map $M$. More generally, this implies that we can explicitly characterize the mapped momentum distributions for both the mapped FG and mapped CG systems, which when combined with Eq. (25) provides the following relation implying momentum consistency in a system with virtual particles,

The only solution to this equation is to set $MIG=MIM$ for each CG site *I*; in this case, we find a set of equations implying consistency,

Note that these equations are positively constrained with respect to masses and mapping constants (along with the previously stated constraints). This provides a simple condition connecting our FG masses, pre-CG masses, $M$, and $G$ and allows one to check for momentum consistency if all the relevant variables are defined. It is important to note that *I* indexes the CG sites at the resolution of *p*_{ref} and *p*_{mod}—that is, without the virtual particles. As such, in the case of $G$ simply dropping virtual particles, consistency is trivially satisfied by simply matching the masses of the nondropped particles to those implied by the FG system with $M$. Additional details may be found in Appendix B.

### F. Related methods

Despite differences in representation, ARCG can be formulated to elucidate connections to a variety of previous CG parameterization strategies. This is performed via the appropriate design of the characteristic function space $Q$ in Eq. (10). Additionally, ARCG bears resemblance to a recent CG method based on distinguishability and classification.^{29} In this section, we make explicit connections between the *f*-divergence implementation presented in this article and such external methods. The applications of the *f*-divergence duality presented here are in the infinite sampling limit, with a complete $Q$, and with converged optimization; in practice, significant differences in seemingly equivalent methods may arise.

Classification has been recently used to train a CG model by using the resulting decision function $\eta ^$ to directly update the CG configurational free energy.^{29} This is motivated by noticing that *η* that satisfies the variational bound in Eq. (17) can be related to the pointwise free energy difference as

suggesting a procedure where $log(1\u2212\eta ^)\u2212log(\eta ^)$ is scaled and used as an additive update to the CG potential. This procedure is similarly valid using any of the *f*-divergence losses discussed in this article.^{66} However, beyond the differing update rules, the variational divergence approach presented in this article is differentiated by a subtle but important difference in characteristic assumptions. The divergence interpretations of ARCG rely on the completeness of $Q$ but place no constraint on ${pmod,\theta}\theta \u2208\Theta $, where Θ denotes the set of all model parameterizations considered. In contrast, the interpretation of the method of Lemke and Peter^{29} also requires an fully expressive $Q$; however, as the update to *p*_{mod} inherently utilizes members of $Q$, the method naturally also forces ${pmod,\theta}\theta \u2208\Theta $ to be fully expressive, i.e., $pref\u2208{pmod,\theta}\theta \u2208\Theta $. In other words, $Q$ and ${pmod,\theta}\theta \u2208\Theta $ are directly coupled. As a result, in the case that the classifier used in the additive update method similarly has a relation to a specific *f*-divergence, an ideal model would always be chosen, rendering the specific choice of *f*-divergence inconsequential. Beyond this, it is unclear how to expand the update rule of Lemke and Peter^{29} to apply to virtual sites as the classifier is only directly present at the resolution of *p*_{mod} and extension of the update to the resolution of $pmodpre$ is unclear.

REM CG proposes that approximate CG models should be parameterized by minimizing the relative entropy,^{17} or KL-divergence, between the distributions produced at the FG resolution,

where we have introduced a new quantity, $pmod,rFG$, defined to be the probability density implied by the CG model over FG space (which is not used in ARCG theory); the exact form implied over the FG space depends on the interpretation of REM CG considered.^{49} This differs by a constant (when considering CG force-field optimization) from the relative entropy considered at resolution of the CG model, given by

KL-divergence is an *f*-divergence [generated by *f*(*x*): = *x* log *x*] and in the case of Eq. (33) can resultingly be formulated and solved for in the current framework, providing the following losses through Eq. (17):

We utilize this method for the computational examples presented in Sec. III. We note that the full specification of REM CG considers comparing a coarser CG model to a finer FG model at the FG resolution by defining a new model density at the FG resolution, where we have used many-to-one functions to reduce the resolution of the FG and pre-CG model in our theoretical approach. However, calculation of the relative entropy at the CG resolution produces the same model selection rule as the FG relative entropy when considering the CG force-field. Optimization of systems with virtual particles is not straightforward via REM CG as most refinement schemes require $\u27e8\u2202\theta Umod,\theta \u27e9$ which is difficult to calculate as the explicit form of $Umod,\theta $ is unknown in the case of virtual particles.

Schöberl, Zabaras, and Koutsourelakis^{33} extended REM CG by framing coarse-graining as a generative process where the FG statistics are nondeterministically produced by the CG variables by means of a backmapping operator, a method termed Predictive Coarse-Graining (PCG). This approach allows optimization of the backmapping operator itself and additionally allows more flexibility in describing the connection between the FG and CG systems. This allows PCG to describe CG models with virtual particles. Additionally, PCG is trained using expectation-maximization, which can be framed as a two part process with a variational search providing the information for a gradient update of the parameters. PCG differs from ARCG in multiple ways. First, PCG aims to optimize an iteratively tightened lower bound on the relative entropy of the CG model, whereas ARCG encompasses the optimization of a larger variety of possible metrics, including relative entropy. Additionally, the variational estimation in PCG is solved via a closed form expression and generates a gradient update which optimizes said lower bound, where the variational optimization in ARCG is solved iteratively in practice and provides the exact gradient of relative entropy. Finally, ARCG is not formulated as generating statistics at the FG resolution and instead is formulated on the CG resolution. Despite these differences, the overall similarity between PCG and ARCG suggests that the two methods could be used to extend each other. We reserve a detailed analysis of these connections for a future work.

Alternatively, recent work by Vlcek and Chialvo^{25} (as well as previous work by Stillinger^{10}) suggests that the Bhattacharyya distance (*BD*) Eq. (37) is a natural metric to judge approximate models,

While the Bhattacharyya distance is not an *f*-divergence, it is related to one via a monotonic transformation: the Hellinger distance (*H*),

This can be variationally approximated in the same framework as REM CG, resulting^{66} in the following losses:

Justification of the Bhattacharyya distance may be grounded in information geometry and the distinguishability of samples produced by the FG and CG models. Despite the apparent similarity to the fictional game described earlier, the justification of Vlcek and Chialvo^{25} is grounded in distinguishing populations via their collective empirical samples, while our game focuses on distinguishing individual configurations. The stated connection simply occurs through our duality with *f*-divergences.

Inverse Monte Carlo (IMC),^{11} also known as Newton Inversion (NI), optimizes the expected value of a vector-valued observable that characterizes the difference between the mapped FG and CG systems (often through their radial distribution functions) and can be adapted for use on systems with virtual particles. The covariance derivative term presented in Eq. (15) may be used to create a set of equations analogous to those used in traditional IMC/NI in order to couple an arbitrary potential to a vector observable quantifying the quality of the model.^{84} Importantly, observables (e.g., the radial distribution function) defined on nonvirtual sites may still be calculated with virtual particles present, which allows the optimization to be performed at the pre-CG resolution using reference data defined only at the CG resolution. ARCG may be viewed similarly as optimizing the expected value of observables. However, in ARCG, the observable minimized at each step is scalar and must be variationally found at each step of optimization, and the corresponding derivatives additionally equal those of a formal divergence. However, due to the envelope theorem, the derivatives calculated for both ARCG and IMC/NI share a similar covariance form, as shown in Eq. (15). Additionally, the vector observables considered in IMC/NI are typically connected to the form of potential being optimized. As an example, when optimizing pairwise spline potentials, the observable is closely related to the empirical radial distribution function. The calculation of observables of this type generally follows a histogram based approach which can limit the generalization to higher dimensional potentials. We note that while ARCG does in practice use similar statistics, no ensemble histogramming is performed.

There exist additional CG methods that are difficult to directly compare to ARCG (e.g., Refs. 11, 13–16, and 18). However, in general, most methods considered make assumptions that strongly inhibit virtual site application. Specifically, methods often assume that the CG potential (or its derivatives) can be applied at the resolution of the CG samples acquired (either through calculation of the residual or the update strategy facilitating optimization) although extensions are sometimes feasible. For example, traditional MS-CG force-matching optimizes the CG force-field to optimally match mapped forces; with a general linear $G$ and $Umodpre$, this would likely require an iterative procedure to determine the mean force implied at the CG resolution by $G$ and $Umodpre$. Alternatively, gYBG inverts two- and three-body CG correlations to produce a force-field at the corresponding resolution of the observed correlations; similarly, iterative Boltzmann inversion requires a map to define the iterations that connect modifications in the potential to changes in the observed correlations (which is nonintuitive when considering parameters associated with general virtual sites). These limitations often do not appear to be fundamental ones, but rather one of implementation; extensions to these methods that circumvent this limitation are likely possible. There are three straightforward strategies to remove this limitation, the first two of which the authors know are in current use. First, methods such as binning or kernel density estimation are used to approximate the probability density at a resolution differing from the CG configurational Hamiltonian (e.g., the radial distribution approach in Ref. 25). This approach is often limited to lower dimensional spaces when comparing models. Second, constraints are placed on virtual sites such that $Umodpre$ may be related via closed expression to *U*_{mod}.^{85} This approach inherently requires limiting the type of virtual site considered. Third, methods that allow the observed mapped FG sample to be backmapped to the pre-CG domain are applied, and then, traditional approaches are used on the backmapped sample. This approach requires that the reference virtual particles are assigned a distribution using a set rule. We note that this final strategy does not directly optimize *U*_{mod}; rather, it optimizes $Umodpre$ to match the backmapped distribution (which would still result in configurational consistency at the resolution of $XR$). These extensions may additionally be used with various methods discussed in the previous paragraphs; for example, traditional REM CG is applicable without modification if a suitable backmapping strategy exists. In contrast, ARCG is well suited to higher dimensions, imposes no constraint on the virtual sites, and does not require backmapping; however, it incurs increased training complexity.

Finally, we note that while there is significant overlap between ARCG and GANs with respect to the residual calculation and optimization, the method by which samples are produced in the models is conceptually distinct. GANs are characterized by transforming noise to fit a desired distribution; the optimization of the model parameters modifies the nature of this transformation. In contrast, the transformation present in ARCG is held constant, while the underlying sample generating process is modified.

## III. IMPLEMENTATION

Section II provided an abstract definition of the ARCG method, including the specific form with connection to *f*-divergences. In this section, we provide the corresponding concrete expressions for optimizing models using relative entropy by implementing the classification based approach described in Sec. II D. Additional practical points on implementation, relaxations of the method for stability, and the specification of $Q$ are also discussed.

As previously noted, the relative entropy between *p*_{ref} and *p*_{mod} is an *f*-divergence and is obtained by setting *f*(*x*) ≔ *x* log *x*. This implies equivalence with a classification task with the aforementioned losses in Eq. (34), from which we derive the model optimization statement using Eq. (18) and associated gradients using Eq. (15), such that

This comprises a full residual and associated gradient for optimization. However, in practice, the loss functions are poorly behaved: pointwise values of $\eta ^=1$ easily create a divergent residual value (identical to the corresponding situation with the traditional relative entropy estimation methods). Fortunately, the optimal *η* is shared among all proper losses.^{66} As a result, $\eta ^\u2020$ (which denotes the optimal $\eta ^$ found in practice) can be similarly discovered with the corresponding statement using the log-loss^{66,72}

while the gradient estimation remains unchanged. To summarize, the models trained in this article indirectly minimize Eq. (42) by producing derivatives over ** θ** via Eqs. (44) and (43), where $\eta ^\u2020$ retains the same meaning across equations. This equality only holds assuming that $\eta ^\u2020=\eta $; incomplete $Q$ can cause the resulting $\eta ^\u2020$’s to differ.

The numerical examples in Sec. IV are computed in the following way. First, the CG (or pre-CG) model is represented using a force-field and samples are generated using standard molecular dynamics software. These samples are mapped if necessary using $G$. Reference examples are similarly generated and mapped using $M$ if needed. The variational estimator is represented using either a neural network or through logistic regression, which implement Eq. (44). The estimator then is trained on the reference and model samples. Finally, the gradient is calculated using the output of the variational estimator [Eq. (43)] and the model samples; this gradient is then used to update the model parameters. This process is iterated although the reference sample is not regenerated. The variational estimators are not fed the Cartesian coordinates of the input system directly; instead, various features are calculated for each frame, and these features are given as input to the variational estimator. This has the effect of constraining that $\eta ^$ be a function of these features. Additional points on each of these details are discussed for each example or may be found in Appendix D.

In some cases of ARCG, including the case of *f*-divergence estimation, the functions achieving the inner maximum with a complete $Q$ can be expressed as a pointwise function of the mapped distributions. Specifically, as noted in Eq. (19), in the case of relative entropy, the optimal witness function *η* is expressible as a function of the conditional class densities (*p*_{mod} and *p*_{ref}, where we have dropped the configurational subscripts). This can guide how elements of a tractable $Q$ are parameterized. When the algebraic forms of *p*_{ref} and *p*_{mod} are known to be functions of summary statistics of their respective systems (e.g., the inverse 6 and 12 moments in a traditional Lennard-Jones potential^{86}), we can often express a complete $Q$ exactly with a manageable number of terms. However, this is not true of practical bottom-up CG application: the form of the mapping operator does not provide us with an algebraic understanding of the implied mapped free energy surfaces. The resulting *η* does maintain invariances with the free energy surfaces it is composed of (e.g., rotational and translational invariances).

When considering the creation of CG models for realistic chemical systems, it is therefore reasonable to examine how an incomplete $Q$ would affect the approximate model produced by ARCG. This situation may arise when providing an incomplete feature set as input to a parametric representation of a function or when the parametric representation itself is limited in its capacity (e.g., when logistic regression is used). The implications of general restrictions on $Q$ are complex and reserved for future studies; however, in the case of *f*-divergences, certain conclusions are clear without substantial analysis:

When considering identical

*p*_{ref}and*p*_{mod},*η*is a constant function and is typically included in $Q$. In other words when*p*_{mod}completely satisfies configurational consistency, $Q$ is typically complete. Additionally, this*η*provides a lower bound on the divergence estimate given by ARCG for any*p*_{mod}considered. Assuming $Q$ contains this function (but not assuming completeness), if the residual estimated for a candidate*p*_{mod}is*not*zero, the distribution considered does not satisfy configurational consistency and the $Q$ considered is always expressive enough to determine that the model satisfying configurational consistency is preferable. Alternatively, if a divergence of zero is returned for a given*p*_{mod}, this is motivation to expand the expressiveness of $Q$.If the parametric estimator acting on the given set of features is assumed to be fully expressive, then the divergence returned by ARCG is simply the divergence calculated at the resolution of the features themselves. The model optimization process can be recast as one with a complete $Q$ at the resolution given by composing $M$ and $G$ with the featurization function. For example, suppose one is using ARCG to create a CG lipid bilayer with a solvent comprised of virtual particles. However, additionally suppose there exists solvent in the FG system that

*could*be connected to the behavior of the CG solvent. One could reclassify the virtual particles present instead as typical CG sites but then limit the variational search to functions which do not take the solvent behavior into account, producing an identical optimization principle as that of the virtual solvent approach described previously.An incomplete $Q$ can change which approximate model ARCG returns. For example, consider selecting one of two models which reduces the relative entropy with respect to a reference distribution. In this example, our model and reference distributions each contain two degrees of freedom and their probability distributions are independent with respect to these dimensions. Suppose that the first candidate model has a different probability distribution along the first degree of freedom as the reference data but the same distribution along the second degree of freedom. Additionally, suppose that the second candidate model is similar except that the roles of the first and second degrees of freedom are reversed: the first degree of freedom has the same distribution as the reference data, but the second degree of freedom differs. If we consider variational estimators which use a $Q$ limited to functions of only the first or second degrees of freedom, it is clear that the optimal model chosen is a function of the particular incomplete $Q$ selected.

The effect of these properties on the parameterization of complex models will be investigated in a future work.

MS-CG and REM CG are known to be convex when paired with potentials which vary linearly with respect to their parameterization.^{15,16,87} ARCG requires two types of optimization: the solution of the variational statement defining the divergence and the optimization of the CG model parameters.^{88,104} The convexity of inner variational statement defining the divergence depends on the particular divergence chosen and the representation of the function space used (e.g., logistic regression or a neural network). The optimization estimating the divergence for the examples presented in this paper is convex when using logistic regression and nonconvex when using a neural network. The outer optimization modifying the CG model is characterized by the parameterization of the CG force-field, the divergence emulated, and the completeness of $Q$. When $Q$ is complete, the characteristics of the divergence emulated can be used to check for convexity; for example, the optimization of the CG parameters for the simple Lennard-Jones liquid example presented is convex. However, when $Q$ is incomplete or the potential is nonlinear, convexity is unclear, as is true for the remainder of the examples presented.^{89}

The integrals characterizing the variational residual are computationally approximated as sample averages. Optimizing a function using a sample average introduces the possibility that the function which maximizes the sample average is a poor approximation of the function which maximizes population average. In the context of classification, this error is captured by considering whether the classifier is overfitting the data sample. There are multiple strategies to overcome this;^{72} in the current study, we use *L*2 regularization and only allow the variational estimator to update a limited number of times at each iteration. When optimizing examples with flexible potentials and large feature sets (for example, the water and methanol models presented in Sec. IV), we have found that using a neural network quickly overfits the data provided, even with relatively strong regularization. However, reducing the number of iterations allowed at each step of variational optimization causes the neural network to exhibit considerable hysteresis between iterations, causing the force-field being optimized to orbit around an ideal solution. To ameliorate this, we use logistic regression in these more complex cases, where the solution to the logistic regression is optimized using a limited number of iterations of l-BFGS. Note that the output of logistic regression readily affords estimates of the class conditional probabilities, which in turn directly connects its optimal solution to *η*.

The issues associated with overfitting and hysteresis are intrinsically connected to the size of the finite samples used to approximate the integrals in Eq. (10). Overfitting would be reduced by increasing the sample size, which in turn would allow additional optimization at each variational iteration and would therefore reduce hysteresis. In practice, we have found that increasing the sample size to the point that the hysteresis is removed slows down the rate of force-field optimization considerably due to the time needed to both generate molecular samples and evaluate high dimensional gradients. This difficulty naturally suggests the use of modified sampling strategies to reduce the discrepancy between the sample and population averages. As Eq. (10) only involves expectation values, any modified sampling scheme which allows for the calculation of an unbiased ensemble average is a candidate for this strategy. It is worth noting that ARCG selects an optimal observable function based on the ensemble averages of a large set of candidate observables and that the error for each observable may be different, complicating the use of variance reduction techniques which are designed for a single observable. It would also be possible to improve the sampling of the feature space on top of which $Q$ is represented; for example, if $Q$ is represented by a neural network acting on two statistics calculated for each sample, free energy estimation may be used to better resolve the joint distribution of these two statistics themselves. This approach could be extended to produce a parameterization method which improves the observable estimates on the fly, such as that in the work of Abrams and Vanden-Eijnden.^{90} While we have not pursued these strategies here, future applications to more complex systems will likely need to consider these options.

The variational search over possible $\eta ^$ was either performed via a neural network outputting class probability predictions penalized via the log-loss or through logistic regression. Logistic regression was used in the cases of examples using *b*-spline based potentials, and neural networks were used in all other cases. All neural networks used in examples in this paper utilized a simple feed-forward architecture with at least two layers (not including the input and output layers). The results were found to be insensitive to the architecture chosen, and the specific architectures used may be found in Appendix D. All internal nodes used rectified linear activation functions with the output normalized via softmax. The duality with classification underpins the utility such traditional choices have in our variational search.

In practice, we have noticed that ARCG optimization may suffer from instability, especially when optimizing the parameters of a model that produces a distribution significantly different than its optimization target. This issue can be noted by observing that the classifier achieves 100% accuracy during parameterization, producing uninformative gradients. In these cases, we find that an effective strategy is to introduce standard Gaussian noise into both the model and reference samples; the variance of this noise is gradually reduced to zero as the optimization progress. It is likely that a correct local minima is achieved in this case as the optimization appears stationary at the end of minimization, but it is unclear if the selection of a specific local minima is biased using this strategy.

A public proof-of-concept Python/Lammps based implementation is available at the weblink https://github.com/uchicago-voth/ARCG. This code base makes extensive use of the theano, theanets, pyLammps, numpy, scikit, and dill libraries. All computational examples presented in this paper may be found in the test portion of this code, which includes the complete settings used to generate the data used. Visualizations and analysis were performed with the matplotlib and seaborn libraries, as well as the base plotting system, rgl, and data.table packages in R. Extensions providing scalability for more complex systems and potentials will be considered in future work.

## IV. RESULTS

The relative entropy approach described in Sec. III was applied to five test systems. First, a simple single component 12-6 Lennard-Jones (LJ) system was optimized to approximate a reference LJ system at the same resolution (no virtual particles were present, and no coarse-graining of either the reference or model was performed). Second, a system representing two bonded real particles where force is partially mediated by a single harmonically bonded virtual particle was optimized to approximate a reference system of the same type. Third, a binary LJ liquid undergoing phase separation was simulated and optimized after particles of a single type had been integrated out; this distribution was fit to match a similarly integrated binary LJ system. Fourth, a CG model using pairwise *b*-spline interactions and a single site per molecule was used to approximate liquid methanol. Fifth, a CG model using pairwise *b*-spline interactions and a single site per molecule was used to approximate liquid water. In these cases, we observed good convergence of suitable correlation functions; however, in cases with virtual particles, we found that numerically recovering the known parameters of the reference system is difficult; in other words, it seems likely that the parameter space is either redundant or sloppy,^{91} with similar correlation functions arising from distinct parameter sets. We note that while the potentials considered here are relatively simple, ARCG is fully applicable to more complex potentials such as those in the work of Zhang *et al.*^{34}

The first three examples considered here are theoretically able to capture the reference distributions used for fitting (i.e., the model optimized is not misspecified). This is ensured by generating reference data using a force-field that is directly representable by the CG force-field family. For example, the LJ CG model in the first example was optimized to reproduce the statistics generated by a particular LJ reference potential. Additionally, when either the reference or model is modified using a mapping function, this mapping operator is forced to be the same between the two systems, and the reference data are again produced using a force-field which is expressible by the CG model. For example, in the case of the virtual solvent LJ system, a distinct system of binary LJ particles was simulated for both the model and reference data samples, each with differing parameter sets. Both systems then had the particles of a specific shared type integrated out. The resulting integrated distributions were then the basis of comparison used to train the model parameters (with new statistics being created for the CG model at each iteration). This is not true for the examples approximating water and methanol: here, the CG model is approximating the mapped distributions using a pairwise potential and is unable to capture the true free energy surface.

### A. Lennard-Jones fluid

A single component 12-6 LJ fluid was simulated with 864 particles at 300 K [the potential form is given in Eq. (45) with *r*_{ij} denoting the Euclidean distance between particles *i* and *j*],

The system was simulated at constant NVT conditions using a Langevin thermostat with coupling parameter set to 100.0 fs and a time step of 1.0 fs. No virtual particles were present, i.e., $G$ and $M$ are set to be the identity function. Inverse sixth and twelfth moments were used as input to the variational estimator (in this case, this set of features is known to be complete, see Appendix D for details). System *A*_{initial} with $\u03f5Ainitial=0.6\u2009kcal/mol$ and $\sigma Ainitial=3.5$ Å was optimized to match the statistics of system *B* characterized by *ϵ*_{B} = 0.75 kcal/mol and *σ*_{B} = 3.0 Å. Upon optimization, the parameters of *A* were seen to quantitatively converge to those of *B*: $\u03f5Aopt=0.746\u2009kcal/mol$ and $\sigma Aopt=3.00$ Å. Additionally, convergence of the pairwise correlation functions (Fig. 3) was observed. The initial parameters of *A* resulted in a homogeneous liquid, while those of system *B* (and system *A* upon optimization) resulted in liquid-vapor coexistence. During training, Gaussian noise was used to smooth out initial gradients to resolve initial soft wall differences; this noise is reduced to zero by the end of optimization. Optimization was performed using RMSprop^{79} with individual rates for each parameter. These results demonstrate good convergence properties with small parameter sets when no virtual particles are considered in the pre-CG resolution.

### B. Virtual bond site

A system of three particles completely connected via harmonic bonds was simulated at 300 K. The system was propagated in constant NVT conditions using a Langevin thermostat with the coupling parameter set to 100.0 fs and a time step of 1.0 fs. Two types of particles were present; we denote the types of the particles *X*, *Y*, *X*. Upon application of $M$ and $G$, the *Y* particle is removed, resulting in a system composed of two particles of type *X* (i.e., the *Y* particle is a virtual site). This mapped system is optimized using the distance between the two *X* particles as input to the discriminator; in this case, this feature set is complete. Initial, optimized, and reference parameters are seen in Table I. Optimization was performed using RMSprop. Convergence to a specific parameter set that reproduces observed correlations (Fig. 4) is fast; however, these parameters differ from the parameters of the reference system. Additional simulations were run where the CG model was initialized with parameters set to those of the reference system (results not shown); in this case, we observed local diffusion around a small set of parameters including the true set. This suggests that virtual particles may create degeneracy in model specification in practice (i.e., even if the model parameters are identifiable, the specification is sloppy). This case represents an application where a pairwise force-field may be augmented via bonded virtual particles to create modified correlations.^{92,105–107} For example, a heterogeneous elastic network^{93} may be augmented by introducing virtual particles to facilitate higher order correlations.

. | x_{XY}
. | $kXY$ . | x_{XX}
. | $kXX$ . |
---|---|---|---|---|

System . | (Å) . | (kcal/mol)/Å^{−2}
. | (Å) . | (kcal/mol)/Å^{−2}
. |

B | 2 | 2.7 | 2.3 | 0.4 |

A_{initial} | 0.65 | 2.2 | 1.4 | 0.15 |

A_{opt} | 1.70 | 2.06 | 2.66 | 0.224 |

. | x_{XY}
. | $kXY$ . | x_{XX}
. | $kXX$ . |
---|---|---|---|---|

System . | (Å) . | (kcal/mol)/Å^{−2}
. | (Å) . | (kcal/mol)/Å^{−2}
. |

B | 2 | 2.7 | 2.3 | 0.4 |

A_{initial} | 0.65 | 2.2 | 1.4 | 0.15 |

A_{opt} | 1.70 | 2.06 | 2.66 | 0.224 |

### C. Virtual solvent Lennard-Jones fluid

A binary system composed of 864 LJ particles of types *X* and *Y* was simulated at 300 K. The system was simulated at constant NVT conditions using a Langevin thermostat with coupling parameter set to 100.0 fs and a time step of 1.0 fs. Equal numbers of *X* and *Y* particles were present prior to the application of mapping operators; upon application, all particles of type *Y* were removed (i.e., the *Y* particles are virtual sites). The target system was parameterized to undergo phase coexistence, while the unoptimized CG model was well mixed. Parameters are found in Table II. Optimization was performed using RMSprop with rates adjusted for each parameter. Gaussian noise was used to stabilize initial training. Visual inspection of representative molecular configurations showed greatly improved similarity for the optimized parameter set (Fig. 5). Again, while convergence of correlation functions is readily observed (Fig. 6), parameters do not converge to those of the reference system, likely due to sloppiness in specification.

System . | σ_{XX} (Å)
. | $\u03f5XX$ (kcal/mol) . | σ_{YY}(Å)
. | $\u03f5YY$ (kcal/mol) . | σ_{XY} (Å)
. | $\u03f5XY$ (kcal/mol) . |
---|---|---|---|---|---|---|

B | 0.7 | 3.6 | 0.7 | 3.6 | 0.35 | 3.5 |

A_{initial} | 0.6 | 3.5 | 0.6 | 3.2 | 0.5 | 3.1 |

A_{opt} | 0.713 | 3.600 | 0.722 | 3.594 | 0.349 | 3.494 |

System . | σ_{XX} (Å)
. | $\u03f5XX$ (kcal/mol) . | σ_{YY}(Å)
. | $\u03f5YY$ (kcal/mol) . | σ_{XY} (Å)
. | $\u03f5XY$ (kcal/mol) . |
---|---|---|---|---|---|---|

B | 0.7 | 3.6 | 0.7 | 3.6 | 0.35 | 3.5 |

A_{initial} | 0.6 | 3.5 | 0.6 | 3.2 | 0.5 | 3.1 |

A_{opt} | 0.713 | 3.600 | 0.722 | 3.594 | 0.349 | 3.494 |

This case is representative of the situation where higher order correlations may be captured by the addition of virtual solvent particles. For example, the hydrophobic driving force underlying a CG lipid bilayer could be facilitated by a virtual solvent. This is distinct from using traditional explicit solvent where each solvent molecule is directly connected to the FG reference system: there, the behavior of the solvent is incorporated into the quality of the model, where the approach of ARCG ignores the direct solvent behavior.

### D. Single site methanol

Methanol was modeled using single site CG liquid. The reference atomistic (FG) trajectory of 512 molecules was simulated in the NVT ensemble at 300 K with a Nose-Hoover damping time of 1 ps after NPT equilibration at 1 atm. The all-atom optimized potential for liquid systems (OPLS-AA)^{94–96} force-field was used in the atomistic system. The FG system was mapped to the CG resolution by retaining only the central carbon; no virtual sites were present in the CG system. The CG potential was described using a pairwise *b*-spline potential using 15 equally spaced knots and a 10 Å cutoff (the last three control points were set to zero to enforce a smooth decay at the cutoff). The CG system was run at the same temperature and volume as the FG system using a Langevin coupling parameter of 100 fs and a time step of 1 fs. The starting potential used for the simulation was a Weeks-Chandler-Andersen (WCA)^{108} potential (Fig. 7). Quantitative reproduction of the radial distribution function was observed (Fig. 8). Convergence was smoothed using Gaussian noise whose standard deviation decayed to zero by the end of the optimization.

### E. Single site water

A single site model of water was trained using 512 molecules of extended simple point charge (SPC/E)^{109} water simulated at 300 K. The molecular (FG) system was equilibrated at 1 atm, and production NVT samples were produced with the Nose-Hoover thermostat with a 1 ps damping time. The mapping connecting the FG system to the CG system was the center of mass mapping; no virtual sites were present in the CG system. The CG system potential was limited to pairwise interactions with a 7 Å cutoff and was parameterized using *b*-splines with 37 knots (see Appendix D for knot locations and further details). The last three spline control points were set to zero to enforce continuity at the cutoff. The starting potential used for the simulation was a WCA potential (Fig. 9). CG simulations were run at the same volume as the FG system with a Langevin thermostat using a coupling parameter of 100 fs and a 1 fs time step. Quantitative reproduction of the radial distribution function was obtained (Fig. 10). Gaussian noise was used initially to smooth convergence and was tapered to a standard deviation of zero by the end of the optimization.

## V. DISCUSSION

In this work, we have described a broad new class of variational statements for optimizing CG models and described methods for their optimization by utilizing the theory underpinning adversarial models in ML. Subsequently, we have shown that it is possible to parameterize a CG model via ARCG at a coarser resolution than that native to the CG Hamiltonian. A clear application of ARCG is the parameterization of models that contain virtual sites; however, the CG distribution may be critiqued at any coarser resolution, providing the intriguing ability to control what aspects of a CG model are visible for optimization purposes. In the process of doing so, we showed that gradients needed at each step of divergence minimization can be reformulated as modifying the system Hamiltonian to minimize the value of a specific observable, but that this observable depends on the distributions being considered at that step of optimization. We note that more generally the method presented can be used to calculate the KL divergence (and any of the other divergences discussed) between distributions for which no probability density/mass is known and for which one cannot be approximated via kernel density approximation or binning.

Beyond our central results, we have provided work and discussion on two supporting topics.

We have provided comparisons to multiple contemporary methods for CG parameterization. In certain cases, we have shown that divergences characterizing their configurational variational principles can be used in ARCG modeling. In one case, we showed that a classifier based approach bears striking but not complete similarity to the presented approach. In the remaining cases, we have discussed how decoupling the resolution at which we critique a model from the resolution of the CG Hamiltonian creates difficulties in said approaches.

We have provided a set of sufficient conditions for momentum consistency in the case of virtual sites and described how these conditions may be extended. These are closely related to consistency requirements for traditional bottom-up CG models.

Additionally, we have provided simple numerical examples (and a public computational implementation) for which we have optimized CG potentials to match specific distributions, some of which utilize CG virtual particles. The results show quantitative agreement for calculated correlations, visual agreement, and qualitative agreement in matching exact coefficients when the answer is known (quantitative agreement is seen when virtual particles are not present). Difficulties in convergence appear to be either due to instability in the parameterization process or sloppiness in the model specifications. The manner in which this will affect realistic systems is yet to be seen but may present a significant challenge. It is clear that in the most general case, parameter uniqueness is not guaranteed: if CG consistency can be obtained without virtual particles, then a model that can both decouple the virtual particle interaction from the real particles and modify the behavior of the virtual particles independently of said coupling will inherently be nonidentifiable. Additionally, it is likely that in the case of *f*-divergence based ARCG optimization, a relatively good initial hypothesis for the CG potential may be necessary, or significant amounts of noise must be added initially during optimization.

There are multiple additional studies that could naturally expand and clarify the results presented.

The methods provided can be applied to approximate nontrivial molecular systems without virtual particles. This will require multiple steps: first, the proof-of-concept software framework presented will have to be expanded for larger system sizes. Second, the training method used will have to be developed such that it remains stable, whether through the systematic addition of noise or the use of enhanced sampling techniques. Third, the feature-space used to index $Q$ will likely have to be correctly engineered based on knowledge of the FG and CG Hamiltonians. All three of these are tractable challenges.

The effect of using virtual particles should be investigated both computationally and theoretically as previous analysis on incomplete basis sets (e.g., that on relative entropy and MS-CG

^{49}) does not apply transparently. In the process of doing so, a better theoretical understanding of how to utilize these methods to capture specific higher order correlations in the training data should additionally be investigated, possibly leading to new ways in which bottom-up CG parameterization may be tuned to reproduce specific novel correlation functions.The effect of various divergences on training approximate CG models should be investigated theoretically and through simulation. This will facilitate the design of CG parameterization methods that have different biases in the approximations they produce when coupled with realistic CG potentials. This applies to not only various

*f*-divergences but also the wider set of divergences not heavily discussed in this article, such as the Wasserstein,^{71}Sobolev,^{97}Energy,^{69}and MMD^{77}distances. The Wasserstein and Energy distances share the interesting property of taking into account the spatial organization of the domain of the probability distributions considered through a separate spatial metric. Combined with kinetically informed coordinate transforms such as time-lagged independent component analysis (TICA)^{98}and variants,^{99,100}it may be possible to parameterize models to have stationary distributions that are kinetically close to one another.The effect of an incomplete $Q$ should be investigated. In this case, the presented divergence based interpretation is not trivially accurate.

^{101}Understanding of how imperfect classifiers affect the parameterization of approximate models may have large implications on the optimization of complex multicomponent systems; overly expressive $Q$ will likely impede model parameterization as more sampling of the CG and FG system may be required.

## VI. CONCLUDING REMARKS

In this article, we discussed a new class of methods for the systematic bottom-up parameterization of a CG model. In doing so, we illustrated concrete connections between CG models and algorithms such as generative adversarial networks. Utilizing these connections, we both decoupled the resolution at which we critique our CG model from the CG potential itself and enabled the use of a variety of novel measures of quality for CG model parameterization. We provided a proof of concept implementation and several numerical examples. Additionally, we illustrated precise connections to several previous methods for CG model parameterization. Finally, we noted multiple future branches of studies that can now be pursued. Together, these results open a new conceptual basis for future systematic CG parameterization strategies.

## ACKNOWLEDGMENTS

This research was supported in part by the National Science Foundation (NSF) under Grant No. CHE-1465248, by the Office of Naval Research (ONR) through Award No. N00014-18-1-2574, by the Air Force Office of Scientific Research (AFOSR) through Grant No. FA9550-16-1-0118, and through the National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a. Simulations were performed using the resources provided by the University of Chicago Research Computing Center (RCC).

### APPENDIX A: ENVELOPE THEOREM

The envelope theorem is used to justify optimizing ** θ** using only the partials calculated holding the optimal observable constant. A general statement of the envelope theorem is given by Theorem 1 in the work of Milgrom and Segal,

^{78}which also notes that the theorem applies to directional derivatives in a normed vector space.

Let *X* denote the choice set, and let the relevant parameter be *t* ∈ [0, 1]. Let *f*: *X* × [0, 1] → **R** denote the parameterized objective function. The value function *V* and the optimal choice correspondence *X*^{*} are given by

Take *t* ∈ (0, 1) and *x*^{*} ∈ *X*^{*}(*t*), and suppose that *∂*_{t}*f*(*x*^{*}, *t*) exists. If *V* is differentiable at *t*, then *V*′(*t*) = *∂*_{t}*f*(*x*^{*}, *t*).

This result puts no constraint on *X*, which corresponds to $Q$ in the current work, except that its maximal member has a derivative at that point. Additionally, as noted in the work of Milgrom and Segal,^{78} this result is only useful if *V* is known to be differentiable. This is compatible with our *f*-divergence variational statement when considered in the context of a complete $Q$ and population averages but must, in general, be confirmed for each choice of $Q$. In situations where a closed form expression corresponding to the maximum is not known, constraints may be put on each member of $Q$ to ensure applicability. Suitable constraints may be found in the remainder of the work of Milgrom and Segal.^{78}

### APPENDIX B: MOMENTUM CONSISTENCY

The described approach to achieve momentum consistency requires that we put more specific constraints on $G$. This is needed due to our minimal strategy for providing sufficiency conditions for consistency: primarily, we utilize arguments in previous work to provide sufficient constraints. The resulting conditions given suffice for the case of virtual particles that are simply dropped from the system by $G$. Generalizations to linear mappings that share particles between sites can additionally be inferred. First, we discuss the approach of previous work on momentum consistency as is relevant to our work and then concisely give a route to momentum consistency.

#### 1. MS-CG

Generally, we will here assume that $M$_{r} satisfies specific properties. Once $M$_{r} is defined, we construct an appropriate $M$_{p}. First, $M$_{r} must be expressible in the following linear form, where $MrI$ denotes the *I*th particle entry of the output of $M$_{r}, *i* iterates over the particles contribute to site *I*, and *c* denotes positive constants:

As in MS-CG,^{15} we impose translational consistency,

From this, we allow $M$_{r} to imply $M$_{p} up to the factor of the CG masses ${MI}I$ as stated in MS-CG,

As before, this type of map transforms global consistency into constituent momentum and position space components, i.e.,

where the vector valued delta functions are understood to be products of scalar delta functions. If $M$ does not associate any individual atoms to more than a single CG site, then

with

We will additionally assume that analogous constraints are put on $Gr$ when considering momentum consistency below.

#### 2. Momentum consistency

Using these points, we now move forward directly discussing momentum consistency. As stated previously, by constraining $G$ and $M$ as above, and assuming the underlying systems are characterized by separable probability densities, we find

As a result, we split up our consistency statement (omitting arguments for clarity),

Configurational consistency is handled via divergence matching as described in the main article; we here consider momentum consistency algebraically,

We substitute these using two sets of properly designed CG masses, each set implied by a mapping operator and the masses at the resolution it maps,

The only solution is to set $MIG=MIM$ for each CG site *I*; in this case, find a set of equations implying consistency,

Note that these equations are still subject to the aforementioned constraints (positivity, etc.). This provides a simple condition connecting our FG masses, pre-CG masses, $M$, and $G$ and allows one to check for momentum consistency.

When considering a CG model with no pre-CG resolution, the FG mapping $M$ must associate each atom with at most a single CG site in order for the mapped momentum distribution to factorize with respect to each CG site. This is required for momentum consistency if the CG model is simulated using traditional molecular dynamics software as the momentum distribution produced by traditional molecular dynamics is necessarily factorizable. This same constraint to $M$ and $G$ is assumed in the preceding analysis, but this is not generally required for ARCG models as nonfactorizable momentum distributions may be produced by both $M$ and $G$. However, the analysis provided to illustrate momentum consistency is based on a generalizable strategy: previous approaches to momentum consistency which produced a closed form expression for a function proportional to the Boltzmann density of the mapped atomistic distribution may be extended to the current setting by simply calculating the density implied by both $M$ and $G$ and setting them to be equivalent. In this way, more sophisticated approaches such as the one in the work of Han, Dama, and Voth^{102} may be applied analogously to approach more complex mapping operators.

### APPENDIX C: LOSS DERIVATIONS

The basis of the duality central to *f*-divergences is translated from theorem 9 in the work of Reid and Williamson.^{66} The equations relating loss functions *l* from the combined loss $L\u0332$ may be confirmed via algebra after the two following identities are noted, both of which may be found in the work of Reid and Williamson:^{66}

The terms needed to define *l*_{ref} and *l*_{mod} for relative entropy are given as follows. First, note that the function generating the appropriate relative entropy is *x* log *x* (not log *x*). From this, we find (only in the case of relative entropy)

and

Through substitution, we then arrive at Eq. (34). A similar procedure may be used to emulate other *f*-divergences.

### APPENDIX D: NUMERICAL SIMULATION DETAILS

This appendix contains details of the molecular potentials used, the features used as input to the variational estimator, and the noise used to smooth the optimization.

#### 1. Spline potentials

The *b*-splines describing the potential used to approximate water used knots that were not spaced evenly. Instead, various uniform regions of high and low knot density were used. This was due to computational constraints on the current implementation used to numerically optimize the potentials, not limitations of the methodology itself. It was found that a high knot density was needed to capture the inner well of the potential. The knots used for the water potential were 0, 0.417, 0.833, 1.25, 1.67, 2.08, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.4, 5.8, 6.2, 6.6, and 7.0 Å. This corresponds to a higher density of knots near the inner well. In contrast, the methanol potential instead used uniform knots spaced from 0 to 10 Å.

#### 2. Variational features

This subsection describes the features used as input to the variational estimator. The single component LJ fluid and the integrated bonded particle used relatively simple feature sets, while the examples of the integrated binary LJ system, the approximated methanol system, and the approximated water system used a more complex feature set as input for the variational estimator. We here define the classes of features used and then describe the set used for each of those examples. These features are calculated on each frame to produce the input for the variational estimator.

The first class of features is defined as the framewise average of a power of the distances between all the particles in the system,

where *n*_{p} is the number of pairs in the system.

The second class characterizes the average local density of each frame. The local environment is characterized by passing the softened number of neighbors within a certain cutoff through a hyperbolic tangent function. Note that an offset and scaling factor is applied to this local density before the hyperbolic tangent is applied,

where *f* is the hyperbolic tangent, *g* is the logistic sigmoid, and *n* is the number of particles in the system.

The third class of features is given by calculating the statistic central to the empirical radial distribution: given a radial bin, it returns the number of particle pairs whose separating distance is in that bin,

The single component LJ system used *H*_{moment}(·, −6) and *H*_{moment}(·, −12). This set of features is sufficient to create a complete $Q$ as we are able to write the potential of both the reference and models systems as a function of it. The virtual bonded particle only used the distance between the two real particles as input; this is can be seen to be sufficient by considering the rotational and translational symmetry present in the system. The integrated binary LJ system and the approximated methanol system used the same set of features: this was composed of features from the 3 classes described above. The *H*_{moment} features were parameterized with 2, 4, 6, and 12. The parameterization of the *H*_{density} features is given in Table III. The *H*_{RDF} features were parameterized with 50 equally spaced bins from 2.5 Å to 10 Å. The featurization using the water example was identical except for the RDF features: in this case, they were parameterized 2 Å–12 Å with 100 bins. The extended radial features were due to the higher resolution knot density.

r_{cut} (Å)
. | a
. | b
. |
---|---|---|

4 | 0 | 2 |

4 | 1 | 2 |

7 | 7 | 2 |

7 | 9 | 2 |

10 | 9 | 2 |

10 | 11 | 2 |

r_{cut} (Å)
. | a
. | b
. |
---|---|---|

4 | 0 | 2 |

4 | 1 | 2 |

7 | 7 | 2 |

7 | 9 | 2 |

10 | 9 | 2 |

10 | 11 | 2 |

The neural network architectures used were simple feed-forward networks. Not including the input and output layers, the LJ example used 2 layers of 5 nodes, the virtual bonded site example used 3 layers of 10 nodes, and the binary LJ system used 4 layers of 15 nodes. The architecture did not have a noticeable effect as long as at least two layers were present.

#### 3. Noise

Noise was added to improve convergence of a variety of the numerical examples in this paper (all except the case of the virtual bonded particle). This is helpful in the cases examined when the distributions being optimized are highly dissimilar. The procedure used to apply noise is summarized as follows. First, a data set composed of the combined samples from both the reference and model trajectories is whitened to have a mean of zero and a standard deviation of one in each dimension. Gaussian noise of a specified variance with a mean of zero was applied to each dimension. This variance noise was geometrically decayed when the reported accuracy of the classifier (produced by optimization of the variational statement) was below a set threshold for a set number of iterations. The decay factor was set to 0.95–0.97 for the examples presented. Additional details can be found in the tests presented in the public code base.

## REFERENCES

We note that recent work^{104} has reinforced how macroscopic observable matching has unexpected challenges when trying to establish a firm microscopic connection to atomistic systems if care is not taken in choosing which observable form to optimize with.

Throughout this paper, we omit proportionality constants related to indistinguishability and unit systems (including the factors of Planck’s constant often introduced through quantum mechanical limits); reintroduction of these constants is straightforward.

$M$ is also typically^{15} additionally constrained such that the resulting coordinates are linearly independent and unambiguously associate at most one CG site to each atom. These constraints are mostly unimportant to the work at hand except for when momentum consistency is considered, but some care must be taken so that the corresponding densities exist.

These functions, as well as other functions throughout the paper, must be integrable with respect to the measures defining the model or reference distributions. We will informally refer to integrable functions as functions. We do not, however, assume such functions are continuous unless noted. Additionally, we will refer to functions which differ by measure zero as the same function when considering those which maximize expectation values.

We note that the *x* proceeding the log here effectively changes the distribution over which log is averaged; relative entropy traditionally averages $logpref(x)pmod(x)$ over *p*_{ref}.

We note that certain proper losses may only be defined on (0, 1], [0, 1), or (0, 1). This is partially reflective of the fact that certain *f*-divergences, such as relative entropy, are not always defined when the support of the corresponding densities is not the same. There exist cases such that the limiting behavior of such losses is still valid for distributions with differing support, such as the log loss. The optimization performed in this paper only compares distributions for which the KL divergence is defined and for which *η* ∈ (0, 1). The variational statements hold more generally; see the work of Reid and Williamson^{66} for more details.

In this form, the observables considered no longer must be dual to the potential form optimized. Additionally, if the number of force-field parameters does not equal the number of observables considered, the linear system typically used in IMC/NI optimization may be overdetermined or underdetermined, which implies that the Newton method used for optimization cannot be applied without modification. However, first order methods such as those used in this paper (RMSprop) remain valid options to minimize a scalar residual crafted from the vector observable typical to IMC/NI, e.g., through the vector’s 2-norm.

Convergence results exist for generative adversarial networks in the case where it is not assumed that the variational divergence estimation converges at each iteration; as an example, see the work of Heusel *et al.*^{105} However, such analysis moves beyond the divergence based interpretation presented and we reserve further discussion of this topic to future work.

It is important to note when using virtual particles that potentials which use linear parameterizations at the pre-CG resolution are not necessarily linear at the CG resolution.

Systems using harmonically bonded virtual particles have been considered in the past and under certain constraints can be shown to not affect the stationary distribution of the CG particles (see the work of Zwanzig^{106,107} and Davtyan *et al.*^{108}). The particles considered here do not satisfy these constraints.

^{*}CM1A-LBCC: Localized bond-charge corrected cm1a charges for condensed-phase simulations