Coarse-grained (CG) models allow efficient molecular simulation by reducing the degrees of freedom in the system. To recapitulate important physical properties, including many-body correlations at the CG resolution, an appropriate mapping from the atomistic to CG level is needed. Symmetry exhibited by molecules, especially when aspherical, can be lost upon coarse-graining due to the use of spherically symmetric CG effective potentials. This mismatch can be efficiently amended by imposing symmetry using virtual CG sites. However, there has been no rigorous bottom-up approach for constructing a many-body potential of mean force that governs the distribution of virtual CG sites. Herein, we demonstrate a statistical mechanical framework that extends a mapping scheme of CG systems involving virtual sites to provide a thermodynamically consistent CG model in the spirit of the principle of maximum entropy. Utilizing the extended framework, this work defines a center of symmetry (COS) mapping and applies it to benzene and toluene systems such that the planar symmetry of the aromatic ring is preserved by constructing two virtual sites along a normal vector. Compared to typical center of mass (COM) CG models, COS CG models correctly recapitulate radial and higher order correlations, e.g., orientational and three-body correlations. Moreover, we find that COS CG interactions from bulk phases are transferable to mixture phases, whereas conventional COM models deviate between the two states. This result suggests a systematic approach to construct more transferable CG models by conserving molecular symmetry, and the new protocol is further expected to capture other many-body correlations by utilizing virtual sites.

## I. INTRODUCTION

Coarse-grained (CG) models enable one to extend the temporal and spatial scales of molecular simulations by averaging out the finer details below the resolution of the CG site (or “bead”).^{1–7} While computationally efficient, some detailed configurations, or spatial correlations, are lost during the CG model construction and resulting simulation. Thus, an appropriate CG mapping from the fine-grained (FG), e.g., atomistic, degrees of freedom to the CG model is a prerequisite for parameterizing CG models that can faithfully represent the physical properties of the system of interest using the reduced CG degrees of freedom.^{8–13}

A mapping operator represents the correspondence between the FG configurations **r**^{n} to the mapped CG configurations **R**^{N}. Since there are no general or definitive rules for mappings, various operators have been suggested and implemented for different systems. As one example, the MARTINI force field has a straightforward mapping rule that is to approximately map four heavy atoms with associated hydrogen atoms to a single CG bead.^{14–16} On the other hand, CG models can be constructed from the dynamical information inferred from the essential dynamics of all-atom simulations. One such approach is known as Essential Dynamics Coarse-Graining (ED-CG), which characterizes and groups atoms sharing similar dynamical behavior, such as fluctuations in the low-frequency regime.^{17–21} Similarly, other approaches, the diffusion map and sketch map approaches, have also been developed to extract relevant low-dimensional dynamic information by reducing dimensions in a nonlinear sense.^{22–26}

In a more physically transparent manner, a group of atoms can be mapped into their center of mass (COM) as the CG variable. As an example, the multiscale coarse-graining (MS-CG) approach^{8–12} has been applied to various scales of molecular systems ranging from condensed matter (liquids) to biological systems (such as lipids) using COM mappings. In such cases, the equilibrium distribution of the CG variables should be consistent with that of the reference FG statistical ensemble (known as the “consistency” relationship) for those CG mappings, which establishes an equivalence between the force acting on a CG site and the summation of exerted forces on the underlying FG atoms. A derivation of this consistency condition is detailed in Ref. 11. However, in some cases, the COM mapping scheme might not be a good choice to thoroughly represent and reproduce certain physical properties of the system. Since a simple two-body interaction between the COM distances might not be enough to fully account for the many-body effects in condensed phases, various CG models have been suggested to include higher order interactions, which significantly affect the computational efficiency.^{27–32}

Alternatively, one can improve CG mapping while only using the pair interactions between the CG sites. One example is the center of charge (COC) mapping that provides a better description of Coulombic interactions for molecules where the conventional MS-CG model based on COM configurations is not enough to capture long-range interactions.^{33} However, two drawbacks still remain in the COC mapping. First, the COC mapping assumes that the mapping weighting factor of the COC CG beads is proportional to atomic charges within the subgroups. In turn, it does not provide a rigorous thermodynamic relationship between the suggested CG mapping and the FG system. Also, this approach is most valid if the system has highly charge-separated groups, such as in polar molecules.

In addition, one of the basic limitations of conventional CG models is that each CG site is generally treated as a spherically symmetric particle, resulting in discrepancies when asymmetric molecules are mapped to single (or multiple) sites. For example, we recently demonstrated that the symmetry mismatch between the CG and FG beads leads to the failure of MS-CG models in interfacial systems where the symmetry or orientation plays an important role.^{34} One might instead use the Gaussian overlap formalism^{35} or Gay-Berne potential,^{36} which have been extended to parameterize CG interactions,^{37–39} or explicitly include dipole moments,^{40–42} respectively. However, these strategies often require expensive computational overhead to both parameterize and utilize such CG models in order to maintain high accuracy with faithful reproduction of the FG structure.

To overcome the aforementioned discrepancies in symmetry and to provide an efficient and accurate CG interaction, the present work presents a center of symmetry (COS) CG mapping that is designed to preserve the symmetry of the molecule in CG configuration space by introducing virtual CG sites. Here, “virtual” sites are defined as fictitious particles that are introduced, in contrast to other mapping schemes (e.g., COM) that rely on physical atoms. A straightforward example to demonstrate this mapping is benzene, which has a planar geometry. In order to preserve the planar symmetry, the COS CG mapping yields two virtual CG sites that are normal to the aromatic plane, introducing additional degrees of freedom (rather than only using their pairwise distances as order parameters) that can provide higher quality basis sets for elucidating a CG many-body potential of mean force (PMF). This approach provides a simple way to represent the local many-body correlation of nonspherical molecules with marginally more computational cost. We further envisage that our proposed CG model will enhance the transferability of the effective CG interactions,^{5,43} which will be discussed later.

In order to represent the effective interactions between the virtual CG sites correctly, we require a rigorous statistical mechanical approach to construct the PMF describing the distributions of virtual CG sites. In this article, we will develop a systematic CG theory involving virtual sites for the first time. The thermodynamic consistency condition is first derived at the resolution of the COM variables. A bottom-up parameterization protocol utilizing the force-matching equations is then introduced based on these conditions. It is demonstrated that the MS-CG method can be extended in this framework with minimal modification. Our approach is expected to facilitate the modeling of complex biomolecules such as lipids and membranes by adopting the symmetry or shape from the all-atom resolution into the FG to CG mapping operator. This is in line with shape-based CG^{44,45} or related *ad hoc* CG approaches^{46,47} that have been applied to certain biosystems.

The remaining sections of this paper are as follows: In Sec. II, we first address the CG mapping involving virtual sites and derive a thermodynamic consistency relationship based on the principle of maximum entropy. Then, we construct and review the CG models for benzene and toluene. Practical implementation details for the CG simulations are also given. The proposed CG mapping is next evaluated in Sec. III by analyzing the structural properties including radial, orientational, and three-body correlations for benzene derivatives. We also present an analysis of transferability by comparing the effective CG interactions from bulk and mixed phases at the end of Sec. III. Section IV provides a summary and concludes the paper.

## II. METHODOLOGY

### A. COS mapping

As discussed in Sec. I, a CG model is defined by a set of mapping operators $MRN:rn\u2192RN$. These mapping operators are conventionally designed as a linear function to most readily capture the physics of the FG system at the level of resolution that one would like to focus on. For instance, the COM mapping operator captures the translational motion of the FG particles, while at the same time coarse-graining out the internal rotational and vibrational motions. Similarly, the center of geometry (COG) mapping operator constructs a representation of the geometric center of FG particles. However, in the COS mapping, the CG procedure involves two sets of linear mapping operators in the following forms: one for the CG mapping from the FG trajectory onto the COM representation with $R\alpha COM\u2261R\u0303\alpha $ and the other for the reverse mapping from the COM to the virtual CG bead positions, respectively,

These mappings can be written in a more compact form as

A schematic description of the proposed mapping relationship is shown in Fig. 1. The reverse mapping function is carefully designed to manifest intrinsic symmetrical features, such as the dipole moment vector of polar molecules or the normal vector of planar molecules (e.g., benzene), of the underlying FG system that is integrated out in the COM level of description. Detailed design and setup of the COS model will be addressed in Sec. II D.

Unlike the conventional COM and COG mappings, the physical meaning of CG beads in the COS mapping does not directly correspond to a concrete physical entity. The center of mass of the COS CG beads represents the most resolved physical entity, whereas the COS CG beads themselves are just virtual particles characterizing certain symmetrical features of the underlying FG molecules. We will show in Sec. III that the presence of these virtual CG beads helps to characterize the complicated many-body orientational correlations of the FG system in a more intuitive and computationally efficient way.

It should be noted that many-body correlations can be intrinsically captured by the COM mapping when appropriate basis sets are utilized (e.g., containing terms beyond pairwise two-body). However, since information on the orientation of CG beads is missing in the case of simple pairwise interactions, more complicated functional forms for CG interactions are often necessary to fully take this many-body effect into account.^{27–32} Embedding this complexity can introduce challenges for both the parameterization and propagation of the resultant CG models. However, the aforementioned shortcoming is alleviated in the COS mapping approach since the proposed reverse mapping explicitly embeds the orientational information into the virtual CG beads.

### B. The thermodynamic consistency conditions

With the CG mapping operator in hand, the “perfect” CG system is required to behave exactly like the FG system at the resolution that the CG system was constructed. To reiterate, the so-called structural thermodynamic consistency is a sufficient condition for the CG model to be consistent with the corresponding atomistic model in configuration space. Thus, in order to ensure the so-called structural thermodynamic consistency, a quantitative relationship between the CG and FG representations should first be developed. This consistency requires that at the resolution of the “center of masses,” the FG model **r**^{n} and COS CG model **R**^{N} generate the same statistical distribution (we focus on the consistency in configuration space only). Suppose that the probability distribution function for the COM variables is known. In order to ensure the thermodynamic consistency conditions at the resolution of the COM, the COS CG and COM distribution functions must be connected

in which the delta function is a shorthand notation for a product of delta functions of each CG center of mass, i.e.,

To infer the COS CG distribution functions based on the above consistency condition, one can maximize the Gibbs-Jaynes entropy of the COS CG systems with respect to the distribution function

under the constraint of Eq. (3) as well as the normalization condition $\u222bdRN\rho CG(RN)=1$. By introducing the Lagrange multiplier function *λ*(**R**), one can maximize the entropy functional by solving an unconstraint optimization of the following Lagrangian $L\rho CG(R),\lambda (R),\lambda 0$ using the calculus of variations

The result from Eq. (6) gives

in which the measure $\Omega M\u0303(RN)$ is defined as

The measure function $\Omega M\u0303(RN)$ represents the number of states for the COS CG system given a specific center of mass configuration. In general, this function is dependent on the COM configurations reflecting the geometric features of the mapping operator $M\u0303$. In the case of linear mapping operators, the measure can be evaluated as a constant independent of specific COM configurations.

On the other hand, as has been previously shown in prior work,^{11} the center of mass distribution functions can be evaluated from the FG atomistic distribution functions through

in which the Dirac delta function is similarly defined as $\delta R\u0303I\u2212Mrn=\u220fI=1N\delta R\u0303I\u2212MIrn$. Combining Eq. (7) with Eq. (9), we obtain

which provides the least biased estimation of the COS CG distribution functions based on the atomistic statistical distributions.

More specifically, if the mapping from the COS sites to the COM site is linear, the number of states measure becomes a constant. In the scope of this paper, we choose to map two COS sites onto their COM under equal weights, and thus the measure function will have the form $\Omega M\u0303(RN)=2V\xd1$. Finally, we obtain the thermodynamic consistency condition relating the probability distribution functions for the atomistic and COS CG systems writing as

A detailed derivation for the measure function as well as an alternative approach to derive Eq. (11) from the observation consistency is discussed in Appendix B.

Based on Eq. (11), supposing that the FG and CG systems all obey classical Boltzmann statistics, their effective Hamiltonians are linked by

in which *u*(**r**^{n}) denotes the potential energy of the FG system, whereas *U*(**R**^{N}) represents the CG counterpart. By taking a natural logarithm on both sides of Eq. (12), the effective CG potential is expressed as

Once the effective CG potential function is defined, the force exerted on the CG bead *I* as a function of the CG system configuration **R**^{N} can be obtained by differentiating the potential energy profile, namely,

In the above derivation, we have assumed that each COS CG bead is involved in the only one particular definition of the COM variable. The same assumption is also applied to each FG atom with respect to COM variables.

To simplify the force expression presented in Eq. (14), one can note that the following relationship holds due to the linear property of both the “forward” and “reverse” CG mapping operators

in which the set of coefficients {*d*} satisfies the following condition from the definition of a CG mapping

Readers can refer to Appendix C for the details. Utilizing the above condition and integrating by parts, Eq. (14) can be further simplified; the CG effective force field that ensures the thermodynamic consistency can be written as a constrained ensemble average

where the constrained ensemble average of a function $grn$ takes the following form:

In the case of linear CG mapping operators, we can choose the coefficients {*d*} so that the following condition holds:

The condition of Eq. (16) still holds since translational invariance requires that the mapping coefficients {*c*} sum to unity for each of the COM configurations. By putting all the previous equations together, a thermodynamically consistent force field of the COS CG system can be expressed as

in which the COS CG particle *I* is specific to the COM denoted by *α*. The physical meaning of this equation is that the total force on the CG bead *I* is a constrained ensemble average of a partition of the total atomistic forces exerted on the center of mass. This relationship further implies that for any CG mapping that accompanies the virtual CG sites as an explicit coordinate with respect to its corresponding COM position, the thermodynamic consistency condition is valid as long as the configurations of the virtual CG sites can be written as a linear combination of COM CG sites. This equation forms the basis of the variational principle that will be employed in this work to extract the CG effective force field.

### C. Force-matching

Once the CG mapping operator is defined, an atomistic simulation (FG) trajectory or other atomistic configurational information can be readily mapped into a CG representation that preserves all of the structural correlations. The next goal is to derive the equation of motions for the CG system using mapped information. In other words, the thermodynamic consistency conditions derived in Subsection II B provide the theoretical basis for the inference of the CG effective interactions. According to Eq. (20), the underlying force field describing the CG system should, in principle, be extracted by calculating the constraint ensemble averages. However, the numerical implementation of such an approach is quite challenging due to the difficulty in handling the constraints as well as the high dimensional nature of many-body systems. Practically designed variational principles based on the consistency conditions can be formulated in order to construct numerical approximations to the thermodynamically consistent CG force fields. The MS-CG method^{8,9,11,12} built upon the force-matching equation as well as the Relative Entropy Minimization (REM) method^{6,48–50} are two examples of such approaches.

In this work, the CG force field is parameterized utilizing the MS-CG method. Specifically, the CG force field is determined by variationally minimizing the following force residual functional:^{8,9,11,12}

For consistency with the original MS-CG method,^{8,9,11,12}$FI(rn)$ in Eq. (21) denotes the microscopic force on the CG site *I* and $fI(\u03c2I(rn))$ is the model parameter given by the CG mapping $\u03c2I(rn)$. In general, the above functional optimization problem in an infinite dimensional space is computationally forbidding. For practical implementation, the test CG force field function $fI(\u03c2I(rn))$, or $fI(RN)$, is often expanded in a subspace of the infinite dimension space, which is constructed by a set of basis functions. Such basis functions are usually determined based on some degree of physical intuition about the system of interest. Many of these basis functions can be expressed as key variables of the system, such as the pair distance and triplet angle for nonbonded and bonded atoms, respectively. In this work, we assume that the CG force field only includes contributions from pair nonbonded interactions to achieve a physically accurate and efficient CG model. Under this assumption, the CG force field in this work takes the following form:

The basis function *ϕ*_{2}(*R*_{IJ}) is approximated by a set of spline functions {*u*_{k}}, i.e.,

The set of coefficients {*c*_{k}} parameterizing the pairwise basis function are then solved by minimizing the residual functional, which is now cast as a quadratic function through the above approximations, in the least square sense. Readers are referred to Ref. 51 for the detailed analysis of different solvers, regularizations, and spline function specifications of the MS-CG algorithm.

### D. Model description and system setup

In this subsection, we will introduce the COS and conventional COM mapping operators for benzene and toluene systems in the liquid phase. First, we reiterate that benzene as a model system was chosen to test the COS mapping in the case of nonspherical symmetry because benzene molecules exhibit a planar structure due to the sp^{2} hybridization throughout the aromatic ring. Figure 2 shows three possible CG mapping schemes for a benzene molecule. We did not consider CG models with more than two sites per molecule to maximize the efficiency of the mapping (i.e., resolution) and computational cost.

One can intuitively map a molecule into a single CG bead corresponding to the COM of the whole molecule, as shown in Fig. 2(a). However, as mentioned in Sec. I, a benzene model consisting of CG beads governed by spherically symmetric interactions is not able to capture the correct planar symmetry of the original molecule at the FG resolution. As a result, the single-site COM model only preserves the overall translational motion while losing local orientational correlation. In particular, the single-site model is not able to distinguish between two possible configurations for a pair of benzene molecules at the same intermolecular distance, e.g., the T-shape and sandwich shape configurations, which have different energetics.^{52–57} A detailed analysis of the discrepancies of single-site CG models for different geometries is also discussed in Ref. 34.

Alternatively, Fig. 2(b) shows a two-site COM CG model where two adjacent C_{3}H_{3} moieties are coarse-grained to two CG sites at their respective COM and connected by a harmonic bond (denoted as a gray line). Even though this model incorporates the planar geometry to the previous single-site COM model and removes the implicit isotropy, the rotation of the benzene ring along the vector between the two COM sites is degenerate, and thus we expect a poor description of the planar symmetry. Specifically, with regard to the aforementioned dimer configurations, the two-site COM mapping is not able to distinguish between stacking or relatively tilted configurations.^{55} While maintaining the same CG mapping resolution as Fig. 2(b), one could perform the COS CG mapping to preserve the planar symmetry. A schematic description of mapped COS beads is shown in Fig. 2(c). As discussed in Sec. II, the COM of the COS system is equivalent to the COM of the benzene molecule, which is shown as a green bead in Fig. 2(a). To satisfy this condition, the two COS beads must be equidistant from the COM, giving *χ*_{αI} = 0.5. Therefore, from Eq. (20), the effective CG forces acting on the COS CG bead *I* in the configuration **R**^{N} are simply expressed as half of the total force that is exerted on the molecule

This model resembles the charge separation model suggested by Hunter and Sanders, in which the *π*-*π* interactions are modeled by placing particles along the normal vector of the plane. However, our CG model provides nonbonding interactions in the CG configuration space where the molecular symmetry is preserved, while their model is a simple electrostatic model that incorporates charge interactions to represent the *π*-system.^{58} In terms of the force-matching procedure, our intent here is to include an additional basis set that can be conceived as a projected quadrupole moment of benzene to correctly reproduce the CG interactions.^{59–62}

In practice, the COS mapping *ς*:**r**^{n}→**R**^{N} is performed by first constructing the normal vector of the benzene plane $d\u2192\u2225I=(rC3I\u2212rC1I)\xd7(rC5I\u2212rC1I)$ and the COM configuration of the benzene molecule *I* which is denoted as $R\u0303I$. Then, two virtual CG sites are constructed as $(R\u0303I\u2212u\u2192\u2225I\u22c5l,R\u0303I+u\u2192\u2225I\u22c5l)$, where *l* is the center-to-particle distance or the separation distance with the normalized vector $u\u2192\u2225I=d\u2192\u2225I/d\u2192\u2225I$. The thermodynamic consistency relationship derived in Sec. II does not confine the value of the separation distance *l*, but practical limitations and constraints should be considered. As designed to impose planar symmetry and represent *π*-*π* interactions, the value of *l* should be chosen to enable efficient sampling of the intermolecular region between COS CG sites that are not in the same molecule. Based on the reported minimum at 3.6 Å from the dimer potential energy surface using the OPLS force field,^{57} *l* must be smaller than $R\u0303IJ\u22123.62$ Å. Since the first coordination shell of liquid benzene is positioned near 6.6–7 Å,^{34,63} this rule of thumb yields an upper bound of (6.6 − 3.6)/2 = 1.5 Å in order to correctly represent structural correlations of benzene. If the value of *l* is too large, the COS CG model may sample unphysical configurations in which benzene molecules penetrate each other. On the other hand, consider the opposite limit when *l* approaches zero as expressed below

In other words, as the separation distance approaches zero, the COS sites reduce back to the COM site. Conversely, this limit clearly demonstrates that the COS mapping breaks rotational degeneracy by introducing a normal vector to the single-site COM configuration. We checked that at very small *l* < 0.05 Å, structural properties of the COS mapping converge to that of the single-site COM model (see Fig. 11 in Appendix A). Considering the above-mentioned points carefully, the *l* value was set to be 0.5 Å so that the separation between two COS beads is 1 Å.

In a similar vein, we constructed CG models for toluene; recall that toluene is equivalent to benzene except where an –H atom is substituted for a –CH_{3} group. As a consequence of the broken C_{6} symmetry, the single-site COM CG bead is located slightly off center of the aromatic ring, as depicted in Fig. 3(a), and thus the CG sites in the two-site COM CG model have different types of interactions [Fig. 3(b)]. Even in this case, we are able to utilize the proposed COS mapping scheme because the COM of the molecule is still within the aromatic plane. In other words, the mapped CG beads are placed equidistant and perpendicular to the toluene plane, and the resultant CG beads are homogeneous as designed. We will examine the effect of C_{6} symmetry breaking in Subsection III B.

### E. Simulation details

Each reference atomistic system for the two liquids (i.e., benzene or toluene systems) is comprised of 1000 molecules in a cubic box with periodic boundary conditions. For all liquid molecules, the optimized potentials for liquid simulations all-atom (OPLS-AA) force field was chosen as a force field to conduct the FG (all-atom) simulation.^{64,65} We also used the Ewald summation to account for the long-range electrostatics at the FG resolution.^{66} All MD simulations including both all-atom and CG models were conducted using the LAMMPS simulation package.^{67} To prepare the simulations, initial configurations for the liquid system were constructed by randomly distributing molecules using the Packmol package,^{68} and then, the VMD package^{69} was used to build the bonding topology for the systems. After energy-minimization using the steepest descent algorithm, each system of interest was annealed to a temperature *T* = 300 K with a Nosé-Hoover thermostat with *τ* = 0.1 ps.^{70,71} Constant *NPT* dynamics with *P* = 1 atm with *τ* = 1 ps were performed for 1 ns in order to equilibrate the system.^{72} Finally, we further sampled each system using constant *NVT* dynamics at *T* = 300 K with a Nosé-Hoover thermostat for 5 ns to collect the configurations for structural analysis every 0.5 ps (same damping constant was also used for *NVT* dynamics).

The initial configurations for force-matching were prepared from snapshots of the FG trajectories during *NVT* runs. The effective CG pair potentials from force-matching were obtained as a linear combination of sixth-order B-splines where the resolution of the spline was fixed to 0.2 Å. Due to statistical sensitivity and poor sampling, polynomial terms of *A***r**^{−b} form were fit after the B-spline fitting at the hard-core region of the potentials.^{73} For all CG models, we used an outer cutoff of 10.0 Å. The inner cutoff parameters vary for each CG model. For example, the three different inner cutoff values that correspond to the one-site COM, two-site COM, and two-site COS mapped CG models for benzene were 3.33 Å, 3.04 Å, and 2.40 Å, respectively. The parameterization of the CG interactions was conducted using the MS-CG program package, which is available at the LAMMPS or Github webpages free of use.^{74}

The CG simulations were then performed for 5 ns using constant *NVT* dynamics at *T* = 300 K with a Nosé-Hoover thermostat.^{70,71} The initial configuration of the CG simulation was mapped from the last snapshot of the constant *NVT* runs from the all-atom simulations. For propagating the COS CG model, the bonds between virtual CG sites within the same molecule were constrained via the SHAKE algorithm^{75} to maintain their relative spacing. However, we note that the proposed approach does not necessarily require any constrained or fixed geometry.

### F. Velocity distributions

In Sec. II D, we discussed the mapping between atomistic and CG virtual site representations in configuration space. In this subsection, we would also like to explore the mapping relationship between the velocities of atomistic and CG virtual representations. It should be noted that the analysis in this subsection does not affect any of the statistical behavior in configuration space that we demonstrated earlier, which is of our primary focus.

By design, the mapping operators between atomistic and COM representations, *M*(**r**^{n}), and the mapping operators between the COM and CG virtual sites representations, $M\u0303(RN)$, are linear. However, the direct mapping from atomistic to CG virtual sites representation is nonlinear with an explicit dependency on configuration. In the case of benzene, for example, this dependency could be the orientation of the benzene ring. Hence, the velocities of the virtual sites are a superposition of the COM velocities, $R\u0307Ic$, and a line speed of the benzene ring rotation, $u\u0307||I$, expressed as $R\u0303\u0307Iu=R\u0307Ic+u\u0307||Il$. Since the nonlinearity complicates the CG distribution functions in momentum space, two important questions arise: (i) are the velocities still normally distributed and, if so, (ii) is the temperature of the CG system evaluated using the total atomic mass corresponding to the CG bead intuitively consistent with the temperature underlying the atomistic system?

To investigate these two points, we first formulate the velocity mapping relationship. By taking the first order time derivatives of the configuration mapping relationship, we obtain the velocity of virtual CG beads

In Eq. (26), matrix $B\u0303$, which is a function of the FG configurations, defines the velocity mapping relationship between the CG virtual sites and the underlying atomistic system.

With this definition, the second moment of CG velocity distribution can be estimated as

in which **Z** represents the partition function of the atomistic system. In the last line of Eq. (27), the shorthand notation for the CG constrained ensemble average takes the form

The above result from Eq. (27) suggests that if the mass of the virtual CG beads is defined as

then the temperature consistency between the virtual CG sites and the atomistic representation of the underlying system is preserved. Therefore, *M*_{T}, which we denote as the thermomass, should be our choice for the mass of the CG beads in the following parameterization and CG system propagation processes. In Appendix D, we numerically examined the velocity distribution and system temperature evaluated from the thermomass using the benzene system. The virtual site velocities are indeed normally distributed in this case, and the choice of thermomass quantitatively preserves the system temperature.

## III. RESULTS AND DISCUSSION

### A. Benzene

For benzene, we performed and compared the aforementioned two-site CG models constructed by two different CG mapping schemes: based on the center of mass (COM) or the center of symmetry (COS). The comparison in this section was mainly held on the CG models with the same resolution in order to fix and minimize the auxiliary impact caused by the resolution on the quality of CG model.^{13,76,77} Figure 4 shows the effective pairwise CG interactions for both CG mapping schemes.

In Fig. 4, it is clear that the two CG types show identical interactions for both COM (solid lines) and COS (dashed lines) CG models since the COM mapping is applied to equally coarse-grained –C_{3}H_{3} moieties, while the COS mapping is also designed to map symmetric virtual sites. A notably promising feature in the COS CG interactions is a cusp-like point at short distances (near 4 Å), and this cusp from the pair potential is attributed to the sign changes of effective forces between CG sites, indicating the existence of a locally favorable structure in this region. Namely, this feature suggests an ability to reproduce stacking behavior.

Next, we calculated the structural distribution function of the CG sites to compare the quality of the two CG models. To do so, we remapped the CG particles to a single-site CG model (i.e., one bead represents one molecule) to compare the distributions with respect to the center of benzene molecules as described in Fig. 5. Figure 5(a) displays molecule-wise radial distribution functions (RDFs) from the mapped all-atom (red line), COM (blue line), and COS (green line) models.

It is immediately apparent that the two-site COM CG models do not correctly reproduce the structural behavior from the FG resolution, especially at the first minimum near 8 Å, indicating that the pair structures within the first coordination shell, which might be affected by *π*-*π* interactions, are not captured correctly.

In order to investigate the orientational order throughout the fluid, we also calculated the orientational distribution function (ODF), which is related to the pair orientation. Here, the ODF is defined as^{78}

where **u**_{I} and **u**_{J} denote the unit vector of molecules *I* and *J*. This ODF was introduced and applied to a system of biphenyl rings.^{78} While the choice of **u** vector is system dependent, it is intuitive to use the vector between the two CG sites that are within the same molecule in the COS mapping. By definition, perfectly aligned molecules at a certain distance have an ODF value of unity. By contrast, the ODF value converges to 0.5 when the system is randomly oriented. The ODF calculated from the COS CG simulation is shown in Fig. 5(b) and compared to that from the remapped FG trajectory. The COM model is not shown in Fig. 5(b) since the vector between the COM CG sites lies in the aromatic plane and is, thus, irrelevant with respect to the orientational motions. The ODF calculated from the mapped all-atom configuration demonstrates a similar profile with that from the COS CG model, suggesting that the COS mapping also captures the orientational correlations as expected. Furthermore, when the pair distance is near 6 Å, both ODF functions have a value near one, indicating that neighboring benzene molecules are highly aligned in a *π*-stacked configuration.

Finally, we examined the higher order correlations of the system by calculating the three-body distribution functions. As discussed earlier, we expected the COS CG model to correctly recapitulate the three-body correlations due to an orientational preference of the benzene ring. In practice, we considered the molecule-wise triplet of the CG systems within the first coordination shell *r*_{cut} = 8.0 Å. In other words, the three-body distribution function *P*(*θ*), i.e., angular distribution function (ADF) in this case, is a normalized histogram over triplets of remapped single-site CG sites *I*, *J*, *K* in a similar manner to Fig. 5(a)

It is clear from Fig. 5(c) that the COS mapping is able to precisely reproduce the three-body structure of the benzene system within the first coordination shell.

### B. Toluene

For toluene, the effective CG interactions illustrated in Fig. 6 denote that the two CG types in the COM model are no longer degenerate (due to the substitution of a –CH_{3} group), while the two COS CG sites still show identical behavior.

The different potential profiles observed from the COM model appear to originate from size effects or steric interactions. In toluene, the CG bead corresponding to type 1 (–C_{4}H_{5}) is larger than that corresponding to type 2 (–C_{3}H_{3}). Likewise, the solid lines in Fig. 6 where the 1-1 interaction shows the strongest repulsion at short distances exhibit an analogous trend. By using Lennard-Jones like interactions as an exemplary model, this behavior resembles the following situation: *σ*_{11} > *σ*_{22} and *ε*_{11} ≈ *ε*_{22}, where *σ*_{12} is in a close agreement to the combined value from the general Lorentz-Berthelot rule.^{79,80}

Likewise, the molecule-wise RDF, ODF, and ADF that correspond to the COM of toluene are calculated in Fig. 7. As expected, the COM model yields a higher *g*(*r*) value at the first minimum near 8 Å [lower inset, Fig. 7(a)] that varies from the mapped all-atom structure, while the COS model is able to correctly recapitulate the FG structure. Similarly, ODF profiles of the mapped all-atom and COS models are in agreement, as seen in Fig. 7(b). ADF functions with a cutoff distance at the first coordination shell *r*_{cut} = 8.4 Å further demonstrate that the COS model also recapitulates the trend seen in the mapped all-atom model. Compared to benzene, the deviations of the COM model from the FG structure is smaller, and this can be understood due to the different CG types (–C_{4}H_{5} and –C_{3}H_{3} beads) in toluene. Thus, the center of molecule slightly deviates from the center of the ring, resulting in less degenerate configurations. Despite such small differences, the COS model still corrects the angular profile, especially at 70°–110°. Altogether, our designed model faithfully reproduces both the radial (two-body) and many-body correlations from the atomistic model.

### C. Benzene and toluene mixtures

As a final example, we constructed a mixed system of benzene and toluene molecules. We hypothesized that the CG interactions from the COS mapping would be more transferable between bulk and mixture phases of benzene and toluene. Our reasoning is that the CG interactions in this system are based on an aromatic ring as a structural motif. Therefore, we envisaged that the COS CG interactions are invariant under different phases while reproducing the correct structural properties since the COS mapping conserves the symmetry of each molecule. The reference FG system was composed of 500 benzene and 500 toluene molecules where the initial structure for the FG simulations was generated by the same protocol described in Sec. II. Simulation details were the same as the bulk cases including the use of the OPLS-AA force field and thermostat/barostat conditions. During the constant *NVT* simulation, the equilibrated box size of the mixture system was 55.38 Å. The final snapshot of the all-atom configuration is shown in Fig. 8(a), and the corresponding final CG configuration using COS mapping is depicted in Fig. 8(b) with four different CG sites (red/blue beads for benzene and orange/green beads for toluene).

As expected from the shared aromatic (phenyl) ring motif, we observed that the two molecular components did not create a segregated phase. From the simulated CG trajectories, we first calculated the moleculewise RDF by remapping the two-site COS and COM models to a single COM site. Distributions of the three molecule pairs (benzene-benzene, benzene-toluene, and toluene-toluene) are demonstrated in Fig. 9. We also included a single-site COM CG model to compare the quality of the CG model with different resolution and purely spherical symmetry.

As anticipated, the CG sites via the COS mapping provide correct molecule-wise pair structures even in a mixed phase. While the two-site COM models deviate from the all-atom case at the first minimum of the RDFs, the COS models correctly capture this feature. Interestingly, this region is well described by the single-site COM model. However, the single-site model gives a poor description of the first maximum with a strongly over-structured peak, presumably arising from the mismatch of symmetry. Furthermore, by comparing the general shape of the RDFs from the COS mapping to that of bulk liquids (Figs. 5 and 7), one might expect the transferability of the CG potential from the bulk phase to the mixture phase, following Henderson’s theorem^{81} and the basic concept of iterative Boltzmann inversion.^{82} We note that solving the MS-CG equation does not guarantee exact pairwise structures but provides CG interactions that satisfy the generalized Yvon-Born-Green (YBG) equations.^{10,83,84} The transferability of the CG interactions for different mapping schemes was examined by comparing the effective CG interactions between the mixed system and bulk liquids, as summarized in Fig. 10. The COM interactions shown in Figs. 10(a) and 10(b) are obtained from the single-site COM model, while the COS interactions shown in Figs. 10(c) and 10(d) are from the site-site interactions due to the homogeneity of COS CG types.

Clearly, the COM interaction is not as transferable between bulk and mixture phases as seen by the different minimum predicted in the two phases. This trend is in accord with previously reported differences in pair potentials between bulk and mixed phases using single-site MS-CG models.^{31,34} To reiterate, performing the MS-CG force-matching procedure in the mixed case does not guarantee identical pair interactions as that of the bulk even though the single-site CG model nicely reproduced the first minima in the RDF of both phases. Interestingly, almost an exact agreement between the interactions from the bulk and mixture phases is identified from the COS models [Figs. 10(c) and 10(d)], thereby confirming our hypothesis about enhanced transferability.

In order to further examine the transferability of the CG interactions, we introduce a statistical measure to quantify the similarity of the many-body PMFs from bulk and mixture phases. Previously, we utilized conventional distance metrics such as the Euclidean distance between the PMFs $W1(RN)\u2212W2(RN)=\u2211I,JW1(RIJ)\u2212W2(RIJ)2$ or the distance from smoother manifold by integrating the PMF along the pair distance $\u222b\u221ercutW1(r)dr$ corresponding to the area under curve.^{34} These distance metrics can be easily utilized to evaluate the similarity of different PMFs within the same mapping scheme such as a comparison of COM CG models between bulk liquid and interface. However, these distance metrics are not applicable to compare the PMFs from different mapping schemes due to different interaction ranges as well as lacking in physical interpretation of the CG system. As a universal metric that can be applied to different mapping schemes (COM and COS) and systems (bulk and mixture), here we utilize the relative entropy $Srel$, also known as the Kullback-Leibler divergence,^{85} to characterize differences in CG interactions between different CG systems

As an analogy to the systematic CG approaches by minimizing the relative entropy,^{6,48–50}$Srel$ provides information loss incurred upon different system (or state points). $Srel$ values for different mapping schemes are readily obtained by $Pbulk(r)=exp(\u2212\beta Wbulk(r))\u222bdr\u2061exp(\u2212\beta Wbulk(r))$ and $Pmix(r)=exp(\u2212\beta Wmix(r))\u222bdr\u2061exp(\u2212\beta Wmix(r))$. By its form given in Eq. (32), it is clear that the relative entropy is not confined in different interaction ranges because its contribution is normalized in the $lnPbulk(r)Pmix(r)$ term. Therefore, comparing the relative entropy for different mapping can be a potentially universal way to score the transferability of a CG model in different state points. Table I shows the relative entropy values of benzene and toluene molecules $Srel=DKLPmix(r)||Pbulk(r)$ from bulk to mixture.

. | Relative entropy values . | ||
---|---|---|---|

Type . | COM ($SrelCOM$) . | COS ($SrelCOS$) . | $SrelCOM/SrelCOS$ . |

Benzene | 4.428 × 10^{−3} | 4.294 × 10^{−4} | 10.31 |

Toluene | 1.345 × 10^{−3} | 9.847 × 10^{−5} | 13.66 |

. | Relative entropy values . | ||
---|---|---|---|

Type . | COM ($SrelCOM$) . | COS ($SrelCOS$) . | $SrelCOM/SrelCOS$ . |

Benzene | 4.428 × 10^{−3} | 4.294 × 10^{−4} | 10.31 |

Toluene | 1.345 × 10^{−3} | 9.847 × 10^{−5} | 13.66 |

In contrast to the COM mapping, the COS mapping yields much smaller relative entropy values for both benzene and toluene systems. In a quantitative manner, the COS model shows 10-13 times closer agreement, confirming a better transferability between bulk and mixture systems. To clarify, the additional basis (in this case, the normal vector of the aromatic ring) increases the ability to express the effective CG interaction via COS mapping. Furthermore, as a quantitative measure, the relative entropy calculation is expected to further determine the optimal COS model. Practically, for models with different separation distances, one could calculate the Kullback-Leibler divergence of the PMFs from the parameterized CG models with the target atomistic sampling, as demonstrated in Table I. Then, the optimal separation distance can be chosen as the distance that gives the smallest Kullback-Leibler divergence. As a natural extension from our interpretation, it would be of interest to understand the role of additional basis sets in force-matching to systematically improve the reproducibility and transferability of the CG models.

## IV. CONCLUDING REMARKS

In summary, this paper aims to provide a systematic framework for the construction of CG models using virtual sites, i.e., fictitious particles designed to capture essential physical correlations of the system at the CG resolution. First, we have introduced a statistical mechanical theory of MS-CG involving virtual CG sites based on the principle of maximum entropy. The extended thermodynamic consistency relationship from the MS-CG formalism is satisfied if the overall COM of each molecule is expressed as a linear combination of the configurations of virtual sites. In such cases, the overall forces are partitioned among the virtual CG types based on the coefficient used to construct the overall COM. Namely, this framework opens up a new way of designing CG models with virtual particles based on the MS-CG methodology. Depending on the system of interest, one can design a new CG model and explicitly consider the vectors between virtual sites as an additional basis set to the MS-CG algorithm without any additional complications.

A new center of symmetry (COS) CG mapping scheme involving virtual sites is also presented in this work. The COS mapping is designed to preserve the symmetry from the finer resolution through the use of virtual sites, which we demonstrate for molecules that share aromatic ring motifs with planar symmetry. In this case, the imposed normal vector formed by the virtual sites can capture the *π*-*π* interaction or quadrupolar interaction for benzene and toluene systems. In turn, the COS CG models can provide very accurate delineation of radial, orientational, and angular distributions, as well as transferable interactions between bulk and mixed liquid phases.

The aforementioned reproducibility and transferability of the COS mapping scheme further suggests potential applicability of shape-based or symmetry-based CG models^{44–47} through the MS-CG approach. Defining CG mapping operators and utilizing such bottom-up CG models in more complicated systems such as biomolecules is still challenging because various motions with different time scales are entangled in the bead.^{4} Nevertheless, the current framework can effectively model some biomolecules that have a certain shape or form such as lipids^{86–88} as the symmetry of the CG sites might play an important role in understanding the physical picture of the CG models. This general approach may also facilitate the construction of more transferable bottom-up CG models for sophisticated bio- and macromolecules.

## ACKNOWLEDGMENTS

This material is based on the work supported by the National Science Foundation (NSF, Grant No. CHE-1465248). Simulations were performed using the resources provided by the University of Chicago Research Computing Center (RCC). J.J. acknowledges a graduate fellowship from the Kwanjeong Educational Foundation. We thank Dr. Alexander Pak for his valuable comments and a careful reading of the manuscript. We especially thank the anonymous reviewer for valuable comments and suggestions to further improve the manuscript.

### APPENDIX A: DISCUSSION OF THE LIMIT: COS MODEL CONVERGES TO A SINGLE-SITE COM MODEL WHEN *l*→0

### APPENDIX B: DERIVATION OF $\Omega R\u0303\xd1$ AND EQ. (11)

The number of state measure $\Omega R\u0303\xd1$ in general is a nonlinear function of the COM configurations, which is difficult to obtain in an explicit analytical form. However, in this work where the two COS beads correspond to their COM, we can explicitly evaluate the integral in the measure function. In order to perform such an integral, we change the integration variable from the COS bead coordinates (**R**_{I1}, **R**_{I2}), to their center of mass and half relative distances $R\xafI,\Delta RI$, where $R\xafI=RI1+RI2/2$ and $\Delta RI=RI1\u2212RI2/2$. Then, the measure function can be explicitly integrated as Eq. (B1), resulting in Eq. (11),

Moreover, one can alternatively derive Eq. (11) from the perspective of consistent observation without utilizing the maximum entropy principle. We consider any operator as a function of the COM information (conventionally denoted as $RCN$, but as $R\u0303\xd1$ in this work). The expectation value of such an operator for the atomistic ensemble as well as the CG virtual beads ensemble should remain identical. Therefore, the expectation value for the COM ensemble is further expressed as

Similar to the evaluation of the measure functions, the term $2V\xd1$ comes from changing the integral measure from *d***R**^{N} to $dR\u0303\xd1d\Delta R\xd1$. The integration of the half relative distance between virtual sites, $\Delta R\xd1$, gives the volume terms, and the Jacobian of the transformation gives the factor of 2. In order to satisfy the consistency, *ρ*_{CG}(**R**^{N}) should be expressed as

giving Eq. (11).

### APPENDIX C: DERIVATION OF EQ. (15)

Equation (15) is derived based on the change of variables for derivatives. Notice that the following equation holds due to the linearity of the two mapping operators

Without loss of generality, any fine-grained atom *i* belongs to the center of mass *α*. Therefore, assigning a weighting factor *d*_{αi} for each side of the equation above and summing over the terms result in

Once the choice of the weighting factors satisfies $\u2211i\u2208\alpha d\alpha i=1$, we arrive at Eq. (15).