Multiscale modeling of polymers exchanges information between coarse and fine representations of molecules to capture material properties over a wide range of spatial and temporal scales. Restoring details at a finer scale requires us to generate information following embedded physics and statistics of the models at two different levels of description. Techniques designed to address this persistent challenge balance among accuracy, efficiency, and general applicability. In this work, we present an image-based approach for structural backmapping from coarse-grained to atomistic models with cis-1,4 polyisoprene melts as an illustrative example. Through machine learning, we train conditional generative adversarial networks on the correspondence between configurations at the levels considered. The trained model is subsequently applied to provide predictions of atomistic structures from the input coarse-grained configurations. The effect of different data representation schemes on training and prediction quality is examined. Our proposed backmapping approach shows remarkable efficiency and transferability over different molecular weights in the melt based on training sets constructed from oligomeric compounds. We anticipate that this versatile backmapping approach can be readily extended to other complex systems to provide high-fidelity initial configurations with minimal human intervention.
Multiscale modeling is critical in diverse fields of research, including physics, chemistry, biology, and materials science and engineering.1–4 Developing multiscale techniques in a hierarchical and consistent manner provides a comprehensive description of materials from bottom-up and, as a predictive tool, permits rational design of materials in silico. Coarse-graining of an atomistic/microscopic model simplifies the representation of a physical system by reducing the degrees of freedom and hence drastically accelerates calculations. In consequence, coarse-graining allows simulations to reach larger temporal and spatial scales that are inaccessible to detailed models retaining a fine level of description. Systematic methods to construct coarse-grained (CG) models by passing information from fine to coarse levels continue to develop and improve over recent years.5–9 This is particularly critical for macromolecular systems, whose structural properties and dynamical behavior span over a vast range of length and time scales.10,11 Commercial grade polymeric materials consist of molecules with hundreds to thousands of repeat units that necessitate the simulation studies of large-scale systems, which evolve in exceedingly slow rates. As a result, systematic coarse-graining, based on atomistic simulations of small molecules, facilitates the quantitative interrogation of conformational and thermodynamic properties of industrial polymers, such as polystyrene,12,13 polyethylene,14 polyisoprene,15–17 and polybutadiene,18–21 at molecular weights where atomistic modeling presents insurmountable demands even for state-of-the-art high performance computing resources.
For many applications, there is a need to study large systems with long correlation times and also fine details; examples include the studies of local structure and dynamics of long macromolecules at the interface with solid surfaces15,22 and cross-linking reactions that depend on the atomistic-level configuration of specific repeat units.23 In those cases, high-fidelity backmapping is essential for closing the loop starting from the atomistic model (AT) to a derived CG model and then going back to the detailed AT description. Although the accuracy of such a scheme is intimately coupled with the capability of the CG model to propose realistic structures, it is important to note that equally essential attributes are efficiency, flexibility, and broad applicability with minimal human intervention. Given the multitude of the already existent chemically specific CG models, a general backmapping technique, readily applicable with little modification that is not necessarily tied to the technique used to develop the CG force fields in the first place, is highly desirable. Several studies have focused on addressing this need with different strategies, based on random mapping, geometrical and mechanical considerations, Bayesian inference, hybrid scheme (using a time-dependent switching function), and subsequent position-restrained molecular dynamics (MD) or Monte Carlo relaxations.13,15,24–34 However, those strategies more or less rely on maintaining libraries of molecular structures or force fields that are system specific, while often facing a delicate trade-off between efficiency and accuracy.
For polymers, CG models commonly preserve chemical specificity by promoting local configurations dictated by bond, bending angle, and dihedral angle potentials that are designed to match specific order parameters with atomistic structures. Through backmapping, these local configurations can be translated to fine structures absent in the CG configuration via sequential building of successive repeat units. Nevertheless, subsequent procedures are often needed to increase the quality of the resulting structures that are specific to the system at hand (as, for example, in our previous work with polyisoprene15 and similar studies with polystyrene,26 where dihedral angles in the retrieved atomistic models required further refinement). Such efforts take significant insight and algorithmic design, which are not generally applicable. In this work, we present a general method for backmapping polymer CG models, based on machine learning. While machine learning is currently being exploited in forward coarse-graining studies,35–39 we reveal its versatility in the direct application of structural backmapping. Recently, a study on backmapping polystyrene oligomers constructed deep convolutional neural networks embedded with the force field description and utilized machine learning to predict atom placement sequentially through recurrent training.40 By contrast, the proposed method in this work separates particle positions from the molecular topology and force field and hence avoids tedious bookkeeping of molecular details during the reconstruction while making predictions in groups. We expect that this method can be easily modified to apply to a broad class of polymeric materials given existing CG models. Noting that any backmapping procedure would result in “imperfections,” our effort herein is to develop a systematic, model agnostic procedure that excels in flexibility and efficiency.
We formulated the backmapping from CG to atomistic as an image-to-image translation problem that translates “low-resolution” CG pictures to “high-resolution” atomistic pictures with fine details. This problem is essentially analogous to super-resolution,41 while learning-based generative methods should be sought.42 The rationale behind our approach is to use machine learning to capture complex statistical relationships between a CG model and the atomistic counterpart, in terms of probability distributions, through training with substantial samples. In the class of generative models, generative adversarial networks (GANs)43 present a powerful deep learning approach, which simultaneously trains two types of artificial neural networks, namely, generator and discriminator, in a synergistic and opposing manner. Specifically, a generator network is trained to produce synthetic data following the same statistics as the training dataset, while a discriminator network learns to evaluate the authenticity of generated data as if they belong to the real dataset or not. Ideally, the quality of generated prediction improves through training without human intervention.
We employed modeling of cis-1,4 polyisoprene (PI) melts as an illustrative example and adopted a general-purpose conditional GAN model, “Pix2Pix,”44 for the image-to-image translation. Pix2Pix captures the mapping between paired images through training on a corresponding dataset, predicts pixels from pixels, and synthesizes output images with conditioning on the input ones. Specifically, it utilizes a U-Net architecture45 as a generator and a convolutional “patchGAN” classifier as a discriminator, which skips tweaking the loss function in machine learning and is found to be effective on a wide variety of image prediction problems.44 In this study, the training sets were prepared based on atomistic configurations from MD simulation trajectories. The mapped CG configurations were employed as input, while the atomistic counterpart served as ground truth. After training the generative model, it was applied to data outside of the training sets to make backmapping predictions via the generator. The efficiency of this machine learning approach was examined, and the prediction quality was evaluated. We further compared the results between machine learning and a commonly used “random mapping” method, which generates configurations using random rotation of repeat units around their center-of-mass (COM) locations. The outcome demonstrates that the proposed image-based machine learning approach presents significant advantages by introducing correlations absent in the random mapping method, without sacrificing efficiency and general applicability.
A united-atom (UA) representation was selected for the atomistic model of PI, where each carbon with directly bonded hydrogens is lumped into a single interaction site. In this simplified atomistic representation, every PI monomer is defined by the positions of five particles [Fig. 1(a)]. The implemented UA force field has been extensively described in past studies, where it was shown to faithfully reproduce material properties.46–49 Our CG mapping herein places a CG bead at the COM of each monomer [Fig. 1(a)]. The CG force field was developed using the Iterative Boltzmann Inversion (IBI) method,15,50–52 a technique that has been widely used for systematic coarse-graining of atomistic polymer models. Detailed results and validation of the CG mapping are presented in Figs. S1 and S2 in the supplementary material.
A CG configuration comprises of coordinates of CG beads representing monomers for each polymer chain in the system [Fig. 1(a)]. Backmapping involves generating atom coordinates, while molecular topology (describing the architecture and connectivity among atoms) can be extracted separately and treated as metadata that is not required for the coordinate generation. To achieve high-fidelity backmapping, the statistical relationships between CG and UA configurations need to be captured. For instance, the arrangement of atom positions in a polymer chain should follow the correlations among successive CG beads along the chain contour, as expressed by bond distances, angles, dihedrals, etc. Nevertheless, these relationships are mostly ignored in random mapping and can only be partially traced by the sequential building of polymers (via stochastic sampling and local optimization in a system-specific manner).15,34 By contrast, machine learning, in principle, parses hidden representations of the full relationships and casts them into neural networks, regardless of the chemical species and force fields.
To highlight the correlations embedded in spatial configurations, we exploited an adaptive reference strategy for establishing the correspondence between CG and UA representations. Specifically, CG beads of monomers along the chain contour were referenced by vectors from the COM of a polymer fragment or entire chain [Fig. 1(a)], instead of the absolute positions in space. Meanwhile, UA particles were referenced by vectors that depict relative positions within an individual CG bead. As illustrated in Fig. 1(c), three UA reference schemes were considered: Scheme 1 takes five position vectors from the monomer COM to the united atoms; Scheme 2 treats the monomer as rigid body, where the relative positions of atoms are fixed, and the rotation of the rigid body can be described by two unit orientation vectors; Scheme 3 takes four bond vectors among pairs of atoms.
In Scheme 1, four position vectors are sufficient for determining particle coordinates inside the monomer, given the position constraint of the COM. However, all the five vectors were adopted to maximize the degrees of freedom for machine learning. For Scheme 2, a fixed unit monomer is required to construct approximate configurations using rotations; we selected the one prescribed by the preferred bonds and angles from the force field [as shown in Fig. 1(c), where all five particles lie in the same plane]. From the setup, there is no extra positional constraint among five atoms in Scheme 1 besides the COM position. Scheme 2 constrains both bonds and angles inside the monomer, and the same bond constraints can be applied in Scheme 3 (i.e., bond lengths are fixed to 0.150 nm, 0.134 nm, 0.150 nm, and 0.150 nm for 1–2, 2–3, 3–4, and 2–5 bonds, respectively), which is the case in our UA simulations (constraining bonds allows for a larger time step in MD simulations). In general, these three schemes provide different data representations of the UA configuration. Their effect on the training and prediction of machine learning will be examined later.
To structure UA and CG data, spatial configurations were mapped into images. The translation was executed by converting XYZ components of vectors to RGB values through normalization with respect to adaptive reference cells and rescaling [Fig. 1(b)]. Each color channel, corresponding to an axis, takes integer values from 0 to 255. The spatial resolution of this mapping is thus dictated by l/255, with l being the side length of the reference cells, which is optimized for performance. Following the mapping schemes in Fig. 1(c), UA images, carrying different features of individual monomers, were obtained from mapping characteristic vectors (i.e., , , and ) to pixels or color blocks. In particular, the relative position vectors in Scheme 1 were normalized to a reference box of l = 0.46 nm centered at the monomer COM (i.e., the CG bead location); the orientation and bond vectors in Schemes 2 and 3 were treated as unit vectors. CG color blocks, containing positional information of monomers, were designed to take the total size of the UA color blocks from the member vectors, which composed “low-resolution” CG images.
For a case study, we selected a PI melt system consisting of 100 chains of 30 monomers each (noted as 30-mer). MD simulations were performed at temperature T = 413 K and pressure P = 1 bar to generate system configurations (using Gromacs,53,54 version 2019.3, see simulation details and Fig. S3 in the supplementary material). Typical chain conformations in UA and mapped CG representations are shown in Figs. 2(a) and 2(b), respectively. Image translation was performed per polymer chain. A reference cell of l = 7.5 nm was adopted at the chain COM for the color mapping of CG beads (the average 30-mer end-to-end distance REE ≈ 3.28 nm, see the supplementary material and Fig. S4 for detailed discussion). Figure 2(c) presents three images translated from the conformations via the mapping schemes in Fig. 1(c). Each image represents single training data, consisting of two concatenated 256 × 256 pixel matrices of RGB values from the UA and CG representations. The CG image is used as the input for machine learning, while the UA image is the target. The particular data arrangements displayed in those images were employed to maximize the use of information in training. The training sets were constructed accordingly from an atomistic MD trajectory (each frame corresponds to 100 images).
The implementation of Pix2Pix44 was carried out in the open source TensorFlow 2 platform, and a single NVIDIA GeForce GTX 1080 GPU was employed for calculations. The efficiency of this machine learning mechanism in training was optimized, and the adjustment of relevant control parameters is presented in Figs. S5–S10 in the supplementary material. As the machine learning is based on images, the acquired parameter set should also exhibit good performance in other systems with similar data representations. A typical training process using Scheme 3 mapping is shown in Fig. 3(a), which is characterized by the evolution of defined loss functions along with the number of epochs. Specifically, generator loss contains cross entropy between generated images (namely, predictions) and a matrix of the same shape with value ones, plus a weighted L1 difference between predictions and targets (the effect of weight coefficient was tested, cf. Fig. S10 in the supplementary material); discriminator loss contains cross entropy between predictions and a matrix of zeros, plus cross entropy between targets and a matrix of ones.44 The decay of total loss (i.e., sum of the generator and discriminator losses) as learning progresses indicates that the training improves the quality of predicted images from the generator (cf. Fig. S8).
Our machine learning approach was validated by testing configurations not included in the training set as input. Figure 3(b) displays an example of a predicted image from the trained model after 400 epochs, alongside the corresponding target image. Besides mild blurry near the margins, the prediction in general exhibits an outstanding visual similarity with the target. The averaged RGB values, , over individual color stripes are translated to bond vectors (i.e., ). Taking the CG bead coordinates from the pairing CG image, the atom coordinates with respect to the chain COM can thus be determined. Figure 3(c) shows the chain conformations converted from images in Fig. 3(b), with the original chain conformation also presented for comparison (the image translation results in a marginal difference between the target and original conformations due to float to integer conversion). One finds a good consistency between the prediction and the target or original. Meanwhile, the deviation of the prediction from the original contour exposes the inherent feature of the one-to-many correspondence from CG to UA configurations, as such a backmapping process generates information and is not deterministic in solutions. With the information of chain COM (recorded during the image translation), a UA configuration of the entire system can be handily constructed from assembling chains in space [see Fig. 3(d)].
Obtaining the trained models, the backmapping from CG to UA proceeds in three steps: convert a CG configuration to CG images following the image translation protocols used in the training set, input the images into machine learning for a prediction, and translate the predicted images to a UA configuration. The CG configuration can be mapped from atomistic trajectories or taken directly from CG simulations with no original atomistic counterpart. Due to adaptive reference, the predictions are made simultaneously for all monomers in the image (i.e., 30 under the current setting as a typical choice). This offers a significant improvement over sequential building in efficiency. In addition, as the data arrangement in image translation is flexible, the trained models can directly apply to other PI melt systems of different size or chain length. Specifically, shorter chains can be treated by extra zero-padding in the input CG images; longer chains can be processed on the basis of a 30-mer or shorter fragments. We note that although the random mapping method (introducing repeat units based on the COM location and random orientation) inherently has the same feature (i.e., not limited to systems), it requires a system-specific tuning in practice and fails to capture intra-chain correlations, in contrast to the machine learning approach. As an example, we present a CG chain conformation of a 500-mer chain alongside a prediction from random mapping in Fig. 4(a). For comparison, backmapped UA configurations from machine learning are shown in Fig. 4(b) using different data representation schemes.
To quantify the prediction quality, we examined the distributions of all five types of bonds, five types of angles, and four types of dihedrals along the chain contour and computed the probability densities (pi, i indicates individual distribution) for original, target, and prediction UA configurations, respectively (cf. Figs. S7 and S11–S13 in the supplementary material, where radial distribution functions are also shown). A comparison between the original and predicted UA configurations for a dihedral distribution, illustrating a typical intra-chain correlation, is summarized in Fig. 4(c). Random mapping does not carry any correlation among particles in different monomers. By contrast, certain degrees of correlations were captured through machine learning on the chain connectivity depicted by the color block sequence inside the images. As no constraint was applied for the connection between neighboring monomers, machine learning yielded a spread distribution of bond length for the connecting bonds and smeared distributions of the associated dihedral angles. These variations enlarged the configurational space of predictions for backmapping.
We defined a metric, , as the sum over L1 norm of the difference between pi of prediction and original for all the distributions (see Fig. S14 for the comparison of different dihedral distributions). Lower values of Ψ correspond to higher prediction quality. In Fig. 4(d), we plot Ψ of backmapped configurations from random mapping and machine learning for systems of different chain lengths and numbers. Specifically, the backmapping from different schemes was performed using the same trained models based on 30-mer UA configurations. From the figure, one finds that all the schemes exhibit similar performance for different melt systems, demonstrating the transferability and efficiency of the machine learning approach using the chain/fragment-based adaptive reference strategy. Furthermore, machine learning shows much improved backmapping credibility in comparison to the random mapping. Scheme 3 provides the best predictions compared to Schemes 2 and 1, implying that data representation can influence the results of training. Comparatively, reasonable constraints in data representation simplify the statistical relationships that lead to better results.
To further validate the backmapping, we initiated atomistic MD simulations with predicted configurations and compared the trajectories with long UA simulations that present converged metrics of properties. As the adopted chain-based strategy does not fully take into account the local environment of chains, neighboring steric overlapping from different chains cannot be resolved. The resulting stiffness of interaction potentials can be relaxed via energy minimization, as is standard in backmapping methods.15,24–30,34 In Fig. 4(e), we present the radial distribution function of UA particles in 500-mer configurations obtained after an energy minimization procedure on the predictions (see Fig. S15 for the inter-chain result). Particularly, the input CG configuration for backmapping was selected from a CG MD trajectory (rather than being mapped forwardly from a UA configuration). Figure 4(f) plots the temporal evolution of the system’s total energy in subsequent MD simulations (in the NVT ensemble with initial velocities generated based on a Maxwell distribution). The energy evolution proceeds to an equilibrium value rapidly (in ∼60 ps, which is close to the segmental relaxation time of PI at T = 413 K15,49). Upon reaching the equilibrium value, conformational properties recovered the original ones (see Fig. S16, for example). The quick convergence indicates that the predicted configurations from the trained models are close to the target one in the free energy landscape, as also implied by Fig. 4(e). In contrast, random mapping requires longer time to reach equilibrium, which can be attributed to kinetic trappings induced by steric constraints. From the comparison, we conclude that machine learning presents a significant improvement in capturing intra-chain correlations that provides faithful backmapping of long macromolecules in dense systems.
In summary, we have proposed an image-based approach for structural backmapping from coarse-grained to atomistic configurations. We note that the image mapping offers a straightforward and efficient representation to structure data of particle coordinates. Combined with machine learning tools, such as conditional generative adversarial networks, it provides an effective and reliable strategy for high throughput and high accuracy conversion. We demonstrated the success of the approach with a case study on cis-1,4 polyisoprene melt simulations using adaptive reference, which fully exhibits the generality and transferability of the strategy. Our approach separates configurational information from molecular topology and force field for backmapping, and it makes predictions only based on relative positions and hidden correlations. This underlying mechanism can be easily implemented and applied to other complex systems at different coarse-graining levels, such as heteropolymers and biomacromolecules, where bookkeeping and tuning atomistic details via human intervention become particularly tedious and labor-intensive. Specifically, the mapping scheme and data representation can be modified to depict different molecular architectures and chemical species through the variation in color range, color block size, and spatial arrangement, while the optimized control parameters of the machine learning mechanism should be generally applicable. We expect that this backmapping approach further complements multiscale modeling techniques addressing material properties over a broad range of length and time scales.
See the supplementary material for the development of coarse-grained force field, simulation details, benchmark of parameters for machine learning, and additional results.
The data that support the findings of this study are available from the corresponding author upon reasonable request.
This work was supported by the Foundation of Research and Technology—Hellas and the Goodyear Tire and Rubber Company. W.L. acknowledges Yunhe Feng for helpful discussions.