We outline a machine learning strategy for quantitively determining the conformation of AB-type diblock copolymers with excluded volume effects using small angle scattering. Complemented by computer simulations, a correlation matrix connecting conformations of different copolymers according to their scattering features is established on the mathematical framework of a Gaussian process, a multivariate extension of the familiar univariate Gaussian distribution. We show that the relevant conformational characteristics of copolymers can be probabilistically inferred from their coherent scattering cross sections without any restriction imposed by model assumptions. This work not only facilitates the quantitative structural analysis of copolymer solutions but also provides the reliable benchmarking for the related theoretical development of scattering functions.

## I. INTRODUCTION

Block copolymers, in which the chemically distinct monomer segments are grouped into discrete blocks along the polymer chain, have played essential roles in science and technology. The self-assembly of block copolymers in bulk, on surfaces, or in solution has been intensively studied because of their domain sizes and rich chemical tunabilities.^{1,2} In particular, block copolymers spontaneously undergo microphase separation in solutions to form an array of discrete nanostructures with a size range from sub-10 nm to micrometers.^{3} A suite of micellar structures can be achieved by tuning structural parameters of a block copolymer, such as the chemical compositions of the monomers, the molecular weights, and the copolymer architecture. In addition, solvent quality is one of the most important parameters in forming different self-assembled structures because it dictates the conformations of the polymer chains.

To control the thermodynamic properties and functionality of self-assembled structures by the *de novo* design of materials at the nanoscale, there is a critical need to understand the molecular configuration of individual copolymer amphiphiles, such as persistence and the contour length of each constituent segment.^{4,5} Prominent among the experimental tools for determining conformations of copolymers is the technique of small angle scattering, including both the neutron and x ray.^{6} Central to this existing experimental approach is the development of a theoretical model of static two-point correlation functions, through which the conformations of copolymers can be quantitatively described in terms of the optimized structural parameters obtained from regression analysis of measured coherent scattering intensity.

In the past few decades, there has been much interest in developing the scattering functions of homopolymers in the presence of excluded volume effects.^{7–9} In this pursuit, one continued endeavor is to analytically derive the static two-point correlation functions of semiflexible chains with self-avoiding effect on the intra-chain density distribution as a correction of ideal Gaussian chain statistics.^{10–16} The domain of interest also includes the phenomenological approaches, which develop the composite scattering functions of semiflexible chains by phenomenologically joining the asymptotic expressions of coherent scattering intensity manifesting at different length scales using computer simulations.^{17} However, the question of whether a scattering function of semiflexible diblock copolymers consisting of two conformationally distinct segments can be developed based on these two existing approaches has not been addressed unambiguously. Here, the two-point spatial correlation of an AB-type deblock copolymer in the reciprocal *Q* space, $SQ$, takes the form

where $SAAQ$, $SBBQ$, and $SABQ$ represent the coherent scattering contributions from the intra-chain correlation of segment A, the intra-chain correlation of segment B, and the cross correlation between segments A and B, respectively. $SABQ$ can be explicitly expressed as

where *b*_{i} and *b*_{j} are the bound scattering length of positions *i* and *j* located in segments *A* and *B*, respectively, and ⟨⟩ represents the ensemble average. While $SAAQ$ and $SBBQ$ in Eq. (1) can be modeled satisfactorily,^{17,18} defining an analytical expression of $SABQ$ is a formidable challenge because of the difficulty in identifying the functional expression of the relative positional vector *r*_{i} − *r*_{j}, given the heterogeneous intra-chain flexibility. Moreover, as indicated in Fig. 1(a), for a copolymer in the asymptotic power law regime^{6} of 10^{1} < *QL* < 10^{2}, where *L* is the contour length, the corresponding *S*_{DC}$Q$ reflects the collective contributions of $SAAQ,SBBQ$, and $SABQ$. The development of $SQ$ based on the existing phenomenological approach for homopolymers^{17} is severely hampered by the complication due to the heterogeneity in segmental flexibility since the parameterized crossover function used to connect different asymptotic regions in reciprocal space for homopolymers is no longer valid. As a result, no scattering function of diblock copolymers with excluded volume effect has been reported thus far. This challenge provides the motivation of our study.

To eliminate critical deficiencies of the existing parametric methods, which are unable to treat the scattering behavior of copolymers realistically in a tractable manner, we propose a data-driven machine learning (ML) approach based on the algorithm of the Gaussian process (GP)^{19–21} for inversely determining the conformation of copolymers from the coherent scattering intensity $SQ$. Our approach presents a conceptually drastic departure from the conventional protocol of scattering data analysis: The relevant conformational parameters are now probabilistically inferred from the expressive features of $SQ$, instead of deterministically obtained based on its functional expression that must be specified *a priori*.

## II. RESULTS AND DISCUSSION

The prerequisite of every quantitative inverse problem of scattering is to identify an explicit mathematical relation between the experimentally coherent scattering cross section and the relevant parameters so that the structure of studied systems can be described based on the geometric picture built upon the optimized parameters obtained from regression analysis of collected spectra. Before establishing this relation, it is instructive to quantitatively examine the structure factor of the copolymer as a function of conformational parameters. In this study, we first identified a set of conformational parameters $Y=lA,\gamma ,f$, where *l*_{A} is the persistence of segment A in the unit of contour length of copolymer *L*, *γ* is the ratio of *l*_{B}, the persistence of segment B, to *l*_{A}, and *f* is the ratio of contour length of segment B to the contour length *L* of a copolymer. These parameters provide a unique description of the conformation of a diblock copolymer with contour length *L* by applying a constraint *γ* ≥ 1. Using Monte Carlo (MC) simulations (see Appendix A), an extensive library of intra-copolymer structure factors $SQLtraining$ based on experimentally relevant *Y*_{training} was generated. Here, we present the simulated structure factor in a dimensionless unit of *QL*. Experimentally, *L* can be precisely determined in the intermediate *Q* range of collected spectra. As indicated in Fig. 1, continuous variations of *l*_{A}, *γ*, and *f* resulted in smooth changes in $SQL$. One can therefore infer that these parameters are statistically correlated in the vector space of $SQL$. To further highlight the expressiveness of the two-point correlation of data, the scattering function of a rigid rod with length *L*, which is termed as $SrodQL$, was used as the comparative reference in this study. The excess structure factor $\Delta SQL$ was defined as the deviations of $SQL$ from $SrodQL$ in logarithms, namely, $\Delta SQL\u2261lnSQL\u2212lnSrodQL$.

Each simulated $\Delta SQL$ is represented by a single point in this vector space of the structure factor. Therefore, by labeling each point with the values of *l*_{A}, *γ*, and *f*, one can qualitatively examine the statistical dependence of $\Delta SQL$ and X. For each simulated $\Delta SQL$, there are 65 sampled *Q* points. The dimension of this vector is 65, and we further denote it as $R65$. Obviously, the distribution of these data points in $R65$ cannot be directly visualized. To facilitate the inspection of data distribution in $R65$, we conducted a singular vector decomposition (SVD) ( Appendix B) to identify the orthogonal coordinates. By recasting the data into the space spanned by the principal components, the intrinsic correlations of these parameters can be highlighted. Our analysis shows that the first three principal axes retain more than 95% of the variance of original data. Therefore, the vector space $R3$ spanned by their singular vectors SVD0, SVD1, and SVD2 given in Fig. 2(a) is sufficiently expressive to demonstrate the data correlation. The insets of Figs. 1 and 2(a) show that the conformational difference between the copolymer and the reference system of the rigid rod causes a positive deviation in the structure factor, as reflected by the shape of SVD0 and its retained variance of 70%, within the range of 1 < *QL* < 100. SVD1 (20%) and SVD2 (5%) are responsible for subtle changes in $\Delta SQL$. As demonstrated in (b)–(d) of Fig. 2 and the associated insets, the data points representing the simulated $\Delta SQL$ are seen to distribute over a twisted two-dimensional manifold: The black crosses on the origin represent $\Delta SQL$ of a rigid rod with contour length *L*, and the red edges represent that of homopolymers characterized by different persistent lengths. Upon changing *l*_{A}, *γ*, and *f*, the characteristic developments of $\Delta SQL$ are observed as reflected by the color evolution. The nonuniform data distribution reflects that the dependence of $\Delta SQL$ on these parameters is highly nonlinear, which again manifests the mathematical challenge of developing a parametric expression of $\Delta SQL$ based on *l*_{A}, *γ*, and *f*.

In what follows, we outline the mathematical foundation of our nonparametric approach, which establishes the relation between $\Delta SQL$ and *Y* probabilistically, based on Gaussian process.^{19–21} Readers are referred to Appendix C for more details of the implementation. *l*_{A}, *γ*, and *f* are normally distributed random variables in $R$, which follow multivariate Gaussian statistics. Three covariance matrices $K\u2261KlA,K\gamma ,Kf$ are further modeled by a radial basis function (RBF) kernel^{20–22} to define the statistical relationship of each sample. It is important to mention that *K* is determined by the data distribution in $R65$. The so-called training process in the ML procedure is to identify the optimized {K} using a kernel function to quantify data correlation in $R65$ using $\Delta SQLtraining$ as input. For an experimentally measured $SmQL$, its conformational parameters *Y*_{m} must follow the Gaussian statistics described by *K*. Judging from the similarity between $\Delta SmQL$ and $SQLtraining$, which is quantitatively decided by their Euclidian distance in $R$, *Y*_{m} can be obtained by probabilistic inference. The scattering data analysis is therefore a linear operation of conditioning *K* by $\Delta SmQL$ to extract the desired *Y*_{m} and the statistical errors as the mean and variance of the conditional univariate Gaussian distribution. It is important to verify the feasibility of this ML-based approach for inversely determining the conformation of copolymers using scattering. For this benchmarking purpose, another library of the structure factor containing 1266 samples, $SQLtesting$, is computationally generated based on a new set of *Y*_{testing} different from *Y*_{training} corresponding to $SQLtraining$. In (a)–(c) of Fig. 3, we present the comparison of computational inputs *Y*_{testing}, which is termed as ground truths according to the convention of information science,^{20,21} and those inverted from conditioning *K* using $SQLtesting$. A varying degree of statistical uncertainty is first noted: We found that the inverted conformational parameters are strongly correlated to their corresponding ground truth. For $lnlA$, $ln\gamma $, and *f*, the coefficients of determination^{23} *R*^{2} are found to be 0.9992, 0.9749, and 0.9144 and the values of mean-absolute percentage error (MAPE)^{24} of the inverted conformational parameters for the ground truths are found to be 0.51%, 7.05%, and 5.46% for *f*, respectively. The average values of the confident intervals of these parameters returned by Gaussian Process Regression (GPR) are 0.46%, 6.93%, and 6.44%, which are also of the same magnitude. The observation reflects the difference in the susceptibility of $SQL$ toward the change in these parameters. As indicated by the probability density distributions given in (d)–(f) of Fig. 3, the majority of the extracted parameters are in quantitative agreement with the ground truths. However, as indicated by the dispersion of data, especially in (b) and (c) of Fig. 3, there remains certain $SQL$ whose inverted conformation parameters significantly deviate from the corresponding ground truths. We found that the numerical deviation becomes more severe in copolymers characterized by more asymmetric segmental chain length, namely, larger or smaller *f*.

To investigate the origin of this observed disagreement, MC simulations were conducted to generate $SQL$ based on the inverted parameters (red circles in Fig. 3). The corresponding ground truths (white circles) and the results are given in Fig. 4 as red curves and black dashed curves, respectively. In general, these two sets of $SQL$ are indistinguishable in the numerical scales of Fig. 4. The fact that they are essentially indistinguishable provides compelling evidence, demonstrating that the origin of this numerical inaccuracy is the mathematical properties of $SQL$, instead of our developed GP-based ML inversion algorithm. As the consequence of Fermi’s golden rule,^{25} the structural information offered by any elastic scattering experiment is the statistical average of static two-point spatial correlation. With the inherent information due to coarse-graining by ensemble averaging, any higher-order correlation is not explicitly registered by the structure factor in reciprocal space or, equivalently, the pair distribution function in real space. The structural difference may be more discernible in higher-order spatial correlation, which is not accessible by scattering. Moreover, although all the structural inversion problem of scattering experiments is theoretically underpinned by the existence of a one-to-one mapping between the structural variables and the two-point static correlation function,^{26} the numerical sensitivity of a two-point correlator to any variation in the relevant parameters still needs to be taken into consideration.^{27} As indicated by the residual plots given in the insets, their numerical differences are merely 2%–4% and only manifest in the range of 10 ≲ *QL* ≲ 100. The statistical error in small angle neutron scattering (SANS) coherent intensity within this *QL* range is generally around 5% within reasonable measurement time.^{6} The difference between these two $SQL$ is therefore indistinguishable experimentally. Given the additional data smearing encountered in practical experimental conditions, such as instrument resolution and polydispersity of materials, it is essentially impossible to quantify the conformation of a copolymer especially for the highly asymmetric ones, precisely. The reason can also be intuitively appreciated from Fig. 1(a): Assuming that *f* is exceedingly large, the measured $SQL$ can be well approximated by $SAAQL$. It is not surprising that the conformation of segment B cannot be accurately inverted from $SQL$. It is instructive to mention that this issue of structural degeneracy in spectral inversion is not unique to the conformational study of copolymers: It has been known that a fluid characterized by radially symmetric repulsive interaction potential is structurally equivalent to a hard-sphere system with an effective radius.^{26,28,29} Again, the static structure, in terms of two-point spatial correlation functions, is only equivalent up to a certain numerical resolution.^{27}

## III. CONCLUSION AND PROSPECT

We have outlined a ML strategy, based on the framework of the Gaussian process, to inversely determine the conformation of diblock copolymers from their coherent scattering. By treating the probability distributions of the relevant conformational parameters in the structure factor vector space, we have demonstrated that the conformation of copolymers can be inversely determined without relying on any pre-determined analytical model of scattering functions.

The present work also demonstrates that a scattering study of polymers with complex conformation is of interest in its own right: Complemented by computer simulations, the application of our proposed ML approach should be able to clarify the subtle connection between their scattering signature and the properly identified conformational parameters, thereby facilitating the related quantitative scattering characterization. Moreover, there has been considerable interest in the theoretical description of the structure of interacting polymer solutions based on the mathematical framework of the Ornstein–Zernike equation.^{26,30} There exists a reasonable amount of computational results to indicate the important factors to be considered in developing a proper structural description of these highly correlated systems in the context of two-point static correlation.^{26,30} Now that a ML inversion tool for dilute polymer solutions has been established, perhaps the more challenging inversion problems of these highly correlated systems can be addressed based on the GPR approach or other ML-based schemes for elastic scattering.^{31–33}

## ACKNOWLEDGMENTS

A portion of this research used resources at the Spallation Neutron Source and Center for Nanophase Materials Sciences, two DOE Office of Science User Facilities operated by the Oak Ridge National Laboratory. C.-H.T. and S.-Y.C. acknowledge support from the Ministry of Science and Technology of Taiwan under Grant No. MOST 108-2221-E007-054-MY3. Y.W. was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Early Career Research Program Award KC0402010, under Contract No. DE-AC05-00OR22725. Y.S. was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Materials and Science and Engineering Division. B.G.S. acknowledges support from the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences Data, Artificial Intelligence and Machine Learning at the DOE Scientific User Facilities Program, under Award No. 34532. The authors acknowledge the National Center for High-Performance Computing of Taiwan for providing computational and storage resources.

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://www.energy.gov/downloads/doe-public-access-plan).

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request. The codes for MC simulation and GPR are available for evaluation at https://github.com/ch-tung/polymer_chain/tree/main/worm_like_micelle/batchscatter/block.

### APPENDIX A: MONTE CARLO SIMULATION

A discrete Kratky–Porod model^{34} composed of *N* beads connected by bonds of length *l*_{0} = *L*/*N* was used to model the worm-like chain. Based on Boltzmann statistics, the distribution of zenith angle between successive bonds follows $P\theta =1Mexp\u2212a\theta 22sin\u2061\theta $, where *a* is the chain flexibility and *M* is the normalization factor.^{26} In this framework, the persistence length for sufficiently large *N* can be expressed as $b=l01\u2212cos\u2061\theta \u223cal0$. The azimuthal angle *ϕ* are randomly sampled from uniform distribution between 0 to 2*π*. Intra-chain self-intersection was checked by the criteria that the distance between two non-successive segments should not be less than 0.1*b*.^{14} Each $SQ$ is obtained by averaging 1000 trajectories, which satisfy the condition of self-avoiding.

### APPENDIX B: SINGULAR VALUE DECOMPOSITION

Through MC simulations, $SQL$ was obtained based on the conformational parameters of $logbA\u2208\u22122,\u22121$, $log\gamma \u22080,2$, and $f\u22080,1$, respectively. Each conformation parameter was sampled uniformly from these intervals. Overall, 6331 $SQL$ were generated by thoroughly sampling each conformational state where one fifth of generated $SQL$ with $f\u22080.3,0.7$ were assigned as the test set and excluded from the training process. The dimension of the vector space of $SQL$ was 65 because there were 65 sampled *QL* points in each simulated $SQL$. Each $SQL$ was therefore represented by a point in this $R65$ vector space. Visualization of the distribution of these points can be facilitated by dimensionality reduction: The data were arranged into a 65 × 6331 matrix ** F**. Using singular value decomposition (SVD),

^{23}

**can be decomposed into**

*F***=**

*F**U*Σ

*V*

^{T}, where

*U*and

*V*are unitary matrices, whose columns are the eigenvectors of

*F*

*F*^{T}and

*F*^{T}

**, respectively. Σ**

*F*^{2}is a diagonal matrix whose entries are the eigenvalues of

*F*

*F*^{T}.

**can be further centralized, and by definition, $FFT6331\u22121$ is the covariance matrix**

*F***, which contains the information of correlation between different data points.**

*C***can be expressed as $C=U\Sigma 2UT6331\u22121$ where the column vectors in**

*C**U*are the principal axes, which form an orthonormal basis in $R65$. The eigenvalues of $\Sigma 26331\u22121$ are the percentages of the variance of the original data projected onto each corresponding principal axis. The variance of original data was found to be mostly retained by the first three singular value ranks. This principal component analysis allows us to re-express the data as a set of three orthogonal variables to extract the intrinsic correlations of original data.

### APPENDIX C: GAUSSIAN PROCESS REGRESSION

In the context of Gaussian Process (GP), a function *g* relating *S*(*QL*) and *l*_{A}, *γ*, and *f* can be formulated as $g\u223cGP\mu ,k$ in terms of a prior mean function *μ* and a prior covariance function *k*. Given a training set (*X*, *Y*), where *X* represents the *n* sets of *S*(*QL*) in the training set and *Y* represents the corresponding regression targets of *l*_{A}, *γ*, and *f*, the purpose of the ML process is to determine *μ* and *k* from the knowledge of training data. Following the standard procedure of GP,^{19–21} a constant function was used to specify *μ*. The *n* × *n* covariance matrix *K*_{XX} specifies the correlations between the training data pairs modeled by the radial basis function (RBF) kernel.^{20–22} Specifically, for *x*, *x*′ in *X*, kernels $kx,x\u2032$ as entries of *K*_{XX} were formulated by the following squared exponential expression:

where *l* is the correlation length, which measures the similarity between training data points. The Gaussian observational noise term was added to the kernel matrix *K* = *K*_{XX} + *σ*^{2}*I*, where *σ* describes the variance of observational noise. Given a test point *X*_{*}, the goal of GPR is to estimate *Y*_{*} = *g*(*X*_{*}), where the covariance matrix of the test set is denoted as $KX*X*$, and $KXX*$ consists of entries measuring the correlations between training and test points. For simplicity, *K* ≡ *K*_{X}, $K*\u2261KXX*$, and $K**\u2261KX*X*$. For a *S*(*QL*) measured from the ergodic conditions, the relevancy of its *l*_{A}, *γ*, and *f* with those in the training sets should follow the same correlation patterns deduced from the training process. Therefore, they can be determined from this joint distribution,

where *T* indicates the matrix transposition and *N* denotes the normal distribution. The hyperparameters were determined by maximizing the log marginal likelihood using the gradient descent algorithm during training. Given the Gaussian likelihood, *Y*_{*} and its variance can be predicted from the posterior $pY*|X,X*,Y$ defined by the conditioned Gaussian distribution,

In this paper, we adopted the sklearn GaussianProcessRegressor library^{19,35} to implement GPR for the consideration of efficiency and easy deployment.