Tutorial: AI-assisted exploration and active design of polymers with high intrinsic thermal conductivity

Designing polymers with high intrinsic thermal conductivity (TC) is critically important for the thermal management of organic electronics and photonics. However, this is a challenging task owing to the diversity of the chemical space and the barriers to advanced synthetic experiments/characterization techniques for polymers. In this Tutorial, the fundamentals and implementation of combining classical molecular dynamics simulation and machine learning (ML) for the development of polymers with high TC are comprehensively introduced. We begin by describing the core components of a universal ML framework, involving polymer datasets, property calculators, feature engineering and informatics algorithms. Then, the process of constructing interpretable regression algorithms for TC prediction is introduced, aiming to extract the underlying relationships between microstructures and TCs for polymers. We also explore the design of sequence-ordered polymers with high TC using lightweight and mainstream active learning algorithms. Lastly, we conclude by addressing the current limitations and suggesting potential avenues for future research on this topic.


I. INTRODUCTION
4][15][16][17][18][19][20] Achieving high TC in polymers for industrial applications is an urgent demand, and some progress has been realized recently. 21,22Different fabrication techniques such as micromechanical stretching, [23][24][25][26] electrostatic spinning [27][28][29] and nanoscale templating 30,31 have been employed in the exploitation of polymers, which effectively improve the ordering and crystallinity of the polymer chains and thus exhibit high TC.Moreover, the construction of thermal networks by engineering interchain interactions such as hydrogen bonding, 32  - stacking 33 and side chain modifying 34 in polymer blends and copolymers has also been demonstrated to be beneficial for enhanced TC.Taking polyethylene (PE) as an example, the TC of PE films 25 and nanofibers 24 by mechanical stretching was found to be as high as ~ 62 and ~ 104 W m -1 K -1 , respectively, over two or three orders-of-magnitude greater than that of typical polymers.
Despite the engineered polymers that can be produced experimentally to achieve increased TC, this requires a strong chemical background from investigators and is limited by process and characterization instruments.Further, the applicability of different techniques is restricted, e.g.micromechanical stretching is unsuitable for brittle polymers. 357][38][39][40] Computational approaches including first-principles calculations 41,42 and molecular dynamics (MD) simulations 43 have led the way in revealing the effect of polymer nanostructures on TC.The first-principles calculation to TC is based on the computation of interatomic force constants via density-functional theory (DFT).On this basis, all relevant phonon properties can be calculated using lattice dynamics and the Boltzmann transport equation. 42This method has been successfully applied to molecular crystals such as PE, 41 polyvinylidene fluoride 41 and polythiophene, 42 but is challenging to apply in amorphous systems, owing to quantum nuclear motion and their complex primitive cells.
MD simulations employ classical force fields combined with Newtonian mechanics and statistical 3 / 48 physics to derive the macroscopic properties of systems. 44MD simulations can handle polymer systems containing tens of thousands of atoms, which are widely used in polymer thermal transport studies, not only to predict the TC of hierarchical structures, but also to probe the linkages between microstructures and TCs.6][47] These results are encouraging and inspire researchers to make further efforts to develop polymers with high TC.Moreover, properties of polymers such as molecular weight, 48,49 chain length, [50][51][52][53][54] side chains, 34,[55][56][57] and chain conformation; 49,[58][59][60][61] intra-chain effects such as bonds, [62][63][64] angles, 65,66 and dihedrals; [67][68][69] inter-chain behaviours such as molecular cross-linking, [70][71][72][73][74][75] hydrogen bonding networks [76][77][78][79] and   stacking 56,[80][81][82] on TC were extracted in some separate MD simulations.However, current investigations on the thermal transport mechanisms of polymers have mainly focused on common polymers such as PE, polytetrafluoroethylene (PTFE), polyvinylidene fluoride, and other conjugated structures. 83r a long time, researchers have been working on exploring quantitative structure-activity relationships (QSAR) from chemical data, which in turn has enabled the rational design of innovative materials, including polymers, inorganic and biological components. 84QSAR models reveal empirical, linear or non-linear relationships between descriptors extracted from chemical structures and computational/experimental properties or activities. 85As modern research methods have facilitated the proliferation in the amount of chemical data, the data-driven research paradigm is of critical importance for QSAR modeling [86][87][88] .1][92] Moreover, virtual chemical reactions 93 or generative algorithms 94 for small molecules can create a nearly infinite chemical space.[117][118][119][120] In this Tutorial, we introduce some development paradigms combining high-throughput MD simulations of and machine ML for high TC polymers in the hope of inspiring new researchers who are interested in becoming involved in this field.We start with a description of three core components in polymer informatics of polymer datasets, TC simulation methods and polymer representations in Section II.Following this, interpretable regression models are constructed for mapping polymer microstructures to TCs in Section III.Next, active optimization algorithms are utilized for the design of polymers with high thermal conductivity in Section IV, including single-and multi-objective cases.Our conclusions and outlooks for this area are provided in Section V.

II. METHODOLOGY
The principle of polymer informatics is to establish patterns from a sufficient amount of existing or generated polymer data, thus facilitating the design/discovery of new functional polymers with improved target properties. 89

A. Polymer datasets
Well-organized and clean data is a fundamental prerequisite for the accomplishment of high-fidelity informatics algorithms.To promote the openness and sharing of scientific data, the findable, accessible, interoperable, reusable (FAIR) principle has been proposed for data storage and management, 121,122 which also greatly contributed to the progression of material genetic engineering. 123Inorganic materials have a more extensive and accessible database than polymeric materials, such as Materials Project, 124 Atomly, 125 ICSD 126 and AFLOW. 127These inorganic databases usually follow the FAIR guidelines and provide application programming interfaces (APIs) for access and download.However, the development process of polymer databases is relatively slow and is mostly limited to access.This is attributed to the fact that large polymer primitive cells make experimentation/computation difficult and costly, 6 and empirical nomenclature 128 and diverse forms of expression (strings and graphs) of polymers prevent text-mining techniques from obtaining property data from the published scientific literature. 129few representative databases include PoLyInfo, 130 Khazana,131 Polymers: A Property Database, 132 Polymer Property Predictor and Database, 133 CAMPUS 134 and PI1M 94 are listed in Table I.PoLyInfo 130 is one of the largest polymer experimental databases, with over 18,000 homopolymers and 7,000 copolymers from about 20,000 scientific literature, including hundreds of properties such as refractive index, dielectric constant and glass transition temperature.Nevertheless, the abundance of different properties is quite variable, for example, more than 8000 homopolymers contain recorded glass transition temperature, while only 84 homopolymers expose TC (accessed 2023.12.14). Morover, PoLyInfo is contributed by the National Institute for Materials Science of Japan, and is prohibited for downloading large amounts of data. To ddress this concern, Hayashi et al. 135 selected 1070 amorphous polymers from the PoLyInfo database and calculated the associated 15 properties, including TC, using all-atom classical MD simulations. Ma al. 94 trained a recurrent neural network based on ~12000 polymers collected from PoLyInfo, and then generated ~1 million polymers to form a benchmark database of PI1M, which covers similar chemical space as the training datasets.
Databases of computational properties of polymers are rarer than experimental databases, of which Khazana 131 is a typical one.The Khazana database was supported by the Ramprasad's group, including DFT computed refractive index, dielectric constant and band gap, and was used as training datasets to build an open informatics platform of Polymer Genome. 136In Table I, we also describe three databases containing small molecule datasets, namely PubChem, 90 eMolecules 137 and Material Project. 138With a grasp of the laws of chemical reactions, virtual reactions can convert small molecules into polymers through ML 139,140 or specific grammar rules. 93,112,141Yang et al. 112 established a polymer dataset containing more than 8 million hypothetical polyimides formed by known dianhydride and diamine/diisocyanate pairs from PubChem. 90Kim et al. 93 released a vast polymer database known as the Open Macromolecular Genome, which began with approximately 24 million potential reactant molecules of eMolecules, 137 and then formed synthesizable polymer chemistries compatible with 17 polymerization reactions that cover a variety of step growth chain growth addition, ring opening, and metathesis polymerization reactions.A well-known material database, it recently integrated more than 170000 molecules studied using density functional theory and can be queried through an OpenAPI-compliant application programming interface.

B. Polymer simulation
Polymer simulation includes modeling, force field assignment, equilibrium simulation, and TC calculations.While some discrete software such as RDKit, 142 Materials Studio, 143 CHARMM-GUI, 144 Packmol, 145 Moltemplate, 146 Pysimm 147 , Polymer Structure Predictor (PSP), 148 Enhanced Monte Carlo (EMC), 149 and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) 150  Python library that fully automates the calculation of various properties of polymers using all-atom classical MD simulations.The entire simulation process, including modeling, equilibrium and nonequilibrium MD simulations, can be automated by taking only simplified molecular-input line-entry system (SMILES) 152 string of the polymer repeating unit as input.Furthermore, a computational database has been released containing more than 1000 unique amorphous polymers with various thermophysical properties calculated using RadonPy.Given the powerful and convenient capability of RadonPy, the polymer simulation and ML training data for this Tutorial relies primarily on RadonPy and its derived computational dataset.

Polymer modeling
The polymer modeling procedure is illustrated in Fig. 2(a).The SMILES string was given as a unique identifier to distinguish between different polymer structures, where two asterisks denote the connection points, and this was fed as an input parameter to RadonPy.Then, the repeating unit was linked by a self-avoiding random-walk algorithm to form an individual polymer chain. 153,154The degree of polymerization of the polymer chain is controlled by the total number of atoms, which was uniformly set to around 1000 to ignore the dependence of the physical properties on the molecular weight.After that, the second generation of General AMBER Force Field (GAFF2) force field was assigned to the polymer chain, which is expressed as: [155][156][157][158][159] where   ,   ,  d and   are the force constants of the bond, bond angle, dihedral angle, and improper angle, respectively; , ,  and  are the bond length, bond angle, dihedral angle, and improper angle, respectively;  0 ,  0 and  0 are the equilibration structural parameters of the bond, bond angle, and improper angle, respectively;  d is the multiplicity and  is the phase angle;  i and  j are the charges of i-th and j-th atoms;  ij is the distance between atoms i and j;  ij and  ij are the depth of the energy potential and equilibrium distance for Lennard Jones potential, respectively.
In RadonPy, if the bond angle parameters of   and  0 are missing from the predefined parameter, they were automatically estimated in the same manner as GAFF2.
For obtaining a simulation cell, the single polymer chain was duplicated into 10 copies by translational and rotational operations to prevent overlap with each other, and were placed in a large box with a density of ~0.05 g/cm 3 .The packing simulation was performed to increase the density of the amorphous systems, to be adjusted to a suitable level.An NVT (constant number of atoms, volume, and temperature) simulation with a Nosé−Hoover thermostat was applied to the system in three sequential stages at a temperature of 300 K, from 300 K to 700 K, and held at 700 K, under periodic boundary conditions (PBC) and a time step of 1 fs.Each of these stages took 1 ns, and all the bonds and angles were constrained by the SHAKE algorithm, resulting in a packaged cell with a density of around 0.80 g/cm 3 .
Equilibrium simulation was executed for the structural relaxation of amorphous polymers, which follows the 21-step compression/relaxation scheme. 160During the simulation, by combining NVT and NPT (constant number of atoms, pressure, and temperature) simulations with a Noose-Hoover thermostat, the temperature rise to 600 K and fall to 300 K was repeated for about 1.5 ns while the system was compressed to 50,000 atm and then depressurized to 1 atm.In RadonPy, the amorphous system was considered to be in equilibrium when it satisfies the following conditions: the fluctuations in the total, kinetic, bonding, bond angle, dihedral, van der Waals, and long-range Coulomb energies with relative standard deviations (RSDs) of less than 0.05%, 0.05%, 0.1%, 0.1%, 0.2%, 0.2%, and 0.1%, respectively.At the same time, the RSDs for the fluctuations in density and the radius of gyration were less than 0.1% and 1%, respectively.NPT simulation was run at 300 K and 1 atm with a time step of 1fs.The simulated system was checked for equilibrium states every 50 ns until the equilibrium requirements were realized. 135e density  of the equilibrated system can be denoted as: where m is the sum of atomic masses, 〈〉 is the time-averaged system volume.
The number density n is calculated using the atoms number N and volume V in the equilibrated system: The radius of gyration   is given as: where   is the position of a repeating unit,  is the number of repeating units in the polymer chain and  m is the mean position of these repeating units.
The persistence length  can be further obtained as: 51 where  is the degree of polymerization of a polymer chain, and  is approximated as the length of the repeating unit.

Equilibrium molecular dynamics methods
Equilibrium molecular dynamics (EMD) simulation is executed in an equilibrium state without temperature gradients, so reasonable structural configurations, careful relaxation and optimization are essential for the accuracy of TC estimation.The TC of polymers in EMD simulation is calculated by the Green-Kubo formalism: 45,161-165 where V is the volume of the amorphous system,   is the Boltzmann constant, T is the temperature,   is the heat flux in the -direction,  is the correlation delay time, and ⟨  (0)  ()⟩ denotes the heat autocorrelation function (HACF).Figs.2(b)-(d) provide a case study of TC calculation for PTFE using EMD.The optimized system contains ~10000 atoms, and we additionally performed the NVE simulation 20 ps for obtaining 10 HACFs with a sampling interval of 2 fs.After the HACFs decayed to 0, the TC of PTFE also stabilized with an average value of 0.27 W m -1 K -1 .

Non-equilibrium molecular dynamics methods
Non-equilibrium molecular dynamics (NEMD) methods by imposing thermostats (normal NEMD) or swapping the kinetic energy of atoms between two regions (reverse NEMD) in the form of a 10 / 48 temperature gradient and the formation of a heat flux.Once the system reaches a steady state, the TC can be derived by Fourier's law: where  is the heat flux, and     ⁄ is the temperature gradient in the thermal transport direction.In RadonPy, the TC was calculated by reverse NEMD simulation proposed by Müller-Plathe. 166As shown in Fig. 2(e), the equilibrated system was replicated along the x-direction by triplication, and then was divided into N (N=20) slabs.By exchanging the velocity between the coldest atom in slab N/2 and the hottest atom in slab 0, the temperature gradients were formed and recorded for TC evaluation.To prevent temperature shifts after cell replication, the system was initially run at 300K for 2ps.After that, the reverse NEMD simulation was run at 300K for 1 ns with the velocity swapping frequency of 200 fs. 135According to the exchanged energy ΔE obtained using the Müller-Plathe algorithm and the temperature gradient     ⁄ output by reverse NEMD simulation (Fig. 2(f)), the TC λ can be expressed by: where A is the cross-sectional area, and Δ is the simulation time.
Additionally, the NEMD simulation was implemented for 100 ps for a thermal conductivity decomposition analysis.The energy flux along the direction of unit vectors  u can be expressed as the contribution of convection (first term) and interatomic interactions (second term): 155,167 where  i is the velocity of the atom,  i is the potential and kinetic energy of each atom, i is the index of atoms, and  i is the stress tensor.For components (a, b), the stress tensor  ab can be detailed as: 135   = ∑ The stress tensor was divided into six parts of the contributions pairwise, bond, angle, dihedral, improper and K-space, respectively.where   is the force acting on atom i due to the interaction,  0 stands for the relative position of atom i to the geometric center of its interacting atoms,   ,   ,   ,   , and   are the number of interactions of pairs, bonds, bond angles, dihedral angles, and improper angles, respectively.The normalized TC contribution  from the equivalent of the heat flux terms is: where   is the total heat flux calculated by Eq. ( 9),   is the partial heat flux solved by Eqs.(10)   and (11).

C. Feature engineering
Polymer descriptors translate structural and chemical information about polymers into a machinereadable form for ML model training. 168Successful descriptors are required to uniquely, completely and minimally express polymer information and are a prerequisite and important condition for guaranteeing high accuracy of ML models.To date, many polymer representations have emerged to be exploited for polymer informatics, which can be categorized into three categories 169 : string-based descriptors, graph-based descriptors and physical-based descriptors, as illustrated in Fig. 3.

String-based descriptors
String-based descriptors are efficient and convenient line notations, such as SMILES 152,170 and selfreferencing embedded strings (SELFIES). 171,172SMILES is a popular polymer representation that uniquely encodes atoms, bonds, rings and branches in polymer monomers by ASCII string.SMILES allows a relatively uniform expression of polymers, does not depend on a strong chemical background, and has good readability for both humans and computers.Thus, SMILES is widely used in data storage, 136 modeling 135,147,148,173 and ML 168,174 of polymers, and has become a standard tool in computational chemistry.However, along with the progress of polymer informatics, SMILES has also exposed some limitations.For instance, in some inverse design tasks based on evolutionary or deep learning algorithms, a lot of the generated SMILES cannot correspond to valid polymers. 175An improved version of SELFIES 93 was proposed to address this issue, which can represent every polymer and be directly applied in arbitrary ML models without the adaptation of the models.[178] 2 Graph-based descriptors Graph-based descriptors are organic chemical representations based on topological information, involving substructure statistics, interatomic connections, and relative positional relationships.
Molecular access system (MACCS) keys 179 are one of the most commonly used structural keys, and have been integrated into some open-source cheminformatics software, including RDKit, 142 CDK 180 and OpenBabel. 181The generated MACCS keys contain 167 bits, the first bit determines whether the molecule has a predefined feature (if exists is set to 1, else is set to 0), and the last 166 bits correspond to 166 substructures.3][184] The generation of Morgan fingerprints of a polymer monomer requires three steps: 185 1) Initialization: Initializes each atom to be encoded as a unique integer identifier.2) Iteration: In each iteration, the identifier of each atom is updated to a combination of its own and its neighbours' identifiers.The emerging identifiers are hashed to yield a fixed-length bit vector.Once all atoms have been given new identifiers, the old ones are replaced and the new identifiers are 3) Post-processing: After a prespecified number of iterations, duplicate atom identifiers are removed, and the Morgan fingerprints are formed by the retained identifiers.Mogan fingerprints for polymer representation have the advantages of high efficiency, convenience and absence of a pre-training process, but their feature dimensions are very high and sparse, and possibly introduce bit collisions caused by the hashing process.Mol2vec 186 is an unsupervised method inspired by natural language processing techniques, that considers compound substructures derived from the Morgan algorithm as words and compounds as sentences.Ma at el.

Physical-based descriptors
Exploring the collection of physically independent descriptors for characterizing molecular structures is important for qualitative structure-property relationship building and provides a more intuitive guide to molecular performance assessment. 189Benefiting from the advances in the feature engineering of drug-like molecules, some chemoinformatics software [190][191][192] is available for the automatic calculation of molecular descriptors.For example, Mordred 192 is a mainstream descriptorcalculation software, that can calculate more than 1800 descriptors, including constructional descriptors, 2D topological descriptors and 3D geometric descriptors. 168However, polymer monomers are different from small molecules due to the presence of connection site information, which prevents the creation of 3D structures and leads to the inability to obtain some geometric descriptors.To compensate for the lack of 3D information, we incorporated molecular force field (FF) parameters as added descriptors and designed automated physical feature engineering for polymer descriptor optimization in our previous work. 83,120The initial descriptors are a joint collection, which are calculated by the Mordred software and extracted from the parameters of the polymer structural data file after the FF assignment.Afterwards, feature down-selection technology is employed to acquire the optimized descriptors, which consists of three stages: 1) Evaluate the numerical fluctuations of each physical descriptor itself using a variance indicator and remove features with low variance, since these features have less impact on target properties.This is beneficial in reducing the complexity and improving the performance of ML models.For a specific physical descriptor, the variance can be denoted as: where  is the number of candidates,   is the value of the feature for each candidate, and ̅ is the average of the feature values.
2) Filtering and removal of features with poor correlation to target property by four correlation coefficients of the four metrics of Pearson, Spearman, distance, and maximum information coefficients (MIC).For two variables  and , the Pearson coefficient can be solved by: where   and   are the mean values of the two variables  and , respectively.  and   are the standard deviations of the two variables  and  , respectively. can indicate any of the variables  and .
The Spearman coefficient is defined as: where   is the difference in the ranks of the corresponding variables, i.e., the difference in the positions (ranks) of pairs of variables after the two variables have been individually ranked.In addition, the distance correlation coefficient 193 and MIC 194 are used to measure the correlation of possible nonlinear variables.
3) Feature optimization using ML models to better match model performance.Recursive feature elimination (RFE) 195

III. INTERPRETABLE REGRESSION MODELS FOR TC PREDICTION
Establishing the mapping from microstructures to TCs using interpretable ML contributes to the understanding of the intrinsic thermal transport mechanism and guides the design of novel promising structures.Shapley additive explanations (SHAP) 197 is a powerful tool for explaining ML models to alleviate black-box challenges, which links the optimal credit allocation of the model's input features to local explanations.Besides, the symbolic regression (SR) [198][199][200][201] technique can construct analytical models of key physical parameters for TC prediction, and intuitively assists in capturing the underlying physical correlations.Some open-source frameworks for SR, such as gplearn, 199 PySR 202 and SISSO, 203 facilitate the discovery of optimal combinations between features and arithmetic symbols, enabling the creation of interpretable explicit mathematical models.5][206] Here, we presented three case studies of interpretable TC prediction models constructed using the SHAP or SR approaches, respectively.

A. Regression models trained with physical descriptors
The training of the regression model started with 1051 polymers sourced from a computational database, and all candidates were labelled TCs by NEMD simulations in RadonPy. 135We then calculated 325 initial descriptors, of which 294 are Mordred-based descriptors and 31 are MD-based descriptors.The down-selection process is displayed in Fig. 4(a).The threshold for variance assessment was 0.10 and a total of 202 descriptors were reserved.A weight assignment mechanism was developed for filtering descriptors, each metric was assigned a factor of 0.25, and descriptors with a cumulative weighting factor of 1 were retained.That is, the descriptor is valid only if all of the four correlation coefficients reach the corresponding thresholds.We filtered out 53 descriptors with a cumulative factor of 1 using the thresholds of 0.050, 0.050, 0.213 and 0.186 for Pearson, Spearman, Distance and maximum information coefficients.Ultimately, the random forest (RF) model combined with RFE was employed in the Scikit-learn 207 for descriptor optimization, where 25 descriptors were determined based on the evaluation of the model's accuracy, as listed in Table II.
The 1051 polymers were represented by optimized physical descriptors and randomly splited by training/test set as 80%/20%.We constructed an RF model using those optimized descriptors, where the hyperparameters for RF were optimized with Bayesian optimization with R 2 as a target. 208The Gaussian regression process and acquisition function with ten random pairs of parameters were selected for initial training, and the ideal parameters for each ML model were determined after 100 optimization iterations. 209

B. Deep neural network trained using MFF
The utilization of substructures of polymers as descriptors is more intuitive than physical descriptors in revealing structure-TC relationships and facilitates the design of new structures.We give a guide to building a deep neural network (DNN) model using 1144 polymer data with MFF.Among the 1144 polymers, the TCs of most of the structures were collected in a computational database, 135 and the rest were calculated in Radonpy using the same setup. 120 The pairs of DNN-predicted and MD-calculated TCs are plotted in Fig. 5(a), with good consistency and a test root mean square error (RMSE) of 0.040 W m -1 K -1 and test R 2 of 0.79.We performed an additional five-fold cross-validation (CV) to evaluate the accuracy of the DNN model.In the five-fold CV, the test R 2 of the DNN models is 0.72±0.03(mean value and one standard deviation)., reflecting that the trained DNN model is reliable. 213Further, the trained DNN model was interpreted by SHAP, and the role of the most important 16 substructures on TC is shown in Fig. 5(b).When the descriptor values follow the same trend as the SHAP values, it suggests that the substructure is contributing to the realization of high TC.Thus, eight substructures were found to have a positive effect on TC, while six substructures suppressed TC, which are marked in Fig. 5(c) in blue and red text, respectively.The insights that can be extracted from these substructures coincide with the conclusions previously gained from the RF model trained with physical descriptors in Section III A, namely that conjugated, linear side-chain-free polymers are beneficial for maintaining large chain stiffness and high TC.
Additionally, when the polymer system contains heavy atoms such as fluorine (F), it hinders the efficient heat flux transport, preventing the achievement of high TC.

C. Symbolic regression for revealing structure-TC correlations
SR is another interpretable ML method for discovering specific mathematical expressions to match the fit of a dataset. 200SR simultaneously searches for a set of parameters and the optimal mathematical formula of a function. 214Reliable training data is critical for SR without requiring massive amounts of data.We calculated the TC of 104 promising amorphous polymers recommended by the RF model, 120 and found that their   extracted from equilibrium systems have a strong linear positive correlation with the TCs, as shown in Fig. 6(a).Besides the radius of gyration, we additionally acquired 10 parameters from the equilibrium amorphous systems for SR, as listed in Table III.The SR was implemented in gplearn, 199 and the hyperparameters settings are listed in Table IV.gplearn is a proven tool based on genetic programming that provides a scheme for optimizing mathematical expressions using genetic algorithms.Thus, the main input parameters of the genetic algorithm in gplearn include the optimization generations, the number of formulas produced in each generation, the crossover probability, and the probability of mutation at each node.The formulas were selected based on the criterion of simultaneously having low complexity and high fitting accuracy.The complexity was defined as the length of the formula, with each constant, variable, or operation symbol being represented as a unit of length.Fig. 6(b) statistics of the 3364 mathematical formulae with complexity within 15 and R 2 over 0.70 by density plot.The six formulae F1-F6 at the Pareto front are marked by stars, and their corresponding analytic functions are listed in Table V.The Pareto front is the total set of all feasible and Pareto optimal solutions, and is often considered the optimal trade-off between various objectives. 215,216Overall, formulae with large complexity are more likely to yield large prediction accuracies.Of the six formulas, the smallest complexity is only 5, and the highest precision is 0.876, and their predicted TCs are both in good agreement with those calculated by NEMD, as depicted in Figs.6(c) and (d).All six formulas capture the positive correlation between   and TC, and three of them reflect the fact that it is unfavorable to have large masses of atoms in the systems for TC.  was calculated to express the spatial extent of the molecular chains.When the molecular chains in the amorphous system have a high   , it is beneficial to maintain large rigid chain segments and enhance the heat transport along the chain backbone through intra-chain bonding interactions, thus increasing the TC. 58,120In addition, it is worth mentioning that since some of the variables utilized for SR, such as   , were extracted from the equilibrated amorphous systems and are closely related to the TC, resulting in higher prediction accuracy of the obtained analytic mathematical models than that of the RF model in Fig. 4b.The key materials parameters include the K_bond_ave (   ), Mass_max (   ), nHBDon (  ), number density (n), radius of gyration (  ) and persistence length ().

IV. ACTIVE DESIGN OF POLYMERS WITH HIGH TC
Active learning is oriented to the design of new polymers driven by target properties, which breaks through the limitations of regression tasks restricted to a fixed exploration chemical space.The inverse design of polymers with high TC can be achieved by some lightweight and smart optimization algorithms, such as genetic algorithm, Bayesian optimization and quantum annealing.In this section, we not only present some cases of polymer design with a single target of high TC, but also additionally consider the synthesizability of polymers in multi-objective optimization trials.

TABLE VI.
Polymer fragments as basic units for high thermal conductivity polymer design.Each fragment was binary encoded according to serial numbers (No.). 213. SMILES of fragments to a fragment.The composition of the polymer fragments is directionless, i.e., two polymers consisting of the fragments [0, 2, 29] and [29, 2, 0] are equivalent.The entire benchmark dataset contains 16896 triblock polymers, which were classified into 13 categories referring to the same classification method as PoLyInfo, such as polyolefins, polyethers and polyimides (in Fig. 7(c)).The TC of emerging polymers was estimated by a high-fidelity DNN model trained in Section III B and ranged from 0.16 to 1.03 W m -1 K -1 , with 42.6% of the TC greater than 0.40 W m -1 K -1 (in Fig. 7(d)).Moreover, the synthesizability of these polymers was evaluated by the synthetic accessibility (SA) score. 217The SA score, which ranges from 1 (easy) to 10 (hard), is calculated by considering both fragment contribution and complexity penalties.This score is used to evaluate the synthesizability of molecules or polymer repeating units.The SA scores for polymers within a range of 2.28 to 6.21, with 6.3% of them having scores less than 3.0.Out of all 16,896 generated polymers, only 4.5% meet the predefined criteria for both TC and SA, which are known as ideal polymers.

B. Single-objective optimization trials 1 Genetic algorithm
The genetic algorithm (GA) is a heuristic search algorithm that is inspired by the biological evolutionary process, and based on the mechanics of natural genetics and selection. 218,219The operation of the GA algorithm involves various stages of population initialization, fitness evaluation, selection, crossover, and mutation.The core operators of crossover and mutation are illustrated in Fig. 8(a).Once the parents are selected according to the fitness function, the crossover operator combines parents into one or several offspring.After that, the mutation will execute with a predefined probability to increase the diversity in the population.In our case, the GA was realized in the pymoo 220 package with simulated binary crossover and polynomial mutation. 221Fig. 8(b) depicts the convergence curve of the GA in a single optimization run, using the setups of 10 initial random structures, 200 iterations × 10 candidates per batch, and both crossover and mutation probabilities of 100%.Thanks to the inheritance of excellent fragments from the parents, GA can efficiently explore the optimized polymers by only simulating a few candidates (64 non-repeating structures).To probe the effect of the initial structures on the convergence capacity, 20 GAs with different initial structures were executed, and the results are exposed in Fig. 8(c).The TC of the best structures for 20 runs ranges from 0.72 to 1.03 W m -1 K -1 , within the top 0.3% of all possible candidates.A total of 6 rounds out of 20 optimizations yielded polymers with the global optimal TC, which indicates that the initial population has a significant influence on the optimization performance of the GA.

Bayesian optimization
Bayesian optimization (BO) is a method for finding the optimal solution to black-box functions by using sequential design strategies that rely on the probabilistic surrogate models and acquisition functions. 222Previously, we have released a tutorial on the design of thermal functional materials by coupling thermal transport calculations and BO. 223Moreover, additional instruction is available to assist in understanding the core components of BO more intuitively. 224The general process of BO can be described as: 219

Quantum annealing in a quantum virtual machine
Quantum annealing (QA) is an optimization algorithm assisted by Ising machines to search for the global minimum of a given problem over a given set of candidate solutions (candidate states), 226,227 which are implemented with superconducting qubits, ASICs, GPUs, and so on. 228QA a has been successfully applied to the design of inorganic and organic materials. 228,229Ising machine developed specifically for solving quadratic unconstrained binary optimization (QUBO), which is adapted to the binary sequence encoding of polymers.QUBO with N bits is described as: 230 where   and   are real-valued parameters, and   =   .From Fig. 10(a), the factorization machine (FM) was employed as the surrogate model, and the optimal binary solution was identified by trained FM with a quantum annealer.The form of FM is given by: 230 where   and   are real-valued parameters, and the rank K was set to 8 as in the previous literature. 230Due to hardware limitations, we utilized a sampler called dimod 231 in the Python framework instead of Ising machines to simulate the real quantum annealing.In Figs.10(c) and (d), although the simulated QA leads to an increase in the TC of the polymers, the TC of optimal polymers below 0.80 W m -1 K -1 , and fails to search for the global optimal structure in the 20 parallel optimizations with various initial candidates.

C. Multi-objective optimization trials
Two state-of-the-art multi-objective optimization algorithms of unified non-dominated sorting genetic algorithm III (U-NSGA-III) and q-noisy expected hypervolume improvement (qNEHVI) were employed for the design of triblock polymers with both high TC and synthetic possibility.As depicted in Fig. 12(a), the U-NSGA-III is a multi-objective evolutionary algorithm (MOEA) that is an updated version of NSGA-III. 232,233It generalizes different dimensional objective problems by increasing the selection pressure by a scalar selection operator.U-NSGA-III was implemented in the pymoo 220 and kept all hyperparameters with default values.qNEHVI is a multi-objective Bayesian optimization (MOBO) that extends the acquisition function of expected improvement to hypervolume (HV) as an objective.It evaluates samples collected by the Quasi-Monte Carlo (QMC) sampler from the model posterior and identifies the candidate with the largest objective value.qNEHVI was operated in BoTorch, 225 with the base and raw sampling set at 256 and 128 respectively, to enhance computational efficiency.MOBO runs, respectively.The number of unique polymers generated per MOEA run is significantly lower than that of MOBO, with a mean value of approximately 77, which is less than 5.0% of the average value for MOBO.To address this issue, an effective approach is to Over the past few years, data-driven informatics algorithms have contributed to a revolution in the materials development paradigm, greatly facilitating the design of polymers and enhancing our understanding of their underlying mechanisms.In this Tutorial, we discuss the basic principles and implementation of ML for the exploitation of high thermal conductivity polymers, covering polymer datasets, polymer modeling and TC calculation, feature engineering, as well as informatics algorithms.
We begin by describing the construction of interpretable regression models via physical or graph descriptors, and reveal the mapping between polymer microstructures and TCs.Based on the trained surrogate prediction model and the knowledge derived from the ML, we create a library containing 32 motifs and employ lightweight active learning algorithms to design sequence-ordered triblock polymers with high TCs.We not only focus on designing polymers with a single optimization target of high TC using GA, BO and simulated QA, but also consider the synthetic feasibility of polymers in multi-objective optimization trials that are realized by two state-of-the-art algorithms of U-NSGA-III and qNEHVI, respectively.
Although ML has facilitated the development of macromolecules with high TC, there is still a large gap in satisfying the various demands of realistic engineering applications, which also provides great opportunities for future investigations.
(1) Sufficient high-quality polymer data is a fundamental prerequisite for polymer informatics.
Accessible polymer databases are rare compared to inorganic databases, and the recorded data are rather sparse with strict acquisition rules.The development and preservation of publicly accessible polymer databases that adhere to the FAIR principles and encompass a wide range of properties necessitate collaboration and consensus among chemical researchers.In addition, providing APIs for automated batch downloading of data is favored by polymer informatics.
(2) Several open-source software 135,151 enable automated modeling and TC calculations of polymers through classical MD simulation, which promotes the development of polymer informatics.
It is crucial to ensure the reliability of the obtained TCs of polymers.Therefore, efforts are being made at the computational level to enhance the accuracy of force fields for MD or to develop first-principles computational methods to be efficiently and economically applicable to the simulation of macromolecular systems.
(3) Most current work on ML in polymer science focuses on the computational TC given the convenience and consistency of the research.Subsequently, with the success of applying ML to automated chemistry experiments, [234][235][236] we look forward to the emergence of automated platforms that integrate Polymer literature mining, polymer synthesis, TC measurement, data storage and analysis, as well as novel structure generation and TC evaluation.This will enable the expansion of reliable polymer experimental data and the identification of promising polymers with high TCs.
(4) State-of-the-art informatics algorithms are always sought after by the polymer community.Deep learning algorithms such as transfer learning, 91 recurrent neural networks and reinforcement learning 117 have been successfully applied to the exploitation of polymers with high TC.On the one hand, more intelligent molecular generation algorithms are required for the design of polymers with high TC; on the other hand, efforts are made to explore the application of large language models and multitask learning to the design of multifunctional polymers with enhanced TC.
Last but not least, with the rapid advancement of artificial intelligence and automated experiments, we foresee that ML will become a powerful driving force in accelerating the design of advanced polymers to meet the immense demand in various fields.The attractive characteristics of polymers for ML extend beyond TC to encompass other properties such as optical, electrical, and mechanical properties.
[134] CAMPUS -a material information system for the plastics industry,

FIG. 2 .
FIG. 2. Molecular dynamics simulation for polymer TC calculations.(a) Polymer modeling pipeline.(b) Example snapshot of the relaxed amorphous system.(c) and (d) Heat flux autocorrelation function and TC curves calculated using the Green−Kubo formula.(e) Simulation setup for TC calculation in the non-equilibrium molecular dynamics simulations, where the relaxed system was triple replicated along the heat transport direction, and was then divided equally into N slabs, with the red and blue slabs corresponding to the hottest and coldest region.(f) Temperature profile.

196 FIG. 3 .
FIG. 3. Feature engineering for polymer informatics.It shows the three different options for polymer representation, including string-based, graph-based and physical-based descriptors.

Fig. 4 ( 120 FIG. 4 .
FIG. 4. Interpretable machine learning model on physical descriptors and TC.(a) polymer descriptor down-selection process.The initial descriptors (Init.)dimensionality reduction by removing features with low variance (Var.), correlation coefficients filtering (Cor.) and feature recursion elimination to obtain the optimized (Opt.)descriptors.(b) Comparison of MD calculated and RF model predicted TC.(c) Average SHAP importances for optimized descriptors.The blue and red bars indicate positive and negative importance (d) Impact of each optimized descriptor on TC.(e) and (f) SHAP value for the MW_ratio and K_bond_ave of the train set polymer as the functions of descriptor value.The MW_ratio represents the ratio of the molecular weight of the backbone to the total molecular weight of the monomer, and the K_bond_ave indicates the average of different bond force constants.120

Fig. 4 (
c) exhibits the eight most important physical descriptors related to properties such as atomic number, atomic mass, bond connection, bond strength, sigma electrons, and monomer length.Combined with the distribution of SHAP values for the first eight physical descriptors of training candidates in Fig.4(d), the promotion/inhibition of those key descriptors for TC can be recognized in general.The most significant descriptor is BCUTZ-1h, which is the first highest eigenvalue of the Burden matrix weighted by atomic number and is associated with atomic and bonding properties.210BCUTZ-1h was observed to have a strong positive correlation with the maximum atomic mass (Mass_max) in the monomer.120Typically, the presence of large masses of atoms such as chlorine, bromine and iodine, in the system suppresses lattice vibrations, resulting in small phonon group velocities and low TC.We also analyzed two MD-inspired descriptors of MW_ratio and K_bond_ave in detail, as shown in Figs.4(e) and (f).The MW_ratio is the ratio of the molecular weight of the backbone to the total molecular weight of the repeating unit, and the K_bond_ave indicates the average of different bond force constants.Both descriptors have a positive effect on TC, as polymers with fewer side chains and stronger chain stiffness are favorable for efficient thermal transport.

48 FIG. 5 .
FIG. 5. ML model performance and feature importance evaluation.(a) Predicted results for DNN.(b) The interpretations of the DNN model for TC prediction by the SHAP evaluation.(c) The key su bstructures that act on TC, where blue text indicates a positive effect and red indicates an inhibitory effect.213

FIG. 6 .
FIG. 6. Symbolic regression for analytic model construction.(a) The relationships between the radius of gyration   and TC of 104 polymers in this work.(b) Accuracy R 2 vs. complexity of 3364 mathematical formulas shown via density plot.The six points of F1-F6 were picked by Pareto search.(c)and (d) Calculated TC vs. fitted TC of formula F1 and F6, respectively.120

FIG. 7 .
FIG. 7. Construction of triblock polymers dataset.(a) Example of the generation of a triblock polymer.(b) SA score versus TC of all 16896 triblock polymers, where stars indicate candidates at the Pareto front.(c) and (d) Distributions of the TC and SA for the whole triblock polymers.The gray strips highlight the statistics of polymers with TC > 0.4 W m -1 K -1 or SA < 3.0. 213

FIG. 8 .
FIG. 8. Performance evaluation of genetic algorithm (GA).(a) Illustration of crossover and mutation behaviors in GA.(b) and (c) Convergence of GAs for a single and 20 parallel runs.

1 )FIG. 9 .
FIG. 9. Performance evaluation of Bayesian optimization (BO) algorithm.(a) Gaussian process regression and acquisition functions in BO.(b) and (c) Convergence of BO algorithms for a single and 20 parallel runs.

FIG. 10 .
FIG. 10.Performance evaluation of quantum annealing (QA) algorithm in a quantum virtual machine.(a) Factorization machine and Ising machine.(b) and (c) Convergence of simulated QA algorithms for a single and 20 parallel runs.

FIG. 11 .
FIG. 11.Comparison of different optimization algorithms.(a) Convergence curves of three optimization algorithms, each curve was averaged by the outcomes of 20 runs, and the shading indicates one standard deviation.(b) Number of candidates designed by three optimization algorithms after de-duplication in 20 runs, where the diamonds are the mean values and the error bars represent a standard deviation.

Figs. 12 (FIG. 12 .FIG. 13 . 213 Figs. 14
Figs. 12(b) and (c) illustrate the optimization trajectories for a single run of MOEA and MOBO with 10 random initial structures and 200 iterations × 10 candidates per batch.Nine gray stars indicate the sites of global optimal polymers, while the polymer dots are color-coded according to the generations.The distribution of searched non-duplicated polymer structures in a MOBO run is much denser than those in a MOEA run.qNEHVI integrates HV into the expected improvement acquisition function as

TABLE I .
Available organic databases, including mainstream polymer and small molecule datasets.
116,187trained the Mol2vec model on more than 1 million monomer structures 110,111,188ination of PoLyInfo and PI1M databases, and obtained a pre-trained model called polymer embedding.Polymer embedding is a 300-dimensional continuous-valued vector, and has been successfully used for the predictions of polymer density, melting temperature, glass transition temperature and TC, respectively.Morgan fingerprint with frequency (MFF)112is an expansion of Morgan fingerprints, which captures the frequency of chemical moieties (substructures) present in monomers.Across the entire explored chemical space, substructures (identifiers) with a frequency of occurrences larger than a predefined threshold are preserved to compose descriptor vectors.MFF has much lower feature dimensions compared to Morgan fingerprints, which can effectively suppress the overfitting of ML models and is widely utilized in the prediction of various properties of polymers.110,111,188

TABLE II .
120cription of 25 optimized descriptors.120Thetrained RF model was explained by the SHAP approach, and the importance of the features was analyzed based on the output SHAP values.

TABLE III .
120bol of parameters in symbolic regression.120

TABLE V .
120 six mathematical formulas at the Pareto front in Fig.6b.120