Hydrocarbon pyrolysis is a complex process involving large numbers of chemical species and types of chemical reactions. Its quantitative description is important for planetary sciences, in particular, for understanding the processes occurring in the interior of icy planets, such as Uranus and Neptune, where small hydrocarbons are subjected to high temperature and pressure. We propose a computationally cheap methodology based on an originally developed ten-reaction model and the configurational model from random graph theory. This methodology generates accurate predictions for molecule size distributions for a variety of initial chemical compositions and temperatures ranging from 3200 to 5000 K. Specifically, we show that the size distribution of small molecules is particularly well predicted, and the size of the largest molecule can be accurately predicted provided that this molecule is not too large.

## I. INTRODUCTION

Complex chemical systems, in which a large number of chemical species are generated via numerous types of chemical reactions, play a key role in a variety of applications, such as combustion,^{1,2} astrophysics,^{3–5} or battery interfaces.^{6} Understanding the kinetic and thermodynamic forces involved is crucial for making accurate predictions of the evolution of these systems. However, estimating equilibrium chemical composition for such systems is extremely complicated. A common way to obtain the equilibrium molecular composition is by running atomistic simulations.^{7–9} While these simulations eventually equilibrate, the equilibration typically takes a long time, which can be measured in weeks or even months depending on the conditions. Furthermore, each modification of the conditions, such as the temperature and the initial molecular composition, requires a brand-new simulation.

An example of a chemical system facing these challenges is the pyrolysis of hydrocarbons at high temperature and pressure,^{10–14} which we consider in this work. Hydrocarbon pyrolysis occurs in the interior of icy giant planets, such as Neptune and Uranus, where small hydrocarbons, such as methane, are subject to high temperatures, typically on the order of several thousand kelvins, and high pressure, typically on the order of tens of gigapascals.^{3–5} Small hydrocarbons react to form long carbon structures and these long structures can undergo phase transition to a solid phase, such as diamond. The conditions where diamond is formed have been studied in several experimental works.^{3,15,16} However, these experiments do not always agree with each other, even within conditions of temperature and pressure that are similar. These disagreements can be explained by the fact that the experimental set-ups are different between these experiments. However, a lack of understanding of the precise mechanism of hydrocarbon pyrolysis prevents researchers from having a clear understanding of the forces that are in play and their effect on the kinetic and thermodynamic equilibrium of this system. Indeed, as these experiments are run in extreme conditions on small time scales, it is very challenging to obtain precise concentrations of the different structures that could be used to obtain these mechanisms.

As experimentally studying the hydrocarbon pyrolysis mechanism is intricate, many researchers have studied it via modeling and simulation. Many widely used combustion kinetic models^{17,18} can make predictions for hydrocarbon pyrolysis; however, these models assume a gaseous phase, whereas in the conditions of the interior of Neptune, hydrocarbons are in a liquid phase. In addition, these kinetic models only consider species with less than 20 atoms and hence are not suitable for the study of the growth of large carbon structures that occur in this system. Ancilotto *et al.*^{19} instead performed *ab initio* Molecular Dynamics (MD) simulations to observe the evolution of methane pyrolysis at high pressure and temperature. They observed some dissociation of methane below 100 GPa and 5000 K and the formation of diamond above 300 GPa. However, due to the high computational cost of *ab initio* MD simulations, the simulations were only run for 2 ps with 16 CH_{4} molecules. At these scales, it can be difficult to know if a system is at equilibrium or if it needs longer than a few picoseconds to react. It can also be challenging to observe the different reaction pathways that methane can undergo during its pyrolysis.

As an alternative to the high computational cost of *ab initio* MD simulations, classical MD simulations relying on force fields can be performed. Particularly, the ReaxFF force field^{20–23} is an alternative to study hydrocarbon pyrolysis simulations in conditions close to the ones in Neptune or Uranus. Indeed, the parameters of this force field developed for combustion and pyrolysis of hydrocarbons have shown several successes in reproducing experimental Arrhenius parameters,^{7,23–26} DFT-calculated energies,^{25,27} and experimentally observed reaction pathways^{23,28,29} in conditions of temperatures between 2000 and 4000 K. In addition, these parameters were shown to be relevant for the study of condensed carbon phases by Orekhov *et al.*^{31} where they validated Ashraf *et al.* parameters^{23} with DFT calculations. The calculations by the ReaxFF forcefields^{20–23} developed for pyrolysis and combustion have not yet been validated with DFT calculations for the conditions of pressures in Neptune and Uranus (tens of gigapascals), which is the range of pressures we study in this work.

Although ReaxFF had several successes, several of its limitations have also been reported: for example, it failed to reproduce some *ab initio* energies,^{32} some common reactions were not observed in ReaxFF simulations,^{32,33} and excited electronic states are not taken into account. However, due to the prohibitive computational cost of *ab initio* MD simulations and the intricacy of doing experiments, ReaxFF is a sensible alternative to study hydrocarbon pyrolysis in the conditions studied here, with reasonable time scales and system sizes. Unfortunately, ReaxFF simulations can still take several days to run, which can make it time-consuming to know the evolution of hydrocarbon pyrolysis in different conditions.

In the works of Yang *et al.*^{10,12–14} and Chen *et al.*,^{11} kinetic models for hydrocarbon pyrolysis were built from ReaxFF MD simulations data. These models were able to reproduce the concentrations of small molecules as well as the bulk counts for larger structures arising in the MD simulations, but no information on the size of the larger structure was given. Another kinetic model for the growth of polycyclic aromatic compounds has been developed by Wang *et al.*;^{34} nonetheless, this model does not study structures that could lead to diamond formation.

In the recent work of Reed’s group,^{35,36} a “local” kinetic model capable of predicting the dynamics of counts of small and large carbon molecules was developed. In this model, trained on ReaxFF MD simulations, the reactions were described in terms of local changes at the reactive site. The local kinetic model was able to accurately reproduce the time series of the training and test MD data of counts for the largest hydrocarbons in the system. Remarkably, the initial composition and the temperature of the training set was different from that of the test set. Nevertheless, the local kinetic model suffers from a major drawback: full trajectories, using kinetic Monte Carlo simulations,^{37,38} need to be run in order to obtain the equilibrium composition. Therefore, obtaining equilibrium molecular compositions in many different conditions would require significant time. In addition, this time would scale with the size of the system making it impractical to study larger size systems.

One *broad goal of the scientific community studying hydrocarbon pyrolysis* is to predict the structures and sizes of the products of hydrocarbon pyrolysis from any initial conditions. Of particular interest are the conditions of the interior of Neptune and Uranus where this process occurs continuously. As these structures are difficult to observe experimentally or in *ab initio* simulations, we chose to work with the structures observed in classical MD simulations run with the ReaxFF force field. Therefore, the *goal of this work* is to develop an easy-to-implement and fast-to-run model for hydrocarbon pyrolysis that predicts the equilibrium distribution observed in ReaxFF MD simulations of molecule sizes measured in the numbers of carbon atoms per molecule for any given initial temperature within the range of 3200–5000 K and any initial composition of hydrocarbons. The *specific objective* of this work is to develop an approach based on random graph theory that can quickly replicate the results of MD simulations at various temperatures and initial molecular compositions using a small amount of MD data for training.

The proposed methodology consists of two major components.

We introduce the

*ten-reaction model*that predicts the degree distribution given the temperature and the initial composition. This model accounts only for*local*reactions of the form

where the subscripts indicate the number of carbon–carbon bonds (the degree) that the reacting carbon atoms have. The number of the reactions, ten, is determined by the fact that the number of carbon–carbon bonds per carbon atom takes values only between 0 and 4. The model is calibrated on MD data with initial composition consisting of 64 C_{4}H_{10} molecules at various temperatures.

We use the

*configuration model*from random graph theory^{39–43}to predict molecule size distribution given the degree distribution of a system, calculated with the ten-reaction model, or directly extracted from MD data. The approach based on generating functions introduced by Newman, Strogatz, and Watts^{41}is employed.

All the simulations studied here are at a pressure of 40 GPa, and no pressure extrapolation was investigated in this work. Indeed, contrary to temperature extrapolation where the Arrhenius model is available, there are no readily available model for pressure extrapolation. Developing such a model was considered out of the scope of this work and is left to future work.

Knowing only the temperature and initial composition, we can use the ten-reaction model to estimate the degree distribution from which the Newman–Strogatz–Watts generating function formalism allows us to calculate the equilibrium distribution of hydrocarbon sizes we desire. The need for any ordinary differential equation (ODE) solver, such as kinetic Monte Carlo, is completely eliminated, as these models are analytical or based on random sampling, which makes them computationally cheap. Furthermore, one can predict the histograms of small molecule counts and the histogram for the largest molecule size by sampling random graphs for the configuration model. The flowchart of the proposed methodology is shown in Fig. 1.

## II. MOLECULAR DYNAMICS SIMULATIONS FOR OBTAINING TRAINING AND TEST DATA

The raw data used for training and testing the model developed in this work were extracted from ReaxFF MD simulations. All these data were extracted when the simulations were at equilibrium. Equilibrium was assumed to be reached once the counts of each of the most common small molecules, CH_{4}, H_{2}, and C_{2}H_{6}, and the size of the largest molecule converged to a steady state value. These extracted data include

the time-averaged distribution for the numbers of carbon–carbon bonds per carbon atom, or, in other words, the time-averaged degree distribution for the graphs consisting of the union of the carbon skeletons of all molecules present at each time step;

the time-averaged counts of molecules of each size, i.e., the molecule size distribution.

The MD simulations were run using the LAMMPS software^{44,45} and the KOKKOS package.^{46} The ReaxFF potential^{20,47,48} with the parameters described by Ashraf *et al.*^{23} was used as the interatomic potential. The initial molecules present in the simulations were exclusively C_{4}H_{10}, C_{2}H_{6}, CH_{4}, C_{8}H_{18}, or a mix of molecules. The numbers of molecules for each simulation were chosen to have a total number of atoms between 800 and 1000 in each simulation. The simulations were run for a duration allowing the system to reach an equilibrium and remain in it for a sufficient period of time (see the supplementary material, Sec. S1 and Fig. S1, for the definition of equilibrium). The simulation was between 40 and 1000 ps. To take into account the variance between MD replicates, two MD simulations for each set of conditions were run. The simulations details, the conditions of each of the simulations, and the duration during which the simulations are at equilibrium can be found in the supplementary material, Sec. S1 and Table S1. The computation time to run the MD simulations was around 1 week on 40 CPUs for 500 ps of simulation.

An important aspect of extracting data from MD simulations is the criterion for when two atoms should be treated as bonded. We used the bond duration and bond length criterion proposed in Ref. 11: two atoms are considered to be bonded if the interatomic distance has been less than a threshold distance *λ* for a threshold duration *τ*. The values for *λ* were taken from Chen *et al.*:^{11} *λ* = 1.98 Å for C–C bonds, *λ* = 1.57 Å for C–H bonds, and *λ* = 1.09 Å for H–H bonds. The bond duration was studied at different temperatures using the protocol developed by Chen *et al.*^{11} (see the supplementary material, Sec. S2), and the value *τ* = 0.096 ps was set for all temperatures.

## III. METHODS

The proposed methodology consists of two steps. Step one predicts the degree distribution for the random graph comprised by carbon skeletons of all molecules with the aid of the originally developed *ten-reaction model*. Step two calculates the molecule size distributions from the degree distribution using the generating function approach^{41} or a random graph sampling approach. Below we detail each of these steps.

### A. The ten-reaction model for predicting the degree distribution

In each MD simulation, tens of thousands of chemical reactions are observed. These reactions can be described at the molecular scale, where the whole molecules involved are described, or at the atomic scale, where only atoms surrounding the reactive site are described. A comparison of these two descriptions in the case of hydrocarbon pyrolysis was performed previously.^{35} Here, we wish to obtain the degree distribution of the carbons, which is an atomic property; therefore, we will describe reactions at the atomic scale. When observing the reactions at the atomic level, these reactions can be divided into three groups according to the change in the total number of the carbon–carbon bonds (CC-bonds) in the system:

there is no change in the CC-bonds;

a single CC-bond is added or removed; and

more than one CC-bonds are added or removed.

The histogram of changes in the CC-bonds in the reactions observed in the MD simulation at *T* = 3300 K initialized with 64 molecules C_{4}H_{10} is shown in Fig. S3. (Here and further the symbol S indicates that the referred object is found in the supplementary material.) In this case, ∼50% of all reactions observed did not change the total number of CC-bonds. Respectively, 24.25% and 24.32% of the reactions removed or added a single CC-bond. The remaining 1.4% of the reactions added or removed two or more CC-bonds. The reactions that do not lead to changes in CC-bonds do not affect the molecule size distribution measured in the number of carbon atoms in the molecules. Over 97% of the remaining reactions either add or remove exactly one CC-bond. These observations motivate our first simplifying assumption:

**Assumption 1.** *Each reaction changes exactly one CC-bond.*

As we are studying the equilibrium degree distribution, we consider only the molecules that are stable. This means that we ignore radicals as these molecules are only transition species and do not remain for a long time in this state. We also ignore double bonds. The reason for this assumption is that they are not prevalent. Except at temperatures above 4500 K more than 90% of carbons are bonded to four atoms, as can be observed in Fig. S4. The rest of the carbons are mostly bonded to three other atoms. These carbons bonded to three other atoms are either radicals or forming a double bonds and their total percentages represent less than 10% of all the carbons. Ignoring radicals and double bonds brings us to our second simplifying assumption:

**Assumption 2.** *All carbons have four single bonds*.

Using Assumptions 1 and 2, we obtain the following set of possible reactions:

The subscripts of the carbons indicate their degrees, i.e., *C*_{k} is bonded to *k* other carbon atoms. Since the valence of carbon is 4, the degree *k* can be 0, 1, 2, 3, or 4, resulting in ten possible reactions. We note that these reactions are not elementary reactions and would likely need several elementary steps to occur. However, these reactions are considered to govern the degree distribution of the carbons at equilibrium.

Using this simplified mechanism, the equilibrium constants *K*_{ij}s can be fitted to the data obtained with the MD simulations. The *K*_{ij}s are fitted at different temperatures to MD simulations starting with C_{4}H_{10}. A least squares fit to the Arrhenius model is used to obtain the *K*_{ij}s as a function of temperature.

Once the parameters for the Arrhenius models for *K*_{ij}s are obtained, the degree distribution at equilibrium can be readily calculated given the temperature and the initial chemical composition of the system. The calculation consists in solving a system of nonlinear algebraic equations. The details of the fitting of the equilibrium constants *K*_{ij}s, finding the degree distribution, and the uncertainty quantification for the degree distribution are provided in the supplementary material, Sec. S3.

### B. Generating molecule size distribution from degree distribution

The collection of carbon skeletons of hydrocarbon molecules at each step of an MD simulation can be viewed as the realization of a random graph whose vertices are the carbon atoms (C-atoms) and edges are the carbon–carbon bonds (CC-bonds). We remind the reader that a graph consists of sets of nodes and edges. In undirected graphs, each edge connects an unordered pair of nodes. The hydrogens are not represented in the graphs. The degree of a node is the number of edges that this node forms. In our case, the degree of a C-atom represents the number of CC-bonds that this carbon forms.

The degree distribution for this random graph can be extracted from the MD simulation or predicted as described in Sec. III A. Therefore, this random graph is an instance of the so-called *configuration model*^{39–41} briefly described below. A detailed derivation of this model is given in Ref. 41. Its adaptation for the case of hydrocarbons is presented in the supplementary material, Sec. S4.

Our goal is to obtain the probabilities *π*_{s}, $s\u2208N$, that a randomly picked molecule has *s* C-atoms, using the degree distribution *p*_{k}. For this purpose, we define the probabilities *P*_{s}, $s\u2208N$, that a randomly chosen C-atom belongs to a molecule containing *s* C-atoms. The distribution {*π*_{s}} is readily obtained from the distribution {*P*_{s}},

To obtain the distribution *P*_{s} from the degree distribution *p*_{k}, we use the generating functions of these distributions.^{49} For any discrete probability mass function (PMF) {*a*_{k}}, the generating function is defined by *G*(*x*)≔*∑*_{k}*a*_{k}*x*^{k}. Generating functions possess a number of useful properties. For example, since any PMF sums up to 1, we have *G*(1) = 1. The definition of the mean implies that ⟨*k*⟩ = *G*′(1). We denote the generating functions of the distributions *p*_{k} and *P*_{s} by *G*_{0}(*x*) and *H*_{0}(*x*), respectively. While *G*_{0}(*x*) can be readily written out (*G*_{0}(*x*) = *p*_{0} + *p*_{1}*x* + *p*_{2}*x*^{2} + *p*_{3}*x*^{3} + *p*_{4}*x*^{4}), finding *H*_{0}(*x*) is nontrivial and requires more work.

To find *H*_{0}(*x*), we need to define the so-called *excess degree distribution* {*q*_{k}} as follows: When an edge is randomly picked and then one of its endpoints *v* is chosen, the probability that *v* has *k* other edges incident to it is

The generating function *G*_{1}(*x*) of the excess degree distribution is readily found: $G1(x)=G0\u2032(x)/G0\u2032(1)$. Using *G*_{1}(*x*), we can obtain the generating function *H*_{1}(*x*) for the probability distribution for the component sizes hanging off from one of the ends of a randomly picked edge. The calculation of *H*_{1}(*x*) stems from the following two ideas. The first idea is the use of *the power property* of any generating function: if *G*(*x*) generates the distribution for some quantity associated with an object, then [*G*(*x*)]^{m} generates the distribution for the sum of these quantities of *m* independent samples of the object. The second idea is that *H*_{1}(*x*) must satisfy the following self-consistency relationship:

An insightful graphical explanation for (4) was offered by Newman, Strogatz, and Watts.^{41} Figure 2 (top) displays a version of it adapted for our case where the excess degree distribution has at most four nonzero components: *q*_{0}, *q*_{1}, *q*_{2}, and *q*_{3}. Since *G*_{1}(*x*) is a cubic polynomial, Eq. (4) with respect to *H*_{1} has three roots. Only one of them does not blow up as *x* → 0, allowing us to obtain a unique solution for *H*_{1}(*x*).

The generating function *H*_{0}(*x*) can be obtained from *H*_{1}(*x*) using the formula [see Fig. 2 (bottom)]

The distribution {*P*_{s}} can be extracted from its generating function *H*_{0} using Cauchy’s integral formula [Eq. (S18)]. Then, the probabilities *π*_{s} that a randomly picked molecule has *s* C-atoms can be obtained from {*P*_{s}} using Eq. (2).

Two different regimes can be observed for the molecule size distribution: either all molecules remain small (subcritical regime) or one molecule, called the *giant component*, and contains a substantial fraction of carbons at all times (supercritical regime). These two regimes are separated by the criterion^{39,41}

The distributions *H*_{0} and *H*_{1} account only for small tree-like connected components of a random graph. In the subcritical regime where there is no giant component, these distributions sum up to 1. If there is a giant component, *H*_{0} sums up to the fraction of vertices *not* lying in the giant component. Hence, when the system is in the supercritical phase, one can estimate the fraction *S* of vertices in the giant component using the following relationship:^{41}

Therefore, the expected number of carbons in the largest hydrocarbon molecule is *N*_{C}(1 − *H*_{0}(1)), where *N*_{C} is the number of carbons in the system. Equation (4) and Fig. 2 (top) imply that *H*_{1}(1) = *u* < 1 in the supercritical phase, where *u* is the smallest positive solution to *u* = *G*_{1}(*u*).

The model described in this section will be referred to as the “RGT” model in the rest of this paper. RGT stands for *Random Graph Theory*.

### C. Random graph sampling

The algorithm for obtaining the molecule size distribution {*π*_{s}} and the size of the giant component (if any) summarized in Sec. III B is extremely cheap: it runs in less than a second on a single central processing unit (CPU). However, it is based on the assumption that the number of vertices (carbons) is very large. An alternative way to predict molecule size distribution from the degree distribution is by sampling random graphs for the configuration model and applying standard graph algorithms, such as depth-first search,^{50} to find all connected components. Random graph sampling has several advantages. First, it allows us to fix the number of vertices to match the number of carbons in the considered system and hence eliminate the assumption that the system is large. Second, it offers a direct way to approximate any distribution we want. In particular, we are interested in the size distribution for the largest molecule. This distribution depends on the size of the system and hence cannot be obtained via the generating function approach. On the downside, it takes about a minute to generate and process a large enough collection of random graph samples on a regular modern laptop. Moreover, all estimates obtained using random graph sampling unavoidably contain statistical errors.

Given the degree distribution and the total number of vertices *N*_{C} (carbon atoms), we can create a collection of random graphs as outlined in Ref. 41. The main idea of this procedure is to generate a collection of vertices with “stubs” (i.e., edges sticking out of them) whose number obeys the degree distribution and then randomly pair the stubs:

The carbon atoms are enumerated from 1 to

*N*_{C}. The first $round(p0NC)$ carbons get zero stubs. The next $round(p1NC)$ carbons get one stub each. The next $round(p2NC)$ carbons get two stubs each, and so on.The total number of edges is one-half of the number of the stubs. If the number of stubs turns out to be odd, the last atom with zero stubs is given one stub. Then, the stubs are matched as follows: First, the stubs are enumerated. Next, a random permutation of the stubs is generated. Then, the stubs from the top half of this permutation are paired with the stubs of its bottom half. As a result, a realization of the random graph is generated.

Finally, the realization of the random graph is postprocessed by removing self-loops and repeated edges.

This random graph sampling model will be referred to as the “RGS” model in the rest of the paper.

## IV. RESULTS

In this section, we apply our proposed methodology to predict the molecule size distribution for hydrocarbon pyrolysis. Since our methodology involves several components, we test each of them separately to achieve a more detailed assessment of its strengths and limitations.

### A. Prediction of the degree distribution

We have assumed that the equilibrium constants *K*_{ij} obey the Arrhenius law for any hydrocarbon composition. The validity of this assumption is demonstrated in Fig. S5 displaying Arrhenius plots for *K*_{ij}s for a collection of training and test data for temperatures in the range 3200 K $\u2264T\u2264$ 5000 K.

To compare the degree distributions predicted by the ten-reaction model and extracted from MD simulations, we define the following error metric:

where the superscripts 10RM and MD mean that the corresponding data come, respectively, from the ten-reaction model and from the average of two MD simulations.

Figure 3 visualizes this comparison. It can be observed that even for the highest error, the predicted degree distribution matches well the degree distribution extracted from the MD data.

While the error in the predicted degree distribution is fairly small for all initial compositions and temperatures that we have considered, there is an apparent trend observed for C_{4}H_{10} data: the error increases as the temperature grows. This trend and its consequences will be discussed in Sec. V.

### B. Prediction of small molecule size distribution

Hydrocarbon pyrolysis usually generates a mix of small molecules, and one much larger molecule, if the H/C ratio is low enough. In order to accurately predict the equilibrium conditions of hydrocarbon pyrolysis we need to correctly describe both of these components of the system. The algorithm summarized in Sec. III B predicts the size distribution for small molecules as well as the mean size of the largest carbon agglomeration, called the *giant component* in the random graph theory framework. Here, we start by studying the small molecules size distribution. We chose 20 carbons in a molecule as a reasonable size cut-off for “small” molecules.

For each initial composition and temperature reported in Table S1, the following three small molecule size distributions were compared:

The molecule size distribution extracted from MD simulations.

The molecule size distribution predicted by the configuration model described in Sec. III B using the degree distribution extracted from the MD simulations (“RGT” model).

The molecule size distribution predicted by the configuration model described in Sec. III B using the degree distribution predicted by the ten-reaction model described in Sec. III A (“10RM + RGT” model).

Figure 4 (left and middle columns) illustrates four samples of predicted and extracted from MD simulations molecule size distributions. These samples are chosen representatively with respect to initial compositions, temperatures, and discrepancies between the distributions extracted from MD simulations and the ones calculated via the random graph theory. More plots of this kind are found in Fig. S6 in the supplementary material. We observe that the predictions of the RGT and 10RM + RGT models match generally well the values observed in the MD simulations, especially for the smallest molecules (containing less than five carbons). We can also note that the RGT and 10RM + RGT model predictions are very close to each other.

In order to compare the accuracy of the prediction for the small molecule size distribution for different temperatures and initial chemical compositions, we use the Wasserstein *W*_{1} distance, a standard measure for comparing probability distributions^{51}

The molecule size distributions extracted from MD simulations have been used as the distribution *P*, while the predicted molecule size distributions have been used as *Q*. In both cases, we truncated these distributions at size *s* = 20 and renormalized them so that they sum up to 1. The values of the Wasserstein *W*_{1} distance obtained for various initial compositions and various temperatures are visualized in Fig. 4 (right column). This figure allows us to make the following observations.

The Wasserstein

*W*_{1}distance tends to increase as the H/C ratio decreases. This phenomenon is related to the presence of large hydrocarbons. We will discuss it in Sec. V.The temperature dependence observed for C

_{4}H_{10}data is nonmonotone.The difference between the accuracy of RGT and 10RM + RGT for small molecule size distributions is insignificant.

The RGT model is derived under the assumption that the number of vertices in the graph is very large. Hence, we expect that the predictions by RGT will become more accurate as the number of carbons in the system increases. This effect is illustrated in Sec. S6 of the supplementary material.

The random graph sampling (RGS) model can also be used to study the small molecule size distribution. In particular, this method can be used to obtain a histogram of the counts of molecules of a particular size at equilibrium. More details and figures are given in Sec. S7 of the supplementary material.

### C. Prediction for the size of the largest molecule

There are two ways to predict the size of the largest molecule given the degree distribution. If the degree distribution implies the supercritical phase and the number of carbons is large enough, the size of the giant component can be estimated by *N*_{C}(1 − *H*_{0}(1)) (RGT, see the end of Sec. III B). Alternatively, for a system of any size and independent of phase, the size distribution of the largest molecule can be obtained using random graph sampling (RGS) (see Sec. III C). The results of both of these approaches are presented here. The degree distribution is obtained by the ten-reaction model (10RM).

The existence of the giant component was predicted for initial compositions consisting of C_{4}H_{10} molecules and C_{8}H_{18} molecules, while for the mix and the initial compositions consisting of C_{2}H_{6} and CH_{4}, the fraction of carbons in the giant component *S* was found to be zero. We report the numbers of carbons in the giant component *S* · *N*_{C} in Table I. We can see that the size of the giant component is generally over-estimated by the RGT model. This observation will be discussed in Sec. V.

Init. Comp. . | Temp. (K) . | N_{C}
. | RGT . | RGS . | MD . |
---|---|---|---|---|---|

C_{4}H_{10} | 3200 | 256 | 79 | 112 ± 25 | 57 ± 18 |

C_{4}H_{10} | 3300 | 256 | 81 | 110 ± 26 | 49 ± 17 |

C_{4}H_{10} | 3400 | 256 | 115 | 113 ± 26 | 70 ± 28 |

C_{4}H_{10} | 3500 | 256 | 112 | 113 ± 27 | 64 ± 21 |

C_{4}H_{10} | 3600 | 256 | 125 | 117 ± 26 | 76 ± 24 |

C_{4}H_{10} | 4000 | 256 | 92 | 129 ± 25 | 57 ± 22 |

C_{4}H_{10} | 4500 | 256 | 105 | 150 ± 22 | 58 ± 21 |

C_{4}H_{10} | 5000 | 256 | 96 | 171 ± 18 | 54 ± 19 |

C_{8}H_{18} | 3300 | 288 | 195 | 208 ± 13 | 127 ± 36 |

C_{2}H_{6} | 3300 | 250 | ⋯ | 24 ± 8 | 20 ± 6 |

C_{2}H_{6} | 4000 | 250 | ⋯ | 31 ± 11 | 20 ± 5 |

CH_{4} | 3600 | 160 | ⋯ | 6 ± 1 | 6 ± 1 |

CH_{4} | 4500 | 160 | ⋯ | 10 ± 3 | 10 ± 3 |

Mix | 3500 | 240 | ⋯ | 21 ± 7 | 18 ± 3 |

Init. Comp. . | Temp. (K) . | N_{C}
. | RGT . | RGS . | MD . |
---|---|---|---|---|---|

C_{4}H_{10} | 3200 | 256 | 79 | 112 ± 25 | 57 ± 18 |

C_{4}H_{10} | 3300 | 256 | 81 | 110 ± 26 | 49 ± 17 |

C_{4}H_{10} | 3400 | 256 | 115 | 113 ± 26 | 70 ± 28 |

C_{4}H_{10} | 3500 | 256 | 112 | 113 ± 27 | 64 ± 21 |

C_{4}H_{10} | 3600 | 256 | 125 | 117 ± 26 | 76 ± 24 |

C_{4}H_{10} | 4000 | 256 | 92 | 129 ± 25 | 57 ± 22 |

C_{4}H_{10} | 4500 | 256 | 105 | 150 ± 22 | 58 ± 21 |

C_{4}H_{10} | 5000 | 256 | 96 | 171 ± 18 | 54 ± 19 |

C_{8}H_{18} | 3300 | 288 | 195 | 208 ± 13 | 127 ± 36 |

C_{2}H_{6} | 3300 | 250 | ⋯ | 24 ± 8 | 20 ± 6 |

C_{2}H_{6} | 4000 | 250 | ⋯ | 31 ± 11 | 20 ± 5 |

CH_{4} | 3600 | 160 | ⋯ | 6 ± 1 | 6 ± 1 |

CH_{4} | 4500 | 160 | ⋯ | 10 ± 3 | 10 ± 3 |

Mix | 3500 | 240 | ⋯ | 21 ± 7 | 18 ± 3 |

Random graph sampling allows us to estimate the probability *ζ*_{s} that the size of the largest molecule is *s*. We used the 10 000 samples of random graphs to obtain the probability distribution for size of the largest component and compared them to the probability distribution for the largest molecule size extracted from the MD data. This was done for each initial composition and temperature listed in Table S1. The agreement between these distributions is reasonably good for the cases with high H/C ratio where the giant component does not exist, i.e., for the mix and the initial compositions consisting of only C_{2}H_{6} or CH_{4} (see the last five rows in Table I). All cases where the giant component was predicted exhibited poor agreement. Examples of good and bad agreements are shown in Figs. 5(a)–5(c), respectively.

We would like to comment on the highly noisy distributions for the size of the largest molecule extracted from MD simulations. First, various MD simulations stayed at equilibrium for different duration ranging from 25% to 67% of simulation time. Second, the largest molecule size distribution is naturally very noisy: for example, if a bond is broken in the middle of the largest molecule, its size is divided roughly by two.

The Wasserstein *W*_{1} distance between the size distributions for the largest molecule obtained using random graph sampling and extracted from MD data is plotted in Fig. 5(d) as a function of H/C ratio. The temperature is shown using color coding.

Figure 5(d) suggests that the prediction for the distribution of the largest molecule size becomes less accurate in terms of the Wasserstein *W*_{1} distance as the H/C ratio decreases and the temperature increases.

## V. DISCUSSION

### A. Ten-reaction model analysis

As shown in Sec. IV A, the prediction error for the degree distribution by the ten-reaction model increases with temperature. This behavior of the error can be explained by the growth of the number of radicals and double carbon–carbon bonds with temperature. In Fig. S4, we have plotted the probability to observe a carbon bonded to a total of *k* other atoms, hydrogens and carbons, vs temperature. The ten-reaction model relies on Assumption 2 that all carbons are bonded to four other atoms. This assumption becomes less accurate as the temperature increases. At *T* = 3, 200 K, over 96% of carbons are bonded to four other atoms. At *T* = 3, 600 K, *T* = 4, 000 K, and *T* = 5, 000 K, this percentage drops down to 91%, 87%, and 74%, respectively, due to the increase of the number of double bonds and radicals.

To determine if the errors due to the ten-reaction model are important or not, we compare the small molecule size distributions predicted by the RGT model and the 10RM + RGT model in Fig. 4 (right column). Remarkably, these predictions are very close to each other. In particular, the discrepancy between these predictions is very small for the system initially composed of C_{4}H_{10} at 5,000 K. Note that the error in the degree distribution predicted by the ten-reaction model for this system was the largest of all systems considered here. This means that the errors observed in Sec. IV A are low enough and are not a limiting factor for our method in the case of small molecules.

However, in Fig. 5(d), there is a clear trend that the 10RM + RGS model performs worse for the largest molecule size prediction when the temperature increases. In this model, temperature only affects the 10RM part through the calculation of the *K*_{ij}s. Therefore, for the largest molecule size distribution, it looks like the prediction error from the ten-reaction model is critical and affects the performance of the model. This phenomenon is explained by the fact that errors in the probabilities *p*_{3} and *p*_{4} for a carbon to have degree three and four, respectively, affect the relative error in size distribution for large molecules the most. To improve the ten-reaction model in future work, assumption 2 could be relaxed and double bonds or radicals could be taken into account.

### B. Random graph theory and sampling model analysis

In Sec. IV, we have observed that the RGT and RGS models accurately predict size distributions for small molecules (Fig. 4). When no giant component exists, as in the case of systems starting with CH_{4}, C_{2}H_{6} or the mix, the size of the largest molecules is also well-predicted [see the last five rows of Table I and Figs. 5(a), 5(b), and 5(d)]. Importantly, the RGT not only predicts the molecule size distribution but also explains it. The distribution is completely determined by three assumptions: (*i*) the degree distribution, (*ii*) a large number of carbons, and (*iii*) equal probability for any pair of carbons to be bonded. These three assumptions imply that small molecules are tree-like with high probability, and hence the component size distribution *H*_{0} is given by Eqs. (4) and (5).

However, when the random graph theory [specifically, Eq. (6)] predicts that there is a giant component, the performance of these models worsens, especially for the prediction of the largest molecule [Table I, Figs. 5(c) and 5(d)]. Therefore, *the larger the giant component of the system is the worse these random graph models perform*. This observation explains why the model consistently performs worse when the H/C ratio decreases. When there are fewer hydrogens per carbon in the system, larger giant components tend to form, resulting in worse predictions by the random graph theory and sampling.

The fact that both the RGT and the RGS model result in a poor prediction for the size of the largest molecule is due to the fact that both of these approaches ignore geometric aspects of the graphs. This means that some large connected components of graphs predicted by the random graph models may be nonphysical, i.e., impossible to realize in 3D when taking into account the geometric restriction for bond lengths, bond angles, and minimal inter-atomic distance of unbonded atoms. Without these geometric constrains, the molecules predicted by the RGT and RGS models tend to be larger than in the MD simulations, resulting in the poor predictions by the RGT and RGS models. One possible approach to tackle this issue is to learn physical molecular substructures such as branching patterns, cycles, combinations of cycles, etc. from data and reject molecules generated by random graph sampling with nonphysical substructures. Another possible approach is to develop a model for hydrocarbons based on hypergraphs rather than graphs. Both of these approaches require major developments and are left to future work.

Finally, we would like to make a few remarks on computational cost, the cost of the prediction of molecule size distribution by 10RM + RGT is independent of the size of the system and involves only a solution of a small system of nonlinear equations and function evaluations. Runtimes are about one second. The cost of 10RM + RGS is determined by the number of samples and scales with the number of carbons *N*_{C} as the cost of the depth-first search algorithm: *O*(*N*_{C}).^{50} The runtimes for the systems studied in this work are about one minute. It is important to note that 10RM + RGS has an advantage over kinetic Monte Carlo (KMC) hypothetically used with the ten-reaction model that the former would immediately sample random graphs from the invariant distribution while the latter would require some equilibration period. The accuracy of these two techniques would be comparable. Kinetic Monte Carlo combined with a more sophisticated atomic–level reaction model can give accurate predictions even for the distribution of the largest molecule in the system,^{35,36} but would take more time to run (10 s of minutes).

## VI. CONCLUSION

We have proposed a methodology to predict the molecule size distribution for hydrocarbon pyrolysis suitable for various initial molecular compositions and various temperatures, at least in the range 3200–5000 K. The methodology involves two steps. First, the ten-reaction model analytically gives the degree distribution for the desired initial composition and temperature. Second, given this degree distribution, we predict the molecule size distribution using two random graph theory models. These models can estimate the size distribution of small molecules and of the largest molecule. These two steps are executed within seconds to minutes on a consumer-grade laptop, which is several orders of magnitude faster than what is possible by running MD simulations. Remarkably, the ten-reaction model and the RGT model runtimes do not depend on the size of the system and can easily be computed for any system size. The predictions for small molecule size distributions are remarkably accurate and the predictions for the largest molecule size distributions are good if there are no giant components. Further improvements of the predicting power of the random graph approaches, particularly for the largest molecules, will be performed in future work. Additionally, with more accurate MD data from improved methods or more powerful computers, this methodology can take advantage of the improved data and provide even better predictions.

Our method can also be applied to other systems where the size distribution is of interest. From the first conclusions of our work, our methods can be applied to systems in regimes where no structural effects (cycles, etc.) are observed. If molecules become large but no structural effects are observed (only branching), we foresee that our method could be applied successfully as our method assumes that molecules are tree-like. For example, our method could be applied to many polymerization processes.

## SUPPLEMENTARY MATERIALS

See the supplementary material for the following information: GitHub link to code, molecular dynamics simulations details, bond duration definition, ten-reaction model details, Arrhenius parameters and plots, generating molecule size distribution from degree distribution (detailed version), additional plots for the size distribution for small molecules, system size effect, and prediction of histograms for particular molecule sizes from random graph sampling.

## ACKNOWLEDGMENTS

This work was partially supported by the Air Force Office of Scientific Research under Award No. FA9550-20-1-0397.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Vincent Dufour-Décieux**: Conceptualization (equal); Data curation (lead); Formal analysis (equal); Investigation (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Christopher Moakler**: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Evan J. Reed**: Conceptualization (supporting); Funding acquisition (equal); Supervision (equal); Writing – review & editing (supporting). **Maria Cameron**: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## REFERENCES

^{1}-regularization

_{2}

_{2}/CO/C1-C4 Compounds