Large-scale nonlinear dynamical systems, such as models of atmospheric hydrodynamics, chemical reaction networks, and electronic circuits, often involve thousands or more interacting components. In order to identify key components in the complex dynamical system as well as to accelerate simulations, model reduction is often desirable. In this work, we develop a new data-driven method utilizing $\u21131$-regularization for model reduction of nonlinear dynamical systems, which involves minimal parameterization and has polynomial-time complexity, allowing it to easily handle large-scale systems with as many as thousands of components in a matter of minutes. A primary objective of our model reduction method is interpretability, that is to identify key components of the dynamical system that contribute to behaviors of interest, rather than just finding an efficient projection of the dynamical system onto lower dimensions. Our method produces a family of reduced models that exhibit a trade-off between model complexity and estimation error. We find empirically that our method chooses reduced models with good extrapolation properties, an important consideration in practical applications. The reduction and extrapolation performance of our method are illustrated by applications to the Lorenz model and chemical reaction rate equations, where performance is found to be competitive with or better than state-of-the-art approaches.

We develop a new data-driven paradigm for efficient model reduction of a broad class of nonlinear dynamical systems. Our model reduction method directly enables the interpretation of key components of the dynamical system, unlike traditional projection-based model reduction methods that focus on reducing computational complexity more than interpretability. Our method is not application specific and is simple to implement on nonlinear dynamical systems arising from a variety of different fields. It requires minimal parameterization using a single parameter to trade-off between model complexity and estimation error. We use a data-driven paradigm to formulate model reduction as an efficient convex optimization problem that scales polynomially in the original size of the complex system, enabling systems with as many as thousands of components to be reduced in a matter of minutes.

## I. INTRODUCTION

The numerical solution of nonlinear dynamical systems underlies many problems in computational science from biology^{1} to atmospheric chemistry^{2,3} to combustion.^{4,5} An important challenge in this field is the ability to identify key components of the dynamical system that govern properties of the system of scientific interest. For example, in the case of chemical reaction networks, one might be interested in the time to equilibration and the equilibrium state given an initial chemical composition of the system, and the goal would be to find a small set of key reactions that are sufficient to characterize how the system evolves toward equilibrium. This is difficult to do for complex systems that may consist of thousands or more components.

In this work, we develop a new data-driven paradigm that enables the computationally efficient model reduction of a wide class of nonlinear dynamical systems. Specifically, we focus on systems of the form

where $x(t)\u2208RM$, $rj\u2208RM$ are constant, and $hj:RM\u2192R$ are possibly nonlinear. We identify every $fj:=hjrj$ with an *interpretable* component of the system. The aim is to perform *model reduction*, that is to find a surrogate to (1) that consists of as few of these components as possible while still being able to reproduce the coarse-scale behavior of $x$. Specifically, we focus on finding a surrogate dynamical system,

such that $k^\u2208RN$ is a *sparse* vector, namely, a vector such that most of its components are equal to zero. This surrogate system becomes the *reduced model* (or *system*) for (1).

Evidently, $k^=0$ would be a maximally reduced system, but the time evolution of $x^$ would bear no resemblance to that of $x$. For this reason, we require that

for some suitably defined loss function $\u2113$ and prescribed $\u03f5>0$. Consequently, model reduction is constrained by the trade-off between the *reduction* achieved, in terms of how many $fj$ can be neglected in (1), and the *error* incurred, as quantified by the loss function $\u2113$. Typically, the error will be minimal if $k^=k$ and maximal if $k^=0$. For instance, one choice of loss function is

where $P$ is an $m\xd7m$ projection matrix; in particular, this matrix may select the components of the state vector $x$ that we consider important to reproduce. In this case, we are allowing possibly large errors in the nonimportant components in order to allow a large reduction with a small error in the important components.

In contrast to other model reduction approaches in the literature, such as proper orthogonal decomposition (POD),^{6–8} we are not concerned primarily with finding a projection of the state vector $x$ onto a low-dimensional space in order to reduce the number of components of the system and thus its computational complexity. Instead, we are interested in reducing the number of $fj$ in (1) until only those that are essential to reproduce the behavior of $x$ over time remain. This allows us to interpret what components of the dynamical system are necessary to reproduce the behavior of $x$, as opposed to which subspace on $RM$ contains the directions of the largest variability of $x$ over time. It may be the case that by doing this, we also happen to find a reduction in the dimensionality of the relevant state space of $x$, but this is a secondary concern. In particular, we may reduce the dimensionality of the state space if, by removing some of the $fj$ from the dynamical system, some components $x^i$ of the solution $x^$ to the reduced system are constant for $t>0$. If these components are not considered important, for example, as defined by the matrix $P$ in the loss function above, then model reduction also leads to a reduction in computational cost.

We illustrate our method with two different types of dynamical systems. We start with a simple example using the widely known Lorenz equations^{9} from chaos theory. We show that our model reduction method is not only able to find a reduced model of the Lorenz equations near their stationary point at the origin that corresponds to its linear approximation there but also able to identify a stable dynamical system as the optimal next smaller reduced model, rather than an unstable system that seems to reproduce the full model more closely near the origin but extrapolates poorly. Then, we follow with a larger scale problem in the form of chemical reaction networks described by the reaction rate equations,^{10} where each $fj$ corresponds to a distinct chemical reaction. We would like to find a subset of chemical reactions that are capable of reproducing the molecular concentrations at time $t$ of certain important species with as little error as possible. We study an 89 reaction network that has been used in the combustion community to describe propane pyrolysis and show that our method is not only able to identify a significantly reduced set of reactions that predicts the dynamics of six important molecular species with less than 5% error, but also that the reduced models extrapolate comparably or better over orders of magnitude in time than two previously determined reduced models computed using other methods.

## II. BACKGROUND

Model reduction of nonlinear dynamical systems has traditionally been aimed at choosing a subspace of dimension $m\u226aM$ to project the state vector $x\u2208RM$ onto this subspace at every instant. Either the singular value decomposition (SVD) or Krylov subspace methods play a key role in many of these methods.^{11} This approach has the ability to speed up the computation of solutions to the dynamical system and can help determine the regions of state space that play an important role in the evolution of the dynamical system; in fact, the low-dimensional subspace is typically computed from the directions of large variability of the trajectory. However, projection-based methods are difficult to apply to nonlinear dynamical systems and is an active area of research.^{12} They also do not naturally lead to interpretability about the significant components of a system in the sense of (1) since typically, the component functions $fj$ do not have a one-to-one correspondence to components of $x$.

The ability to find significant components of a complex dynamical system is important from a scientific perspective in many practical applications. In particular, the combustion and biochemistry communities are often interested in the key chemical reactions that are drivers of a reaction or gene regulatory network. Thus, model reaction of chemical reaction networks has been extensively studied in both these communities, among others.^{32–37} Popular methods that find reduced mechanisms consisting of a subset of reactions and thus preserve interpretability include sensitivity analysis,^{13,14} graph-theoretic methods,^{15} and integer programming.^{16,17} For example, sensitivity analysis involves computing at various selected times how the state variables (e.g., molecular concentrations) change with respect to changes in system parameters (e.g., reaction rate constants), building a sensitivity matrix. This sensitivity matrix is then analyzed using methods such as principal component analysis (PCA)^{18} in order to find reactions that can be eliminated. The directed relation graph method^{15} is another method that computes at selected snapshots of the system a graph that describes whether the rate of change of a particular state variable depends significantly on another state variable. The nodes of the graph are the state variables (e.g., molecular species) and there is a directed edge between nodes if there is a dependency between them. Given a set of state variables that we are interested in, we can then find all other state variables that may be relevant to their dynamics by searching over this graph for dependencies. Then, the reduced reaction model consists of all components (e.g., reactions) that contain the found set of state variables (e.g., molecular species).

These methods involve user expertise in choosing a set of times and system conditions under which to observe the system state and construct a sensitivity matrix or a directed relation graph, respectively, so that the reduced reaction networks consist of the union of reactions found over these different snapshots. They also involve user-specified thresholds to determine which components to retain in the reduced mechanisms, or what qualifies as a significant dependence between states, respectively, and different choices of thresholds can affect the ability to achieve optimal reduction. On the other hand, direct global optimization methods such as integer programming do not scale well and are computationally prohibitive for very large-scale systems.

In contrast, we have designed our algorithm to require minimal parameterization and polynomial-time computational complexity. In fact, we find empirically that the solution to our model reduction procedure is relatively insensitive over a wide range of values to the choice of parameters for all parameters except the one controlling the sparsity of the model. Our method uses $\u21131$-regularization and leverages the linearity in the components $fj$ in (1) to formulate model reduction as a convex optimization problem, allowing it to scale polynomially in $N$, the number of components $fj$ in the full system. Finally, our method is simple to implement and easily applied to any dynamical system in the form of (1). Its simple form gives rise to a number of straightforward variations that can be used to achieve different model reduction goals.

In recent years, several methods leveraging $\u21131$-regularization have been proposed in the literature to solve similar problems. However, the model reduction problem, as we have formulated it, has particular features that makes our method distinct from others proposed in the literature. In particular, it is these differences that preclude us from leveraging the extensive literature on $\u21131$-recovery methods to provide guarantees for our method.

*Model reduction and model estimation:*In*model estimation*problems, we observe (possibly several) trajectories $xi$ of an unknown dynamical system. The aim is to estimate the $fj$ in (1) from these observed trajectories. This is achieved by choosing a collection of functions $f^j$, not necessarily of the form we have considered, and then using a fitting procedure to find a suitable $k^$, e.g., by minimizingIf the true $fj$ are part of the collection ${f^j}$, then we expect this procedure to find the unknown dynamical system. In recent years, SINDy$\u2211i\u222b0T\Vert dxi(t)dt\u2212\u2211jk^jf^j(xi(t))\Vert 22dt.$^{19,20}has been proposed as a method that leverages $\u21131$-regularization of the above loss function to estimate dynamical systems from a set of observed trajectories.Model reduction, as we have formulated it, assumes

*full knowledge*of the dynamical system that we aim to reduce. In particular, we do not assume that there is any “true” reduced model, i.e., a sparse $k^$ for which the error is zero. Instead, we focus on characterizing the trade-off between reduction and error. Although this is a conceptual difference, model estimation methods such as SINDy can be nonetheless applied for model reduction, for instance, by setting $f^j\u2261fj$ for all $j$. However, in some multiscale systems, such as chemical reaction networks, where the reaction rate parameters may be several orders of magnitude different from each other, knowing the true model is critical to being able to learn a reduced model from the data. A naïve application of SINDy on chemical reaction networks would encounter challenges due to the vastly different scales of the reaction rates; our method does not have similar difficulties because it leverages the knowledge of the true dynamical system to scale the problem appropriately. This leads to improved reduced models, as we demonstrate in Sec. IV.*Model reduction and sparse recovery:*Suppose that there is an unknown sparse vector $k0$ in a high-dimensional space representing an object of interest and that the only information we have about this vector is a set of $n\u226aN$ measurements of the form $aiTk0$ for some known vectors $ai$. The aim of*sparse recovery*is to leverage the sparsity of $k0$ to recover it from the $n$ linear measurements available. Compressed sensing^{21–23}characterizes the conditions on the sparsity of $k0$ and the measurement vectors $ai$ that ensure we can recover $k0$ exactly from this incomplete set of linear measurements.Although model estimation problems are sometimes successfully framed as a sparse recovery problems,^{19,24}in our case it is not possible to do so. In principle, we could assume the sparse $k^$ that achieves a loss of at most $\u03f5$ is unknown and that we observe a*noisy*version $k=1$, where $1$ denotes the vectors of all ones. From this point of view, we want to leverage the sparsity of $k^$ to recover it from $k$. However, we have a*complete*set of*direct*measurements about $k^$ and thus critical conditions on the measurements, such as incoherence^{25}or the restricted isometry property (RIP),^{26}do not hold. In fact, the model reduction problem we have formulated is closer to a denoising problem where we have observations of the formwhere $w$ is the noise. In this formulation, the properties of $w$ can be deduced from the constraint (3). Consequently, we cannot leverage the results of sparse recovery to provide guarantees to our method.$kj=k^j+wj,$

## III. ALGORITHM

Many nonlinear dynamical systems can be expressed in the form (1) reproduced here

For example, if the dynamical system is defined by a polynomial function, then we can assign to each distinct term that appears a scalar function $hj$ multiplying a constant vector $rj$. Now, suppose that we are interested in reproducing the dynamics of a subset of state vector components $xi$, $i\u2208C$, where $C\u2282{1,\u2026,M}$ that are important, and we would like to know the smallest set of components $fj$ that are needed to do so. Ideally, we would like to find a vector $k^\u2208RN$ such that as many of the $k^j$ are 0 as possible while minimizing a loss function $\u2113$ that quantifies the discrepancy between $x$ and $x^$ over a time interval $t\u2208[ta,tb]$ of interest; here, $x^$ is the solution to the reduced model

with initial condition $x^(0)=x(0)$.

Since $x$ and $x^$ are both solutions to nonlinear dynamical systems, directly minimizing any loss function $\u2113$ that depends on the values of $x$ and $x^$ over an interval $[ta,tb]$ of interest is computationally challenging, as evaluating $x^(t)$ for any $k^$ requires integrating a nonlinear ODE, and minimization methods typically involve doing this many times iteratively. Instead, we make the observation that if $dx/dt=dx^/dt$, then it is clear that $x=x^$ as well. Intuitively then, when $dx/dt\u2248dx^/dt$ we should have $x\u2248x^$. Thus, we develop a model reduction algorithm that seeks to minimize the error between $dx/dt$ and $dx^/dt$ over a set of sampling times. The idea of minimizing an error on the derivatives can also be found in the literature on collocation methods for solving systems of ordinary differential equations (ODEs).^{27}

We make the key observation that $dx^/dt$ varies linearly with respect to the vector $k^\u2208RN$ at any given time $t$. For any $x\u2208RM$ and $k\u2208RN$, we can write $F$ in (2) in matrix form as

To capture the nonlinearity of the system, we can then sample $x$ at a set of times $t1,\u2026,tm$ and stack up all of the corresponding linear systems into the $mM\xd7N$ linear system

This relation has the crucial property that $x\u02d9=F1$. We note that the sampling times $t1,\u2026,tm$ that we use to build our data-driven model reduction problem have some effect on the quality of the final reduced model. In our example problems, we choose the sampling times to be evenly spaced in $t\u2208[ta,tb]$ for a specified sample size $m$. The reduced models that we find, while fairly robust to the sample size $m$ on our example problems, could in general depend on the density of samples in certain regions where the derivative changes rapidly. It is usually sufficient to use as many samples as are available from the numerical solution of the original ODE.

After setting up the linear system (4), we take the additional step of weighting each *row* of $F$ by its $\u21132$-norm. This can be done by multiplying both $F$ and $x\u02d9$ by a diagonal matrix $W$ whose $r$-th diagonal entry is equal to the reciprocal of the $\u21132$-norm of the $r$-th row of $F$. We do this as a normalization step to guard against differences in the scale of different $dxi/dt$. We do not want to ignore perturbations to the rate of change of particular state variables simply because their rate of change is smaller in magnitude than that of others, because especially in a nonlinear system, these small changes may still have a large impact on the overall dynamics of the system. We note that other choices of $W$ are possible to achieve this effect, for example, weighting rows by their $\u21131$-norm. We use the $\u21132$-norm because it has the additional effect of contributing to better numerical stability when solving (5).

Now, we solve the weighted system for sparse $k^$, subject to a constraint on the $\u21131$-norm of $k^$ in order to induce sparsity in the solution. We add two additional inequality constraints to bound $k^$. This leads us to a $\u21131$-regularization problem with box constraints

where vector inequalities are interpreted componentwise and $\kappa \u22651$. The inequality constraints bounding the components of $k^$ are intended to promote a solution that is close to that of the integer program. In the special case of chemical reaction networks, as we will discuss below, the lower bound $k^\u22650$ also enforces a physical constraint, namely, that reaction rates must be nonnegative. If we choose $\kappa =1$, our problem is exactly the convex relaxation of the corresponding integer program where the second constraint in (5) is replaced by $k^\u2208{0,1}N$. Note that the upper bound $k^\u2264\kappa 1$ is only relevant when $\kappa \u2264\lambda $. In practice, we find that for large $\lambda $, choosing $\kappa =10$ or $\kappa =100$ allows for some added flexibility in finding reduced models where some of the remaining components $fj$ now need to be weighted more heavily in $F$. In general, however, the overall solution is not too sensitive to different choices of $\kappa $ and any improvements from an optimal choice are typically small.

Our problem bears a strong resemblance to the nonnegative garrote^{28,29} that was an early inspiration for the popular least absolute shrinkage and selection operator (LASSO) method,^{30} which has become an important tool for finding sparse models. By varying $\lambda $, we find a family of potential reduced models. The solution to the quadratic program above results in a vector $k^$ for which some entries are 0; its sparsity depends on the choice of $\lambda $. We use this solution $k^$ only for model selection to determine which components to retain, since a well-known property of $\u21131$-regularization is that it tends to underestimate the magnitudes of $k^j$; this phenomenon is known as the shrinkage effect in the statistics literature. While we could choose to only retain components that correspond to nonzero $k^j$, empirically for $\lambda $ such that we can find a reasonable reduced model, $k^j$ exhibit a bimodal distribution, so that either $k^j\u22480$ or $k^j\u2208[10\u22122,102]$ (that is, the constrained solution does not vary from the unconstrained solution of $k^=1$, by more than two orders of magnitude). Thus, we retain all components $j$ such that $k^j\u2265\u03f5$, where due to the strong separation of the modes the particular choice of $\u03f5$ is not very important, but generally $\u03f5=0.01$ can be used. Finally, we debias the solution by resolving the weighted linear system

## IV. EXPERIMENTAL RESULTS

We demonstrate our method on two different types of nonlinear dynamical systems of the form (1) described above: the Lorenz equations and chemical reaction networks. In both cases, we study the resulting reduced models for their extrapolation capability and other properties of practical interest.

In the following discussions, we evaluate the quality of the reduced model by the error on a subset of important state variables with indices in $C$. For the Lorenz model, there are only three dimensions and we are interested in all three state variables. In the chemical reaction network example, we choose a small subset of important molecular species, and note that the best reduced model found may be affected by the choice of $C$. We use the following expression for the error between the full and reduced models (relative norm error), which for comparison purposes is the same as that used by Hannemann-Tamás *et al.*,^{17}

where Tol$i$ is a tolerance parameter used to prevent large relative errors as the denominator decreases to zero. In Hannemann-Tamás *et al.*^{17} as well as in this work, we use $Toli=10\u22124\xd7maxjxi(tj)$. We note that this error metric is different from the objective function of our optimization problem (5), which we have constructed as a surrogate for the error. The empirical relationship between the objective error and the relative norm error is explored in more detail in the experiments below.

We refer to the smallest model with no more than 5% relative norm error as the best reduced model. In general, there is a sharp increase in the error after the best reduced model as we increase $\lambda $ to try to find smaller models, so again the choice of 5% relative error as a threshold is not a very sensitive parameter.

In all experiments, we choose $\kappa =100$ and $\u03f5=0.01$, although as noted above the results are not too sensitive to these parameter choices. We vary $\lambda $ over a range of values to find a family of reduced models of different sizes, characterizing the trade-off between reduction and error. Note that our problem is scaled so that $\lambda $ is a rough approximation for the number of components in the reduced model.

### A. Lorenz equations

We begin by considering the Lorenz equations as a small illustrative example. The Lorenz system is a nonlinear system of ordinary differential equations that is famous for exhibiting chaotic behavior under certain parameter values and initial conditions. The original model developed by Edward Lorenz in 1963^{9} is as follows:

Lorenz used the parameter values $\sigma =10$, $\rho =28$, and $\beta =8/3$ to demonstrate an example of when the system exhibits chaotic behavior.

The Lorenz system is a dynamical system of the form (1) where $hj$ are given by the five different terms: $x$, $y$, $z$, $xy$, and $xz$. Note that we can rewrite the system as

The origin is a fixed point of the system for all values of $\sigma $, $\rho $, and $\beta $. We test the ability of our method to find a reduced model at this fixed point. We initialize the system close to the origin and study the system for $t\u2208[0,0.1]$ so that the system remains in a regime where the values of $x$, $y$, and $z$ are small. We experimented with many different randomly chosen initial conditions close to the origin and find in almost all cases that the smallest reduced model with less than 5% relative norm error is given by

We plot the results for one such test in Fig. 1 with initial conditions $x(0)=0.2415$, $y(0)=0.1434$, and $z(0)=0.3260$. The reduced models correspond to $\lambda \u2208[5,3,1,0.5,0.05]$, which result in reduced models with 5, 4, 3, 2, and 1 terms, respectively. Note that the smallest reduced model before a large growth in error corresponds to omitting the nonlinear $xy$ and $xz$ terms from the Lorenz equations. Our method has thus automatically detected that a good reduced model for the Lorenz equations close to the origin is its linear approximation!

There are a few interesting observations from this experiment. First, our model reduction method does indeed find the correct smallest reduced model consistently, and its ability to do so is not particularly sensitive to the initial conditions as long as they are close to the origin. Second, we notice that the next term that is removed from the system by our model reduction method is not the $z$ term, which might have been expected given that in the 3-term linear model $z$ has become separated from $x$ and $y$, so that close to the origin the additional error incurred by simply setting $z\u2261z(0)$ is small. Instead, it is the $y$ term that is removed first. If we compare these two 2-term models, the model that removed the $z$ term first has much lower training error over $t\u2208[0,0.1]$ (see Table I). This can make it seem that the 2-term reduced model found by our method is not the “optimal” 2-term model. However, on closer inspection, it turns out that when extrapolating the system to longer times, the error grows more slowly for the reduced model that our method found.

2-Term reduced model . | Training err . | t ∈ [0, 1]
. | t ∈ [0, 10]
. | t ∈ [0, 100]
. |
---|---|---|---|---|

x, z (ℓ^{1} model) | 0.216 | 911 | 2.36 × 10^{+03} | 1.26 × 10^{+03} |

x, y | 0.0424 | 5.96 × 10^{+05} | 4.33 × 10^{+50} | ∞ |

y, z | 0.251 | 904 | 2.47 × 10^{+03} | 1.36 × 10^{+03} |

2-Term reduced model . | Training err . | t ∈ [0, 1]
. | t ∈ [0, 10]
. | t ∈ [0, 100]
. |
---|---|---|---|---|

x, z (ℓ^{1} model) | 0.216 | 911 | 2.36 × 10^{+03} | 1.26 × 10^{+03} |

x, y | 0.0424 | 5.96 × 10^{+05} | 4.33 × 10^{+50} | ∞ |

y, z | 0.251 | 904 | 2.47 × 10^{+03} | 1.36 × 10^{+03} |

We can see this by directly computing the analytical solution of the two reduced models. In a 2-term model that only contains $x$ and $y$ terms, the analytical solution is

where

For our choices of parameters, $\sigma =10$ and $\rho =28$, we see that the magnitude of the eigenvalues of $A$ is much larger than 1, and thus the solution is unstable and will lead to exponential growth in extrapolation error from the solution of the true dynamical system. On the other hand, in a 2-term model that only contains $x$ and $z$ terms, the analytical solution works out to

Then, it is clear that as $t\u2192\u221e$, the system approaches a constant: $x(t),z(t)\u21920$, and $y(t)\u2192y(0)+\rho x(0)/\sigma $. For positive values of $\sigma $ and $\beta $, this is a globally stable system and has bounded extrapolation error, even though it seemed to have much worse training error. In Table I, we list how the error grows for the three possible 2-term models. For each model, we use the training data to solve for the debiased $k^\u2217$ as described above. We note that the 2-term model that contains only the $y$ and $z$ terms has the same structure as the 2-term model with only the $x$ and $z$ terms; the only difference is in the constants.

### B. Chemical reaction networks

Model reduction of chemical reaction networks is an important problem in the combustion literature. Chemical reaction networks can be described by the reaction rate equations (RREs), which are a deterministic set of ordinary differential equations that give the concentration of chemical species continuously over time.^{10} The RRE is applicable to chemical systems that obey the law of mass action, under constant pressure and temperature conditions, and is a nonlinear dynamical system of the form (1).

Consider a chemical system with $M$ molecular species and $N$ distinct reactions. Let

be the vector of concentrations of each molecular species at time $t$. Then, the RRE states that the instantaneous rate of change in the concentration $xi$ of the $i$-th molecular species can be expressed as the sum of the changes caused by each reaction in the system for which the $i$-th molecular species is a reactant or product. The effect of each reaction is proportional to the concentrations of the reactants as well as the stoichiometry of the molecular species in the reaction. This can be summarized in vector notation as

The column vectors $rj\u2208RM$ correspond to the stoichiometry of reaction $j$, so that entries are negative if they correspond to reactant species and positive if they correspond to product species. Each $aj$ is a function proportional to the concentration at time $t$ of the reactants in reaction $j$. For example, if reaction $j$ is unimolecular with species $i\u2192$ products, then $aj(x)=kjxi(t)$ where $kj$ is a reaction-specific rate constant. If reaction $j$ is bimolecular between species $i\u2032$ and $i\u2033$ so that $i\u2032+i\u2033\u2192$ products, then $aj(x)=kjxi\u2032xi\u2033$.

Since the functions $aj$ are nonlinear in $x$ for any reaction that is not unimolecular, the RRE for most chemical reaction networks is a nonlinear dynamical system. However, we can factor out the rate constants $kj$ from the nonlinear functions $aj$ to rewrite (7) as

Consider the toy chemical reaction network

The corresponding reaction rate equations for this toy system are

At any given time $t$, we can write the rate of change in concentration as a linear function of the vector of reaction rate constants $k$,

After choosing $m$ sampling times, we follow the same argument in (4) to obtain a linear system $x\u02d9=Fk$. Then, we take the crucial step of first scaling the coefficient matrix in that linear system by the known vector $k$ that parameterizes the full chemical reaction network. In other words, we perform the scaling

to solve (5) with this rescaled matrix; the key property is that after rescaling we have $F1=x\u02d9$. This step is important because the true reaction rates $k$ can span many orders of magnitude, and we would like our $\u21131$-regularization constraint to apply to each reaction uniformly.

We use a medium-sized reaction mechanism to demonstrate the efficiency of our data-driven approach to model reduction. The mechanism describes propane pyrolysis and involves effectively 89 reactions and 32 molecular species (originally 98 reactions but 9 reactions are eliminated since they have reactants that have no formation steps and are not initially present in the system). We compare our reduction method to two works that have studied the same system previously. This network was studied by Turanyi in 1990,^{18} where model reduction was done by using principal components analysis on the rate sensitivity matrix. Turanyi’s method first identified important and then necessary species in an iterative manner. Then, in 2013, this system was revisited by Hannemann-Tamás *et al.*, who used mixed-integer quadratic programming (MIQP).^{17} Both of these methods require careful parameterization. The MIQP method was also too computationally expensive to use directly and required a prefiltering step that first reduced the model to fewer than 40 reactions. In contrast, our method has a single significant parameter, $\lambda $, to trade-off between sparsity and error and scales polynomially with the number of reactions in the full mechanism. In Fig. 2, the plotted reduced models correspond to 41 evenly spaced $\lambda $ between 10 and 90. We see that our method is able to find a model at the 5% error threshold that is close to the size of the models found by both the sensitivity matrix and prefiltering $+$ MIQP methods at around 30 reactions, a two-thirds reduction in the network.

Reducing the number of reactions in the network also leads to a reduction in the number of molecular species involved in those reactions. From Fig. 3, we see that our method finds reaction networks that consistently involve 23 species, even as the number of reactions decreases. Both the sensitivity matrix and prefiltering $+$ MIQP methods find reduced models involving fewer species.

We note that the prefiltering $+$ MIQP and the sensitivity matrix methods determine quite a different set of reactions from the $\u21131$-regularization method. The prefiltering $+$ MIQP model finds 23 reactions; these 23 reactions happen to be a subset of the 34 reactions found by the Sensitivity Analysis method. By contrast, there are many differences between these models and the $\u21131$-models of approximately the same size and relative norm error.

In the following discussion, we use for illustration the 29-reaction model found by $\u21131$-regularization, which is the smallest model with less than 5% relative error found by the $\u21131$-regularization method. There are 19 reactions that appear in both models. The four reactions that appear in the 23-reaction prefiltering + MIQP model but not in the 29-reaction $\u21131$-model are given in Table II. Note that $1-C3H7$ indicates a particular isomer of $C3H7$ given by locant 1.

C_{3}H_{8} + 2−C_{3}H_{7} → 1−C_{3}H_{7} + C_{3}H_{8} |

C_{3}H_{8} + C_{3}H_{5} → 1−C_{3}H_{7} + C_{3}H_{6} |

C_{3}H_{6} + CH_{3} → C_{3}H_{5} + CH_{4} |

2−C_{3}H_{7} + 2−C_{3}H_{7} → C_{3}H_{6} + C_{3}H_{8} |

C_{3}H_{8} + 2−C_{3}H_{7} → 1−C_{3}H_{7} + C_{3}H_{8} |

C_{3}H_{8} + C_{3}H_{5} → 1−C_{3}H_{7} + C_{3}H_{6} |

C_{3}H_{6} + CH_{3} → C_{3}H_{5} + CH_{4} |

2−C_{3}H_{7} + 2−C_{3}H_{7} → C_{3}H_{6} + C_{3}H_{8} |

On the other hand, the ten reactions listed in Table III appear in the 29-reaction $\u21131$-regularization model but not the 23-reaction prefiltering $+$ MIQP model. Many of them were reduced in the prefiltering step before solving the MIQP.

1−C_{6}H_{12} → 1−C_{3}H_{7} + C_{3}H_{5} |

H + C_{3}H_{6} → 2−C_{3}H_{7} |

CH_{3} + CH_{3} → C_{2}H_{6} |

CH_{3} + 1−C_{3}H_{7} → n − C_{4}H_{10} |

C_{2}H_{5} + 2−C_{3}H_{7} → 2−methylbutane |

1−C_{3}H_{7} + 2−C_{3}H_{7} → 2−methylpentane |

1−C_{3}H_{7} + C_{3}H_{5} → 1−C_{6}H_{12} |

2−C_{3}H_{7} + 2−C_{3}H_{7} → product |

C_{2}H_{5} + C_{2}H_{5} → n − C_{4}H_{10} |

C_{2}H_{5} + 2−C_{4}H_{9} → 2−C_{4}H_{8} + C_{2}H_{6} |

1−C_{6}H_{12} → 1−C_{3}H_{7} + C_{3}H_{5} |

H + C_{3}H_{6} → 2−C_{3}H_{7} |

CH_{3} + CH_{3} → C_{2}H_{6} |

CH_{3} + 1−C_{3}H_{7} → n − C_{4}H_{10} |

C_{2}H_{5} + 2−C_{3}H_{7} → 2−methylbutane |

1−C_{3}H_{7} + 2−C_{3}H_{7} → 2−methylpentane |

1−C_{3}H_{7} + C_{3}H_{5} → 1−C_{6}H_{12} |

2−C_{3}H_{7} + 2−C_{3}H_{7} → product |

C_{2}H_{5} + C_{2}H_{5} → n − C_{4}H_{10} |

C_{2}H_{5} + 2−C_{4}H_{9} → 2−C_{4}H_{8} + C_{2}H_{6} |

One important and interesting point illustrated by the above is that when evaluating reduced models, especially for large chemical reaction networks where only a small number of important molecular species are tracked, there may be many possible “reduced networks” satisfying a particular error threshold. Our ultimate choice of reduced model may be based on a number of factors. For example, we may simply want the model that results in the smallest state space. In this case, we might, for example, choose the prefiltering $+$ MIQP model given above. However, usually we are interested in using the reduced model in regions it was not trained on. In that case, we may want to choose the model that has the best extrapolation properties, whether in time or initial state. Crucially, our analysis shows that reduced models that at first glance may seem to “miss” well-known chemical mechanisms do not necessarily do worse at extrapolating.

We see in Fig. 4 that both the sensitivity matrix and prefiltering $+$ MIQP models seem optimized for a particular period of time—the MIQP model is highly optimized for the time period it was trained on, while the sensitivity matrix model is optimized for about an order of magnitude longer. We choose two comparable models determined by our $\u21131$-regularization method with around the same number of reactions and approximately the same relative norm error. We see that the smaller $\u21131$-reduced model extrapolates to two orders of magnitude longer in time better than the sensitivity matrix model despite containing fewer reactions. The two $\u21131$-reduced models have comparable extrapolation error to the prefiltering $+$ MIQP model, although we note that they do not grow in error until after an order of magnitude in time. Note that propane pyrolysis, the chemical process being modeled in this reaction network, reaches completion after about 10$6$ s.

Furthermore, we note that our reduced models can be less sensitive to perturbations in initial conditions than both the sensitivity matrix and prefiltering $+$ MIQP models. The original initial conditions are that there is only propane in the system ($1.912\xd710\u22123moldm\u22123$ initial propane concentration). In Fig. 4, we see that if we decrease the initial propane by a factor of 10, our best reduced models at around 30 reactions trained on the original initial conditions have lower error than both the other models. If, on the other hand, we increase the initial propane by a factor of 10, the sensitivity matrix model is best, but our models are still comparatively better than the prefiltering $+$ MIQP model. Finally, if we change the initial concentration to have two orders of magnitude more propane and also include a little bit of $CH3$ and $H2$, a practical possibility in combustion systems, all reduced models are much worse than before, but the MIQP and sensitivity matrix models both exhibit catastrophic failure, whereas our models remain within reasonable error (Fig. 5).

To provide empirical intuition for how our model reduction algorithm finds a family of reduced models that exhibits approximately decreasing relative norm error with increasing model size even though the objective function in (5) never directly minimizes the relative norm error, we show in Fig. 6 that the relative norm error (6) is well correlated with the objective error of our model reduction algorithm when we choose the set of significant species $i\u2208C$ to be the 6 physically relevant ones (propane and the main products of the propane pyrolysis^{18}). However, if we look at the relative norm error overall species, there is a subset of reduced models for which the relative norm error is roughly correlated and then the error plateaus. This is because reduced models necessarily eliminate some of the species that contributes a constant value to the relative norm error.

Since our model reduction method is formulated as a convex optimization problem, specifically quadratic programs, it is fast and can in practice handle reaction networks with thousands of reactions.^{31} Due to the computational efficiency of our algorithm, we can compute a family of reduced models that describe propane pyrolysis over a range of temperatures. The size of the resulting models is plotted in Fig. 7. By construction, the reduced models that we find represent an upper bound on the total number of reactions necessary to capture the dynamics of the key molecular species at the given temperature. We can therefore draw a convex hull below the datapoints to reveal the approximate rate of growth in the complexity of the chemical reaction network as a function of temperature.

### C. Comparison with SINDy

As noted previously, several methods leveraging $\u21131$-regularization have been proposed in the recent literature to solve similar problems, perhaps most notably SINDy for discovering governing equations from data.^{19} We discussed above how a naïve application of SINDy is insufficient for multiscale systems such as chemical reaction networks, where the parameters of interest may vary across many orders of magnitude and also have physical constraints. Here, we demonstrate applying SINDy to model reduction of our propane pyrolysis reaction network, with a few modifications. First, we add the additional physical constraint that $k^\u22650$, since no reaction rates can be negative. Second, in addition to normalizing the columns of $F$ as recommended by the authors of SINDy, we also normalize the rows of $F$ as discussed above to handle differences in scale of the derivatives of the $xi$. Finally, we use $\u03f5=0.01$ as the threshold for retaining components $j$ such that $k^j\u2265\u03f5$ as above, noting that without using an $\u03f5$ threshold the model is barely reduced before the error grows out of a reasonable range. The key differences of this SINDy-type problem from our proposed method are that: (i) we do not first scale the problem by the true reaction rate parameters, since it is assumed these are not known *a priori*; and (ii) we do not debias the solution. It is important to note that the row normalization and $\u03f5$ thresholding, which are **not** in the original proposed SINDy method, are key to finding a reasonable reduced model for the dynamical system. Even with these additional adjustments to SINDy, however, we find as shown in Fig. 8 that the results are unsatisfactory for finding meaningfully reduced reaction networks.

### D. Discussion of validation and test error

We note that in the traditional context of data-driven methods and machine learning, choosing a linear model learned using sampled datapoints would require a validation set, and afterwards the chosen model should be evaluated using a test set. However, this is not necessary in our case because we can compute a true error, for example, the relative norm error given in (6), since we know the exact full model. One might also be interested in whether our reduced model extrapolates to regions it was not trained on. In that sense we would like to know if our model can extrapolate in time or to different initial states (noting that extrapolation in time is equivalent to initializing at a different state along the trajectory); this can also be determined exactly by comparing with the full model. Furthermore, from an application point of view, it may not be necessary or desirable to extrapolate too far away from the trained dataset, since it would not intuitively make sense that a reduced model for propane pyrolysis would be the same relevant reduced model for, for example, methane pyrolysis. As we saw in the family of temperature-dependent reduced models for propane pyrolysis above, as the temperature increased, the size of the best reduced model tended to increase as well. However, in order to effectively use our reduced models as part of multiscale codes, we would like to find models that have stable error under a reasonable amount of extrapolation. The requirements will be different for every application. For a given desired range of extrapolation in initial states, one possible solution is to use a combination of data from several trajectories with different initial conditions to construct our linear system (4) and find reduced models that are applicable across them all. In that case, the prediction error for new initial conditions can be estimated as typically done for data-driven learning problems, by using a collection of test trajectories to measure the expected performance of the model for new problems.

## V. CONCLUSIONS

In this work, we have developed an efficient method for model reduction of nonlinear dynamical systems of the form (1) that is capable of reducing systems with thousands of components in a matter of minutes. The key to our method is exploiting the linearity of the dynamics with respect to the components $fj$ of interest. Then, we use data sampled from the solution of the full dynamical system to capture a series of linear systems that describe the instantaneous rate of change of the system. We use these samples to find a sparse set of linear coefficients, such that as many coefficients are zero as possible so that the corresponding components can be eliminated from the system, while preserving the dynamics of key state variables $xi$, $i\u2208C$, in the system. To induce sparsity in a computationally efficient way, we use $\u21131$-regularization. We show that our method works in finding a reduced model of the Lorenz equations near the origin that corresponds to its linear approximation. We also find that this method is able to quickly and with minimal parameterization find a significantly reduced chemical reaction network for modeling propane pyrolysis, and furthermore that the reduced model maintains a low approximation error for several orders of magnitude in time and extrapolates better to perturbed initial conditions than comparable methods. The computational efficiency of our method enables us to study how the complexity of chemical reaction networks changes as parameters such as temperature are varied. We hope that our method will find use for a wide range of dynamical systems, from combustion to biochemistry, atmospheric chemistry, and even very-large-scale integration circuit design, and reduce the amount of time and human expertise required to develop reduced models.

## ACKNOWLEDGMENTS

This work was supported by the Department of Energy National Nuclear Security Administration under Award No. DE-NA0002007 and the National Science Foundation under Grant No. DMR-1455050. C.S.L. was partially funded by the Fondecyt Iniciacion (Grant No. 11160728).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## REFERENCES

**5674**, 76–86 (2005). .