The complexity of mathematical models in biology has rendered model reduction an essential tool in the quantitative biologist’s toolkit. For stochastic reaction networks described using the Chemical Master Equation, commonly used methods include time-scale separation, Linear Mapping Approximation, and state-space lumping. Despite the success of these techniques, they appear to be rather disparate, and at present, no general-purpose approach to model reduction for stochastic reaction networks is known. In this paper, we show that most common model reduction approaches for the Chemical Master Equation can be seen as minimizing a well-known information-theoretic quantity between the full model and its reduction, the Kullback–Leibler divergence defined on the space of trajectories. This allows us to recast the task of model reduction as a variational problem that can be tackled using standard numerical optimization approaches. In addition, we derive general expressions for propensities of a reduced system that generalize those found using classical methods. We show that the Kullback–Leibler divergence is a useful metric to assess model discrepancy and to compare different model reduction techniques using three examples from the literature: an autoregulatory feedback loop, the Michaelis–Menten enzyme system, and a genetic oscillator.

## NOMENCLATURE

## I. INTRODUCTION

Stochastic biochemical reaction networks, such as those involved in gene expression, immune response, or cellular signaling,^{1–4} are often described using the Chemical Master Equation (CME). The CME describes the dynamics of biochemical processes on a mesoscopic level, viewing them as a discrete collection of molecules interacting and undergoing reactions stochastically; as such, it is generally considered more accurate than continuum approximations, such as rate equations and the Chemical Langevin Equation.^{4} Despite its explanatory power, the CME poses significant analytical and computational difficulties to modelers that have limited its use in practice. Closed-form solutions to the CME are difficult to obtain and are only known for a small number of biologically relevant systems, and solving the CME numerically requires using approximations such as the Finite State Projection (FSP).^{5} Numerical approaches tend to scale poorly with the number of species and reactions present in a system, and as a result, there is significant interest in finding ways to simplify a description of a stochastic reaction network that make it easier to analyze and study—this is the goal of model reduction.

Model reduction for deterministic and continuum-limit models in biology is an active research topic,^{6,7} but very few existing methods can be applied to the discrete, stochastic setting of the CME. The Quasi-Steady State Approximation (QSSA) is perhaps the best known technique, first considered in the stochastic case in Ref. 8. Here the system is partitioned into “slow” and “fast” species such that the fast species evolve very quickly on the timescale of the slow species. On the slow timescale, the states of the fast species can therefore be approximated by their steady-state value (conditioned on the slow species), effectively allowing for a description of the system in terms of the slow species only. By its nature, the QSSA is only applicable to systems with a clear separation of timescales between species, the existence of which cannot always be established. The QSSA for stochastic systems is generally believed to require more stringent conditions than in the deterministic case, but the exact validity conditions are not well-understood.^{9–14} Recent work^{15} has analyzed a modification of the QSSA, the total QSSA (tQSSA), in systems involving reversible binding of molecules and shown that it has a wide range of applicability in the stochastic case.

Similar to the QSSA is the Quasiequilibrium Approximation (QEA), which was first considered in Refs. 16 and 17 for stochastic reaction networks. Here the reaction network is decomposed into “slow” and “fast” *reactions*, and fast reactions are assumed to equilibrate rapidly on the timescale of slow reactions. Similar to the QSSA, the QEA can be used to reduce the number of species and reactions in a system, but it relies on the existence of a clear timescale separation between reactions, which is not always present for large systems with many distinct reactions. Much like the QSSA, the validity of the QEA for systems without the appropriate timescale separation has not been generally established, and from the asymptotic nature of descriptions, it is not usually possible to quantify the approximation error. Despite this, both the QSSA and the QEA are by far the most commonly used model-reduction technique for chemical reaction networks owing to their physical interpretability and analytical tractability, most famously in the Michaelis–Menten model of enzyme kinetics.

A distinct approach for model reduction with the Chemical Master Equation is state-space lumping, which originates from the theory of finite Markov chains; see for example Ref. 18. Here different states in a system are lumped together such that the coarse-grained system is still Markovian and can be modeled using the CME. For a typical biochemical reaction network, it may not be possible to perform any nontrivial lumping while preserving Markovianity, whence approximate lumping methods have been considered, e.g., in Refs. 19–21. Here the coarse-grained system is approximated by a Markov process, and the approximation error is quantified using the Kullback–Leibler (KL) divergence between the original model and a lift of the approximation to the state space of the original model. State-space lumping for the CME often occurs in the context of the QEA as states related by fast reactions are effectively lumped together or averaged.^{22–24} For this reason, we will not consider this approach separately, although many of our results, such as the optimal form of the lumped propensity functions, extend to state-space lumping.

Finally, a more recent model reduction technique specifically for gene expression systems is the Linear Mapping Approximation (LMA).^{25} The LMA replaces bimolecular reactions of the form *G* + *P* → (⋯), where *G* is a binary species such as a gene, by a linear approximation where *P* is replaced by its mean conditioned on [*G*] = 1. While the LMA does not reduce the number of species or reactions, linear reaction networks are generally easier to analyze: their moments can be computed exactly, and closed-form solutions are known for many cases.^{26–29}

At a first glance, these approaches—timescale separation, state space lumping, and the Linear Mapping Approximation—seem rather disparate, and it is unclear what, if any, relationships exist between them. In this paper, we show that all these methods can be viewed as minimizing a natural information theoretic quantity, the Kullback–Leibler (KL) divergence, between the full and reduced models. In particular, they can be seen as maximal likelihood approximations of the full model, and one can assess the quality of the approximation in terms of the resulting KL divergence. Based on these results, we show how the KL divergence can be estimated and minimized numerically, therefore providing an automated method to choose between different candidate reductions of a given model in situations where none of the above model reduction techniques are classically applicable. In its full generality, the Chemical Master Equation describing a reduced model can be obtained by marginalization of the original CME, and hence, our approach recovers the method based on conditional expectations presented in Ref. 30.

The KL divergences we consider in this paper are computed on the space of trajectories and as such include both static information and dynamical information in contrast to purely distribution-matching approaches. The KL divergence and similar information-theoretic measures between continuous-time Markov chains have previously been considered in Refs. 31 and 32 in the context of variational inference (with the true model and the approximation reversed compared to our approach), in Refs. 33 and 34 to obtain approximate non-Markovian reductions, in Ref. 35 to analyze the information flow for stochastic reaction networks, and in Ref. 36 in quantifying model discrepancy for Markovian agent-based models.

In Sec. II, we introduce the mathematical framework within which we consider model reduction for the Chemical Master Equation based on KL divergences between continuous-time Markov chains. We show how the KL divergence can be minimized analytically in some important cases, recovering standard results in the literature and providing a mathematical justification for commonly used mean-field arguments as in the QSSA, the QEA, and the LMA. We furthermore provide numerical algorithms for estimating and minimizing the KL divergence in cases where analytical solutions are not available. In Sec. III, we illustrate the use of KL divergences as a metric for approximation quality using three biologically relevant examples: an autoregulatory feedback loop exhibiting critical behavior; Michaelis–Menten enzyme kinematics, where we reanalyze the QSSA and the QEA; and an oscillatory genetic feedback system taken from Ref. 11, for which we compare different reductions using our approach. Finally, in Sec. IV, we discuss our observations and how our approach could be used as a stepping-stone toward automated reduction of complex biochemical reaction pathways.

## II. METHODS

### A. Stochastic reaction networks

The Chemical Master Equation describes a biochemical reaction network as a stochastic process on a discrete state space $X$. We will use the letter *q* to denote such a stochastic process, which for the purposes of this paper can be seen as a probability distribution over *trajectories* on $X$. For a biochemical reaction network, the state space will be $X=Ns$, where *s* is the number of species in the system: every state consists of a tuple ** n** = (

*n*

_{1}, …,

*n*

_{s}) of

*s*integers describing the abundances of each species.

Since the model *q* can consist of many species interacting in complicated ways, we often want to find a reduction that is more tractable, yet approximates *q* as closely as possible. The reduced model, which we will call *p*, should be of the same form as *q*, i.e., described by the CME, but will typically involve fewer species and simpler reactions. In particular, *p* can be defined on a lower-dimensional state space $X\u0303$. A state ** n** in the original model can then be described by its projection $n\u0303$ onto this lower-dimensional space, together with some unobserved components, which we will denote

**. See Figs. 1(a) and 1(b) for an illustration.**

*z*In this paper, we assume that the basic structure of *p* is known *a priori*, i.e., the species and reactions we wish to retain are fixed. Our approach to model reduction, therefore, consists in finding optimal propensity functions for the reduced model, and we shall see how this can give rise to various known approximations, such as the QSSA, the QEA, or the LMA depending on what reductions are performed. We will return to the related problem of choosing the structure of the reduced model *p* in the discussion.

The original model *q* defines a probability distribution over trajectories in $X$, and projecting each trajectory onto the chosen reduced state space, we get the (exact) projection of *q* onto this space, which we denote $q\u0303$ [Fig. 1(a)]. This is a stochastic process on $X$ that is generally not Markovian and thus cannot be modeled using the CME. We aim to find a tractable approximation *p* to $q\u0303$ that can be described using the CME, and we will do this by minimizing the KL divergence $KL(q\u0303\Vert p)$ between the two models on the space of trajectories. Several well-known examples of model reduction for the CME are illustrated in Fig. 2.

Jumps in *q* come in two kinds: those that affect the observed species $n\u0303$, which we will call visible jumps, and those that only change ** z**, which we call hidden. The jumps in $q\u0303$ correspond to visible jumps in

*q*. In the context of the CME, jumps are typically grouped into reactions with fixed stoichiometry, often also called reaction channels, and we can similarly distinguish visible and hidden reactions in

*q*. We will always assume that different reactions have different stoichiometries so that every jump in

*q*and

*p*corresponds to a unique reaction. Reactions with the same stoichiometry can always be combined by summing their propensities.

We introduce some more notation at this point, which is summarized in the nomenclature and illustrated in Fig. 1(c). A single realization, or trajectory, of *q* is defined by the sequence of states $n0,n1,n2,\u2026\u2208X$ visited and jump times 0 < *t*_{1} < *t*_{2} < ⋯. We will write *n*_{[0,T]} = {** n**(

*t*)}

_{0≤t≤T}for a trajectory, where

**(0) =**

*n*

*n*_{0}and

**(**

*n**t*) =

*n*_{i}for

*t*

_{i}≤

*t*<

*t*

_{i+1}, and denote by

*τ*

_{i}=

*t*

_{i+1}−

*t*

_{i}the waiting times between jumps.

For a continuous-time Markov process *p*, e.g., one defined using the CME, we denote the transition rate from state ** n** to

**≠**

*m***by**

*n**p*

_{m←n}. We let

*p*

_{←n}= ∑

_{m≠n}

*p*

_{m←n}be the total transition rate out of

**. The transition probabilities at**

*n***are then given by**

*n**p*

_{m|n}=

*p*

_{m←n}/

*p*

_{←n}. For completeness we define

*p*

_{n|n}≔ 0.

### B. Minimizing the KL divergence

*p*is defined as

*p*is therefore equivalent to minimizing the cross-entropy. Alternatively, it is equivalent to maximizing the average log-likelihood under

*p*of samples drawn from $q\u0303$, which is a statistical inference problem.

*p*are trajectories. We therefore need to compute the log-likelihood of an arbitrary trajectory

*n*_{[0,T]}under

*p*, which by assumption is a stochastic reaction network described by the CME. In Appendix A, we show that the log-likelihood for a trajectory visiting the states

*n*_{0},

*n*_{1}, …,

*n*_{k}with waiting times

*τ*

_{0},

*τ*

_{1}, …,

*τ*

_{k}is given by

Note that computing the likelihood of a fully observed trajectory is easier than computing the likelihood of a trajectory sampled at discrete time points, which is the form of data typically observed in biological experiments. The difference is that computing the likelihood in the latter case requires integrating over all possible courses of the trajectory between observations; this integral is computed implicitly by the Chemical Master Equation, which is generally hard to solve. For a fully observed trajectory, there are no hidden variables to integrate out, which considerably simplifies the task of computing likelihoods.

*N*trajectories $n\u0303[0,T](i)$,

*i*= 1, …,

*N*, from $q\u0303$ and computing the estimate

*p*. Formulas for the gradients can be computed by differentiating Eq. (4) by hand or by using automatic differentiation software. Minimizing the estimated cross-entropy via a gradient descent yields a probabilistic estimate for optimal parameters, which will generally converge to the true solution as

*N*→ ∞. Summarizing the contents of this section we arrive at Algorithm 1 for automatically fitting the optimally reduced model.

Inputs: number of simulations N, simulation length T, full model q, model family {p_{θ}}, learning rate η |

Output: reduced model parameters $\theta \u0302$ |

for all i = 1, …, N do |

sample $n[0,T](i)$ from q |

project $n[0,T](i)$ to $n\u0303[0,T](i)$ |

end |

initialize parameters $\theta \u0302$ |

while not converged do |

compute $H\u0302(q\u0303;p\theta \u0302)$ using (4), (5) |

compute $\u2207\theta H\u0302(q\u0303;p\theta \u0302)$ |

$\theta \u0302\u2261\eta \u2207\theta H\u0302(q\u0303;p\theta \u0302)$ |

return $\theta \u0302$ |

Inputs: number of simulations N, simulation length T, full model q, model family {p_{θ}}, learning rate η |

Output: reduced model parameters $\theta \u0302$ |

for all i = 1, …, N do |

sample $n[0,T](i)$ from q |

project $n[0,T](i)$ to $n\u0303[0,T](i)$ |

end |

initialize parameters $\theta \u0302$ |

while not converged do |

compute $H\u0302(q\u0303;p\theta \u0302)$ using (4), (5) |

compute $\u2207\theta H\u0302(q\u0303;p\theta \u0302)$ |

$\theta \u0302\u2261\eta \u2207\theta H\u0302(q\u0303;p\theta \u0302)$ |

return $\theta \u0302$ |

This algorithm has the advantage of being completely general, but unlike standard model reduction algorithms, it involves numerical optimization. In Sec. II C, we will analyze our loss function more closely to obtain alternative expressions for the optimum, allowing us to bypass Algorithm 1 for a wide range of problems.

Instead of minimizing $KL(q\u0303\Vert p)$, an alternative approach would minimize the opposite KL divergence $KL(p\Vert q\u0303)$. The latter has been analyzed, e.g., in Refs. 31 and 32 for continuous-time Markov processes. Despite the almost symmetric definition, it is well-known that these two quantities behave rather differently in an optimization context and lead to very different results; cf. Ref. 37. From a pragmatic point of view, minimizing $KL(q\u0303\Vert p)$ involves computing likelihoods under *p*, while minimizing $KL(p\Vert q\u0303)$ involves computing probabilities of trajectories under $q\u0303$, which is significantly more difficult.

#### 1. Example: Telegraph model

We will illustrate the above by means of a well-known example, the telegraph model of gene transcription,^{38} simplified to neglect degradation and depicted in Fig. 3. The full model *q* consists of three species: a gene in the on state (*G*_{on}) and the off state (*G*_{off}), together with mRNA (*M*). The gene switches between both states with constant activation rate *σ*_{on} and inactivation rate *σ*_{off}, and in the active state produces mRNA with rate *ρ*_{on}. The reduced model *p* consists only of mRNA (*M*), which is produced at constant rate *ρ*_{eff}. This instructive example has previously been treated in Ref. 33 in a similar context.

A state in the full model can be represented as ** n** = (

*g*,

*m*), where

*g*is the gene state (on or off) and

*m*is the number of mRNAs present, while a reduced state $n\u0303=(m)$ is given by the number of mRNA molecules only, with the unobserved component

**= (**

*z**g*) being the gene state. The only reaction in

*q*that descends to $q\u0303$ is mRNA production, whereas the two gene switching reactions are not observed. Note that the projection $q\u0303$ of the telegraph model is not Markovian as the instantaneous mRNA production rate depends on the current gene state

*g*, but we can sample from $q\u0303$ by simulating

*q*and discarding the information about the gene state.

*p*is a Poisson process whose only parameter is the production rate

*ρ*

_{eff}. The only trajectories possible under

*p*are those starting at

*m*= 0 and increasing by one at every jump. The log-likelihood of such a trajectory $n\u0303[0,T]$ under

*p*is given by

*T*, as can be verified using Eq. (4). The log-likelihood of any other trajectory is −∞.

Note that Eq. (6) describes the likelihood for an entire trajectory, as opposed to the likelihood of observing *n*_{T} molecules at time *T*. The latter follows a Poisson distribution with rate *ρ*_{eff}*T* and can be obtained by integrating Eq. (4) over all trajectories that end at the same *n*_{T}. Indeed, conditioned on observing *n*_{T} mRNA molecules at time *T*, their individual production times are uniformly and independently distributed on [0, *T*] by the properties of the Poisson process, and integrating Eq. (6) over all possible combinations of production times, we recover the usual Poisson likelihood, keeping in mind that any permutation of the production times yields the same trajectory.

The reduced model *p* can model every trajectory obtained from $q\u0303$, so the log-likelihood of any sample from $q\u0303$ is finite. The mRNA transcription reactions in both models correspond. If *q* were to include, e.g., mRNA degradation as is usual in the literature, some trajectories from the full model would feature decreases in $n\u0303$ and be impossible under *p*; in this case, the KL divergence would infinite, which signals that the reduced model *p* is not appropriate and needs to be extended to include degradation.

*p*only depends on the total mRNA produced. We can therefore compute the cross-entropy explicitly,

*ρ*

_{eff}yields the optimum

*T*]. Thus the optimal approximation to the telegraph model is obtained by setting the mRNA production rate to its mean value. This is the result we obtain by applying the QEA to this system (see Sec. II E 2), which suggests that the approximation will be better if

*σ*

_{on}and

*σ*

_{off}are large compared to

*ρ*

_{on}. We will see how to evaluate the approximation quality in Sec. II D.

Even if the analytical minimization above was not possible, the above minimization can be performed using Algorithm 1. To do so, we start by sampling a fixed number *N* of mRNA trajectories from the telegraph model; the more the trajectories we use, the better our estimate of *ρ*_{eff} will be. The next step is constructing the cross-entropy loss function for the reduced model, which only depends on one parameter *ρ*_{eff} and is given by Eq. (7). We then perform the gradient descent on the cross-entropy, averaged over all trajectories, to arrive at our estimate of *ρ*_{eff}. Here the learning rate *η* in Algorithm 1 has to be determined by experimentation. The code accompanying this paper contains an example implementation of this procedure in Julia.

### C. Analyzing the KL divergence

In Sec. II B, we have derived a numerical procedure to minimize the KL divergence (1) or, equivalently, the cross-entropy (3). While this yields a direct computational procedure to fit reduced models, we are equally interested in the theoretical insights to be gained from considering model reduction as a variational problem. Understanding how our approach works in more detail will help us establish links with existing model reduction techniques and derive general principles that can help us generalize these to new systems.

*p*, the cross-entropy can be written as

*t*defined as

*t*will be affected by the history of the process, and hence on the initial distribution

*q*

_{0}. If $q\u0303$ is Markovian, Eq. (12) reduces to the classical transition rate, which is independent of

*q*

_{0}.

Looking at Eq. (9), we make two important observations. As intimated in the discussion of the telegraph example of Sec. II B 1, if any trajectory under $q\u0303$ has jumps that are not allowed under *p*, i.e., if $q\u0303m\u0303\u2190n\u0303\u22600$ while $pm\u0303\u2190n\u0303=0$, then the cross-entropy (9) will be infinite: the reduced model *p* is not flexible enough to model $q\u0303$. On the other hand, if *p* contains transitions that are impossible under $q\u0303$, i.e., if $pm\u0303\u2190n\u0303\u22600$ while $q\u0303m\u0303\u2190n\u0303=0$, then these can only increase the cross-entropy (9). This means that the optimal reduced model *p* does not contain transitions beyond those in $q\u0303$, and in the context of the CME we can therefore assume that *p* and $q\u0303$ have the same reactions, with different propensities (Markovian for *p*, not necessarily for $q\u0303$).

*i*th reaction in

*p*has propensity function $\rho i(n\u0303;t)$ and net stoichiometry

*s*_{i}we obtain the following decomposition of the cross-entropy rate at time

*t*:

*p*or, equivalently, all visible reactions in

*q*. The total cross-entropy is obtained by integrating Eq. (13) over [0,

*T*], and we can find the optimal

*p*by optimizing the cross-entropy for each reaction separately.

*q*is Markovian. Assume that there is precisely one reaction in

*q*with net stoichiometry $si\u0303$ in the projection, and let

*σ*

_{i}be its propensity function. Differentiating the above equation with respect to $\rho i(n\u0303;t)$ and setting the derivative to zero, we obtain

*p*is the expected propensity under the original model conditioned on the observed state $n\u0303$. In particular, if the propensity of the original reaction does not depend on unobserved species, it can be removed. We explore the implications of these results on the moments of the reduced system in Sec. II E 4.

*ρ*

_{i}is taken to be independent of time, we have to integrate Eq. (13) over [0,

*T*], and as

*T*→ ∞, it can be verified that the total cross-entropy is minimized for

For other forms of propensity functions, the optimum can generally be obtained by a similar averaging procedure. In all cases, the main difficulties lie in estimating relevant conditional expectations, which can be done numerically using the Monte Carlo approach represented in Sec. II B. Equation (14) is perhaps the central result of this paper, and we will see that it coincides with the propensities obtained using the QSSA, the QEA, and the LMA when these are applied. These methods can, therefore, all be seen as special cases of our variational approach based on minimizing the KL divergence. As an aside, our results provide a new derivation of the marginal CME proposed in Refs. 30 and 34 and the nontrivial result that Eq. (14) and Algorithm 1 target equivalent objectives.

In particular, Algorithm 1 can be bypassed by computing certain conditional moments of a system, which provides an alternative way to perform model reduction. For many systems, these conditional moments can be efficiently computed using moment equations, possibly involving moment closure approximations, which have attracted a large amount of literature.^{4} The LMA, for example, involves a bootstrapping approach to estimate relevant moments. Alternatively, conditional expectations could be computed by numerically integrating the CME although we do not expect this to scale well to large systems. In consequence, Algorithm 1, while a convenient way to implement our approach in full generality and empirically verify its performance, can often be bypassed.

#### 1. Example: Telegraph model (cont.)

*p*more flexible by allowing a nonlinear and time-dependent propensity

*ρ*(

*m*;

*t*) =

*ρ*

_{eff}(

*m*;

*t*), then the cross-entropy between $q\u0303$ and

*p*can be computed as

*p*have the same initial conditions. The effective propensities of $q\u0303$ can be computed as

*ρ*

_{on}weighed by the probability that the gene is on at time

*t*, given that there are

*m*mRNA molecules present. This is the optimal propensity for the reduced model

*p*in accordance with Eq. (14), featuring a nonlinear dependence on

*m*, and this can be shown to be the optimal Markovian approximation to $q\u0303$. We will see in Sec. II E 4 that this approximation exactly matches the mRNA distribution of the full model at all times.

*ρ*(

*m*;

*t*) =

*ρ*

_{eff}(

*t*) for an effective rate constant

*ρ*

_{eff}(

*t*), minimizing the cross-entropy (16) yields

*p*is the expected transcription rate of

*q*at time

*t*, this time averaged over all

*m*.

*ρ*(

*m*;

*t*) =

*ρ*

_{eff}to be time-independent, we obtain by a similar procedure

*t*. This is the special case considered in Sec. II B 1, where

*p*was restricted to be a Poisson process with constant intensity. As

*T*→ ∞, optimum (19) converges to the expected transcription rate of

*q*at steady state.

It can be checked that propensity functions (17) and (18) exactly preserve the mean mRNA numbers of the full system. Similarly, the time-independent propensity function (19) preserves the mean mRNA numbers at steady state. For mass-action propensities, however, variances will be underestimated. This is because reduced models do not model gene switching, which increases the noise in the system. This is a common occurrence with model reduction and is well known for the telegraph model since the mRNA numbers are always distributed following a Poisson mixture distribution, which will always have a higher variance than a single Poisson distribution with the same mean.^{39} We will further explore how well reduced models capture marginal distributions of the original in Sec. II E 4.

### D. Computing KL divergences

*p*. Perhaps counter-intuitively, minimizing the KL divergence is easier than computing it since the entropy

*N*samples $n\u0303[0,T](i)$ from $q\u0303$ and computing

*q*is a Markov process, the projection $q\u0303$ is just a partially observed Markov process and we can use the so-called forward algorithm for hidden Markov models

^{40}to compute the log-likelihood of a trajectory.

*t*and the current state of the hidden species,

*z*_{t}. As shown in Appendix C, the forward algorithm in this case yields the following set of jump ordinary differential equations (ODEs):

*q*, with propensity functions given by

*σ*

_{i}. The marginal likelihood of the observed trajectory can then be computed by summing over the unobserved variables

**,**

*z*The above equations take the form of a modified CME where the observed variables are fixed. They can be solved using an adaptation of the Finite State Projection algorithm,^{5} and in cases where the unobserved state space is finite, exact solutions can be computed numerically. We summarize the procedure to estimate the KL divergence defined in Eq. (1) in Algorithm 2.

Inputs: number of simulations N, simulation length T, full model q, reduced model p |

Output: KL divergence estimate $KL\u0302(q\u0303\Vert p)$ |

$L\u0302\u22540$ |

for all i = 1, …, N do |

sample n_{[0,T]} from q |

project n_{[0,T]} to $n\u0303[0,T]$ |

compute $p(n\u0303[0,T])$ using (4) |

compute $q\u0303(n\u0303[0,T])$ using (22)–(25). |

$L\u0302+=q\u0303(n\u0303[0,T])\u2212p\u0303(n\u0303[0,T])$ |

return $L\u0302/N$ |

Inputs: number of simulations N, simulation length T, full model q, reduced model p |

Output: KL divergence estimate $KL\u0302(q\u0303\Vert p)$ |

$L\u0302\u22540$ |

for all i = 1, …, N do |

sample n_{[0,T]} from q |

project n_{[0,T]} to $n\u0303[0,T]$ |

compute $p(n\u0303[0,T])$ using (4) |

compute $q\u0303(n\u0303[0,T])$ using (22)–(25). |

$L\u0302+=q\u0303(n\u0303[0,T])\u2212p\u0303(n\u0303[0,T])$ |

return $L\u0302/N$ |

**. These equations can be found in various places in the literature, e.g., Refs. 33, 35, and 41–44, although we are not aware of a consistent nomenclature for this modification of the CME.**

*z*The equations for the log-likelihood are linear and can be split into a sum of contributions for each reaction in $q\u0303$. Integrating over all reduced trajectories, $n\u0303[0,T]$ shows that the entropy $H(q\u0303)$ can thus be expressed as a sum of contributions from each reaction in $q\u0303$. Equation (13) shows that the same decomposition holds for the cross-entropy $H(q\u0303;p)$, so the entire KL divergence $KL(q\u0303\Vert p)$ can be decomposed into contributions coming from the individual reactions, which allows us to differentially analyze the accuracy of the reduced model *p* for each reaction separately. We note that all reactions together determine the distribution *q* on trajectories over which we integrate, so this decomposition is not strict in practice.

#### 1. Example: Telegraph model (cont. 2)

*σ*

_{on}and

*σ*

_{off}and the visible transcription reaction

*ρ*

_{on}. Specializing Eq. (23) to this system, we obtain

*m*

_{[0,t]}denotes the observed mRNA trajectory. At each mRNA production event we update the probabilities according to Eq. (24), which in this case reads

As a by-product, the above equations yield the conditional distribution over the gene state at each time *t*, given the trajectory prior to that time point. These are the filtered distributions and illustrated in Fig. 3(c), but we only need the marginal likelihood for our purposes.

Using the above equations to compute the log-likelihood for a large number of trajectories sampled from the telegraph model *q* yields a numerical approximation of the entropy $H(q\u0303)$, and together with Eq. (6), we obtain an estimate of the KL divergence $KL(q\u0303\Vert p)$. For time-homogeneous propensities, we empirically found the KL divergence to be asymptotically proportional to *T*, and we call the proportionality factor the (steady-state) KL divergence rate. Note that the cross-entropy rate can be defined directly using Eq. (11).

In Fig. 3(d), we show how this KL divergence rate between the telegraph model and its Poisson approximation changes for various choices of *σ*_{on} and *σ*_{off}. We observe that the KL divergence rate tends to 0 as the switching rates increase; indeed, the Poisson approximation can be obtained from the QEA for this example. We further note that the KL divergence is maximal when the gene spends substantial amounts of time in each state; for *p*_{on} close to 1, the gene is almost always active and the system approaches constitutive expression with rate *ρ*, while for *p*_{on} close to 0, the gene only activates sporadically and resembles a constitutive gene with a very low expression rate *ρ* · *p*_{on}. Visually inspecting mRNA trajectories generated in both regimes shows that the results become harder to distinguish from the Poisson process.

Our results on the telegraph model without degradation apply almost verbatim when degradation is included: adding the reaction $M\u2192\u2205$ to the full and reduced models does not affect the KL divergences considered. *A priori*, adding degradation changes the probability distribution over trajectories, which can now exhibit decreasing mRNA numbers, but the log-likelihood contributed by the degradation reaction is the same between the telegraph model and its reduction, and therefore the difference cancels in Eqs. (26)–(28) and Eq. (13). Since the log-likelihood contributed by the gene switching and transcription reactions does not depend on mRNA numbers, which do change in the presence of degradation, the total KL divergence is unaffected. In the presence of feedback, degradation would indirectly affect the KL divergence via its effect on mRNA numbers; we refer the interested reader to the recent paper^{35} for an analysis of this type of information flow in stochastic biochemical reaction networks. We can readily verify that our observations on the mean and variance of mRNA numbers under the telegraph model and its reduction remain valid in the presence of degradation.

### E. Relationship with known approaches

#### 1. The Quasi-Steady State Approximation

In deterministic chemical kinetics, the Quasi-Steady State Approximation (QSSA) is a model reduction technique that can be applied when a system can be partitioned into the so-called slow species *n*^{s} and fast species *n*^{f} such that the fast species *n*^{f} evolve on a much faster time scale than the slow species. On the timescale on which the slow species evolve, the fast species can, therefore, be assumed to reach their steady state almost instantaneously. Thus, one assumes $ddtnf=0$, which allows for the simplification of the remaining equations for the slow species. The QSSA has famously been applied to Michaelis–Menten enzyme kinetics with the enzyme–substrate complex *ES* as the fast species, resulting in the classical Michaelis–Menten propensity for product formation (see Sec. III B for more information).

*n*^{s}, the fast species reach the steady state nearly instantaneously compared to the timescale of interest. In our formulation, the reduced state space consists of the slow species, $n\u0303=ns$, and the unobserved species are the fast species,

**=**

*z*

*n*^{f}. As shown in Ref. 8, the QSSA yields the following reduced CME for the approximation

*p*[Eqs. (10) and (11) in Ref. 8]:

*q*

_{0},

*q*, in general, depend on the unobserved species, whose distribution is correlated with the history of the current trajectory. If we assume that the timescale of the unobserved species

*n*^{f}is very fast, then this correlation will decay very quickly and the time-dependence of the conditional distributions in Eqs. (26)–(28) will be negligible, so we can replace these equations by

#### 2. The Quasiequilibrium Approximation

While the QSSA relies on a reaction network being divisible into slow and fast species that evolve on two different timescales, in practice, it is more frequently the case that some *reactions* in a system will be fast and that some will be slow. The Quasiequilibrium Approximation (QEA) developed in Refs. 7, 45, and 46 modifies the QSSA to model this scenario by reformulating the system using extents ** a** = (

*a*

_{1}, …,

*a*

_{r}), where the extent

*a*

_{i}is defined as the number of times reaction

*i*has occurred.

**(**

*n**t*) =

*n*_{0}+

**(**

*Sa**t*), where

**is the stoichiometry matrix. As derived, e.g., in Ref. 17, the CME of this system is**

*S*

*e*_{i}is the vector with a 1 in the

*i*th position and 0 elsewhere. The QEA assumes that the extent vector can be divided into slow and fast components

*a*^{s}and

*a*^{f}, respectively, corresponding to slow and fast reactions, and we define our reduced system to consist only of slow reactions. From here, we can proceed analogously to the QSSA discussed above and obtain that the reduced propensities are given by

A subtlety of this argument is the fact that the fast extents *a*^{f} can only increase in time and, therefore, will not admit a steady-state distribution, in general. The conditional means, however, will converge in many cases, e.g., if *a*^{f} consists of both directions of a reversible reaction. If this is the case, the QEA yields a well-defined reduction, which, moreover, agrees with (14) and, therefore, minimizes the KL divergence to the full model on the space of extents.

#### 3. The Linear Mapping Approximation

^{25}replaces a bimolecular reaction of the form $G+X\u2192\sigma G*$, where

*G*is a binary species representing a gene state, by a reaction $G\u2192\sigma \u0304G*$ with effective propensity $\sigma \u0304$. Assuming mass action kinetics, the propensity function of the bimolecular reaction is

*ρ*(

**) =**

*n**gxσ*, where

*g*is the state of

*G*and

*x*is the abundance of the species

*X*, and the propensity of the linearized version is $\rho \u0303(n)=\sigma \u0304g$ for an appropriate choice of $\sigma \u0304$. Taking the derivative with respect to $\sigma \u0304$ of the cross-entropy rate (13) at time

*t*yields

^{25}The LMA approximates this conditional expectation by imposing a self-consistency condition on the linearized system, thereby deriving an effective approximation to (43).

*q*and the optimal reduction

*p*given by (43) analytically to obtain

*t*is suppressed. This expression for the discrepancy incurred by the LMA, which resembles the definition of the variance, quantifies the intuition that the LMA should be accurate if fluctuations of

*X*in the unbound gene state are small.

The above derivation is not entirely rigorous as the linearization has a different net stoichiometry than the original reaction and KL divergence between *q* and *p* is infinite. To remedy this, we can either neglect fluctuations in *X* due to binding in the original model or consider the KL divergence on a reaction-by-reaction basis, as discussed in Sec. II D, since the two reactions correspond despite their different net stoichiometries.

*X*are absent or can be neglected, the optimal reduction using Eq. (43) preserves first-order moments for all species, assuming that all other reactions are linear. Indeed, replacing the reaction $G\u2192\sigma [X]G*$ by its linearization only changes the moment equation for $E[g]$. For the full model, the latter is given by

*g*(the remaining terms only depend on species means by our linearity assumption). Second- and higher-order moments will, in general, differ between the two as we will see. This sheds light on a more subtle aspect of the LMA: the self-consistent approach in Ref. 25 involves computing the second-order moment $E[gx;t]$, which can differ in the reduction. For this reason, the original approach, while generally quite accurate, can lead to slightly suboptimal propensities, as we shall see in Sec. III A.

#### 4. Moment-matching

In Appendix D, we show that the optimal reduced model *p*, defined using the propensity functions (14), exactly preserves marginal distributions of the full model. In particular, if we place no constraints on the reduced propensity functions, the resulting system *p* will have the same means, variances, and higher-order moments for the observed species as the full system *q*. Thus, our approach is also related to moment-matching, in particular to the proposed moment-based methods in Refs. 11 and 47.

The fact that the reduced model *p* can exactly capture the probability distribution of the projected model $q\u0303$ at any time *t* does not mean that the two are equivalent. The difference can be seen on the level of individual trajectories. The reduced model *p* is always memoryless, whereas this is not usually the case for $q\u0303$, resulting in different dynamical properties such as autocorrelations. This ability to discriminate models on a trajectory-level illustrates an advantage of our path-based approach over purely moment-based model reduction.

We emphasize that exact preservation of moments only holds for the globally optimal propensity functions defined by Eq. (14), which can depend on $n\u0303$ and *t* in a complicated way. In practice, one often use parametric propensity functions, such as mass-action propensities or Hill functions, and as a consequence, this feature is lost. As with the LMA or the telegraph model, it may still be possible to preserve some moments (such as means) exactly with the right choice of propensity function. Our results imply that discrepancies in the moments between the full and reduced models result from the fact that Eq. (14) is only computed approximately for methods such as the QSSA or the QEA.

## III. NUMERICAL EXPERIMENTS

### A. Autoregulatory feedback loop

In this section, we analyze a simple model of stochastic gene expression featuring positive autoregulation [see Fig. 4(a)]. The system in question consists of a single gene found in two states, bound (*G*_{b}) and unbound (*G*_{u}), as well as the coded protein *P*, which can bind to the gene to increase its own transcription rate. For simplicity, we do not consider mRNA dynamics explicitly, instead modeling protein production as occurring in geometrically distributed bursts (see Ref. 48 for a derivation). The linearized version of the system, using the optimal propensity derived in Sec. II E 3, is shown in Fig. 4(b). In Fig. 4(c), we compare steady-state distributions of the full model with its linearization, where the effective binding rate is computed numerically using Algorithm 1, and the LMA, where the binding rate is approximated using the self-consistent approach in Ref. 25.

This system exhibits critical behavior for some parameter values [see Fig. 4(c)], and it was shown in Ref. 49 that computing the moments of this system is difficult near points of criticality as most moment closure techniques and the LMA yield inaccurate results. Our results show that for all parameters, self-consistent equations of the LMA provide results close to the numerically computed optimum. As discussed in Sec. II E 3, the exact reduction very closely reproduces mean protein numbers, with the LMA incurring a small bias near the critical point. In contrast, neither reduction is able to capture the increased fluctuations near the critical point, significantly underestimating the variance in protein numbers. Comparing the protein distributions for the full and the optimally reduced model shows a large discrepancy near the critical point, compared to parameters far from it.

We compute the KL divergence rate at the steady state between the full model and its linearization using (44) and via Monte Carlo estimation [see Fig. 4(d)]. The steady-state KL divergence rate exhibits a notable peak near the critical point, coinciding with the parameter regime where the linearization fails to capture the full model. As we move away in either direction from the critical point, the KL divergence decreases in accordance with the better approximation of the system by its linearized version. This shows how the KL divergence can be used to assess how well model reduction works for different parameter regimes.

### B. Michaelis–Menten kinetics

We next consider one of the most-studied reaction networks in biology, the Michaelis–Menten model of enzyme kinetics, consisting of an enzyme *E* and a substrate *S* that reversibly bind to form an enzyme–substrate complex *ES*, which converts the substrate into the product *P* [see Fig. 5(a)]. This system is often treated as an example application of both the QSSA and the QEA, which remain standard model reduction techniques applied to this example.

*E*and the enzyme–substrate complex

*ES*as the fast species and replaces the full system by a one-step reaction where the substrate is converted to the product with a Hill-type propensity function [see Fig. 5(b)]. This approximation is known to be valid when

*E*

_{T}is the total number of enzymes and

*K*

_{m}is the Michaelis–Menten constant (

*k*

_{−1}+

*k*

_{2})/

*k*

_{1}.

^{50,51}Note that the reduced model is invariant under the scaling

*ɛ*can be seen as controlling the accuracy of the reduction: the rescaled QSSA condition (47) reads

*ɛ*should result in the full model more closely resembling the QSSA reduction.

*k*

_{1},

*k*

_{−1}are sufficiently large. Note that the QEA reduction is invariant under the scaling

*c*controls the timescale at which quasiequilibrium is reached and therefore the accuracy of the QEA. We stress that (48) and (50) are two independent scalings and can be performed simultaneously, as will be done in Fig. 5(c).

In general, it may be difficult to predict which of the two approaches is more accurate unless one is clearly in one of the limiting regimes. Based on the KL divergence between the full model and either of the two reactions, we can investigate this question for a range of parameters. The reduced model consists of two species, *S* and *P*, and since [*S*] + [*P*] is conserved, we can describe it in terms of either of the two. The correct projection for this example identifies the species *S* and *ES* in the full model, i.e., we define [*S*]_{red} = [*S*] + [*ES*] in order to eliminate the binding and unbinding reactions from the projection. This lumping of two rapidly equilibrating species is standard when using the QEA (see, e.g., Ref. 24), but it applies equally well to the QSSA in this case.

In Fig. 5(c), we use Algorithm 2 to estimate the total KL divergence over [0, ∞] between the full model and both reductions for a fixed number of substrate molecules. The total KL divergence is finite since all trajectories enter the same absorbing state defined by [*S*] = 0 in a finite amount of time. As expected, for the QEA, its accuracy increases with *c*, but whereas the QSSA tends to become less accurate for large *E*_{T}; in the low enzyme regime, we observe a similar decrease in accuracy that is not explained by the deterministic theory.

The observed decrease is a small copy number effect caused by the fact that the waiting time distribution between two productions of *P* is not exponential: for very small *E*_{T}, the unbinding of an enzyme immediately after such an event implies that a significant fraction of such productions occur as a two-step process (where the free enzyme binds another substrate and then converts it).^{52} For large enough *c*, the binding step is very fast, and assuming that *k*_{2} ≫ *k*_{−1} so that unbinding is unlikely to occur again, the conversion of substrates to products can be viewed as an effective one-step process. The effect of nonexponential waiting time distributions has previously been analyzed in Ref. 53; we see that the KL divergence on trajectories can be sensitive to subtle dynamical effects such as waiting time distribution, which are not visible when considering, e.g., the moments of a system, which are well predicted by the QSSA for small values of *c* and *ɛ*.

Overall, we can see that for the chosen parameter values, the QSSA generally performs better than the QEA in the regime of small *ɛ*, corresponding to low enzyme numbers, and the QEA is most accurate for *c*, keeping in mind the scalings in (48) and (50). Neither approximation is satisfactory if the number of total enzymes is larger than the amount of substrates, but the binding and unbinding rates are small. In this scenario, other approximations, such as the total QSSA,^{54} which we do not investigate here, will generally be more accurate.

Figure 5(d) compares the effective propensities for the full system, as a function of the unconverted substrate abundance, with the predictions made by the QEA or the QSSA. In the case of the QEA, we see that the effective propensities slowly converge to their limit as the timescale parameter *c* increases. In contrast, the QSSA provides a good approximation to the effective propensity as long as the number of substrate molecules is larger than the number of enzymes. While the effective propensities are the optimal choice for the reduced model, the actual quality of the approximation is affected both by the size of the fluctuations of the actual propensities around their mean and the degree to which the waiting times of the original system follow an exponential distribution.

### C. Genetic oscillator

Our final example is the gene expression system shown in Fig. 6(a). This system consists of a gene, a mRNA, and protein and a Michaelis–Menten-type protein degradation mechanism. Up to two protein molecules can bind the gene, and in the twice bound state, the gene pauses transcription. This model has been considered in Ref. 11 as an example system where the naive reduction of the CME using Hill-type effective propensities is unable to accurately capture the noise in the original. The true system can exhibit oscillations in mRNA numbers that are caused by the two-step negative feedback and are not present in the reduced version. In this section, we want to analyze what reductions can be performed on the full model while still keeping oscillatory behavior.

We consider five different reductions for this model, which are listed in Table I. Each reduction removes or combines several species, such as gene states, thereby reducing the dimensionality of the system. The rightmost column describes the correct projection for each model, derived using an analogous argument as in the Michaelis–Menten example. We fit the unknown (effective) parameters for each model numerically by minimizing the KL divergence from the full model using Algorithm 1. Typical mRNA trajectories for all models can be seen in Fig. 6(b). The oscillations in the full model are clearly visible, and they are inherited by reductions I, II, and I + II. In contrast, simplifying the first binding step as in IV and V results in a non-oscillatory system [see Fig. 6(c)].

Model . | Full reactions . | Reduced reactions . | Projection map . |
---|---|---|---|

I | |||

II | |||

I + II | (I and II above) | (I and II above) | |

IV | |||

V |

Model . | Full reactions . | Reduced reactions . | Projection map . |
---|---|---|---|

I | |||

II | |||

I + II | (I and II above) | (I and II above) | |

IV | |||

V |

In Fig. 6(d), we compare the KL divergence rates between the full model and reductions I–V (note that the different reductions are defined on different state spaces). While models I, II, and I + II have comparatively low KL divergences from the original, there is a sharp increase with models IV and V. From this alone, one could expect that the latter perform significantly worse, which is indeed the case, as shown in Fig. 6(b). Figure 6(e) represents a comparison of the mean and standard deviations for mRNA and protein abundances; while all models approximate the means very closely, the predicted standard deviations for models IV and V are very far from their true values consistent with the lack oscillatory behavior. This resembles the results in Sec. III A, which also show excellent agreement of the full and reduced model on the mean level, independent of the total approximation quality.

This comparison of various reductions shows how variational model reduction can be employed on a more sophisticated scale, where multiple reductions are possible. Given a list of possible simplifications, we can automatically find optimal parameters for each reduction and compute the KL divergence from the true model. Reductions that do not significantly affect the output will generally lead to very low KL divergences compared to those which do, as in this example where a large gap between KL divergence rates separates reductions I, II, and I + II from IV and V. The latter do not retain much of the moments or oscillatory dynamics of the full model compared to the other reductions.

## IV. DISCUSSION

In this work we presented an information-theoretic approach to model reduction for the CME based on minimizing the KL divergence between the full model and the proposed reduction. Based on this variational principle, we determine the optimal reaction propensities for the reduced model in Eq. (14) and show that it underlies some of the most common approaches to model reduction for the CME: the QSSA,^{8–10} the QEA,^{16,17} and the LMA.^{25} As a consequence, we establish that these methods can be seen as special cases of our approach. We furthermore obtain a general justification for the mean-field arguments proposed in the literature and connect them with information theory and likelihood-based approaches. We provide a numerical algorithm for automated fitting of a reduced model based on minimizing the KL divergence via stochastic gradient descent and a numerical algorithm for estimating this KL divergence in order to assess model fit. While the KL divergence between Markov chains has been considered before in, e.g., Refs. 19–21, 31, 32, 34, 35, and 55, to our knowledge it has not been studied in connection with standard model reduction methods as done in this paper. Our numerical results show how the KL divergence can be computed, in practice, and used in the context of model reduction.

Using three biologically relevant examples, we illustrated how these model reduction techniques can be analyzed from this variational perspective. For the autoregulatory feedback loop, we showed that the KL divergence provides a useful metric of approximation quality and can detect parameter regimes where the mean-field approximation behind the LMA fails. Using the Michaelis–Menten system, we demonstrated how our approach can be used to decide between possible reductions of a model, particularly in the non-asymptotic regime where neither the QSSA nor the QEA is strictly valid. Finally, we used the genetic oscillator in Ref. 11 as an example to show how different reductions of a given model can be fit automatically, separately and in combination, and assess which steps in a putative model reduction procedure would impact approximation quality more than others.

In this paper, we only focused on discrete stochastic models, mainly those described using the Chemical Master Equation. This approach disregards continuum approximations, such as the chemical Langevin equation and the System Size Expansion,^{56} which are frequently used in practice, and it is worth asking whether information-theoretic approaches can be extended to this context. *A priori* this appears difficult since the KL divergence and related quantities tend to diverge in the continuum limit, such as the divergence between Brownian motions with different diffusion coefficients. This similarly applies to the wide class of hybrid approaches.^{30,42,57,58} In particular, abundance separation, as opposed to time-scale separation with the QSSA and the QEA, does not fall under the umbrella approach we considered. In this scenario, other perspectives on model reduction are likely needed.

A major drawback of the general approach we presented is that the KL divergences we use cannot usually be optimized analytically. As we have shown, the minimum typically corresponds to computing conditional expectations as in (14), which is rarely possible exactly, e.g., in the case of the LMA, where moment equations are used to approximate conditional means of protein numbers, or the QSSA, where deterministic rate equations are commonly used to derive the approximate propensities. The Monte Carlo approach we proposed in Algorithm 1 has the advantage of being completely generic, but unlike most existing techniques, it requires numerical optimization and will inevitably be slower than those. The optimization problem in question typically being low-dimensional and convex, most of the computational time is required to generate sample trajectories and compute gradients, with variable computational complexity that heavily depends on the system.

One important aspect of model reduction we have not addressed is that of choosing an appropriate architecture for a reduced model. As we investigated in the case of the genetic oscillator, while we can always optimize the parameters given an architecture, the quality of the approximation can greatly depend on which reductions are performed. Although methods for automatically choosing from a predefined set of approximations of a system exist in special cases,^{15} a fully general algorithm that computes a maximally reduced version of a given model within a given threshold is still missing from the literature. In combination with other machine learning approaches to model reduction such as Ref. 59, however, we hope that our approach may be one of the several steps toward such a procedure.

The KL divergence is a well-studied quantity in statistics, but it lacks some desirable properties in the context of model reduction. While we were able to establish that optimizing the KL divergence in theory leads to a reduced system with the same marginal distributions, establishing a relationship between the KL divergence and dynamical information, such as the autocorrelation and power spectra, is more difficult. Our investigations suggest that the KL divergence on the space of trajectories is a useful quantity that relates to both the marginal distributions and dynamical properties of the system and can be a useful metric of model fit. We are optimistic that further investigations in this area will demarcate more clearly which properties of a model are well captured by the KL divergence and which are not.

## ACKNOWLEDGMENTS

This work was supported by the EPSRC Centre for Doctoral Training in Data Science (EPSRC, Grant No. EP/L016427/1) and the University of Edinburgh for K.Ö. and a Leverhulme Trust grant (Grant No. RPG-2020-327) for R.G. The authors are grateful to the anonymous reviewers for their useful feedback and suggestions.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Kaan Öcal**: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). **Guido Sanguinetti**: Conceptualization (equal); Supervision (equal); Writing – review & editing (equal). **Ramon Grima**: Conceptualization (equal); Supervision (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

Code implementing the methods introduced in this paper can be found at https://github.com/kaandocal/modred.

### APPENDIX A: LIKELIHOODS OF FULLY OBSERVED TRAJECTORIES

Let *p* be a continuous-time Markov chain defined on the space $X$ with initial distribution *p*_{0}. If *p* has time-independent transition rates, the Stochastic Simulation Algorithm,^{60} reproduced in Algorithm 3, returns exact samples from *p*. We can inspect the SSA to obtain the probability of drawing any given trajectory as follows:

The first state

(0) is drawn from the initial distribution*n**p*_{0}and has probability*p*_{0}((0)).*n*The time

*τ*_{1}until the next jump is drawn from an exponential distribution with rate $p\u2190n0$ and has probability distribution function $p\u2190n0\u2061exp(\u2212\tau 1p\u2190n0)$.The next state

*n*_{1}is drawn from the transition distribution $pn1|n0$ and has probability $pn1\u2190n0/p\u2190n0$.The time

*τ*_{2}until the next jump is drawn from an exponential distribution with rate $p\u2190n1$ and has probability distribution function $p\u2190n1\u2061exp(\u2212\tau 2p\u2190n1)$, etc.

Input: simulation length T, Markov chain p, initial distribution p_{0} |

Output: n_{[0,T]} - sampled trajectory |

sample n_{0} ∼ p_{0} |

t ← 0, i ← 0 |

while t < T do |

sample $\tau i+1\u223cExp(1/p\u2190ni)$ |

sample n_{i+1} = mwith probability $pm\u2190ni/p\u2190ni$ |

t + = τ_{i+1} |

return states (n_{0}, n_{1}, …), jump times (t_{1}, t_{2}, …) |

Input: simulation length T, Markov chain p, initial distribution p_{0} |

Output: n_{[0,T]} - sampled trajectory |

sample n_{0} ∼ p_{0} |

t ← 0, i ← 0 |

while t < T do |

sample $\tau i+1\u223cExp(1/p\u2190ni)$ |

sample n_{i+1} = mwith probability $pm\u2190ni/p\u2190ni$ |

t + = τ_{i+1} |

return states (n_{0}, n_{1}, …), jump times (t_{1}, t_{2}, …) |

*t*

_{k+1}>

*T*, i.e., that the next jump occurs after time

*T*(note the absence of the prefactor $p\u2190nk$). Each term $p\u2190ni$ in the above expression cancels with the denominator in the next term, and upon taking logarithms, we arrive at Eq. (4).

### APPENDIX B: KULLBACK–LEIBLER DIVERGENCE BETWEEN MARKOV CHAINS

*q*and

*p*defined on a common state space $X$. For any integer

*N*> 1, let

*q*

^{(N)}or

*p*

^{(N)}be the time discretizations of

*q*and

*p*with time step

*δt*=

*T*/

*N*. These define probability distributions on $XN+1$ with the KL divergence

*t*

^{(i)}=

*iT*/

*N*is the

*i*th discretization time point. Using these identities for

*q*and

*p*, it can be verified that the KL divergence (B1) converges to the following as

*N*→ ∞:

*H*(

*q*;

*p*) and the entropy

*H*(

*q*;

*p*), and we define the cross-entropy of two continuous-time Markov chains as

*q*

_{m←n}≠ 0 implies

*p*

_{m←n}≠ 0; otherwise, the cross-entropy and KL divergence are both infinite.

*q*is not Markovian, e.g., when it is the projection of a Markov process. In the discrete-time case, we have

*q*since $q\u0303$ is no longer assumed to be memoryless. Expressing the transition probabilities for

*p*in terms of transition rates using Eqs. (B2) and (B3) and the identities

*N*log(

*δt*) is independent of

*p*and we will renormalize it to 0; this is standard when deriving continuum-limit versions of many information-theoretic quantities and was implicitly used in our definition of the cross-entropy in Eq. (B5). Proceeding with the above equation, we get

*N*log(

*δt*) term and rearranging, yields Eq. (9) in the limit.

### APPENDIX C: LIKELIHOODS OF REDUCED TRAJECTORIES

*q*be a continuous-time Markov chain defined on the discrete state space $X$ with initial distribution

*q*

_{0}, and let $q\u0303$ be its projection onto $X$. Computing the log probability of a trajectory $n\u0303[0,T]$ under $q\u0303$ requires integrating over all possible full trajectories $n[0,T]=(n\u0303[0,T],z[0,T])$ that are compatible with $n\u0303[0,T]$, that is,

*q*and $q\u0303$. The marginal likelihood in this case can be obtained using the well-known forward algorithm for HMMs, which sequentially computes the joint probabilities $q(N)(n\u03030:i,zi)$,

*i*th step, respectively. Equation (C4) can be reorganized as

*σ*

_{j}and taking the continuum limit

*δt*→ 0 now yield Eqs. (22)–(24).

**. Marginalizing Eq. (22) immediately yields Eq. (26), and for the other two, we obtain**

*z*

*z*_{t}as a conditional expectation,

### APPENDIX D: MARGINAL DISTRIBUTIONS AFTER REDUCTION

*p*and $q\u0303$ agree at all times.

## REFERENCES

*Advances in Neural Information Processing Systems*

*Stochastic Processes in Physics and Chemistry*

*Computational Methods in Systems Biology*