The complexity of mathematical models in biology has rendered model reduction an essential tool in the quantitative biologist’s toolkit. For stochastic reaction networks described using the Chemical Master Equation, commonly used methods include time-scale separation, Linear Mapping Approximation, and state-space lumping. Despite the success of these techniques, they appear to be rather disparate, and at present, no general-purpose approach to model reduction for stochastic reaction networks is known. In this paper, we show that most common model reduction approaches for the Chemical Master Equation can be seen as minimizing a well-known information-theoretic quantity between the full model and its reduction, the Kullback–Leibler divergence defined on the space of trajectories. This allows us to recast the task of model reduction as a variational problem that can be tackled using standard numerical optimization approaches. In addition, we derive general expressions for propensities of a reduced system that generalize those found using classical methods. We show that the Kullback–Leibler divergence is a useful metric to assess model discrepancy and to compare different model reduction techniques using three examples from the literature: an autoregulatory feedback loop, the Michaelis–Menten enzyme system, and a genetic oscillator.

n

full state vector

ñ

reduced state vector

p

reduced model

q

full model

q̃

projected model

qn

total transition rate

q0(n)

initial distribution

qm|n

transition probability

qmn

transition rate

qt(n)

single-time marginal

z

unobserved state vector

Stochastic biochemical reaction networks, such as those involved in gene expression, immune response, or cellular signaling,1–4 are often described using the Chemical Master Equation (CME). The CME describes the dynamics of biochemical processes on a mesoscopic level, viewing them as a discrete collection of molecules interacting and undergoing reactions stochastically; as such, it is generally considered more accurate than continuum approximations, such as rate equations and the Chemical Langevin Equation.4 Despite its explanatory power, the CME poses significant analytical and computational difficulties to modelers that have limited its use in practice. Closed-form solutions to the CME are difficult to obtain and are only known for a small number of biologically relevant systems, and solving the CME numerically requires using approximations such as the Finite State Projection (FSP).5 Numerical approaches tend to scale poorly with the number of species and reactions present in a system, and as a result, there is significant interest in finding ways to simplify a description of a stochastic reaction network that make it easier to analyze and study—this is the goal of model reduction.

Model reduction for deterministic and continuum-limit models in biology is an active research topic,6,7 but very few existing methods can be applied to the discrete, stochastic setting of the CME. The Quasi-Steady State Approximation (QSSA) is perhaps the best known technique, first considered in the stochastic case in Ref. 8. Here the system is partitioned into “slow” and “fast” species such that the fast species evolve very quickly on the timescale of the slow species. On the slow timescale, the states of the fast species can therefore be approximated by their steady-state value (conditioned on the slow species), effectively allowing for a description of the system in terms of the slow species only. By its nature, the QSSA is only applicable to systems with a clear separation of timescales between species, the existence of which cannot always be established. The QSSA for stochastic systems is generally believed to require more stringent conditions than in the deterministic case, but the exact validity conditions are not well-understood.9–14 Recent work15 has analyzed a modification of the QSSA, the total QSSA (tQSSA), in systems involving reversible binding of molecules and shown that it has a wide range of applicability in the stochastic case.

Similar to the QSSA is the Quasiequilibrium Approximation (QEA), which was first considered in Refs. 16 and 17 for stochastic reaction networks. Here the reaction network is decomposed into “slow” and “fast” reactions, and fast reactions are assumed to equilibrate rapidly on the timescale of slow reactions. Similar to the QSSA, the QEA can be used to reduce the number of species and reactions in a system, but it relies on the existence of a clear timescale separation between reactions, which is not always present for large systems with many distinct reactions. Much like the QSSA, the validity of the QEA for systems without the appropriate timescale separation has not been generally established, and from the asymptotic nature of descriptions, it is not usually possible to quantify the approximation error. Despite this, both the QSSA and the QEA are by far the most commonly used model-reduction technique for chemical reaction networks owing to their physical interpretability and analytical tractability, most famously in the Michaelis–Menten model of enzyme kinetics.

A distinct approach for model reduction with the Chemical Master Equation is state-space lumping, which originates from the theory of finite Markov chains; see for example Ref. 18. Here different states in a system are lumped together such that the coarse-grained system is still Markovian and can be modeled using the CME. For a typical biochemical reaction network, it may not be possible to perform any nontrivial lumping while preserving Markovianity, whence approximate lumping methods have been considered, e.g., in Refs. 19–21. Here the coarse-grained system is approximated by a Markov process, and the approximation error is quantified using the Kullback–Leibler (KL) divergence between the original model and a lift of the approximation to the state space of the original model. State-space lumping for the CME often occurs in the context of the QEA as states related by fast reactions are effectively lumped together or averaged.22–24 For this reason, we will not consider this approach separately, although many of our results, such as the optimal form of the lumped propensity functions, extend to state-space lumping.

Finally, a more recent model reduction technique specifically for gene expression systems is the Linear Mapping Approximation (LMA).25 The LMA replaces bimolecular reactions of the form G + P → (⋯), where G is a binary species such as a gene, by a linear approximation where P is replaced by its mean conditioned on [G] = 1. While the LMA does not reduce the number of species or reactions, linear reaction networks are generally easier to analyze: their moments can be computed exactly, and closed-form solutions are known for many cases.26–29 

At a first glance, these approaches—timescale separation, state space lumping, and the Linear Mapping Approximation—seem rather disparate, and it is unclear what, if any, relationships exist between them. In this paper, we show that all these methods can be viewed as minimizing a natural information theoretic quantity, the Kullback–Leibler (KL) divergence, between the full and reduced models. In particular, they can be seen as maximal likelihood approximations of the full model, and one can assess the quality of the approximation in terms of the resulting KL divergence. Based on these results, we show how the KL divergence can be estimated and minimized numerically, therefore providing an automated method to choose between different candidate reductions of a given model in situations where none of the above model reduction techniques are classically applicable. In its full generality, the Chemical Master Equation describing a reduced model can be obtained by marginalization of the original CME, and hence, our approach recovers the method based on conditional expectations presented in Ref. 30.

The KL divergences we consider in this paper are computed on the space of trajectories and as such include both static information and dynamical information in contrast to purely distribution-matching approaches. The KL divergence and similar information-theoretic measures between continuous-time Markov chains have previously been considered in Refs. 31 and 32 in the context of variational inference (with the true model and the approximation reversed compared to our approach), in Refs. 33 and 34 to obtain approximate non-Markovian reductions, in Ref. 35 to analyze the information flow for stochastic reaction networks, and in Ref. 36 in quantifying model discrepancy for Markovian agent-based models.

In Sec. II, we introduce the mathematical framework within which we consider model reduction for the Chemical Master Equation based on KL divergences between continuous-time Markov chains. We show how the KL divergence can be minimized analytically in some important cases, recovering standard results in the literature and providing a mathematical justification for commonly used mean-field arguments as in the QSSA, the QEA, and the LMA. We furthermore provide numerical algorithms for estimating and minimizing the KL divergence in cases where analytical solutions are not available. In Sec. III, we illustrate the use of KL divergences as a metric for approximation quality using three biologically relevant examples: an autoregulatory feedback loop exhibiting critical behavior; Michaelis–Menten enzyme kinematics, where we reanalyze the QSSA and the QEA; and an oscillatory genetic feedback system taken from Ref. 11, for which we compare different reductions using our approach. Finally, in Sec. IV, we discuss our observations and how our approach could be used as a stepping-stone toward automated reduction of complex biochemical reaction pathways.

The Chemical Master Equation describes a biochemical reaction network as a stochastic process on a discrete state space X. We will use the letter q to denote such a stochastic process, which for the purposes of this paper can be seen as a probability distribution over trajectories on X. For a biochemical reaction network, the state space will be X=Ns, where s is the number of species in the system: every state consists of a tuple n = (n1, …, ns) of s integers describing the abundances of each species.

Since the model q can consist of many species interacting in complicated ways, we often want to find a reduction that is more tractable, yet approximates q as closely as possible. The reduced model, which we will call p, should be of the same form as q, i.e., described by the CME, but will typically involve fewer species and simpler reactions. In particular, p can be defined on a lower-dimensional state space X̃. A state n in the original model can then be described by its projection ñ onto this lower-dimensional space, together with some unobserved components, which we will denote z. See Figs. 1(a) and 1(b) for an illustration.

FIG. 1.

Model reduction for the Chemical Master Equation. (a) Model reduction approximates a high-dimensional model q by a lower-dimensional version. Since the direct projection q̃ of the full model is not easy to describe, we approximate it using a family of tractable candidate models: in this paper, the approximation p is described by the CME. (b) Comparison of the full state space of a system, consisting of two species S1 and S2, and a reduced state space containing S1 only. Species that are not deemed essential can be discarded in the reduction and become unobserved variables. The dynamics of the original system involves all species, whereas the reduced model aims at an effective description only in terms of the reduced species. (c) Sample trajectory for a one-dimensional system, defined by the sequence n0, n1, … of states visited and the corresponding jump times t1, t2, … (or alternatively, the waiting times τ0, τ1, …).

FIG. 1.

Model reduction for the Chemical Master Equation. (a) Model reduction approximates a high-dimensional model q by a lower-dimensional version. Since the direct projection q̃ of the full model is not easy to describe, we approximate it using a family of tractable candidate models: in this paper, the approximation p is described by the CME. (b) Comparison of the full state space of a system, consisting of two species S1 and S2, and a reduced state space containing S1 only. Species that are not deemed essential can be discarded in the reduction and become unobserved variables. The dynamics of the original system involves all species, whereas the reduced model aims at an effective description only in terms of the reduced species. (c) Sample trajectory for a one-dimensional system, defined by the sequence n0, n1, … of states visited and the corresponding jump times t1, t2, … (or alternatively, the waiting times τ0, τ1, …).

Close modal

In this paper, we assume that the basic structure of p is known a priori, i.e., the species and reactions we wish to retain are fixed. Our approach to model reduction, therefore, consists in finding optimal propensity functions for the reduced model, and we shall see how this can give rise to various known approximations, such as the QSSA, the QEA, or the LMA depending on what reductions are performed. We will return to the related problem of choosing the structure of the reduced model p in the discussion.

The original model q defines a probability distribution over trajectories in X, and projecting each trajectory onto the chosen reduced state space, we get the (exact) projection of q onto this space, which we denote q̃ [Fig. 1(a)]. This is a stochastic process on X that is generally not Markovian and thus cannot be modeled using the CME. We aim to find a tractable approximation p to q̃ that can be described using the CME, and we will do this by minimizing the KL divergence KL(q̃p) between the two models on the space of trajectories. Several well-known examples of model reduction for the CME are illustrated in Fig. 2.

FIG. 2.

Common model reduction techniques for the Chemical Master Equation. (a) The QSSA eliminates intermediate species, which evolve on a faster timescale than others, replacing them by their steady-state values. In the case of the pictured Michaelis–Menten enzyme system, this is often applied to the enzyme E and the substrate–enzyme complex ES. (b) The QEA is an analog of the QSSA that can be applied when a reaction and its reverse equilibrate rapidly on timescales of interest. This can occur for example when a gene switches rapidly between states. (c) The LMA replaces protein-gene binding reactions by effective unimolecular reactions, where the concentration of proteins is approximated by its mean conditioned on the gene state. While this does not remove any species from the system, it simplifies the propensities of the reactions.

FIG. 2.

Common model reduction techniques for the Chemical Master Equation. (a) The QSSA eliminates intermediate species, which evolve on a faster timescale than others, replacing them by their steady-state values. In the case of the pictured Michaelis–Menten enzyme system, this is often applied to the enzyme E and the substrate–enzyme complex ES. (b) The QEA is an analog of the QSSA that can be applied when a reaction and its reverse equilibrate rapidly on timescales of interest. This can occur for example when a gene switches rapidly between states. (c) The LMA replaces protein-gene binding reactions by effective unimolecular reactions, where the concentration of proteins is approximated by its mean conditioned on the gene state. While this does not remove any species from the system, it simplifies the propensities of the reactions.

Close modal

Jumps in q come in two kinds: those that affect the observed species ñ, which we will call visible jumps, and those that only change z, which we call hidden. The jumps in q̃ correspond to visible jumps in q. In the context of the CME, jumps are typically grouped into reactions with fixed stoichiometry, often also called reaction channels, and we can similarly distinguish visible and hidden reactions in q. We will always assume that different reactions have different stoichiometries so that every jump in q and p corresponds to a unique reaction. Reactions with the same stoichiometry can always be combined by summing their propensities.

We introduce some more notation at this point, which is summarized in the nomenclature and illustrated in Fig. 1(c). A single realization, or trajectory, of q is defined by the sequence of states n0,n1,n2,X visited and jump times 0 < t1 < t2 < ⋯. We will write n[0,T] = {n(t)}0≤tT for a trajectory, where n(0) = n0 and n(t) = ni for tit < ti+1, and denote by τi = ti+1ti the waiting times between jumps.

For a continuous-time Markov process p, e.g., one defined using the CME, we denote the transition rate from state n to mn by pmn. We let pn = ∑mnpmn be the total transition rate out of n. The transition probabilities at n are then given by pm|n = pmn/pn. For completeness we define pn|n ≔ 0.

The Kullback–Leibler divergence between two distributions q̃ and p is defined as
(1)
(2)
where the quantity H(q̃;p) is known as the cross-entropy,
(3)
Minimizing the KL divergence with respect to p is therefore equivalent to minimizing the cross-entropy. Alternatively, it is equivalent to maximizing the average log-likelihood under p of samples drawn from q̃, which is a statistical inference problem.
In this paper, samples from q̃ and p are trajectories. We therefore need to compute the log-likelihood of an arbitrary trajectory n[0,T] under p, which by assumption is a stochastic reaction network described by the CME. In  Appendix A, we show that the log-likelihood for a trajectory visiting the states n0, n1, …, nk with waiting times τ0, τ1, …, τk is given by
(4)

Note that computing the likelihood of a fully observed trajectory is easier than computing the likelihood of a trajectory sampled at discrete time points, which is the form of data typically observed in biological experiments. The difference is that computing the likelihood in the latter case requires integrating over all possible courses of the trajectory between observations; this integral is computed implicitly by the Chemical Master Equation, which is generally hard to solve. For a fully observed trajectory, there are no hidden variables to integrate out, which considerably simplifies the task of computing likelihoods.

Returning to the problem of minimizing the KL divergence (1), or equivalently the cross entropy, we need to compute expectations over q̃ that are not in general available in the closed form. It is easier to approximate these expectations by simulating N trajectories ñ[0,T](i), i = 1, …, N, from q̃ and computing the estimate
(5)
We can then minimize this with respect to the reduced model parameters by gradient descent, noting that the cross-entropy is a convex function of p. Formulas for the gradients can be computed by differentiating Eq. (4) by hand or by using automatic differentiation software. Minimizing the estimated cross-entropy via a gradient descent yields a probabilistic estimate for optimal parameters, which will generally converge to the true solution as N → ∞. Summarizing the contents of this section we arrive at Algorithm 1 for automatically fitting the optimally reduced model.
ALGORITHM 1.

Model reduction via a gradient descent.

Inputs: number of simulations N, simulation length T, full model q, model family {pθ}, learning rate η 
Output: reduced model parameters θ̂ 
for all i = 1, …, N do 
 sample n[0,T](i) from q 
 project n[0,T](i) to ñ[0,T](i) 
end 
initialize parameters θ̂ 
while not converged do 
 compute Ĥ(q̃;pθ̂) using (4), (5) 
 compute θĤ(q̃;pθ̂) 
 θ̂ηθĤ(q̃;pθ̂) 
return θ̂ 
Inputs: number of simulations N, simulation length T, full model q, model family {pθ}, learning rate η 
Output: reduced model parameters θ̂ 
for all i = 1, …, N do 
 sample n[0,T](i) from q 
 project n[0,T](i) to ñ[0,T](i) 
end 
initialize parameters θ̂ 
while not converged do 
 compute Ĥ(q̃;pθ̂) using (4), (5) 
 compute θĤ(q̃;pθ̂) 
 θ̂ηθĤ(q̃;pθ̂) 
return θ̂ 

This algorithm has the advantage of being completely general, but unlike standard model reduction algorithms, it involves numerical optimization. In Sec. II C, we will analyze our loss function more closely to obtain alternative expressions for the optimum, allowing us to bypass Algorithm 1 for a wide range of problems.

Instead of minimizing KL(q̃p), an alternative approach would minimize the opposite KL divergence KL(pq̃). The latter has been analyzed, e.g., in Refs. 31 and 32 for continuous-time Markov processes. Despite the almost symmetric definition, it is well-known that these two quantities behave rather differently in an optimization context and lead to very different results; cf. Ref. 37. From a pragmatic point of view, minimizing KL(q̃p) involves computing likelihoods under p, while minimizing KL(pq̃) involves computing probabilities of trajectories under q̃, which is significantly more difficult.

1. Example: Telegraph model

We will illustrate the above by means of a well-known example, the telegraph model of gene transcription,38 simplified to neglect degradation and depicted in Fig. 3. The full model q consists of three species: a gene in the on state (Gon) and the off state (Goff), together with mRNA (M). The gene switches between both states with constant activation rate σon and inactivation rate σoff, and in the active state produces mRNA with rate ρon. The reduced model p consists only of mRNA (M), which is produced at constant rate ρeff. This instructive example has previously been treated in Ref. 33 in a similar context.

FIG. 3.

Model reduction illustrated using the telegraph model without degradation. (a) Schematic of the model. (b) The reduced model is equivalent to a Poisson process with rate ρeff. (c) Example trajectory from the telegraph model, showing mRNA numbers (top) and the gene state (bottom) in time. If we only observe mRNA numbers, we can infer the current gene state by filtering, obtaining the probability for the gene being on at a given time (blue line). When mRNA numbers increase, the on probability jumps to 1 since the transcription only happens in that state. (d) KL divergence rate between the telegraph model and its Poisson reduction for various choices of the switching timescale τ = σon + σoff and on probability pon = σon/(σon + σoff), assuming that a fixed transcription rate ρon = 1. The Poisson approximation becomes more accurate as the switching timescale increases compared to the transcription timescale. KL divergences were estimated numerically using Algorithm 2.

FIG. 3.

Model reduction illustrated using the telegraph model without degradation. (a) Schematic of the model. (b) The reduced model is equivalent to a Poisson process with rate ρeff. (c) Example trajectory from the telegraph model, showing mRNA numbers (top) and the gene state (bottom) in time. If we only observe mRNA numbers, we can infer the current gene state by filtering, obtaining the probability for the gene being on at a given time (blue line). When mRNA numbers increase, the on probability jumps to 1 since the transcription only happens in that state. (d) KL divergence rate between the telegraph model and its Poisson reduction for various choices of the switching timescale τ = σon + σoff and on probability pon = σon/(σon + σoff), assuming that a fixed transcription rate ρon = 1. The Poisson approximation becomes more accurate as the switching timescale increases compared to the transcription timescale. KL divergences were estimated numerically using Algorithm 2.

Close modal

A state in the full model can be represented as n = (g, m), where g is the gene state (on or off) and m is the number of mRNAs present, while a reduced state ñ=(m) is given by the number of mRNA molecules only, with the unobserved component z = (g) being the gene state. The only reaction in q that descends to q̃ is mRNA production, whereas the two gene switching reactions are not observed. Note that the projection q̃ of the telegraph model is not Markovian as the instantaneous mRNA production rate depends on the current gene state g, but we can sample from q̃ by simulating q and discarding the information about the gene state.

The reduced model p is a Poisson process whose only parameter is the production rate ρeff. The only trajectories possible under p are those starting at m = 0 and increasing by one at every jump. The log-likelihood of such a trajectory ñ[0,T] under p is given by
(6)
where ñ(T) is the total number of mRNAs produced up to time T, as can be verified using Eq. (4). The log-likelihood of any other trajectory is −∞.

Note that Eq. (6) describes the likelihood for an entire trajectory, as opposed to the likelihood of observing nT molecules at time T. The latter follows a Poisson distribution with rate ρeffT and can be obtained by integrating Eq. (4) over all trajectories that end at the same nT. Indeed, conditioned on observing nT mRNA molecules at time T, their individual production times are uniformly and independently distributed on [0, T] by the properties of the Poisson process, and integrating Eq. (6) over all possible combinations of production times, we recover the usual Poisson likelihood, keeping in mind that any permutation of the production times yields the same trajectory.

The reduced model p can model every trajectory obtained from q̃, so the log-likelihood of any sample from q̃ is finite. The mRNA transcription reactions in both models correspond. If q were to include, e.g., mRNA degradation as is usual in the literature, some trajectories from the full model would feature decreases in ñ and be impossible under p; in this case, the KL divergence would infinite, which signals that the reduced model p is not appropriate and needs to be extended to include degradation.

For this simple example, the likelihood of a reduced trajectory under p only depends on the total mRNA produced. We can therefore compute the cross-entropy explicitly,
(7)
Minimizing this with respect to ρeff yields the optimum
(8)
which is the average mRNA production rate on the interval [0, T]. Thus the optimal approximation to the telegraph model is obtained by setting the mRNA production rate to its mean value. This is the result we obtain by applying the QEA to this system (see Sec. II E 2), which suggests that the approximation will be better if σon and σoff are large compared to ρon. We will see how to evaluate the approximation quality in Sec. II D.

Even if the analytical minimization above was not possible, the above minimization can be performed using Algorithm 1. To do so, we start by sampling a fixed number N of mRNA trajectories from the telegraph model; the more the trajectories we use, the better our estimate of ρeff will be. The next step is constructing the cross-entropy loss function for the reduced model, which only depends on one parameter ρeff and is given by Eq. (7). We then perform the gradient descent on the cross-entropy, averaged over all trajectories, to arrive at our estimate of ρeff. Here the learning rate η in Algorithm 1 has to be determined by experimentation. The code accompanying this paper contains an example implementation of this procedure in Julia.

In Sec. II B, we have derived a numerical procedure to minimize the KL divergence (1) or, equivalently, the cross-entropy (3). While this yields a direct computational procedure to fit reduced models, we are equally interested in the theoretical insights to be gained from considering model reduction as a variational problem. Understanding how our approach works in more detail will help us establish links with existing model reduction techniques and derive general principles that can help us generalize these to new systems.

As shown in  Appendix B, for a Markovian reduced model p, the cross-entropy can be written as
(9)
(10)
with the instantaneous cross-entropy rate at time t defined as
(11)
Here, we use the effective transition rates
(12)
If q̃ is not Markovian, the transition probability from a state ñ at time t will be affected by the history of the process, and hence on the initial distribution q0. If q̃ is Markovian, Eq. (12) reduces to the classical transition rate, which is independent of q0.

Looking at Eq. (9), we make two important observations. As intimated in the discussion of the telegraph example of Sec. II B 1, if any trajectory under q̃ has jumps that are not allowed under p, i.e., if q̃m̃ñ0 while pm̃ñ=0, then the cross-entropy (9) will be infinite: the reduced model p is not flexible enough to model q̃. On the other hand, if p contains transitions that are impossible under q̃, i.e., if pm̃ñ0 while q̃m̃ñ=0, then these can only increase the cross-entropy (9). This means that the optimal reduced model p does not contain transitions beyond those in q̃, and in the context of the CME we can therefore assume that p and q̃ have the same reactions, with different propensities (Markovian for p, not necessarily for q̃).

If the ith reaction in p has propensity function ρi(ñ;t) and net stoichiometry si we obtain the following decomposition of the cross-entropy rate at time t:
(13)
where the first sum is over all reactions in p or, equivalently, all visible reactions in q. The total cross-entropy is obtained by integrating Eq. (13) over [0, T], and we can find the optimal p by optimizing the cross-entropy for each reaction separately.
We can minimize Eq. (13) analytically if the full model q is Markovian. Assume that there is precisely one reaction in q with net stoichiometry sĩ in the projection, and let σi be its propensity function. Differentiating the above equation with respect to ρi(ñ;t) and setting the derivative to zero, we obtain
(14)
The optimal propensity for a reaction under p is the expected propensity under the original model conditioned on the observed state ñ. In particular, if the propensity of the original reaction does not depend on unobserved species, it can be removed. We explore the implications of these results on the moments of the reduced system in Sec. II E 4.
In practice, we often place constraints on reduced propensities ρi(ñ;t), such as time-homogeneity and mass-action kinetics, which result in constrained optima. For example, if ρi is taken to be independent of time, we have to integrate Eq. (13) over [0, T], and as T → ∞, it can be verified that the total cross-entropy is minimized for
(15)
which is the steady-state version of Eq. (14).

For other forms of propensity functions, the optimum can generally be obtained by a similar averaging procedure. In all cases, the main difficulties lie in estimating relevant conditional expectations, which can be done numerically using the Monte Carlo approach represented in Sec. II B. Equation (14) is perhaps the central result of this paper, and we will see that it coincides with the propensities obtained using the QSSA, the QEA, and the LMA when these are applied. These methods can, therefore, all be seen as special cases of our variational approach based on minimizing the KL divergence. As an aside, our results provide a new derivation of the marginal CME proposed in Refs. 30 and 34 and the nontrivial result that Eq. (14) and Algorithm 1 target equivalent objectives.

In particular, Algorithm 1 can be bypassed by computing certain conditional moments of a system, which provides an alternative way to perform model reduction. For many systems, these conditional moments can be efficiently computed using moment equations, possibly involving moment closure approximations, which have attracted a large amount of literature.4 The LMA, for example, involves a bootstrapping approach to estimate relevant moments. Alternatively, conditional expectations could be computed by numerically integrating the CME although we do not expect this to scale well to large systems. In consequence, Algorithm 1, while a convenient way to implement our approach in full generality and empirically verify its performance, can often be bypassed.

1. Example: Telegraph model (cont.)

Consider the telegraph model from Sec. II B 1. If we make the reduced model p more flexible by allowing a nonlinear and time-dependent propensity ρ(mt) = ρeff(mt), then the cross-entropy between q̃ and p can be computed as
(16)
Note that both q̃ and p have the same initial conditions. The effective propensities of q̃ can be computed as
(17)
which is ρon weighed by the probability that the gene is on at time t, given that there are m mRNA molecules present. This is the optimal propensity for the reduced model p in accordance with Eq. (14), featuring a nonlinear dependence on m, and this can be shown to be the optimal Markovian approximation to q̃. We will see in Sec. II E 4 that this approximation exactly matches the mRNA distribution of the full model at all times.
If we require mass-action kinetics for the reduced model, i.e., if we let ρ(m; t) = ρeff(t) for an effective rate constant ρeff(t), minimizing the cross-entropy (16) yields
(18)
The effective transcription rate learned by p is the expected transcription rate of q at time t, this time averaged over all m.
Finally, if we require ρ(mt) = ρeff to be time-independent, we obtain by a similar procedure
(19)
which is just the previous result averaged over t. This is the special case considered in Sec. II B 1, where p was restricted to be a Poisson process with constant intensity. As T → ∞, optimum (19) converges to the expected transcription rate of q at steady state.

It can be checked that propensity functions (17) and (18) exactly preserve the mean mRNA numbers of the full system. Similarly, the time-independent propensity function (19) preserves the mean mRNA numbers at steady state. For mass-action propensities, however, variances will be underestimated. This is because reduced models do not model gene switching, which increases the noise in the system. This is a common occurrence with model reduction and is well known for the telegraph model since the mRNA numbers are always distributed following a Poisson mixture distribution, which will always have a higher variance than a single Poisson distribution with the same mean.39 We will further explore how well reduced models capture marginal distributions of the original in Sec. II E 4.

To this point, we have been occupied with minimizing the KL divergence (1), or rather the cross-entropy (3), with respect to p. Perhaps counter-intuitively, minimizing the KL divergence is easier than computing it since the entropy
(20)
can be ignored during optimization. In order to assess the performance of the reduced model obtained this way, we want to compute the full KL divergence, and hence the entropy, explicitly. As for the cross-entropy, the integral in Eq. (20) is generally intractable, and we will approximate the entropy by simulating N samples ñ[0,T](i) from q̃ and computing
(21)
Here we face another difficulty, for if q̃ is not Markovian, we cannot use Eq. (4) to compute the log-likelihood of a trajectory ñ[0,T]. Since by assumption q is a Markov process, the projection q̃ is just a partially observed Markov process and we can use the so-called forward algorithm for hidden Markov models40 to compute the log-likelihood of a trajectory.
The forward algorithm computes the joint probability q(ñ[0,t],zt) for the observed trajectory up to time t and the current state of the hidden species, zt. As shown in  Appendix C, the forward algorithm in this case yields the following set of jump ordinary differential equations (ODEs):
(22)
(23)
(24)
with jumps at the jump times of the observed trajectory ñ[0,T]. The first sum in Eq. (23) represents hidden reactions (those that do not affect ñ), and the second sum is over all visible reactions in q, with propensity functions given by σi. The marginal likelihood of the observed trajectory can then be computed by summing over the unobserved variables z,
(25)

The above equations take the form of a modified CME where the observed variables are fixed. They can be solved using an adaptation of the Finite State Projection algorithm,5 and in cases where the unobserved state space is finite, exact solutions can be computed numerically. We summarize the procedure to estimate the KL divergence defined in Eq. (1) in Algorithm 2.

ALGORITHM 2.

Computing the projected KL divergence between a full and a reduced model.

Inputs: number of simulations N, simulation length T, full model q, reduced model p 
Output: KL divergence estimate KL̂(q̃p) 
L̂0 
for all i = 1, …, N do 
 sample n[0,T] from q 
 project n[0,T] to ñ[0,T] 
 compute p(ñ[0,T]) using (4) 
      compute q̃(ñ[0,T]) using (22)(25)
      L̂+=q̃(ñ[0,T])p̃(ñ[0,T]) 
return L̂/N 
Inputs: number of simulations N, simulation length T, full model q, reduced model p 
Output: KL divergence estimate KL̂(q̃p) 
L̂0 
for all i = 1, …, N do 
 sample n[0,T] from q 
 project n[0,T] to ñ[0,T] 
 compute p(ñ[0,T]) using (4) 
      compute q̃(ñ[0,T]) using (22)(25)
      L̂+=q̃(ñ[0,T])p̃(ñ[0,T]) 
return L̂/N 

As further shown in  Appendix C, a corollary of the above equations is the following jump ODE for the marginal distribution itself:
(26)
(27)
(28)
These equations generalize Eq. (4) to the presence of unobserved species z. These equations can be found in various places in the literature, e.g., Refs. 33, 35, and 4144, although we are not aware of a consistent nomenclature for this modification of the CME.

The equations for the log-likelihood are linear and can be split into a sum of contributions for each reaction in q̃. Integrating over all reduced trajectories, ñ[0,T] shows that the entropy H(q̃) can thus be expressed as a sum of contributions from each reaction in q̃. Equation (13) shows that the same decomposition holds for the cross-entropy H(q̃;p), so the entire KL divergence KL(q̃p) can be decomposed into contributions coming from the individual reactions, which allows us to differentially analyze the accuracy of the reduced model p for each reaction separately. We note that all reactions together determine the distribution q on trajectories over which we integrate, so this decomposition is not strict in practice.

1. Example: Telegraph model (cont. 2)

We can solve the filtering problem for the telegraph model exactly as the unobserved state space is two-dimensional. Let F(i;t)=qg(t)=i,m[0,t]. The telegraph model consists of three reactions, the hidden switching reactions with parameters σon and σoff and the visible transcription reaction ρon. Specializing Eq. (23) to this system, we obtain
(29)
(30)
where m[0,t] denotes the observed mRNA trajectory. At each mRNA production event we update the probabilities according to Eq. (24), which in this case reads
(31)
(32)
The marginal likelihood of the observed mRNA trajectory is then given by
(33)
These derivations can also be found in Ref. 33.

As a by-product, the above equations yield the conditional distribution over the gene state at each time t, given the trajectory prior to that time point. These are the filtered distributions and illustrated in Fig. 3(c), but we only need the marginal likelihood for our purposes.

Using the above equations to compute the log-likelihood for a large number of trajectories sampled from the telegraph model q yields a numerical approximation of the entropy H(q̃), and together with Eq. (6), we obtain an estimate of the KL divergence KL(q̃p). For time-homogeneous propensities, we empirically found the KL divergence to be asymptotically proportional to T, and we call the proportionality factor the (steady-state) KL divergence rate. Note that the cross-entropy rate can be defined directly using Eq. (11).

In Fig. 3(d), we show how this KL divergence rate between the telegraph model and its Poisson approximation changes for various choices of σon and σoff. We observe that the KL divergence rate tends to 0 as the switching rates increase; indeed, the Poisson approximation can be obtained from the QEA for this example. We further note that the KL divergence is maximal when the gene spends substantial amounts of time in each state; for pon close to 1, the gene is almost always active and the system approaches constitutive expression with rate ρ, while for pon close to 0, the gene only activates sporadically and resembles a constitutive gene with a very low expression rate ρ · pon. Visually inspecting mRNA trajectories generated in both regimes shows that the results become harder to distinguish from the Poisson process.

Our results on the telegraph model without degradation apply almost verbatim when degradation is included: adding the reaction M to the full and reduced models does not affect the KL divergences considered. A priori, adding degradation changes the probability distribution over trajectories, which can now exhibit decreasing mRNA numbers, but the log-likelihood contributed by the degradation reaction is the same between the telegraph model and its reduction, and therefore the difference cancels in Eqs. (26)(28) and Eq. (13). Since the log-likelihood contributed by the gene switching and transcription reactions does not depend on mRNA numbers, which do change in the presence of degradation, the total KL divergence is unaffected. In the presence of feedback, degradation would indirectly affect the KL divergence via its effect on mRNA numbers; we refer the interested reader to the recent paper35 for an analysis of this type of information flow in stochastic biochemical reaction networks. We can readily verify that our observations on the mean and variance of mRNA numbers under the telegraph model and its reduction remain valid in the presence of degradation.

1. The Quasi-Steady State Approximation

In deterministic chemical kinetics, the Quasi-Steady State Approximation (QSSA) is a model reduction technique that can be applied when a system can be partitioned into the so-called slow species ns and fast species nf such that the fast species nf evolve on a much faster time scale than the slow species. On the timescale on which the slow species evolve, the fast species can, therefore, be assumed to reach their steady state almost instantaneously. Thus, one assumes ddtnf=0, which allows for the simplification of the remaining equations for the slow species. The QSSA has famously been applied to Michaelis–Menten enzyme kinetics with the enzyme–substrate complex ES as the fast species, resulting in the classical Michaelis–Menten propensity for product formation (see Sec. III B for more information).

The QSSA was extended to the stochastic case in Ref. 8. Here, one assumes that conditioned on ns, the fast species reach the steady state nearly instantaneously compared to the timescale of interest. In our formulation, the reduced state space consists of the slow species, ñ=ns, and the unobserved species are the fast species, z = nf. As shown in Ref. 8, the QSSA yields the following reduced CME for the approximation p [Eqs. (10) and (11) in Ref. 8]:
(34)
where the reduced propensities ρ̃i are defined as
(35)
This agrees precisely with Eq. (14), which illustrates how the QSSA can be seen as minimizing the KL divergence (B4).We remark that Eq. (34) is a special case of the phenomenon discussed in Sec. II E 4. We can compare (35) with true propensities of the projection q̃, which depend on the entire history of the observed trajectory and the initial distribution q0,
(36)
This equation, which describes the instantaneous reaction rate of the non-Markovian process q̃, has previously been considered, e.g., in Refs. 8 and 33.
Some intuition for the relationship between timescale separation and our approach can be gained from Eqs. (26)(28), which describe the probability of a reduced trajectory ñ[0,T] under q̃. Reaction propensities under q, in general, depend on the unobserved species, whose distribution is correlated with the history of the current trajectory. If we assume that the timescale of the unobserved species nf is very fast, then this correlation will decay very quickly and the time-dependence of the conditional distributions in Eqs. (26)(28) will be negligible, so we can replace these equations by
(37)
(38)
(39)
which is another way of showing that q̃ has propensities given by (35).

2. The Quasiequilibrium Approximation

While the QSSA relies on a reaction network being divisible into slow and fast species that evolve on two different timescales, in practice, it is more frequently the case that some reactions in a system will be fast and that some will be slow. The Quasiequilibrium Approximation (QEA) developed in Refs. 7, 45, and 46 modifies the QSSA to model this scenario by reformulating the system using extents a = (a1, …, ar), where the extent ai is defined as the number of times reaction i has occurred.

The extents themselves form a Markov chain, and the state of a system can be obtained from its extents as n(t) = n0 + Sa(t), where S is the stoichiometry matrix. As derived, e.g., in Ref. 17, the CME of this system is
(40)
Here, the sum is over reactions and ei is the vector with a 1 in the ith position and 0 elsewhere. The QEA assumes that the extent vector can be divided into slow and fast components as and af, respectively, corresponding to slow and fast reactions, and we define our reduced system to consist only of slow reactions. From here, we can proceed analogously to the QSSA discussed above and obtain that the reduced propensities are given by
(41)
if the fast reactions can be assumed to equilibrate instantaneously.

A subtlety of this argument is the fact that the fast extents af can only increase in time and, therefore, will not admit a steady-state distribution, in general. The conditional means, however, will converge in many cases, e.g., if af consists of both directions of a reversible reaction. If this is the case, the QEA yields a well-defined reduction, which, moreover, agrees with (14) and, therefore, minimizes the KL divergence to the full model on the space of extents.

3. The Linear Mapping Approximation

The Linear Mapping Approximation (LMA)25 replaces a bimolecular reaction of the form G+XσG*, where G is a binary species representing a gene state, by a reaction Gσ̄G* with effective propensity σ̄. Assuming mass action kinetics, the propensity function of the bimolecular reaction is ρ(n) = gxσ, where g is the state of G and x is the abundance of the species X, and the propensity of the linearized version is ρ̃(n)=σ̄g for an appropriate choice of σ̄. Taking the derivative with respect to σ̄ of the cross-entropy rate (13) at time t yields
(42)
which vanishes if and only if
(43)
This provides a mathematical justification for the mean-field assumption underlying.25 The LMA approximates this conditional expectation by imposing a self-consistency condition on the linearized system, thereby deriving an effective approximation to (43).
As an aside, we can compute the KL divergence rate between q and the optimal reduction p given by (43) analytically to obtain
(44)
Here, the dependence of expectations on the time t is suppressed. This expression for the discrepancy incurred by the LMA, which resembles the definition of the variance, quantifies the intuition that the LMA should be accurate if fluctuations of X in the unbound gene state are small.

The above derivation is not entirely rigorous as the linearization has a different net stoichiometry than the original reaction and KL divergence between q and p is infinite. To remedy this, we can either neglect fluctuations in X due to binding in the original model or consider the KL divergence on a reaction-by-reaction basis, as discussed in Sec. II D, since the two reactions correspond despite their different net stoichiometries.

If fluctuations in X are absent or can be neglected, the optimal reduction using Eq. (43) preserves first-order moments for all species, assuming that all other reactions are linear. Indeed, replacing the reaction Gσ[X]G* by its linearization only changes the moment equation for E[g]. For the full model, the latter is given by
(45)
while the reduction has
(46)
which is equivalent for a binary species g (the remaining terms only depend on species means by our linearity assumption). Second- and higher-order moments will, in general, differ between the two as we will see. This sheds light on a more subtle aspect of the LMA: the self-consistent approach in Ref. 25 involves computing the second-order moment E[gx;t], which can differ in the reduction. For this reason, the original approach, while generally quite accurate, can lead to slightly suboptimal propensities, as we shall see in Sec. III A.

4. Moment-matching

In  Appendix D, we show that the optimal reduced model p, defined using the propensity functions (14), exactly preserves marginal distributions of the full model. In particular, if we place no constraints on the reduced propensity functions, the resulting system p will have the same means, variances, and higher-order moments for the observed species as the full system q. Thus, our approach is also related to moment-matching, in particular to the proposed moment-based methods in Refs. 11 and 47.

The fact that the reduced model p can exactly capture the probability distribution of the projected model q̃ at any time t does not mean that the two are equivalent. The difference can be seen on the level of individual trajectories. The reduced model p is always memoryless, whereas this is not usually the case for q̃, resulting in different dynamical properties such as autocorrelations. This ability to discriminate models on a trajectory-level illustrates an advantage of our path-based approach over purely moment-based model reduction.

We emphasize that exact preservation of moments only holds for the globally optimal propensity functions defined by Eq. (14), which can depend on ñ and t in a complicated way. In practice, one often use parametric propensity functions, such as mass-action propensities or Hill functions, and as a consequence, this feature is lost. As with the LMA or the telegraph model, it may still be possible to preserve some moments (such as means) exactly with the right choice of propensity function. Our results imply that discrepancies in the moments between the full and reduced models result from the fact that Eq. (14) is only computed approximately for methods such as the QSSA or the QEA.

In this section, we analyze a simple model of stochastic gene expression featuring positive autoregulation [see Fig. 4(a)]. The system in question consists of a single gene found in two states, bound (Gb) and unbound (Gu), as well as the coded protein P, which can bind to the gene to increase its own transcription rate. For simplicity, we do not consider mRNA dynamics explicitly, instead modeling protein production as occurring in geometrically distributed bursts (see Ref. 48 for a derivation). The linearized version of the system, using the optimal propensity derived in Sec. II E 3, is shown in Fig. 4(b). In Fig. 4(c), we compare steady-state distributions of the full model with its linearization, where the effective binding rate is computed numerically using Algorithm 1, and the LMA, where the binding rate is approximated using the self-consistent approach in Ref. 25.

FIG. 4.

Reduction of an autoregulatory feedback loop using the Linear Mapping Approximation. (a) Schematic of the full reaction network. The number k ≥ 0 of proteins produced at each reaction is geometrically distributed with mean b. (b) Reduced form of the same reaction system, where the protein-gene binding reaction is replaced by a mean-field approximation. (c) For the chosen parameter values, the system exhibits critical behavior around σb = 2.5, transitioning from a mostly unbound to a mostly bound state. Near the transition, protein fluctuations increase and the mean-field assumption in the LMA breaks down. A comparison of steady-state moments and distributions for the full model, the optimal reduction, and the LMA shows that the effective reaction rate computed by the latter is generally close to optimal. (d) KL divergence rate at the steady state between the full model and the reduced version, computed analytically using Eq. (44) and using Monte Carlo simulations. The peak around the transition matches the observed discrepancy between the full and the reduced model. The remaining model parameters are σu = 400, ρu = 0.3, ρb = 105, b = 2, and d = 1.

FIG. 4.

Reduction of an autoregulatory feedback loop using the Linear Mapping Approximation. (a) Schematic of the full reaction network. The number k ≥ 0 of proteins produced at each reaction is geometrically distributed with mean b. (b) Reduced form of the same reaction system, where the protein-gene binding reaction is replaced by a mean-field approximation. (c) For the chosen parameter values, the system exhibits critical behavior around σb = 2.5, transitioning from a mostly unbound to a mostly bound state. Near the transition, protein fluctuations increase and the mean-field assumption in the LMA breaks down. A comparison of steady-state moments and distributions for the full model, the optimal reduction, and the LMA shows that the effective reaction rate computed by the latter is generally close to optimal. (d) KL divergence rate at the steady state between the full model and the reduced version, computed analytically using Eq. (44) and using Monte Carlo simulations. The peak around the transition matches the observed discrepancy between the full and the reduced model. The remaining model parameters are σu = 400, ρu = 0.3, ρb = 105, b = 2, and d = 1.

Close modal

This system exhibits critical behavior for some parameter values [see Fig. 4(c)], and it was shown in Ref. 49 that computing the moments of this system is difficult near points of criticality as most moment closure techniques and the LMA yield inaccurate results. Our results show that for all parameters, self-consistent equations of the LMA provide results close to the numerically computed optimum. As discussed in Sec. II E 3, the exact reduction very closely reproduces mean protein numbers, with the LMA incurring a small bias near the critical point. In contrast, neither reduction is able to capture the increased fluctuations near the critical point, significantly underestimating the variance in protein numbers. Comparing the protein distributions for the full and the optimally reduced model shows a large discrepancy near the critical point, compared to parameters far from it.

We compute the KL divergence rate at the steady state between the full model and its linearization using (44) and via Monte Carlo estimation [see Fig. 4(d)]. The steady-state KL divergence rate exhibits a notable peak near the critical point, coinciding with the parameter regime where the linearization fails to capture the full model. As we move away in either direction from the critical point, the KL divergence decreases in accordance with the better approximation of the system by its linearized version. This shows how the KL divergence can be used to assess how well model reduction works for different parameter regimes.

We next consider one of the most-studied reaction networks in biology, the Michaelis–Menten model of enzyme kinetics, consisting of an enzyme E and a substrate S that reversibly bind to form an enzyme–substrate complex ES, which converts the substrate into the product P [see Fig. 5(a)]. This system is often treated as an example application of both the QSSA and the QEA, which remain standard model reduction techniques applied to this example.

FIG. 5.

Comparison of the QEA and the QSSA for Michaelis–Menten kinetics. (a) Schematic of the Michaelis–Menten system. We assume mass action kinetics. (b) Reduction of the Michaelis–Menten system under the QEA and the QSSA yields a one-step process with effective (non-mass action) propensities that depend on the method used. (c) Total KL divergence between the full model and the QEA and QSSA for different values of the timescale parameter c in (50) and the enzyme abundance parameter ɛ in (48). The QEA generally becomes more accurate as c increases, and the QSSA becomes more accurate as ɛ decreases. Note the small copy number effects that become apparent in the QSSA for low values of ɛ and c. We used analytically obtained propensities for the QEA and the QSSA and numerically estimated KL divergences using Algorithm 2. (d) Comparison of effective reaction propensities computed according to Eq. (14) with those predicted by the QEA and the QSSA. Here, substrates include enzyme-bound substrates. The top panel shows the effective propensities for ɛ = 1: as c → ∞, the effective propensities converge to the function predicted by the QEA. The bottom panel shows the effective reaction propensities for c = 1000, and as ɛ decreases, the propensities approach the function predicted by the QSSA. The remaining parameters were fixed to k1 = k−1 = 0.001, k2 = 0.1, ET = 10, and ST = 100.

FIG. 5.

Comparison of the QEA and the QSSA for Michaelis–Menten kinetics. (a) Schematic of the Michaelis–Menten system. We assume mass action kinetics. (b) Reduction of the Michaelis–Menten system under the QEA and the QSSA yields a one-step process with effective (non-mass action) propensities that depend on the method used. (c) Total KL divergence between the full model and the QEA and QSSA for different values of the timescale parameter c in (50) and the enzyme abundance parameter ɛ in (48). The QEA generally becomes more accurate as c increases, and the QSSA becomes more accurate as ɛ decreases. Note the small copy number effects that become apparent in the QSSA for low values of ɛ and c. We used analytically obtained propensities for the QEA and the QSSA and numerically estimated KL divergences using Algorithm 2. (d) Comparison of effective reaction propensities computed according to Eq. (14) with those predicted by the QEA and the QSSA. Here, substrates include enzyme-bound substrates. The top panel shows the effective propensities for ɛ = 1: as c → ∞, the effective propensities converge to the function predicted by the QEA. The bottom panel shows the effective reaction propensities for c = 1000, and as ɛ decreases, the propensities approach the function predicted by the QSSA. The remaining parameters were fixed to k1 = k−1 = 0.001, k2 = 0.1, ET = 10, and ST = 100.

Close modal
The QSSA classically models the enzyme E and the enzyme–substrate complex ES as the fast species and replaces the full system by a one-step reaction where the substrate is converted to the product with a Hill-type propensity function [see Fig. 5(b)]. This approximation is known to be valid when
(47)
where ET is the total number of enzymes and Km is the Michaelis–Menten constant (k−1 + k2)/k1.50,51 Note that the reduced model is invariant under the scaling
(48)
where the parameter ɛ can be seen as controlling the accuracy of the reduction: the rescaled QSSA condition (47) reads
(49)
which suggests that small values of ɛ should result in the full model more closely resembling the QSSA reduction.
The stochastic QEA for this system is analyzed in Ref. 17 and also results in a one-step reaction where the propensity function is now piecewise linear [see Fig. 5(b)]. As opposed to the QSSA, the QEA is accurate when substrate binding and unbinding are fast, i.e., k1, k−1 are sufficiently large. Note that the QEA reduction is invariant under the scaling
(50)
where c controls the timescale at which quasiequilibrium is reached and therefore the accuracy of the QEA. We stress that (48) and (50) are two independent scalings and can be performed simultaneously, as will be done in Fig. 5(c).

In general, it may be difficult to predict which of the two approaches is more accurate unless one is clearly in one of the limiting regimes. Based on the KL divergence between the full model and either of the two reactions, we can investigate this question for a range of parameters. The reduced model consists of two species, S and P, and since [S] + [P] is conserved, we can describe it in terms of either of the two. The correct projection for this example identifies the species S and ES in the full model, i.e., we define [S]red = [S] + [ES] in order to eliminate the binding and unbinding reactions from the projection. This lumping of two rapidly equilibrating species is standard when using the QEA (see, e.g., Ref. 24), but it applies equally well to the QSSA in this case.

In Fig. 5(c), we use Algorithm 2 to estimate the total KL divergence over [0, ∞] between the full model and both reductions for a fixed number of substrate molecules. The total KL divergence is finite since all trajectories enter the same absorbing state defined by [S] = 0 in a finite amount of time. As expected, for the QEA, its accuracy increases with c, but whereas the QSSA tends to become less accurate for large ET; in the low enzyme regime, we observe a similar decrease in accuracy that is not explained by the deterministic theory.

The observed decrease is a small copy number effect caused by the fact that the waiting time distribution between two productions of P is not exponential: for very small ET, the unbinding of an enzyme immediately after such an event implies that a significant fraction of such productions occur as a two-step process (where the free enzyme binds another substrate and then converts it).52 For large enough c, the binding step is very fast, and assuming that k2k−1 so that unbinding is unlikely to occur again, the conversion of substrates to products can be viewed as an effective one-step process. The effect of nonexponential waiting time distributions has previously been analyzed in Ref. 53; we see that the KL divergence on trajectories can be sensitive to subtle dynamical effects such as waiting time distribution, which are not visible when considering, e.g., the moments of a system, which are well predicted by the QSSA for small values of c and ɛ.

Overall, we can see that for the chosen parameter values, the QSSA generally performs better than the QEA in the regime of small ɛ, corresponding to low enzyme numbers, and the QEA is most accurate for c, keeping in mind the scalings in (48) and (50). Neither approximation is satisfactory if the number of total enzymes is larger than the amount of substrates, but the binding and unbinding rates are small. In this scenario, other approximations, such as the total QSSA,54 which we do not investigate here, will generally be more accurate.

Figure 5(d) compares the effective propensities for the full system, as a function of the unconverted substrate abundance, with the predictions made by the QEA or the QSSA. In the case of the QEA, we see that the effective propensities slowly converge to their limit as the timescale parameter c increases. In contrast, the QSSA provides a good approximation to the effective propensity as long as the number of substrate molecules is larger than the number of enzymes. While the effective propensities are the optimal choice for the reduced model, the actual quality of the approximation is affected both by the size of the fluctuations of the actual propensities around their mean and the degree to which the waiting times of the original system follow an exponential distribution.

Our final example is the gene expression system shown in Fig. 6(a). This system consists of a gene, a mRNA, and protein and a Michaelis–Menten-type protein degradation mechanism. Up to two protein molecules can bind the gene, and in the twice bound state, the gene pauses transcription. This model has been considered in Ref. 11 as an example system where the naive reduction of the CME using Hill-type effective propensities is unable to accurately capture the noise in the original. The true system can exhibit oscillations in mRNA numbers that are caused by the two-step negative feedback and are not present in the reduced version. In this section, we want to analyze what reductions can be performed on the full model while still keeping oscillatory behavior.

FIG. 6.

Comparison of different reductions for the oscillatory gene network. (a) Schematic of the reaction system. We assume mass action kinetics for all reactions in the full model. (b) Example trajectories for the full model and its reductions. Note the oscillatory behavior in the original model that is not preserved in reductions IV and V. We used Algorithm 1 to compute the effective parameters for each reduction (cf. Table I) from a single long trajectory probing steady-state behavior. (c) Power spectra of mRNA and protein concentrations (defined as copy numbers normalized by the system size Ω). Models I, II, and I + II closely reproduce the oscillatory behavior of the full model, while IV and V do not show sustained oscillations. (d) KL divergence rates between the full model and all reductions, estimated using Algorithm 2. (e) Steady-state means and standard deviations for mRNA and protein numbers, estimated numerically using the Gillespie algorithm. While all reductions closely approximate the mean, models IV and V do not match the variance of the full model. The parameters for the full model, taken from Ref. 11, are Ω = 1000, ET = 10, ρm = 50, ρp = 0.0045, dm = 0.01, k1 = 0.1, k−1 = 10, k2 = 10, σb,P = 0.001, σu,P = 100, σb,PP = 1000, and σu,PP = 1.

FIG. 6.

Comparison of different reductions for the oscillatory gene network. (a) Schematic of the reaction system. We assume mass action kinetics for all reactions in the full model. (b) Example trajectories for the full model and its reductions. Note the oscillatory behavior in the original model that is not preserved in reductions IV and V. We used Algorithm 1 to compute the effective parameters for each reduction (cf. Table I) from a single long trajectory probing steady-state behavior. (c) Power spectra of mRNA and protein concentrations (defined as copy numbers normalized by the system size Ω). Models I, II, and I + II closely reproduce the oscillatory behavior of the full model, while IV and V do not show sustained oscillations. (d) KL divergence rates between the full model and all reductions, estimated using Algorithm 2. (e) Steady-state means and standard deviations for mRNA and protein numbers, estimated numerically using the Gillespie algorithm. While all reductions closely approximate the mean, models IV and V do not match the variance of the full model. The parameters for the full model, taken from Ref. 11, are Ω = 1000, ET = 10, ρm = 50, ρp = 0.0045, dm = 0.01, k1 = 0.1, k−1 = 10, k2 = 10, σb,P = 0.001, σu,P = 100, σb,PP = 1000, and σu,PP = 1.

Close modal

We consider five different reductions for this model, which are listed in Table I. Each reduction removes or combines several species, such as gene states, thereby reducing the dimensionality of the system. The rightmost column describes the correct projection for each model, derived using an analogous argument as in the Michaelis–Menten example. We fit the unknown (effective) parameters for each model numerically by minimizing the KL divergence from the full model using Algorithm 1. Typical mRNA trajectories for all models can be seen in Fig. 6(b). The oscillations in the full model are clearly visible, and they are inherited by reductions I, II, and I + II. In contrast, simplifying the first binding step as in IV and V results in a non-oscillatory system [see Fig. 6(c)].

TABLE I.

Five possible reductions for the oscillator model. Model I + II is the combination of the reductions in models I and II. All propensities are shown without their mass action terms. Species that are lumped or removed in the reduced model are shown in gray, and species that do not exist in the full model are in blue. Parameters that are introduced in the reduced model are shown in red and found by the minimization of the KL divergence between the models; all remaining parameters are taken over from the full model. The functional forms of the propensity functions for models II, IV, and V were derived using the QEA.

ModelFull reactionsReduced reactionsProjection map
   
II    
I + II (I and II above) (I and II above)  
IV    
   
ModelFull reactionsReduced reactionsProjection map
   
II    
I + II (I and II above) (I and II above)  
IV    
   

In Fig. 6(d), we compare the KL divergence rates between the full model and reductions I–V (note that the different reductions are defined on different state spaces). While models I, II, and I + II have comparatively low KL divergences from the original, there is a sharp increase with models IV and V. From this alone, one could expect that the latter perform significantly worse, which is indeed the case, as shown in Fig. 6(b). Figure 6(e) represents a comparison of the mean and standard deviations for mRNA and protein abundances; while all models approximate the means very closely, the predicted standard deviations for models IV and V are very far from their true values consistent with the lack oscillatory behavior. This resembles the results in Sec. III A, which also show excellent agreement of the full and reduced model on the mean level, independent of the total approximation quality.

This comparison of various reductions shows how variational model reduction can be employed on a more sophisticated scale, where multiple reductions are possible. Given a list of possible simplifications, we can automatically find optimal parameters for each reduction and compute the KL divergence from the true model. Reductions that do not significantly affect the output will generally lead to very low KL divergences compared to those which do, as in this example where a large gap between KL divergence rates separates reductions I, II, and I + II from IV and V. The latter do not retain much of the moments or oscillatory dynamics of the full model compared to the other reductions.

In this work we presented an information-theoretic approach to model reduction for the CME based on minimizing the KL divergence between the full model and the proposed reduction. Based on this variational principle, we determine the optimal reaction propensities for the reduced model in Eq. (14) and show that it underlies some of the most common approaches to model reduction for the CME: the QSSA,8–10 the QEA,16,17 and the LMA.25 As a consequence, we establish that these methods can be seen as special cases of our approach. We furthermore obtain a general justification for the mean-field arguments proposed in the literature and connect them with information theory and likelihood-based approaches. We provide a numerical algorithm for automated fitting of a reduced model based on minimizing the KL divergence via stochastic gradient descent and a numerical algorithm for estimating this KL divergence in order to assess model fit. While the KL divergence between Markov chains has been considered before in, e.g., Refs. 1921, 31, 32, 34, 35, and 55, to our knowledge it has not been studied in connection with standard model reduction methods as done in this paper. Our numerical results show how the KL divergence can be computed, in practice, and used in the context of model reduction.

Using three biologically relevant examples, we illustrated how these model reduction techniques can be analyzed from this variational perspective. For the autoregulatory feedback loop, we showed that the KL divergence provides a useful metric of approximation quality and can detect parameter regimes where the mean-field approximation behind the LMA fails. Using the Michaelis–Menten system, we demonstrated how our approach can be used to decide between possible reductions of a model, particularly in the non-asymptotic regime where neither the QSSA nor the QEA is strictly valid. Finally, we used the genetic oscillator in Ref. 11 as an example to show how different reductions of a given model can be fit automatically, separately and in combination, and assess which steps in a putative model reduction procedure would impact approximation quality more than others.

In this paper, we only focused on discrete stochastic models, mainly those described using the Chemical Master Equation. This approach disregards continuum approximations, such as the chemical Langevin equation and the System Size Expansion,56 which are frequently used in practice, and it is worth asking whether information-theoretic approaches can be extended to this context. A priori this appears difficult since the KL divergence and related quantities tend to diverge in the continuum limit, such as the divergence between Brownian motions with different diffusion coefficients. This similarly applies to the wide class of hybrid approaches.30,42,57,58 In particular, abundance separation, as opposed to time-scale separation with the QSSA and the QEA, does not fall under the umbrella approach we considered. In this scenario, other perspectives on model reduction are likely needed.

A major drawback of the general approach we presented is that the KL divergences we use cannot usually be optimized analytically. As we have shown, the minimum typically corresponds to computing conditional expectations as in (14), which is rarely possible exactly, e.g., in the case of the LMA, where moment equations are used to approximate conditional means of protein numbers, or the QSSA, where deterministic rate equations are commonly used to derive the approximate propensities. The Monte Carlo approach we proposed in Algorithm 1 has the advantage of being completely generic, but unlike most existing techniques, it requires numerical optimization and will inevitably be slower than those. The optimization problem in question typically being low-dimensional and convex, most of the computational time is required to generate sample trajectories and compute gradients, with variable computational complexity that heavily depends on the system.

One important aspect of model reduction we have not addressed is that of choosing an appropriate architecture for a reduced model. As we investigated in the case of the genetic oscillator, while we can always optimize the parameters given an architecture, the quality of the approximation can greatly depend on which reductions are performed. Although methods for automatically choosing from a predefined set of approximations of a system exist in special cases,15 a fully general algorithm that computes a maximally reduced version of a given model within a given threshold is still missing from the literature. In combination with other machine learning approaches to model reduction such as Ref. 59, however, we hope that our approach may be one of the several steps toward such a procedure.

The KL divergence is a well-studied quantity in statistics, but it lacks some desirable properties in the context of model reduction. While we were able to establish that optimizing the KL divergence in theory leads to a reduced system with the same marginal distributions, establishing a relationship between the KL divergence and dynamical information, such as the autocorrelation and power spectra, is more difficult. Our investigations suggest that the KL divergence on the space of trajectories is a useful quantity that relates to both the marginal distributions and dynamical properties of the system and can be a useful metric of model fit. We are optimistic that further investigations in this area will demarcate more clearly which properties of a model are well captured by the KL divergence and which are not.

This work was supported by the EPSRC Centre for Doctoral Training in Data Science (EPSRC, Grant No. EP/L016427/1) and the University of Edinburgh for K.Ö. and a Leverhulme Trust grant (Grant No. RPG-2020-327) for R.G. The authors are grateful to the anonymous reviewers for their useful feedback and suggestions.

The authors have no conflicts to disclose.

Kaan Öcal: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Guido Sanguinetti: Conceptualization (equal); Supervision (equal); Writing – review & editing (equal). Ramon Grima: Conceptualization (equal); Supervision (equal); Writing – review & editing (equal).

Code implementing the methods introduced in this paper can be found at https://github.com/kaandocal/modred.

Let p be a continuous-time Markov chain defined on the space X with initial distribution p0. If p has time-independent transition rates, the Stochastic Simulation Algorithm,60 reproduced in Algorithm 3, returns exact samples from p. We can inspect the SSA to obtain the probability of drawing any given trajectory as follows:

  • The first state n(0) is drawn from the initial distribution p0 and has probability p0(n(0)).

  • The time τ1 until the next jump is drawn from an exponential distribution with rate pn0 and has probability distribution function pn0exp(τ1pn0).

  • The next state n1 is drawn from the transition distribution pn1|n0 and has probability pn1n0/pn0.

  • The time τ2 until the next jump is drawn from an exponential distribution with rate pn1 and has probability distribution function pn1exp(τ2pn1), etc.

ALGORITHM 3.

The Stochastic simulation algorithm to simulate samples from a Markov chain p.

Input: simulation length T, Markov chain p, initial distribution p0 
Output: n[0,T] - sampled trajectory 
sample n0p0 
t ← 0, i ← 0 
while t < T do 
sample τi+1Exp(1/pni) 
sample ni+1 = m with probability pmni/pni 
t + = τi+1 
return states (n0, n1, …), jump times (t1, t2, …) 
Input: simulation length T, Markov chain p, initial distribution p0 
Output: n[0,T] - sampled trajectory 
sample n0p0 
t ← 0, i ← 0 
while t < T do 
sample τi+1Exp(1/pni) 
sample ni+1 = m with probability pmni/pni 
t + = τi+1 
return states (n0, n1, …), jump times (t1, t2, …) 

We thus arrive at the probability
(A1)
where the last term corresponds to the probability that tk+1 > T, i.e., that the next jump occurs after time T (note the absence of the prefactor pnk). Each term pni in the above expression cancels with the denominator in the next term, and upon taking logarithms, we arrive at Eq. (4).
In Ref. 31, the authors derived an expression for the Kullback–Leibler divergence between two continuous-time Markov chains q and p defined on a common state space X. For any integer N > 1, let q(N) or p(N) be the time discretizations of q and p with time step δt = T/N. These define probability distributions on XN+1 with the KL divergence
(B1)
The transition rates are related to the discrete-time transition probabilities as follows:
(B2)
(B3)
where t(i) = iT/N is the ith discretization time point. Using these identities for q and p, it can be verified that the KL divergence (B1) converges to the following as N → ∞:
(B4)
In general, the KL divergence can be written as the difference of the cross-entropy H(qp) and the entropy H(qp), and we define the cross-entropy of two continuous-time Markov chains as
(B5)
The above derivation is valid only if qmn ≠ 0 implies pmn ≠ 0; otherwise, the cross-entropy and KL divergence are both infinite.
This definition of the cross-entropy generalizes to the case where q is not Markovian, e.g., when it is the projection of a Markov process. In the discrete-time case, we have
(B6)
where the marginal and conditional distributions of q̃ depend on the initial distribution of q since q̃ is no longer assumed to be memoryless. Expressing the transition probabilities for p in terms of transition rates using Eqs. (B2) and (B3) and the identities
(B7)
(B8)
which follow from the definition of the marginal transition rates in Eq. (12), we obtain
(B9)
The term N log(δt) is independent of p and we will renormalize it to 0; this is standard when deriving continuum-limit versions of many information-theoretic quantities and was implicitly used in our definition of the cross-entropy in Eq. (B5). Proceeding with the above equation, we get
which, after dropping the N log(δt) term and rearranging, yields Eq. (9) in the limit.
Let q be a continuous-time Markov chain defined on the discrete state space X with initial distribution q0, and let q̃ be its projection onto X. Computing the log probability of a trajectory ñ[0,T] under q̃ requires integrating over all possible full trajectories n[0,T]=(ñ[0,T],z[0,T]) that are compatible with ñ[0,T], that is,
(C1)
In order to compute the integral on the right-hand side, we consider the time-discretizations of q and q̃. The marginal likelihood in this case can be obtained using the well-known forward algorithm for HMMs, which sequentially computes the joint probabilities q(N)(ñ0:i,zi),
(C2)
(C3)
Using (B2) and (B3), we can write the second equation as
(C4)
(C5)
Here, the two cases correspond to no visible jump and a visible jump occurring at the ith step, respectively. Equation (C4) can be reorganized as
(C6)
Here, the first and second sums correspond to hidden reactions that do not change ñ and the last double represents visible reactions that affect ñ. Expressing the transition rates for visible reactions in terms of the reaction propensities σj and taking the continuum limit δt → 0 now yield Eqs. (22)(24).
To arrive at Eqs. (26)(28), we sum Eqs. (22)(24) over all z. Marginalizing Eq. (22) immediately yields Eq. (26), and for the other two, we obtain
(C7)
(C8)
The first double sum in Eq. (C7), corresponding to all hidden reactions, vanishes. Write the joint distributions in terms of conditional distributions as
(C9)
where we explicitly include the dependence on the initial distribution. We arrive at
(C10)
(C11)
We recognize the sum over zt as a conditional expectation,
(C12)
(C13)
The two equations rearrange to yield (27) and (28).
In this section, we compute the marginal distributions of a projected reaction network q̃,
(D1)
(D2)
(D3)
This corresponds exactly to the CME for the reduced system with propensities given by Eq. (14). As a result, the marginal distributions of the optimal reduction p and q̃ agree at all times.
1.
M.
Kærn
,
T. C.
Elston
,
W. J.
Blake
, and
J. J.
Collins
, “
Stochasticity in gene expression: From theories to phenotypes
,”
Nat. Rev. Genet.
6
(
6
),
451
464
(
2005
).
2.
R.
Satija
and
A. K.
Shalek
, “
Heterogeneity in immune responses: From populations to single cells
,”
Trends Immunol.
35
(
5
),
219
229
(
2014
).
3.
T.
Lipniacki
,
B.
Hat
,
J. R.
Faeder
, and
W. S.
Hlavacek
, “
Stochastic effects and bistability in T cell receptor signaling
,”
J. Theor. Biol.
254
(
1
),
110
122
(
2008
).
4.
D.
Schnoerr
,
G.
Sanguinetti
, and
R.
Grima
, “
Approximation and inference methods for stochastic biochemical kinetics—A tutorial review
,”
J. Phys. A: Math. Theor.
50
(
9
),
093001
(
2017
).
5.
B.
Munsky
and
M.
Khammash
, “
The finite state projection algorithm for the solution of the Chemical Master Equation
,”
J. Chem. Phys.
124
(
4
),
044104
(
2006
).
6.
N.
Ali Eshtewy
and
L.
Scholz
, “
Model reduction for kinetic models of biological systems
,”
Symmetry
12
(
5
),
863
(
2020
).
7.
T. J.
Snowden
,
P. H.
van der Graaf
, and
M. J.
Tindall
, “
Methods of model reduction for largescale biological systems: A survey of current methods and trends
,”
Bull. Math. Biol.
79
(
7
),
1449
1486
(
2017
).
8.
C. V.
Rao
and
A. P.
Arkin
, “
Stochastic chemical kinetics and the quasi-steady-state assumption: Application to the Gillespie algorithm
,”
J. Chem. Phys.
118
(
11
),
4999
5010
(
2003
).
9.
J. K.
Kim
,
K.
Josić
, and
M. R.
Bennett
, “
The relationship between stochastic and deterministic quasi-steady state approximations
,”
BMC Syst. Biol.
9
(
1
),
87
(
2015
).
10.
J. K.
Kim
,
K.
Josić
, and
M. R.
Bennett
, “
The validity of quasi-steady-state approximations in discrete stochastic simulations
,”
Biophys. J.
107
(
3
),
783
793
(
2014
).
11.
P.
Thomas
,
A. V.
Straube
, and
R.
Grima
, “
The slow-scale linear noise approximation: An accurate, reduced stochastic description of biochemical networks under timescale separation conditions
,”
BMC Syst. Biol.
6
(
1
),
39
(
2012
).
12.
P.
Thomas
,
A. V.
Straube
, and
R.
Grima
, “
Limitations of the stochastic quasi-steady-state approximation in open biochemical reaction networks
,”
J. Chem. Phys.
135
(
18
),
181103
(
2011
).
13.
H.-W.
Kang
,
W. R.
KhudaBukhsh
,
H.
Koeppl
, and
G. A.
Rempała
, “
Quasi-steady-state approximations derived from the stochastic model of enzyme kinetics
,”
Bull. Math. Biol.
81
(
5
),
1303
1336
(
2019
).
14.
J.
Eilertsen
,
K.
Srivastava
, and
S.
Schnell
, “
Stochastic enzyme kinetics and the quasi-steady-state reductions: Application of the slow-scale linear noise approximation à la Fenichel
,”
J. Math. Biol.
85
(
1
),
3
(
2022
).
15.
Y. M.
Song
,
H.
Hong
, and
J. K.
Kim
, “
Universally valid reduction of multiscale stochastic biochemical systems using simple non-elementary propensities
,”
PLoS Comput. Biol.
17
(
10
),
e1008952
(
2021
).
16.
E. L.
Haseltine
and
J. B.
Rawlings
, “
Approximate simulation of coupled fast and slow reactions for stochastic chemical kinetics
,”
J. Chem. Phys.
117
(
15
),
6959
6969
(
2002
).
17.
J.
Goutsias
, “
Quasiequilibrium approximation of fast reaction kinetics in stochastic biochemical systems
,”
J. Chem. Phys.
122
(
18
),
184102
(
2005
).
18.
L.
Cardelli
,
I.
Pérez Verona
,
M.
Tribastone
et al, “
Exact maximal reduction of stochastic reaction networks by species lumping
,”
Bioinformatics
37
(
15
),
2175
(
2021
).
19.
R. A.
Amjad
,
C.
Blöchl
, and
B. C.
Geiger
, “
A generalized framework for Kullback–Leibler Markov aggregation
,”
IEEE Trans. Autom. Control
65
(
7
),
3068
3075
(
2020
).
20.
K.
Deng
,
P. G.
Mehta
, and
S. P.
Meyn
, “
Optimal Kullback-Leibler aggregation via spectral theory of Markov chains
,”
IEEE Trans. Autom. Control
56
(
12
),
2793
2808
(
2011
).
21.
B. C.
Geiger
,
T.
Petrov
,
G.
Kubin
, and
H.
Koeppl
, “
Optimal Kullback–Leibler aggregation via information bottleneck
,”
IEEE Trans. Autom. Control
60
(
4
),
1010
1022
(
2015
).
22.
S.
Bo
and
A.
Celani
, “
Multiple-scale stochastic processes: Decimation, averaging and beyond
,”
Phys. Rep.
670
,
1
59
(
2017
).
23.
J.
Holehouse
,
A.
Sukys
, and
R.
Grima
, “
Stochastic time-dependent enzyme kinetics: Closed-form solution and transient bimodality
,”
J. Chem. Phys.
153
(
16
),
164113
(
2020
).
24.
C.
Jia
and
R.
Grima
, “
Dynamical phase diagram of an auto-regulating gene in fast switching conditions
,”
J. Chem. Phys.
152
(
17
),
174110
(
2020
).
25.
Z.
Cao
and
R.
Grima
, “
Linear Mapping Approximation of gene regulatory networks with stochastic dynamics
,”
Nat. Commun.
9
(
1
),
3305
(
2018
).
26.
T.
Jahnke
and
W.
Huisinga
, “
Solving the Chemical Master Equation for monomolecular reaction systems analytically
,”
J. Math. Biol.
54
(
1
),
1
26
(
2007
).
27.
C.
Gadgil
,
C.
Lee
, and
H.
Othmer
, “
A stochastic analysis of first-order reaction networks
,”
Bull. Math. Biol.
67
(
5
),
901
946
(
2005
).
28.
T.
Zhou
and
J.
Zhang
, “
Analytical results for a multistate gene model
,”
SIAM J. Appl. Math.
72
(
3
),
789
818
(
2012
).
29.
Y.
Li
,
D. Q.
Jiang
, and
C.
Jia
, “
Steady-state joint distribution for first-order stochastic reaction kinetics
,”
Phys. Rev. E
104
(
2
),
024408
(
2021
).
30.
T.
Jahnke
, “
On reduced models for the Chemical Master Equation
,”
Multiscale Model. Simul.
9
(
4
),
1646
1676
(
2011
).
31.
M.
Opper
and
G.
Sanguinetti
, “
Variational inference for Markov jump processes
,” in
Advances in Neural Information Processing Systems
(
MIT Press
,
2007
), Vol. 20.
32.
C.
Wildner
and
H.
Koeppl
, “
Moment-based variational inference for Markov jump processes
,” in
Proceedings of the 36th International Conference on Machine Learning
(
Curran Associates, Inc.
,
2019
), pp.
6766
6775
.
33.
C.
Zechner
and
H.
Koeppl
, “
Uncoupled analysis of stochastic reaction networks in fluctuating environments
,”
PLoS Comput. Biol.
10
(
12
),
e1003942
(
2014
).
34.
L.
Bronstein
and
H.
Koeppl
, “
Marginal process framework: A model reduction tool for Markov jump processes
,”
Phys. Rev. E
97
(
6
),
062147
(
2018
).
35.
A.-L.
Moor
and
C.
Zechner
, “
Dynamic information transfer in stochastic biochemical networks
,”
Phys. Rev. Research
5
,
013032
(
2023
).
36.
W. R.
KhudaBukhsh
,
A.
Auddy
,
Y.
Disser
, and
H.
Koeppl
, “
Approximate lumpability for Markovian agent-based models using local symmetries
,”
J. Appl. Probab.
56
(
3
),
647
671
(
2019
).
37.
K. P.
Murphy
,
Probabilistic Machine Learning: Advanced Topics
(
The MIT Press
,
2023
).
38.
J.
Peccoud
and
B.
Ycart
, “
Markovian modeling of gene-product synthesis
,”
Theor. Popul. Biol.
48
(
2
),
222
234
(
1995
).
39.
J.
Kim
and
J. C.
Marioni
, “
Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data
,”
Genome Biol.
14
(
1
),
R7
(
2013
).
40.
L. R.
Rabiner
, “
A tutorial on hidden Markov models and selected applications in speech recognition
,”
Proc. IEEE
77
(
2
),
257
286
(
1989
).
41.
F.
Confortola
and
M.
Fuhrman
, “
Filtering of continuous-time Markov chains with noise-free observation and applications
,”
Stochastics
85
,
216
251
(
2013
).
42.
J.
Hasenauer
,
V.
Wolf
,
A.
Kazeroonian
, and
F. J.
Theis
, “
Method of conditional moments (MCM) for the Chemical Master Equation: A unified framework for the method of moments and hybrid stochastic-deterministic models
,”
J. Math. Biol.
69
(
3
),
687
735
(
2014
).
43.
L.
Duso
and
C.
Zechner
, “
Selected-node stochastic simulation algorithm
,”
J. Chem. Phys.
148
(
16
),
164108
(
2018
).
44.
M.
Rathinam
and
M.
Yu
, “
State and parameter estimation from exact partial state observation in stochastic reaction networks
,”
J. Chem. Phys.
154
(
3
),
034103
(
2021
).
45.
Y.
Cao
,
D. T.
Gillespie
, and
L. R.
Petzold
, “
The slow-scale stochastic simulation algorithm
,”
J. Chem. Phys.
122
(
1
),
014116
(
2005
).
46.
E. L.
Haseltine
and
J. B.
Rawlings
, “
On the origins of approximations for stochastic chemical kinetics
,”
J. Chem. Phys.
123
(
16
),
164115
(
2005
).
47.
D. W.
Kim
,
H.
Hong
, and
J. K.
Kim
, “
Systematic inference identifies a major source of heterogeneity in cell signaling dynamics: The rate-limiting step number
,”
Sci. Adv.
8
(
11
),
eabl4598
(
2022
).
48.
C.
Jia
and
R.
Grima
, “
Small protein number effects in stochastic models of autoregulated bursty gene expression
,”
J. Chem. Phys.
152
(
8
),
084115
(
2020
).
49.
K.
Öcal
,
R.
Grima
, and
G.
Sanguinetti
, “
Parameter estimation for biochemical reaction networks using Wasserstein distances
,”
J. Phys. A: Math. Theor.
53
(
3
),
034002
(
2019
).
50.
L.
Segel
, “
On the validity of the steady state assumption of enzyme kinetics
,”
Bull. Math. Biol.
50
(
6
),
579
593
(
1988
).
51.
K. R.
Sanft
,
D. T.
Gillespie
, and
L. R.
Petzold
, “
Legitimacy of the stochastic Michaelis–Menten approximation
,”
IET Syst. Biol.
5
(
1
),
58
69
(
2011
).
52.
R.
Grima
and
A.
Leier
, “
Exact product formation rates for stochastic enzyme kinetics
,”
J. Phys. Chem. B
121
(
1
),
13
23
(
2017
).
53.
D. T.
Gillespie
,
Y.
Cao
,
K. R.
Sanft
, and
L. R.
Petzold
, “
The subtle business of model reduction for stochastic chemical kinetics
,”
J. Chem. Phys.
130
(
6
),
064103
(
2009
).
54.
S.
MacNamara
,
A. M.
Bersani
,
K.
Burrage
, and
R. B.
Sidje
, “
Stochastic chemical kinetics and the total quasi-steady-state assumption: Application to the stochastic simulation algorithm and Chemical Master Equation
,”
J. Chem. Phys.
129
(
9
),
095105
(
2008
).
55.
Z.
Rached
,
F.
Alajaji
, and
L. L.
Campbell
, “
The Kullback-Leibler divergence rate between Markov sources
,”
IEEE Trans. Autom. Control
50
(
5
),
917
921
(
2004
).
56.
N.
van Kampen
,
Stochastic Processes in Physics and Chemistry
,
3rd ed.
(
Elsevier
,
2007
).
57.
A.
Hellander
and
P.
Lötstedt
, “
Hybrid method for the Chemical Master Equation
,”
J. Comput. Phys.
227
(
1
),
100
122
(
2007
).
58.
S.
Smith
,
C.
Cianci
, and
R.
Grima
, “
Model reduction for stochastic chemical systems with abundant species
,”
J. Chem. Phys.
143
(
21
),
214105
(
2015
).
59.
G.
Caravagna
,
L.
Bortolussi
, and
G.
Sanguinetti
, “
Matching models across abstraction levels with Gaussian processes
,” in
Computational Methods in Systems Biology
, edited by
E.
Bartocci
,
P.
Lio
, and
N.
Paoletti
(
Springer International Publishing
,
Cham
,
2016
), pp.
49
66
.
60.
D. T.
Gillespie
, “
A general method for numerically simulating the stochastic time evolution of coupled chemical reactions
,”
J. Comput. Phys.
22
(
4
),
403
434
(
1976
).
Published open access through an agreement with University of Edinburgh