Emerging techniques now allow for precise quantification of distributions of biological molecules in single cells. These rapidly advancing experimental methods have created a need for more rigorous and efficient modeling tools. Here, we derive new bounds on the likelihood that observations of single-cell, single-molecule responses come from a discrete stochastic model, posed in the form of the chemical master equation. These strict upper and lower bounds are based on a finite state projection approach, and they converge monotonically to the exact likelihood value. These bounds allow one to discriminate rigorously between models and with a minimum level of computational effort. In practice, these bounds can be incorporated into stochastic model identification and parameter inference routines, which improve the accuracy and efficiency of endeavors to analyze and predict single-cell behavior. We demonstrate the applicability of our approach using simulated data for three example models as well as for experimental measurements of a time-varying stochastic transcriptional response in yeast.
I. INTRODUCTION
Many physical, chemical, and biological processes are characterized by discrete particles that randomly fluctuate in space, time, or number. These microscopic fluctuations often provide the key to understand and modify mechanisms that control macroscopic phenomena. As one example, stochastic fluctuations in discrete numbers of specific genes, RNA, or proteins across genetically identical populations of cells play an increasingly important role in the understanding of gene regulation.1 Emerging experimental techniques, such as flow cytometry, single-cell RNA sequencing, and single-molecule fluorescence in-situ hybridization (smFISH),2–4 allow for the precise quantification of these fluctuations at the single-cell level. Several approaches have been developed to fit models to the statistical moments,5–7 stochastic trajectories,8 or full probability distributions4 of data collected with these experimental techniques. However, substantial work remains to establish more rigorous and efficient approaches to integrate stochastic analyses with single-cell experimental data.
A little over ten years ago, the finite state projection (FSP9) approach was introduced to approximate the solution of the Chemical Master Equation (CME10,11) and to capture the dynamics of discrete molecular events that control single-cell gene regulation. Since that time, the FSP has received substantial attention, has seen numerous computational improvements, and has become a benchmark tool in the analysis of stochastic gene regulation. Most recently, the FSP has been used to fit and predict experimental measurements of RNA transcription in yeast, bacteria, and human cells.12 The main utility of the FSP is to provide precise bounds on the accuracy of its approximation as well as a systematic approach to improve that accuracy. However, improved accuracy comes with increased computational cost, and no attention has been given to how one could optimize this tradeoff. Careful evaluation of this tradeoff is needed to improve the rigor and efficiency with which FSP models can be matched to experimentally measured data. In this work, we develop new FSP-based bounds on the likelihood of single-cell data given a stochastic model. We show how these bounds can be used to reduce computational costs without compromising accuracy. Finally, we use a combination of simulated and experimentally collected single-cell data to demonstrate how the co-design of FSP tools and experimental data can lead to efficient inference of discrete stochastic models.
II. MATHEMATICAL BACKGROUND
Like many single-molecule kinetic events, gene expression is often modeled as a Markov process, where each discrete state corresponds to the integer numbers of N chemical species (e.g., RNA or protein). Transition events between states are different reactions such as transcription, translation, or degradation, and these reactions can be indexed by μ ∈ {1, 2, …, M}. Each reaction occurs with a propensity wμ(xi) dt, which is the probability that the μth reaction will occur in the next infinitesimal time step (t, t + dt) given the current state xi. State transitions are described as xi → xj = xi + sμ, where sμ is the stoichiometry vector that describes the change in population after the μth reaction. In such models, each node has a continuous valued probability P(xi, t) that evolves in time according to the linear ODE known as the chemical master equation (CME),10,11
By enumerating all possible {x1, x2, …} ∈ X and corresponding probabilities, , the CME can be posed in matrix form as , where A is known as the infinitesimal generator (examples of A are provided in Secs. IV and V).
The CME dimension is often infinite, making it impossible to solve directly for most systems. The finite state projection (FSP) approach allows one to approximate the CME solution within strict error bounds.4,9,13 In its formulation, the FSP approach selects a finite set of indices, J = {j1, …, jL} with which it separates the full state space X into two exhaustive and disjoint sets, XJ = {xj1, …, xjL} and its complement XJ′. Under this reorganization, the full master equation can be written in matrix form as
To approximate the CME, the FSP approach then forms a finite state Markov process, where all nodes in XJ′ are aggregated to one or more sink states, denoted by g, which record the probability mass that leaves XJ. However, all probability mass within g is now restricted to remain in g as time proceeds. The new, reduced FSP-CME becomes
The resulting approximation in Eq. (3) provides three key insights into the exact CME solution. First, it provides a lower bound on the true solution,
This can be interpreted easily by noting that probability can only leave XJ in the FSP-CME (Eq. (3)), but can return from XJ′ to XJ in the original CME (Eq. (2)). Second, the FSP provides an exact measure of the approximation error,
where | ⋅ |1 denotes the absolute sum of a vector. Finally, as states are added to the set XJ, the error g(t) decreases monotonically. Proofs of these three results are provided in Refs. 13 and 14.
The state space of the FSP, XJ, can be defined through use of polynomial projection shapes,13
Here {fk(xi)} is a set of polynomials of the population counts, and the constraints {ck} are weights on these polynomials that may be increased (decreased) to include more (fewer) states. In practice, each kth constraint is associated with its own sink, gk, which aggregates all states that satisfy the {1, …, (k − 1)}th constraints, but not the kth constraint. The value of gk(tf) then quantifies probability of violation for the kth constraint, which in turn guides the systematic increase of ck. With this expansion, the FSP algorithm presented in Refs. 9 and 13 can be used to select an XJ to make small for a specified finite time. However, lower error will require more states and greater computational expense. Until now, there has been little attention given to how small g(t) should be, especially for comparison to a given set of experimental data. In this article, we will show how the combination of experimental data and the FSP results in Eqs. (4) and (5) combines to reveal a natural requirement for the accuracy of g(t).
III. FSP-DERIVED BOUNDS ON THE LIKELIHOOD
To quantify how experimental data affect the accuracy requirement for the FSP, we consider single-molecule, single-cell data, such as that obtained using the technique of smFISH. This technique allows experimentalists to count the number of specific RNA molecules in individual cells. Let D = {d1, d2, …} be the set of data corresponding to di cells with measured populations corresponding to each state xi. Define as the subset of all i such that di≠0 (i.e., the support of the observed data). Assuming independence of individual cells, the logarithm of the likelihood of observing data D is
In many cases only partial state observations are available (e.g., only one mRNA species is measured), and with no loss of generality, P may be projected onto the observable dimension prior to the summation in Eq. (7). With these definitions in place, we now explore how the FSP bounds affect the computation of the likelihoods of experimental data.
The solution for the FSP is and is guaranteed to be a lower bound on the model’s true solution P = [P(x1), P(x2), …] by Eq. (4). The log-likelihood in Eq. (7) is monotonic in each P(xi); therefore provides a lower bound on the log-likelihood of D given the model
However, in Eq. (3) the FSP also provides the exact error in the solution of a particular model described by the CME. By redistributing the known FSP error back onto the CME solution in an optimal manner, an upper bound on logL(D) can be derived,
where εi is the probability error redistributed to state xi. To optimize the redistribution of g and determine UBJ(D), we developed a modified water-filling algorithm similar to those used to determine the amount of power to send to different channels in communication systems.15,16 To simplify notation, we define the FSP probability for each state as and the corresponding portion of the total objective due to state xi as fi ≡ dilog(pi + εi). Clearly, adding probability mass to the xi with the greatest ∂fi/∂εi will have the largest increase in the . When a set of two or more {∂fi/∂εi} are maximal and equal, probability mass must be redistributed to equally increase all members of that set. At the optimal redistribution, all xi to which probability mass has been added will have identical values for ∂fi/∂εi. This approach, where extreme values are first equalized before adjusting more moderate values, is analogous to raising the water level over a rough surface, filling holes to equal levels before covering hills.
The derivative of fi with respect to εi is computed from Eq. (9) to get
and we define ND as the number of distinct observations (i.e., the size of ). These values are then ranked in decreasing magnitude according to
where the notation refers to the data-ordered state space . An optimal redistribution of g equalizes the first n terms of Eq. (11) and satisfies the linear constraints
In this formulation, the number of states to which probability is redistributed, n, is the largest dimension for which the solution of Eq. (13) is strictly positive for all εi. If the states XJ used by the approximation do not span the support of the distribution of data, the corresponding values of pi are zero and are infinite. Therefore, the corresponding redistribution of {εi} will always include some mass for those states. Algorithm 1 provides pseudocode for the proposed error redistribution approach. At most, the FSP error redistribution algorithm requires ND iterations, and in practice computation of the upper bound on the likelihood takes only a fraction of the time needed for the FSP solution itself, especially for cases where the data correspond to partial state observations.
FSP error redistribution algorithm.
Rank in decreasing order. |
n = 1, ε1 = g |
while and n < ND do |
n → n + 1 |
Solve for ε1, ε2, …, εn using Eq. (12) |
end while |
ε = [ε1, …, εn] is the optimal redistribution of g. |
Rank in decreasing order. |
n = 1, ε1 = g |
while and n < ND do |
n → n + 1 |
Solve for ε1, ε2, …, εn using Eq. (12) |
end while |
ε = [ε1, …, εn] is the optimal redistribution of g. |
The FSP-derived bounds on likelihoods have several important implications for the comparison of stochastic models to single-cell data. Let Λ denote a particular combination of a model and its parameters, and let L(D|Λ) denote the likelihood of D given Λ. In the case when XJ = ∅ (i.e., the FSP set is empty), all of the probability mass must be redistributed, and the FSP-derived upper bound is given by
This result is understood easily—the maximum of the log-likelihood function occurs when the distribution of the data exactly matches the model, and in this case the FSP upper bound describes the best any potential model can ever do. To interpret bounds for non-trivial FSP projections, we make use of the facts that (i) the FSP approximation lower bound pi increases monotonically as XJ is expanded9 and (ii) the likelihood increases monotonically as each pi increases. As a result, LBJ(D) and UBJ(D) are guaranteed to be monotonically increasing and decreasing functions of the projection size. Figure 1 illustrates the converging upper- and lower-bounds for the likelihoods of two FSP models as the size of the index set J, or equivalently the size of XJ, is increased.
A. Using FSP-derived bounds for model discrimination
For any two models and their parameter sets, Λi and Λj, we define the set of sufficient discriminating projections, Φ(Λi, Λj), as any pair of projection index sets, Ji and Jj, that guarantees the correct ranking of likelihoods for the two models,
Intuitively, these are any two projections such that the worst possible likelihood for one model is greater than best possible likelihood of the other. In Fig. 1, the red and green circles denote pairs of projections sufficient to guarantee that parameter set Λ2 is more likely than Λ1. Because Φ(Λ1, Λ2) can contain an infinite set of such pairs with varying projection sizes, we define a minimal symmetric discriminatory projection, ϕs(Λi, Λj), as
In Fig. 1, the blue circle denotes ϕs(Λ1, Λ2). Finally, in many sequential parameter searches, previous FSP models may already be computed to high accuracy, and it may not be necessary to demand the same accuracy for subsequent models. For this case, we define a minimum non-symmetric projection size such that
The utility of this particular discriminatory projection size definition becomes important in parameter search problems, where it enables sequential likelihood evaluations to be conducted at minimal projection sizes. In Fig. 1, the green circles denote ϕ2(Λ1, Λ2) a particular combination where the size of XJ2 is minimized given a previous computation for XJ1. As the examples below will demonstrate, utilization of these minimal discriminating projections can substantially reduce the computational effort in parameter inference by eliminating much of the potential parameter space with smaller projection sizes.
Schematic of discriminatory projection sizes. Monotonically converging upper and lower FSP bounds on the likelihood are shown for two parameter sets, Λ1 and Λ2. The red and green circles illustrate two pairs of projections in Φ(Λ1, Λ2) that enable exact ranking of the two parameter sets. The blue circle illustrates the minimal symmetric discriminatory projection, ϕs(Λ1, Λ2). The filled green circle illustrates the minimal nonsymmetric discriminatory projection, ϕ2(Λ1, Λ2), needed to discriminate the system given the previous analysis of Λ1 (open green circle).
Schematic of discriminatory projection sizes. Monotonically converging upper and lower FSP bounds on the likelihood are shown for two parameter sets, Λ1 and Λ2. The red and green circles illustrate two pairs of projections in Φ(Λ1, Λ2) that enable exact ranking of the two parameter sets. The blue circle illustrates the minimal symmetric discriminatory projection, ϕs(Λ1, Λ2). The filled green circle illustrates the minimal nonsymmetric discriminatory projection, ϕ2(Λ1, Λ2), needed to discriminate the system given the previous analysis of Λ1 (open green circle).
FSP bound for the birth/death model. (a) Schematic of the model. (b) Probability distributions for the model at t = 1. Simulated data (500 runs of the SSA19) are in black. The lower bound Eq. (4) is shown in red, and the upper bound Eq. (9) is shown in blue. The shaded region denotes the redistribution of FSP error to maximize the likelihood of data. See Table I for parameters.
FSP bound for the birth/death model. (a) Schematic of the model. (b) Probability distributions for the model at t = 1. Simulated data (500 runs of the SSA19) are in black. The lower bound Eq. (4) is shown in red, and the upper bound Eq. (9) is shown in blue. The shaded region denotes the redistribution of FSP error to maximize the likelihood of data. See Table I for parameters.
Demonstration of the converging bounds for the birth and death model. Upper and lower bounds on the likelihood of a simulated data set given two different parameter sets, Λ1 and Λ2, as a function of the number of states. As the number of states increases, the upper and lower bounds monotonically converge to the true likelihood of each parameter set. The dotted horizontal lines are the likelihood values found using the analytical solutions in Eq. (20). ϕs indicates the minimum symmetric projection size that guarantees correct discrimination between the two parameter sets. See Table I for parameters.
Demonstration of the converging bounds for the birth and death model. Upper and lower bounds on the likelihood of a simulated data set given two different parameter sets, Λ1 and Λ2, as a function of the number of states. As the number of states increases, the upper and lower bounds monotonically converge to the true likelihood of each parameter set. The dotted horizontal lines are the likelihood values found using the analytical solutions in Eq. (20). ϕs indicates the minimum symmetric projection size that guarantees correct discrimination between the two parameter sets. See Table I for parameters.
B. Relationship of FSP bounds to other CME truncations
The FSP upper and lower bounds present an opportunity to better understand relationships between the FSP and other CME approximations in the context of single-cell data. For example, several groups have imposed limits of species17 or total molecule populations,18 which result in truncated master equations with reflecting boundary conditions (in contrast to the FSP’s absorbing boundary condition). In a similar vein, one could renormalize the FSP solution to make use of the original FSP computation. For this renormalization, the CME can be written in terms similar to the FSP formulation,
where is the rate of flow of probability out of XJ, which is now redistributed back into XJ according to the probability . However, the solution of this non-Markovian system is identical to simply renormalizing at all times,
Because the FSP likelihood bounds provide the best- and worst-case redistribution of the probability absorbed in g, likelihoods computed by reflection, renormalization, or any other arbitrary strategy are guaranteed to lay between the computed FSP bounds. Therefore, it is possible that reflecting boundaries may provide an improved approximation of the true likelihood for a particular combination of data and model. Unfortunately, the likelihoods of reflected or renormalized solutions are not necessarily monotonic, and the likelihoods of renormalized solutions for two different parameter sets or models may change rank depending on the projection size. These issues will be addressed further in Section IV C.
IV. APPLICATION OF FSP-DERIVED BOUNDS
To demonstrate the application of the FSP bounds, this section uses simulated data and three examples of stochastic gene regulation: an unregulated birth-death model, a genetic toggle switch, and a non-linear self-activating gene.
A. Birth-death model
The variability in mRNA copy numbers for housekeeping genes is well captured by the standard single-species birth and death model, where xi = xi is the discrete number of RNA. This model consists of two reactions that describe transcription and degradation as shown in Fig. 2(a),
where the propensity functions, w = {w1, w2}, are
Table I shows three different parameter sets for this example. A well-known analytical solution for this model, assuming x(0) = 0, is the time-varying Poisson distribution
The FSP formulation for this model can be written from Eq. (21) as
where Nm is the size of the FSP truncation (i.e., XJ = {x0, x1, …, xNm}). In this case, we used a single constraint function in Eq. (12)f1(xi) = xi ≤ c1, where c1 = Nm. To expand the state space for this model, c1 is simply increased by one in each iteration.
For this model and the parameters provided in the bottom row of Table I, we simulated 500 trajectories of the stochastic simulation algorithm,19 and plotted the resulting “data” in Fig. 2(b) (black). For a projection size defined by Nm = 50, Fig. 2(b) also shows the FSP lower bound computed using Eq. (4) in red and the FSP upper bound from Eq. (9) in blue. Figure 3 demonstrates the convergence of the upper and lower FSP bounds as c1 is increased. Results are plotted for two different parameter sets. Increasing c1 adds more states to X and monotonically decreases the error g. In turn, less error is available to be redistributed back onto the FSP solution, and UBJ(Λ) and LBJ(Λ) converge monotonically to the analytical value of logL(D|Λ).
B. Toggle model
We next explore the application of the FSP bounds on a model of the toggle switch20 in which two genes λcI and lacI mutually inhibit one another, as illustrated in Fig. 4(a). Here we consider a simple model similar to that presented by Tian and Burrage.21 In this model, the system state is defined by the discrete populations of the two proteins, x = [λcILacI]. The four reactions are
where the propensity functions w = {w1, w2, w3, w4}, are given by
Table II provides the toggle model parameters that have been used to generate simulated data shown in black in Fig. 4(b).
The infinitesimal generator for this toggle model can be written as
To apply the FSP to truncate the toggle model CME, we consider three constraint functions from Eq. (6), where c1, c2, and c3 define the projection as
These constraints are illustrated in Fig. 5. By monotonically increasing ck, more states are included, and the error g(t) decreases. For simplicity in presentation, c2 and c3 were initially set at high values of 95 and 55, respectively, and the expansion only modifies the criteria c1. Similar results can be obtained with more general expansion routines, provided that ck are constant or monotonically increasing for each k. Figure 4(b) shows the marginal probability distribution for LacI with c1 = 150, where the FSP lower and upper bounds are in red and blue, respectively. Although Algorithm 1 distributes the error onto the joint probability distribution of both species, results are plotted only for the marginal distribution. Figure 6 shows the converging likelihood bounds for two parameter sets as c1 is increased to expand the number of states captured by the projection. Because these bounds are monotonic, the analysis reveals a minimal symmetric projection size, ϕs, at which one can guarantee that the model in green has a greater likelihood than the model in blue.
C. Comparing FSP bounds to other CME truncation approaches
To illustrate potential issues that arise through application of other CME truncation approaches, we consider a simple gene regulation model with nonlinear self-activation. In this model, the propensity of production is given by the positive feedback function
and the propensity of degradation is a first order process given by w2 = γx. In this formulation, k1 is the rate of production for small values of x, k1 + k2 is the rate of production for large values of x, m is the value of x at which the rate of production is halfway between k1 and k1 + k2, and the cooperativity factor n determines the steepness of the activation function. In the deterministic regime, this model of self-regulated gene expression can lead to one or two stable equilibria as illustrated in Fig. 7. In the stochastic regime, this corresponds to bimodal distributions in some parameter regimes, and unimodal distributions in other regimes. Table III provides three possible parameter sets for this model. We simulated data from the first of these models, ΛTrue, which admits a single stable point (see the dashed line in Fig. 7) and yields a unimodal distribution of data as shown in black in Fig. 8.
FSP bounds for the toggle model. (a) Schematic of the toggle model with two mutually repressing proteins, LacI and λcI. (b) Marginal probability distributions of LacI at t = 8 h. Simulated data are in black. The lower bound Eq. (4) is shown in red and the upper bound Eq. (9) is shown in blue. The shaded region denotes the redistribution of FSP error to maximize the likelihood of data. See Table II for parameters.
FSP bounds for the toggle model. (a) Schematic of the toggle model with two mutually repressing proteins, LacI and λcI. (b) Marginal probability distributions of LacI at t = 8 h. Simulated data are in black. The lower bound Eq. (4) is shown in red and the upper bound Eq. (9) is shown in blue. The shaded region denotes the redistribution of FSP error to maximize the likelihood of data. See Table II for parameters.
Λ . | kr(Nt−1) . | γ(t−1) . |
---|---|---|
Λ1 | 50.0 | 0.5 |
Λ2 | 49.65 | 0.5 |
45.0 | 0.5 |
Λ . | kr(Nt−1) . | γ(t−1) . |
---|---|---|
Λ1 | 50.0 | 0.5 |
Λ2 | 49.65 | 0.5 |
45.0 | 0.5 |
Λ . | bλcI . | kλcI . | αLacI . | ηLacI . | γλcI . | bLacI . | kLacI . | αλcI . | ηλcI . | γLacI . |
---|---|---|---|---|---|---|---|---|---|---|
. | (s−1) . | (s−1) . | (N−ηLacI) . | − . | (N−1 s−1) . | (s−1) . | (s−1) . | (N−ηλcI) . | − . | (N−1 s−1) . |
Λ1 | 6.8 × 10−5 | 1.6 × 10−2 | 6.1 × 10−3 | 2.1 | 6.7 × 10−4 | 2.2 × 10−3 | 1.7 × 10−2 | 2.6 × 10−3 | 3.0 | 3.8 × 10−4 |
Λ2 | 6.8 × 10−5 | 1.6 × 10−2 | 6.1 × 10−3 | 2.1 | 8.0 × 10−4 | 2.2 × 10−3 | 1.7 × 10−2 | 2.6 × 10−3 | 3.0 | 3.8 × 10−4 |
6.8 × 10−5 | 1.4 × 10−2 | 6.1 × 10−3 | 2.1 | 6.7 × 10−4 | 2.2 × 10−3 | 1.6 × 10−2 | 2.6 × 10−3 | 3.0 | 3.8 × 10−4 |
Λ . | bλcI . | kλcI . | αLacI . | ηLacI . | γλcI . | bLacI . | kLacI . | αλcI . | ηλcI . | γLacI . |
---|---|---|---|---|---|---|---|---|---|---|
. | (s−1) . | (s−1) . | (N−ηLacI) . | − . | (N−1 s−1) . | (s−1) . | (s−1) . | (N−ηλcI) . | − . | (N−1 s−1) . |
Λ1 | 6.8 × 10−5 | 1.6 × 10−2 | 6.1 × 10−3 | 2.1 | 6.7 × 10−4 | 2.2 × 10−3 | 1.7 × 10−2 | 2.6 × 10−3 | 3.0 | 3.8 × 10−4 |
Λ2 | 6.8 × 10−5 | 1.6 × 10−2 | 6.1 × 10−3 | 2.1 | 8.0 × 10−4 | 2.2 × 10−3 | 1.7 × 10−2 | 2.6 × 10−3 | 3.0 | 3.8 × 10−4 |
6.8 × 10−5 | 1.4 × 10−2 | 6.1 × 10−3 | 2.1 | 6.7 × 10−4 | 2.2 × 10−3 | 1.6 × 10−2 | 2.6 × 10−3 | 3.0 | 3.8 × 10−4 |
FSP state space expansion. Maximum species counts, as in Eq. (24) is c2 = NLacI = 95 and c3NλcI = 55. States included within a truncation of c1 = 500 are in gray. As c1 is increased, the boundary (dashed lines) increases to include more states, the FSP error decreases, and the FSP bounds converge to one another (see also Fig. 6).
FSP state space expansion. Maximum species counts, as in Eq. (24) is c2 = NLacI = 95 and c3NλcI = 55. States included within a truncation of c1 = 500 are in gray. As c1 is increased, the boundary (dashed lines) increases to include more states, the FSP error decreases, and the FSP bounds converge to one another (see also Fig. 6).
Demonstration of the converging bounds for a two dimensional system. Upper and lower bounds on the likelihood of a simulated data set given two different parameter sets, Λ1 and Λ2, as a function of the number of states included in the toggle model. The vertical dashed line denotes the size of the minimal symmetric discriminating projection, ϕs(Λ1, Λ2). Parameters are given in Table II.
Demonstration of the converging bounds for a two dimensional system. Upper and lower bounds on the likelihood of a simulated data set given two different parameter sets, Λ1 and Λ2, as a function of the number of states included in the toggle model. The vertical dashed line denotes the size of the minimal symmetric discriminating projection, ϕs(Λ1, Λ2). Parameters are given in Table II.
Positive self-regulated gene expression. First order degradation (black line) and positive feedback in production (red, green, dashed) can give rise to bistable (red) or monostable dynamics (green, dashed). Blue dots represent stable equilibria and the cross represents an unstable equilibrium. See Table III for parameters and Fig. 8(c) for examples of the corresponding distributions.
Positive self-regulated gene expression. First order degradation (black line) and positive feedback in production (red, green, dashed) can give rise to bistable (red) or monostable dynamics (green, dashed). Blue dots represent stable equilibria and the cross represents an unstable equilibrium. See Table III for parameters and Fig. 8(c) for examples of the corresponding distributions.
Effective parameters for self-activating gene expression model. Parameters are defined in the main text. Parameter sets ΛA and ΛB are compared to the reference set ΛTrue in Fig. 8.
Λ . | k1 . | k2 . | m . | n . | γ . |
---|---|---|---|---|---|
. | (1/s) . | (1/s) . | (N) . | − . | (1/s) . |
ΛTrue | 20 | 40 | 70 | 4 | 1 |
ΛA | 22.5 | 40 | 70 | 4 | 1 |
ΛB | 20 | 125 | 70 | 4 | 1 |
Λ . | k1 . | k2 . | m . | n . | γ . |
---|---|---|---|---|---|
. | (1/s) . | (1/s) . | (N) . | − . | (1/s) . |
ΛTrue | 20 | 40 | 70 | 4 | 1 |
ΛA | 22.5 | 40 | 70 | 4 | 1 |
ΛB | 20 | 125 | 70 | 4 | 1 |
Comparing FSP bounds with other CME truncation approaches. (a) Distributions of response of the self-regulated gene. Simulated data are in black. The renormalized FSP solutions for ΛB are shown in red and in green for ΛA. Both methods use a projection size of 45. (b) Likelihood versus projection size for different CME truncations. The renormalized scheme is shown with dashed lines, and the reflecting scheme is shown with dotted lines. All schemes lie within the FSP bounds (solid lines) and eventually approach the correct likelihood values. Results for parameter set ΛA are in green and for ΛB are in red. At moderate projection sizes, the renormalized and reflecting boundary schemes appear to converge to a higher likelihood for ΛB than for ΛA. At higher projection sizes, the trend is switched. (c) The same as (a), but now the projection size has been increased to 285.
Comparing FSP bounds with other CME truncation approaches. (a) Distributions of response of the self-regulated gene. Simulated data are in black. The renormalized FSP solutions for ΛB are shown in red and in green for ΛA. Both methods use a projection size of 45. (b) Likelihood versus projection size for different CME truncations. The renormalized scheme is shown with dashed lines, and the reflecting scheme is shown with dotted lines. All schemes lie within the FSP bounds (solid lines) and eventually approach the correct likelihood values. Results for parameter set ΛA are in green and for ΛB are in red. At moderate projection sizes, the renormalized and reflecting boundary schemes appear to converge to a higher likelihood for ΛB than for ΛA. At higher projection sizes, the trend is switched. (c) The same as (a), but now the projection size has been increased to 285.
Next, we consider two perturbations of this true parameter set,
where ΛA and ΛB correspond to the red and green lines, respectively, in Figs. 7 and 8, and their parameters are given in Table III. For ΛB the system has a bimodal response and for ΛA, the response is unimodal. It is interesting to explore how the application of the renormalization scheme in Eq. (19) would affect comparison of these two models to simulated data. For a projection size of 45, Fig. 8(a) suggests that ΛB provides a better match to the data than ΛA. Moreover, the likelihood of the data given ΛA and ΛB appears to be nearly constant over a substantial portion of the projection space as shown by the dashed lines and the shaded region of Fig. 8(b). Based on this information, it would be easy to conclude that ΛB is the more appropriate parameter set. However, only at large projections that include the second peak in the distribution for ΛB, does it becomes apparent that the ΛA is the better choice. This scenario illustrates how the renormalization scheme can complicate parameter discrimination for certain combinations of models and data. Similar cautions also apply for reflecting boundary approximations of the CME (see dotted lines in Fig. 8). The strict upper and lower bounds provided by the FSP provide a rigorous approach to eliminate this ambiguity as a function of projection size.
It should also be noted that likelihood computations using reflection or renormalization based truncations require the support of the CME to include that of the experimental data. Otherwise, these approaches will match the FSP lower bound, which suggests that the data are infinitely unlikely. Such a lower bound may appear uselessness at first, but as we will see in Sec. V, the FSP upper bound may still be sufficient for rigorous and efficient model selection.
Likelihoods and sufficient projection sizes. (a) Log-likelihood versus RNA production rate kr. The horizontal line denotes the likelihood at the fixed comparison parameter . (b) The symmetric projection size ϕs required to compare parameter set Λ to . For Λ such that , larger projections are needed. The green (red) region represents parameter sets that are better (worse) than . (c,d) Same as (a,b), except for the toggle model and two variable parameters, kLacI and kλcI. Parameters inside (outside) the dashed contour represent parameter sets are better (worse) than the comparison set denoted with the black dots. In all plots, dashed lines denote parameter combinations with likelihoods that are equivalent to the reference set.
Likelihoods and sufficient projection sizes. (a) Log-likelihood versus RNA production rate kr. The horizontal line denotes the likelihood at the fixed comparison parameter . (b) The symmetric projection size ϕs required to compare parameter set Λ to . For Λ such that , larger projections are needed. The green (red) region represents parameter sets that are better (worse) than . (c,d) Same as (a,b), except for the toggle model and two variable parameters, kLacI and kλcI. Parameters inside (outside) the dashed contour represent parameter sets are better (worse) than the comparison set denoted with the black dots. In all plots, dashed lines denote parameter combinations with likelihoods that are equivalent to the reference set.
V. FSP LIKELIHOOD BOUNDS IN PARAMETER SEARCHES
The FSP’s constricting upper and lower bounds on the likelihood enable rigorous discrimination between two parameter sets, Λ1 and Λ2, without using unnecessarily strict error tolerances. The following examples will demonstrate the utility of the sufficient discriminatory projections, Φ(Λi, Λj), ϕs(Λi, Λj), and ϕi(Λi, Λj).
A. Parameter search for the birth-death and toggle models
We return to the simulated data presented in Figs. 2 and 4, but this time we apply the FSP for many different parameter combinations and for many different projections. Figure 9 illustrates the practical strength of the minimal symmetric discriminating projection, ϕs(Λi, Λj). For the birth/death model in Figs. 9(a) and 9(b), parameter Λ = kr is allowed to vary. Figure 9(a) shows the likelihood of the data as a function of parameter Λ, and Fig. 9(b) shows the size of the sufficient symmetric projection, , needed to discriminate between Λ and a fixed parameter . Similarly, for the toggle model, both the maximum rates of production for LacI and λcI were varied, . In this case, Fig. 9(c) shows contour plots of the likelihoods, and Fig. 9(d) shows the corresponding contours of the size of . For models whose likelihood is better or worse than , Figs. 9(b) and 9(d) show that the comparison can be made with smaller projection sizes. Considering that the solution of Eq. (3) has a complexity that increases with order or worse depending on the solution scheme,22 such reductions can lead to substantial computational savings. In past studies, the FSP has been solved to uniformly strict error tolerances such as 10−6 in Neuert et al.,4 yet consideration of the FSP bounds allows for error tolerances that are relaxed by several orders of magnitude.
B. Using FSP bounds to compare CME models using experimental data
To demonstrate the application of the FSP bounds on real data, we examine a recent model and single-cell experimental data for Mitogen Activated Protein Kinase (MAPK) control of STL1 gene regulation in S. cerevisiae (budding yeast). Yeast activate a variety of regulatory pathways to mitigate the osmotic pressure difference that arises from solute imbalance across the cell membrane. One such mechanism is the high-osmolarity glycerol (HOG) pathway.23 Upon osmotic shock, the Hog1 kinase phosphorylates and localizes to the nucleus of the cell, where it initiates transcription of several genes, including STL1. After cells adapt to the new condition, the kinase leaves the nucleus, and the transcription pathways turn back off. Interestingly, while the nuclear localization and transcription initialization are a largely deterministic temporal signal, transcript abundance varies considerably between isogenic cells. This variability has been quantified in detail, using the smFISH technique to quantify transcript abundances in single cells at sixteen different time points from zero to 60 min after osmotic shock.4
The current time-varying CME model of the STL1 regulation process allows the gene to switch between four possible states with different transcription rates as shown in Fig. 10(a). Reactions that change the gene from state i to j occur with propensities {kij}, and the transcription rates are given by kri, for each of the i = {1, 2, 3, 4}th gene states. In this model, one particular transition rate k21 varies as a function of the Hog1p kinase in the nucleus as
where the temporal signal profile for Hog1p was measured experimentally4 and is reproduced in Fig. 10(b). As a result of this dependence on a time-varying parameter, the infinitesimal generator for the CME is a function of time. The FSP truncation of the CME can be written as
where Nm is the maximum number of RNA in the FSP approximation, and the matrices S, T, and Γ are given by
Figure 11 shows examples of distributions for six time points during the osmotic shock response. Experimental data4 (black) were collected using smFISH, and the FSP lower bound for a moderate projection size is shown in red. At each experimentally measured time point during the dynamic process, the FSP error g(t) is computed based on the FSP truncation, and the FSP upper bound on logL(D|Λ) (shown in blue) is computed using Algorithm 1. In this illustrative example, we note that the projection size is substantially smaller than the support of the experimental data, yet the reduced FSP adequately captures the distribution, especially for the earlier time points. Because a good model must capture early and late time points, this observation suggests that smaller projections may be quite informative for model discrimination.
Gene regulation in the HOG-STL1 system. (a) The four-state model of Hog1p-induced STL1 gene regulation, in which each gene state (S1…S4) has a distinct transcription rate. (b) The parameterized nuclear enrichment signal, Hog1p(t), that controls the rate of transition from S2 to S1. This signal was parameterized from experimental measurements at 0.4M NaCl by Neuert et al.4
Gene regulation in the HOG-STL1 system. (a) The four-state model of Hog1p-induced STL1 gene regulation, in which each gene state (S1…S4) has a distinct transcription rate. (b) The parameterized nuclear enrichment signal, Hog1p(t), that controls the rate of transition from S2 to S1. This signal was parameterized from experimental measurements at 0.4M NaCl by Neuert et al.4
FSP bounds on STL1 mRNA distributions. Experimentally measured distributions of STL1 transcripts are in black for each time point.4 The FSP lower bounds are shown in red and upper bounds are shown in blue. The Hog1-STL1 pathway is activated at t = 0 with a 0.4M treatment of NaCl (see also Fig. 10). Parameters are given as ΛN in Table IV.
FSP bounds on STL1 mRNA distributions. Experimentally measured distributions of STL1 transcripts are in black for each time point.4 The FSP lower bounds are shown in red and upper bounds are shown in blue. The Hog1-STL1 pathway is activated at t = 0 with a 0.4M treatment of NaCl (see also Fig. 10). Parameters are given as ΛN in Table IV.
To explore the impact that experimental data has on the required projection for the FSP, we use the non-symmetric minimal projection to determine the necessary projection size needed for parameter discrimination. In this case, each sequential comparison of two models begins with a previous FSP model that is already solved to a known precision. For example, in Fig. 12, we plot the FSP error bounds versus the projection size for two Hog-STL1 model parameter sets, ΛN and ΛN+1. If the likelihood is already known for the Nth parameter set (horizontal dashed line in Figs. 12(a) and 12(b)), then the FSP model for the (N + 1)th parameter set need only be solved with a projection size corresponding to ϕN+1. Figure 12 represents this minimal non-symmetric projection size with black circles. In many cases, such as that shown in Fig. 12(b), the new parameter is worse than the previous case, and the necessary discriminating projection size can be much smaller than the support of the experimental data. Such situations where the next parameter set is worse than the current set are the norm in a typical parameter search.
Figure 13(a) shows the likelihood of the experimental data versus two parameters in the Hog-STL1 model, and Fig. 13(b) shows the size of the necessary discriminating projection for the new ΛN+1 given that the old ΛN is at the black circle. For new parameters that give smaller likelihoods than ΛN (i.e., those outside of the halo), parameter discrimination can be achieved with projection sizes that are a fraction of the support of the data. In fact the median projection needed to compare the old and new models is 70, compared to a data support size of 107.
Using FSP-bounds to search Hog1-STL1 models. (a) FSP upper and lower bounds versus projection size for old parameters ΛN−1 and new parameters ΛN. In this case, ΛN is better, and sufficient discrimination is made at ϕs, which corresponds to the support of the experimental data. (b) Comparison of the bounds for old parameters ΛN and new parameters ΛN+1. In this case, reusing the FSP bounds for ΛN makes it possible to reject ΛN+1 at a projection size that is less than the support of the experimental data in Fig. 11. Parameters ΛN−1, ΛN, and ΛN+1 are given in Table IV.
Using FSP-bounds to search Hog1-STL1 models. (a) FSP upper and lower bounds versus projection size for old parameters ΛN−1 and new parameters ΛN. In this case, ΛN is better, and sufficient discrimination is made at ϕs, which corresponds to the support of the experimental data. (b) Comparison of the bounds for old parameters ΛN and new parameters ΛN+1. In this case, reusing the FSP bounds for ΛN makes it possible to reject ΛN+1 at a projection size that is less than the support of the experimental data in Fig. 11. Parameters ΛN−1, ΛN, and ΛN+1 are given in Table IV.
Hog model parameters. Parameters are defined in the main text and in Neuert et al.4 Parameter sets ΛN−1, ΛN, and ΛN+1 are compared in Fig. 12. Parameter set Λfixed is used as the reference parameter set in Fig. 13. In the units, Nr refers to the discrete number of RNA and [A.U.] refers to the Hog1p nuclear signal in arbitrary units.
Λ . | k12 . | α . | β . | k23 . | k32 . | k34 . | k43 . | kr1 . | kr2 . | kr3 . | kr4 . | γ . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | (1/s) . | (1/s) . | (1/(s∗[A.U.])) . | (1/s) . | (1/s) . | (1/s) . | (1/s) . | (Nr/s) . | (Nr/s) . | (Nr/s) . | (Nr/s) . | (1/s) . |
ΛN−1 | 2.096 | 5406.3 | 25 116.6 | 0.009 79 | 0.008 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1372 | 1.953 | 5.53 × 10−3 |
ΛN | 2.096 | 5406.3 | 18 116.6 | 0.007 79 | 0.006 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1072 | 1.953 | 5.53 × 10−3 |
ΛN+1 | 2.096 | 5406.3 | 30 116.6 | 0.013 79 | 0.010 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1372 | 1.953 | 5.53 × 10−3 |
Λfixed | 2.096 | 5406.3 | 18 116.6 | 0.02 | 0.006 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1072 | 1.953 | 5.53 × 10−3 |
Λ . | k12 . | α . | β . | k23 . | k32 . | k34 . | k43 . | kr1 . | kr2 . | kr3 . | kr4 . | γ . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | (1/s) . | (1/s) . | (1/(s∗[A.U.])) . | (1/s) . | (1/s) . | (1/s) . | (1/s) . | (Nr/s) . | (Nr/s) . | (Nr/s) . | (Nr/s) . | (1/s) . |
ΛN−1 | 2.096 | 5406.3 | 25 116.6 | 0.009 79 | 0.008 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1372 | 1.953 | 5.53 × 10−3 |
ΛN | 2.096 | 5406.3 | 18 116.6 | 0.007 79 | 0.006 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1072 | 1.953 | 5.53 × 10−3 |
ΛN+1 | 2.096 | 5406.3 | 30 116.6 | 0.013 79 | 0.010 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1372 | 1.953 | 5.53 × 10−3 |
Λfixed | 2.096 | 5406.3 | 18 116.6 | 0.02 | 0.006 68 | 0.0448 | 0.465 | 9.16 × 10−4 | 0.012 32 | 0.1072 | 1.953 | 5.53 × 10−3 |
Likelihoods and sufficient projection sizes for Hog1-STL1 data. (a) The likelihood of smFISH data for 2500 different parameter combinations of kr3 and k23. (b) The size of ϕN+1 versus ΛN+1 = [kr3, k23], where . Over most of the parameter space, sufficient discrimination does not require the full support of the experimental data (Nm = 107), with a median sufficient projection size of 70.
Likelihoods and sufficient projection sizes for Hog1-STL1 data. (a) The likelihood of smFISH data for 2500 different parameter combinations of kr3 and k23. (b) The size of ϕN+1 versus ΛN+1 = [kr3, k23], where . Over most of the parameter space, sufficient discrimination does not require the full support of the experimental data (Nm = 107), with a median sufficient projection size of 70.
VI. SUMMARY AND CONCLUSIONS
In recent years, substantial interest has arisen to integrate discrete stochastic models with single-cell experimental data. This has motivated many approaches to solve the chemical master equation, including stochastic simulations, moment closure analyses, and the finite state projection approach. Progress in this arena will continue in the future to open new biochemical processes for discrete, stochastic computational analyses. However, until now there has been little discussion of how accurate CME models need to be in order to adequately interpret experimental data. In this article, we have explored the benefit by which careful consideration of experimental data can help to reduce computational complexity and enable more efficient and rigorous comparison of multiple models in the context of experimental data. We have shown that this advance can substantially reduce the complexity of model identification for single-cell gene regulation models using real data, and we believe this approach opens new doors for gene regulation models in many pathways and organisms.
In light of the above results, it would be interesting to re-examine other approaches to fit stochastic models to single-cell data. For example, a common and highly flexible tool for this task is the stochastic simulation algorithm (SSA19). As one runs more and more SSA trajectories, the collected statistics converge to the solution of the CME, and the computed likelihood of the data given the model will also converge to the correct value. Unlike the FSP approach derived here, convergence of the SSA or other kinetic Monte Carlo schemes will not be monotonic, and long distribution tails can be very difficult to estimate. However, although the SSA does not provide a direct computation or bounds on its computational error, one can estimate the rate of convergence with increasing numbers of trajectories. With these one could imagine that the insight gained from the optimal redistribution of the FSP error could be adapted to explore similar ad hoc redistribution methods for the SSA. Such analyses provide intriguing paths for future theoretical, computational, and experimental investigations.
In our examples, we assumed perfect measurements of the number of RNA per cell. While the RNA-FISH approach provides a more precise quantification of molecule numbers than flow cytometry or single-cell sequencing approaches, all experiments are subject to some measurement error. Drawing insight from past efforts to predict flow cytometry data in the presence of background fluorescence,24 the analyses described above can be extended to account for experimental uncertainties. Specifically, we are currently exploring experimental controls to estimate how variations in experiment or image analysis affect the probability to under- or over-count RNA spots. Such analyses will provide a baseline to calibrate how the distribution of counted spots conditions upon the actual spot number. Assuming the measurement error is identical and independently distributed for each cell, models could be convolved with the calibrated conditional distribution to predict the probability distribution for observed spot counts, and the FSP error bounds derived above will hold with no loss in generality.
In conclusion, it is now well established that stochastic models can help to better understand single-cell gene regulatory responses. Here, we have complemented this fact by showing how single-cell data may inform the design of rigorous and yet more efficient computational analyses. Together, these insights offer further motivation for tighter integration and co-design of computational and experimental investigations of biological phenomena.
Acknowledgments
This study was partially supported by the NIH-NIGMS under Award No. R25GM105608 and by the W.M. Keck Foundation. Z.F. was funded under NSF-NRT Grant No. DGE-1450032. G.N. was supported by the NIH under Award No. DP2GM114849. Contributions were as follows: B.M. designed the study; Z.F. and B.M. developed the computational approaches; Z.F. performed the numerical analyses; G.N. performed the experimental studies and advised on interpretation of the experimental data; Z.F. and B.M. wrote the manuscript.