Finding reduced models of spatially distributed chemical reaction networks requires an estimation of which effective dynamics are relevant. We propose a machine learning approach to this coarse graining problem, where a maximum entropy approximation is constructed that evolves slowly in time. The dynamical model governing the approximation is expressed as a functional, allowing a general treatment of spatial interactions. In contrast to typical machine learning approaches which estimate the interaction parameters of a graphical model, we derive Boltzmann-machine like learning algorithms to estimate directly the functionals dictating the time evolution of these parameters. By incorporating analytic solutions from simple reaction motifs, an efficient simulation method is demonstrated for systems ranging from toy problems to basic biologically relevant networks. The broadly applicable nature of our approach to learning spatial dynamics suggests promising applications to multiscale methods for spatial networks, as well as to further problems in machine learning.

## I. INTRODUCTION

### A. Model reduction of statistical many-body systems

Master equations are broadly applicable to stochastic systems in biology. For reaction-diffusion systems, the solution to the chemical master equation (CME) fully characterizes the probability distribution over system states and all observables at all times.^{1} However, solving the CME for relevant moments is challenging when the interactions of two or more reagents lead to non-linear differential equation systems for the moments and even more challenging when considering spatially distributed systems. A wealth of analytical and numerical approaches has been developed in pursuit of approximate solutions, each of which is optimally suited for a distinct dynamical regime.^{2} For example, at the low and spatially heterogeneous concentrations of molecules present in dendritic spines in synapses, particle-based methods may best describe the highly stochastic signaling activity.^{3–5} In the larger volumes such as the dendritic shaft, simpler geometries and higher concentrations allow more efficient partial differential equation (PDE) methods. It remains an open problem to develop a modeling framework that is able to flexibly transition across different dynamical regimes or to describe their coexistence in the same spatial domain.

One key problem is that the number of states appearing in the CME increases exponentially with the number of underlying random variables describing the system. This system state space explosion poses a computational challenge for Monte Carlo algorithms such as the popular Gillespie stochastic simulation algorithm (SSA),^{6} requiring the sampling of a sufficiently large number of trajectories to estimate observables.

The direct estimation of observables also poses a challenge. Generally, many-body systems result in a hierarchy of moments [analogous to a Bogoliubov–Born–Green–Kirkwood–Yvon (BBGKY) hierarchy^{7–9}], where the differential equation for any given moment depends on the current value of higher order ones. This requires the use of a moment closure technique (see Ref. 10 for review), but a poor choice here can unduly restrict modeling of the rich correlation structures available.

Machine learning approaches present an opportune setting for addressing these problems because a central goal of these approaches is the estimation of the structures underlying complex correlations. For example, machine learning has recently been proposed to approximate quantum systems.^{11} Previous applications to chemical systems include the predictions of molecular reactions^{12} and synaptic activity.^{13} However, to our knowledge, no general formulation for learning chemical dynamics exists that incorporates the complex spatial interactions central to many problems in biology. In this work, we present such a framework and derive algorithms for simulating reaction-diffusion systems in continuous space. This has promising general implications for both moment closure and model reduction of the CME and of other generally spatially distributed systems.

### B. Inferring Markov random fields for reduced dynamics

Previous work has shown the applicability of machine learning to model reduction. In particular, the “Graph Constrained Correlation Dynamics” (GCCD) framework^{10} uses a Markov Random Field (MRF) of plausible state variables and interactions as input, incorporating human expertise into the model reduction process. The probability distribution associated with this MRF is written in a form that separates the time evolution *μ*(*t*) from the graph structure *V*_{α}(*s*),

This time-evolving mean field model can be learned separately at each time-point using the well-known Boltzmann Machine (BM) learning algorithm^{14} and then approximated with its own dynamics for *μ*(*t*). For a suitably chosen MRF, the resulting temporal dynamics of *μ*(*t*) result in a large degree of model reduction, as reported in Ref. 10.

In Sec. II, we formalize these ideas and extend to the spatial domain a general variational problem for estimating the dynamical model dictating the time evolution of a Boltzmann distribution. The importance of spatial networks has been widely studied,^{15,16} with continued interest in mean-field methods, such as for evolving networks.^{17} Our formulation using functionals presents a flexible framework for capturing the dynamics of any desired spatial correlations in the system. This leads to algorithms closely related to a BM, with a modified learning rule for directly estimating the functionals dictating the time evolution of the mean-field model. We anticipate that such an approach is broadly applicable to other spatially organized networks and will have further practical applications in machine learning.

## II. LEARNING ALGORITHMS FOR MODEL REDUCTION

This section is organized as follows. Section II A reviews recent developments that have enabled the derivation of stochastic simulation algorithms for chemical kinetics in the Doi-Peliti formalism and introduces further extensions for describing spatial dynamics. Section II B introduces time-evolving Boltzmann distributions as reduced models, and Sec. II C sets up the associated variational problem for their dynamics. For the case of well-mixed systems, this is solvable by an algorithmic approach, as shown in Sec. II D. Section II E treats some analytic solutions for simple systems used to guess parameterizations needed to algorithmically solve the general spatially heterogeneous case in Sec. II F.

### A. Field theoretic approaches to deriving stochastic simulation algorithms

An equivalent description parallel to the CME is the quantum-field-theoretic Doi-Peliti^{18–20} operator algebra formalism (see, e.g., Ref. 21 for review). Extensions to this formalism have developed it as a natural framework for deriving stochastic simulation algorithms of chemical kinetics. These extensions include:

The introduction of parameterized objects has generalized bare molecules to allow the description of macromolecular complexes such as phosphorylation states

^{22,23}or other structures with size, type, and other internal parameters that affect their dynamics.Dynamically graph-linked collections of objects

^{24}have allowed collections of inter-related objects and extended objects to associate and dissociate according to specified rules and propensities.Differential operators have been introduced that express differential equations and stochastic differential equations,

^{22,23}as in the Lie Series.^{25}

These innovations naturally lead to the rederivation of the popular Gillespie SSA from the CME and from there to extensions to the parametric, graph-matching, and mixed-dynamics cases.^{23} Here, we consider further extensions to this formalism to develop *model reduction* techniques for spatial reaction-diffusion systems.

The raising and lowering operators, *â* and *a*, create and destroy identical particles of a single species. For states consisting of a single species distributed on a discrete lattice |{*n*}⟩, where {*n*} describes the occupancy of each lattice site, the action of these operators on the *i*-th lattice site (in some ordering) is

Further, they satisfy the Heisenberg algebra commutation relationship [*a*_{i}, *â*_{j}] = *δ*_{ij}, where *δ*_{ij} is the Kronecker delta function. We note that these are different from the ladder operators in quantum mechanics in which *a*_{i} is not conjugate to *â*_{i}. However, in the present context, they are of key importance as they capture mass action chemical kinetics.

These operators admit an equivalent generating function representation

where the product runs over all spatial lattice sites. Then the operators may be represented as

In the spatially continuous case, let the state of the system be denoted by |*n*, ** α**,

**⟩, consisting of**

*x**n*particles at locations

**, consisting of**

*x**n*positions in 3D space, with species labels

**, also of length**

*α**n*. The equivalent generating function representation is

The raising and lowering operators are now

where, switching from the discrete to the continuous case, partial derivatives for the annihilation operator turn into functional derivatives. Importantly, the CME

### B. Reduced states in a dynamic Boltzmann distribution

Let |*n*, ** α**,

**,**

*x**t*⟩ denote the true state of the system at time

*t*. In the spirit of a MRF, construct states in a coarse-scale model,

where $ikn={i1<i2<\u2026<ik:i\u2208[1,n]}$ denotes ordered subsets of *k* indexes each in {1, 2, …, *n*}, and $\nu k(\alpha ikn,xikn,t)$ are *k*-particle interaction functions up to a cutoff order *K*. We note that ${\u2026\u2009}k=1K$ is used to denote an index-ordered set in this context. This expansion of *n*-body interactions is a specific case of more general dimension-wise decompositions, such as analysis of least variance (ANOVA).^{26} The probability of being in a state |*n*, ** α**,

**,**

*x**t*⟩ is given by a dynamic and instantaneous Boltzmann distribution

The true probability distribution *p*(*n*, ** α**,

**,**

*x**t*) evolves according to the CME (7). To describe the time evolution of the reduced model, introduce a set of functionals ${Fk}k=1K$, forming a differential equation system for the interaction functions ${\nu k}k=1K$,

where

denotes all possible *ν* functions evaluated at the given arguments. Here, the right hand side $Fk$ may be a global functional, in the sense that the arguments {*ν*(** α**,

**,**

*x**t*)} are not restricted to the arguments appearing on the left hand side of (10). We consider particular local parameterizations of this general form in Sec. II F.

In addition to the connection to MRFs, we note several advantages of the form of this reduced model (9) and (10):

Since the states $|{\nu k}k=1K,t$ define a grand canonical ensemble (GCE), (9) exactly describes equilibrium systems and is expected to reasonably approximate systems approaching equilibrium.

If the interactions between two groups of particles are independent, their joint probability distribution equals the product of their probabilities and their interaction functions

*ν*_{k}in (9) sum. The Boltzmann distribution thus preserves the locality of interactions.A further important result pertains to linearity, as stated in the following proposition.

*Proposition 1.*

*Given a reaction network and a fixed collection of K interaction functions* ${\nu k}k=1K$, *the linearity of the CME in reaction operators* $p\u0307$ = ∑_{r}**W**^{(r)}*p extends to the functionals* $Fk=\u2211rFk(r)$.

*Proof.*

*δ*(

**) denotes a multi-dimensional Dirac delta function. Note that there are $L=\u2211k=1Knk$ interaction terms and equally many moments they control. Switching to vector notation, let**

*x***of length**

*ν**L*denote the interaction functions and

**denote the corresponding moments.**

*μ**ϕ*

_{l}for

*l*= 1, …,

*L*. This solution depends only on the interaction functions and not on the reaction operators appearing in the CME. For a single reaction process, let the differential equations for the moments be $\mu \u0307(r)$, resulting from $p\u0307(r)$ =

**W**

^{(r)}

*p*

^{(r)}, where

*ẋ*denotes a time derivative. Taking the time derivatives of both sides of (13) gives

Due to Proposition 1, the functionals $F$ will be referred to as *basis functionals*. In Sec. III C, the utility of this property is explored further in a machine learning context.

### C. Formulation of general problem to determine functionals governing spatial dynamics

We next formulate a general problem to determine the functionals leading at all times to the MaxEnt Boltzmann distribution. Define the action as the Kullback-Leibler (KL) divergence between the true and reduced models (extending Ref. 10),

Next, we introduce a notation to define a higher-order variational problem. Since the interaction functions are defined specifying by the set of functionals $Fk[{\nu (\alpha ,x,t)}]$ for all *k* = 1, …, *K*, we use the notation $\nu k[{F}]$ to denote that *ν*_{k} is a higher-order generalization of a functional. The action is a functional of the set of all interaction functions, which we denote by $S[{\nu [{F}]}]$, where ${x}={xk}k=1K$.

The higher-order variational problem for the basis functionals is given by the chain rule

where we use the notation $\delta ^$ to denote that this is not an ordinary variational problem, in the sense that a variation with respect to a functional is implied. Equation (17) should therefore be regarded as a purely notation solution, generalizing the well-known chain rule for functionals where a variational derivative is taken of a functional: $\delta F[G[\varphi ]]\delta \varphi (y)=\u222bdx\u2009\delta F[G]\delta G(x)\delta G[\varphi ]\delta \varphi (y)$. The first term is a variational derivative analogous to that appearing in the derivation of the BM learning algorithm,^{14} giving

where the moments *μ* are defined in (12), with $\mu \u0303$ having *p* replaced by $p\u0303$. Next, we consider well-mixed systems where the de-escalation from functionals to ordinary functions *F*_{k} makes this problem (18) well-defined. In Sec. II F, we parameterize the functional form of $F$ to consider spatially distributed systems.

### D. Learning algorithm for reduced dynamics of well-mixed systems in one species

In the case of well-mixed systems in one species, the state of the system is entirely characterized by the number of individuals |*n*, *t*⟩. Dropping the species and position labels in the dynamic Boltzmann distribution gives

where we use the notation ${\nu}={\nu k\u2032}k\u2032=1K$. The time evolution is now described by basis functions forming the autonomous differential equation system

where *F*_{k} are now functions rather than functionals $Fk$. The variational problem (18) for the basis functions becomes

where $Xp(t\u2032)=\u2211n=0\u221eXp(n,t\u2032)$ and similarly for $p\u0303$.

The variational term on the RHS of (21) may be determined by a number of methods, including an ordinary differential equation (ODE) formulation derived in Appendix A 1, a PDE formulation derived from applying the chain rule at the initial condition, and using a Lie series approach (see the supplementary material). The first of these and arguably the most practical is

An algorithmic solution to (21) is therefore possible in the form of a PDE-constrained optimization problem: Solve (21) and (22) subject to the PDE-constraint (20). An example algorithm using a simple gradient descent is given by Algorithm 1.

1: Initialize |

2: Grid of values {ν} to solve over. |

3: F_{k}({ν}) for k = 1, …, K. |

4: Max. integration time T. |

5: A formula for the learning rate λ. |

6: while not converged do |

7: Initialize ΔF_{k}({ν}) = 0 for all k, {ν}. |

8: Generate a sample of random initial conditions {η}. |

9: for η_{i} ∈{η} do |

10: ⊳Generate trajectory in reduced space {ν}: |

11: Solve the PDE constraint (20) with IC η_{i} |

for 0 ≤ t ≤ T. |

12: Solve (22) for variational term $\delta \nu k\u2032(t)/\delta Fk({\nu})$. |

13: ⊳Sampling step: |

14: Evaluate moments $nk\u2032p\u0303(t\u2032)$ of the Boltzmann |

distribution by sampling or analytically. |

15: Evaluate true moments $nk\u2032p(t\u2032)$ by stochastic |

simulation or analytic solution. |

16: ⊳Evaluate the objective function: |

17: Update ΔF_{k}({ν}) as the cumulative moving |

average of (21) over initial conditions {η}. |

18: ⊳Update to decrease objective function: |

19: Update F_{k}({ν}) to decrease the objective function: |

F_{k}({ν}) → F_{k}({ν}) − λΔF_{k}({ν}). |

1: Initialize |

2: Grid of values {ν} to solve over. |

3: F_{k}({ν}) for k = 1, …, K. |

4: Max. integration time T. |

5: A formula for the learning rate λ. |

6: while not converged do |

7: Initialize ΔF_{k}({ν}) = 0 for all k, {ν}. |

8: Generate a sample of random initial conditions {η}. |

9: for η_{i} ∈{η} do |

10: ⊳Generate trajectory in reduced space {ν}: |

11: Solve the PDE constraint (20) with IC η_{i} |

for 0 ≤ t ≤ T. |

12: Solve (22) for variational term $\delta \nu k\u2032(t)/\delta Fk({\nu})$. |

13: ⊳Sampling step: |

14: Evaluate moments $nk\u2032p\u0303(t\u2032)$ of the Boltzmann |

distribution by sampling or analytically. |

15: Evaluate true moments $nk\u2032p(t\u2032)$ by stochastic |

simulation or analytic solution. |

16: ⊳Evaluate the objective function: |

17: Update ΔF_{k}({ν}) as the cumulative moving |

average of (21) over initial conditions {η}. |

18: ⊳Update to decrease objective function: |

19: Update F_{k}({ν}) to decrease the objective function: |

F_{k}({ν}) → F_{k}({ν}) − λΔF_{k}({ν}). |

We note the implicit connection between this approach and using Boltzmann machines, such as in GCCD, by the algorithm’s objective function. Here, the whole trajectory of moments from stochastic simulations is used to directly estimate time evolution operators, rather than estimating the interaction parameters at each time step. We make this connection explicit in Algorithm 2 in Sec. III C.

1: Initialize |

2: Initial θ^{(r)} for all r. |

3: Max. integration time T. |

4: A formula for the learning rate λ. |

5: Time-series of lattice spins {s}(t) from stochastic |

simulations from some known IC h_{0}, J_{0}. |

6: Fully visible MRF with NN connections and as many |

units as lattice sites N. |

7: while not converged do |

8: ⊳Generate trajectory in reduced space: |

9: Solve the PDE constraint (52) with IC h_{0}, J_{0} |

for 0 ≤ t ≤ T. |

10: ⊳Awake phase: |

11: Evaluate true moments μ(t), Δ(t) from the |

Stochastic simulation data {s}(t). |

12: ⊳Asleep phase: |

13: Evaluate moments $\mu \u0303(t),\Delta \u0303(t)$ of the Boltzmann |

distribution by Gibbs sampling. |

14: ⊳Update to decrease objective function: |

15: Solve (54) for derivative terms. |

16: Update θ^{(s)} to decrease the objective function |

for all s by taking: θ^{(s)} →θ^{(s)} − λ × (53). |

1: Initialize |

2: Initial θ^{(r)} for all r. |

3: Max. integration time T. |

4: A formula for the learning rate λ. |

5: Time-series of lattice spins {s}(t) from stochastic |

simulations from some known IC h_{0}, J_{0}. |

6: Fully visible MRF with NN connections and as many |

units as lattice sites N. |

7: while not converged do |

8: ⊳Generate trajectory in reduced space: |

9: Solve the PDE constraint (52) with IC h_{0}, J_{0} |

for 0 ≤ t ≤ T. |

10: ⊳Awake phase: |

11: Evaluate true moments μ(t), Δ(t) from the |

Stochastic simulation data {s}(t). |

12: ⊳Asleep phase: |

13: Evaluate moments $\mu \u0303(t),\Delta \u0303(t)$ of the Boltzmann |

distribution by Gibbs sampling. |

14: ⊳Update to decrease objective function: |

15: Solve (54) for derivative terms. |

16: Update θ^{(s)} to decrease the objective function |

for all s by taking: θ^{(s)} →θ^{(s)} − λ × (53). |

Further improvements to Algorithm 1 are possible, such as to replace ordinary gradient descent by an accelerated version, e.g., Nesterov accelerated gradient descent.^{27} Furthermore, the wealth of methods available to solve PDE-constrained optimization problems, e.g., adjoint methods,^{28} offers rich possibilities for further development.

#### 1. Example: Mean of the Galton-Watson branching process

As a simple illustrative example, consider a reduced model that captures the time-evolving mean of the Galton-Watson branching process consisting of the birth process *A* → *A* + *A* with rate *k*_{b} and death *A* → *∅* with rate *k*_{d}.

In this case, there are only self-interactions (*K* = 1) described by *ν*(*t*) with basis function *F*(*ν*(*t*)). The dynamic Boltzmann distribution is

Using the fact that

and from the CME,

gives the analytic solution for the basis functions

This solution is reproduced using Algorithm 1, as shown in Fig. 1 for *k*_{d} = 3*k*_{b}/2. Here, the solution is constructed on a grid of *ν* ∈ [0^{+}, 3.0] with spacing Δ*ν* = 0.1, with maximum integration time *T* = 1 (arbitrary units). The learning rate is decreased exponentially over iterations to improve convergence. The convergence of the algorithm is shown in Fig. 2.

Figure 3 shows the variational terms (22) evaluated at a single point *ν* = 2 for several initial conditions *η*. These resemble step functions, where the position in time of the step corresponds to the time at which the solution trajectory crosses *ν* = 2, with value *δν*(*t*)/ *δF*(*ν* = 2) ≈ 0 before this time, and non-zero value afterwards. These terms assume different characteristic shapes as the differential equation model (10) is varied, another example of which is shown in Sec. II F 1.

#### 2. Example: Two basis functions controlling mean and variance

Consider again the process of Sec. II D 1, but with *K* = 2 basis functions *ν*_{1}(*t*) and *ν*_{2}(*t*) controlling the mean and variance in the number of particles. The dynamic Boltzmann distribution is

This may be interpreted as a Gaussian distribution in the number of particles, provided we treat *n* as continuous and extend its range to ±*∞* or consider systems with means far from *n* = 0. In this case, the mean *μ* and variance *σ*^{2} can be related to the interaction functions as *μ* = 1/2 − *ν*_{1}/*ν*_{2} and *σ*^{2} = 1/*ν*_{2}. The differential equations derived from the CME for the moments of this system are

which can be converted to analytic solutions for the basis functions

These are shown in Fig. 4.

Figure 5 shows the variational terms *δν*_{1}(*t*)/*δF*_{1}(*ν*_{1}, *ν*_{2}) and *δν*_{2}(*t*)/*δF*_{1}(*ν*_{1}, *ν*_{2}), resulting from Algorithm 1 and determined by (22). Interestingly, the self-varying term *δν*_{1}(*t*)/*δF*_{1}(*ν*_{1}, *ν*_{2}) more closely resembles the multivariate delta-function appearing in (22), while the cross term *δν*_{2}(*t*)/*δF*_{1}(*ν*_{1}, *ν*_{2}) shows a greater temporal memory of the solution trajectory.

### E. Analytic MaxEnt solutions

We next consider special cases where analytic solutions for the basis functionals are possible, to motivate a parameterization leading to a solvable version of the variational problem (18).

#### 1. Gaussian distributions

The well-mixed case (19) is the MaxEnt distribution consistent with $nkp$ for *k* = 1, …, *K*. If *K* = 2, then (19) may be interpreted as a Gaussian distribution in continuous *n*, as discussed in Sec. II D. Generalizing these results, the basis functions are generally given by

where *dμ*/*dt*, *dσ*^{2}/*dt* are evaluated from the CME and expressed in terms of *ν*_{1}, *ν*_{2}. Here, a moment closure approximation must be applied if the reactions have a total stoichiometry greater than one on the starting (input) side. For example, the higher order moments appearing in the CME may be approximated by those of the reduced model $p\u0303$ and expressed in terms of lower order *μ*, *σ*^{2} following the well-known property of Gaussian distributions. This closure technique is described further in Sec. III A.

#### 2. Diffusion from point source

In the spatial case, consider a diffusion process of a fixed number of particles *n* with diffusion constant *D* spreading out from a point source at *x*_{0}. The analytic solution to the CME is

reflecting that only self-interactions (*K* = 1) are necessary to describe the process. The reduced model (9) becomes

It is straightforward to verify that $p(x,t)=p\u0303(x,t)$ if

Consequentially, from $\u2202t\nu \xaf(y,t)$ , the basis functional is

#### 3. Unimolecular reaction-diffusion

For reaction networks that involve only diffusion and unimolecular reactions, two key properties hold for the CME solution:

Separable spatial and particle number distributions

*p*(*n*,) =*x**p*(*n*)*p*() where each distribution is normalized*x**∑*_{n}*p*(*n*) = 1 and*∫d**x**p*() = 1.*x*Independence of spatial distribution

*p*() =*x**p*(*x*_{1})*p*(*x*_{2})…*p*(*x*_{n}) where each*∫dx p*(*x*) = 1 is normalized. This assumes that initial*p*(*x*_{i}) are independent—otherwise, a fixed mixture of independent components must be considered.

Analogous to the purely diffusive process above, this allows analytic solutions to the inverse Ising problem by imposing these conditions upon the dynamic Boltzmann distribution $p\u0303$. Here, we exploit the fact that multiplication of Boltzmann distributions results in addition of the energy functions.

Introduce a single interaction function $\nu \xaf(x,t)$ to capture the diffusion process and the usual *ν*_{1}(*t*), …, *ν*_{K}(*t*) to describe the reactions (for brevity, omit further time arguments in this section). Furthermore, impose the normalization $\u222bdx\u2061exp[\u2212\nu \xaf(x)]=1$. The dynamic Boltzmann distribution becomes

where the partition function is

The distribution $p\u0303(n,x)$ is the MaxEnt distribution consistent with the moments $\u27e8nk\u27e9$ for all *k* = 1, …, *K*, as well as the spatial moment

such that the solution to the inverse Ising problem is

The solution for the inverse Ising problem for $\u27e8nk\u27e9$ is independent of this spatial moment and is analytically possible, e.g., for *K* = 1 or 2, as demonstrated in Secs. II D 1 and II E 1 above.

Taking the time derivatives of these solutions $\nu \xaf\u0307$ and $\nu k\u0307$ and using the CME to derive differential equations for the moments give the basis functionals. For unimolecular reactions, the diffusion process does not affect the reactions such that the functional controlling $\nu \xaf$ is always that of diffusion (34). For example, for a branching random walk consisting of diffusion from a point source and the Galton-Watson process with *K* = 2, the basis are the functional (34) and the functions (29).

### F. Parameterizations for spatially heterogeneous systems

For reaction-diffusion systems that involve reactions which have a total stoichiometry greater than one on the starting (input) side, it generally becomes difficult to analytically solve the inverse Ising problem and identify basis functionals. However, an algorithmic solution remains possible, where we guess a local parameterization of the functional (10) based on the analytic solutions presented above.

Let ** β**,

**be of length**

*y**k*and use the notation

Then choose the *spatially local* parameterization of $Fk$ in (10),

where *∂*_{m} denotes the derivative with respect to the *m*-th component of $yi\lambda k$, and $Fk(\gamma )({\nu (\beta ,y,t)})$ for (*γ*) = (0), (1, *λ*), (2, *λ*) are local functions, i.e., functions of the arguments on the left hand side of (40).

The variational problem (18) now becomes

for (*γ*) = (0), (1, *λ*), (2, *λ*), where ** β**′,

**′ are of length**

*y**k*′.

Analogously to the well-mixed case, it is possible to derive a PDE system governing the variational term $\delta \nu k\u2032(\beta \u2032,y\u2032,t\u2032)/\delta Fk(\gamma )({\nu (\beta ,y)})$. In Appendix A 2, an illustrative example is derived for a diffusion process.

#### 1. Example: Branching random walk

Consider a branching random walk consisting of the Galton-Watson process and diffusion from a point source with rate *D* in one spatial dimension and one species. From the true solutions for the basis functionals (34) and (29), use one spatial interaction function $\nu \xaf(y,t)$ and two purely temporal *ν*_{1}(*t*), *ν*_{2}(*t*) and further restrict the parameterization (40) of the basis functionals to be

for *k* = 1, 2. The variational problem for *γ* = 1, 2 is

Differential equations governing the variational terms $\delta \nu \xaf(y\u2032,t\u2032)/\delta F\xaf(\gamma )(\nu \xaf)$ for *γ* = 1, 2 are derived in Appendix A 2, given by (A15). Differential equations governing $\delta \nu k\u2032(t\u2032)/\delta Fk(0)(\nu 1,\nu 2)$ are given by (22).

The optimization problems (44) and (45) subject to the PDE-constraints (42) and (43) may be solved algorithmically using Algorithm 1 in each $Fk,F\xaf(1),F\xaf(2)$ analogously to the well-mixed case. We note that the true solutions are given by (34) and (29), in particular: $F\xaf(1)=D$ and $F\xaf(2)=\u2212D$.

Figure 6 plots the spatial variational terms resulting from the true basis functionals. Here, the reaction rates used are as before *k*_{d} = 3*k*_{b}/2 = 1.5, with a diffusion constant of *D* = 1. Contrary to the well-mixed case, these terms do not resemble step functions but rather exhibit some extended temporal dynamics.

## III. ESTIMATING EFFECTIVE REDUCED DYNAMICS IN 1D

The PDE-constrained optimization problems given above are the general solution for finding the basis functionals that govern the time evolution of the reduced MaxEnt model. Here, we present a more efficient machine learning approach for learning the basis functions from the solutions of simple, analytically solvable models. In Sec. III A, we present a method for finding such analytic solutions in the discrete lattice limit and present examples for a variety of simple processes in Sec. III B. In Sec. III C, we demonstrate the utility of using such analytic solutions in a Boltzmann machine-like learning algorithm and further in Sec. III D to learn non-linear combinations of solutions using artificial neural networks (ANNs).

### A. Mapping to spin glass systems in 1D

At low particle densities, a feasible model of a reaction-diffusion system in one spatial dimension and one species is a 1D lattice in the single occupancy limit. Let the spin values occupying each lattice site be *s*_{i} ∈ {0, 1}, *i* = 1, …, *N*, denoting the absence or presence of a particle.

The reduced model (9) now becomes the discrete analog. We note that this model is consistent with the continuous version in some parameter regime where the separation between molecules is large compared to the interaction radius. By including only self-interactions described by an interaction function *h*(*t*), and two particle nearest-neighbor interactions *J*(*t*), we obtain the well-known Ising model, with partition function

This may be evaluated explicitly using the standard transfer matrix method. In the thermodynamic limit, $lnZ\u2248\lambda +N$ is analytically accessible, where *λ*_{+} is the largest eigenvalue of the transfer matrix.

The inverse Ising problem has the solution

Taking the derivatives of both sides of (47),

The time derivatives of the moments on the left may be obtained directly from the CME $p\u0307$ = **W***p* using the Doi-Peliti formalism described in Sec. II A. If the system is *linear*, these may be expressed further in terms of *h*, *J* using (47), and the basis functions are given directly by inverting (48). If the system is *non-linear*, the presence of a moment hierarchy requires an approximation in the form of a moment closure technique. Here, we choose to express the higher order moments that appear through the CME in terms of *h*, *J*, which is possible for any higher order moment since the partition function (46) is analytically accessible. As a result of inverting (48),

where the RHS has been expressed in terms of *h*, *J* as described above, and we use the notation $F\u0303h,F\u0303J$ to indicate that these are generally only approximations to the true basis functions *F*_{h}, *F*_{J} and only exact for systems with closed moments. Effectively, we have replaced the probability distribution *p* in the CME $p\u0307$ = *Wp* by the dynamic Boltzmann distribution $p\u0303$ and evaluated the effect of the operator on the RHS on this new distribution. The analytic solution to the 1D inverse Ising problem therefore provides an elegant approach to moment closure (see Refs. 10 and 29 for related MaxEnt approaches to moment closure). Similar extensions to 2D Ising models^{30} are likewise possible and possibly to 3D as well.^{31}

Furthermore, we note that analogous to the continuous case proven in Proposition 1, the linearity of reaction operators in the CME extends to the basis function approximations $F\u0303$ (regardless of whether $Z$ is analytically accessible as in the 1D case). This requires that the inverse Ising problem has not changed, as discussed further in Sec. III C.

### B. Analytic approximations to basis functions of simple reaction motifs

Figure 7 shows the basis function approximations calculated using the 1D Ising model (49) for several simple unimolecular reaction processes. Note that the reaction rates/diffusion constants provide an overall multiplicative factor to each process. Computer algebra systems can be used to determine these analytic forms, which contain sums on the order of ten to a hundred terms in length, depending on the operator (see the supplementary material for the code used to generate Figs. 7 and 8).

Generalizing these simple systems, we solve for the basis function approximations of the trivalent reaction *A* + *B* → *C* with its reverse process *C* → *A* + *B*. This process is fundamentally important as a generalization of many simple biochemical processes and has been studied extensively.^{23,32} For example, it is the building block of the broadly applicable substrate-enzyme-product (SEP) motif *S* + *E ⇌ C* → *P* + *E*, where *S*, *E*, *P* denote the substrate, enzyme, and product (see Sec. III D).

In the Ising model, the description of this process involves 9 time dependent interaction functions *h*_{A}, *h*_{B}, *h*_{C}, *J*_{AA}, *J*_{AB}, *J*_{AC}, *J*_{BB}, *J*_{BC}, *J*_{CC},

where the species label *α*_{i} ∈ {*A*, *B*, *C*}, and we implicitly note that the sum *∑*_{{α}} runs only over occupied sites *s*_{i} = 1. Figure 8 shows several 2D slices of three of the nine basis function approximations for the forward process *A* + *B* → *C*.

Due to the species labels, (49) leads to analytic expressions containing on the order of hundreds of terms. Here, we use a numerical strategy as described in Appendix B for evaluating the basis functions over the chosen domain. While a computer algebra system may be employed as before, this strategy is computationally faster.

### C. Boltzmann machine-style learning algorithm for dynamics

The basis function approximations derived above constitute a space of possible reduced dynamics. Here, we consider using these analytic insights to describe large spatially distributed reaction networks in 1D. This approach faces two key problems:

For non-linear systems, $p\u0303$ obeying (49) will over the time diverge from the MaxEnt distribution consistent with the CME moments due the moment closure approximation made. As a fundamental consequence of this moment hierarchy, it is not possible to find exact basis functions over the

*entire*interaction parameter space (e.g.,*h*,*J*). Another way to see this is that trajectories of the CME system will intersect in*h*,*J*space.

However, we postulate that it may be possible to learn approximately well the basis functions for a *single trajectory* (from a single initial condition) which does not self-intersect over some domain. This model may be used for extrapolation with a reasonable accuracy close to the stochastic trajectory.

For large reaction networks, the basis functions are generally not linear in the basis functions of individual processes because the collection of interaction functions is not fixed, violating the assumption in Proposition 1. For example, consider the process

*A*→*B*→*C*. Here, nine basis functions are required to capture all means and nearest neighbor (NN) correlations such that (49) is nine dimensional. Denote these by=*b***A**^{−1}where*m*denotes the basis functions,*b*denotes the time evolving moments, and*m***A**denotes the matrix of partition function derivatives.

Next, consider the separate processes *A* → *B* and *B* → *C*, described by five basis functions each. Let these be denoted by $b(r)=(A(r))\u22121m(r)$ for each of the two reactions *r*. Clearly, not all nine basis functions in ** b** are present in each

*b*^{(r)}. Furthermore, for those that are present in both, it is not necessarily true that the

*i*-th basis function is expressible as $bi\u2260bj(1)+bk(2)$ for appropriate

*j*,

*k*.

Generally, a reaction network involves more interaction parameters than each of the individual processes such that Proposition 1 does not apply. It is only for a subset of networks, such as reaction networks in one species, where the linearity in the CME extends *exactly* to the basis functions. Regardless, we postulate that many networks may be described *approximately* well by linear combinations of basis functions corresponding to individual processes.

In light of these observations, we return to the variational problem (21) and its PDE-constraint. In the discrete lattice case considered in Sec. III A, it becomes for each *γ* = *h*, *J*,

where we have used the notation *μ*, Δ to denote the average number of particles, nearest neighbors (NN) over *p*, and similarly $\mu \u0303,\Delta \u0303$ to denote averages over $p\u0303$.

Here, we exploit the analytic results derived above to simplify this problem and derive an efficient Boltzmann-machine type learning algorithm for the dynamics. In particular, we assume that the true basis functions are linear combinations of the approximations derived in Sec. III B above, given by

Here, the reaction rates and diffusion constant are all set to unity such that the coefficients *θ* indicate the rates. The variational problem now turns into a regular optimization problem for the coefficients *θ* that will yield at all times the MaxEnt distribution consistent with the CME moments. The resulting optimization problem is: Subject to the PDE constraint (52), solve:

where the derivative terms are given by the solution to the ordinary differential equation system

with initial condition *∂h*(0)/*∂θ*^{(s)} = *∂J*(0)/*∂θ*^{(s)} = 0.

Parameter estimation is greatly simpler to solve than the function estimation (51). Furthermore, the variational problem (54) is significantly simplified since $F\u0303(r)$ and consequentially its derivatives are analytically accessible. We capitalize upon these practical qualities in Algorithm 2, which solves this problem in a Boltzmann-machine learning style approach.

As an illustrative example, we apply Algorithm 2 to a branching and annihilating random walk (BARW) on a 1D lattice, described by the following processes: *A* → *A* + *A* with rate *k*_{b}, *A* + *A* → 0 with reaction probability *p*_{a} upon encounters, and diffusion. BARWs have been studied in the context of universality classes, in particular, the directed percolation universality class.^{33,34}

Stochastic simulations are used to generate training data for this system on a chain of length *N* = 100 for maximum time of *T* = 1 with time step Δ*t* = 0.01. Here, we follow the numerical procedure described in Ref. 33, with particles attempting to hop to neighboring lattice sites at every time step and reacting with *k*_{b} = 10 and *p*_{a} = 0.1. The basis functions used in (52) are those of the three processes present, as shown in Fig. 7. The initial coefficients $\theta 0(s)$ used are *k*_{b} for branching, *p*_{a}/Δ*t* for annihilation, and 10 for diffusion.

Figure 9 shows the moments of the BARW system. Due to the moment closure problem, the system predicted by solving the constraint equations (52) diverges from the stochastic simulations. After running 400 iterations of Algorithm 2, the new coefficients lead to much closer agreement to the true system.

Figure 10 shows the coefficients converge over the iterations. In particular, the effective rates for bimolecular annihilation and branching have decreased, while the effective diffusion constant has increased. Since the final values are sensitive to the initial $\theta 0(s)$ chosen, an *L*_{2} regularization term is included in the action. A further constraint in Algorithm 2 to keep *θ*^{(s)} positive enforces the connection to effective reaction rates.

### D. ANNs for learning non-linear combinations of basis functions

As a more general approach than linear combinations, we use ANNs (artificial neural networks) to describe non-linear combinations of basis functions. Consider the SEP system diffusing on a 1D lattice described by

The full Ising model for this system consists of four external fields and 10 NN coupling parameters.

Figure 11 shows several moments of this system evolving in time from stochastic simulations. Here, the parameters used are *k*_{−1} = 0.1, *k*_{2} = 0.5, bimolecular reaction probability *p*_{1} = 0.1, maximum time *T* = 2 with time step 0.01, and lattice length *N* = 100. The system evolves from an initial lattice generated by Gibbs sampling with parameters *h*_{S} = 0.5, *h*_{E} = 1, *h*_{C} = −1, *h*_{P} = −1 and all NN terms set to zero.

The inputs to the ANN are the basis functions for the three separate processes, each of which belongs to the trivalent reaction motif of Fig. 8, thereby contributing 9 basis functions. Additionally, the two basis functions for the diffusion of each of the four species are included from Fig. 7 for a total of 35 inputs. The other layers in the ANN are two layers of 40 units, and an output layer of 14 units, with tanh activation functions between each layer. One third of the total length *T* of the time-series is used for training. These are converted to trajectories in interaction parameter space using Boltzmann machine learning and smoothed using a low-pass filter before being used to evaluate the 35 input basis functions. The outputs to be learned are the time derivatives of these 14 parameters also smoothed by a low-pass filter.

The network learns the dynamics of these parameters to high precision. We infer from the fast training times that the usage of these analytic solutions as input greatly reduces the difficulty of training the network from the interaction parameters directly.

Figure 11(a) shows the extrapolated parameters and corresponding moments, compared to the remaining third of the simulation time. These extrapolations are approximately linear in interaction space and may diverge quickly, such as *h*_{P}. However, the moments show considerable robustness to these variations. In Fig. 11(b), we compare the symmetric relative error of the interaction parameters and the moments. For each, the error *ε* is calculated by comparing the extrapolated values from the ANN and the remaining two thirds of the stochastic simulation data, averaged over the *K* = 14 moments,

and similarly for the interaction parameters. The error in the moments grows considerably slower than in the parameters, suggesting that using ANNs for extrapolation is possible for practical applications.

We compare the learned model to a well-mixed model described by mass action kinetics for (55). The reaction rates appearing in the ODEs for this model are fit by non-linear regression to the training data, obtaining *k*_{1} ≈ 0.09, *k*_{−1} ≈ 1.34, *k*_{2} ≈ 0.66. These parameters are used to further extrapolate the model over the remaining 2/3 of the timeseries, as shown in Fig. 11(a). The symmetric relative error shown in Fig. 11(b) is at all times larger than for the ANN model. The better performance of the ANN method results from it explicitly capturing spatial correlations, rather than implicitly as in the well-mixed model. The tradeoff for this accuracy is the computational cost of training the ANN. However, as typical for ANNs, forward passes through the network are fast, and the cost of solving the resulting differential equations after training is comparable between the two models. This ANN approach is therefore ideal for stochastic spatial modeling applications where the training time is not a bottleneck.

A further feature learned by the ANN is a closure approximation for the dynamics of *J*_{SP} and its corresponding NN moment. This parameter is not included in any of the basis functions or inputs to the ANN. The basis function learned, $FJSP$, shown in Fig. 11(b), expresses the dynamics of this moment in terms of the interactions made available as input to the network. Similar extensions to higher order moments are likewise possible.

## IV. DISCUSSION AND CONCLUSIONS

This paper presents a new approach to model reduction of spatial chemical systems. Slowly time-evolving MaxEnt models are employed to capture the key correlations in the system. This approach is particularly useful for multiscale problems, where different spatial and temporal correlations become more or less relevant over time to accurately describe the system. For example, in synaptic level neuroscience, the stochastic influx of signaling molecules in the post-synaptic spine produces complex spatial correlations between ion channels and downstream targets, but these are less relevant during quiescent periods. We anticipate that such problems stand to benefit greatly from modeling approaches that are able to adjust which correlations are included to optimize simulation efficiency and accuracy.

A general model that is functional in nature is introduced to describe dynamic Boltzmann distributions. This extends and formalizes ideas originally developed in GCCD in Ref. 10—in particular:

A general variational problem has been formulated to determine the functions in the dynamical system controlling the interaction parameters, taking the form of a PDE-constrained optimization problem.

The reduced model has been extended to capture spatial correlations, with particular relevance to biological applications. By motivating parameterizations of the functionals from analytically solvable cases, practical optimization algorithms for learning spatial systems are made possible.

ANNs have been employed to learn

*non-linear*combinations of basis functions, derived for individual reaction processes using computer algebra systems.

Mapping the chemical system onto a spin lattice allows a direct connection to the more traditional formulation of a Boltzmann machine. Here, the connection to the new learning algorithm is evident in (51), and we anticipate this will suggest numerous further applications to diverse areas of machine learning where estimating the dynamics of a time series is required.

The learning rule for dynamic Boltzmann distributions developed in Sec. II applies to systems of any spatial dimension, generally 3D. The results of Sec. III for chemical kinetics on 1D lattices can be extended to treat 2D and 3D systems. For example, for 2D systems with no external field, the basis function approximations may be derived as before since exact solutions for observables of the Ising model exist.^{30} In general, the inverse Ising problem is not solvable. Instead, approximations such as mean-field theory may be used as an input to the ANNs or as initialization for further learning.

Numerous strategies are possible for improving the efficiency of the PDE-constrained optimization problem formulated here, such as adjoint methods.^{28} In this work, we have shown that the complexity of this problem can be greatly reduced by instead learning linear and non-linear combinations of analytically accessible approximations. Deconstructing the problem in this way can offer physical insight into a complex reaction system, such as in Sec. III C, where effective reaction rates are learned. Future work in this direction may further explore these principled methods for integrating human intuition with machine inference in the model reduction process.

## SUPPLEMENTARY MATERIAL

See supplementary material for alternate derivations of the differential equation system (22) and for the code used to implement Algorithms 1 and 2.

## ACKNOWLEDGMENTS

This work was supported by NIH Grant Nos. R01HD073179, USAF/DARPA FA8750-14-C-0011 (E.M.), NIH P41-GM103712, and AFOSR MURI FA9550-18-1-0051 (O.K.E., T.B., and T.S.).

### APPENDIX A: DERIVATION OF DIFFERENTIAL EQUATION SYSTEM FOR VARIATIONAL TERM

#### 1. Well-mixed case

Consider the differential equation system (20). Represent the solution as a functional of the basis functions *F* using the notation

where {*F*} = {*F*_{l} |*l* = 1, …, *K*}, and *J* results from solving (20). Further, let {*J*[{*F*}]} = {*J*_{l}[{*F*}] |*l* = 1, …, *K*}, then (20) is

To find the variational term $\delta \nu k\u2032(t\u2032)/\delta Fk({\nu})$, let *F*_{k} → *F*_{k} + *ϵη* using the notation

then,

Differentiating with respect to *ϵ* at *ϵ* = 0 gives

Substitute the definition of the functional derivative

to obtain (22)

#### 2. Spatially heterogeneous example: Diffusion in 1D

Consider a diffusion process in 1D, described by a single basis functional parameterized according to

Use the functional notation

where {*F*} = {*F*^{(1)}, *F*^{(2)}} and *J* results from solving (40), then,

To find the variational term *δν*(*y*′, *t*′)/*δF*^{(γ)}(*ω*) for *γ* = 1, 2, let *F*^{(γ)} → *F*^{(γ)} + *ϵη*. Use the notation

then

Take the derivative with respect to *ϵ* at *ϵ* = 0,

where *ν* = *ν*(*y*′, *t*′) everywhere. Substituting the definition of the functional derivative

gives

### APPENDIX B: EVALUATING BASIS FUNCTIONS NUMERICALLY

To compute the basis functions numerically using (49), an efficient method is possible if the eigenvalues of the transfer matrix *M* are singular. Let the eigenvalues be *λ*_{i} with corresponding eigenvectors *u*_{i}. Define

for *α* = *h*, *J*, where *∂*_{α}*M* denotes component-wise differentiation of *M*. Also note that *p*_{ij}(*α*) = *p*_{ji}(*α*) is symmetric. Then the derivatives of the eigenvalues are given by^{35}

for *β* = *h*, *J*. The principle advantage of this approach lies in the fact that the analytic expressions for *∂*_{α}*M* and *∂*_{α}*∂*_{β}*M* are simpler to derive than differentiating the analytic expressions for the eigenvalues *λ*.

It is now straightforward to numerically evaluate the components $\u2202\alpha \u2202\beta \u2061lnZ$ of (49) in the thermodynamic limit $lnZ\u2248N\u2061ln\lambda +$, where *λ*_{+} is the largest eigenvalue of the transfer matrix and *N* the length of the chain.