Polymers consisting of more than one type of monomer, known as copolymers, are vital to both living and synthetic systems. Copolymerization has been studied theoretically in a number of contexts, often by considering a Markov process in which monomers are added or removed from the growing tip of a long copolymer. To date, the analysis of the most general models of this class has necessitated simulation. We present a general method for analyzing such processes without resorting to simulation. Our method can be applied to models with an arbitrary network of sub-steps prior to addition or removal of a monomer, including non-equilibrium kinetic proofreading cycles. Moreover, the approach allows for a dependency of addition and removal reactions on the neighboring site in the copolymer and thermodynamically self-consistent models in which all steps are assumed to be microscopically reversible. Using our approach, thermodynamic quantities such as chemical work; kinetic quantities such as time taken to grow; and statistical quantities such as the distribution of monomer types in the growing copolymer can be directly derived either analytically or numerically from the model definition.

## I. INTRODUCTION

Copolymers are polymers consisting of more than one type of monomeric unit; the order of these monomers in the chain defines the copolymer sequence. In general, copolymerization mechanisms can be classified into two main categories: free copolymerization that does not rely on a template,^{1} as shown in Fig. 1(a); and templated copolymerization, in which a template (usually another copolymer) is used to bias the distribution of sequences produced, as shown in Figs. 1(b) and 1(c). Polymers produced via both types of mechanism are of relevance to both biological and industrial systems. In living systems, O-glycans are sequences of monosaccharides that grow by free copolymerization from serine or threonine amino acids.^{2} They play a key role as a physical protective barrier for cells from pathogens, as well as participating in other cellular processes.^{2,3} Free copolymerization is also a common method for producing plastics and rubbers in commercial and industrial systems.^{4,5} In addition, there have been recent experimental designs for free copolymerization systems that produce specific products utilizing DNA-nanotechnology-based reaction schemes.^{6,7}

Templated copolymerization is the mechanism by which DNA, RNA, and polypeptides are produced in DNA replication, RNA transcription, and protein translation, respectively. These processes are at the heart of the central dogma of molecular biology^{8} and are the basis of the informational and biochemical complexity of life. In DNA replication, DNA templates the production of copies of itself; in transcription, DNA templates the production of RNA; and, in translation, messenger RNA (mRNA) is the template for the production of a polypeptide.^{9} Inspired by these biological templated copolymerization mechanisms, there has been recent interest in designing synthetic systems that can produce other sequence-controlled molecules via templated copolymerization.^{10–16}

Free polymerization can be modeled as a Markovian growth process under which monomers bind to the end of a growing polymer at a certain rate. Early free copolymerization models^{17–19} built on this framework to allow for copolymerization via the incorporation of multiple types of monomeric unit, as shown in Fig. 1(a), albeit with irreversible polymerization reactions. In particular, Mayo and Lewis^{19} emphasized that in polymerization models, if the monomer binding events are irreversible and their rates are conditional on the terminal monomer type, then intra-sequence correlations are generated within the copolymer.

Although the use of models with irreversible transitions is reasonable in many contexts, thermodynamically self-consistent models require all transitions to be microscopically reversible.^{20} In particular, if a transition from state *A* to state *B* is possible, then transitions from *B* to *A* must also be possible. Models with fully microscopically reversible polymerization reactions, as in Fig. 1(a), are more challenging to analyze but can be interpreted in a thermodynamic sense.^{1,21,22}

Templates can affect the rate at which monomers are added or removed from a growing copolymer, and hence, templated copolymerization models can be more complex than free copolymerization models. When the template consists of just one type of templating monomer (homopolymer), a templated copolymerization process can be mapped onto a free copolymerization model. Furthermore, if one assumes some symmetries regarding interactions between monomers in the growing copolymer and those in the template (such as all complementary bonds have equal strength and all non-complementary bonds have equal strength), models of sequence-bearing templates may be mapped onto models with homopolymeric templates and, hence, to models of free copolymerization.^{23–27}

Templated copolymerization models can be further divided into two main categories: templated self-assembly Fig. 1(b)^{,}^{1,27–41} and autonomously separating mechanisms Fig. 1(c).^{23,24,42} Templated self-assembly models are those in which all the monomers in the growing copolymer remain bound to the template. In autonomously separating models, the growing copolymer detaches as it extends.^{23–25} There has been recent interest in explicitly modeling autonomous separation in templated growth in an attempt to understand models that give a better description of transcription or translation.^{23–25} In autonomously separating models, the simultaneous growth and separation of the copolymer and template mean that the copy-template interactions are not permanent, and therefore, free energy released from such interactions cannot be part of the driving force of polymerization. In addition, since these copy-template bonds are temporary, they cannot stabilize the accurate copy directly in the long time limit. Furthermore, an ensemble of accurate polymers is a lower entropy state than an ensemble of random polymers. These conditions mean that non-equilibrium driving is required to generate accurate copies of the template if the copies are to spontaneously detach.^{42} Moreover, the separation of the lagging tail from the template as the copolymer grows naturally causes intra-sequence correlations within the product.^{23}

The models described above are maximally coarse-grained, in that they treat the binding of monomers to the growing tip of the copolymer as a simple, usually single-step, process. However, more generally, one may wish to study models in which polymerization occurs via a more detailed series of steps, as in Fig. 1(d). For instance, in order to explain the high accuracy observed in biological polymer copying systems, Hopfield^{43} and Ninio^{44} independently introduced the concept of kinetic proofreading: a reaction motif in which a monomer undergoes a free energy consuming activation reaction before it is polymerized into the copolymer. The introduction of kinetic proofreading reaction motifs presaged the investigation of more complex copolymerization mechanisms.^{35,45}

In summary, models that allow for multiple monomer types, intra-sequence correlations, reversible reactions, and general, multi-step monomer inclusion reactions represent a wide class of copolymerization processes. Previous techniques^{17–19,22,23,27–35,39–41,46–51,66–69} have not allowed analysis of thermodynamically self-consistent models of generalized free copolymerization processes in which monomer addition is given by an arbitrarily complex network of reversible reactions with rates that may depend on the terminal monomer type, and templated copolymerization models with a high symmetry that can be mapped to these free processes. Investigating the most general type of model in this class would require simulation.

In this paper, we present a universal method for studying this large class of copolymerization models. Drawing on the work of Gaspard and Andrieux^{52} for analyzing linear copolymerization processes, and Hill^{53,54} for analyzing absorbing Markov processes, we present analytical methods for extracting: explicit expressions for the probability of inclusion of a given monomer; the growth rate of a copolymerization process; and the chemical work done by the process. Our method removes the need to extract the same features by simulation and often produces simple, analytic results.

In Sec. II A, we review and refine methods relating to absorbing Markov chains that are crucial to understanding our approach. In Sec. II B 1, we present our method. In Sec. III, we apply the method to a few example processes to demonstrate its use and power when considering models with certain features. First, we apply the method to models for which the rate of adding new monomers only depends on the monomer type being added. Next, we apply the method to templated copolymerization systems with autonomous-separation that do not have non-equilibrium kinetic proofreading cycles. Finally, we solve a generalized version of Hopfield’s kinetic proofreading model applied to a templated copolymerization system with an autonomously separating product.

## II. METHODS

### A. Absorbing Markov chains

We begin by reviewing and adapting some diagrammatic techniques introduced by Hill to analyze absorbing Markov chains.^{53,54} An absorbing Markov chain is a Markov chain for which any trajectory through its state space with arbitrary initial conditions will reach an absorbing state in finite time almost surely.^{55} We can decompose the state space of an absorbing Markov chain into absorbing states $A$ and transient states $X$ such that the state space is $V=A\u222aX$. Let us denote the rate function that describes the chain as $K:V\xd7V\u2192R+$ such that *K*(*x*, *y*) is the rate of the transition from state *x* to state *y*. Then, we denote a Markov process as the tuple (*V*, *K*).

Throughout this section, we shall refer to the absorbing Markov chain given in Fig. 2(a), which possesses two absorbing states and non-trivial cycles, for illustrative purposes.

#### 1. Expectations of an absorbing process are steady state averages of a “closed process”

We will derive expressions for four main quantities: the probabilities of reaching certain absorbing states, the expected time taken to absorption, the expected net number of times traversing a given edge before absorption, and the expected number of times that a trajectory goes round a cycle before absorption. These quantities depend on the starting (transient) state $\sigma \u2208X$ and can be found in terms of the “closed” process.^{54} The closed process is a modified version of an absorbing Markov chain in which transitions to the absorbing states are redirected to the starting state. Figure 2(b) shows the closed process starting at state 1 of our example absorbing chain of Fig. 2(a). The closed process for a Markov process $(X\u222aA,K)$ starting at state *σ* is a new Markov process $(X,K\sigma )$ with a rate function given by

for $x\u2208X$ and agreeing with *K* on $X\xd7X/{\sigma}$.

The closed process has a unique stationary distribution for the following reasons. From the definition of an absorbing Markov chain, there exists a path from any state to an absorbing state, taking finite time. Thus, in the closed process, there is a path from any state to the starting state, taking finite time. The set of states including the starting state and all those that may be reached from the starting state is, therefore, positive recurrent and furthermore, this set is the only recurrent set of states and will be reached from any other state. Since there is only one recurrent set of states, there is a unique stationary distribution.^{55}

Expected quantities of an absorbing Markov chain, such as the expected probability that a particular absorbing state is reached, can be found in terms of steady state quantities in the closed process. Whenever a trajectory of the original process reaches an absorbing state, in the closed process that same trajectory would have been reset back to the starting state. Hence, running the closed process for long times is equivalent to generating many independent trajectories to absorption for the original chain. Thus, averaging quantities in the steady state of the closed process is equivalent to taking expectations over independent trials of quantities in the absorbing chain. It is worth noting that the dependence of expected quantities on the starting state is encoded in the definition of the closed process. Finally, we can see that the definition of the closed process may permit self-transitions *σ* → *σ*, which, for continuous time Markov processes, have little meaning. However, for the purposes of calculating steady state probabilities of the closed process they may be ignored.

#### 2. Steady state averages of the closed process are calculated using the Markov chain tree theorem

Given that we can turn the calculation of expectations of absorbing processes into steady state averages over closed processes, we can make use of tools developed for analyzing the steady state of Markov processes, such as the Markov chain tree theorem (MCTT).^{56} The MCTT states that the steady state distribution of a Markov chain with a unique stationary distribution may be found by summing over rooted spanning trees of the process, where the transition rates are taken as weights on the edges of the graph. Explicitly, let $G$ be a directed weighted graph, with weight *K*(*e*) for an edge *e* of $G$. A spanning tree of $G$, rooted at a vertex *ν*, is a subgraph of $G$ with no cycles that connects all the vertices of $G$ and for which the out degree of every vertex, except *ν*, is one. The sets of spanning trees rooted at nodes 3 and 4 of the closed process given in Fig. 2(b), are shown in Fig. 3. Denote the set of all spanning trees rooted at *x* by $T(x)$. The MCTT states that the steady state probability *π*(*x*) to be in state *x* is given by

with *e* ∈ *T* representing the edges of the tree. The denominator here is simply a normalization constant.

We can define steady state currents from the steady state distribution of the closed process that corresponds to expected currents of the absorbing chain. Let a subscript *σ* denote quantities in the closed process starting at state *σ*. Then, *π*_{σ} is the steady state probability distribution and *K*_{σ} the rate function. The current along a given edge, *e* = *x* → *y*, is given by the probability to be in state *x*, *π*_{σ}(*x*), multiplied by the rate along said edge. We can, therefore, write the steady state current along all edges that originally led to absorbing states as

where, as before, $X$ is the set of transient states and $A$, the set of absorbing states. *J*_{Tot}(*σ*) is the expected total current to absorbing states from state *σ*, and, therefore, its reciprocal is the expected time to absorption.

For the example process shown in Fig. 2, we present the spanning trees of the corresponding closed process rooted at node 3 and 4 in Fig. 3. Given the spanning trees, we can directly write down the total current to the absorbing states as

where $N$ is the normalization term, given in Appendix B. The terms multiplied by *k*_{A} are the partial current to absorbing state *A*, i.e., the current along transition 3 → *A*, coming from the trees rooted at node 3, and equivalently for *k*_{B} with state *B*, i.e., the current along transition 4 → *B*, coming from trees rooted at node 4.

#### 3. Absorbing probabilities

Given a Markov chain with multiple absorbing states, we can ask for the probability of absorption in each absorbing state in the long time limit. The probability that a trajectory eventually ends in a specific absorbing state can be calculated from the closed process, by dividing the expected current along transitions that originally led to the absorbing state in question by *J*_{Tot}(*σ*) [Eq. (3)]. Therefore, the absorption probabilities can be written

using the notation $P[\sigma \u2192A]$ to denote probability of being absorbed to *A* given that the trajectory started in state *σ*. It is worth noting here that given that this quantity is a ratio of currents, there is a factor of *π*_{σ} in both the denominator and the numerator of the expression. In practice, we see that the normalization factor from the MCTT [Eq. (2)] cancels out, which simplifies the quantities in the calculation.

For our example process shown in Fig. 2, we can use the partial currents to absorbing states *A* and *B* to write down the absorbing probabilities

The normalization factor, $N$, propagated through from Eq. (2), conveniently cancels out with the $1/N$ implicit in *J*_{tot}.

#### 4. Counting edge and cycle transitions

In this subsection, we shall calculate the expected net number of times traversing a given edge of an absorbing Markov process before absorption. In addition, we shall calculate the expected number of times a non-recurrent cycle of an absorbing Markov process is traversed before absorption. Both of these will be of use later in defining a notion of chemical work.

To calculate the net number of times crossing a given edge, we find the expected current along the transition *x* ⇋ *y*, between states *x* and *y* of an absorbing process, $(X\u222aA,K)$, as in Sec. II A. The expected current through this edge, denoted *J*_{x⇋y}(*σ*), given starting in state $\sigma \u2208X$, can be calculated from the closed process, $(X,K\sigma )$, as in Eq. (1), as the difference between the steady state probability to be in state *x* multiplied by the rate from *x* → *y* and the steady state probability to be in state *y* multiplied by the rate from *y* → *x*,

The net number of times traversing this edge (number of observed transitions *x* → *y* - number of observed transitions *y* → *x*) before absorption is, then, just the ratio between this current and total current to absorbing states,

The current, given in Eq. (7), is intimately linked to the notion of cycles as pointed out by Wachtel *et al.*^{57} and detailed in Appendix C. Thus, we also wish to find the expected number of times traversing a non-recurrent cycle. We define a non-recurrent cycle for a Markov chain to be a cycle of states, where each state, aside from the originating state, does not appear more than once in the cycle. For example, the cycle A → B → C → D → A is non-recurrent, but A → B → C → D → B → A is recurrent. Note that the originating state is arbitrary, and so A → B → C → D → A is equivalent to B → C → D → A → B. For a stationary process, the expected frequency with which a cycle is completed can be calculated from the one-way cycle current,^{54,58} which is the probability current going around the cycle. For a chosen non-recurrent cycle, the one-way cycle current can be calculated diagrammatically from three terms. First, a cycle term given by the product of rates around the cycle in the chosen direction. Second, a spanning tree term that can be found by collapsing the nodes in the cycle into a single node [in Fig. 2(c), the cycle 1 → 2 → 3 → 1 has been collapsed in this way] and finding the sum of spanning trees of this new graph rooted at the collapsed cycle node. Finally, there is a normalization factor, which is the same normalization factor as for the current, $N$. Explicitly, consider an absorbing Markov chain $G=(X\u222aA,K)$ and its closed process starting at $\sigma \u2208X$, $G\sigma =(X,K\sigma )$. Let *C* denote both the set of edges and set of nodes of a cycle in the closed process. To calculate the spanning tree term for the one-way cycle current, construct a new Markov chain, the cycle process $G\sigma ,C=({c}\u222a(X/C),K\sigma ,C)$, where ${c}\u222a(X/C)$ is the set of transient states of the original process with the states in the cycle replaced by the single node *c* and *K*_{σ,C} is given by

for $x\u2208X/C$ and agreeing with *K*_{σ} elsewhere. The cycle process for the cycle *C* = 1231 (or *C*′ = 1321) of the example system in Fig. 2(a) is shown in Fig. 2(c). Let $T\sigma (x),TC(x)$ be the sets of spanning trees rooted at *x* of the closed process, $G\sigma $, and cycle process, $G\sigma ,C$, respectively. Then, the cycle current is given by^{58}

Note that, for the cycle term, the edges are taken from the original process rather than the closed process. Given the cycle current for the closed process, the expected number of circulations of the cycle before absorption, *N*_{Cyc}(*σ*, *C*), is the ratio of the cycle current to the total current to absorbing states

For our example process, the expected number of circulations of *C* = 1231 is

with the same implicit cancelation of normalization as before, since $JTot\u221d1N$.

For an absorbing process starting at a given state, we may divide the cycles into internal and external cycles. External cycles are those which appear in the closed process and involve edges which were absorbing edges in the original process. The set of all cycles, sorted into internal and external, for the example process Fig. 2, is shown in Appendix D. The external cycles correspond to the pathways from the starting state to an absorbing state. Therefore, the expected number of times traversing an external cycle before absorption will be at most one and corresponds to the probability of following a given path to absorption. Furthermore, the sum of Eq. (11) over all external cycles will be one.

### B. Copolymer methods

#### 1. Philosophy of coarse-graining complex underlying copolymerization reaction networks

Armed with the techniques for solving absorbing Markov chains, here, we set out the method for the analysis of copolymerization processes. Gaspard and Andrieux^{52} presented a method to analyze Markov polymerization processes in which each monomer is added in a single step [i.e., if the internal reaction network shown in Fig. 1(d) were trivial], assuming long polymers. We shall present a method for mapping more complex models for the individual polymerization step onto coarse-grained descriptions that can be analyzed using this framework and, subsequently, show how to get the behavior of the full model from the results.

Consider a growing copolymer with *M* monomer types, which are assumed to be present in the environment at fixed concentrations. At a coarse grained level, we can define a state space of finite length sequences {*x*_{1}*x*_{2}…*x*_{l}|*x*_{i} ∈ {1, 2, …, *M*}, *l* ≥ 0}, where *l* is the length of the sequence. Let us refer to the coarse-grained states in this state space as *completed states*. On this coarse-grained level, a sequence of length *l* may increase in length by one unit by polymerizing one of *M* units at the growing tip (*x*_{1}*x*_{2}…*x*_{l} → *x*_{1}*x*_{2}…*x*_{l}*x*_{l+1}) or it may decrease in length by one unit (*x*_{1}*x*_{2}…*x*_{l} → *x*_{1}*x*_{2}…*x*_{l−1}). Such a coarse grained model is depicted in Figs. 1(a)–1(c) for free polymerization, templated self-assembly, and templated polymerization with simultaneous separation.

In general, copolymerization processes may be best described by models in which the underlying copolymerization reaction networks are complex, featuring multiple sub-steps in arbitrarily complex networks connecting the completed states, as suggested in Fig. 1(d). Hence, overall, we could consider a copolymerization process as having a tree-like structure with networks of reactions connecting completed states, as in Fig. 4. Such a class of models is wide-reaching, with many examples from the literature included in this class.^{17–19,22,23,27,28,30–35,39–41,45–51}

We will define a Markov process at the level of the coarse-grained completed states that, by construction, preserves the probabilities of the transitions between the completed states of the fine-grained process and, therefore, preserves the statistics of the sequences produced. The coarse-grained Markov process does not preserve the distribution of the transition times between the completed states implied by the fine-grained model, which will, in general, be non-Markovian. Moreover, it does not provide fine-grained information on trajectories between the coarse-grained completed states. However, the temporal details and information about the fine-grained dynamics can be added back in at a later stage, once the statistics have been analyzed at the coarse-grained level.

#### 2. Identifying propensities in the coarse-grained model

We find the transition rates of the coarse-grained model (hereafter labeled as propensities to avoid confusion with the underlying rates of the fine-grained process) by considering first passage problems between completed states. From a given completed state, there are *M* + 1 completed states that may be reached, corresponding to the *M* possible additions of a monomer and the removal of the monomer currently at the tip of the copolymer. For a first passage problem, we can convert each of these reachable completed states into an absorbing state by removing the transitions out of said states, as in Fig. 4(a), in the same vein as Cady and Qian.^{59} Let us refer to this absorbing Markov process as the *step-wise process* and define step to mean the addition/removal of a monomer.

We shall work with the assumption that the transition rates depend on the two monomers at the growing tip of the copolymer following.^{1,17–19,21–23,36–38,52,60} There will therefore be *M*^{2} flavors of this process corresponding to the combinations of the two terminal monomers of the copolymer, the central state &*xy* (here & represents an arbitrary sequence). We wish to find the absorbing probabilities, $P[&xy\u2192&xyz],z\u2208{1,\u2026M},P[&xy\u2192&x]$ given an initial condition of the central state &*xy*. As outlined in Sec. II B 1, Eq. (5), we can find these probabilities by constructing the closed process and finding sums over spanning trees rooted at different states. The step-wise process has *M* + 1 petal-like graphs each connected to the central state but disconnected from each other. Due to this structure, any sums over spanning trees of the full process will factorize into a product of sums over spanning trees of the petals. Thus, we find that the absorbing probabilities take the following form:

Here, *z* ∈ {1, …, *M*}, $N$ is the normalization factor from Eq. (2); Λ^{+}(*z*, *y*) is the sum over spanning trees of the petal connecting states monomers &*xy* and &*xyz*, rooted at the forward completed state &*xyz*; Λ^{−}(*y*, *x*) is the sum over spanning trees of the petal connecting states monomers &*x* and &*xy*, rooted at the backward completed state &*x*, Fig. 4(b); and *Q*(*y*, *x*) is the sum over spanning trees of the petal connecting states &*x* and &*xy*, linked back to the central state and rooted at the central state, i.e., with edges redirected to the starting state as in the closed process, as in Fig. 4(c). Since *Q* is a sum over spanning trees rooted at the node to which edges have been redirected, the sum takes the same form for both the forward and backward petals, only depending on which two completed states it is connecting.

From these probabilities, we see that choosing propensities *ω*_{±yx} for the transitions $&x\u2192\omega +yx&xy$ and $&xy\u2192\omega \u2212yx&x$ such that

not only preserves the ratios of probabilities of transitions to completed states, but also ensures that *ω*_{±yx} only depends on monomers *x* and *y*.

We note here that this coarse graining process is different from lumping,^{55,61} in which the state space is reduced while attempting to retain trajectory dynamics. In our approach, the coarse-grained process does not reproduce the dynamics of the fine-grained process, only the statistics of the completed states that are visited. However, dynamic quantities may be extracted exactly from the step-wise process, as we show in Sec. II B 4.

#### 3. Solving the coarse-grained model

We now use the methods developed by Gaspard and Andrieux^{52} to solve the coarse-grained Markov model over the completed states with propensities *ω*_{±yx}. Gaspard and Andrieux’s approach considers a frame of reference that is comoving with the tip of the growing polymer and assumes that the state of the tip and nearby monomers reaches a stationary distribution to derive quantities at this steady state, such as the set of tip incorporation velocities *v*_{x} (the rates of adding monomers to a copolymer &*x*), the tip probabilities *μ*(*x*) (the probability at a given time that the growing polymer is in state &*x*), and the pair tip probabilities *μ*(*x*, *y*) (the probability of being in state &*xy*). However, we note that the time-dependent information is not physical at this stage due to the coarse-graining process. The above quantities are found from solving the following equations:^{52}

Using *μ* and *v*, we can calculate the statistics of the copolymer sequence far behind the growing tip.^{52} We note that the distribution of monomers at the tip *μ*(*x*) is different from the distribution of monomers at sites behind the tip; we assume that this distribution reaches some limit far behind the growing tip in the bulk of the copolymer. This limiting distribution describes the probability that a monomer in the bulk of the copolymer takes a value *x*. Using *ɛ*(*x*) to denote the frequency of monomer *x* in the bulk of the copolymer,^{52}

We may similarly define *ɛ*(*y*|*x*) as the probability that in the bulk of the copolymer, a monomer *y* is observed given a monomer *x* behind it. *ɛ*(*x*) and *ɛ*(*y*|*x*) fully characterize the statistics of the bulk copolymer since under our assumptions, transitions only depend on the two monomers at the tip; the completed copolymer sequence is itself a Markov chain.^{23}

#### 4. Extracting properties of the fine-grained model from the solution of the coarse-grained model

The easiest quantities to extract are the frequencies of monomers in the bulk of the copolymer. These quantities are identical in the coarse-grained and fine-grained models, since the coarse-graining preserves the statistical distribution of the sequences produced. Therefore *ɛ*(*x*), as defined in Eq. (18), and *ɛ*(*y*|*x*) apply directly to the fine-grained process.

The tip probabilities *μ* above give the fraction of the time spent in each tip state in the coarse-grained model. However, the coarse-grained model will not reproduce the time series of the fine-grained model, but only the sequences of completed states visited. We, therefore, quotient out the lifetime *τ*(*x*, *y*) of the tip state (*x*, *y*) to obtain the frequency with which the tip states are visited in the coarse-grained model,

This frequency defines a new tip distribution *ξ*. *ξ*(*x*, *y*) is the frequency that a given pair of monomers *x*, *y* is observed at the tip of the growing copolymer in the sequence of transitions. This distribution, *ξ*(*x*, *y*), applies to both the coarse-grained model and the sequence of completed states visited in the full fine-grained model. It can, therefore, be used to find averages of the key dynamic properties.

For example, we can calculate the probability *P* that a growing copolymer increases in length at each step of the step-wise process. *P* is calculated by averaging the probability of adding a monomer over the possible states &*xy*:

Upon averaging out the sequence information, we may treat the growth of a polymer as a random walk with probability *P* of stepping forwards and (1 − *P*) of stepping back. We can find the expected number of monomer inclusion/removal steps per net forward step as 1/(2*P* − 1) (for proof see Appendix E). A number of quantities scale with the total number of steps rather than the net number of steps, making the number of steps per net forward step a necessary quantity. For example, in order to find the expected time taken per net forward step, one can find the expected time to absorption for the step-wise process *T*(*x*, *y*) for a copolymer in state &*xy* by calculating 1/*J*_{Tot}(&*xy*) for the step-wise process using Eq. (3). Then, the expected time per net forward step is

1/*τ*_{step} is, therefore, the physical average growth rate of the copolymer in the fine-grained model.

We may also calculate the chemical work done by the system in producing the copolymer. In a purely chemical system, with no time-varying externally applied protocols, the entropy increase of the universe is given by the decrease in the generalized free energy of the chemical system, including any coupled reservoirs of fuel molecules.^{20} Since the total free energy must decrease, any increase in one contribution must be paid for by a decrease of at least the same magnitude in another contribution. It is common to describe the latter subsystem as doing work on the former.

For the polymerization systems analyzed here, the generalized free energy can be split into a term corresponding to the chemical free energy of the system averaged over the uncertain state of the system and a term related to the entropy arising due to the uncertainty of the state occupied.^{62}

where we use natural units such that *k*_{B}*T* = 1. Here, *a* is a chemical state of the system as a whole, *G*_{chem}(*a*) is the chemical free energy of state *a*, and *p*(*a*) is the probability that the system occupies the state *a*. *G*_{chem}(*a*) = −ln *Z*_{a}, where *Z*_{a} is the partition function of the system (explicitly including any large chemical buffers) restricted to the chemical state *a*. *Z*_{a} represents the contribution of concentrations and bond strength to the favorability of a molecular state. The principle of detailed balance^{20} states that the chemical free energy change associated with a transition from *a* to *b* is given by

The second term in Eq. (23) is information theoretic in character; it is equal to the negative of the Shannon entropy associated with the distribution over chemical states. For the systems studied here, in which we consider infinitely long copolymers that have reached steady state growth, the only relevant contribution to this term is the increase in Shannon entropy of the copolymer sequence produced as the polymer gets longer. Since the copolymer sequence is itself a discrete time Markov chain^{23} the additional entropy per net forward step (the entropy rate) can be readily calculated,^{63}

with *x*, *y* representing the monomer types. Since the purpose of a copolymerization system is often to produce a low entropy (or “accurate”) sequence, it is reasonable to think of the chemical free-energy decrease per net forward step as the chemical work done to reduce the information entropy of Eq. (25) below that of a uniform, random polymer. Extending the definition provided by Poulton *et al.*,^{23} we may define the efficiency of copolymerization as

where ln *M* is the entropy per monomer (or entropy rate) of a uniform, random copolymer with *M* monomer types and $Wchem$ is the average decrease in chemical free energy per net forward step. This efficiency is, then, the ratio between the entropy drop due to the accuracy of the copolymer compared to that of a random one (ln *M* − *H*) and the chemical work used to drive the system $(Wchem)$ above that required to make a random copolymer in equilibrium is (−ln *M*).^{64}

The expected work done during a transition adding or removing a monomer given starting in completed state &*xy* can be calculated by summing the contribution from Eq. (24) multiplied by the expected net current along the edge *a* ⇋ *b* prior to absorption over all the edges in the step-wise process,

where *N*_{a⇋b}(&*xy*) is the expected net number of times traversing edge *a* ⇋ *b* before absorption given started in the central state of the step-wise process &*xy*, as in Eq. (8). This sum will also require contributions from edges that lead to absorbing states. For such edges, the rate for the reverse transition in the logarithm of Eq. (27) is the rate from the full process.

Equivalently, however, as outlined in Appendix C, we may find this chemical work by considering the non-recurrent cycles of the process.^{57} For a given internal cycle, *C*, we may define the affinity^{20}

where the products are over the edges *e* composing the cycle and *C*′, which is the cycle with the edges in a reversed direction. For external cycles, we may define the affinity in the same way, inferring the rate for the reversed edge of the transition to absorbing states from the full process. The expected work done before absorption of the cycle *C* having started in the state &*xy* is

Averaging *w*_{chem}(*x*, *y*) with *ξ* and multiplying by the expected number of steps per net forward step gives the expected chemical work done per net forward step,

Furthermore, the forms of Eqs. (22) and (30) may be applied to an arbitrary quantity for which one can find the expected value in the step-wise process starting in state &*xy*. Let this arbitrary quantity be *A*(*x*, *y*). One can then average this quantity using the distribution *ξ* to obtain the expected value of the quantity per step. Then, if appropriate, multiplying by 1/(2*P* − 1) gives the expected value of the quantity per net forward step. In practice, as shall be seen in Sec. III C, since the quantities we wish to calculate may be written in terms of sums over spanning trees, the quantities for the step-wise process may be written as a sum over the terms per petal, with the quantity for a given petal factorizing into some quantity that depends on the petal multiplied by *Q*’s for the other petals.

#### 5. Stalled growth

Explicit simulation of copolymer growth is particularly challenging in regimes where *P* ≳ 0.5, since many backward and forward steps are taken per net forward step. At *P* = 0.5, then, the process will not reliably produce copolymers; for *P* < 0.5 polymers will tend to shrink. In general, for *P* = 0.5, we can say the model has stalled. Our approach is particularly beneficial in this case; indeed, it is possible to check whether a model is at the stall point by considering an *M* × *M* dimensional matrix of the ratios of forward to backward propensities,^{52} $Zyx=\omega +yx\omega \u2212yx=\Lambda +(y,x)\Lambda \u2212(y,x)$. The model is at the stall point if and only if

where $1M$ is the *M* × *M* identity matrix. The polymer will shrink if $detZ\u22121M<0,$ Since *Z* gives the ratios of adding a monomer to removing one, this condition essentially says that models will stall if the total rate of adding a monomer is equal to the total rate of removing one.

In a typical model, there exists at least one parameter that controls the driving. Often this parameter is related to the backbone strength of the polymer produced: e.g., the free energy drop associated with the formation of a generic backbone bond Δ*G*_{pol}. This parameter will be present in the rates of each external cycle so that by tuning it, the model can be moved all the way from stalling to irreversible growth, whereby monomers cannot be removed once polymerized. If such a parameter exists, we may rephrase the stall condition, given in Eq. (31), in terms of this parameter. For example, for the case of the parameter being Δ*G*_{pol}, we may find some threshold value Γ such that the model will stall for Δ*G*_{pol} = Γ.

#### 6. Limiting behavior

We shall note two limits for which we may give analytic expressions for the frequency of monomer types in the copolymer bulk for all models. First, consider the case that the system is at the stall point [Eq. (31)]. In general, entropy production can still occur within cycles in the step-wise process; therefore, these frequencies cannot be determined from equilibrium arguments and are non-trivial. Nonetheless, at the stall point, we may express the monomer frequencies in the bulk relatively simply. The frequency *ɛ*_{stall}(*x*) of monomer *x* is proportional (up to normalization) to the cofactor of the diagonal element (corresponding to monomer *x*) of the matrix $1M\u2212Z$, as proven in Appendix F. For example, for *M* = 2,

and for *M* = 3, we have

On the other end of the spectrum, we can also solve for the monomer bulk frequencies in the irreversible limit, where *ω*_{−yx} = 0 for all *x*, *y*. Intuitively, we could consider the Markov process on the state space {1, …, *M*} representing copolymers with a given monomer at its tip and the transitions between those states with rates *K*_{irrev}(*x* → *y*) = *ω*_{+yx}. The steady state of this process will give the time dependent frequencies of having a given monomer at the tip of the copolymer. Therefore, dividing by the time spent in each state will give the bulk frequencies. A nice way to write out these frequencies in the style of the methods described thus far is as a sum over the spanning trees on the complete graph on *M* vertices with rate functions *K*_{irrev}(*x*, *y*) = *ω*_{+yx}. Explicitly, we may write these frequencies (up to normalization) as

where $T(x)$ is the set of spanning trees of the complete graph on *M* vertices. This expression is derived formally in Appendix G. For example, with *M* = 2,

#### 7. Simplification for factorizable propensities

The presented method applies to arbitrary complex copolymerization models obeying the structure of Fig. 4. However, if we make some further common assumptions, much of the analysis simplifies. For example, consider the case in which the ratios of propensities may be factored,

where *Y* is a function of monomer *y* only and *X* is a function of monomer *x* only. Intuitively, such a condition holds in the cases where there are no direct, type-dependent interactions between monomers in the growing polymer, such as when monomers only interact with a template.^{1,23,24,27–38} Under such an assumption, multiple calculations simplify, see Appendix H. For example, the stall condition becomes simply that the model will stall at

Bulk frequencies at stall are just,

## III. EXAMPLE APPLICATIONS

We shall now consider some exemplar classes of models to: provide examples of how to utilize the methods; validate their accuracy; and to show the types of quantities and information that may be extracted.

A useful initial classification of these models is into those which we shall call balanced. We shall refer to a model as being balanced if its petals [see Fig. 4(b)] are detailed balanced. Such models are useful baseline checks as their cycles all have zero affinity, meaning no chemical work is done internally and, hence, the only contributions to chemical work are from external cycles. Furthermore, these models exhibit a proper equilibrium at the stall point and allow for equilibrium arguments to validate the method at this point. It is worth noting that although related to the notion of detailed balanced, the full model with its infinite state space is not detailed balanced.

### A. Stalling behavior in a polymerization model with no neighbor–neighbor interactions

We shall start with the simplest case, where the propensities in the coarse-grained model only depend on the monomer type being added/removed: *ω*_{±yx} = *ω*_{±y}, such as in a simple model for templated self-assembly, shown in Fig. 1(b). Assume that there exists a backbone free energy Δ*G*_{pol} controlling the driving as described in Sec. II B 5. Any spanning tree in Λ^{±} must involve at least one incidence of Δ*G*_{pol}, since it appears in every external cycle. Therefore, we can split the ratio of propensities as follows:

where Δ*G*_{y} encompasses the rest of the details about the models. We note, in general, that Δ*G*_{y} may be a function of Δ*G*_{pol}, however, in many cases, it is not. These cases include when there is only one completion reaction [highlighted in red in Fig. 4(a)] that contains the dependence on Δ*G*_{pol} or if the model is balanced. We may then interpret −Δ*G*_{y} as an effective binding free energy of monomer *y*. If we think of Δ*G*_{pol} as the free energy drive of the model away from stalling, we look for a threshold value Δ*G*_{pol} = Γ above which the model will not stall. Using Eq. (37), we see that

where $Z$ is the partition function for a system with one state for each monomer type, each state labeled by *y* and with free energy −Δ*G*_{y}. Furthermore, using Eq. (38), the bulk frequencies at the stall point may be written as

which is the probability of selecting a state *y* with free energy −Δ*G*_{y} as predicted by equilibrium statistical mechanics. In these results, −Δ*G*_{y} looks like the equilibrium contribution to free energy, and the results follow fairly directly in equilibrium. However, these results hold even if the process involves fuel-consuming cycles: entropy may still be being produced at stall. In such cases, the effect of breaking the equilibrium will be to change the effective free energies of selecting a given monomer type.

### B. Balanced models of templated polymerization with autonomous separation

Next, we shall consider a class of models where the ratio of propensities may be written as

As before, Δ*G*_{pol}, coming from the polymerization reactions represents the driving of this process. Such a class of models includes, most notably, balanced models of templated polymerization with autonomous separation.^{23} In these cases, the breaking of the previous copy-template bond every time a new bond is formed enforces the structure in Eq. (42). We shall assume, as in Ref. 23, that Δ*G*_{y} is independent of Δ*G*_{pol}.

Using Eqs. (37) and (38), we find the stall point to be Δ*G*_{pol} = Γ = −ln *M* and the bulk frequencies at stall $\epsilon stall(y)=1M$, where *M* the number of monomer types. Physically, we can understand these results by considering balanced models of templated polymerization with autonomous separation. For such models, by definition, there is no entropy production in internal cycles and therefore, the stall point must be thermodynamic equilibrium. In such models, the only driving comes from the polymerization Δ*G*_{pol} and the entropic effect having *M* monomers to choose. These two effect balance at equilibrium.^{64}

Next, let us consider the limit that the completion reactions highlighted in red in Fig. 4(a) are much slower than the other reactions. Explicitly, let *k* be some rate constant at the same order of magnitude of the rates of the process that are not the rates for the completion transitions indicated in red in Fig. 4(a). Write the completion rates as $kcomRcom+(y,x)$, where *k*_{com} ≪ *k* is a rate constant controlling the overall speed of the completion reactions and $Rcom+(y,x)$ provides any sequence dependence. Similarly, the reverse transitions along the completion edges have the rate $kcomRcom\u2212(y,x)$. Furthermore, let there be *n*_{com} such completion reactions in a given petal of the step-wise process (we shall assume this number is the same for all pairs of monomers *x*, *y*).

Assume for simplicity that all completion reactions, $Rcom\xb1(y,x)$, take the same form in a given petal. Then, we can write the sum over spanning trees *Q*(*y*, *x*) as

since Λ^{−}(*y*, *x*) has first order terms in *k*_{com}/*k*. This fact can be seen from noting that the leading order terms in *Q*(*y*, *x*) are the trees with no completion reactions and the leading order terms in Λ^{−}(*y*, *x*) are those same leading order trees of *Q*(*y*, *x*), except with one completion reaction added in. There are *n*_{com} such completion reactions and each adds the same leading order term to Λ^{−}(*y*, *x*). With *Q*(*y*, *x*) taking this form, and remembering Eq. (42), the propensities take the following form:

The *n*_{com}*k*_{com} term cancels in ratios of *ω*_{±yx} variables and, therefore, does not affect the sequence statistics. Thus, in the slow completion limit, such models are only affected by the binding free energy differences (Δ*G*_{y} − Δ*G*_{x}), the driving (Δ*G*_{pol}), and the nature of the final completion step $(Rcom\u2212)$. Therefore, the finer details do not affect the statistics of the polymers.

Assuming that all completion edges are associated with the same free energy change −Δ*G*_{pol}, so that $Rcom\u2212(y,x)=e\u2212\Delta Gpol$, we may solve for the statistics explicitly. For the case of two monomer types, *M* = 2, we find the bulk frequency to be ( Appendix I):

where *DG* = Δ*G*_{1} − Δ*G*_{2}. This expression is plotted in Fig. 5 for *DG* = 4. From this expression, we can confirm explicitly by substituting in the stall driving, Δ*G*_{pol} = −ln 2, that the bulk frequency, indeed, becomes $\epsilon (1)=12$. Furthermore, taking the irreversible limit, Δ*G*_{pol} → *∞*, we find that the bulk frequency becomes

which is the equilibrium statistical mechanics probability of choosing state 1 with free energy −Δ*G*_{1}, given state 2 has free energy −Δ*G*_{2}. Since the completion reactions are slow and irreversible, in this limit, the process of selecting the monomers is allowed to equilibrate. Therefore, copolymerization is simply sampling from the equilibrium distribution of this process and, hence, tends to the result predicted by equilibrium statistical mechanics.

Equation (44) shows that in the slow completion limit, the finer details of the reaction network leading to the selection of a specific monomer becomes unimportant and the models collapse onto a single accuracy curve determined by *DG*, Δ*G*_{pol}, and $Rcom\u2212$. Conversely, if we fix all parameters except *k*_{com}, we seem to see that the bulk frequencies will tend monotonically to their limits as *k*_{com}/*k* → 0, either from above or below.

We can use this fact to compare bulk frequencies for certain types of models. For example, we may compare on-rate discrimination,^{26} where incorrect monomers bind more slowly, to off-rate discrimination,^{26} where incorrect monomers unbind more quickly. An example model comparing on-rate and off-rate discrimination is plotted in Fig. 5 for a model defined in Appendix J. Consider the bulk frequency of an incorrect monomer. On-rate discrimination benefits from fast polymerization and, therefore, tends to its slow polymerization limit from below, whereas off-rate discrimination benefits from allowing the process selecting monomers to equilibrate and, hence, tends to its slow copolymerization limit from above. This fact sets up a hierarchy for a given set of parameters and moderate or strong driving; for the bulk frequency of incorrect monomers, off-rate discrimination > slow copolymerization > on-rate discrimination. This observation is consistent with the results of Sartori and Pigolotti^{26} and Poulton *et al.*^{23} for kinetic (on-rate) and energetic (off-rate) discrimination.

### C. Hopfield’s kinetic proofreading in a model of templated copying with autonomous separation

For our final example, we shall consider an explicit model of copolymerization, with Hopfield’s kinetic proofreading mechanism incorporated into a templated copolymerization system with autonomously separating product in a thermodynamically valid way. From this setup, we can provide a fully worked example of an explicit model and demonstrate the power of the method for analyzing sequences of models with recursive structures as we look at a generalized version of Hopfield’s proofreading incorporated into a model of templated polymerization with autonomous separation.

Explicitly, we first consider the one-loop model of kinetic proofreading shown in Fig. 6(a). There are two monomer types, the right ones *x* = *r* and wrong ones *x* = *w*. Note that we have already transformed the model so that the sequence of the copy is defined relative to that of the template.^{35} These monomer types exist in inactive and active states with concentrations *M*_{in} and *M*_{act}, respectively, relative to some reference concentration, with each monomer type having the same concentration. As previously, we shall assume the environment is sufficiently large such that these concentrations remain constant.

The monomers may bind to the template either in an active or inactive state with binding free energies −Δ*G*_{x} for monomer type *x*. Inactive monomers may be activated on the template with a free-energy change of Δ*G*_{act}. Finally, active monomers may be polymerized into the copolymer chain, with a free-energy change −Δ*G*_{pol}. Subsequently, the penultimate monomer of the copolymer unbinds from the template. Each of these reactions is assigned a forward and reverse reaction rate consistent with the thermodynamic model; the full model is illustrated in Fig. 6(a). Conceptually, the proofreading motif functions by providing two opportunities to reject the unwanted monomer *w*: first, when the unactivated monomer binds and second, after it has been activated. To be effective, a non-zero affinity is required to drive the system around the cycle of states in the correct order: unbound template site → unactivated monomer bound → activated monomer bound.^{20,43} We emphasize that this model differs from Hopfield’s original description in two important ways: first, we consider a full, microscopically reversible polymerization process, rather than a single incorporation step with irreversible polymerization; and second, we embed the proofreading motif into a non-trivial polymerization process involving autonomous detachment from the template.

Given the model as described in Fig. 6(a), we first identify the propensities *ω*_{xy} connecting completed states. Due to the petal-like structure, we can follow Eq. (13) and simply consider spanning trees of the petal sub-processes illustrated in Fig. 6; Λ^{−}(*y*, *x*) rooted at &*x*, Λ^{+}(*y*, *x*) rooted at &*xy*, and *Q*(*y*, *x*) for a petal connecting &*x* and &*xy*. Explicitly writing out the sums of spanning trees, we obtain

Here, we add a subscript 1 to denote these quantities as applying to the simple, “1-loop,” Hopfield model, which we shall extend to allow more loops later. We note that the ratio Λ^{+}(*y*, *x*)/Λ^{−}(*y*, *x*) factorizes as in Eq. (36), and so, we can easily write down the stall condition as Δ*G*_{pol} = Γ with

Note that by setting *M*_{in} = *M*_{act} = 1, Δ*G*_{act} = 0 in Eq. (50), Γ collapses to −ln 2 as these conditions reduce the system to a balanced one with a stall point at equilibrium, as in Sec. III B.

The frequency of the right and wrong monomers *ɛ*(*x* = *r*, *w*) may be calculated from Eq. (18) (the calculation is implemented in the supplementary material). We plot copying error, as represented by *ɛ*(*w*), in Fig. 7(a), and demonstrate that it agrees well with the results found from a Gillespie simulation^{65} of the same model. We also compare to a “0-loop” version of the model, in which the inactivated monomers and the inactivated monomer bound state are omitted. As can be seen, the proofreading motif generally improves accuracy when driven above its stall point Δ*G*_{pol} = Γ. Indeed, we may write down expressions for the bulk frequency in the irreversible limit (Δ*G*_{pol} → *∞*) using Eq. (34). In this irreversible limit, we recover Hopfield’s classic argument by taking some further limits consistent with his analysis. Namely, letting *M*_{act}, *k*_{act}, *k*_{pol} → 0, we find $\epsilon (w)/\epsilon (r)=e2(\Delta Gw\u2212\Delta Gr)$. In this limit, the ratio of incorrect monomers to correct ones involves the square of the binding free energy difference, reflecting the fact that two steps of discrimination have occurred.

We may also write down expressions for the expected chemical work done per net step of the process. This quantity will involve the total current to absorbing states of the step-wise process for starting with a copolymer &*xy*, which we may write as

where $N$ is a normalization factor that will cancel out of calculations. In order to track each of the terms here, we shall break down the contributions to the chemical work done into three parts, one for each of the petals present in the step-wise process. These three petals correspond to adding a monomer type *r*, adding a monomer type *w*, or removing a monomer type *y*. Let us label each of these contributions to the chemical work with a subscript, $Gr(y,x)$ for the transition &*xy* → &*xyr*, $Gw(y,x)$ for the transition &*xy* → &*xyw*, and $Gq(y,x)$ for the transition &*xy* → &*x*. From the *r* petal, we have

The first line of Eq. (52) corresponds to the chemical work associated with the internal cycle (inactive monomer binds, gets activated, and the activated monomer unbinds). The second line corresponds to an external cycle: an inactive monomer binds to the template, is activated, and is polymerized into the chain with the previous monomer *y*, detaching from the template. The third line corresponds to the alternative external cycle: an active monomer binds to the template and is polymerized with monomer *y*, unbinding from the template. We may similarly write down $Gw(y,x)$ as Eq. (52), except swapping *r* and *w*. Finally, the contribution to the chemical work from the petal for removing monomer *y* may be written as

Here, only external cycles are possible. The first line corresponds to monomer *x* rebinding to the template, monomer *y* being depolymerized, this monomer being deactivated and an inactive monomer *y* unbinding from the template; and the second line to *x* rebinding, *y* being depolymerized, and an active monomer *y* unbinding from the template. The distribution *ξ*(*y*, *x*) may be calculated from Eq. (20) and *P* from Eq. (21) (both demonstrated in the supplementary material), letting the chemical work done per net step of the 1-loop model be written as

This chemical work done is plotted for a certain set of parameters in Fig. 7(b) and is also compared to both the results of direct simulation and the simpler “0-loop” model, which has chemical work Δ*G*_{pol}. The free-energy cost of the proofreading mechanism diverges as Δ*G*_{pol} → Γ since there will be a finite chemical work done per monomer addition/removal step due to the proofreading internal cycle and the number of addition/removal steps per net step diverges. Furthermore, for large Δ*G*_{pol}, the work tends to be dominated by Δ*G*_{pol}, albeit very slowly, as shown by the orange line gradually approaching Δ*G*_{pol} (the blue line) in Fig. 7(b).

In addition, we can find an expression for the time taken per net step forwards, Eq. (22). For this quantity, we need the explicit expression for the normalization, $N$. Similar to the chemical work, we can split this term into contributions from the petal adding an *r*, $Nr(y,x)$; from the petal adding a *w*, $Nw(y,x)$; from the petal removing monomer *y*, $Nq(y,x)$ and a contribution from the central node. These normalization terms come from the sums of the spanning trees directed to the individual nodes in the closed step-wise process. We see that

with a similar result for $Nw(y,x)$ except swapping *r* and *w*. Finally, for the monomer removal petal, we have

The total normalization is, then,

with the last term being the contribution from the starting, central node. This normalization can be used in Eq. (51) to give the current to absorbing states, which can be used in Eq. (22) to find the expected time per net step. This time is plotted in Fig. 7(c), alongside a simulation of the same model and the simplified 0-loop model for comparison. Like the chemical work in Fig. 7(b), the time per net step diverges as Δ*G*_{pol} → Γ, since each monomer addition/removal step will take finite time, but the number of such steps required for a net forward step diverges. Unsurprisingly, the time taken for a given driving for the Hopfield model is longer than that of the simple model, due to the proofreading cycle.

Hopfield’s model for proofreading may be naturally extended to include *N* activation stages instead of just one.^{48,50} We shall call these extensions the *N*-loop Hopfield models. These models can be solved recursively to write down expressions for the sums over spanning trees $\Lambda N\xb1(y,x),QN(y,x)$ as a function of the number of loops *N*. We shall consider the model as in Fig. 6(b). A detailed derivation of the sums over the spanning trees is given in Appendix K. From these sums over spanning trees, we calculate the bulk frequencies, the time taken per net step, and the chemical work done per net step using recursive relations (see Appendix K).

For simplicity, we shall discuss the case where the monomer binding free energy is only dependent on monomer type, not on the activation stage; each activation stage is associated with a free energy change of Δ*G*_{act}; each active monomer is present in the environment at a concentration *M*_{act} except for the inactive monomers at concentration *M*_{in}; and the overall rate constants are *k*_{1} for binding of inactive monomer, *k*_{KP} for binding of active monomers, and *k*_{act} for activation of monomers. Under these assumptions, the corresponding rates are given in Appendix K.

To reduce the frequency of incorrect monomers in the product, we wish to have a low concentration *M*_{act} of active monomers in solution to force the system into utilizing the proofreading cycles. Indeed, the bulk error probability in the irreversible limit [calculated using Eq. (34) and plotted in Fig. 8(a)] shows a strong improvement with loop number for low *M*_{act}, but larger values of *M*_{act} lead to a much worse performance and limited (or negative) returns to increasing the number of loops.

However, for finite driving strength Δ*G*_{pol}, we cannot allow this concentration to be arbitrarily small. To see why, consider the stall point Γ(*N*) derived in Appendix K and plotted for a certain set of parameters in Fig. 8(b). It is observed that the stall point driving increases monotonically with *N* and that this increase is faster and tends to a higher limit for smaller *M*_{act}. We find that the limiting Γ scales approximately linearly with −ln(*M*_{act}). Intuitively, introducing more monomer states at a low concentration in the environment destabilizes the polymer. For a small *M*_{act} and driving Δ*G*_{pol}, the depolymerization of the polymer into these activated states competes with its tendency to grow by binding to and activating the inactive monomers.

One drawback of proofreading with a large number of loops is, therefore, that the tendency to disassemble the growing polymer increases. A second effect is a tendency to introduce errors by alternate pathways if *M*_{act} is non-zero. In particular, for *M*_{act} ≠ 0, we observe in Fig. 8(a) a minimum in *ɛ*_{irrev}(*w*) for a relatively small value of *N*. This minimum can be explained by splitting the pathways by which a monomer can go from solution to being incorporated into the polymer into two; either starting from a fully inactive monomer or from a partially activated one. The pathway starting with an inactive monomer will have the highest discrimination between the right and wrong monomers and will improve exponentially with more loops, as demonstrated by the exponential decrease in error for *M*_{act} = 0. However, the probability that a monomer taking this path will reach polymerization falls exponentially with loop number at the same time. On the other hand, the pathway from partially active monomers will give an error that reaches some non-zero limit as the number of loops *N* increases. Furthermore, the rate with which activated monomers bind to an available template site and subsequently get incorporated into the polymer will also tend to a constant. As such, the error will initially decrease exponentially with *N*, but for non-zero *M*_{act}, will eventually become dominated by the less discriminating, partially active monomer pathways through which monomers are more likely to be incorporated into the polymer.

Having calculated the error probability *ɛ*(*w*) at finite driving, plotted in Fig. 9(a); having used *ɛ*(*x*, *y*) to calculate the entropy rate; and having calculated $\Delta G$, we can evaluate the efficiency *η*, as in Eq. (26) (see the supplementary material for demonstrations). This efficiency is plotted in Fig. 9 for *N* = 0, 1, 5, 10 and a certain set of parameters. Although accuracy is generally increased above the stall point, we see that in this particular model, kinetic proofreading requires much more work than the minimum required to generate information and as such is inefficient. In addition, the gradient of the efficiency at minimum driving Γ_{N} is zero for *N* > 0, reflecting how at minimum driving, the number of monomer addition/removal steps diverges, but the chemical work done per such step remains finite.

## IV. CONCLUSION

We have presented a method for analyzing copolymerization models with complex networks of reactions leading to the incorporation or removal of monomers. By coarse graining, a model may be transformed into a simpler model that may be solved and, thereafter, information from the fine-grained process may be put back into the model to extract thermodynamic or kinetic quantities such as chemical work done, molecule exchange, or time taken. The approach allows for complex incorporation motifs to be considered alongside the nearest neighbor interactions in a thermodynamically well-defined model of polymerization with microscopic reversibility. We note that all of these features were present in the kinetic proofreading example in Sec. III C. Moreover, phenomena such as the shift in stall point with loop number and the non-monotonicity of the error rate with the loop number rely on these features being present in the model.

In general, this method provides a way to extract model predictions numerically quickly and without the need for simulations. Doing so is particularly useful when simulating polymer growth is slow, either due to the details of the incorporation process or because the polymer is near its stall point. In addition, the approach makes screening of a large parameter space for a given model topology feasible.

In addition to the numerical performance, this approach allows for analytic results in simpler models or those with helpful symmetries and in certain limits for more complex models. The process of summing over the spanning trees is particularly well suited to identifying the structure of the process and providing simplified results.

Moving forward, it is an open question as to whether components of the techniques developed here can be applied outside of the context of infinitely long polymers whose tips have reached a steady state. An obvious goal would be a simplified way to analyze finite-length “oligomers.”^{24} More generally, we believe that the key equation of this paper, Eq. (14), may be applied more generally for the coarse graining of Markov processes and, in particular, that if a set of states are enclosed between two boundary states, in the sense that any path from one of the trapped states to outside must pass through one of the boundary states, then this set of states may be replaced by a pair of edges analogously to Eq. (14) that shall preserve the steady state properties of the Markov process.

This framework could be applied to explore models of copolymerization processes such as those presented in Refs. 17–19, 22, 23, 27, 28, 30–35, 39–41, and 45–51, more straightforwardly or more thoroughly. Alternatively, the method would allow for more complex reaction steps to be included in such models. The framework presented here is particularly useful when backward steps are relevant, either when the system is weakly driven and, thus, operating near the stall point or when thermodynamics is of importance or interest. We also predict that it will be useful to guide design principles for synthetic copolymerization systems, which are often particularly well-described by the class of models studied here.

## SUPPLEMENTARY MATERIAL

The supplementary material contains a C++ script implementing the Gillespie algorithm that reproduces the data for the 1-loop Hopfield kinetic proofreading model presented in Fig. 7, and a MATLAB script for numerically calculating quantities of the 1-Loop and N-Loop Hopfield kinetic proofreading models presented in Sec. III C and shown in the solid lines of Fig. 7 and the points of Figs. 8 and 9.

## ACKNOWLEDGMENTS

This work was part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Program (Grant Agreement No. 851910). T.E.O. was supported by a Royal Society University Fellowship. J.J. was supported by a Royal Society Ph.D. studentship.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

All authors conceived of the project. B.Q. produced the methodology and analysis and wrote the initial draft. All authors interpreted results and reviewed and edited this paper.

**Benjamin Qureshi**: Conceptualization (equal); Formal analysis (lead); Methodology (equal); Writing – original draft (lead). **Jordan Juritz**: Conceptualization (equal); Visualization (equal); Writing – review & editing (equal). **Jenny M. Poulton**: Conceptualization (equal); Writing – review & editing (equal). **Adrian Beersing-Vasquez**: Conceptualization (equal); Writing – review & editing (equal). **Thomas E. Ouldridge**: Conceptualization (equal); Funding acquisition (lead); Supervision (lead); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.7271702.

### APPENDIX A: FACTORIZING SUMS OF SPANNING TREES

We note here that sums of spanning trees can be factorized in terms of self-avoiding walks (SAWs), a result which is both useful for generating sets of spanning trees and allows us to make statements about ratios of propensities of balanced models. For a given process, $G=(X,K)$, for which we wish to find the sum of spanning trees rooted at $x1\u2208X$, we may factorize this sum in terms of self-avoiding walks (SAWs) between two vertices in the graph. Select some other arbitrary vertex $x2\u2208X/{x1}$ and let $S(x2,x1)$ be the set of SAWs from *x*_{2} to *x*_{1}. For each $S\u2208S(x2,x1)$, we can construct $GS=({s}\u222a(X/S),KS)$ analogously to Eq. (9), whereby we collapse the nodes *S* in the SAW into the single node *s*. The sum over spanning trees rooted at *x*_{1} may then be written as

where $T(x),TS(x)$ are the sets of spanning trees directed to *x* for the original process $G$ and the new process $GS$. For example, in Fig. 3(a), the spanning trees are arranged in terms of SAWs from node 1 to node 3, with the first row for SAW: 1 → 2 → 4 → 3; the second row for 1 → 2 → 3; and the last three rows for 1 → 3. Similarly, for Fig. 3(b), the trees are arranged in terms of SAWs from node 1 to node 4 with row one for 1 → 2 → 3 → 4; row two for 1 → 2 → 4; row three for 1 → 3 → 4; and row four for 1 → 3 → 2 → 4.

### APPENDIX B: NORMALIZATION CONSTANT FOR EXAMPLE ABSORBING MARKOV PROCESS

The normalization constant for the closed example process, given in Fig. 2(b), can be found by considering the spanning trees rooted at each of the nodes. Factorizing these in terms of SAWs, we write

The first square bracket corresponds to the trees rooted at node 1, organized by SAWs from node 3; the second, to trees rooted at 2 organized by SAWs from 1; the third, to trees rooted at 3 organized by SAWs from 1; and the fourth, to trees rooted at 4 organized by SAWs from 1.

### APPENDIX C: EQUIVALENCE BETWEEN CHEMICAL WORK CALCULATED FROM EDGES AND CYCLES

Here, we shall show the equivalence of chemical work for a process calculated by summing over edges vs summing over cycles. For this comparison, consider a process $(X,K)$ without any absorbing states (for simplicity) and such that every edge is microscopically reversible, and let *π*(*x*) be the steady state probability to be in state *x*. For an edge *x* ⇋ *y*, as described in Sec. II A 4, the net current through this edge is

We can write *π*(*x*) in terms of the spanning trees by MCTT, and by Appendix A, we may expand the sum over the spanning trees by SAWs from *y* to *x*. For *π*(*y*), we may expand by SAWs from *x* to *y* such that the spanning tree terms of both expansions are the same and only the direction of the edges in the SAW terms is flipped. The net current may then be written as

where $S(x,y)$ is the set of SAWs from node *x* to node *y*; $N$ is the normalization as in Eq. (2), and *e*′ is the edge in the opposite direction, i.e., if *e* = *x* → *y*, *e*′ = *y* → *x*; and the last bracketed term is the spanning tree part for SAW, *S*, as in Eq. (A1). One of the SAWs from *y* to *x* will simply be the single transition *x* → *y*, however, this term will cancel out from the sum leaving just the non-trivial SAWs. Taking a non-trivial SAW from *y* to *x* and multiplying by the rate *K*(*x*, *y*) gives a cycle containing the edge *x* → *y*. Therefore, the current may be written as a sum over cycle currents, as in Sec. II A 4, of cycles which contain the edge *x* → *y* minus those which contain *y* → *x*. Each of the edges contains a contribution to the chemical work $lnK(x,y)K(y,x)$. The total chemical work before absorption is the sum over all edges of these contributions,

Since *J*_{x⇋y} may be split up as a sum over cycles in this sum, we may collect the parts of this sum corresponding to given cycles and convert the sum over edges into a sum over cycles. By performing these operations, we find the contribution to the chemical work from cycle *C* to be $lnA(C)A(C\u2032)$, i.e., the affinities as we might expect. Hence, the sum over cycles is equivalent to the sum over edges.

### APPENDIX D: CYCLES OF THE EXAMPLE ABSORBING PROCESS

We divide the cycles of the example process, shown in Fig. 2(a), into internal cycles, external cycles to absorbing state *A*, and external cycles to absorbing state *B*. First, the internal cycles are

where the cycle is written out below in the clockwise direction. Similarly, we find the external cycles to state *A*,

Finally, the external cycles to absorbing state *B* are

### APPENDIX E: NUMBER OF STEPS PER NET FORWARD STEP OF A RANDOM WALK

Here, we shall derive the number of steps per net forward step of a random walk. Let us set up a random walk as follows: let the state space be the nodes {0, 1, …*L*}, where *L* is the length of the walk (polymer). Let the transition 0 → 1 have probability 1, *i* → *i* + 1 for *i* = 1, …*L* − 1 have probability *p*, *i* → *i* − 1 for *i* = 1, …*L* − 1 have probability *q* = 1 − *p*, and let state *L* be an absorbing state as in Fig. 10.

Then, we wish to find the expected number of steps to absorption given we start from state 0, for which we can utilize the spanning tree methods with Eq. (22). Since the total rate out of any state sums to one, the expected number of steps equals the expected time to absorption. Thus, we can form the closed process starting at 0. Let *f*(*n*) be the sum over spanning trees rooted at node *n* for the closed process. *f*(*n*) is given by

From this equation, the expected number of steps before absorption is

By utilizing the formulas for finite geometric series, we can find the expected number of steps to be

Most of this expression is sub-linear in *L* and as such,

which is the net number of steps per net forward step.

### APPENDIX F: THE FREQUENCY AT STALL IS GIVEN BY THE DIAGONAL COFACTORS OF A MATRIX

We wish to show that, at the stall point, the frequency with which a monomer appears in the bulk of the copolymer is proportional to the cofactor of the corresponding diagonal element of a matrix,

where *A*_{ij} is the cofactor of element *i*, *j* of the matrix $1\u2212Z$. To show this relation, we will rely on the relationship between the cofactors and vectors of the nullspace of a matrix. Let *M* be an arbitrary matrix with a one dimensional nullspace, and let *A* be its matrix of cofactors. Recall that

Thus, any column of *A*^{T} is in the nullspace of *M*. In anticipation, let $\mu \u20d7$ be a vector in the nullspace of *M* and $v\u20d7$ be a vector in the nullspace of *M*^{T}. Since *M* has a one dimensional nullspace, then,

for some arbitrary *i*. Similarly,

for arbitrary *j*.

Looking at Eqs. (15) and (16), noting that near the stall point, *v*_{z} ≪ *ω*_{±y,x}, we see that the tip probabilities *μ*(*x*) form a vector in the nullspace of $1M\u2212Z$ and the tip velocities *v*_{x} form a vector in the nullspace of $1M\u2212ZT$. Hence, we have that

for arbitrary *i*, *j*. Thus, we may choose *j* = *y* and *i* = *x* leading to cancellation such that

Since

we get the required result.

### APPENDIX G: THE FREQUENCIES IN THE IRREVERSIBLE LIMIT ARE GIVEN BY THE STEADY STATE OF A PROCESS OF THE COMPLETE GRAPH

We wish to find an expression for the frequency with which monomer *x* appears in the bulk of the copolymer in the irreversible limit. This limit is such that the backward propensities *ω*_{−yx} = 0. With this assumption, from Eq. (15), we have

With this form for the velocities, we may manipulate Eq. (16),

The last line is the equation for the steady state of a Markov process with probability *μ*(*x*) to be in state *x* and rate *ω*_{+yx} of transition from state *x* to state *y*. Thus, set *μ*(*x*) to be the steady state probability distribution of the Markov process on *M* states with transition rates from state *x* to *y* given by *ω*_{+yx} and *v*_{x} = *∑*_{y}*ω*_{+yx}. Then, calculating

gives the required result. Finding the distribution *μ*(*x*) in terms of the spanning trees of the complete graph on *M* elements gives Eq. (36).

### APPENDIX H: SIMPLIFICATION OF RESULTS FOR FACTORIZABLE RATIOS OF PROPENSITIES

We shall show that, if the ratio of propensities factorizes as in Eq. (38), then we may simplify the stall condition and frequency of monomers at stall. Thinking of the functions *X* and *Y* as column vectors, since they have a discrete domain, the matrix *Z* may be written,

By a well-known result,^{70}

Here, rearranging gives Eq. (39). As shown, the frequency of monomer *x* in the bulk of the copolymer is given by the cofactor of the diagonal elements of $1M\u2212X\u20d7Y\u20d7T$. The cofactor *A*_{xx} may be written as

using the stall condition, where $X\u20d7[x]$ is the vector $X\u20d7$ missing element *X*(*x*), i.e., $X\u20d7[x]=(X(1),\u2026X(x\u22121),X(x+1),\u2026X(M))T$. In addition, because of the stall condition *∑*_{x}*X*(*x*)*Y*(*x*) = 1,

is already normalized.

### APPENDIX I: FREQUENCY FOR A BALANCED MODEL WITH TWO MONOMER TYPES IN THE SLOW POLYMERIZATION LIMIT

We shall derive the frequency of monomer *x* in the bulk of the copolymer with propensities given by Eq. (46), canceling *n*_{com}*k*_{com}, with $Rcom\u2212=e\u2212\Delta Gpol$, and with *M* = 2. With these propensities, Eq. (15) becomes

where *DG* = Δ*G*_{1} − Δ*G*_{2}. These equations may be solved by the following form the velocities:

Doing so reduces Eq. (15) to a quadratic equation,

with one positive root,

when the system is not stalling. *v*_{2} can be found from in terms of *v*_{1} as *v*_{2} = *e*^{DG}*v*_{1}. Furthermore, a quick check confirms *v*_{1} = 0 if Δ*G*_{pol} = −ln 2. Furthermore, with *v*_{y} known, Eq. (16) is a simple linear equation, and *μ* can be found as the eigenvector of the matrix

### APPENDIX J: MODEL USED FOR BALANCED ON-RATE VS OFF-RATE DISCRIMINATION COMPARISONS

### APPENDIX K: EQUATIONS FOR *N*-LOOP HOPFIELD MODEL

The sum over the spanning trees of the *N*-loop model can be written in terms of sums over spanning trees of the lower loop number models. We label the reaction rates for the *N*-loop process as shown in Fig. 6. The *N*-loop model has one more node and two more edges than the *N* − 1-loop model. Let a subscript *N* denote the sums over the spanning trees for the *N*-loop models. Tracking the spanning trees, we see,

with the initial conditions

The sum-product can be eliminated by subtracting terms proportional to $\Lambda N\u22121\xb1,QN\u22121$, leaving just

with the same initial conditions as above. This system of recursion relations may be used to generate the terms of the spanning tree sums quickly.

In certain simple cases, Eqs. (K5)–(K7) can be solved as a function of *N*. For example, when the reaction rates are not a function of *N*, such as

for *n* ∈ {1, …*N*}, for the step-wise process with monomer tip &*xy*, where *k*_{i} are some overall rates, *M*_{in} and *M*_{act} represent the concentrations of inactive or active monomers, respectively, and Δ*G*_{act} represents the chemical work upon moving a monomer up one activation stage. The rates in Eq. (K8) are used for the numeric results in Figs. 8 and 9. In this case, the sums over spanning trees are

where

From Eqs. (K9) and (K10), we may write the stall condition. Noting that, Λ^{+}(*y*, *x*) is independent of Δ*G*_{pol} and Λ^{−}(*y*, *x*) is proportional to $e\u2212\Delta Gpol$, we may write the stall point as

such that the dependence on Δ*G*_{pol} in the logarithm is canceled out.