In this work, we present a kinetic Markov state Monte Carlo model designed to complement temperature-jump (T-jump) infrared spectroscopy experiments probing the kinetics and dynamics of short DNA oligonucleotides. The model is designed to be accessible to experimental researchers in terms of both computational simplicity and expense while providing detailed insights beyond those provided by experimental methods. The model is an extension of a thermodynamic lattice model for DNA hybridization utilizing the formalism of the nucleation-zipper mechanism. Association and dissociation trajectories were generated utilizing the Gillespie algorithm and parameters determined via fitting the association and dissociation timescales to previously published experimental data. Terminal end fraying, experimentally observed following a rapid T-jump, in the sequence 5′-ATATGCATAT-3′ was replicated by the model that also demonstrated that experimentally observed fast dynamics in the sequences 5′-C(AT)_{n}G-3′, where *n* = 2–6, were also due to terminal end fraying. The dominant association pathways, isolated by transition pathway theory, showed two primary motifs: initiating at or next to a G:C base pair, which is enthalpically favorable and related to the increased strength of G:C base pairs, and initiating in the center of the sequence, which is entropically favorable and related to minimizing the penalty associated with the decrease in configurational entropy due to hybridization.

## I. INTRODUCTION

The kinetics and dynamics of nucleic acids have been shown to be a significant factor in both biological processes and a number of applications outside of biology. DNA duplexes are dynamic molecules that constantly undergo motions including configurational changes that can, but do not always, include a localized loss of base pairs. A large number of distortions to the “ideal” double helix have been identified that significantly impact the physical properties of DNA and play a role in biological processes.^{1,2} DNA duplexes also dynamically break and form base pairs ranging from the fluctuations involving a single base pair, or base flipping,^{3} to continuous stretches of broken base pairs, i.e., bubbles.^{4} It has been proposed that the dynamics through which a base adopts an extrahelical configuration play a significant role in biological processes, for instance, in the recognition of the target base by an enzyme.^{3,5} DNA breathing modes, where base pairs dynamically open and close along a stretch of the double helix, have been thoroughly studied and shown to play a role in a large number of biological processes. Two examples among many others are the recognition of thymine dimers formed due to UV damage and the initiation of DNA transcription.^{4,6–10}

While association mechanisms of short oligonucleotides have been discussed since the 1950s, many open questions remain. One of the primary conceptual pictures used to discuss DNA association is the nucleation-zipper mechanism.^{11–17} In the nucleation-zipper picture, there are three distinct phases of the association process. In the first phase, two single DNA strands, known as monomers, diffuse together and form the first base pair, which is largely considered to be diffusion-controlled.^{13,17} Upon forming the first base pair, the process enters a phase commonly known as the pre-equilibrium.^{11,13,17} During this phase, the partially formed duplex rapidly interconverts between a variety of configurations, all of which are thermodynamically metastable but include at least one intact base pair. The partially formed duplex remains in the pre-equilibrium until it either fully dissociates or the partially formed duplex forms the “critical nucleus.” The critical nucleus contains the minimum number of intact base pairs such that the partially formed duplex is stable and has a significantly greater probability of rapidly zipping up the remaining bases relative to returning to the pre-equilibrium.^{11,13} That is, it signifies that the transition state to hybridization has been crossed. Finally, downhill zipping describes the rapid sequential formation of the remaining base pairs from the critical nucleus to the fully formed duplex, a process that occurs orders of magnitude faster than the formation of the critical nucleus.^{11,12,15,17}

Recently, computational advances in this field have significantly outpaced experimental work. The hybridization, dehybridization, and dynamics of short DNA oligos have been studied utilizing both all-atom^{18–20} and coarse-grained^{21–24} molecular dynamics (MD) simulations. Computational studies have observed a number of different association mechanisms such as the “slithering” and “inch-worm” mechanisms that may play a role in the formation of DNA duplexes. The formation of non-native contacts occurs in both of these mechanisms, suggesting that these contacts play a significant role in facilitating the hybridization of DNA. The significance of the “slithering” and “inch-worm” mechanisms relative to the “zippering” mechanism carries a sequence dependence.^{21,22,25,26} However, the sophistication of modern computational methods, particularly all-atom or coarse-grained MD simulations, makes them not easily accessible to experimentally focused researchers due to their computational expense and the complexity of running and analyzing them. As a result, it would be useful to have a model that provides more detailed mechanistic insight than can be derived from experiment alone while remaining accessible to experimentally focused researchers. Bridging this gap will allow experimentally focused researchers to harness the power of computational modeling to gain additional dynamical and mechanistic insights into DNA systems without making a significant resource investment.

Kinetic Monte Carlo methods, which simulate trajectories of a system evolving in a state space, provide computational and conceptual simplicity and are commonly utilized to study DNA and other biomolecules. Such models can be sophisticated and provide detailed structural information, including instances where their state space is based on Markov state models.^{27–29} In the case of DNA, simpler approaches could use states generated by a lattice model^{30} or the thermodynamic nearest neighbor (NN) model.^{31} In this paper, we describe a simplified kinetic Monte Carlo model for DNA hybridization that builds on states obtained from a new lattice model^{32} and implements Monte Carlo simulations in continuous-time using the Gillespie algorithm.^{33,34} The Gillespie algorithm has been used to study DNA in a variety of contexts including breathing dynamics^{7,35} and the hybridization of a variety of DNA motifs and structures.^{31,36–38}

The Markov state kinetic Monte Carlo model presented here is applied to a number of DNA sequences of differing length and base pair composition to probe the different variables that impact the dynamics, mechanism, and energetic driving forces of DNA hybridization and dehybridization. The primary body of experimental results analyzed using the model are recent temperature-jump (T-jump) dehybridization measurements on the sequences 5′-C(AT)_{n}G-3′, where *n* = 2–6, which will be referred to as the length series or CG-ends.^{39} Additional T-jump experiments are analyzed for the sequence 5′-ATATATATAT-3′, referred to as AT-all, and the sequence, 5′-ATATGCATAT-3′, or GC-core.^{40} The purpose of utilizing these varied sequences is twofold. First, to test how well the model matches the experimental results when the number and location of G:C base pairs and the overall length of the sequence are altered. Second, these sequences provide a window into understanding how these variables affect the dynamics, mechanism, and energetic driving forces behind DNA association and dissociation.

## II. MODEL DESCRIPTION

The Markov state kinetic Monte Carlo model presented here utilizes a state space obtained from a lattice model that has been validated against short DNA oligonucleotides of differing length and composition.^{32} In this lattice model, the thermodynamics of base pair association and dissociation are drawn from the NN parameters published by SantaLucia,^{41} the configurational entropy of partially formed duplexes is determined from a self-avoiding random walk on a cubic lattice, and the translational entropy of the monomer and dimer states at fixed concentration comes from a separate cubic lattice that assigns configurations with two monomers residing in adjacent cells to a dimer. Drawing on the common nucleation-zipper mechanism,^{11,13–16,42–44} where base pairs are added sequentially in hybridization, the kinetic model builds trajectories by stepping through configurations with each step adding or removing a single base pair. The thermodynamics from the lattice model are used in combination with detailed-balance relations in the calculation of rates for moving between states in the kinetic model.

### A. Reaction scheme

To help visualize the state space and how trajectories move through it, Fig. 1 shows the reaction scheme for the sequence 5′-CATATG-3′. The main reaction coordinate trajectories progress along is the number of intact base pairs (*N*_{BP}). This is demonstrated in Fig. 1 by the states denoted either as 2M, designating the monomer state, two dissociated single strands of DNA, or D_{n}, which designates a state with *n* intact base pairs. Within each possible *N*_{BP}, the model can adapt any configuration where all intact base pairs are in-register, meaning that they are aligned properly for the formation of the fully formed dimer, and the intact base pairs form a continuous stretch. All other configurations are excluded. The configurations of unpaired bases, either frayed ends or monomers, are not explicitly considered in the kinetic model, though they are included indirectly when calculating the free energy of each configuration. In doing so, the model in essence assumes that free terminal chains sample all possible configurations very quickly relative to the making and breaking of base pairs. The movement of the two monomers as they diffuse and encounter one another is also not explicitly resolved. When moving between states, only one base pair can form or break during a single step and the value of *N*_{BP} must change. As a result, a base pair may only form or break at the end of the continuous stretch of intact base pairs, with the exception of moving between the 2M and D_{1} states. The first base pair can form anywhere along the sequence with each position sharing the same probability.

### B. Model parameters and rate determination

Moves that form a base pair are broken down into two categories: nucleation, the formation of the first base pair, and propagation, the formation of a base pair next to an already formed base pair. The forward rates utilize three parameters *k*_{f}, *σ*_{i}, and *β* that are conceptually similar to the parameters utilized in the model of Craig, Crothers, and Doty^{11} but with somewhat different definitions. *k*_{f}, which has units of s^{−1}, is the zipping rate constant and can be thought of as the “speed limit” for forming a base pair next to an already formed base pair.^{11–13,45} *σ*_{i} and *β* are unitless scaling factors that describe the attenuation of the base pair formation rate relative to *k*_{f}.^{11}

The attenuation factor for the nucleation step, *β*, describes the rate of forming the first base pair in a nucleation step, *k*_{N}, relative to the zipping speed limit, i.e., *k*_{N} = *βk*_{f}. Calculating *β* assumes that nucleation can be broken down into two independent and sequential steps, the single strands diffusing into the proper orientation followed by the formation of the base pair. With this assumption, the timescale for forming the first base pair can be written as

where *τ*_{D} is the timescale for the two monomers diffusing into proper orientation and *τ*_{f} is the timescale for the formation of the first base pair. Assuming that the two monomers are aligned after the first step, the rate of base pair formation can be presumed to be *k*_{f}; thus, $\tau f=1kf$. Utilizing this and Eq. (1), *β* can be expressed as

To determine *τ*_{D}, we make use of the Stokes–Einstein expression for the diffusion-limited association rate for two identical spheres for a concentration of monomers [M],^{46}

giving

Here, *R* is the ideal gas constant, *η* is the viscosity, and [M] is the initial monomer concentration of the system, which in the context of this work is the equilibrium monomer concentration at the initial temperature prior to the arrival of the T-jump pulse. Determining *β* in this way also incorporates the expected concentration dependence for the association of self-complementary DNA single strands.

The attenuation parameter for propagation steps is denoted as *σ*_{i}, the values of which are contained in the interval (0, 1].^{11} The subscript *i* denotes *N*_{BP} for the initial state, with respect to the forward direction, for the formation of a base pair along an already formed stretch. The value of *σ*_{i} increases with *N*_{BP} in agreement with the conceptual understanding of *k*_{f} being the rate of formation for a base pair at the end of a long series of intact base pairs. While there are a number of different factors that contribute to the increasing rate of formation for each sequential base pair, it has been considered to be primarily due to the additional stability that is associated with the formation of the helical structure that occurs when multiple consecutive intact base pairs exist.^{11}

The definition of *σ*_{i} requires that the values fall between zero and one and that it starts small, monotonically increases, and asymptotically approaches a value of one,^{11}

The hyperbolic tangent function fits these requirements and the intuitive understanding of *σ*_{i}, resulting in the definition

where *α* is a fit parameter in the model that dictates how quickly *σ*_{i} approaches a value of one and *x*_{i} is the fraction of intact base pairs relative to the total number of base pairs $NBPN$.

The only remaining term in Fig. 1 is the free energy difference between the initial state (*n*) and the final state (*m*), Δ*G*_{nm}. The value of Δ*G*_{nm} is the difference between the free energy of the configurations in the final and initial states for a particular move, with the free energy of both configurations calculated using the thermodynamic lattice model previously published by our group,^{32} and is used in the calculation of the reverse rate starting with the detailed-balance relation

where *s* denotes an equilibrium constant for a single base pairing step following the notation used by many in the literature,^{11,17,43}*k*_{n→m} is the forward rate, and *k*_{m→n} is the reverse rate. Since the free energy of each individual configuration is used to calculate Δ*G*_{nm}, the reverse steps in the model carry all of the sequence specificity, an assumption widely used in kinetic Monte Carlo models utilized to study DNA.^{7,31,35–37,47}

Two more assumptions are made when calculating the rates, both involving *k*_{f}. The first assumes *k*_{f} is independent of base pair composition, an assumption commonly made for similar models in the literature^{7,11,12,17,35,45,48} since A:T and G:C base pairs are sterically similar and *k*_{f} should not significantly depend on stacking interactions.^{7}

It is also assumed that *k*_{f} is independent of temperature, which is more contentious in the literature. Models that do not include a temperature dependence exist,^{45} while others do by incorporating an activation energy or directly fitting each individual temperature; however, among these models, the results are inconclusive. It has been proposed that the activation barrier is small and positive, generally in the range of 1 kcal mol^{−1}–5 kcal mol^{−1}.^{12,17,47,48} This leads to the proposal that the elementary formation of a single base pair adjacent to an intact base pair is diffusion-controlled.^{12,17,48} However, caution should be exercised due to studies in the literature examining significantly longer sequences^{17} or fitting as few as two temperatures and acknowledging that under certain experimental conditions, the correct rate as a function of temperature was obtained using an activation energy of zero.^{12} Other experimental results, examining sequences with lengths of 8–14 base pairs, demonstrate that *k*_{f} varies insignificantly and inconsistently with temperature for a given chain length.^{11} It is also worth noting that the formation of a base pair being either diffusion-controlled or barrierless provides further support for both *k*_{f} and the overall forward rate not carrying a sequence dependence, as one would not be expected to exist in either case.

### C. Optimizing fit parameters and obtaining trajectories

Association and dissociation trajectories were generated with the Gillespie algorithm^{33,34} using the “direct method”^{33} to determine both the time interval and the final state for each step. For association (dissociation) trajectories, the system starts in the monomer (fully formed dimer) state and terminates upon reaching the fully formed dimer (monomer) state. To determine the fit parameters *α* and *k*_{f}, the model was parameterized against the experimental T-jump results for sequences of varying base pair composition and length that have been published previously.^{39,40} The parameters for each sequence were fit independently to the observed rate constants, *k*_{obs}, with five or six temperatures included for each sequence. To compare the simulations to the experimental results, a set of association and dissociation trajectories were run for a given set of parameters to determine the mean first passage time for both directions. The first passage time is the time it takes for a trajectory to proceed from the initial state to the final state, which in the case of association (dissociation) is the monomer (fully formed dimer) and the fully formed dimer (monomer), respectively. The mean first passage time is then determined by averaging over all trajectories. To model the conditions of the T-jump experiments, the model is run at the final temperature of the sample after the arrival of the T-jump pulse.

The observed rate constant, which is the standard relaxation rate from perturbative chemical kinetics for the two-state D *⇌* 2M process, is calculated according to a standard two-state kinetic analysis making the assumption that these rates are in response to a weak perturbation, which the temperature jump is assumed to be for the experiments the model is fit to in this work.^{39,40,49} Under this assumption, the observed rate constant is given by^{50}

where [M] is the monomer concentration at the initial temperature prior to the T-jump pulse, *k*_{a} is the association rate constant, and *k*_{d} is the dissociation rate constant. The association rate *k*_{a} is calculated from^{38}

where ⟨*τ*_{a}⟩ is the association mean first passage time from the model. The dissociation rate, *k*_{d}, is calculated from^{38}

where ⟨*τ*_{d}⟩ is the dissociation mean first passage time from the model. The parameters were optimized utilizing a pattern search algorithm that minimized the sum of the squared residuals at each temperature. It is worth noting that these equations are correct for the self-complementary sequences analyzed here and would need to be altered for the case of non-self-complementary sequences.

The number of trajectory sets run during each iteration of the fitting algorithm is twice the number of fit parameters, so in the case of fitting *k*_{f} and *α*, four trajectory sets must be run each iteration and thousands of iterations are required. To reduce the computational expense, the trajectory sets run during the course of the fitting are relatively small, on the order of hundreds of trajectories. A number of optimization routines were run for each sequence with randomized initial parameters until a more concise range in which the parameters were converging was determined. The best parameters from these initial fits were selected and used as the initial parameters for additional optimization routines, using the same method, to determine the final parameters. Once these parameters were determined, a large trajectory set was run with these parameters to ensure that the values compared to the experiment during the fitting routine were representative of the results of the large trajectory set. For the 5′-ATATGCATAT-3′ (GC-core) sequence, 5000 association and dissociation trajectories were run, while for all other sequences, 100 000 trajectories were run. This ensured that there was no error due to the small trajectory sets used during the optimization routines. These large trajectory sets, and the transition rate matrices used to generate them, are the results analyzed in this paper.

The parameters returned by the fitting algorithm are given in Table I. To get a better sense for the values of *σ*_{i}, Fig. 2 shows the *σ*_{i} values at each *N*_{BP} value for each sequence in the length series. This helps to both demonstrate the functional form of *σ*_{i} given in Eq. (6) and provide a clearer conceptual understanding of how the rate of formation for a single base pair increases with an increasing number of previously intact base pairs. One note on the inclusion of *σ*_{i} in this form is that preliminary results were obtained from an earlier version of the model that did not incorporate *σ*_{i} values of any kind, making *β* the only attenuation parameter. This early version generated fits to the experimental data that were relatively comparable to those from the model presented here. The use of a single attenuation parameter would be comparable to the use of an apparent *β*, or *β*_{app}, which has been done by others.^{11,16} However, even though the fit quality was not substantially improved, the *σ*_{i} parameter was incorporated in an effort to provide a clearer physical interpretation of the model’s parameters since it is expected that the rate of formation for multiple base pairs should be attenuated.

Sequence . | Length . | k_{f} (s^{−1})
. | α
. |
---|---|---|---|

CG-ends | 6 | 5.4344 × 10^{11} | 0.6101 |

8 | 2.0969 × 10^{11} | 1.0790 | |

10 | 5.9513 × 10^{10} | 1.5424 | |

12 | 4.0526 × 10^{10} | 1.5665 | |

14 | 7.4036 × 10^{9} | 2.2705 | |

GC-core | 10 | 3.5993 × 10^{10} | 3.7285 |

AT-all | 10 | 3.2769 × 10^{10} | 2.8186 |

Sequence . | Length . | k_{f} (s^{−1})
. | α
. |
---|---|---|---|

CG-ends | 6 | 5.4344 × 10^{11} | 0.6101 |

8 | 2.0969 × 10^{11} | 1.0790 | |

10 | 5.9513 × 10^{10} | 1.5424 | |

12 | 4.0526 × 10^{10} | 1.5665 | |

14 | 7.4036 × 10^{9} | 2.2705 | |

GC-core | 10 | 3.5993 × 10^{10} | 3.7285 |

AT-all | 10 | 3.2769 × 10^{10} | 2.8186 |

It is interesting to note that for all sequences in the length series, due to the value of *α* increasing with increasing length, the values of *σ*_{i} consistently approach a value of one within the formation of 4–5 base pairs, in good agreement with predictions from the literature.^{11} This suggests that the values of *σ*_{i} approach a value of one within approximately a half turn of the DNA double helix, supporting the idea that beginning to obtain the helical structure, and the structural stability associated with it, likely plays a significant role in the rate at which a single base pair forms. The consistency across the different sequences also suggests that defining *σ*_{i} based on the number of previously intact base pairs, rather than the number of unpaired base pairs in the frayed end, results in a more accurate representation of the degree to which the rate of forming an individual base pair is attenuated. This supports the definition of *σ*_{i} utilized in this model while also providing further evidence for the physical interpretation of this parameter.

The observed rate constants, calculated from the mean first passage time for association and dissociation using the two-state analysis, are compared to those determined experimentally in Fig. 3 for all sequences except AT-all. AT-all was excluded since only two temperatures are available, making it a poor metric of fit quality relative to the other sequences. The model is generally in good agreement with the experimental data, particularly at higher temperatures.

The main discrepancy between the model and the experimental results is the ability of the model to replicate the degree to which the different sequences and lengths demonstrate nonlinear trends in the Arrhenius plots. The observed rate constant is closely related to the dissociation rate constant, and a linear Arrhenius plot is indicative of two-state kinetics dictated by a single temperature independent activation barrier. The Arrhenius plots for the dissociation rates are also shown in Fig. 4. The values of *k*_{obs} from the model consistently demonstrate a small degree of nonlinearity such that for shorter CG-ends sequences, where the experimental trends are linear, the model does not fully replicate the linear trend, resulting in some deviation at low temperature. The degree of nonlinearity demonstrated by the model appears to be relatively unaffected by the sequence length and composition. which can be seen by the fact that the model is unable to fully replicate the degree of nonlinearity observed in the GC-core sequence, again deviating at low temperature. Outside of capturing the change in the degree of linearity as a function of length and sequence, the results from the model are shown to be in excellent agreement with the experimental results.

One final note on the agreement between the model and the experimental data concerns the primary insights gained from the model. The primary analysis presented here focuses on the mechanistic insights gained from the association and dissociation trajectories and how they compare to the experimental results. While trends in the model’s parameters exist that are worth discussing, caution should be taken before drawing strong conclusions from them since, while rooted in physical processes, they are a combination of a number of factors that are difficult to disentangle. Further studies that expand the number of different sequences examined by the model should help to alleviate this by providing a larger dataset of model parameters and a more robust analysis, particularly with regards to trends with sequence, length, and other variables, which will allow conclusions to be drawn with more confidence. With respect to future studies, it is worth noting that, in practice, we have found that the lattice model is currently useful for sequences of up to 20 base pairs in length. The Markov state Monte Carlo model has not been used for sequences longer than 14 base pairs due to a lack of experimental data to fit to. However, its current form is capable of simulating sequences of up to 20 base pairs, the limit of the lattice model, given sufficient computing power.

## III. RESULTS

The analysis here will have two different foci. The first aspect analyzed in detail is the barrier crossing event in the trajectories. The barrier crossing event is defined here for association (dissociation) as the portion of the trajectory between the last time the trajectory is in the monomer (fully formed dimer) state until it reaches the fully formed dimer (monomer) state. The barrier crossing events generally occur on a timescale of hundreds of picoseconds for the dissociation and nanoseconds for the association. These barrier crossing events can involve both forward and backward moves, meaning that there is no set number of steps that make up the barrier crossing. However, the minimum number of steps is equal to the number of base pairs in the sequence. For certain sequences, it is not uncommon for the barrier crossing event to include two to three times more steps than the minimum.

It is also informative to analyze the entire trajectory as a whole. The main focus here is examining the early time dynamics, which take place on an approximately nanosecond timescale prior to the dissociation of the full duplex, experimentally observed in some sequences. These early time dynamics will in some cases be referred to as the “fast response” as they experimentally appear as an increase in signal that occurs on a faster timescale than the full dissociation. This analysis examines how the trajectory moves through the state space prior to the barrier crossing event and how long the trajectory spends in states with each *N*_{BP}. Doing so allows for a quantitative comparison between the dissociation trajectories for each sequence, and the resulting trends with length and sequence composition, to the changes observed in the experimental results. This comparison will help to further clarify the dissociation dynamics that are occurring.

### A. Barrier crossing pathways

Utilizing transition pathway theory (TPT),^{51–54} individual pathways for barrier crossing events can be isolated and ranked according to the frequency at which they occur. It is important to distinguish individual pathways from the overall mechanism. Individual pathways are one possible way the system can move through different configurations between the initial and final states in the association or dissociation barrier crossing events. In this context, the overall mechanism incorporates the entire distribution of individual pathways and is a more general view of how the system progresses through a barrier crossing event.

Figure 5 shows the top six association pathways isolated by TPT for three different sequences and the probability of a successful association event occurring along each of the pathways, shown in the bar graphs on the left. The probability is calculated from the percentage of the overall flux between the monomer state and fully formed dimer state for each individual pathway.^{27} Black boxes represent intact base pairs, and time proceeds up the chart with the first base pair formed in the second row from the bottom and the reaction proceeding upward to reach the fully formed dimer state in the top row. Since these sequences are all self-complementary, each pathway has a symmetric partner that is identical and carries the same probability. For example, the first 5′-CATATATATG-3′ pathway in Fig. 5 initiates at one end and zips up sequentially across the sequence. The symmetric partner of this pathway is identical except that it starts at the other terminus. For all self-complementary sequences, each pair will be referred to as a unit, for example, referring to the two most dominant pathways refers to the two most dominant sets of pathways, which are the top four individual pathways. The probabilities shown in the bar graphs in Fig. 5 represent the probability of an association event following each individual pathway. For example, the probability of an association event following the top 5′-CATATATATG-3′ pathway shown or its symmetric pair is 16% since the probability for each pathway is 8%.

After examining the different pathways shown in Fig. 5, two prominent motifs emerge. The first pathway motif initiates in the center of the sequence, and the two sides symmetrically form base pairs, keeping the two frayed ends of similar or identical length, until the duplex is formed. This pathway motif can be observed in Fig. 5 as the top pathways for both 5′-ATATATATAT-3′ and 5′-ATATGCATAT-3′ in addition to the third pathway for 5′-CATATATATG-3′. The second motif initiates at or near a G:C base pair, and in the case where the pathway does not initiate at the G:C base pair, it forms as early as possible. This is best seen in the top two pathways for 5′-CATATATATG-3′, where this pathway is distinct from the center-initiated pathway, and the top pathways for 5′-ATATGCATAT-3′, where the two motifs overlap due to the location of the G:C base pairs. The probability of each pathway occurring is correlated with how closely the pathway follows each of these motifs. Initiating closer to the center increases the probability of the pathway as does initiating closer to a G:C base pair. Looking at 5′-ATATGCATAT-3′, where the two pathway motifs overlap, initiating further from the center and increasing the difference in length between the two frayed ends result in the probability of these pathways rapidly decreasing. While deviating from either motif results in decreasing probability, the G:C base pair driven motif demonstrates a more rapid drop off. This can be seen by comparing the probability of the first two pathways for 5′-CATATATATG-3′ and 5′-ATATATATAT-3′. Comparing these two sequences also shows that initiating at or near a G:C base pair carries a more dominant effect, as seen by examining the ranking of the top six pathways for 5′-CATATATATG-3′.

### B. Overall mechanistic insights

While the individual pathways are informative on a microscopic scale, it is important to more generally consider the mechanism for monomer–dimer transitions in terms of the overall two-state reaction. However, our focus at this point remains on the barrier crossing event itself. One interesting aspect of the overall mechanism is the probability of initiating a barrier crossing at different positions. This can be determined by summing over all TPT pathways, the result of which is shown in Fig. 6.

One of the more interesting observations of Figs. 5 and 6 is that while the dominant individual pathway for 5′-CATATATATG-3′ initiates at a terminus, those positions are the least likely to initiate a successful association barrier crossing event. The most probable position is actually either next to the G:C termini, the position at which the second most probable individual pathway initiates, or in the center of the sequence depending on temperature. It is also interesting to note that the third position has a lower probability than either of its neighbors, which is related to the previously discussed pathway motifs. The center-initiated preference results in the dome shape and explains why the fourth position from the end is more probable than the third position. The second position is more probable than the third position because it receives a significant benefit from forming next to a terminal G:C base pair. This does not, however, explain why the most dominant pathway for 5′-CATATATATG-3′ initiates at the least likely position to initiate a successful association barrier crossing. The reason for this has to do with Fig. 6 incorporating all possible pathways. Combinatorics dictates that more pathways initiate at the positions near the center relative to the positions near the end. The number of potential pathways at each position can be found by looking along the row of Pascal’s triangle whose length is equal to the number of base pairs in the sequence. Even though these pathways become less probable, the cumulative effect makes a significant difference. This explains why the termini are the least likely positions for initiating a successful association event. While the most probable pathway initiates at that position, it is the only pathway. The results in Fig. 6 for 5′-CATATATATG-3′ are in stark contrast to those for the GC-core sequence that demonstrate a very strong preference for initiating in the middle of the sequence, which makes sense when considering the pathways in Fig. 5 and the overlap between the two main pathway motifs.

Figure 6 can also be analyzed to gain insights into the likely locations of the critical nucleus for a given sequence. The critical nucleus is an important structure in the mechanism of DNA hybridization and dehybridization since its identity impacts the energetics and resulting reaction rates for the processes.^{11,13,21,55} While the size of the critical nucleus is debated in the literature and likely depends on factors including sequence composition^{14} and temperature,^{21} it is generally considered to be small and on the order of two to four base pairs.^{11,13,21,55} Considering the relatively small size of the critical nucleus, the probability of initiating at different positions within the sequence can be used as a proxy for the location of the critical nucleus. As such, we can consider what the preference for initiating centrally or at or near a G:C base pair means with respect to the critical nucleus. The fact that association events that initiate near a G:C base pair will form the G:C base pair as early as possible, as shown in Fig. 5, shows that the preference for initiating association at or near a G:C base pair implies that there is a preference for including the G:C base pair in the critical nucleus, which makes sense considering the added stability provided by G:C base pairs. The preference for initiating near the center of the sequence implies that the critical nucleus also has a preference for forming centrally. This means that the critical nucleus preferentially forms either in the center of a sequence or, in the case where there are no G:C base pairs in the center of the sequence, incorporates one or more G:C base pairs. In instances where there are one or more G:C base pairs near the center of a sequence, such as the GC-core sequence analyzed in this work, the critical nucleus will very likely incorporate them.

It is worth briefly comparing these results to the literature, particularly coarse-grained MD simulations. Coarse-grained MD simulations have found that contacts in the center of the sequence are critical for hybridization, particularly in the case of more randomized sequences.^{23,24} For both randomized and repetitive sequences, nucleation is biased toward the center,^{23} and one study found that middle to middle nucleation events represent more than 80% of all those possible for all oligos examined.^{22} All of which is consistent with the findings from the model presented here which also demonstrates a strong preference for initiating association events centrally.

It is also interesting to examine the influence of G:C base pairs on the association initiation position. It has been proposed that sequences that contain them are expected to initiate at their position.^{14,56} While a preference for forming at or near G:C base pairs exists, it is still location dependent and not overwhelming. While the findings for GC-core do show that a large number of initiations will occur at the G:C base pairs, it is still less than 50% of all initiations for all temperatures studied here. For CG-ends, this number is even lower with initiation at the terminal G:C base pairs making up less than 28% of all initiations for the shortest sequence and less than 12% for the longest sequence. This further demonstrates that for CG-ends, while initiating at a G:C base pair does appear to result in a dominant individual pathway, when considering the mechanism as a whole, the relative significance of that pathway diminishes, particularly for longer lengths.

### C. Energetic driving forces

Our attention now turns to the driving forces behind the trends observed in the individual pathways and overall barrier crossing mechanisms. Two general motifs were observed, initiating association in the center and initiating at or near a G:C base pair that forms early on. We will now discuss why the center-initiated motif is entropically driven, while the G:C base pair initiated motif is enthalpically driven. These motifs may overlap, resulting in the enthalpic and entropic components driving the same, or competing, pathways.

To demonstrate how entropy favors the center-initiated motif, consider the increased preference for the pathways that follow it, best observed in the top pathway for AT-all in Fig. 5. There are two contributions to this entropically driven preference for initiating centrally. When treating the configurational entropy of frayed ends as a self-avoiding random walk, as our thermodynamic lattice model does,^{32} the highest entropy state for configurations with a given *N*_{BP} has two frayed ends of equal length. The next highest entropy states are those that have two frayed ends of nearly equal lengths and the entropy decreases as the difference between the length of the two ends grows. Finally, the lowest entropy configuration has a single frayed end.^{32} As a result, the preferential configurational entropy drives the preference for initiating association in the center and forming the remaining base pairs by building out symmetrically.

The second factor that drives preference for initiating association events in the center of the sequence, as seen in Fig. 6, is also entropic in nature. Initiating in the center of a sequence results in an increased multiplicity of pathways to the final fully hybridized state. While this additional driving force contributes to the preference for initiating in the center, unlike the contribution from the configurational entropy, it does not result in any preference for building symmetrically out from the center.

Since the pathways that initiate at or next to the terminal G:C base pairs in the CG-ends sequences are expected to have a higher entropic cost, the relative preference for these pathways must be enthalpically driven. This is not particularly surprising as G:C base pairs are known to be more stable than A:T base pairs, and it has been previously proposed that they play a role in the early stages of the association process for this reason.^{14}

### D. Full trajectory analysis

Now, we will step back from looking at the barrier crossing event and examine the entire trajectory, with a particular eye toward the experimentally observed fast response. The primary purpose of this is twofold: to understand how fraying appears in the model by looking at GC-core, where fraying has been experimentally observed, and using this knowledge to gain insight into the fast response that grows in with length in the CG-ends sequences. Analyzing the entire trajectory is difficult since there are thousands of steps in each trajectory. However, due to the construction of the model requiring that all dissociation initiate at the ends any early time response must be fraying. As a result, the individual configurations are insignificant, and the *N*_{BP} at each step can simply be tracked.

It is worth briefly discussing the previously published experimental results that the model is compared to. The GC-core sequence demonstrated a significant early time rise in signal that was not observed for the sequence 5′-GATATATATC-3′. This early time response was attributed to the fast fraying dynamics of the terminal A:T base pairs.^{40} Since the fraying of the terminal A:T base pairs, with no loss of G:C base pairs, was so clearly observed in the experiment, this will be used as a point of comparison for the model to investigate if the model correctly captures this behavior. The study examining the CG-ends sequences found that the sequences showed a small increase in early time signal, significantly smaller than GC-core, which grew with both increasing length and temperature for a given length. Interestingly, rather than adopting the biexponential signal trace observed for the GC-core sequence,^{40} the fast response for the CG-ends series adopted a stretched exponential form and was nearly indistinguishable between A:T and G:C base pairs.^{39} Because the CG-ends results did not demonstrate a clear qualitative mechanistic interpretation like the GC-core sequence, the origin of the fast response in the case of the CG-ends series could not be determined through experiment alone.

Figure 7 contains plots showing the percentage of time spent in states as a function of *N*_{BP} averaged over all trajectories for each length, except for the shortest sequence, of the CG-ends series and the GC-core sequence at either 333 K or 334 K. Looking at the partially intact states, there is a clear distinction between the GC-core and the CG-ends sequences. The increased amount of time GC-core spent in partially intact states is consistent with the experimentally observed fraying.^{40} This demonstrates that the kinetic model replicates the fraying behavior experimentally observed in GC-core while indicating less early time dissociation in the ten base pair CG-ends sequence, also in agreement with experiment.^{39,40}

Not only does the kinetic model agree with the experimental results but also with the thermodynamic lattice model.^{32} Figure 7 compares the percentage of time spent in states during a trajectory with the probability, presented as a percentage, of occupying the same states according to the thermodynamic lattice model, both as a function of *N*_{BP}. Note that the probability of occupying a state is proportional to the free energy of the system and the free energy surface along *N*_{BP}, which is given by $\u2212kBT\u2061lnP$. This demonstrates that over a sufficient number of trajectories, the amount of time spent in different states is primarily dictated by the thermodynamic free energy of the system rather than any significant kinetic factors.

The agreement with experiment is particularly good since the model also demonstrates that for the GC-core sequence, primarily A:T base pairs are dissociating at early time. Any configuration with six or more intact base pairs must have both G:C base pairs intact. While configurations with four or five base pairs are not accessed to any significant degree. The lattice model shows that even when states with four or five intact base pairs are accessed, the probability of still having both G:C base pairs intact is over 99%.^{32} This demonstrates that the kinetic model is accurately depicting the fast fraying dynamics of the GC-core sequence.

Now that it has been established that the kinetic model demonstrates the fraying behavior expected for GC-core, the same analysis can be applied to the CG-ends sequences to understand what is behind the experimentally observed fast response that grows in with increasing length. Figure 7 shows that with the increasing length, and temperature held constant, the amount of time spent in partially intact states increases. This provides clear evidence that even with the stabilizing G:C base pairs on the termini, these sequences become more susceptible to fraying with increasing length. This is in agreement with studies from the literature that have also observed fraying in sequences with G:C end caps.^{57,58} While the trends with length are not nearly of the magnitude observed for GC-core, this is reasonable since the trends observed in the experimental results for the CG-ends sequences are also of smaller magnitude. This suggests that fraying is the likely source of the increasing stretching factor observed in the CG-ends sequences with increasing length. It is worth noting briefly that in conjunction with CG-ends, the sequence 5′-GATATATATC-3′ was briefly examined to identify if switching the bases at the 5′ and 3′ ends affected the behavior observed by the model. Preliminary work examined the probability of occupying non-monomer states with a given *N*_{BP}, similar to Fig. 7. This showed that the 5′-GATATATATC-3′ sequence would undergo similar fraying behavior to CG-ends with the only difference, resolved by the model, being a slight increase in terms of the average size of the frayed ends.

## IV. CONCLUSION

The two parameter Markov state Monte Carlo model presented here is able to reasonably reproduce the experimental results and is in agreement with the existing coarse-grained MD simulations of more significant complexity. With regards to fast dynamics prior to the full dissociation of the duplex, the model recreates the fast fraying dynamics experimentally observed for GC-core and provides further evidence that these dynamics are primarily driven by thermodynamic factors and the reshaping of the free energy surface rather than kinetic factors. Additionally, the model suggests that the origins of the increasing fast response observed with increasing length in the CG-ends series is similar to those observed in the GC-core sequence, suggesting that fraying plays a significant role in these dynamics.

The model also shows that the initiation position for a successful association barrier crossing, which corresponds to the location of the critical nucleus and transition state, is driven by two factors. Entropic favorability, due to both a contribution from minimizing the penalty due to configurational entropy and from an increased number of available association pathways, preferentially drives initiating at the center of the sequence. The second factor is an enthalpic favorability that preferentially drives initiating near G:C base pairs, if present in the sequence. The effects of these energetic forces, particularly the enthalpic benefit, become far less significant after the formation of the first few base pairs, which is in agreement with the canonical nucleation-zipper mechanism and the corresponding critical nucleus. This demonstrates that the simple, accessible, two parameter Markov state kinetic Monte Carlo model presented here provides additional insights that complement the existing experimental methods to aid in understanding the mechanisms and energetics of DNA hybridization and dehybridization. Additionally, the insights gained from this model build the foundation for continuing research on this topic moving forward. Utilizing it hand-in-hand with future sequence specific kinetic results from different sequence and length motifs will further validate the model and drive a more complete understanding of DNA association and dissociation.

## ACKNOWLEDGMENTS

The authors would like to thank Brennan Ashwood for his thoughtful comments and suggestions and Paul Sanstead for sharing his sequence-dependent dehybridization data. The authors would also like to thank the University of Chicago Research Computing Center for their support of this work. This work was supported by a grant from the National Science Foundation (Grant No. CHE-1561888).

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.