Bayesian nonparametrics (BNPs) are poised to have a deep impact in the analysis of single molecule data as they provide posterior probabilities over entire models consistent with the supplied data, not just model parameters of one preferred model. Thus they provide an elegant and rigorous solution to the difficult problem encountered when selecting an appropriate candidate model. Nevertheless, BNPs’ flexibility to learn models and their associated parameters from experimental data is a double-edged sword. Most importantly, BNPs are prone to increasing the complexity of the estimated models due to artifactual features present in time traces. Thus, because of experimental challenges unique to single molecule methods, naive application of available BNP tools is not possible. Here we consider traces with time correlations and, as a specific example, we deal with force spectroscopy traces collected at high acquisition rates. While high acquisition rates are required in order to capture dwells in short-lived molecular states, in this setup, a slow response of the optical trap instrumentation (i.e., trapped beads, ambient fluid, and tethering handles) distorts the molecular signals introducing time correlations into the data that may be misinterpreted as true states by naive BNPs. Our adaptation of BNP tools explicitly takes into consideration these response dynamics, in addition to drift and noise, and makes unsupervised time series analysis of correlated single molecule force spectroscopy measurements possible, even at acquisition rates similar to or below the trap’s response times.

## I. INTRODUCTION

Bayesian nonparametrics (BNPs) have recently been introduced to single molecule biophysics.^{1–4} Their unique characteristics make them ideal tools in modeling complex biomolecules as they offer clear physical interpretation and allow full posterior probabilities over molecular-level models to be inferred with minimal subjective choices.^{3,5}

In particular, in single molecule analysis, it is typical to associate a (mathematical) model with a molecular state space of a specific size.^{1,4,6} For example, a molecule with 2 distinct biophysical states is represented by a model with 2 (mathematical) states and the corresponding number of state-specific parameters. Since traditional analysis methods can only infer parameters of particular models,^{7–9} but not populations of parameters, they require an *a priori* particular choice among all such possible models. In other words, a critical requirement is that the size of the molecular state space already be known, which is often not the case.

To this end, BNPs avoid the crucial step of having to select a single model out of many plausible ones as they can infer parameters while simultaneously inferring the appropriate model.^{1–5} Accordingly, BNPs have been used for the analysis of single molecule time series as natural extensions of well-established parametric methods,^{10–14} such as Hidden Markov Models (HMMs), without the need to impose additional model selection procedures.

The infinite Hidden Markov Model^{15} (iHMM) exploits BNPs and is the nonparametric extension of the routinely used HMM. The iHMM performs well in the analysis of decorrelated time series^{1,2} though it is challenged by correlated time series. Indeed, when time correlated traces are analyzed under the assumption of decorrelated data, iHMMs may give rise to a severe overestimate in the number of states because, unlike HMMs which have a fixed number of states, iHMMs recruit states to model the data.

Time correlations are unavoidable experimentally. For instance, time correlations arise when data acquisition times are fast, as is required to capture rapid dynamics of molecules. In such cases, transitions between discrete molecular states may appear smooth or as artifacts of the data acquisition process.

For the sake of concreteness, we focus on traces obtained from single molecule force spectroscopy (FS) where fast measurements may introduce time correlations in the data because the observed dynamics of the molecule are heavily influenced by the response of the optical trap instrumentation. This response includes the motion of the associated bead, the settling of the ambient fluid, and the relaxation of the tethering DNA handles.^{16–18} In fact, measurements in FS probe the motion of the combined molecule-trap system which, at fast time scales, may deviate substantially from the motion of the underlying molecule alone. As a result, sharp molecular transitions may be distorted or smeared out because the combined system, roughly idealized as an elastic spring, transforms transitions from discrete into continuous events that are sampled over multiple time points during the molecule-trap settling time, Fig. 1.

Due to their ability to infer the size of the molecular state space simultaneously with the properties of the constituting states,^{3,5,19} iHMMs can over-interpret the distorted signals and overpopulate the state space with unphysical states in those regions of the time trace where transitions between states appear continuous. Inevitably, if the raw FS traces are directly used, the estimated number of states may be severely overestimated, Fig. 1.

To lessen such inaccuracies, one may naively choose to remove correlations by pre-processing the time trace. For instance, this can be achieved by artificially downsampling the experimental trace,^{20} where multiple consecutive data points are binned together and replaced by their average. Such preprocessing, while reducing correlations between successive measurements within a time series, has two main disadvantages: (i) it undermines the purpose of collecting data at high frequencies in the first place by reducing the measurement rate and leaving originally observed fast kinetics unresolved and (ii) it introduces artificial short-lived dwells due to aliasing.^{20} In turn, these artificial short-lived dwells may also be interpreted by iHMMs as artificial states leading to the same downfalls as when using the original, unprocessed, raw signal, Fig. 1.

In summary, iHMMs are advantageous because they do not assume a fixed number of states. Yet iHMMs fail when confronted with the, often unavoidable, problem of correlated time traces. A nonparametric method that can address this new challenge is needed in order to interpret high frequency measurements.

To address this limitation, we propose an extension of iHMM with continuous control ICON,^{2} a generalization of the iHMM method previously developed to deal with time traces with drift (such as FRET), that is appropriate for correlated traces as they appear in FS. With the extended formulation presented here, we account for slow instrumentation responses, noise, and also drift. Additionally, we introduce computational improvements that facilitate the analysis of long traces such as those in FS. A working implementation of the resulting method, termed **xl-ICON**, is available in the supplementary material.

## II. METHODS

Consider a FS single molecule time series $z\xaf=(z1,z2,\u2026,zN)$, where *z*_{n} is the extension measured at time *t*_{n}, with *t*_{1} and *t*_{N} denoting the onset and conclusion of the experimental time course, respectively. As the timing accuracy in a typical detector is far better than the extension measurement accuracy, for example, less than few ps, in this study we ignore timing errors and model *t*_{n} as noiseless and known in advance.

To extract the molecular signature from $z\xaf$, which, as explained in Sec. I, may be masked by slow responses, noise, and drift, we develop a detailed model of the whole biophysical/experimental system. Our formulation, which relies on ICON^{2} and the iHMM,^{1} is described in Sec. II A and depicted schematically in Fig. 2. In Sec. II B, we present the statistical machinery that is required to derive posterior probabilities over multiple models and the associated parameters, and in Sec. II C we describe the acquisition of the datasets used in the examples that follow in Sec. III. For quick reference, a summary of our notational conventions can be found in Appendix A.

A working implementation of the method described in this section can be found in the supplementary material.

### A. Description

#### 1. Molecular kinetics

At the single molecule level, similar to ICON^{2} and the initial iHMM,^{1} we assume that the system of interest alternates spontaneously between a number of distinct molecular states *σ*_{1}, *σ*_{2}, …, where *σ*_{m} is used to represent the biophysical states attainable over the course of the time trace. For example, for a protein undergoing conformation changes,^{16} as shown in Fig. 2, *σ*_{1} = “folded” and *σ*_{2} = “unfolded.” In general, the precise number of unique states *σ*_{m} available to the molecule is unknown and so we must also estimate it along the other quantities of the model that follow (see Sec. II B 1).

To facilitate the presentation, we let $s\xaf=(s1,s2,\u2026,sN)$ be the sequence of successive states attained by the molecule during the experiment. Here, *s*_{n} denotes the state *σ*_{m} that the molecule attained at time *t*_{n}. To model $s\xaf$, as in previous studies,^{1,10–13,21,22} we assume memoryless switching and adopt simple Markovian kinetics. In statistical notation, these kinetics can be formally expressed as $sn+1|sn\u223cCategorical(\sigma 1,\sigma 2,\u2026\u2009)(\pi \u0303sn)$, where $\pi \u0303sn=(\pi sn\u2192\sigma 1,\pi sn\u2192\sigma 2,\u2026\u2009)$ gathers the probabilities of the molecule switching from *s*_{n} to any *σ*_{m} at any time step.

The formulation we adopt here assumes that whenever the molecule departs from a state *σ*_{m} it arrives at a state $\sigma m\u2032$ on average $\pi \sigma m\u2192\sigma m\u2032$ of the times. So, $\pi \sigma m\u2192\sigma m\u2032=0$ indicates that *σ*_{m} never leads to $\sigma m\u2032$, $\pi \sigma m\u2192\sigma m\u2032=1$ indicates that *σ*_{m} always leads to $\sigma m\u2032$, while $\pi \sigma m\u2192\sigma m$ describes self-transitions. The latter can be used to find^{23} the state’s mean dwell time $T\sigma m=\Delta t/(1\u2212\pi \sigma m\u2192\sigma m)$, where Δ*t* is the time lapse between successive measurements. Similarly, transition rates^{23} between any pair of states can be found based on the probabilities $\pi \sigma m\u2192\sigma m\u2032$.

#### 2. Trap instrumentation response

Unlike the cases modeled by ICON and the iHMM,^{1,2} as the molecule switches from state to state producing $s\xaf$, it initiates the motion of the optical trap instrumentation that yields the trace $z\xaf$. To represent the driving signal, we consider $\varphi \sigma m$ to be the extension associated with state *σ*_{m} and the extension produced over the course of the experiment to be $\varphi \xaf=(\varphi s1,\varphi s2,\u2026,\varphi sN)$.

To describe the dynamics of the trap system, we first consider the idealized case where noise is absent. For this case, we let *d*_{n} denote the distance that would be recorded by the trap at *t*_{n}. For sufficiently fast trap systems, *d*_{n} is indistinguishable from the driving extension $\varphi sn$; however, for traps with response times comparable to the data acquisition rate, *d*_{n} may deviate considerably from $\varphi sn$ since the influence of past values *d*_{n−1}, *d*_{n−2}, etc., increases with increasing response times. For instance, for traps with exponentially decaying responses, such as those illustrated in Fig. 2, driving extensions and recorded extensions are related by $dn=\eta 0\varphi sn+\eta 1dn\u22121$, where *η*_{0} = Δ*t*/(*t*_{resp} + Δ*t*) and *η*_{1} = *t*_{resp}/(*t*_{resp} + Δ*t*) with *t*_{resp} denoting the trap’s response time. In the case of traps with pronounced non-linearities in their response or traps incorporating feedback, generally more than one lag term, *d*_{n−1}, *d*_{n−2}, etc., is required in order to accurately represent *d*_{n}. Since very little is typically known about the dynamics of the combined molecule/handles/bead/ambient fluid system, we model the response using a general autoregressive process^{8} having the form

and, to ensure that for static molecules the recorded extension is identical to the extension of the molecule, we set *η*_{0} = 1 − *∑*_{k≥1}*η*_{k}. We note that this condition also ensures “unit magnification” in the optical trap, that is, a change in the molecular extension is transmitted to a change in the recorded extension of the same magnitude, i.e., Δ*ϕ* = Δ*d*, and *not* to a constant multiple of it, i.e., Δ*ϕ* = *c*Δ*d* for some *c* ≠ 1.

The assumption underlying Eq. (1) is that the extension *d*_{n} measured by the trap depends on (i) the current state of the molecule $\varphi sn$ and (ii) the extensions measured at the previous *K* time steps. A consequence is that *d*_{n} is determined not only by $\varphi sn$ but is also implicitly determined by the preceding extensions *d*_{n−1}, *d*_{n−2}, …. Therefore, the model trap inherits memory and the strength of the memory is ultimately determined by the values of the autoregressive coefficients *η*_{1}, …, *η*_{K}. In order to capture memory effects accurately, we learn the values of *η*_{k} directly from the data (see Sec. II B 1).

#### 3. Noise

In addition to the driving molecular signal and the induced response of the trap system, the actual recorded extension during the experiment may also be influenced by unspecified perturbations such as mechanical forces, unresolved dynamics, perturbative ambient flows, and thermal fluctuations. Thus, a realistic description of the trap dynamics requires a modification of the extension *d*_{n} in Eq. (1) which excludes them.

For a well-calibrated experimental apparatus, the perturbations are expected to be small relative to the typical distances between the molecular extensions $\varphi \sigma m$. So, we model these perturbations collectively as random fluctuations $wn$ having the characteristics of white noise. Accordingly, in our formulation, the actual extension *x*_{n} recorded by the trap at *t*_{n} is given by a modification of Eq. (1) of the form

where the perturbations $wn$ ∼ Normal(0, *τ*) have zero mean and variance equal to 1/*τ*. As a side note, we point out that unlike *d*_{n} mentioned earlier, which describes the idealized situation of absent random perturbations, *x*_{n} explicitly includes them.

#### 4. Drift

Similar to ICON,^{2} to account for drift in the measured time series, which, as shown previously, may compromise the analysis if ignored,^{2} we couple the molecular trace $x\xaf$ to an additional process $\u0233$ that explicitly incorporates drift. To accommodate drift, we place a set of nodes *θ*_{j} = (*B*_{j}, *D*_{j}) over the time series and use interpolation^{24} to link across them as shown in Fig. 3. In particular, we place the nodes over the available time series $z\xaf$ at positions *B*_{j} and heights *D*_{j}. To allow the model drift to fit tightly even near sharp changes or discontinuities, we allow both *D*_{j} and *B*_{j} to shift. To achieve this, we restrict each *B*_{j} within regions $[Bjmin,Bjmax]$ and choose $Bjmin$ and $Bjmax$ from a regular grid of time levels that spans the entire time series.

In summary, the drift contribution to the measured extension at *t*_{n} is evaluated as

where $G\theta \u2192(t)$ is the interpolating function^{24} that uses the nodes $\theta \u2192=(\theta 1,\theta 2,\u2026,\theta J)$. Depending on the specifics of the experimental setup at hand, $G\theta \u2192(t)$ may utilize splines or more specialized basis functions. For example, for drift with severe discontinuities, Hermite polynomials^{25} might provide superior fits than splines.

For well-designed FS experiments, drift is expected to fluctuate on a slower time scale than the molecular transitions. As a result, assuming separation of times scales in $x\xaf$ and $\u0233$, a sufficient condition to avoid degeneracies of our formulation is *J*/*N* ≪ 1. That is, to be able to accurately capture the shape of the drift, the number of nodes *J* used in the interpolation needs to be lower than the total number of time points in the supplied trace *N*. Generally, a precise upper bound on the ratio *J*/*N* is hard to predict as it depends on (i) the length of the trace *N*, (ii) the intermeasurement period Δ*t*, and (iii) the intrinsic time scale of the drift. As a rule of thumb, we suggest *J*/*N* ≈ 0.01 or lower.

Finally, to resolve non-identifiability issues caused because the drift may move perpendicularly,^{2} we restrict the allowed combinations of nodes *θ*_{j} to those that yield either a zero global contribution, i.e., *∑*_{n} *y*_{n} = 0, or a zero contribution locally at the experiment onset, i.e., *y*_{1} = 0.

#### 5. Measurements

Finally, having incorporated trap instrumentation responses Eq. (1), random fluctuations Eq. (2), and drift Eq. (3), we model any remaining noise as having little to no additional effect. Thus, our net observations are *z*_{n} = *x*_{n} + *y*_{n}.

### B. Inference

The parameters that we can infer from a time trace—quantities such as $\pi \u0303\sigma m$, $\varphi \sigma m$, etc., introduced above—are directly related to the molecular properties of the system, e.g., molecular extensions, kinetics, and dwell times, that we wish to learn. To infer parameter values, we would normally follow the Bayesian paradigm.^{7–9,13} That is, we would ascribe priors to each parameter, combine priors with the likelihood implied by the formulation developed in Sec. II A 5, and subsequently derive a posterior probability distribution.^{7,8} However, since the number of attained molecular states is *a priori* unknown, so is the number of parameter values that need to be inferred. To handle this more general case, we use a nonparametric prior^{1,3,5,19} that allows us to estimate models and their associated parameters simultaneously.

Below we describe the inference procedure for our formulation. A graphical summary is shown in Fig. 4.

#### 1. Specifying the nonparametric prior

To infer a posterior over models incorporating a varying number of states in the molecular state space, we use a (nonparametric) generalization of the Dirichlet distribution^{1,15,26,27} that is routinely used in applications involving HMMs.^{8,28,29} In this formulation, the states available to the molecule are infinite in number and our primary goal is to identify which and how many of them have been visited during the experiment. As it is apparent, the majority of these states will remain unvisited since, for example, the molecule can visit at most *N* of them. In practice, through a careful choice of the prior on the state transitions (see below), most of the states will be revisited multiple times and so the number of unique states in $s\xaf$ will be much lower than *N*.

Specifically, the prior we place on the molecular kinetics is a hierarchical Dirichlet process^{26} of the form

where $\delta \u0303\sigma m$ is the Dirac delta centered at *σ*_{m} and *γ*, *ξ*, and *ρ* are hyperparameters that, as explained below, determine the prior probability distributions on the number of molecular states and mean dwell times. In Eqs. (4) and (5), GEM denotes the Griffiths-Engen-McCloskey stick breaking^{19,26,30} and DP denotes the Dirichlet process^{5,19,26,27,31,32} prior.

The unique characteristic of this prior is that when assigning molecular transitions, say from *s*_{n} to *s*_{n+1}, it considers three possibilities: (i) the molecule stays in the same state, (ii) the molecule revisits one of the states visited before, and (iii) the molecule visits a state never visited before. Similar to ICON, it is because of the last possibility that, *a priori*, the molecule has access to infinitely many states and it is up to the model’s likelihood to bias the transition toward options (i) and (ii) as more data are gathered, i.e., as *n* moves to higher values. In other words, it is solely up to the likelihood to distinguish *which and how many* of the *a priori* infinite states have been actually visited during the experiment. However, unlike the prior used in ICON, which places *a priori* equal weights on these three possibilities, the present formulation places a higher weight on self-transitions as induced by the delta terms $\rho \delta \u0303\sigma m$ in Eq. (5). This makes the current formulation more appropriate in describing molecules exhibiting long dwells,^{33} for which multiple recurring self-transitions are required, while still maintaining an infinite pool of potential states.

In principle, the values of the hyperparameters *γ*, *ξ*, and *ρ* can be inferred from the time series by invoking further hierarchical constructions.^{26,31,33,34} However, because prior information about the anticipated number of visited molecular states and characteristic time scales is often available—for example, either from previous experiments, computational modeling, or simply by intuition based on molecules of similar structure—such constructions may be unnecessary. Instead, below we describe how *γ*, *ξ*, and *ρ* can be chosen to reflect the available information.

For a time series $s\xaf$ containing *N* measurements, provided *N* is large as is typically the case in FS, it has been shown^{31} that the number of distinct molecular states to be visited, implied by the above prior, has a probability distribution that is approximately proportional to *γ*^{m}(0.58 + log *N*)^{m}/(*m* − 1)!. That is, when the only information about the molecule is that it has generated *N* measurements, but the values of these measurements are unknown, the prior in Eqs. (4) and (5) implies that the number of unique states *σ*_{m} in $s\xaf$ is random and that its probability distribution is as given above. This relationship can be inverted to initialize a value for $\gamma =e\psi (Mmean*)/(0.58+logN)$, where $Mmean*$ is the anticipated number of unique molecular states and *ψ* is the digamma function.^{35}

Furthermore, the induced state mean dwell times $T\sigma m$, as shown in Appendix B, have a prior probability distribution that is illustrated in Fig. 5. This distribution has mean $Tmean*=\Delta t/(1\u2212\rho )$ and mode $Tmode*=\Delta t\xi /((1\u2212\rho )\xi +1)$, which may be used to choose appropriate values for the hyperparameters $\xi =Tmode*Tmean*/(Tmean*\u2212Tmode*)/\Delta t$ and $\rho =1\u2212\Delta t/Tmean*$. More precisely, in this case, the mean sets the anticipated time scale of the molecular kinetics. In addition, a mode close to the mean indicates low uncertainty on the chosen time scale, as most of the prior probability is concentrated in a region around $Tmean*$, while a mode distant from the mean indicates high uncertainty on the anticipated time scale, as the prior probability is spread over a wider region, Fig. 5.

At this point, it is worth noting that although the values of $Mmean*$, $Tmean*$, and $Tmode*$ influence the prior, their effect on the resulting posterior—which incorporates information from the data introduced through the likelihood—decreases as the amount of data available increases or the noise level decreases.^{8,29}

For the remaining parameters, which may be treated within the parametric context, motivated by simplicity, we use normal priors for the molecular extensions $\varphi \sigma m\u2009\u223c\u2009Normal(\mu \varphi ,\sigma \varphi 2)$ and a gamma prior for the precision of the noise terms $\tau \u223cGamma(\alpha \tau /2,\alpha \tau \sigma \tau 2/2)$. Both choices are conjugate^{8} to their respective likelihoods. Consequently, their hyperparameters may be interpreted in terms of pseudo-observations,^{7,8} which in turn may be helpful in choosing appropriate values. Finally, we use normal priors for the autoregressive coefficients of the trap’s instrumentation response $\eta k\u2009\u223c\u2009Normal(\mu \eta ,\sigma \eta 2)$, symmetric generalized beta priors for the placement of the drift nodes $Bj\u223cBjmin+(Bjmax\u2212Bjmin)Beta(\alpha B,\alpha B)$, and improper^{8} flat priors for the node heights *D*_{j} ∝ 1. At this point, we want to mention that although these prior choices in our formulation are mainly motivated by modeling simplicity, due to the typical long length of FS traces, they have little influence on the resulting estimates though they remain necessary ingredients of Bayesian modeling.^{36}

#### 2. Obtaining estimates

In Bayesian methods, it is often possible to obtain posterior probabilities in analytic form; however, due to the nonparametric prior of Eqs. (4) and (5), this is impossible in our case. Instead, to compute the posterior $P(s\xaf,\beta \u0303,\pi \u0303\u0303,\varphi \u0303,\tau ,\eta \u2190,\theta \u2192\u2009|z\xaf)$, we have to adopt a computational approach. For this reason, we have developed a specialized scheme described in detail in Appendix C. Using this scheme, we can generate samples from $P(s\xaf,\beta \u0303,\pi \u0303\u0303,\varphi \u0303,\tau ,\eta \u2190,\theta \u2192\u2009|z\xaf)$ and subsequently obtain full estimates, such as expected values and credible intervals, of the involved physical quantities.^{1,8,34}

For most parameters of interest—for example, molecular kinetics $\pi \u0303\u0303$, extensions $\varphi \u0303$, and drift $\u0233$—the evaluation of the corresponding estimators is straightforward. Some difficulties may arise only when trying to estimate the most probable state sequence $s\xaf^$ since the popular Viterbi sequence,^{29,37} common in engineering applications, cannot be easily generalized to the nonparametric context.^{34,38} To that end, following existing approaches,^{38–41} here we propose as an alternative estimator for $s\xaf$ the state sequence $s\xaf^$ that lies closest to the expectation $E(s\xaf|z\xaf)$ of the marginal posterior probability distribution $P(s\xaf|z\xaf)$. Similar to the expectation $E(s\xaf|z\xaf)$, the proposed $s\xaf^$ considers the full extent of the model posterior;^{38} however, unlike $E(s\xaf|z\xaf)$ which might violate the model kinetics,^{29} the proposed $s\xaf^$ is ensured to be in agreement as it is chosen exclusively among the feasible sequences $s\xaf$.

### C. Example time series

To test and validate our method and demonstrate its applicability in the analysis of single molecule data, we have used synthetic as well as experimental traces.

For the synthetic traces, i.e., Figs. 6, 8, and 9, we simulated the switching of a hypothetical biomolecule with 5 states of characteristic extensions 10, 20, 30, 40, 50 nm and a common mean dwell time of 0.01 s. These traces are generated for a total duration of 5 s, with individual measurements every 15 *μ*s. The generated molecular signals are distorted by (i) the response of a trap system with linear dynamics, (ii) random fluctuations, and (iii) drift that consists of linear and sinusoidal terms. Intentionally, for these traces, we used exceedingly slow response times *t*_{resp} = 150 *μ*s, so the effect of the induced correlations may be clearly visualized. Random fluctuations introduced in these examples simulate a single noise source that combines every noise process occurring experimentally. Such noise processes include but are not limited to thermal fluctuations, unspecified mechanical perturbations, detection shot-noise, read-out noise, background signals, etc. In more detail, we have added fluctuations as described by Eq. (2) with perturbations $wn$ that are normally distributed as one would expect in cases of noise collectively obeying the law of large numbers.^{42,43} Finally, to achieve a comparison with real-data, we have selected the variance (i.e., 1/*τ*) of $wn$ such that the signal-to-noise ratio in the resulting traces is similar to (Figs. 6 and 8) or much lower than (Fig. 9) the experimental trace used in this study (see below). The precise values of *τ* used are 4 nm^{−2} and 0.04 nm^{−2} for the high and low signal-to-noise cases, respectively.

The optical trap instrument used to obtain the experimental trace, i.e., Fig. 10, was a home-built timeshared dual optical trap constructed and operated generally as described in previous studies.^{44,45} In brief, an acousto-optic modulator rapidly deflects a 1064 nm laser between two positions to create two independent optical traps (66.7 kHz modulation rate). Time traces were obtained with the traps held at fixed positions. A pair of beads was held in the dual traps and tethered together *in situ* by a single molecule consisting of an 89-bp dsDNA hairpin flanked by two 1.5-kbp dsDNA “handles.” The construct and hairpin sequence are the exact same as “hairpin 1” described earlier.^{45,46} The change in the positions of the pair of trapped beads is the readout of the changing tether length and thus the hairpin folded/unfolded configuration. The tether tension was chosen so as to observe the long DNA hairpin reversibly unfold and refold in multiple steps (four clear steps are observed in this case). Hairpins such as these are often used as test substrates for investigations of helicase protein machines. Bead position data were recorded unfiltered at 66.7 kHz sampling rate (i.e., one data point per individual trap on cycle or every 15 *μ*s) for a total of 5 s. Because the measured trace appeared void of drift, we artificially contaminated it with the addition of drift similar to the synthetic traces.

## III. RESULTS

We implemented **xl-ICON** as described in Appendix C and used it to analyze the single molecule time series described above. The full set of options used in these analyses is listed in Appendix D.

### A. Synthetic measurements

Figure 6 shows a sample synthetic trace with low noise as well as a collection of characteristic estimates. In this trace, the simulated trap instrumentation response consists of a single exponential decay. However, because in the experimental setting the trap’s response is generally uncharacterized, we have used a broader 3 order autoregressive model, i.e., *K* = 3, to represent it. Despite the intentional misspecification, the estimated coefficients *η*_{1} = 0.91, *η*_{2} ≈ 0, *η*_{3} ≈ 0 indicate that the model accurately identifies the correct autoregressive order. The corresponding estimate of the response time is *t*_{resp} = 150.8 ± 1.1 s, which is in agreement with the simulated one of 150 s. Furthermore, **xl-ICON** also accurately identifies and removes drift which is also shown in Fig. 6. Overall, in the estimated state trace $s\xaf^$, **xl-ICON** misses or miss-identifies less than 1.5% of the transitions (missed transitions are not shown for clarity).

Figure 7 shows the combinations of state extensions $\varphi \sigma m$ and mean dwell times $T\sigma m$ estimated through the posterior probability distribution. Here, color coded is the average fraction of time that the molecule is estimated to occupy a state *σ*_{m} with the corresponding extension and mean dwell time. For example, a fraction of 10% indicates that the molecule produced in total 10% of the measurements in the time trace while occupying state *σ*_{m}; further, a null fraction indicates that the molecule never occupied the corresponding combination during the experiment. As expected, the estimated occupation fraction clusters around the simulated combinations of extensions of 10, 20, 30, 40, 50 nm and dwell time 0.01 s indicating that **xl-ICON** successfully identifies and localizes the correct states.

We emphasize that the success of **xl-ICON** in clearly identifying 5 states at the correct extensions, despite the severe correlations in the time series analyzed, relies heavily on the explicit incorporation of the trap instrumentation response (Sec. II A 2) that is the novelty of our method. To demonstrate how critical this is, we analyzed the same trace without considering the response. This is achieved, for example, by keeping the coefficients *η*_{1}, …, *η*_{K} in Eqs. (1) and (2) fixed at zero. Figure 8 demonstrates how the estimates deteriorate in this case. By not accounting for the induced correlations, which is effectively equivalent to applying ICON or a plain iHMM, the method estimates an unrealistic total of ≈40 states with characteristic extensions spread throughout the full range of the data.

To facilitate the presentation, for the examples shown so far, we have intentionally utilized low noise levels. Nevertheless, it is worth noting that **xl-ICON** provides robust estimates even at exceedingly high noise. For instance, Fig. 9 shows corresponding estimates obtained from a considerably noisier trace than those used earlier. As can be seen, the estimated molecular extensions and drift remain in excellent agreement with the true ones despite the significant overlap in the state emissions. Further, good agreement is also maintained with the estimated molecular trace $s\xaf^$. In the latter, missed transitions account for less than 10% of the entire trace. Although this is already a low error rate, we want to point out that the molecular traces shown throughout are only point estimates $s\xaf^$ (see Sec. II B 2) that represent the single best choices out of every possible trace $s\xaf$ in the sense that they simultaneously optimize the number of occupied states, state kinetics, state extensions, as well as drift, trap response, and noise globally over the entire time series. So a non-zero error rate is expected to occur when the supplied data are noisy. Such errors typically disappear once the estimator $s\xaf^$ is enclosed within the appropriate credible region,^{8,9} i.e., traces that engulf a significant portion of the posterior probability $P(s\xaf|z\xaf)$. In principle, this region may be obtained, nonetheless, in a computationally intensive manner.^{38} We avoid the required computations here, since most commonly the main task in the analysis of single molecule measurements is the quantification of the molecular properties, such as state extensions $\varphi \sigma m$ and mean dwell times $T\sigma m$, rather than the precise quantification of the molecular trace $s\xaf$ itself.

### B. Experimental measurements

Having validated our model using synthetic data with known ground truth in Sec. III A, we now turn to real FS data where the ground truth is unknown. Given that the assumptions underlying our method (see Sec. II) are expected to hold in the experimental context—for example, Markovian molecular kinetics and white noise perturbations of trap’s recorded extension—it is anticipated that **xl-ICON** performs equally well on experimental traces as well.

Figure 10 shows the experimental trace and selected estimates. During the acquisition of the trace shown, the optical trap has been configured such as the underlying DNA hairpin foldes/unfoldes in discrete steps that can be observed in the recorded extension (for further details, see Sec. II C). Similar to the synthetic traces shown earlier, the method successfully isolates the underlying molecular transitions from the drift and trap responses, despite the high noise levels and the correlations in the raw trace. As an extra layer of validation, we note here that since the drift in this time series is known, as it is introduced artificially, we verify the accuracy of its estimate, Fig. 10.

Figure 11 illustrates the estimated state extensions and mean dwell times. As can be seen, **xl-ICON** clearly identifies 4 distinct molecular states with extensions varying between −25 and 10 nm. Unlike the synthetic trace in Fig. 7, which ascribed similar kinetics to every state, here one state exhibits relatively stable kinetics (mean dwell time of ≈0.1 s), while the remaining states are more transient (mean dwell times between 1 ms and 0.1 ms). To facilitate the interpretation of these estimates, in Fig. 11 we also show a simple histogram of the experimental data (without including drift). A visual inspection of this histogram reveals the presence of at least three states. All of them have been accurately identified by **xl-ICON**. Further, since **xl-ICON** considers also kinetic properties which are lost when the data are collapsed to produce a histogram, it also reveals the presence of a fourth one at −4.5 nm. This state is barely discernible from the histogram; however, its presence in the experimental trace can be verified from the blow ups of the time courses presented in Fig. 10.

At this point, a comparison of the estimates obtained through **xl-ICON** with those that may be obtained visually through simple downsampling of the raw data is appropriate. In Fig. 12, we show histograms of the experimental trace at full rate 15 *μ*s (left most panel) and successively downsampled prior to binning by a factor of 10 until a rate of 15 ms. The state located near 6 *μ*m, due to its slow kinetics (≈100 ms) that exceed by several orders the inter-measurement times (15 *μ*s – 15 ms) and the trap response (≈60 *μ*s), is sharpened as expected. However, the remaining states, due to their transient kinetics (≈0.1–1 ms), contain only a small fraction of the points in the entire time series. As a result, a mild downsampling may enhance them (for example, at 0.15 ms, the lower state is enhanced, and also the middle two states can be visualized as the histogram’s middle peak spreads asymmetrically suggesting at least two states). Nevertheless, with further downsampling these transient states are completely eliminated as they merge into a single one located at around −12 nm. Besides causing transient states to merge, downsampling also introduces artifactual states (for example, states at extensions between −10 nm and 10 nm that may be seen at 15 ms). These stem from the mixing (aliasing) of the transient states with the stable one. For all these cases, **xl-ICON** provides superior estimates as (i) it considers state kinetics that are lost when the trace is binned and thus resolves noisy states robustly, (ii) it resolves fast molecular kinetics that are comparable to the experimental acquisition rate without downsampling, and (iii) it avoids aliasing.

## IV. DISCUSSION

High data acquisition rates make it possible to capture increasingly faster dynamics of biomolecules. However, the interpretation of those datasets presents unique challenges that are not fully met by existing analysis methods both because such data sets are larger in size but also because the data sets themselves are noisier and more complex.

For the sake of concreteness, in this study, we focused on force spectroscopy (FS) datasets collected at high rates that fall within the low kHz timeframe. Although these datasets present a unique opportunity to observe molecular dynamics at faster scales, their interpretation is challenging because fast molecular motions are distorted by continuous aberrations, such as drift and correlation, which pose a severe limitation not only for traditional (parametric) approaches which specify models ahead of time but also for naive nonparametric methods.^{1–3} For this reason, the experimental measurements are often averaged down or otherwise artificially reduced to render them analyzable. However, this strategy is appropriate only for measurements from biomolecules with relatively slow kinetics that could be also captured in datasets obtained at lower acquisition rates. However, fundamentally, averaging or reducing the collected datasets undermines the whole reason behind collecting data at higher rates in the first place.

In part to resolve emergent modeling challenges and in part to maintain a feasible computational cost (as high rate datasets may be vastly larger), existing analysis methods often require crucial user dependent choices that yield subjective or partial conclusions. In the extreme example of the hidden Markov model (HMM), the workhorse of biophysical time series data analysis,^{10–14} the model itself (i.e., the number of attained states) is commonly specified by hand and extensive data pre-processing is used to remove drift and correlations. Subsequently, to discriminate between plausible models, different model candidates are compared using additional criteria.^{4,6} The challenge here is that pre-processing (e.g., to remove drift) and/or post-processing (e.g., to select the number of states) yields an output that strongly depends on multiple independent steps in the analysis procedure.^{4,6}

To that end, Bayesian nonparametrics (BNPs)^{1–4} offer powerful and feasible alternatives to obtaining globally optimal model and parameter estimates. BNPs do so by determining model parameters and models themselves simultaneously thereby relaxing many of the theoretical requirements of traditional (parametric) approaches. More specifically, infinite hidden Markov models (iHMMs) can relax the requirements of the traditional HMMs to a bare minimum,^{3,5,19} though they remain more computationally costly than HMMs. Nevertheless, current research into efficient algorithmic alternatives to the slower Monte Carlo techniques typically employed, for example, variational methods,^{47–50} may reduce computational cost in the years to come. As an example, **xl-ICON,** which utilizes the same computational tools^{33,34} as the iHMMs, requires ≈2 days of computational time, in an average desktop computer, to analyze a trace of 5 s collected at 66.7 kHz (for example, similar to the trace of Fig. 10). Although, such computational requirements are certainly within the capabilities of modern labs, anticipated computational improvements could reduce it by several orders in the near future.

Our analysis method directly addresses challenges presented by data collected at high rates by exploiting BNPs. By utilizing BNPs, we present the first strategy, to the best of our knowledge, in the single molecule literature toward making raw experimental datasets collected at high rates amenable to rigorous analysis. The resulting method represents the entire experimental system encountered in force spectroscopy and allows clear interpretation of the involved processes, for example, molecular transitions, trap instrumentation responses, and drift. As a result, our formulation of **xl-ICON**—but also the logic behind our formulation—may be used for unsupervised time series analysis of raw single molecule data or used as a building block for models of larger biophysical/experimental setups.

## SUPPLEMENTARY MATERIAL

See supplementary material for a software implementation of **xl-ICON**.

## ACKNOWLEDGMENTS

S.P. acknowledges support from NSF CAREER Grant No. MCB-1719537. M.J.C. acknowledges support from NSF Grant No. MCB-1514706.

### APPENDIX A: NOTATION

Table I summarizes the notation used. Throughout this study, *n* = 1, 2, …, *N* indexes time levels; *m* = 1, 2, …, *M* indexes molecular states; *k* = 1, 2, …, *K* indexes autoregressive coefficients; and *j* = 1, 2, …, *J* indexes drift nodes. To group similar variables, we use bars to denote lists over time, e.g., $z\xaf=(z1,z2,\u2026,zN)$; tildes to denote lists over the molecular state space, e.g., $\varphi \u0303=(\varphi \sigma 1,\varphi \sigma 2,\u2026,\varphi \sigma M)$; double tildes to denote lists of lists over the molecular state space, e.g., $\pi \u0303\u0303$; right arrows to gather the trap’s autoregressive coefficients, e.g., $\eta \u2190=(\eta 1,\eta 2,\u2026,\eta K)$; and left arrows to gather the drift nodes, e.g., $\theta \u2192=(\theta 1,\theta 2,\u2026,\theta J)$. Occasionally, we use carets to denote estimators, e.g., $s\xaf^$.

Variable . | Description . |
---|---|

Δt | Measurement period (i.e., inverse acquisition rate) |

t _{resp} | Response time of trap’s instrumentation |

t _{ n } | Time levels of individual measurements |

z _{ n } | Extension measured at t_{n} |

y _{ n } | Drift at t_{n} |

x _{n} | Recorded position at t_{n} (i.e., with noise) |

d _{n} | Idealized position at t_{n} (i.e., without noise) |

$ w n $ | Trap perturbation at t_{n} |

s _{ n } | Molecular state at t_{n} |

σ _{ m } | Distinct molecular state |

$ \varphi \sigma m $ | Extension associated with σ_{m} |

$ T \sigma m $ | Mean dwell time of σ_{m} |

$ \pi \sigma m \u2192 \sigma m \u2032 $ | Transition probability from σ_{m} to $\sigma m\u2032$ |

η _{ k } | Trap response autoregressive coefficient |

θ _{ j } | Drift node |

B _{ j } | Time position of drift node θ_{j} |

D _{ j } | Extension height of drift node θ_{j} |

$ G \theta \u2192 ( t ) $ | Drift interpolating function |

Variable . | Description . |
---|---|

Δt | Measurement period (i.e., inverse acquisition rate) |

t _{resp} | Response time of trap’s instrumentation |

t _{ n } | Time levels of individual measurements |

z _{ n } | Extension measured at t_{n} |

y _{ n } | Drift at t_{n} |

x _{n} | Recorded position at t_{n} (i.e., with noise) |

d _{n} | Idealized position at t_{n} (i.e., without noise) |

$ w n $ | Trap perturbation at t_{n} |

s _{ n } | Molecular state at t_{n} |

σ _{ m } | Distinct molecular state |

$ \varphi \sigma m $ | Extension associated with σ_{m} |

$ T \sigma m $ | Mean dwell time of σ_{m} |

$ \pi \sigma m \u2192 \sigma m \u2032 $ | Transition probability from σ_{m} to $\sigma m\u2032$ |

η _{ k } | Trap response autoregressive coefficient |

θ _{ j } | Drift node |

B _{ j } | Time position of drift node θ_{j} |

D _{ j } | Extension height of drift node θ_{j} |

$ G \theta \u2192 ( t ) $ | Drift interpolating function |

The notation “*X* ∼ *F*,” where *X* is a random variable and *F* is a probability distribution, that is used extensively throughout this study, indicates^{7–9,36} that *X* is sampled from *F* or simply that *X* follows the probability measure associated with *F*.

### APPENDIX B: PRIORS FOR STATE MEAN DWELL TIMES

According to the prior in Eq. (5), the probabilities of a self-transition, i.e., $\pi \sigma m\u2192\sigma m$, and a non-self transition, i.e., $\pi \sigma m\u219b\sigma m=\u2211m\u2032\u2260m\pi \sigma m\u2192\sigma m\u2032$, are jointly distributed according to

since $\u2211m\u2032\u2260m\beta \sigma m\u2032=1\u2212\beta \sigma m$. Here, Dir denotes the Dirichlet distribution.^{8,9,29} Given that $\pi \sigma m\u219b\sigma m=1\u2212\pi \sigma m\u2192\sigma m$, the above joint probability may be reduced to

where Beta denotes the beta distribution.

Finally, due to the Markov property of the state transitions,^{29} the number of dwell time steps $n\sigma m$ in state *σ*_{m} follows a geometric probability distribution with mean $N\sigma m=1/(1\u2212\pi \sigma m\u2192\sigma m)$. In turn, according to Eq. (B2), $N\sigma m$ follows an inverse beta distribution

Therefore, the mean dwell time $T\sigma m=N\sigma m\Delta t$ in state *σ*_{m} follows a scaled inverse beta distribution and the precise probability density is

where B denotes the beta function.^{35}

### APPENDIX C: COMPUTATIONAL SCHEME

A fully working version of **xl-ICON**, in Matlab source code and GUI formats, is available in the supplementary material. The implementation utilizes the sampler briefly mentioned in Sec. II B 2 and generates samples from the model’s augmented posterior probability distribution $P(s\xaf,\beta \u0303,\pi \u0303\u0303,\varphi \u0303,\tau ,\eta \u2190,\theta \u2192\u2009|z\xaf)$ through an Markov Chain Monte Carlo (MCMC) scheme^{51} that utilizes Gibbs sampling with embedded Metropolis updates.^{51}

Specifically for the nonparametric components related to the molecular state space, as in previous studies,^{33,34,49} we use a weak limit approximation that replaces Eq. (4) with

Results from the theory of statistics^{32,34,49} can be invoked to ensure that for sufficiently large *M*, Eq. (C1) yields estimates that are indistinguishable from Eq. (4). The main advantage of adopting this approximation is that we can invoke efficient machine learning algorithms, such as forward filtering^{1,29} without modifications,^{33} in order to minimize the associated computational cost. Additionally, sampling from Eq. (C1), compared to sampling from Eq. (4), for example, through the popular beam sampler^{52} as in previous implementations,^{1,2} leads to faster mixing of the MCMC chain.^{33,34} The latter is crucial in the analysis of large traces such as those encountered in force spectroscopy.

In detail, our implementation generates the outlined samples by iterating the following steps:

- Step 1:
Update the molecular states $s\xaf$ using forward filtering-backward sampling.

^{1}Briefly, in a forward pass of the trace, compute recursively the state probabilities $An(\sigma m)=Psn=\sigma mz1,\u2026,zn$ according to the filtering relation $An(\sigma m)\u221dPznsn=\sigma m\u2211\sigma m\u2032An\u22121(\sigma m\u2032)\pi \sigma m\u2032\u2192\sigma m$, while in a backward pass of the trace, sample states according to $snsn+1\u223c\pi sn\u2192sn+1An(sn)$. The resulting state trace $s\xaf=(s1,\u2026,sN)$ can be shown^{1,28,52}to follow the conditional probability $Ps\xaf\beta \u0303,\pi \u0303\u0303,\varphi \u0303,\tau ,\eta \u2190,\theta \u2192,z\xaf$ that is required by the overall Gibbs sampling scheme.^{51}A detailed exposition of the algorithm, which lies beyond the scope of this study, can be found in the description of ICON.^{1,2} - Step 2:
Update the nonparametric base $\beta \u0303$, molecular kinetics $\pi \u0303\u0303$, and hyperparameters

*γ*,*ρ*, and*ξ*(if necessary), similar to the blocked sampler of the sticky-HDP-HMM.^{33} - Step 3:
Update the molecular extensions $\varphi \u0303$.

- Step 4:
Update the noise precision

*τ*. - Step 5:
Update the autoregressive coefficients $\eta \u2190$ using a Metropolis-within-Gibbs update.

^{51} - Step 6:
Update the drift nodes $\theta \u2192$ using a Metropolis-within-Gibbs update similar to ICON’s sampler.

^{2}

Note that since $\varphi \u0303$ and *τ* utilize conjugate prior distributions,^{8} sampling in steps 3 and 4 is achieved directly. Also, note that to improve mixing^{51} (i) steps 3–6 are swept in random order, (ii) additional repetitions of steps 4 and 5 are embedded within steps 3 and 6, and (iii) steps 2–6 are repeated several times in each iteration. Finally, the proposals for the Metropolis updates of steps 5 and 6 are adjusted during burn-in^{51} to yield individual acceptance rates of ≈25%.

Fine details of the outlined implementation can be found in the **xl-ICON**’s source code, available in the supplementary material.

### APPENDIX D: EXAMPLES

Table II summarizes the options used in Sec. III. For the computations, a total of 45 000 MCMC samples are generated for each example. From these, the initial 1000 are subtracted as burn-in, while, from the remaining, only 1 out of every 64 samples is used for the results shown in Figs. 6–11.

Variable . | Synthetic . | Experimental . | Units . |
---|---|---|---|

M | 50 | 50 | Dimensionless |

K | 3 | 3 | Dimensionless |

J | 10 | 10 | Dimensionless |

$ M mean * $ | 2 | 2 | Dimensionless |

$ T mean * $ | (t_{N} − t_{1})/10 | (t_{N} − t_{1})/10 | Time |

$ T mode * $ | $ T mean * /3$ | $ T mean * /3$ | Time |

μ _{ ϕ } | $mean ( z \xaf ) $ | $mean ( z \xaf ) $ | Extension |

σ _{ ϕ } | $std ( z \xaf ) $ | $std ( z \xaf ) $ | Extension |

α _{ τ } | 2 | 2 | Dimensionless |

σ _{ τ } | $std ( z \xaf ) $ | $std ( z \xaf ) $ | Extension |

μ _{ η } | 0 | 0 | Dimensionless |

σ _{ η } | 1/3 | 1/3 | Dimensionless |

α _{ B } | 2 | 2 | Dimensionless |

$ B j min $ | $ t 1 + j \u2212 1 J ( t N \u2212 t 1 ) $ | $ t 1 + j \u2212 1 J ( t N \u2212 t 1 ) $ | Time |

$ B j max $ | $ t 1 + j J ( t N \u2212 t 1 ) $ | $ t 1 + j J ( t N \u2212 t 1 ) $ | Time |

Basis | Splines | Splines | … |

Tethering | Global | Global | … |

Variable . | Synthetic . | Experimental . | Units . |
---|---|---|---|

M | 50 | 50 | Dimensionless |

K | 3 | 3 | Dimensionless |

J | 10 | 10 | Dimensionless |

$ M mean * $ | 2 | 2 | Dimensionless |

$ T mean * $ | (t_{N} − t_{1})/10 | (t_{N} − t_{1})/10 | Time |

$ T mode * $ | $ T mean * /3$ | $ T mean * /3$ | Time |

μ _{ ϕ } | $mean ( z \xaf ) $ | $mean ( z \xaf ) $ | Extension |

σ _{ ϕ } | $std ( z \xaf ) $ | $std ( z \xaf ) $ | Extension |

α _{ τ } | 2 | 2 | Dimensionless |

σ _{ τ } | $std ( z \xaf ) $ | $std ( z \xaf ) $ | Extension |

μ _{ η } | 0 | 0 | Dimensionless |

σ _{ η } | 1/3 | 1/3 | Dimensionless |

α _{ B } | 2 | 2 | Dimensionless |

$ B j min $ | $ t 1 + j \u2212 1 J ( t N \u2212 t 1 ) $ | $ t 1 + j \u2212 1 J ( t N \u2212 t 1 ) $ | Time |

$ B j max $ | $ t 1 + j J ( t N \u2212 t 1 ) $ | $ t 1 + j J ( t N \u2212 t 1 ) $ | Time |

Basis | Splines | Splines | … |

Tethering | Global | Global | … |