Bayesian nonparametrics (BNPs) are poised to have a deep impact in the analysis of single molecule data as they provide posterior probabilities over entire models consistent with the supplied data, not just model parameters of one preferred model. Thus they provide an elegant and rigorous solution to the difficult problem encountered when selecting an appropriate candidate model. Nevertheless, BNPs’ flexibility to learn models and their associated parameters from experimental data is a double-edged sword. Most importantly, BNPs are prone to increasing the complexity of the estimated models due to artifactual features present in time traces. Thus, because of experimental challenges unique to single molecule methods, naive application of available BNP tools is not possible. Here we consider traces with time correlations and, as a specific example, we deal with force spectroscopy traces collected at high acquisition rates. While high acquisition rates are required in order to capture dwells in short-lived molecular states, in this setup, a slow response of the optical trap instrumentation (i.e., trapped beads, ambient fluid, and tethering handles) distorts the molecular signals introducing time correlations into the data that may be misinterpreted as true states by naive BNPs. Our adaptation of BNP tools explicitly takes into consideration these response dynamics, in addition to drift and noise, and makes unsupervised time series analysis of correlated single molecule force spectroscopy measurements possible, even at acquisition rates similar to or below the trap’s response times.

Bayesian nonparametrics (BNPs) have recently been introduced to single molecule biophysics.1–4 Their unique characteristics make them ideal tools in modeling complex biomolecules as they offer clear physical interpretation and allow full posterior probabilities over molecular-level models to be inferred with minimal subjective choices.3,5

In particular, in single molecule analysis, it is typical to associate a (mathematical) model with a molecular state space of a specific size.1,4,6 For example, a molecule with 2 distinct biophysical states is represented by a model with 2 (mathematical) states and the corresponding number of state-specific parameters. Since traditional analysis methods can only infer parameters of particular models,7–9 but not populations of parameters, they require an a priori particular choice among all such possible models. In other words, a critical requirement is that the size of the molecular state space already be known, which is often not the case.

To this end, BNPs avoid the crucial step of having to select a single model out of many plausible ones as they can infer parameters while simultaneously inferring the appropriate model.1–5 Accordingly, BNPs have been used for the analysis of single molecule time series as natural extensions of well-established parametric methods,10–14 such as Hidden Markov Models (HMMs), without the need to impose additional model selection procedures.

The infinite Hidden Markov Model15 (iHMM) exploits BNPs and is the nonparametric extension of the routinely used HMM. The iHMM performs well in the analysis of decorrelated time series1,2 though it is challenged by correlated time series. Indeed, when time correlated traces are analyzed under the assumption of decorrelated data, iHMMs may give rise to a severe overestimate in the number of states because, unlike HMMs which have a fixed number of states, iHMMs recruit states to model the data.

Time correlations are unavoidable experimentally. For instance, time correlations arise when data acquisition times are fast, as is required to capture rapid dynamics of molecules. In such cases, transitions between discrete molecular states may appear smooth or as artifacts of the data acquisition process.

For the sake of concreteness, we focus on traces obtained from single molecule force spectroscopy (FS) where fast measurements may introduce time correlations in the data because the observed dynamics of the molecule are heavily influenced by the response of the optical trap instrumentation. This response includes the motion of the associated bead, the settling of the ambient fluid, and the relaxation of the tethering DNA handles.16–18 In fact, measurements in FS probe the motion of the combined molecule-trap system which, at fast time scales, may deviate substantially from the motion of the underlying molecule alone. As a result, sharp molecular transitions may be distorted or smeared out because the combined system, roughly idealized as an elastic spring, transforms transitions from discrete into continuous events that are sampled over multiple time points during the molecule-trap settling time, Fig. 1.

FIG. 1.

To resolve fast molecular dynamics, measurements are obtained at high acquisition rates. Because of the slow response time of the optical trap instrumentation used in FS, the molecular signal is masked (upper left). As a result, sharp molecular transitions are distorted or smeared out. Consequently, downsampling is often used to reduce (upper right) the induced correlations. Downsampling itself, however, introduces aliasing, which also distorts/smears out sharp transitions. In either case, naive application of BNP methods, such as iHMMs, leads to an overestimation of the number of molecular states (bottom), since artifactual short-lived dwells (arising from either finite time trap responses or aliasing upon downsampling) are interpreted as visits to unphysical states.

FIG. 1.

To resolve fast molecular dynamics, measurements are obtained at high acquisition rates. Because of the slow response time of the optical trap instrumentation used in FS, the molecular signal is masked (upper left). As a result, sharp molecular transitions are distorted or smeared out. Consequently, downsampling is often used to reduce (upper right) the induced correlations. Downsampling itself, however, introduces aliasing, which also distorts/smears out sharp transitions. In either case, naive application of BNP methods, such as iHMMs, leads to an overestimation of the number of molecular states (bottom), since artifactual short-lived dwells (arising from either finite time trap responses or aliasing upon downsampling) are interpreted as visits to unphysical states.

Close modal

Due to their ability to infer the size of the molecular state space simultaneously with the properties of the constituting states,3,5,19 iHMMs can over-interpret the distorted signals and overpopulate the state space with unphysical states in those regions of the time trace where transitions between states appear continuous. Inevitably, if the raw FS traces are directly used, the estimated number of states may be severely overestimated, Fig. 1.

To lessen such inaccuracies, one may naively choose to remove correlations by pre-processing the time trace. For instance, this can be achieved by artificially downsampling the experimental trace,20 where multiple consecutive data points are binned together and replaced by their average. Such preprocessing, while reducing correlations between successive measurements within a time series, has two main disadvantages: (i) it undermines the purpose of collecting data at high frequencies in the first place by reducing the measurement rate and leaving originally observed fast kinetics unresolved and (ii) it introduces artificial short-lived dwells due to aliasing.20 In turn, these artificial short-lived dwells may also be interpreted by iHMMs as artificial states leading to the same downfalls as when using the original, unprocessed, raw signal, Fig. 1.

In summary, iHMMs are advantageous because they do not assume a fixed number of states. Yet iHMMs fail when confronted with the, often unavoidable, problem of correlated time traces. A nonparametric method that can address this new challenge is needed in order to interpret high frequency measurements.

To address this limitation, we propose an extension of iHMM with continuous control ICON,2 a generalization of the iHMM method previously developed to deal with time traces with drift (such as FRET), that is appropriate for correlated traces as they appear in FS. With the extended formulation presented here, we account for slow instrumentation responses, noise, and also drift. Additionally, we introduce computational improvements that facilitate the analysis of long traces such as those in FS. A working implementation of the resulting method, termed xl-ICON, is available in the supplementary material.

Consider a FS single molecule time series z¯=(z1,z2,,zN), where zn is the extension measured at time tn, with t1 and tN denoting the onset and conclusion of the experimental time course, respectively. As the timing accuracy in a typical detector is far better than the extension measurement accuracy, for example, less than few ps, in this study we ignore timing errors and model tn as noiseless and known in advance.

To extract the molecular signature from z¯, which, as explained in Sec. I, may be masked by slow responses, noise, and drift, we develop a detailed model of the whole biophysical/experimental system. Our formulation, which relies on ICON2 and the iHMM,1 is described in Sec. II A and depicted schematically in Fig. 2. In Sec. II B, we present the statistical machinery that is required to derive posterior probabilities over multiple models and the associated parameters, and in Sec. II C we describe the acquisition of the datasets used in the examples that follow in Sec. III. For quick reference, a summary of our notational conventions can be found in  Appendix A.

FIG. 2.

For the analysis of the measurements, the molecule of interest may be represented as undergoing sudden changes. Here, the molecule (top) switches abruptly between a folded and an unfolded state. The signature of such spontaneous switching is contained in the extension time traces (bottom). The molecular signal (black) reflects the true molecular extension. Because of the slow response of the optical trap instrumentation, the molecular signal is distorted (red). In addition to the distortion, the measured signal (blue) is contaminated with unspecified drift components and corrupted by noise. In a typical analysis procedure, only the resulting signal (blue) is available and the main objective is the recovery of the underlying molecular signature (black).

FIG. 2.

For the analysis of the measurements, the molecule of interest may be represented as undergoing sudden changes. Here, the molecule (top) switches abruptly between a folded and an unfolded state. The signature of such spontaneous switching is contained in the extension time traces (bottom). The molecular signal (black) reflects the true molecular extension. Because of the slow response of the optical trap instrumentation, the molecular signal is distorted (red). In addition to the distortion, the measured signal (blue) is contaminated with unspecified drift components and corrupted by noise. In a typical analysis procedure, only the resulting signal (blue) is available and the main objective is the recovery of the underlying molecular signature (black).

Close modal

A working implementation of the method described in this section can be found in the supplementary material.

1. Molecular kinetics

At the single molecule level, similar to ICON2 and the initial iHMM,1 we assume that the system of interest alternates spontaneously between a number of distinct molecular states σ1, σ2, …, where σm is used to represent the biophysical states attainable over the course of the time trace. For example, for a protein undergoing conformation changes,16 as shown in Fig. 2, σ1 = “folded” and σ2 = “unfolded.” In general, the precise number of unique states σm available to the molecule is unknown and so we must also estimate it along the other quantities of the model that follow (see Sec. II B 1).

To facilitate the presentation, we let s¯=(s1,s2,,sN) be the sequence of successive states attained by the molecule during the experiment. Here, sn denotes the state σm that the molecule attained at time tn. To model s¯, as in previous studies,1,10–13,21,22 we assume memoryless switching and adopt simple Markovian kinetics. In statistical notation, these kinetics can be formally expressed as sn+1|snCategorical(σ1,σ2,)(π̃sn), where π̃sn=(πsnσ1,πsnσ2,) gathers the probabilities of the molecule switching from sn to any σm at any time step.

The formulation we adopt here assumes that whenever the molecule departs from a state σm it arrives at a state σm on average πσmσm of the times. So, πσmσm=0 indicates that σm never leads to σm, πσmσm=1 indicates that σm always leads to σm, while πσmσm describes self-transitions. The latter can be used to find23 the state’s mean dwell time Tσm=Δt/(1πσmσm), where Δt is the time lapse between successive measurements. Similarly, transition rates23 between any pair of states can be found based on the probabilities πσmσm.

2. Trap instrumentation response

Unlike the cases modeled by ICON and the iHMM,1,2 as the molecule switches from state to state producing s¯, it initiates the motion of the optical trap instrumentation that yields the trace z¯. To represent the driving signal, we consider ϕσm to be the extension associated with state σm and the extension produced over the course of the experiment to be ϕ¯=(ϕs1,ϕs2,,ϕsN).

To describe the dynamics of the trap system, we first consider the idealized case where noise is absent. For this case, we let dn denote the distance that would be recorded by the trap at tn. For sufficiently fast trap systems, dn is indistinguishable from the driving extension ϕsn; however, for traps with response times comparable to the data acquisition rate, dn may deviate considerably from ϕsn since the influence of past values dn−1, dn−2, etc., increases with increasing response times. For instance, for traps with exponentially decaying responses, such as those illustrated in Fig. 2, driving extensions and recorded extensions are related by dn=η0ϕsn+η1dn1, where η0 = Δt/(tresp + Δt) and η1 = tresp/(tresp + Δt) with tresp denoting the trap’s response time. In the case of traps with pronounced non-linearities in their response or traps incorporating feedback, generally more than one lag term, dn−1, dn−2, etc., is required in order to accurately represent dn. Since very little is typically known about the dynamics of the combined molecule/handles/bead/ambient fluid system, we model the response using a general autoregressive process8 having the form

(1)

and, to ensure that for static molecules the recorded extension is identical to the extension of the molecule, we set η0 = 1 − k≥1ηk. We note that this condition also ensures “unit magnification” in the optical trap, that is, a change in the molecular extension is transmitted to a change in the recorded extension of the same magnitude, i.e., Δϕ = Δd, and not to a constant multiple of it, i.e., Δϕ = cΔd for some c ≠ 1.

The assumption underlying Eq. (1) is that the extension dn measured by the trap depends on (i) the current state of the molecule ϕsn and (ii) the extensions measured at the previous K time steps. A consequence is that dn is determined not only by ϕsn but is also implicitly determined by the preceding extensions dn−1, dn−2, …. Therefore, the model trap inherits memory and the strength of the memory is ultimately determined by the values of the autoregressive coefficients η1, …, ηK. In order to capture memory effects accurately, we learn the values of ηk directly from the data (see Sec. II B 1).

3. Noise

In addition to the driving molecular signal and the induced response of the trap system, the actual recorded extension during the experiment may also be influenced by unspecified perturbations such as mechanical forces, unresolved dynamics, perturbative ambient flows, and thermal fluctuations. Thus, a realistic description of the trap dynamics requires a modification of the extension dn in Eq. (1) which excludes them.

For a well-calibrated experimental apparatus, the perturbations are expected to be small relative to the typical distances between the molecular extensions ϕσm. So, we model these perturbations collectively as random fluctuations wn having the characteristics of white noise. Accordingly, in our formulation, the actual extension xn recorded by the trap at tn is given by a modification of Eq. (1) of the form

(2)

where the perturbations wn ∼ Normal(0, τ) have zero mean and variance equal to 1/τ. As a side note, we point out that unlike dn mentioned earlier, which describes the idealized situation of absent random perturbations, xn explicitly includes them.

4. Drift

Similar to ICON,2 to account for drift in the measured time series, which, as shown previously, may compromise the analysis if ignored,2 we couple the molecular trace x¯ to an additional process ȳ that explicitly incorporates drift. To accommodate drift, we place a set of nodes θj = (Bj, Dj) over the time series and use interpolation24 to link across them as shown in Fig. 3. In particular, we place the nodes over the available time series z¯ at positions Bj and heights Dj. To allow the model drift to fit tightly even near sharp changes or discontinuities, we allow both Dj and Bj to shift. To achieve this, we restrict each Bj within regions [Bjmin,Bjmax] and choose Bjmin and Bjmax from a regular grid of time levels that spans the entire time series.

FIG. 3.

For the representation of drift, nodes θj = (Bj, Dj) are placed over the time interval between the first and last available measurements and interpolation is used to link across them. Here, arrows indicate that each node is adjusted perpendicular as well as parallel to the time axis. This way, allowing neighboring nodes to approach each other results in better fits near drift discontinuities. To ensure an a priori regular spacing of the nodes over the experimental time course, each node position Bj is restricted to lie between preset times Bjmin and Bjmax that are kept fixed. Furthermore, weak priors P(Bj) are used to influence placement of the nodes near the center of the designated intervals.

FIG. 3.

For the representation of drift, nodes θj = (Bj, Dj) are placed over the time interval between the first and last available measurements and interpolation is used to link across them. Here, arrows indicate that each node is adjusted perpendicular as well as parallel to the time axis. This way, allowing neighboring nodes to approach each other results in better fits near drift discontinuities. To ensure an a priori regular spacing of the nodes over the experimental time course, each node position Bj is restricted to lie between preset times Bjmin and Bjmax that are kept fixed. Furthermore, weak priors P(Bj) are used to influence placement of the nodes near the center of the designated intervals.

Close modal

In summary, the drift contribution to the measured extension at tn is evaluated as

(3)

where Gθ(t) is the interpolating function24 that uses the nodes θ=(θ1,θ2,,θJ). Depending on the specifics of the experimental setup at hand, Gθ(t) may utilize splines or more specialized basis functions. For example, for drift with severe discontinuities, Hermite polynomials25 might provide superior fits than splines.

For well-designed FS experiments, drift is expected to fluctuate on a slower time scale than the molecular transitions. As a result, assuming separation of times scales in x¯ and ȳ, a sufficient condition to avoid degeneracies of our formulation is J/N ≪ 1. That is, to be able to accurately capture the shape of the drift, the number of nodes J used in the interpolation needs to be lower than the total number of time points in the supplied trace N. Generally, a precise upper bound on the ratio J/N is hard to predict as it depends on (i) the length of the trace N, (ii) the intermeasurement period Δt, and (iii) the intrinsic time scale of the drift. As a rule of thumb, we suggest J/N ≈ 0.01 or lower.

Finally, to resolve non-identifiability issues caused because the drift may move perpendicularly,2 we restrict the allowed combinations of nodes θj to those that yield either a zero global contribution, i.e., nyn = 0, or a zero contribution locally at the experiment onset, i.e., y1 = 0.

5. Measurements

Finally, having incorporated trap instrumentation responses Eq. (1), random fluctuations Eq. (2), and drift Eq. (3), we model any remaining noise as having little to no additional effect. Thus, our net observations are zn = xn + yn.

A graphical model summarizing the dependencies among the variables of the developed formulation, leading from molecular switching to observations, is shown in Fig. 4. In Sec. II B, we describe methods to obtain estimates.

FIG. 4.

For the analysis of the experimental data, a detailed statistical formulation of the combined molecule-trap system is developed. Namely, the molecule’s time course is represented by a Markov chain sn; the response of the trap’s instrumentation is represented by an autoregressive chain xn that is driven by sn; and drift is represented by interpolation yn. The measurements zn are directly influenced by the trap system and drift and indirectly by the molecule and noise. Here, blue marks the nonparametric prior placed on the molecular state space, red marks the prior on the trap dynamics, and cyan marks the prior for the drift interpolation. For details, see Sec. II.

FIG. 4.

For the analysis of the experimental data, a detailed statistical formulation of the combined molecule-trap system is developed. Namely, the molecule’s time course is represented by a Markov chain sn; the response of the trap’s instrumentation is represented by an autoregressive chain xn that is driven by sn; and drift is represented by interpolation yn. The measurements zn are directly influenced by the trap system and drift and indirectly by the molecule and noise. Here, blue marks the nonparametric prior placed on the molecular state space, red marks the prior on the trap dynamics, and cyan marks the prior for the drift interpolation. For details, see Sec. II.

Close modal

The parameters that we can infer from a time trace—quantities such as π̃σm, ϕσm, etc., introduced above—are directly related to the molecular properties of the system, e.g., molecular extensions, kinetics, and dwell times, that we wish to learn. To infer parameter values, we would normally follow the Bayesian paradigm.7–9,13 That is, we would ascribe priors to each parameter, combine priors with the likelihood implied by the formulation developed in Sec. II A 5, and subsequently derive a posterior probability distribution.7,8 However, since the number of attained molecular states is a priori unknown, so is the number of parameter values that need to be inferred. To handle this more general case, we use a nonparametric prior1,3,5,19 that allows us to estimate models and their associated parameters simultaneously.

Below we describe the inference procedure for our formulation. A graphical summary is shown in Fig. 4.

1. Specifying the nonparametric prior

To infer a posterior over models incorporating a varying number of states in the molecular state space, we use a (nonparametric) generalization of the Dirichlet distribution1,15,26,27 that is routinely used in applications involving HMMs.8,28,29 In this formulation, the states available to the molecule are infinite in number and our primary goal is to identify which and how many of them have been visited during the experiment. As it is apparent, the majority of these states will remain unvisited since, for example, the molecule can visit at most N of them. In practice, through a careful choice of the prior on the state transitions (see below), most of the states will be revisited multiple times and so the number of unique states in s¯ will be much lower than N.

Specifically, the prior we place on the molecular kinetics is a hierarchical Dirichlet process26 of the form

(4)
(5)

where δ̃σm is the Dirac delta centered at σm and γ, ξ, and ρ are hyperparameters that, as explained below, determine the prior probability distributions on the number of molecular states and mean dwell times. In Eqs. (4) and (5), GEM denotes the Griffiths-Engen-McCloskey stick breaking19,26,30 and DP denotes the Dirichlet process5,19,26,27,31,32 prior.

The unique characteristic of this prior is that when assigning molecular transitions, say from sn to sn+1, it considers three possibilities: (i) the molecule stays in the same state, (ii) the molecule revisits one of the states visited before, and (iii) the molecule visits a state never visited before. Similar to ICON, it is because of the last possibility that, a priori, the molecule has access to infinitely many states and it is up to the model’s likelihood to bias the transition toward options (i) and (ii) as more data are gathered, i.e., as n moves to higher values. In other words, it is solely up to the likelihood to distinguish which and how many of the a priori infinite states have been actually visited during the experiment. However, unlike the prior used in ICON, which places a priori equal weights on these three possibilities, the present formulation places a higher weight on self-transitions as induced by the delta terms ρδ̃σm in Eq. (5). This makes the current formulation more appropriate in describing molecules exhibiting long dwells,33 for which multiple recurring self-transitions are required, while still maintaining an infinite pool of potential states.

In principle, the values of the hyperparameters γ, ξ, and ρ can be inferred from the time series by invoking further hierarchical constructions.26,31,33,34 However, because prior information about the anticipated number of visited molecular states and characteristic time scales is often available—for example, either from previous experiments, computational modeling, or simply by intuition based on molecules of similar structure—such constructions may be unnecessary. Instead, below we describe how γ, ξ, and ρ can be chosen to reflect the available information.

For a time series s¯ containing N measurements, provided N is large as is typically the case in FS, it has been shown31 that the number of distinct molecular states to be visited, implied by the above prior, has a probability distribution that is approximately proportional to γm(0.58 + log N)m/(m − 1)!. That is, when the only information about the molecule is that it has generated N measurements, but the values of these measurements are unknown, the prior in Eqs. (4) and (5) implies that the number of unique states σm in s¯ is random and that its probability distribution is as given above. This relationship can be inverted to initialize a value for γ=eψ(Mmean*)/(0.58+logN), where Mmean* is the anticipated number of unique molecular states and ψ is the digamma function.35 

Furthermore, the induced state mean dwell times Tσm, as shown in  Appendix B, have a prior probability distribution that is illustrated in Fig. 5. This distribution has mean Tmean*=Δt/(1ρ) and mode Tmode*=Δtξ/((1ρ)ξ+1), which may be used to choose appropriate values for the hyperparameters ξ=Tmode*Tmean*/(Tmean*Tmode*)/Δt and ρ=1Δt/Tmean*. More precisely, in this case, the mean sets the anticipated time scale of the molecular kinetics. In addition, a mode close to the mean indicates low uncertainty on the chosen time scale, as most of the prior probability is concentrated in a region around Tmean*, while a mode distant from the mean indicates high uncertainty on the anticipated time scale, as the prior probability is spread over a wider region, Fig. 5.

FIG. 5.

Anticipated time scales of the molecular kinetics can be represented from quantities appearing in the nonparametric prior. Here, the prior probability distribution of the mean dwell times Tσm is shown for three choices of the hyperparameters ξ and ρ. Vertical lines indicate the mean (solid) and mode (dashed) of the associated distributions. Choosing a large mean influences slower molecular switching (top), while choosing a smaller mean influences faster switching (middle) as most of the prior probability is located at lower values. Additionally, a mode close to the mean indicates strong beliefs about the chosen time scale (middle), while a mode distant from the mean indicates weak beliefs about the chosen time scale (bottom) as the prior probability is spread over a larger region.

FIG. 5.

Anticipated time scales of the molecular kinetics can be represented from quantities appearing in the nonparametric prior. Here, the prior probability distribution of the mean dwell times Tσm is shown for three choices of the hyperparameters ξ and ρ. Vertical lines indicate the mean (solid) and mode (dashed) of the associated distributions. Choosing a large mean influences slower molecular switching (top), while choosing a smaller mean influences faster switching (middle) as most of the prior probability is located at lower values. Additionally, a mode close to the mean indicates strong beliefs about the chosen time scale (middle), while a mode distant from the mean indicates weak beliefs about the chosen time scale (bottom) as the prior probability is spread over a larger region.

Close modal

At this point, it is worth noting that although the values of Mmean*, Tmean*, and Tmode* influence the prior, their effect on the resulting posterior—which incorporates information from the data introduced through the likelihood—decreases as the amount of data available increases or the noise level decreases.8,29

For the remaining parameters, which may be treated within the parametric context, motivated by simplicity, we use normal priors for the molecular extensions ϕσm Normal(μϕ,σϕ2) and a gamma prior for the precision of the noise terms τGamma(ατ/2,ατστ2/2). Both choices are conjugate8 to their respective likelihoods. Consequently, their hyperparameters may be interpreted in terms of pseudo-observations,7,8 which in turn may be helpful in choosing appropriate values. Finally, we use normal priors for the autoregressive coefficients of the trap’s instrumentation response ηk Normal(μη,ση2), symmetric generalized beta priors for the placement of the drift nodes BjBjmin+(BjmaxBjmin)Beta(αB,αB), and improper8 flat priors for the node heights Dj ∝ 1. At this point, we want to mention that although these prior choices in our formulation are mainly motivated by modeling simplicity, due to the typical long length of FS traces, they have little influence on the resulting estimates though they remain necessary ingredients of Bayesian modeling.36 

2. Obtaining estimates

In Bayesian methods, it is often possible to obtain posterior probabilities in analytic form; however, due to the nonparametric prior of Eqs. (4) and (5), this is impossible in our case. Instead, to compute the posterior P(s¯,β̃,π̃̃,ϕ̃,τ,η,θ|z¯), we have to adopt a computational approach. For this reason, we have developed a specialized scheme described in detail in  Appendix C. Using this scheme, we can generate samples from P(s¯,β̃,π̃̃,ϕ̃,τ,η,θ|z¯) and subsequently obtain full estimates, such as expected values and credible intervals, of the involved physical quantities.1,8,34

For most parameters of interest—for example, molecular kinetics π̃̃, extensions ϕ̃, and drift ȳ—the evaluation of the corresponding estimators is straightforward. Some difficulties may arise only when trying to estimate the most probable state sequence s¯^ since the popular Viterbi sequence,29,37 common in engineering applications, cannot be easily generalized to the nonparametric context.34,38 To that end, following existing approaches,38–41 here we propose as an alternative estimator for s¯ the state sequence s¯^ that lies closest to the expectation E(s¯|z¯) of the marginal posterior probability distribution P(s¯|z¯). Similar to the expectation E(s¯|z¯), the proposed s¯^ considers the full extent of the model posterior;38 however, unlike E(s¯|z¯) which might violate the model kinetics,29 the proposed s¯^ is ensured to be in agreement as it is chosen exclusively among the feasible sequences s¯.

To test and validate our method and demonstrate its applicability in the analysis of single molecule data, we have used synthetic as well as experimental traces.

For the synthetic traces, i.e., Figs. 6, 8, and 9, we simulated the switching of a hypothetical biomolecule with 5 states of characteristic extensions 10, 20, 30, 40, 50 nm and a common mean dwell time of 0.01 s. These traces are generated for a total duration of 5 s, with individual measurements every 15 μs. The generated molecular signals are distorted by (i) the response of a trap system with linear dynamics, (ii) random fluctuations, and (iii) drift that consists of linear and sinusoidal terms. Intentionally, for these traces, we used exceedingly slow response times tresp = 150 μs, so the effect of the induced correlations may be clearly visualized. Random fluctuations introduced in these examples simulate a single noise source that combines every noise process occurring experimentally. Such noise processes include but are not limited to thermal fluctuations, unspecified mechanical perturbations, detection shot-noise, read-out noise, background signals, etc. In more detail, we have added fluctuations as described by Eq. (2) with perturbations wn that are normally distributed as one would expect in cases of noise collectively obeying the law of large numbers.42,43 Finally, to achieve a comparison with real-data, we have selected the variance (i.e., 1/τ) of wn such that the signal-to-noise ratio in the resulting traces is similar to (Figs. 6 and 8) or much lower than (Fig. 9) the experimental trace used in this study (see below). The precise values of τ used are 4 nm−2 and 0.04 nm−2 for the high and low signal-to-noise cases, respectively.

FIG. 6.

xl-ICON is used to analyze a synthetic single molecule trace (upper left) intended to represent real force spectroscopy measurements. To facilitate visual inspection, the synthetic signal intentionally contains long trap instrumentation responses (response time of 150 μs) and low noise levels (τ = 4 nm−2). We show the estimated sequence of molecular extensions and drift in the entire trace (lower left) and the estimated trap response (upper and lower right). For clarity, we omit showing true molecular and drift traces since they overlap with the estimated ones. Note that axes are drawn to different scales.

FIG. 6.

xl-ICON is used to analyze a synthetic single molecule trace (upper left) intended to represent real force spectroscopy measurements. To facilitate visual inspection, the synthetic signal intentionally contains long trap instrumentation responses (response time of 150 μs) and low noise levels (τ = 4 nm−2). We show the estimated sequence of molecular extensions and drift in the entire trace (lower left) and the estimated trap response (upper and lower right). For clarity, we omit showing true molecular and drift traces since they overlap with the estimated ones. Note that axes are drawn to different scales.

Close modal

The optical trap instrument used to obtain the experimental trace, i.e., Fig. 10, was a home-built timeshared dual optical trap constructed and operated generally as described in previous studies.44,45 In brief, an acousto-optic modulator rapidly deflects a 1064 nm laser between two positions to create two independent optical traps (66.7 kHz modulation rate). Time traces were obtained with the traps held at fixed positions. A pair of beads was held in the dual traps and tethered together in situ by a single molecule consisting of an 89-bp dsDNA hairpin flanked by two 1.5-kbp dsDNA “handles.” The construct and hairpin sequence are the exact same as “hairpin 1” described earlier.45,46 The change in the positions of the pair of trapped beads is the readout of the changing tether length and thus the hairpin folded/unfolded configuration. The tether tension was chosen so as to observe the long DNA hairpin reversibly unfold and refold in multiple steps (four clear steps are observed in this case). Hairpins such as these are often used as test substrates for investigations of helicase protein machines. Bead position data were recorded unfiltered at 66.7 kHz sampling rate (i.e., one data point per individual trap on cycle or every 15 μs) for a total of 5 s. Because the measured trace appeared void of drift, we artificially contaminated it with the addition of drift similar to the synthetic traces.

We implemented xl-ICON as described in  Appendix C and used it to analyze the single molecule time series described above. The full set of options used in these analyses is listed in  Appendix D.

Figure 6 shows a sample synthetic trace with low noise as well as a collection of characteristic estimates. In this trace, the simulated trap instrumentation response consists of a single exponential decay. However, because in the experimental setting the trap’s response is generally uncharacterized, we have used a broader 3 order autoregressive model, i.e., K = 3, to represent it. Despite the intentional misspecification, the estimated coefficients η1 = 0.91, η2 ≈ 0, η3 ≈ 0 indicate that the model accurately identifies the correct autoregressive order. The corresponding estimate of the response time is tresp = 150.8 ± 1.1 s, which is in agreement with the simulated one of 150 s. Furthermore, xl-ICON also accurately identifies and removes drift which is also shown in Fig. 6. Overall, in the estimated state trace s¯^, xl-ICON misses or miss-identifies less than 1.5% of the transitions (missed transitions are not shown for clarity).

Figure 7 shows the combinations of state extensions ϕσm and mean dwell times Tσm estimated through the posterior probability distribution. Here, color coded is the average fraction of time that the molecule is estimated to occupy a state σm with the corresponding extension and mean dwell time. For example, a fraction of 10% indicates that the molecule produced in total 10% of the measurements in the time trace while occupying state σm; further, a null fraction indicates that the molecule never occupied the corresponding combination during the experiment. As expected, the estimated occupation fraction clusters around the simulated combinations of extensions of 10, 20, 30, 40, 50 nm and dwell time 0.01 s indicating that xl-ICON successfully identifies and localizes the correct states.

FIG. 7.

xl-ICON provides estimates of single molecule properties. Here, the estimates of the mean dwell times Tσm and molecular extensions ϕσm shown are obtained using the expectation of the marginal posterior probability distribution P(T̃,ϕ̃|z¯) of the synthetic trace in Fig. 6. Dashed lines indicate the measurement acquisition period Δt (left) and the total duration of the trace (right). Dotted lines indicate the true molecular extensions (horizontal) and mean dwell time (vertical) used in the generation of the trace. Note that the horizontal axis is drawn in logarithmic scale.

FIG. 7.

xl-ICON provides estimates of single molecule properties. Here, the estimates of the mean dwell times Tσm and molecular extensions ϕσm shown are obtained using the expectation of the marginal posterior probability distribution P(T̃,ϕ̃|z¯) of the synthetic trace in Fig. 6. Dashed lines indicate the measurement acquisition period Δt (left) and the total duration of the trace (right). Dotted lines indicate the true molecular extensions (horizontal) and mean dwell time (vertical) used in the generation of the trace. Note that the horizontal axis is drawn in logarithmic scale.

Close modal

We emphasize that the success of xl-ICON in clearly identifying 5 states at the correct extensions, despite the severe correlations in the time series analyzed, relies heavily on the explicit incorporation of the trap instrumentation response (Sec. II A 2) that is the novelty of our method. To demonstrate how critical this is, we analyzed the same trace without considering the response. This is achieved, for example, by keeping the coefficients η1, …, ηK in Eqs. (1) and (2) fixed at zero. Figure 8 demonstrates how the estimates deteriorate in this case. By not accounting for the induced correlations, which is effectively equivalent to applying ICON or a plain iHMM, the method estimates an unrealistic total of ≈40 states with characteristic extensions spread throughout the full range of the data.

FIG. 8.

Ignoring the response dynamics of the trap’s instrumentation leads to inaccurate estimation of the underlying molecular state space. Here, the trace shown in Fig. 6 (blue trace) is analyzed using ICON or a plain iHMM. As can be seen, the number of states is severely overestimated as consecutive tightly correlated measurements are interpreted as dwells to short-lived non-physical states (red trace). By contrast, proper consideration of the response dynamics (black trace), through xl-ICON, accurately identifies and localizes the correct states (dotted lines).

FIG. 8.

Ignoring the response dynamics of the trap’s instrumentation leads to inaccurate estimation of the underlying molecular state space. Here, the trace shown in Fig. 6 (blue trace) is analyzed using ICON or a plain iHMM. As can be seen, the number of states is severely overestimated as consecutive tightly correlated measurements are interpreted as dwells to short-lived non-physical states (red trace). By contrast, proper consideration of the response dynamics (black trace), through xl-ICON, accurately identifies and localizes the correct states (dotted lines).

Close modal

To facilitate the presentation, for the examples shown so far, we have intentionally utilized low noise levels. Nevertheless, it is worth noting that xl-ICON provides robust estimates even at exceedingly high noise. For instance, Fig. 9 shows corresponding estimates obtained from a considerably noisier trace than those used earlier. As can be seen, the estimated molecular extensions and drift remain in excellent agreement with the true ones despite the significant overlap in the state emissions. Further, good agreement is also maintained with the estimated molecular trace s¯^. In the latter, missed transitions account for less than 10% of the entire trace. Although this is already a low error rate, we want to point out that the molecular traces shown throughout are only point estimates s¯^ (see Sec. II B 2) that represent the single best choices out of every possible trace s¯ in the sense that they simultaneously optimize the number of occupied states, state kinetics, state extensions, as well as drift, trap response, and noise globally over the entire time series. So a non-zero error rate is expected to occur when the supplied data are noisy. Such errors typically disappear once the estimator s¯^ is enclosed within the appropriate credible region,8,9 i.e., traces that engulf a significant portion of the posterior probability P(s¯|z¯). In principle, this region may be obtained, nonetheless, in a computationally intensive manner.38 We avoid the required computations here, since most commonly the main task in the analysis of single molecule measurements is the quantification of the molecular properties, such as state extensions ϕσm and mean dwell times Tσm, rather than the precise quantification of the molecular trace s¯ itself.

FIG. 9.

xl-ICON is used to analyze synthetic measurements contaminated with high noise levels (τ = 0.04 nm−2). Similar to Fig. 6, we show here the estimated sequence of molecular extensions and drift in the entire trace (left) and the estimated trap response (right). Note that for clarity, axes are drawn to different scales.

FIG. 9.

xl-ICON is used to analyze synthetic measurements contaminated with high noise levels (τ = 0.04 nm−2). Similar to Fig. 6, we show here the estimated sequence of molecular extensions and drift in the entire trace (left) and the estimated trap response (right). Note that for clarity, axes are drawn to different scales.

Close modal

Having validated our model using synthetic data with known ground truth in Sec. III A, we now turn to real FS data where the ground truth is unknown. Given that the assumptions underlying our method (see Sec. II) are expected to hold in the experimental context—for example, Markovian molecular kinetics and white noise perturbations of trap’s recorded extension—it is anticipated that xl-ICON performs equally well on experimental traces as well.

Figure 10 shows the experimental trace and selected estimates. During the acquisition of the trace shown, the optical trap has been configured such as the underlying DNA hairpin foldes/unfoldes in discrete steps that can be observed in the recorded extension (for further details, see Sec. II C). Similar to the synthetic traces shown earlier, the method successfully isolates the underlying molecular transitions from the drift and trap responses, despite the high noise levels and the correlations in the raw trace. As an extra layer of validation, we note here that since the drift in this time series is known, as it is introduced artificially, we verify the accuracy of its estimate, Fig. 10.

FIG. 10.

xl-ICON is used to analyze experimental measurements (upper left) from single molecule force spectroscopy. Similar to previous figures, we show here the estimated sequence of molecular extensions in the entire trace (lower left) and the estimated trap response (upper and lower right), while, for clarity, we omit drift traces as estimated and true ones overlap. Note that for clarity, axes are drawn to different scales.

FIG. 10.

xl-ICON is used to analyze experimental measurements (upper left) from single molecule force spectroscopy. Similar to previous figures, we show here the estimated sequence of molecular extensions in the entire trace (lower left) and the estimated trap response (upper and lower right), while, for clarity, we omit drift traces as estimated and true ones overlap. Note that for clarity, axes are drawn to different scales.

Close modal

Figure 11 illustrates the estimated state extensions and mean dwell times. As can be seen, xl-ICON clearly identifies 4 distinct molecular states with extensions varying between −25 and 10 nm. Unlike the synthetic trace in Fig. 7, which ascribed similar kinetics to every state, here one state exhibits relatively stable kinetics (mean dwell time of ≈0.1 s), while the remaining states are more transient (mean dwell times between 1 ms and 0.1 ms). To facilitate the interpretation of these estimates, in Fig. 11 we also show a simple histogram of the experimental data (without including drift). A visual inspection of this histogram reveals the presence of at least three states. All of them have been accurately identified by xl-ICON. Further, since xl-ICON considers also kinetic properties which are lost when the data are collapsed to produce a histogram, it also reveals the presence of a fourth one at −4.5 nm. This state is barely discernible from the histogram; however, its presence in the experimental trace can be verified from the blow ups of the time courses presented in Fig. 10.

FIG. 11.

xl-ICON provides estimates of single molecule properties. Here, similar to Fig. 7, the estimates of the mean dwell times Tσm and molecular extensions ϕσm shown on the left are obtained using the expectation of the marginal posterior probability distribution P(T̃,ϕ̃|z¯) of the trace in Fig. 10. Horizontal lines mark the estimated molecular extensions. A comparison with a histogram of the raw data (void of drift) shown on the right indicates that the developed method successfully identifies the three prominent peaks as well as a fourth that cannot be resolved when ignoring kinetics.

FIG. 11.

xl-ICON provides estimates of single molecule properties. Here, similar to Fig. 7, the estimates of the mean dwell times Tσm and molecular extensions ϕσm shown on the left are obtained using the expectation of the marginal posterior probability distribution P(T̃,ϕ̃|z¯) of the trace in Fig. 10. Horizontal lines mark the estimated molecular extensions. A comparison with a histogram of the raw data (void of drift) shown on the right indicates that the developed method successfully identifies the three prominent peaks as well as a fourth that cannot be resolved when ignoring kinetics.

Close modal

At this point, a comparison of the estimates obtained through xl-ICON with those that may be obtained visually through simple downsampling of the raw data is appropriate. In Fig. 12, we show histograms of the experimental trace at full rate 15 μs (left most panel) and successively downsampled prior to binning by a factor of 10 until a rate of 15 ms. The state located near 6 μm, due to its slow kinetics (≈100 ms) that exceed by several orders the inter-measurement times (15 μs – 15 ms) and the trap response (≈60 μs), is sharpened as expected. However, the remaining states, due to their transient kinetics (≈0.1–1 ms), contain only a small fraction of the points in the entire time series. As a result, a mild downsampling may enhance them (for example, at 0.15 ms, the lower state is enhanced, and also the middle two states can be visualized as the histogram’s middle peak spreads asymmetrically suggesting at least two states). Nevertheless, with further downsampling these transient states are completely eliminated as they merge into a single one located at around −12 nm. Besides causing transient states to merge, downsampling also introduces artifactual states (for example, states at extensions between −10 nm and 10 nm that may be seen at 15 ms). These stem from the mixing (aliasing) of the transient states with the stable one. For all these cases, xl-ICON provides superior estimates as (i) it considers state kinetics that are lost when the trace is binned and thus resolves noisy states robustly, (ii) it resolves fast molecular kinetics that are comparable to the experimental acquisition rate without downsampling, and (iii) it avoids aliasing.

FIG. 12.

Downsampling the raw data may reduce noise and remove correlations making states with slow kinetics more prominent; however, downsampling smears out states with fast kinetics and introduces artifactual states due to aliasing. Here, we show histograms of the experimental trace of Fig. 7 at full rate (left most panel) and successively 10-fold downsampled. Dotted lines marks the states resolved by xl-ICON.

FIG. 12.

Downsampling the raw data may reduce noise and remove correlations making states with slow kinetics more prominent; however, downsampling smears out states with fast kinetics and introduces artifactual states due to aliasing. Here, we show histograms of the experimental trace of Fig. 7 at full rate (left most panel) and successively 10-fold downsampled. Dotted lines marks the states resolved by xl-ICON.

Close modal

High data acquisition rates make it possible to capture increasingly faster dynamics of biomolecules. However, the interpretation of those datasets presents unique challenges that are not fully met by existing analysis methods both because such data sets are larger in size but also because the data sets themselves are noisier and more complex.

For the sake of concreteness, in this study, we focused on force spectroscopy (FS) datasets collected at high rates that fall within the low kHz timeframe. Although these datasets present a unique opportunity to observe molecular dynamics at faster scales, their interpretation is challenging because fast molecular motions are distorted by continuous aberrations, such as drift and correlation, which pose a severe limitation not only for traditional (parametric) approaches which specify models ahead of time but also for naive nonparametric methods.1–3 For this reason, the experimental measurements are often averaged down or otherwise artificially reduced to render them analyzable. However, this strategy is appropriate only for measurements from biomolecules with relatively slow kinetics that could be also captured in datasets obtained at lower acquisition rates. However, fundamentally, averaging or reducing the collected datasets undermines the whole reason behind collecting data at higher rates in the first place.

In part to resolve emergent modeling challenges and in part to maintain a feasible computational cost (as high rate datasets may be vastly larger), existing analysis methods often require crucial user dependent choices that yield subjective or partial conclusions. In the extreme example of the hidden Markov model (HMM), the workhorse of biophysical time series data analysis,10–14 the model itself (i.e., the number of attained states) is commonly specified by hand and extensive data pre-processing is used to remove drift and correlations. Subsequently, to discriminate between plausible models, different model candidates are compared using additional criteria.4,6 The challenge here is that pre-processing (e.g., to remove drift) and/or post-processing (e.g., to select the number of states) yields an output that strongly depends on multiple independent steps in the analysis procedure.4,6

To that end, Bayesian nonparametrics (BNPs)1–4 offer powerful and feasible alternatives to obtaining globally optimal model and parameter estimates. BNPs do so by determining model parameters and models themselves simultaneously thereby relaxing many of the theoretical requirements of traditional (parametric) approaches. More specifically, infinite hidden Markov models (iHMMs) can relax the requirements of the traditional HMMs to a bare minimum,3,5,19 though they remain more computationally costly than HMMs. Nevertheless, current research into efficient algorithmic alternatives to the slower Monte Carlo techniques typically employed, for example, variational methods,47–50 may reduce computational cost in the years to come. As an example, xl-ICON, which utilizes the same computational tools33,34 as the iHMMs, requires ≈2 days of computational time, in an average desktop computer, to analyze a trace of 5 s collected at 66.7 kHz (for example, similar to the trace of Fig. 10). Although, such computational requirements are certainly within the capabilities of modern labs, anticipated computational improvements could reduce it by several orders in the near future.

Our analysis method directly addresses challenges presented by data collected at high rates by exploiting BNPs. By utilizing BNPs, we present the first strategy, to the best of our knowledge, in the single molecule literature toward making raw experimental datasets collected at high rates amenable to rigorous analysis. The resulting method represents the entire experimental system encountered in force spectroscopy and allows clear interpretation of the involved processes, for example, molecular transitions, trap instrumentation responses, and drift. As a result, our formulation of xl-ICON—but also the logic behind our formulation—may be used for unsupervised time series analysis of raw single molecule data or used as a building block for models of larger biophysical/experimental setups.

See supplementary material for a software implementation of xl-ICON.

S.P. acknowledges support from NSF CAREER Grant No. MCB-1719537. M.J.C. acknowledges support from NSF Grant No. MCB-1514706.

Table I summarizes the notation used. Throughout this study, n = 1, 2, …, N indexes time levels; m = 1, 2, …, M indexes molecular states; k = 1, 2, …, K indexes autoregressive coefficients; and j = 1, 2, …, J indexes drift nodes. To group similar variables, we use bars to denote lists over time, e.g., z¯=(z1,z2,,zN); tildes to denote lists over the molecular state space, e.g., ϕ̃=(ϕσ1,ϕσ2,,ϕσM); double tildes to denote lists of lists over the molecular state space, e.g., π̃̃; right arrows to gather the trap’s autoregressive coefficients, e.g., η=(η1,η2,,ηK); and left arrows to gather the drift nodes, e.g., θ=(θ1,θ2,,θJ). Occasionally, we use carets to denote estimators, e.g., s¯^.

TABLE I.

Summary of notation used in this study.

Variable Description
Δt  Measurement period (i.e., inverse acquisition rate) 
tresp   Response time of trap’s instrumentation 
t n   Time levels of individual measurements 
z n   Extension measured at tn 
y n   Drift at tn 
xn   Recorded position at tn (i.e., with noise) 
dn   Idealized position at tn (i.e., without noise) 
w n   Trap perturbation at tn 
s n   Molecular state at tn 
σ m   Distinct molecular state 
ϕ σ m   Extension associated with σm 
T σ m   Mean dwell time of σm 
π σ m σ m   Transition probability from σm to σm 
η k   Trap response autoregressive coefficient 
θ j   Drift node 
B j   Time position of drift node θj 
D j   Extension height of drift node θj 
G θ ( t )   Drift interpolating function 
Variable Description
Δt  Measurement period (i.e., inverse acquisition rate) 
tresp   Response time of trap’s instrumentation 
t n   Time levels of individual measurements 
z n   Extension measured at tn 
y n   Drift at tn 
xn   Recorded position at tn (i.e., with noise) 
dn   Idealized position at tn (i.e., without noise) 
w n   Trap perturbation at tn 
s n   Molecular state at tn 
σ m   Distinct molecular state 
ϕ σ m   Extension associated with σm 
T σ m   Mean dwell time of σm 
π σ m σ m   Transition probability from σm to σm 
η k   Trap response autoregressive coefficient 
θ j   Drift node 
B j   Time position of drift node θj 
D j   Extension height of drift node θj 
G θ ( t )   Drift interpolating function 

The notation “XF,” where X is a random variable and F is a probability distribution, that is used extensively throughout this study, indicates7–9,36 that X is sampled from F or simply that X follows the probability measure associated with F.

According to the prior in Eq. (5), the probabilities of a self-transition, i.e., πσmσm, and a non-self transition, i.e., πσmσm=mmπσmσm, are jointly distributed according to

(B1)

since mmβσm=1βσm. Here, Dir denotes the Dirichlet distribution.8,9,29 Given that πσmσm=1πσmσm, the above joint probability may be reduced to

(B2)

where Beta denotes the beta distribution.

Finally, due to the Markov property of the state transitions,29 the number of dwell time steps nσm in state σm follows a geometric probability distribution with mean Nσm=1/(1πσmσm). In turn, according to Eq. (B2), Nσm follows an inverse beta distribution

(B3)

Therefore, the mean dwell time Tσm=NσmΔt in state σm follows a scaled inverse beta distribution and the precise probability density is

(B4)

where B denotes the beta function.35 

A fully working version of xl-ICON, in Matlab source code and GUI formats, is available in the supplementary material. The implementation utilizes the sampler briefly mentioned in Sec. II B 2 and generates samples from the model’s augmented posterior probability distribution P(s¯,β̃,π̃̃,ϕ̃,τ,η,θ|z¯) through an Markov Chain Monte Carlo (MCMC) scheme51 that utilizes Gibbs sampling with embedded Metropolis updates.51 

Specifically for the nonparametric components related to the molecular state space, as in previous studies,33,34,49 we use a weak limit approximation that replaces Eq. (4) with

(C1)

Results from the theory of statistics32,34,49 can be invoked to ensure that for sufficiently large M, Eq. (C1) yields estimates that are indistinguishable from Eq. (4). The main advantage of adopting this approximation is that we can invoke efficient machine learning algorithms, such as forward filtering1,29 without modifications,33 in order to minimize the associated computational cost. Additionally, sampling from Eq. (C1), compared to sampling from Eq. (4), for example, through the popular beam sampler52 as in previous implementations,1,2 leads to faster mixing of the MCMC chain.33,34 The latter is crucial in the analysis of large traces such as those encountered in force spectroscopy.

In detail, our implementation generates the outlined samples by iterating the following steps:

  • Step 1:

    Update the molecular states s¯ using forward filtering-backward sampling.1 Briefly, in a forward pass of the trace, compute recursively the state probabilities An(σm)=Psn=σmz1,,zn according to the filtering relation An(σm)Pznsn=σmσmAn1(σm)πσmσm, while in a backward pass of the trace, sample states according to snsn+1πsnsn+1An(sn). The resulting state trace s¯=(s1,,sN) can be shown1,28,52 to follow the conditional probability Ps¯β̃,π̃̃,ϕ̃,τ,η,θ,z¯ that is required by the overall Gibbs sampling scheme.51 A detailed exposition of the algorithm, which lies beyond the scope of this study, can be found in the description of ICON.1,2

  • Step 2:

    Update the nonparametric base β̃, molecular kinetics π̃̃, and hyperparameters γ, ρ, and ξ (if necessary), similar to the blocked sampler of the sticky-HDP-HMM.33 

  • Step 3:

    Update the molecular extensions ϕ̃.

  • Step 4:

    Update the noise precision τ.

  • Step 5:

    Update the autoregressive coefficients η using a Metropolis-within-Gibbs update.51 

  • Step 6:

    Update the drift nodes θ using a Metropolis-within-Gibbs update similar to ICON’s sampler.2 

Note that since ϕ̃ and τ utilize conjugate prior distributions,8 sampling in steps 3 and 4 is achieved directly. Also, note that to improve mixing51 (i) steps 3–6 are swept in random order, (ii) additional repetitions of steps 4 and 5 are embedded within steps 3 and 6, and (iii) steps 2–6 are repeated several times in each iteration. Finally, the proposals for the Metropolis updates of steps 5 and 6 are adjusted during burn-in51 to yield individual acceptance rates of ≈25%.

Fine details of the outlined implementation can be found in the xl-ICON’s source code, available in the supplementary material.

Table II summarizes the options used in Sec. III. For the computations, a total of 45 000 MCMC samples are generated for each example. From these, the initial 1000 are subtracted as burn-in, while, from the remaining, only 1 out of every 64 samples is used for the results shown in Figs. 6–11.

TABLE II.

Summary of hyperparameter options used in the examples of Sec. III. Here, units of time and extension refer to the units of the supplied time series z¯.

Variable Synthetic Experimental Units
M   50  50  Dimensionless 
K   Dimensionless 
J   10  10  Dimensionless 
M mean *   Dimensionless 
T mean *   (tNt1)/10  (tNt1)/10  Time 
T mode *   T mean * / 3   T mean * / 3   Time 
μ ϕ   mean ( z ¯ )   mean ( z ¯ )   Extension 
σ ϕ   std ( z ¯ )   std ( z ¯ )   Extension 
α τ   Dimensionless 
σ τ   std ( z ¯ )   std ( z ¯ )   Extension 
μ η   Dimensionless 
σ η   1/3  1/3  Dimensionless 
α B   Dimensionless 
B j min   t 1 + j 1 J ( t N t 1 )   t 1 + j 1 J ( t N t 1 )   Time 
B j max   t 1 + j J ( t N t 1 )   t 1 + j J ( t N t 1 )   Time 
Basis  Splines  Splines  … 
Tethering  Global  Global  … 
Variable Synthetic Experimental Units
M   50  50  Dimensionless 
K   Dimensionless 
J   10  10  Dimensionless 
M mean *   Dimensionless 
T mean *   (tNt1)/10  (tNt1)/10  Time 
T mode *   T mean * / 3   T mean * / 3   Time 
μ ϕ   mean ( z ¯ )   mean ( z ¯ )   Extension 
σ ϕ   std ( z ¯ )   std ( z ¯ )   Extension 
α τ   Dimensionless 
σ τ   std ( z ¯ )   std ( z ¯ )   Extension 
μ η   Dimensionless 
σ η   1/3  1/3  Dimensionless 
α B   Dimensionless 
B j min   t 1 + j 1 J ( t N t 1 )   t 1 + j 1 J ( t N t 1 )   Time 
B j max   t 1 + j J ( t N t 1 )   t 1 + j J ( t N t 1 )   Time 
Basis  Splines  Splines  … 
Tethering  Global  Global  … 
1.
I.
Sgouralis
and
S.
Pressé
, “
An introduction to infinite HMMS for single-molecule data analysis
,”
Biophys. J.
112
,
2021
2029
(
2017
).
2.
I.
Sgouralis
and
S.
Pressé
, “
Icon: An adaptation of infinite HMMS for time traces with drift
,”
Biophys. J.
112
,
2117
2126
(
2017
).
3.
K.
Hines
,
J.
Bankston
, and
R.
Aldrich
, “
Analyzing single-molecule time series via nonparametric Bayesian inference
,”
Biophys. J.
108
,
540
556
(
2015
).
4.
A.
Lee
,
K.
Tsekouras
,
C.
Calderon
,
C.
Bustamante
, and
S.
Pressé
, “
Unraveling the thousand word picture: An introduction to super-resolution data analysis
,”
Chem. Rev.
117
,
7276
(
2017
).
5.
S. N.
MacEachern
, “
Nonparametric Bayesian methods: A gentle introduction and overview
,”
Commun. Stat. Appl. Methods
23
,
445
466
(
2016
).
6.
M.
Tavakoli
,
J.
Taylor
,
C.
Li
,
T.
Komatsuzaki
, and
S.
Pressé
, “
Single molecule data analysis: An introduction
,” preprint arXiv:1606.00403 (
2016
).
7.
D.
Sivia
and
J.
Skilling
,
Data Analysis: A Bayesian Tutorial
(
OUP Oxford
,
2006
).
8.
A.
Gelman
,
J. B.
Carlin
,
H. S.
Stern
,
D. B.
Dunson
,
A.
Vehtari
, and
D. B.
Rubin
,
Bayesian Data Analysis
(
CRC Press
,
Boca Raton, FL
,
2014
), Vol. 2.
9.
U.
Von Toussaint
, “
Bayesian inference in physics
,”
Rev. Mod. Phys.
83
,
943
(
2011
).
10.
S.
McKinney
,
C.
Joo
, and
T.
Ha
, “
Analysis of single-molecule FRET trajectories using hidden Markov modeling
,”
Biophys. J.
91
,
1941
1951
(
2006
).
11.
F.
Qin
,
A.
Auerbach
, and
F.
Sachs
, “
A direct optimization approach to hidden Markov modeling for single channel kinetics
,”
Biophys. J.
79
,
1915
1927
(
2000
).
12.
S.-H.
Chung
,
J. B.
Moore
,
L.
Xia
,
L.
Premkumar
, and
P.
Gage
, “
Characterization of single channel currents using digital signal processing techniques based on hidden Markov models
,”
Philos. Trans. R. Soc., B
329
,
265
285
(
1990
).
13.
K.
Hines
, “
A primer on Bayesian inference for biophysical systems
,”
Biophys. J.
108
,
2103
2113
(
2015
).
14.
K.
Okamoto
and
Y.
Sako
, “
Variational bayes analysis of a photon-based hidden Markov model for single-molecule FRET trajectories
,”
Biophys. J.
103
,
1315
1324
(
2012
).
15.
M.
Beal
,
Z.
Ghahramani
, and
C.
Rasmussen
, “
The infinite hidden Markov model
,” in
Advances in Neural Information Processing Systems
, (
2001
), pp.
577
584
.
16.
B.
Jagannathan
and
S.
Marqusee
, “
Protein folding and unfolding under force
,”
Biopolymers
99
,
860
869
(
2013
).
17.
K. C.
Neuman
and
A.
Nagy
, “
Single-molecule force spectroscopy: Optical tweezers, magnetic tweezers and atomic force microscopy
,”
Nat. Methods
5
,
491
505
(
2008
).
18.
O. K.
Dudko
,
G.
Hummer
, and
A.
Szabo
, “
Theory, analysis, and interpretation of single-molecule force spectroscopy experiments
,”
Proc. Natl. Acad. Sci. U. S. A.
105
,
15755
15760
(
2008
).
19.
S. J.
Gershman
and
D. M.
Blei
, “
A tutorial on Bayesian nonparametric models
,”
J. Math. Psychol.
56
,
1
12
(
2012
).
20.
A.
Antoniou
,
Digital Signal Processing
(
McGraw-Hill
,
2016
).
21.
J.
Bronson
,
J.
Fei
,
J.
Hofman
,
R.
Gonzalez
, and
C.
Wiggins
, “
Learning rates and states from biophysical time series: A Bayesian approach to model selection and single-molecule FRET data
,”
Biophys. J.
97
,
3196
3205
(
2009
).
22.
M.
Blanco
and
N.
Walter
, “
Analysis of complex single-molecule FRET time trajectories
,”
Methods Enzymol.
472
,
153
178
(
2010
).
23.
C. D.
Kinz-Thompson
,
N. A.
Bailey
, and
R. L.
Gonzalez
, Jr
, “
Precisely and accurately inferring single-molecule rate constants
,”
Methods Enzymol.
581
,
187
225
(
2016
).
24.
K. E.
Atkinson
,
An Introduction to Numerical Analysis
, 2nd ed. (
Wiley
,
New York
,
1989
).
25.
F. N.
Fritsch
and
R. E.
Carlson
, “
Monotone piecewise cubic interpolation
,”
SIAM J. Numer. Anal.
17
,
238
246
(
1980
).
26.
Y. W.
Teh
,
M. I.
Jordan
,
M. J.
Beal
, and
D. M.
Blei
, “
Hierarchical Dirichlet processes
,”
J. Am. Stat. Assoc.
101
,
1566
(
2006
).
27.
T.
Ferguson
, “
A Bayesian analysis of some nonparametric problems
,”
Ann. Stat.
209
230
(
1973
).
28.
O.
Cappé
,
E.
Moulines
, and
T.
Rydén
, “
Inference in hidden Markov models
,” in
Proceedings of EUSFLAT Conference
(
2009
), pp.
14
16
.
29.
C.
Bishop
,
Pattern Recognition and Machine Learning (Information Science and Statistics)
, 1st ed. (
2006
), corr. 2nd Printing ed. (2007).
30.
J.
Pitman
, “
Poisson–Dirichlet and gem invariant distributions for split-and-merge transformations of an interval partition
,”
Combinatorics, Probab. Comput.
11
,
501
514
(
2002
).
31.
M.
West
,
Hyperparameter Estimation in Dirichlet Process Mixture Models
(
Duke University ISDS Discussion Paper No. 92–A03
,
1992
).
32.
H.
Ishwaran
and
M.
Zarepour
, “
Exact and approximate sum representations for the Dirichlet process
,”
Can. J. Stat.
30
,
269
283
(
2002
).
33.
E.
Fox
,
E.
Sudderth
,
M.
Jordan
, and
A.
Willsky
, “
A sticky HDP-HMM with application to speaker diarization
,”
Ann. Appl. Stat.
5
,
1020
1056
(
2011
).
34.
H.
Ishwaran
and
M.
Zarepour
, “
Markov chain monte carlo in approximate Dirichlet and beta two-parameter process hierarchical models
,”
Biometrika
87
,
371
390
(
2000
).
35.
M.
Abramowitz
and
I. A.
Stegun
,
Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables
(
Courier Corporation
,
1964
), Vol. 55.
36.
P. M.
Lee
,
Bayesian Statistics: An Introduction
(
John Wiley & Sons
,
2012
).
37.
L.
Rabiner
and
B.
Juang
, “
An introduction to hidden Markov models
,”
IEEE ASSP Mag.
3
,
4
16
(
1986
).
38.
S.
Wade
and
Z.
Ghahramani
, “
Bayesian cluster analysis: Point estimation and credible balls
,” preprint arXiv:1505.03339 (
2015
).
39.
F. A.
Quintana
and
P. L.
Iglesias
, “
Bayesian clustering and product partition models
,”
J. R. Stat. Soc., Ser. B
65
,
557
574
(
2003
).
40.
A.
Fritsch
,
K.
Ickstadt
 et al., “
Improved criteria for clustering based on the posterior similarity matrix
,”
Bayesian Anal.
4
,
367
391
(
2009
).
41.
J. W.
Lau
and
P. J.
Green
, “
Bayesian model-based clustering procedures
,”
J. Comput. Graphical Stat.
16
,
526
558
(
2007
).
42.
R.
Darling
,
J. R.
Norris
 et al., “
Differential equation approximations for Markov chains
,”
Probab. Surv.
5
,
37
79
(
2008
).
43.
S. N.
Ethier
and
T. G.
Kurtz
,
Markov Processes: Characterization and Convergence
(
John Wiley & Sons
,
2009
), Vol. 282.
44.
M. J.
Comstock
,
T.
Ha
, and
Y. R.
Chemla
, “
Ultrahigh-resolution optical trap with single-fluorophore sensitivity
,”
Nat. Methods
8
,
335
340
(
2011
).
45.
M.
Comstock
,
K.
Whitley
,
H.
Jia
,
J.
Sokoloski
,
T.
Lohman
,
T.
Ha
, and
Y.
Chemla
, “
Direct observation of structure-function relationship in a nucleic acid–processing enzyme
,”
Science
348
,
352
354
(
2015
).
46.
Z.
Qi
,
R. A.
Pugh
,
M.
Spies
, and
Y. R.
Chemla
, “
Sequence-dependent base pair stepping dynamics in xpd helicase unwinding
,”
eLife
2
,
e00334
(
2013
).
47.
M.
Johnson
and
A.
Willsky
, “
Stochastic variational inference for Bayesian time series models
,” in
International Conference on Machine Learning
(
2014
), pp.
1854
1862
.
48.
N.
Ding
and
Z.
Ou
, “
Variational nonparametric Bayesian hidden Markov model
,” in
2010 IEEE International Conference on Acoustics, Speech and Signal Processing
(
IEEE
,
2010
), pp.
2098
2101
.
49.
K.
Kurihara
,
M.
Welling
, and
Y. W.
Teh
, “
Collapsed variational Dirichlet process mixture models
,” in
International Joint Conference on Artificial Intelligence
(
2007
), Vol. 7, pp.
2796
2801
.
50.
J.
Paisley
and
L.
Carin
, “
Nonparametric factor analysis with beta process priors
,” in
Proceedings of the 26th Annual International Conference on Machine Learning
(
ACM
,
2009
), pp.
777
784
.
51.
C.
Robert
and
G.
Casella
,
Monte Carlo Statistical Methods
(
Springer Science & Business Media
,
2013
).
52.
J.
Van Gael
,
Y.
Saatci
,
Y.
Teh
, and
Z.
Ghahramani
, “
Beam sampling for the infinite hidden Markov model
,” in
Proceedings of the 25th International Conference on Machine Learning
(
ACM
,
2008
), pp.
1088
1095
.

Supplementary Material