Over 15 years ago, the ability to electrically detect and characterize individual polynucleotides as they are driven through a single protein ion channel was suggested as a potential method for rapidly sequencing DNA, base-by-base, in a ticker tape-like fashion. More recently, a variation of this method was proposed in which a nanopore would instead detect single nucleotides cleaved sequentially by an exonuclease enzyme in close proximity to one pore entrance. We analyze the exonuclease/nanopore-based DNA sequencing engine using analytical theory and computer simulations that describe nucleotide transport. The available data and analytical results suggest that the proposed method will be limited to reading <80 bases, imposed, in part, by the short lifetime each nucleotide spends in the vicinity of the detection element within the pore and the ability to accurately discriminate between the four mononucleotides.

## INTRODUCTION

The ability to sequence DNA rapidly, at low cost, is of significant interest because of its potential positive impact on health care and in many other applications. Over four decades ago, Maxam and Gilbert devised a method for sequencing DNA in which the biopolymer is chemically modified with labels and subsequently cleaved at specific bases with an organic compound.^{1} Sanger and colleagues developed another method based on dideoxynucleoside triphosphates as DNA chain terminators.^{2} Modern variations of the Sanger method use contemporary technology to sequence DNA templates by capillary electrophoresis with optical detection of fluorescent bases, and requires time consuming procedures.^{3}

High-throughput methods based on sequencing by synthesis^{4} were refined in the 1990s, and are typically performed with instruments that are relatively large and expensive to operate. The method was recently miniaturized via microtiter plate technology that reads DNA sequences optically from fluorescent tags^{3,5} or electronically with CMOS/ISFET technology.^{6–8} The latter technique is particularly significant because the device is massively parallel and label-free.

The next logical steps in genomic DNA sequencing include reducing the number of copies needed (ideally, the limit of detection would be a single molecule, to avoid the need for PCR amplification^{9} with its attendant errors^{10}), increasing the read length (to decrease the complexity of genome sequence reconstruction), increasing the measurement accuracy, and decreasing the time to read each base. Towards that goal, it was suggested that individual molecules of DNA might be sequenced in a ticker-tape fashion by threading them through a single nanopore and measuring the change each base makes to the pore's ionic current.^{11} However, this conceptually simple approach has been elusive,^{12} in part, because understanding the physics of polymer translocation through a single nanopore and controlling the rate of polynucleotide translocation have both been challenging.^{13–18}

One variation on nanopore-based DNA sequencing suggested the use of an exonuclease attached to the cap domain of the α-hemolysin channel.^{19} In this scheme, the enzyme would cleave nucleotides, one at a time, from duplex DNA. Conceptually, each base would migrate serially into the pore, bind to a detection element there, and be uniquely identified by the degree of ionic current blockade. Because β-cyclodextrin binds mononucleotides,^{20–22} and fits well inside the nanopore formed by *Staphylococcus aureus* α-hemolysin, it was a logical choice for use as DNA base detection element (Fig. 1(a)). Based on the degree by which the four DNA mononucleotides reduced the conductance of the α-hemolysin channel with a β-cyclodextrin adapter in it (Fig. 1(b)),^{23} an excellent degree of separation of the four bases was achieved. The actual mean overlap between the peaks (8%) determined using a fit to the data with Voigt functions^{24} (Figs. 1(b)–1(e), indigo) is somewhat greater than the 1% reported with a Gaussian mixture model fit (Figs. 1(b)–1(e), orange).^{23} Nevertheless, the relatively high degree of accuracy of the β-cyclodextrin adapter suggested that the concept has merit and deserves a detailed theoretical analysis. Here, we study the critical parts of this system's performance, determine the probability that a given length polynucleotide can be accurately sequenced with the technology, and suggest how the method might be implemented.

## MODEL

### One-dimensional diffusion model and the probability of capture

First, we analyze theoretically the capture of single mononucleotides that are released near one entrance of a nanopore. The time dependent capture probability for an idealized diffusing particle is computed using solutions of the Fokker-Planck drift-diffusion equation.^{25–27} Closed form analytical solutions for this equation can usually be found for geometries with simple boundary conditions. The more complicated case of a particle diffusing in a semi-infinite solution in the vicinity of a nanopore has no known closed form solution because it requires solving and matching boundary conditions between two distinct regions (i.e., the nanopore and the bulk). The problem was simplified recently by assuming that a one-dimensional line can represent the nanopore with a perfect absorber at one end (the particle detector/reader inside the nanopore) and a radiating boundary (which describes the escape of a particle from the nanopore to the bulk) at the other.^{28,29}

Each particle is initially located at the entrance of the nanopore, and the Fokker-Planck equation is solved to estimate the capture time probability distribution and from it the time-integrated capture probability. Figure 2 shows a schematic of the analytical model system. The diffusing particle propagator *G* under a biased flow in a one-dimensional nanopore is

where *v*_{d} and *D* are the particle drift velocity and diffusion coefficient, respectively. The drift includes electrophoretic and electro-osmotic (EO) contributions, which are functions of the applied transmembrane potential, *V* (see Appendix). A radiating boundary condition is applied at the pore exit (*z* = *L*) to model the loss of nucleotide particles into the bulk solution

where *κ* is an effective rate constant.^{28} For a pore coupled to a three-dimensional bulk solution, *κ* = 8*D*/π*d*^{29} where *d* is the pore diameter. The location of the particle sink (the β-cyclodextrin adapter), which defines *L*, is assumed to be that defined in the original experiment.^{23}

The boundary condition for particle capture by the detector located at *z* = 0 is *G*(*z* = 0, *t*) = 0. We assume the particles are initially uniformly distributed at the nanopore entrance. The localization is further assumed to occur over a region of thickness *δ*, such that

The parameter of interest is the survival time probability density; |$\phi \left( t \right) = D\left. {\frac{{\partial G}}{{\partial z}}} \right|_{z = 0}.$|$\varphi t=D\u2202G\u2202zz=0.$*G* is solved from Eq. (1) by applying the absorption and radiating boundary conditions and working in Laplace space (see Appendix). This leads to the following expression for |$\hat \phi \left( s \right)$|$\varphi \u0302s$ the Laplace transform of the survival time probability density:

where *α* = |*v*_{d}| *L*/*D*, *y*^{2} = α^{2}/4 + 2τ*s*,*τ* = *L*^{2}/2*D*, and β = κ*L*/*D*.

## RESULTS AND DISCUSSION

The capture probability density function, *ϕ*(*t*), is estimated by numerically inverting Eq. (4).^{40} Figure 3 shows a comparison between these capture probabilities and results from numerical simulations at applied potentials of *V* = 0 mV and 300 mV. The results show that an exonuclease-based nanopore sequencing device will be rate-limited by the turnover of exonuclease (∼1 ms to 10 ms) and not the capture rate of each base by the nanopore (∼10 ns to 100 ns). Therefore, we estimate the capture probability from the time-integrated capture probability density, which follows from Eq. (4) because |$\hat \phi \left( {s = 0} \right) = \int\nolimits_0^\infty {\phi \left( t \right)dt},$|$\varphi \u0302s=0=\u222b0\u221e\varphi tdt,$

The two parameters that describe the total capture probability are the nanopore's aspect ratio (*L*/*d*) and the voltage-dependent drift velocity *v*_{d}. Figure 4 illustrates the voltage dependence of the time integrated capture probability estimated from the analytical model and numerical simulations with no free parameters. The results demonstrate the difficulty in capturing small diffusive particles. However, they also show that increasing the potential could increase the probability of capturing a given cleaved base. Thus, Fig. 4 suggests that a near certain capture probability can be achieved with transmembrane potentials that greatly exceed 1 V. Implicit in this discussion is the notion that the membrane is impervious to dielectric breakdown at extreme voltages.^{30–33} However, there are other experimental issues that preclude the use of such high voltages. For example, only a relatively small electrostatic potential drop (∼1% of the total field^{34–36}) biases the random walk of each negatively charged mononucleotide into the pore, and most of the potential will drop across the β-cyclodextrin adapter in the nanopore (see Fig. 2 in Clarke *et al.*^{23}). Thus, the corresponding mean residence time of each mononucleotide on the detector will decrease if the applied potential is increased.

Assuming that the reaction between the mononucleotides and the molecular adapter proceed with first order kinetics, the mean residence time of the mononucleotides in the αHL nanopore should decrease exponentially with potential, i.e., |$\tau (V) = \tau _0 e^{ - V/V_c } $|$\tau (V)=\tau 0e\u2212V/Vc$, where *V*_{c} is the mean potential of the distribution. That was indeed the case for thymine with applied potentials between 80 mV and 200 mV (τ_{0} = 0.55 s and *V*_{c} = 47 mV),^{23} but not for the other three mononucleotides (see molecular adapter chemistry below). Thus, for relatively small applied potentials, the normalized probability density of the residence times will be |$\rho _V (t^\prime ;\tau (V)) = e^{ - t^\prime /\tau (V)} /\tau (V)$|$\rho V(t\u2032;\tau (V))=e\u2212t\u2032/\tau (V)/\tau (V)$, for *t*^{′} ⩾ 0 and where *t*^{′} is the time from capture for a given nucleotide. We make the simplifying assumption that a given base cannot be read if it is bound to the adapter for times shorter than the inverse bandwidth (*B*) of the detection system. This absolute limit ignores the possible increase in noise with detector bandwidth that may further limit the probability of correctly reading a base. Nevertheless, the assumption is instructive and points to a limitation of increasing the capture probability through increased voltage. The total probability of successfully reading a base follows from the normalized probability density,

Assuming that the processes governing the capture and read of individual nucleotides are independent, we can multiply Eqs. (5) and (6) to calculate the total probability of sensing a single base

Figure 5 shows plots of the combined detection probability as a function of the transmembrane potential (Eq. (7)) for three detection bandwidths (*B* = 5 kHz, 30 kHz, and 500 kHz). The optimum probability, *P*_{max}, follows from Eq. (7) by setting ∂*P*/∂*V* = 0 and is shown as a function of bandwidth in Fig. 5 (inset). The optimum read probability initially increases as a function of bandwidth, but saturates at ∼0.75. This result suggests that increasing the detection bandwidth beyond 500 kHz will have a limited impact on the efficiency of reading a given nucleotide and will most likely be worse given that detector noise will grow with increasing bandwidth. However, at high potentials, the mean residence time distributions for the negatively charged nucleotides on the molecular adapter may only decrease linearly with voltage, not exponentially, if the potential drop across the adapter is much greater than the binding energy. The effect of this transition between the two regimes will cause the capture probability to decrease more slowly with increasing bandwidth than is shown in Fig. 5. Nevertheless, the probability of detecting a base will be limited in this extreme voltage regime.

In addition to the system bandwidth, which limits the fastest blockade events that can be accurately measured, the relatively small number of ions that probe each base during fast events will also be a limiting factor. For example, ∼160 ions flow through the pore per microsecond under the experimental conditions reported elsewhere (400 mM KCl, 180 mV^{23}), and therefore events that are measured at the 500 kHz bandwidth limit will have sufficiently large statistical counting error (∼8%).

### Molecular adapter chemistry

The choice of β-cyclodextrin as a molecular adapter for detecting and discriminating the four DNA mononucleotides^{23} was sensible because the two species have affinity for each other.^{20–22} However, the question remains whether that particular adapter has precisely the right chemistry for sequencing DNA.

Experimentally, it was determined that 40 *μ*M total bulk mononucleotide concentration causes ∼30 blockades s^{−1} of the αHL channel modified with the β-cyclodextrin adapter.^{23} An estimate of the collision rate between point particles in the bulk and pore mouth can be obtained using a relationship derived by Berg and Purcell.^{37} That rate (2*DdC*_{0}, where *D*, *d*, and *C*_{0} are the mononucleotide diffusion coefficient, pore diameter, and mononucleotide concentration in the bulk, respectively) is ∼58 000 blockades s^{−1}. Assuming that ∼60% of these particles reach the β-cyclodextrin adapter when *V* = 180 mV (Fig. 4), ∼30 000 events/second should have been observed. Thus, one might conclude that only ∼1 out of 1000 particles that collide with the pore mouth reached the adapter. However, it is more likely that many blockade events are simply too short-lived to detect given the electronic measurement bandwidth in the nanopore apparatus (∼10 kHz) and the additional digital filtering used to present the data (2 kHz).^{23}

Indeed, the binding constants for the reactions between mononucleotides with β-cyclodextrin in the bulk are weak^{20–22} (i.e., *K* = *k*_{off}*/k*_{on} ∼ 10 mM to 100 mM, where *k*_{off} and *k*_{on} are the dissociation and association rate constants, respectively). This suggests the reaction will have relatively short mean lifetimes (*τ* = 1/*k*_{off} ∼ 0.1 *μ*s to 1 *μ*s, assuming *k*_{on} ∼ 10^{9} M^{−1} s^{−1}), consistent with experimentally measured values (*τ* = 0.75 *μ*s to 2 *μ*s^{21,22}). If the binding constant for the complex in the pore is similar to that in the bulk, then most of the blockades would be too short-lived to resolve at 2 kHz bandwidth.^{23} Because Clarke *et al.* observed ∼0.1% of the events (see above), we estimate that the mean lifetimes are ∼70-fold greater when the adapter is attached to the inside of the pore, assuming the residence times of the bases on the adapter are exponentially distributed. The increase in mean residence time is due to the amines on the β-cyclodextrin.^{41} Finally, we note that the addition of one or more fixed positive charges on the adapter should facilitate considerably stronger binding with the mononucleotides, and thereby markedly increase the observed apparent capture rate.

### Analysis of the exonuclease/nanopore-based DNA sequencing algorithm

Figure 5 illustrates the limitations on capturing and reading a single base cleaved by the exonuclease, but suggests that for a given electronic measurement bandwidth, we can select an optimum voltage to maximize the likelihood of correctly reading a single base with probability *p*. Assuming the probability of discriminating between the four bases is unity (but see Fig. 1) and that the system treats all bases alike, the probability of correctly sequencing the first *m* bases of a polynucleotide is *p*^{m}. However, because *p <* 0.75 (Fig. 5), accurately sequencing a single polynucleotide molecule in this manner is not practical. The statistics could of course be improved by reading many identical copies of the molecule by either the same nanopore or by an array of identical pores.

We next describe two algorithms to address the question of how many identical DNA copies must be measured to deduce the polynucleotide's sequence with a given degree of accuracy. We first adapt a “survivor” algorithm^{38} in which a number of sequence measurements (i.e., subsequences), *c*, are aligned from one end, and an assignment of the sequence is based on the majority of assignments from each subsequence. On average, only *cp* subsequences survive the first “vote,” and the remaining *c*(1 − *p*) subsequences are discarded. For the second letter in the sequence, the *cp* remaining subsequences will have, on average, *cp*^{2} majority “votes” and the remaining *cp*(1 − *p*) subsequences are discarded. Therefore, the average number of remaining subsequences after *T* “votes” is *N*_{T} = *cp*^{T} and the maximum expected number of correctly sequenced letters *T*_{max} is

Figure 6 shows the results of a numerical simulation using this sequencing algorithm. As an example, we generate *N*_{seq} subsequences, each with an average accuracy of 80% (*p* = 0.8, assuming random deletions of letters, but no misreads) for a 100-mer DNA strand. We then calculate the probability of accurately reconstructing the original sequence using the algorithm described above. Finally, we fit the simulated results using Eq. (8) with *N*_{T} as an adjustable parameter and a combined capture and discrimination probability of *p* = 0.8 (a slight overestimate of this sequencing engine's potential, Fig. 5).

The curve in Fig. 6 (open squares) highlights the drawbacks of discarding sequences, especially as the algorithm progresses. For 10^{3} subsequences (potentially obtained from reading identical copies using either a single nanopore or an array of nanopores), the amount of correctly read bases is only 25%, which increases to 45% and 80% with the use of 10^{5} and 10^{8} nanopores, respectively (not shown). Given the increased uncertainty arising from discarding data, this method may not be amenable to accurately measuring long sequences. However, it may be possible to improve this approach by breaking the original sequence into a sufficient number of overlapping sequences of shorter lengths. These shorter strands could be sequenced and further analyzed with existing sequencing methodologies based on DNA subset overlapping.

The above algorithm is vastly improved by retaining information in the “discarded” subsequences. As before, if we assume that the errors in reading a given base are from deletions only (i.e., perfect discrimination between the bases), then at the end of one voting round we left-shift every subsequence that agrees with the majority and hold all other sequences at their previous positions. The results of numerical simulations show that this relatively simple modification permits the accurate reading of ∼85 bases of a 100 mer's sequence (*p* = 0.8) with only ∼200 subsequences (Fig. 6, open circles).

### Less than perfect discrimination between nucleotides

Because the mononucleotide discrimination afforded by the β-cyclodextrin adapter is not perfect (Figs. 1(a)–1(e)), the probability of correctly reading DNA sequences with the exonuclease/nanopore-based technique would be proportionally lower than the estimates described above. How to achieve better discrimination with an internal molecular adapter is not yet clear. Further study into this problem would be helpful for both the separation of the four bases and the development of an adapter with sufficiently strong affinity for the mononucleotides.

## CONCLUSIONS

An analysis of a proposed exonuclease/nanopore-based DNA sequencing engine illustrates that the system has significant merit, but several aspects of the technique need improvement. Our analysis shows that many bases will escape to the bulk before being detected in the proper sequence. The results also show that increasing the applied transmembrane potential, which also increases the potential just outside the pore, will increase the cleaved mononucleotide capture efficiency. However, the increased potential decreases the lifetime of a given mononucleotide on the molecular adapter, which will therefore decrease the probability of reading each base correctly (or even detecting them). The results also show that because of the relatively low probability of capturing a given base (∼74%), the number of DNA molecules that would need to be read scales exponentially with the number of bases. Based on the ability of a molecular adapter in the αHL nanopore^{23} to discriminate between different mononucleotides (average accuracy = 92%), the probability of correctly reading a given base is at best 68%, which is likely insufficient for sequencing even moderate length strands of DNA.

Those conclusions notwithstanding, there are several possibilities for improving the exonuclease/nanopore-based DNA sequencing method. For example, the capture probability (*P*_{cap}) might be increased by orienting the exonuclease in such a manner that the cleaved mononucleotide is closer to the pore mouth. Conceivably, *P*_{cap} could also be increased by increasing the access resistance on the side that contains the exonuclease, but engineering that capability in a massively parallel array would need to be developed. Finally, the use of computer simulations may lead to the design of more accurate molecular adapters, if the required stereochemistry can be rationally designed and produced.

## ACKNOWLEDGMENTS

This work was supported in part by grants from National Institute of Standards and Technology (NIST) Office of Law Enforcement Standards (J.J.K., J.W.F.R., J.E.R.), the Wheaton College Alumni Foundation (D.L.B., B.S.D.), the National Science Foundation (NSF) (D.L.B.), and NRC/NIST NIH Research Fellowship (A.B.). We thank Dr. James Russo at Columbia University for helpful comments on the paper.

### APPENDIX: DETAILS OF THE CALCULATIONS

##### Drift velocity

The drift velocity of each base originates from electrophoretic and electro-osmotic effects. The electrophoretic velocity of a charged particle in an electric field is *v*_{ep} = *μE*, where *μ* is the electrophoretic mobility.

A net electro-osmotic velocity will occur in an ion selective nanopore. The net flux of ions through a nanopore under an applied potential is^{39}

where *J*_{±} is the net flux of cations or anions through the nanopore, *g* is the nanopore conductance, *V* is the applied transmembrane potential, *q* is the electric charge on each ion, and *P*_{+}*/P*_{−} is the nanopore ion selectivity ratio. Each ion is hydrated by a number of water molecules, *N*_{W}, as it flows through the pore. Therefore, the flux of water molecules through the nanopore is given by *J*_{W} = *N*_{W}*J*_{±} and the resulting net electro-osmotic velocity *v*_{eo} of water through the nanopore *v*_{eo} = *J*_{w} / *c*_{w}*A*_{p}, where *c*_{w} is the molar concentration of water and *A*_{p} is the average cross-sectional area of the nanopore.

We assume the applied potential creates a uniform electric field inside the nanopore. This condition is relaxed in another report, in which numerical simulations consider more complex electric field distributions. Nevertheless, the salient features of the electric field dependence can be understood here.

Combining the electro-osmotic and electrophoretic terms leads to the following expression for the drift velocity in terms of the applied transmembrane:

For comparison to numerical simulations, we use the following parameters: *μ* = −2.4 × 10^{−4} cm^{2}/Vs, *L*′ = 7.5 × 10^{−7} cm, *g* = 2.95 × 10^{−10} C/Vs, *N*_{W} = 10, *c*_{W} = 3.4 × 10^{22} cm^{−3}, *A*_{p} = 5.3 × 10^{−14} cm^{2}, *e* = 1.6 × 10^{−19} C, *P*_{+}/*P*_{−} = 0.1. The computer simulation assumes there is a small drop of the potential outside the pore. Thus *L*′ > *L*. Substitution of these values into Eq. (A2) leads to the following drift velocity voltage dependence:

It is worth noting that the largest possible electro-osmotic contribution that would enhance the drift of nucleotides through the nanopore occurs when *P*_{+}/*P*_{−} = 0 (i.e., purely anion selective). In this case, the prefactor in Eq. (A3) would be slightly greater (−330 cm/Vs).

##### Details of the model

In the given geometry, the drift velocity points towards *z* = 0 so *v* < 0. The Laplace transform of Eqs. (1) and (2) leads to

and

We solve for *G* and *ϕ*(*s*) subject to the absorption and radiating boundary conditions, setting |$\hat G\left( {L - \delta,s} \right)_l\break = \hat G\left( {L - \delta,s} \right)_r$|$G\u0302L\u2212\delta ,sl=G\u0302L\u2212\delta ,sr$ and |$\hat G^\prime \left( {L - \delta,s} \right)_l = \hat G^\prime \left( {L - \delta,s} \right)_r $|$G\u0302\u2032L\u2212\delta ,sl=G\u0302\u2032L\u2212\delta ,sr$ (where *l* and *r* refer to the limit from the left and right side of *L*-*s* and the prime refers to differentiation with respect to *z*), and taking the limit as δ → 0. This leads to the Eq. (4) in the text.

##### Probability of discriminating mononucleotides

Each mononucleotide can be uniquely identified by the degree to which it blocks the ionic current in the pore (Fig. 1). The probability of identifying each nucleotide (A, T, G, or C) is calculated by fitting the data in the figure to four Voigt profiles (Igor Pro, WaveMetrics, Inc., Lake Oswego, OR, USA). Equation (A7) gives the probability density of each peak, where *i* represents the blockade current, α is the amplitude, σ is the width, *μ* is the location of the peak, and γ is a shape parameter:^{24}

The probability of uniquely discriminating each base is given by the area under each peak in Fig. 1(b) using the expression, |$P_d = \int\nolimits_{i_1 }^{i_2 } {\Phi (i;\alpha,\sigma,\mu,\gamma)di}$|$Pd=\u222bi1i2\Phi (i;\alpha ,\sigma ,\mu ,\gamma )di$, with the limits of integration (*i*_{1} and *i*_{2}) set to the points of intersection between neighboring peaks. The probabilities of uniquely identifying a mononucleotide are then calculated to be 0.94, 0.93, 0.85, and 0.95 for dGMP, dTMP, dAMP, and dCMP, respectively. Finally, the average probability of uniquely discriminating a single mononucleotide is 0.92.

##### Numerical simulations

The diffusion simulator uses bulk physical parameters (such as the analyte diffusion constant, electrophoretic mobility, nanopore selectivity, total transmembrane voltage, etc.) to perform a random walk in 3D, constrained by the protein and membrane volumes. No periodic boundary conditions are implemented, because each particle is allowed to translate until it is successfully captured, or a specified maximum run time is exceeded (typically 1 *μ*s). Each diffusing species moves with a randomly determined cardinal direction and a step length determined by sampling from a Gaussian distribution with a mean of zero and a standard deviation (Eq. (A8)), *dx*, which is related to the characteristic diffusion coefficient:^{37}

As long as the simulation time step (*dt*) is small, translational motion inside the nanopore can be modeled accurately. A time step of 1 ps (step size = 40 pm) produces trajectories with adequate spatial resolution. Drift distance arising from electrophoretic and EO flow (Eq. (A2)) over the 1 ps time step is added to *dx* to give the particle displacement.

Nucleotides are simulated as point particles released from *z* = 0 (Fig. 1) and reflect off the ion channel lumen walls. The top exterior of the nanopore extends radially to infinity and is also assumed to be reflective. Nucleotides freely migrate above the nanopore in bulk solution without geometric constraints. The capture probability is defined as the ratio of the number of successful nucleotide captures to the total number of nucleotides used in the simulation.