A fast and accurate sampling method is in high demand, in order to bridge the large gaps between molecular dynamic simulations and experimental observations. Recently, an integrated tempering enhanced sampling (ITS) method has been proposed and successfully applied to various biophysical examples, significantly accelerating conformational sampling. The mathematical validation for its effectiveness has not been elucidated yet. Here we show that the integrated tempering enhanced sampling method can be viewed as a reformulation of the infinite switching limit of the simulated tempering method over a mixed potential. Moreover, we demonstrate that the efficiency of simulated tempering molecular dynamics improves as the frequency of switching between the temperatures is increased, based on the large deviation principle of empirical distributions. Our theory provides the theoretical justification of the advantage of ITS. Finally, we illustrate the utility of the infinite switching simulated tempering method through several numerical examples.

## I. INTRODUCTION

Molecular dynamics simulation is a powerful method for investigating microscopic biochemical systems. However, when the system contains barriers due to energy or entropy, direct molecular dynamics simulation will lead to a very high computational cost. Therefore, people turn to use the enhanced sampling method, sacrificing the real dynamic information in order to capture the desired Gibbs distribution. Until now, sampling methods have already been successfully applied in multiple disciplines including statistical physics, chemical physics, Bayesian statistics, machine learning, and related areas.

The efficiency of the sampling approach highly depends on the convergence rate to the thermodynamic equilibrium. Tempering schemes, such as simulated tempering molecular dynamics (STMD)^{1,2} and replica exchange molecular dynamics (REMD),^{3–5} were developed and are among the most popular methods to overcome the metastability and to enhance the convergence to equilibrium, thanks to their effectiveness and simplicity to use. The basic idea of both is to use one or more artificial high temperatures to accelerate exploration of the conformational space, while using the physical temperature to sample the desired physical observable. The interaction between different temperatures is designed so as to guarantee the unbiasedness of the estimation. These two methods are quite comparable to each other. STMD gives a higher rate of delivering the system between different temperature states as well a higher rate of traversing the energy space,^{6} but requires the estimation of the partition function first.

Another tempering algorithm named “integrated tempering enhanced sampling” (ITS), which was recently introduced by Gao,^{7} uses a temperature-biased effective potential energy to run the molecular dynamics (MD) simulation and has been successfully applied to various biochemical examples.^{8,9} Noting that ITS also uses auxiliary temperatures to accelerate the convergence, it is desirable to compare it with previous methods, and find out why it performs better in many cases, which has not been studied theoretically to the best of our knowledge.

As one of the main results in this paper, we discover that the integrated tempering enhanced sampling method is in fact a reformulated version of the infinite switching limit of STMD, that is, the limit of the simulated tempering when the attempt switching frequency goes to infinity. Moreover, we show that as the switching rate in STMD increases, the empirical measure converges faster toward stationary distribution, using the large deviation principle; and thus the sampling efficiency of STMD increases as the switching frequency increases. Combining the two theoretical findings, we justify the efficiency of ITS over conventional STMD. Finally, we compare the infinite switching STMD against the normal STMD with two numerical examples: an artificial high dimensional system and a more realistic Lennard-Jones example^{10} with 16 atoms. The numerical results validate our theoretical findings.

Our study of the switching rate of the STMD is closely related to the recent progress in understanding the swapping frequency of tempering schemes, started with replica exchange methods. For REMD, it has been discovered through numerical examples that the sampling efficiency increases when the frequency of swapping of temperatures is pushed up to infinity;^{11} but directly increasing the swapping rate to reach this limit is computationally infeasible, as many swaps are needed per MD step. A breakthrough was made in Refs. 12 and 13 which proposed an explicit way to reach this limit and also proved that indeed the sampling efficiency increases in such a limit. A natural reformulation of infinite swapping REMD was later proposed in Ref. 14 which leads to an easy implementation as a simple patch to conventional molecular dynamics. For multiple temperatures, an efficient implementation based on ideas of a multiscale integrator was also proposed for infinite swapping REMD,^{15,16} which leads to practical applications in sampling configurational space of large biomolecules. We note that increasing the switching rate in the simulated tempering has been mentioned in Refs. 16 and 17 under a general framework of viewing tempering schemes as MD process augmented by jumping processes, without much details, in particular, the connection with ITS.

In the remaining article, we will first recall the simulated tempering molecular dynamics as a stochastic process in Sec. II. We will justify using a large switching frequency via large deviation principle for the empirical distribution in Sec. III with detailed derivation given in the Appendix. Infinite switching limit of STMD and its reformulation are derived and generalized in Secs. IV and V. The identification of ITS and infinite switching STMD is given and discussed in Sec. VI, followed by numerical examples in Sec. VII. Some conclusive remarks are given in Sec. VIII.

## II. SIMULATED TEMPERING MOLECULAR DYNAMICS

We begin by recalling the simulated tempering molecular dynamics. For the sake of simplicity, we will first discuss the algorithm for a system governed by overdamped Langevin equations (i.e., when inertia can be neglected). More general dynamics will be discussed in Sec. V. Consider

where $x\u2208R3n$ denotes the configuration of the system with *n* particles, ** f**(

**) = −∇**

*x**V*(

**) is the force associated with the potential**

*x**V*(

**),**

*x**β*= 1/

*k*

_{B}

*T*is the inverse temperature, and

**is a 3**

*η**n*-dimensional white-noise with independent components. We choose a unit system so that the friction coefficient becomes one for notational simplicity. Equation (1) is consistent with the Boltzmann equilibrium probability density

where the normalization constant $Z\beta =\u222bR3ne\u2212\beta V(x)dx$ is the partition function.

Assuming we only take two temperatures in simulated tempering (the generalization to more temperatures will be considered in Sec. V), we replace (1) by

where the inverse temperature *β*(*t*) now attempts switches between the physical and the artificial temperatures with frequency *ν*, and the attempted switch from *β*_{1} to *β*_{2} are accepted with the probability

where *n*_{1} and *n*_{2} are some weighting parameters that will be further discussed below.

Note that, in the simulated tempering overdamped dynamics, both the configuration ** x** and the inverse temperature

*β*are dynamical variables:

**follows (3),**

*x**β*follows a jump process, and the two dynamics are coupled. To understand the dynamics, it is useful to consider its infinitesimal generator, given by (we use

*β*′ to denote the temperature other than

*β*, namely,

*β*′ =

*β*

_{2}if

*β*=

*β*

_{1}and vice versa)

for a smooth function $u:R3n\xd7{\beta 1,\beta 2}\u2192R$. This means that the evolution of a physical observable *u* under the dynamics is given by

or equivalently, the density *ρ* of (** x**,

*β*) evolves under the adjoint operator

For the infinitesimal generator, the first two terms on the right-hand side of (5) correspond to the overdamped dynamics (3), while the last two terms correspond to the jump process of *β*, − *νg*_{β,β′}(** x**) being the rate of switching from temperature

*β*to

*β*′, and

*νg*

_{β′,β}(

**) is the rate of switching to**

*x**β*from the other temperature

*β*′ (attempt switching frequency adjusted by the acceptance rate).

It is straightforward to check the equilibrium distribution of the coupled dynamics of (** x**,

*β*), as the stationary solution of (7) is given by

where $\delta \beta ,\beta 1$ denotes the Kronecker delta: $\delta \beta ,\beta 1=1$ if and only if *β* = *β*_{1}. This is nothing but a weighted average of Boltzmann densities at temperatures *β*_{1} and *β*_{2}, with weight $n1Z\beta 1$ and $n2Z\beta 2$, respectively. On each fixed temperature, the distribution is generated by the ordinary molecular dynamics and thus follows Boltzmann distribution; meanwhile, at each fixed configuration, the proportion of the two temperatures is given by

as determined by the acceptance probability (4). Therefore, (3) together with the jumping process of *β* samples the *ϱ*(** x**,

*β*) as the equilibrium distribution.

Summing *ϱ*(** x**,

*β*) over

*β*then gives the marginal equilibrium density for the configuration position alone,

which is a weighted average of the Boltzmann densities at the two temperatures *β*_{1} and *β*_{2}. As a result the ensemble average at the physical temperature (*β*_{1}) of any observable *A*(** x**) can be estimated from

where we have used that the ensemble average equals to the time average thanks to ergodicity. Note that in principle, we shall use two independent realizations to estimate the numerator and denominator on the right-hand side of (11) to get an unbiased estimator, while in practice the bias is well-controlled and the variance of the estimator usually dominates the error.

Note that the above holds for arbitrary positive weighting factors *n*_{1} and *n*_{2}, while the sampling efficiency of the method depends on the choice. The conventional wisdom in simulated tempering method is to choose *n*_{1} and *n*_{2} according to the partition function,

With this choice, we have

and thus the sampler spends equal amount time at the two temperatures. It is also possible to choose different *n*_{i} to emphasize one of the temperatures. Of course, the partition functions $Z\beta i$ are not known *a priori*, and hence the usual approach is to start with some initial guess of the weighting factors and then adaptively adjust them on-the-fly using an iterative method; see, e.g., Refs. 7 and 17–19. In what follows, we will mainly focus on the choice of the switching frequency *ν*, and thus for the simplicity of discussion, we will assume that *n*_{i} are fixed during the sampling process and only make some remarks about adjusting *n*_{i} afterwards.

## III. LARGE DEVIATION PRINCIPLE FOR EMPIRICAL MEASURE

As we have discussed above, for any choice of the switching frequency *ν*, the trajectory (*x*^{ν}, *β*^{ν}) (we put superscript *ν* to emphasize the *ν* dependence) of the simulated tempering overdamped dynamics can be used to sample the equilibrium distribution (8). In particular, the empirical distribution of the trajectory, defined blow,

will converge to the equilibrium distribution *ϱ* as *T* → *∞*. Note here *δ*_{(x,β)} denotes the Dirac delta function, that is, the point mass function at (** x**,

*β*), and

*B*is any fixed burn-in time (which for simplicity will be taken as

*B*= 0 in the numerical experiments).

To discuss the choice of *ν*, it is important then to quantify the speed of convergence of the empirical distribution (14) to the equilibrium: A better *ν* will correspond to faster convergence and thus less simulation length of the trajectory. To quantify this convergence for STMD, the theoretical tool we will use is the large deviation principle for the empirical measure of stochastic processes. We will discuss the idea and conclusion of the large deviation principle below, while defer the rigorous definition and derivation to the Appendix.

When *T* becomes large, the empirical distribution is expected to be very close to the equilibrium. The large deviation principle quantifies the probability that the empirical distribution is still far away from the equilibrium: more specifically, let the probability of the empirical distribution being equal to *μ* is on the order of $O(e\u2212I\nu (\mu )T)$, with a rate functional, specific form given below, *I*^{ν}(*μ*) ≥ 0, and only vanish when *μ* is the equilibrium distribution *ϱ*. This, in particular, tells us that as *T* → *∞*, the likelihood that empirical distribution is deviated away from the equilibrium is exponentially small, for which the functional *I*^{ν} quantifies the rate of the exponential decay. Hence a larger rate function indicates faster rate of convergence.

To specify the rate functional, let *μ* be a probability measure on $R3n\xd7{\beta 1,\beta 2}$ with a smooth density and define *θ*(** x**,

*β*) ≔ [

*dμ*/

*dϱ*](

**,**

*x**β*), the ratio of the probability density of

*μ*and the equilibrium distribution. For the simulated tempering process with switching frequency

*ν*, the large deviation rate function for the empirical measure converging to the stationary one is given by

where

Note that *J*_{1} is non-negative and hence the large deviation rate functional is a pointwise monotonic function in *ν*. Thus, we conclude from this that a larger swapping rate *ν* corresponds to a faster convergence of the empirical distribution to equilibrium.

## IV. INFINITE SWITCHING LIMIT

As shown in Sec. III, the higher the switching rate *ν* is, the faster the convergence of the empirical distribution and thus the sampling of the simulated tempering overdamped dynamics is. On the other hand however, a large *ν* value requires one to make many swapping attempts, which slows down the actual simulation. To resolve this problem, the key is that the limit when *ν* → *∞* can be taken explicitly. This observation is first made for replica exchange dynamics in Ref. 13. As for simulated tempering, when the switching frequency between the temperatures *ν* → *∞*, the dynamics (3) converges to the following:

where the effective temperature $T(x)=\beta 1\u22121\omega 1(x)+\beta 2\u22121\omega 2(x)$ is a weighted average of $T1=\beta 1\u22121$ and $T2=\beta 2\u22121$. The weight as a function of ** x** is given by (9). Intuitively, this can be understood as with fixed configuration

**, the proportion of time the trajectory spends at temperature**

*x**β*

_{j}is given by

*ω*

_{j}(

**), and thus the effective temperature is given by a weighted average of the two temperatures.**

*x*The Fokker-Planck equation for the probability density of ** x**(

*t*) corresponding to (18) can be written as

where we have defined the effective potential

and the mobility

It is easy to check that the stationary solution of (19) is exp(−*β*_{1}*U*(** x**)), which is just

*ϱ*(

**) by definition of**

*x**U*.

Note that this point of view also leads to a further simplification of (18), in the spirit of Lu and Vanden-Eijnden.^{14} We may replace $B$ in the Fokker-Planck equation (19) by a constant function, which, for convenience, we will simply take to be the identity. This substitution does not affect the stationary distribution, and hence preserves the sampling property, but it changes the overdamped Langevin dynamics associated with the Fokker-Planck equation. It is easy to see that this new dynamics is given by

Compare with (18), the noise term is additive in (22), and it still samples the desired stationary distribution *ϱ*. Note that (22) is very similar to the original overdamped Langevin dynamics (1). The only change is a scaling factor in front of the forcing term, which involves the auxiliary (inverse) temperature *β*_{2} and also the weights *ω*_{j}, *j* = 1, 2. Note that in practice, (22) can be implemented as an easy patch of existing codes for molecular dynamics.

We refer to this infinite switching simulated tempering method as the ITS method.

## V. GENERALIZATIONS

First, the idea in Sec. IV can be extended naturally to more than two temperatures. For which, the dynamics in the infinite switching limit is given by

where we define the weight here as

if *N* is the number of temperatures used. Similarly, the dynamics (23) can be viewed as the overdamped dynamics with the effective potential given by (20) where the corresponding marginal equilibrium density becomes

as we have

Thus the gradient of the effective potential *U*(** x**) is exactly the forcing term in (23).

The infinite switching simulated tempering can be also generalized to the (underdamped) Langevin equation, rather than the overdamped Langevin dynamics (1); other thermostats can be also used. Recall the Langevin equation

where *m* denotes the mass and *γ* the friction coefficient. The generalization of (22) reads

## VI. CONNECTION WITH INTEGRATED TEMPERING ENHANCED SAMPLING

The infinite switching limit of the simulated tempering sampling scheme is very closely related to the integrated tempering enhanced sampling (ITS) algorithm originally proposed in Ref. 7. The ITS algorithm introduces a temperature biased effective potential energy as

and runs molecular dynamics simulation on the surface. Here $n\beta k$ is chosen to be some weighting factor for the temperature *β*_{k}. Note that this is identically the same as we have in (20) with the marginal equilibrium *ϱ*(** x**) given by (25), despite a constant difference which does not matter in sense of potential energy.

As a conclusion, the ITS algorithm can be viewed as the infinite switching limit of the simulated tempering algorithm. As we discussed in Sec. III, the sampling efficiency of the simulated tempering method increases as *ν* → *∞*. Thus as a corollary, the ITS algorithm is more efficient in sampling compared with the simulated tempering algorithm at a finite switching rate. This will be further demonstrated in our numerical examples in Sec. VII.

## VII. NUMERICAL EXAMPLES

### A. Simple high dimensional example

We first consider a system in *D* dimension moving on the following potential with ** x** = (

*x*

_{0}, …,

*x*

_{D−1}):

where *λ*_{1}, *λ*_{2}, …, *λ*_{D−1} are parameters controlling the stiffness of the harmonic potential in the *x*_{0}, *x*_{1}, …, *x*_{D−1} directions. We take the low (physical) temperature *β*_{0} = 25 and five artificial temperatures *β*_{k} = 25 × 2^{−k}, *k* = 1, 2, 3, 4, 5. For each *β*_{k}, its weighting parameter *n*_{k} are set to be inverse of the corresponding partition function, i.e., $Z\beta k\u22121$, which could vary when the number of dimensions changes. When simulating with STMD, we use

so that its infinite swapping limit coincides with (23) [rather than a multi-temperature version of (18) for the original overdamped dynamics (3)]. At a finite switching rate, we just consider the switching attempts of *β*(*t*) from some *β*_{k} to its adjacent *β*_{k−1} (if *k* > 0), or *β*_{k+1} (if *k* < 5). And in each attempt to switch, we shall first decide whether to switch up or down with equal probability, if 0 < *k* < 5. The total simulation time is *T*_{tot} = 2.5 × 10^{6} with time step *dt* = 0.025, which means the total number of steps is *N*_{tot} = 10^{8}. As mentioned previously, we will take no burn-in period when estimating the average of physical observable.

To compare the algorithms, we calculate the asymptotic variance of the observable *V*(** x**) using a batch estimation,

where *AV* stands for the asymptotic variance and *WS* stands for the window size of the batch. The results are shown in Figs. 1 and 2 for *D* = 1 and *D* = 10 respectively. We observe that for the finite switching rate, STMD at frequency *ν* = 1 has a lower asymptotic variance compared to *ν* = 0.1, and moreover the infinite swapping limit has an even lower asymptotic variance. We also observe that for this example, the simulated tempering with *ν* = 1 is already quite close to the infinite switching limit.

### B. Dimer in solvent example

Now, to test the performance of our algorithm on a more realistic example, we apply (28) for a dimer in a solvent model as considered in Ref. 10. This system consists of *N* two-dimensional particles in a periodic box with side length *l*. All particles have the same mass *m*, and they interact with each other with the Weeks-Chandler-Anderson potential defined as

if *r* ≤ *r*_{WCA} = 2^{1/6}*σ* and *V*_{WCA}(*r*) = 0 otherwise, except for a pair of particles which interact via a double well potential

We take *N* = 16, *l* = 4.4, *σ* = 1, *h* = 1, *ω* = 0.5, and *ϵ* = 1 in the simulation. The physical temperature is *T*_{0} = 0.2, and one artificial temperature *T*_{1} = 1.0. The total simulation time is set to be *T*_{tot} = 1.0 × 10^{5} with time step *dt* = 0.001, which means that the total number of steps is *N*_{tot} = 10^{8}. The quantity of interest is the free energy associated with the distance of the pair of particles interacting via the double well potential.

In this numerical example, we would still choose the weighting parameter to be inverse partition function, which is however not explicitly known or easily obtained *a priori* in this case. Therefore, we use the following method to get approximation of these partition functions, similar to that of Ref. 7, and then use the estimates in the weighting factors *n*_{k}. First, a set of initial guess $Zk(0)$ is chosen and used to run the dynamics simulation. Then from a relatively short trajectory, we can add up the corresponding weight to get the “proportion” of each temperature by normalizing their summation to one. Since this proportion would go to 1/2 if the guess is accurate and the trajectory is infinitely long, we can adjust our guess accordingly. More specifically, if the proportions are $wk$, *k* = 0, 1, the weighting factor will be updated as

where *I* is a small neighborhood around 1 to take into account of the fluctuation due to the short simulation trajectory when estimating $wk$. We iteratively update the partition function estimate ${Zk(l)},\u2009l=1,2,\cdots \u2009$ until the proportions $wk$ are satisfactorily close to 1/2. Here we take *I* = [0.35, 1.5], initial guess $(Z0(0),Z1(0))=(1,108)$, and the number of iterations *l*_{max} = 10 with the trajectory length 10^{7} steps each.

The performance of STMD with various switching rates is shown Fig. 3. It can be seen that the performance of STMD is better when the switching rate increases from 0.25 to 25, and the asymptotic variance converges to that of the IST as *ν* → *∞*. This provides strong numerical validation of our theoretical results. We also remark that as this example is more complicated than the previous one, the switching rate needed by STMD to get close to IST becomes much higher, compared to the results in Fig. 2. It is expected that for a larger system and when more temperatures are used, it will become more difficult for STMD to reach its infinite switching limit by increasing the switching rate. Thus it is more advantageous to take the limit explicitly and operate at the infinite switching rate by using, e.g., the IST algorithm.

## VIII. CONCLUSION

We justify that the sampling efficiency of the simulated tempering method increases with the switching rate of the temperature, using both the theoretical analysis based on large deviation of empirical distribution and also numerical tests on two examples. This motivates taking the infinite switching limit of the simulated tempering dynamics, which recovers the integrated tempering enhanced sampling method under a natural reformulation. Now the limiting dynamics can be implemented as a patch to standard molecular dynamics by adding a scaling factor to the force term based on a weighted average of all temperatures involved. This leads to a practical scheme with higher sampling efficiency than standard simulated tempering. Throughout this work, we have assumed that the weight factors *n*_{k} are given and only consider the efficiency of the algorithm as the parameter *ν* varies. For practical use of simulate tempering algorithms, it is also essential for us to choose an appropriate set of weights for each temperature or to design an adaptive algorithm to learn a set of optimal weights. A thorough understanding of these issues is beyond the scope of the current paper and is an important direction for future investigation.

## ACKNOWLEDGMENTS

H.G. is supported in part by the National Science Foundation of China via Grant No. 11622101. J.L. is supported in part by the National Science Foundation via Grant No. DMS-1454939. J.L. and H.G. would also like to thank Yiqin Gao, Zhiqiang Tan, Eric Vanden-Eijnden, and Zhennan Zhou for helpful discussions.

### APPENDIX: DERIVATION OF THE LARGE DEVIATION RATE FUNCTIONAL *I*^{ν}(*μ*)

To introduce formally the large deviation principle, let us introduce some definitions. Let *S* be a Polish space and $P(S)$ the space of probability measures on *S*, equipped with the topology of weak convergence. Under this weak topology, $P(S)$ itself is a Polish space. Note that the empirical measure $\lambda T\nu $ is a $P(S)$-valued random variable.

The large deviation principle for empirical distribution can be stated as follows:^{20–22} A sequence of random probability measures {*γ*_{T}} is said to satisfy a large deviation principle with a rate function $I:P(S)\u2192[0,\u221e]$, if for all open sets $O\u2282P(S)$

for all closed sets $C\u2282P(S)$

and for all *M* < *∞*, {*μ*: *I*(*μ*) ≤ *M*} is compact in $P(S)$. In particular, once we obtain the rate functional, (A1) and (A2) quantifies how unlikely the empirical distribution is far from the equilibrium when *T* is large, which was used in this work to justify the infinite switching limit.

To find the rate functional *I*^{ν} for the simulate tempering overdamped dynamics with switching rate *ν*, we divide the infinitesimal generator into two parts,

where

The rate functional thus has an additive structure (15) with the rate functional *J*_{0} and *J*_{1} corresponding to $Ldiff\nu $ and $Ljump\nu $, respectively.

To get an explicit formula of the rate functionals, we first consider *J*_{0}. Let $f(x,\beta )=\theta (x,\beta )$, then following Donsker and Varadhan,^{20} we have

Next we consider the rate functional *J*_{1} corresponding to the jump process of the temperature with generator $Ljump\nu $. From the definition of *θ*(** x**,

*β*), we have

and therefore

Combined with the Cauchy-Schwartz inequality, we obtain that for any choice of non-negative *u*(** x**,

*β*),

where the equality is attained if and only if for each $x\u2208R3n$ and each *β* it holds

In particular, this gives the choice of *u*(** x**,

*β*) to make equality holds in (A8).

Using the variational characterization of the large deviation rate functional as in Ref. 20, we have

Therefore, we arrive at the explicit formula for the large deviation rate functionals. We remark that a rigorous proof of the rate functional derived from the above calculation can be found in Ref. 23.