Learning Koopman eigenfunctions of stochastic diffusions with optimal importance sampling and ISOKANN

For stochastic diffusion processes the dominant eigenfunctions of the corresponding Koopman operator contain important information about the slow-scale dynamics, that is, about the location and frequency of rare events. In this article, we reformulate the eigenproblem in terms of $\chi$-functions in the ISOKANN framework and discuss how optimal control and importance sampling allows for zero variance sampling of these functions. We provide a new formulation of the ISOKANN algorithm allowing for a proof of convergence and incorporate the optimal control result to obtain an adaptive iterative algorithm alternating between importance sampling and $\chi$-function approximation. We demonstrate the usage of our proposed method in experiments increasing the approximation accuracy by several orders of magnitude.


Introduction
Many real-world stochastic processes contain rare events, for example folding and binding events in molecular systems.The analysis of the frequency and mechanism of these events often takes operators associated with the process and their dominant invariant subspaces into account [1,6,9,11,14].Usually this type of analysis leads to a kind of chicken and egg problem.In order to compute the dominant invariant subspace of the Koopman operator of the process, one has to somehow also "sample those events which are usually rare".However, in order to know how to generate a bias on the process for observing these events, one would need information from the dominant invariant subspace of the Koopman operator.
The key idea of this article is to take an iterative algorithm which approximates the dominant eigenfunctions of the operator and to use the intermediate approximations for sampling along the reaction paths and generating an optimal bias to observe the relevant events.The algorithm for approximating eigenfunctions is ISOKANN 1 [11] and the bias is computed according to the theory of optimal importance sampling and the change of path measures [4].
Briefly speaking, ISOKANN can be thought of as an Arnoldi-like method using neural networks as function representations and replacing the subspace projections by a transformation suitable to its application on neural networks.It does not compute the eigenfunctions themselves but rather the so called χ-functions, which span an invariant subspace of the Koopman operator and can be used to reconstruct the eigenfunctions.Furthermore the χ-functions themselves allow for an interpretation as reaction coordinates indicating the locations of rare events and their reaction paths [3,13].
Using this interpretation, the χ-functions obtained during previous iterations can be used to adapt the sample locations for further iterations, e.g. by χ-stratified sampling (Sec.2.6), and thus providing better global coverage and facilitating exploration.The theory of optimal importance sampling on the other hand allows to exploit the information locally by decreasing the variance of the samples required to approximate the action of the Koopman operator.This in turn allows ISOKANN to arrive at results either more quickly or more precisely.It is due to this feedback loop, coupling local and global information, together with its representation of the χ-functions as neural networks that we expect ISOKANN to perform well even for complex highdimensional systems.While these general ideas are far from being formalized yet, in this article we will try to construct the basic building blocks in a way amenable for their future analysis.
In the following, we start by recalling basic knowledge about the eigenfunctions of Koopman operators (Sec.2.1) and provide an abstract formulation of the ISOKANN problem in terms of χ-functions (Sec.2.2) encoding the invariant subspaces of the Koopman operator.This abstract formulation is incomplete without the choice of an adequate transformation which we discuss in Section 2.3 and follow it up with an explicit choice leading to 1D-ISOKANN (Sec.2.4) which corresponds to the classical ISOKANN algorithm [11].After proving convergence of 1D-ISOKANN we show how to reconstruct the eigenfunctions of the Koopman operator from the χ-functions in Section 2.5.We then conclude Section 2 by providing an algorithmic description in form of Algorithm 1 and discussing the actual sampling and computation procedure (Sec.2.6).Section 3 starts with an introduction to the theory of optimal importance sampling of classical random variables (Sec.3.1) before reciting the result for the optimal sampling of path observables for diffusion processes with the Girsanov re-weighting in Section 3.2.After showing how to apply this result to obtain a zero variance sampler for the evaluation of the Koopman operator (Sec.3.3) we discuss how to integrate the result into the ISOKANN framework (Sec.3.4).In the end (Sec.3.5) we apply the 1D-ISOKANN algorithm to a one-dimensional double-well potential and observe the improvement of the accuracy of the controlled versus the uncontrolled case.

ISOKANN Theory
Before introducing our key idea in the next chapter, we will in this chapter recall the basics of Koopman operator theory and summarize the ISOKANN method while also complementing it by some new results.In particular we provide a new dimension-agnostic formulation (6) which naturally leads to a possible extension of ISOKANN to higher dimensions and show how the classical 1D-ISOKANN can be seen as a special case of this formulation and prove convergence of the algorithm to a χ function (Theorem 1).After showing how to reconstruct the dominant eigenfunctions from the ISOKANN result (Proposition 1) we conclude this section with a discussion of the actual implementation of ISOKANN, suggesting a new adaptive sampling scheme.

The Koopman operator
Although our approach can be generalized to non-reversible stochastic processes (by shifting the focus from eigenfunctions to invariant subspaces), for simplicity we will restrict our explanations to the reversible case.More precisely, we will investigate potential-driven diffusion processes X = (X t ) t≥0 of the form dX taking values in the state space X = R n with constant diffusion term σ ∈ R n×n and force-field b : X → R n given by the gradient of a smooth potential b = −∇U , U : X → R2 .B is a n-dimensional Brownian motion.The Koopman operator for lag time T , K T : L ∞ (X) → L ∞ (X), applied to a function f ∈ L ∞ (X) is defined by its pointwise evaluation via i.e. the expectation value of f at time T when starting the system in X 0 = x.Recall that since the process is time-homogeneous the Koopman operator just depends on the lag time for any start time t ≥ 0 The eigenfunctions v i ∈ L ∞ (X) of K T satisfy for all lag times T ≥ 0 with time-dependent eigenvalues λ i (T ), exponential in time with rates q i (which in turn are the eigenvalues of the corresponding infinitesimal generator of K T ).In the following we will refer to v 1 , . . ., v d as the d dominant eigenfunctions and call v 1 ≡ 1 the trivial eigenfunction.When clear from the context or of no importance we will omit the lag time T and simply speak of the Koopman operator K.
The dominant eigenfunctions are of particular interest as they decay the slowest and hence dominate the long time behavior of the system.The number d of eigenfunctions of interest depends on the time scales of the system and is usually chosen up to a spectral gap.
There exist different approaches to estimate the eigenfunctions.Many depend on the discretization of the state-space into cells leading to a matrix representation of K.A classical method is starting trajectories in each such cell and counting how many of them end up in a certain cell.This sample-driven method can be interpreted as an approximate Ulam/Galerkin discretization onto indicator functions of the cells [7].The eigenfunctions of K are then approximated by the eigenfunctions of its (dense) matrix approximation.Unfortunately this scheme breaks down in high dimensions as the number of cells in a structured grid increases exponentially.
One possible remedy to this problem is posed by the Square-Root-Approximation method (SQRA) [2].It approximates the infinitesimal generator of the Koopman operator by a finite volume approximation where the volumes are implicitly defined by the Voronoi tesselation induced by some sample points.
Using only evaluations of the potential U at these samples it can be understood as a semi-parametric method resulting in a sparse matrix representation, which in turn can be used for the computation of the eigenfunctions.
All these classical approaches however depend on a discretization before solving the eigenproblem.As an alternative we will now summarize ISOKANN, a recent matrix-free approach learning a linear combination of the eigenfunctions by neural networks.

ISOKANN-Computing the dominant eigenspace
The ISOKANN algorithm [11] uses a nonparametric representation in the form of a neural network in order to learn the dominant invariant subspace by interleaving an Arnoldi-like power iteration with the approximation of the Koopman operator by Monte-Carlo simulations.
To this end it will be useful to reformulate the eigenproblem in terms of the so called χ-function for an appropriately chosen matrix S : R d → R d , which will be specified in the next section.The core idea here is that the components χ i of χ span an invariant subspace of K, or more specifically they consists of linear combinations of eigenfunctions of K. Furthermore the action of K on this space is then explicitly given by the matrix S −1 .This definition of χ functions does not only allow us to reconstruct the eigendecomposition of K (Proposition 1) but also allows for an interpretation as macrostates in the Figure 3: Both components of the two-dimensional χ-function, which are linear combinations of the eigenfunctions (Figure 2).In this case the application of K, which in general is linear in χ, corresponds to a shift-scale (see Section 2.4) Figure 4: Scatter plot depicting v 1 against v 2 and χ 1 against χ 2 on a uniform grid over X. PCCA + constructs the linear map S mapping v onto the unit-simplex χ.
form of fuzzy memberships to d different sets [12] and leads to a direct characterization of rare transitions in form of their holding times, reaction rates, exit paths etc. [3,13].
Formally ISOKANN then approximates the χ-function representing the dominant eigenspace by an iterative sequence of approximations χ n : X → R d satisfying the ISOKANN equation, with S n = S n (K T χ n ) being a linear map depending on the previous iteration and constructed in such a way as to provide convergence to a desired form of S in (5).Similar to the Arnoldi-iteration, the iterative application of the Koopman operator leads to a decay of the eigenfunctions that is exponential in their eigenvalue.In the following section we will discuss how to construct S n as to compensate this decay just so that the dominant components will prevail whilst the non-dominant ones vanish.In the limit, χ n spans the dominant subspace (Theorem 1) with v i denoting the dominant eigenfunctions.This in turn allows us to learn the linear action of K on that subspace leading to Let us note that this presentation of ISOKANN differs from the original variant [11] in that the iteration in the form of (6) allows general d-dimensional valued χ functions.However we will see that for the case of d = 2 this is equivalent to the original variant, which by slight abuse of notation we will refer to as 1D-ISOKANN.
Illustrative example To illustrate this better, let us consider the following example given by the SDE (1) with a double well potential The double well potential with two minima is shown in Figure 1, the two dominant eigenfunctions are given in Figure 2. In Figure 3, we show the resulting two χ-functions, that are a linear combination of the eigenfunctions, but allow us to grasp the slow time-scale dynamics better.In particular, the two χ-functions can be interpreted as membership functions of the two wells of the potential.Their exit paths are charachterized by the gradients −∇χ and their holding probabilties or exit rates can be computed from the associated matrix S in (5) (see [3,13]).

Choice of the transformation S
In the previous description we were talking about an "appropriately chosen" linear transformation S or a sequence of S n .We will now explain the role that S plays in ISOKANN and suggest a way of determining suitable S.
In principle the map S can be chosen arbitrarily such that (5) has a solution in χ.However to obtain a useful result and for the ISOKANN iterations to converge with the corresponding choice of S n we require some specific properties.
In order to understand the role of S let us for now think about the classical Arnoldi method to find the dominant eigenfunctions of a matrix.In principle the Arnoldi method also takes the form of (1), where the application of S n corresponds to a Gram-Schmidt orthonormalization.Here the orthogonalization ensures that the leading eigenvectors are projected out from the subsequent ones and the following normalization step then ensures that the eigenvectors do not decay over multiple iterations.
Note here that while the Gram-Schmidt orthonormalization on its own is a nonlinear procedure, the resulting action on a given set of input vectors can be expressed as a linear map, i.e. if ω is the orthonormalization procedure, for each input matrix k we can find a linear map Ω(k) (depending on k) It is in exactly this way that we understand S as a linear map computed non-linearly on the data in Eq. ( 6).
For ISOKANN we want S to fulfill the same role: The goal of each linear transformation S n is to counteract the decay of the dominant eigenfunction components contained in χ n after application of the Koopman operator K.
However, we are interested in linear combinations of the eigenfunctions over a continuous space represented by a neural network.In this setting orthonormalization is a hard problem, involving integration over the whole state space3 .
On the other hand, orthonormality is a strong assumption which is not necessarily required.If we manage to choose S such that it amplifies the first d eigenfunctions such that they stay bounded away from zero, we will obtain a representation of the dominant subspace, which in sequence allows for the reconstruction of eigenfunctions (c.f.Section 2.5).
We now motivate a heuristic approach, based on the PCCA + algorithm [12], to construct a transformation S and will prove its convergence in the 1D-setting.
Let us shortly summarize the idea of the PCCA + methodology.In the context of metastable systems the state-space regions where the individual eigenfunctions become extremal are representative for the respective metastabilities of that system.In practice, plotting the state space X over the respective eigenfunction components v i the resulting set often resembles a simplex.PCCA + can be understood as a method to identify this simplex structure and construct the linear transformation (in the space of eigenfunction components) mapping this set into the unit-simplex (see also Figure 4), such that the image becomes "as big as possible" (specified by an optimization problem).
We can similarly imagine this picture for the χ functions: After sufficient time propagation (or power iterations), the χ functions are mainly composed of the dominant eigenfunctions.Plotting X over the χ components thus will be close to the X over v plot above modulo a linear transformation/a change of basis.We can thus apply PCCA + onto our intermediate Kχ n in order to find a transformation such that the resulting χ n+1 fills this unit-simplex.This inhibits the exponential decay of the nontrivial eigenfunctions by maintaining a set of d linearly independent components.Since the dominant eigenfunctions decay slower then the following non-dominant ones, they will dominate the behavior and prevail for n → ∞.
Even though preliminary results have shown that using PCCA + (with minor modifications4 ) to construct S in higher dimensions works fine, the focus of this paper is on the optimal control so we will reserve a more detailed report to future work and hence give a proof only for the simpler case of a single time-scale in the next section.

1D-ISOKANN with explicit PCCA +
In the case where one is interested merely in the first non-trivial eigenfunction, v 2 , the PCCA + solution can be computed explicitly, which in turn allows us to prove convergence of ISOKANN for d = 2 and explicitly specify S.
Note that because of χ 1 + χ 2 = 1, the desired solution χ : X → R 2 is fully determined by its first component χ : This representation allows us to solve the two dimensional ISOKANN problem with just one scalar function, approximated by a series of scalar neural networks χn : X → R.This scalar representation is the approach taken in [11] and in the following we will refer to it as 1D-ISOKANN 5 .
Let us start by constructing the explicit PCCA + solution.Recall that v 1 ≡ 1 and note that v 2 (X) = [min(v 2 ), max(v 2 )].Thus the image of X forms a line segment, i.e. a 1-dimensional simplex, in the v 1 -v 2 -plane.PCCA + then constructs the unique map S which maps this simplex onto the unit simplex {(x, y) | x + y = 1, x > 0, y > 0} (see Figure 4).
To this end, let us introduce the map S for bounded continuous functions κ ∈ C(X) such that S(κ) : X [0, 1] maps surjectively onto the unit interval.Even though S (consisting of a shift and a scale) is only affine-linear in its argument it can be seen as a linear map on the 1D subspaces } and indeed is the action of the PCCA + solution on a single component, i.e. there exists a matrix S ∈ R 2×2 such that it satisfies So S determined by PCCA + is indeed is a linear map depending non-linearly on the input k, just as in the case of the orthonormalization procedure (( 10)), and the action on the first component is equivalently given by the affine-linear map S. Using S to learn only the first component of the χ function, we now can formulate the following explicit iterative 1D-ISOKANN procedure : Theorem 1.Let χ, S and S be chosen as above in Eqs.(11) to (13).For generic χ0 : X → R, i.e. containing components of v 1 and v 2 , the 1D-ISOKANN iteration χn+1 = S(K χn ) (14) converges to the χ-function χ = lim for some α, β ∈ R, which in turn solves the ISOKANN problem Proof.Noting that S is merely a shift-scale, i.e. affine linear, and K1 = 1 one obtains Looking at the eigendecomposition of χ0 = ∞ i=1 a i v i , we have For large n, the contribution of the faster eigenfunctions v i , i > 2 decays exponentially faster than the contribution of v 1 , v 2 .Hence the shift-scale is dominated by these slow eigenfunctions and with (17) we have for some α, β ∈ R which proves (15).Noting that χ(X) = [0, 1] and K acts as a shift-scale, and S subsequently rescales K χ back to the interval [0, 1] and we have S(K χ) = χ which, given the implicit construction of S from S in Eq. ( 13) shows the fixed-point result in (16).
To summarize, we have shown how the representation of ISOKANN for d = 2 as a scalar problem reproduces the classical 1D-ISOKANN procedure [11] where the role of the linear map S (determined by PCCA + ) gets replaced by the (explicitly given) affine-linear S.This simpler representation allowed us to show convergence to a χ-function and thus solving the ISOKANN problem.

Restoring the eigenfunctions
At the beginning of the article we intended to compute the eigenfunctions of K. Whereas the χ functions only span the corresponding invariant subspace, we now show how we can restore the eigenfunctions from the solution χ to the ISOKANN problem SKχ = χ.This is equivalent to which means that the action of K on the subspace χ is given by the matrix S −1 .By means of a basis transformation (making use of the Moore-Penrose pseudoinverse χ + ) we can recover the eigenfunctions of K from S and χ.
Proposition 1.Let χ = (χ i ) d i=1 be a column vector of χ i component functions and S be a full rank matrix such that SKχ = χ (21) then E := χX are the eigenvectors of K with eigenvalues Λ.
Proof.By assumption we have Kχ = S −1 χ.Inserting this into (22), multiplying with χ from the left and noting that χ + χ = Id by definition, we arrive at which gives the desired result.

Computational procedure
With these theoretical considerations, let us now discuss how to apply the algorithm in practice, i.e. using neural networks as function approximators and Monte Carlo (MC) simulations for the Koopman evaluations.
To this end let us recall the main formula for the iterative update (6): Here we replaced the equality by a ← which indicates classical supervised learning (with the common mean squared error loss) along multiple (possibly random) training points x m , m = 1, ..., M .For a procedural description see Algorithm 1.The main challenges posed by this iterative scheme consist of (a) a representation of the function(s) χ n and (b) the evaluation of the right hand side, i.e. the computation of S n and the evaluation of the Koopman operator.For (a), the representation of the χ n , we chose neural networks as they promise good approximation properties in high dimensions and their differentiability will prove crucial for the following optimal control part.In general any feed-forward architecture should be suitable and whilst convolutional networks could be especially suited due to the spatial structure of the state space X, in the example we will confine ourselves to a fully connected architecture for simplicity.In any case, the update step for χ n+1 consists of a classical supervised learning routine with the labeled data generated by evaluation of the current χ n .Because the χ n are not expected to be changing a lot between the iterations it makes sense to initialize χ n+1 with the weights from χ n as to transfer the already learned structure and speed up the learning.Note here that whilst we talk about different networks χ n for each iteration n to emphasize the iterative nature of (24), in practice we can update a single instance of the network.
In this view, the learning procedure can be seen as iterative supervised batch learning, where the whole data batch D n is generated along a set of points {x m } using the current representation χ n .The update step itself can be performed using any stochastic optimizer such as classical stochastic gradient descent or ADAM to minimize the empirical L 2 error min χn+1 !(xm,sm)∈Dn Considering (b), the evaluation of the right hand side, let us start with the approximation of the Koopman operator.For a given training point x m , we use its representation as an expectation value and approximate the action of the Koopman operator by a Monte-Carlo sum.Each simulation consists of starting K > 0 trajectories at the point x m and propagating them according to the SDE (1) using an SDE integrator, such as the Euler-Maruyama scheme, for the lag time T and storing their end points y k,m , k = 1, ..., K.The action of the Koopman operator at x m is then approximated by the empirical average To summarize, for each x m we average the evaluation of χ n at K propagated positions obtained by SDE simulations.
What now remains is the application of the shift-scale S n .In case of 1D-ISOKANN the action of S is determined by the shift-scale S from (12) which depends on the (global) extrema of the input function k.In practice we thus use the empirical extrema over the observed data {κ m } to directly compute s m = S(κ) m without explicitly constructing S n .In higher dimensions we compute the matrix S n using PCCA + to find the transformation that maps the columns of the matrix K = [κ 1 ... κ m ] into the unit simplex.Note that the use of the empirical extrema requires that the training points x m indeed cover the areas where Kχ n becomes (approximately) extremal.
This brings us to the choice of the M training points x m .In principle ISOKANN can be applied to find the χ functions of a system based on a fixed set of precomputed or assimilated trajectories (replacing the SDE integration).However its iterative nature makes it especially useful in the synthetic data regime where the trajectories are computed on-line as it allows adapting the training points x m , and as we will see the trajectory simulations too, to the information obtained so far.
Since the χ-functions can be interpreted as reaction coordinates [3] we suggest "χ-stratified" sampling of the x m , i.e. such that χ(x m ) is approximately uniform in [0, 1].In practice we achieve this by subsampling from the pool of start and end points of the previous simulations, We then draw M stratified uniform samples u i ∈ [0, 1] by sampling uniformly from each of M equally sized partitions of the interval [0, 1].Finally we chose those p i ∈ P n such that χ n (p i ) is the closest to one of the u i .We furthermore retain those samples which were extremal in χ n to facilitate good approximation of the extrema by S or PCCA + .Heuristically speaking we obtain samples x m which are uniform in χ and hence provide good coverage or "bridges" along the transition region, thus facilitating an efficient "flow of information" during the power-iteration process.Furthermore the regions with a higher variation of χ, i.e. those that are "harder to learn", will also obtain more samples which in turn is also beneficial for the training of the neural network itself.
Last but not least the samples obtained this way followed the system's ergodic dynamics and will therefore approximate the stationary distribution (restricted on each level set of χ).This allows us to evade the curse of dimensionality by restricting the sampling to physically meaningful samples along the reaction paths.
To summarize, χ-stratified sampling allows us to sample uniform along χ and stationary conditioned on it without much additional cost and adapted to the learning process.
The main ISOKANN routine can be summarized by three loops (N power iterations, M training points, K trajectories) which can be loosely tied to the three main ingredients of ISOKANN: (1) The power iteration learning the dominant subspace.
(2) The neural network approximation of the next χ iterate.
(3) The Monte-Carlo simulation of the Koopman evaluation.
In the outermost loop (1) we perform the power iteration χ n to χ n+1 .Whereas in theory we have convergence for n → ∞, we chose to terminate after a fixed number of iterations N .This could be replaced by classical convergence criteria such as relative and absolute tolerances.Note that the rate of convergence and hence the required number of power iterates N depends on the eigenvalues of the Koopman operator where a bigger spectral gap implies faster decay of the non-dominant spectrum and hence faster convergence.
When training the neural network (2) we loop over a batch of labeled training data at the M training points x m which in turn (3) require K individual trajectory simulations.Both the number of training points M as well as trajectories per point K depend on the step sizes chosen for the neural network optimizer.Since in practice the evaluation of the Koopman operator is rather expensive compared to the neural network update, it may be efficient to perform multiple update steps on the same batch of data before proceeding to the next iteration.
Note that the variance of the training data, scaling with K −1 , is particularly high for metastable systems due to the impact of rare transitions.Whereas above we proposed the use χ-stratified subsampling as a heuristic to deal with sampling in X space, we will now address the problem of variance in the "Kχ n -direction" using the techniques of optimal control and importance sampling.

Algorithm 1 ISOKANN
Input: N the number of power iterations M the number of x-samples K the number of Koopman Monte-Carlo samples χ 0 initial neural network Output: χ N approximates a χ function (c.f.Eq. ( 5)) for m = 1 to M do for k = 1 to K do 5: s ← S(κ) transformed target data

Optimal sampling of Koopman eigenfunctions and χ-functions
In this chapter we first recall importance sampling, before showing how the theory allows to better sample eigenfunctions of the Koopman operator and ISOKANN χ-functions.We conclude this chapter with a numerical example.

Importance Sampling for random variables
Importance sampling allows to express the expectation value of an observable f > 0 with respect to some distribution p by an expectation value with respect to some other distribution q (with p q) by the formula where the observable f is reweighted by the Radon-Nikodym derivative dp dq .It is easy to see that by choosing q * such that dp dq * = Z f (i.e.q * = f Z p), we have Now, since Z is a constant, it can be computed with a single (reweighted) f -sample from q * .We therefore refer to this importance sampler with sampling distribution q * as zero-variance-sampler, or optimal-importance-sampler.Note however that we needed to know the (a priori unknown) result Z in order to define the optimal sampling distribution q * .

Optimal Importance Sampling for Diffusion Processes
Since we are working with diffusion processes, importance sampling is further complicated in that the measures are path measures and admit no probability density function.However, importance sampling can still be generalized to stochastic processes.Let us first consider the diffusion process of Eq. ( 1).In general, one is interested in computing the expectation of path-dependent quantities.Let us define the work along a trajectory over [0, T ] by the accumulation of a running cost f and a terminal cost g as: One then is interested in estimating expectation values of the form Girsanov's theorem builds the bridge from importance sampling to diffusion processes by allowing to sample from another diffusion processes.In particularly, it allows us to compute the change of measure in terms of the Radon-Nikodym derivative.To this end let us introduce the controlled process with an admissible control term u acting as an external forcing to the original dynamics.Note that with zero control u = 0 one recovers the original dynamics X = X u=0 .Let P denote the path measure induced by X and Q denote the measure induced by X u .According to Girsanov's theorem the change of measure from Q to P (analogous to dp dq above) along a given controlled trajectory X u is then given by which in turn provides an unbiased estimator of ψ in terms of the controlled process: Note that even though the expected value of this estimator is the same for any control u its variance will vary.Analogous to the case above there exists an optimal measure corresponding to an optimal control u * for which the controlled estimator exhibits zero variance [5,10]: Theorem 2. The optimal control u * is given by and leads to the zero variance estimator for ψ

Optimal sampling of eigenfunctions of the Koopman operator
We will now show how this optimal control theorem can be used to evaluate the Koopman operator.
Corollary 2.1.Let h ∈ L ∞ (X) be a function.A single realization of the controlled process X u starting in then gives the evaluation of K T h at that point x: Proof.With f (x) = 0 and g(x) = − log h(x) Eq. (32) becomes Application of Theorem 2 then leads to the desired result.
This result shows us that in order to compute the optimal control for evaluating Kh we need to have access to the derivatives of Kh.This conundrum is in line with the general optimal importance result and comes at no surprise.In the next section we will argue how this result can still be of use for ISOKANN where the convergence of the χ n provides us with an approximate description which we will use to compute the control.
Let us for now consider the case where the observable of interest h is an eigenfunction of K T , i.e. h = v i with eigenvalue λ i (T ).In this case we can replace the action of the Koopman operator by its eigenvalue, which in turns cancels out after application of ∇ log and results in a time-independent control: This simple example provides a good point to get a feeling for how optimal importance sampling works.We can see that the control pushes the system in the direction of (relative) maximal ascent.If the system follows the forcing its expected evaluation increases, but the path taken also has an increased probability, resulting in a decreasing Girsanov weight.Both increments happen on a commensurate "relative scale": the expectation value gets pushed by an amount relative to its current value ( ∇v v ) whereas the reweighting is adjusted relatively in magnitude due to the exponentiated integral (which may be seen as an infinite product over all time-points) in (34).In this way the increase of the observable is balanced with the decreasing weight exactly so that no matter the path taken these always equalize and one obtains a zero-variance sampler.
Unfortunately however, the control becomes singular whenever v i (x) = 0.According to the Perron-Frobenius theorem, every non-trivial eigenfunction is unsigned, i.e. crosses the 0 at some point, so we have to find a way around that problem.We can alleviate this problem by shifting and (anticipating the form of S in (12)) also rescaling the eigenfunction.
Denote the shift-scaled eigenfunction by Due to linearity of K T , we obtain and thus after application of Corollary 2.1 the control in terms of χ is In this case we see that control is time dependent (which makes sense as the relative contributions of the dominant eigenfunctions to the expectation value change over time).From λ i (t) ≤ λ i (0) = 1 we can conclude that u * (•, t) = γσ ∇χ χ with function γ monotically increasing with γ(1) = 1, i.e. the control is pointing in the same direction as for the pure eigenfunction case, starting weaker and increasing until hitting the full magnitude at t = T .
Note that the requirement for χ functions to satisfy 0 ≤ χ ≤ 1, which so far was merely motivated by their interpretation as macrostates, now also facilitates their optimally controlled importance sampling.

Application to ISOKANN
In the previous section we have shown how to obtain a zero-variance sampler for the Koopman operator in terms of the gradient of its solution, either for general observables h or (shift-scaled) eigenfunctions v i (χ).In either case the solution has to be known a priori as to compute the control.We will now argue how to integrate this result into the ISOKANN procedure.
The main idea of using optimal importance sampling in ISOKANN is to use the intermediate results χ n and S n to compute a pseudo-optimal control as to lower the sampling variance.We therefore have to assume that using an approximation to the optimal control indeed leads to a variance reduction.Whereas we do not know of any proof to this statement it was shown that the objective of the associated optimal control problem is indeed convex in the control [8] which leads us to conjecture that that the variance should be well-behaving for approximate optimal controls as well.
Note that the importance sampler (35) is unbiased for any control.Thus even if the above assumption does not hold ISOKANN would still converge, albeit slower, as long as the variance does not become unbounded 6 , under the usual conditions for stochastic gradient descent convergence (i.e.decaying learn rate).
Let us recall the equation for the control (38): In order to compute the differential of K T −t at χ n we assume sufficient convergence of ISOKANN together with (6), to approximate the action of K T by (S n−1 ) −1 : Using the semi-group property of K and the matrix logarithm we can extend this to other lag times T − t to obtain the matrix approximation Note that in the general d-dimensional case the expectation values in the ISOKANN iterations K T χ n are vector valued.Optimal importance sampling however works only in the scalar case.Therefore we have to compute an individual control for sampling each component (K T χ n ) i individually 7 .
Thus using the optimal control Corollary 2.1 together with the matrix approximation (48) we can compute the pseudo-optimal control for the i-th component of K T χ n explicitly In the case of 1D-ISOKANN the action of the Koopman operator on χ n converges to a shift-scale as in (43) and we can therefore estimate the parameters α, β and λ 2 from the extrema of Kχ n−1 as to apply the explicit control for the shift-scaled eigenfunction (44).Now that we know how to compute the control we can modify the algorithm (Line 5) by sampling the trajectories according to the controlled SDE (33).In order to compute the reweighting (34) we have to either save the trajectory and noise, or integrating it on the fly in an addition SDE component with Finally, for the Koopman Monte Carlo approximation (Line 6) we average over the χ evaluations at the endpoints of K independent trajectories starting in x m weighted with their respective weights G m : In this way (and with the above assumption) we obtain a feedback loop where better approximation of the χ functions results in in a better approximation of the action of K and hence in a better approximation of the optimal control.This pseudo-optimal control in turn decreases the sampling variance which facilitates better approximation of the power iterates, i.e. the χ function.
As a proof of concept we will now illustrate the reduction of variance at the hand of the classic double-well potential.

Example: Controlled 1D-ISOKANN for the double well
Let us consider the controlled process as stated in (33) for the double-well potential (9) which leads to the simplest problem exhibiting metastable behavior and hence a challenging sample variance.In our experiments we compare the training performance of the ISOKANN algorithm, both, with and without the control (44).
We start with a randomly initialized fully connected network χ 0 : R → R with sigmoidal activation functions and 2 hidden layers, each of size 5 (i.e. with layer sizes 1×5×5×1).For each network generation n, we compute Monte Carlo approximations of the Koopman expectation at M = 30 positions.These are initially drawn uniformly from the interval [−2, 2] and subsequently obtained by χ-stratified subsampling as described in Section 2.6.From each starting position we then simulate K = 20 trajectories using the SROCK2 SDE integrator of strong order 1 with step-size ∆t = 0.001.The next generation n + 1 is trained against these M training points by L = 500 stochastic gradient descent steps using the ADAM optimizer (with learning rate η = 0.001).We repeat this evaluation-training procedure (corresponding to a single power iteration) for a total of N = 50 iterations.
For each experiment we monitor the root of the training loss (26), i.e. the root mean squared error, and the mean standard deviation of the MC estimator (27 over the training phase of N = 50 iterations with L = 500 training steps each. Let us now compare the uncontrolled with the controlled experiment.Figure 5 shows the (square root of) our training loss together with the standard deviation of the Monte Carlo estimator for the two cases of study.In Figure 5a we observe that the uncontrolled system quickly (after 3 iterations) approaches its plateau at an error of about 10 −1 but afterwards the training loss does not decrease any further.This comes at no surprise since the training data exhibits noise of the same magnitude, and we cannot expect the average loss to be lower then the noise in the data.Note however, that even though the loss itself seems to have leveled off this does not necessarily mean that the solution does not improve: Whilst the empirical loss will necessarily remain at the level of the noise, the solution could still converge due to the inherent averaging of that noise in the SGD method.
Looking at the controlled experiment, we observe in Figure 5b that for the loss behaves similar to the uncontrolled experiment for the first 3 iterations, reaching a value of 10 −1 .From there on however the loss decreases further getting close to 10 −3 after 50 power iterations and still not having hit a plateau.Notice that the training noise, in strong contrast to the uncontrolled case, decreases rapidly from the beginning of the training.It is furthermore interesting to see that the training loss seems to be following the noise level closely, indicating that the sampling variance is indeed of high importance.Let us finally look at Figure 6, which shows the different learned χ functions after the application of the ISOKANN algorithm together with the evaluations of SKχ n−1 (x m ) at the M random locations x m .The error bars represent the standard deviation of the individual Monte-Carlo estimators, i.e. the noise in the training data.We see that both learned χ functions qualitatively match the expectation.The uncontrolled case however has problems reaching 0 resp. 1 at the boundaries, which can be understood as a result of the noise and the subsequent noisy estimation of the empirical shift-scale.Last we notice that the Monte-Carlo standard deviation for the Koopman evaluation at the χ-sampled positions is considerably lower by the controlled approach.

Conclusion
In this article we started by enhancing ISOKANN by new theoretical results that prove the strengths of ISOKANN, namely a convergence proof (Thm. 1) and a method for reconstructing eigenfunctions from χ-functions (Prop.1).We also proposed a new adaptive sampling strategy, called χ-stratified sampling, which complements ISOKANN well and deserves further investigation.Formulating ISOKANN in terms of the transformation S (6), we paved the way for higher-dimensional χ-functions while generalizing the original 1D-ISOKANN.However, whereas we argued for using PCCA + for the construction of S for d > 1 a more detailed study and a proof of convergence for this case remain open for future work.
The second main contribution in this article is the introduction of importance sampling into ISOKANN.Whereas we know that the resulting estimator is unbiased we argued only heuristically why the variance should not explode.A proof of convexity of the variance in the control, or even better, a proof of the convergence of control in ISOKANN is still missing.Note furthermore, that the concept of optimal importance sampling may be useful for the iterative solution of Koopman evaluations in general (c.f.Cor.2.1).
An important next step would be to apply controlled ISOKANN to an actual molecular dynamics (MD) system as to test how well the introduced techniques fare with the complexities of real world problems.This however requires a way to run many trajectories with different start locations and low-overhead as well as to inject the optimal control into the MD simulations.We hope that once these interfaces are implemented, ISOKANN will enhance the research of molecular systems.

Figure 1 :
Figure 1: Potential function U of the double well with two metastable regions separated by a potential barrier.

Figure 2 :
Figure 2: The two dominant eigenfunctions v 1 , v 2 of the Koopman operator for the double well potential.

Figure 5 :
Figure 5: Training performance over 50 power iterations / batches with 500 ADAM steps each.The blue line shows the training loss and the red line shows the standard deviation of the Monte-Carlo samples in the training data.

Figure 6 :
Figure 6: The blue line shows the learned χ function at the end of training.The red dots show the train target, i.e. the Koopman evaluations at the x m , with the Monte-Carlo standard deviation as error bars.