Author Notes
In this paper the connection between stochastic optimal control and reinforcement learning is investigated. Our main motivation is to apply importance sampling to sampling rare events which can be reformulated as an optimal control problem. By using a parameterised approach the optimal control problem becomes a stochastic optimization problem which still raises some open questions regarding how to tackle the scalability to high-dimensional problems and how to deal with the intrinsic metastability of the system. To explore new methods we link the optimal control problem to reinforcement learning since both share the same underlying framework, namely a Markov Decision Process (MDP). For the optimal control problem we show how the MDP can be formulated. In addition we discuss how the stochastic optimal control problem can be interpreted in the framework of reinforcement learning. At the end of the article we present the application of two different reinforcement learning algorithms to the optimal control problem and a comparison of the advantages and disadvantages of the two algorithms.
I. INTRODUCTION
Rare event sampling is an area of research with applications in many different fields, such as finance,1 molecular dynamics2 and many more. Very often the reason for the occurrence of rare events is that the dynamical system of interest exhibits metastable behavior. Metastability means that the underlying process remains in certain regions of the state space for a very long time and only rarely changes to another region. This change is particularly important for accurate sampling of rare events. The average waiting time for the occurrence of the rare event is orders of magnitude longer than the timescale of the process itself. This behavior is typically observed for dynamical systems following Langevin dynamics and moving in a potential with multiple minima. Here the metastable regions correspond to local minima of the potential. The minima are separated by barriers and the transitions between these regions are of interest. In molecular dynamics, for example, these quantities of interest correspond to the macroscopic properties of the molecules under consideration. Furthermore, it can be shown that the time to cross the barriers scales exponentially with the height of the barrier.3 In terms of sampling it is observed that the variance of the Monte Carlo estimators associated with these rare transitions is often large. One idea to improve these estimators is the application of importance sampling but other methods such as splitting methods have been proposed. In this article we are going to focus on importance sampling. For a detailed discussion of splitting methods see Ref. 4 and the references therein.
One of the main challenges in importance sampling is to find a good bias so that the reweighted expectation has a low variance. The theory shows that the bias that would lead to a zero variance estimator is related to the quantity one wants to sample; see, e.g., Refs. 5 and 6. Therefore, many different variational methods have been proposed to find a good bias.7 For importance sampling applications in stochastic differential equations it is well known that the optimal bias is actually given as the solution of a Hamilton-Jacobi-Bellman (HJB) equation, a nonlinear partial differential equation.8 Since the HJB equation is the main equation of stochastic optimal control the importance sampling problem can be interpreted as stochastic optimal control. A stochastic optimization approach to approximate the bias using a parametric representation of the control has been proposed in Ref. 9. In the optimization approach the weights of the parametric representation are minimised to find the best approximation of the control. Methods for solving the related Hamilton-Jacobi-Bellman equation using deep learning based strategies in high dimensions have been developed; see, e.g., Refs. 10–15. The approximation of control functions with tensor trains has been presented in Ref. 16.
Although many methods have been proposed to approach the sampling problem from an optimal control point of view the stochastic optimization formulation offers the possibility to make a connection with reinforcement learning. Reinforcement learning (RL) is one of the three basic machine learning paradigms and has shown impressive results in high-dimensional applications such as Go and others.17,18 The reinforcement learning literature is very rich and many interesting ideas such as model-free, data-driven methods and robust gradient estimation have been extensively explored. From a more abstract point of view optimal control and reinforcement learning are concerned with the development of methods for solving sequential decision problems19 and their connection has been explored to some extent in Ref. 20. A general formulation can be given as follows: an intelligent agent should take different actions in an environment to maximise a so-called cumulative reward associated with a predefined goal. Applications of this formulation can be, for instance, a cleaning robot moving in a complex space, playing various games21,22 or portfolio optimization.8 The environment in which the agent moves is typically given in the form of a Markov decision process (MDP). Solution methods are often motivated by dynamic programming. The main difference between classical dynamic programming methods and reinforcement learning algorithms is that the latter do not require knowledge of an exact mathematical model of the MDP. In addition, reinforcement learning targets large MDPs where exact methods become infeasible. We make the link between the two fields in two ways. One way is to use the optimal control problem and formulate it as a Markov decision process which is the underlying theoretical framework of reinforcement learning. The other way is to formulate the reinforcement learning problem as a stochastic optimization problem. By comparing the resulting optimization problem with the optimization problem derived in Ref. 9 for the importance sampling stochastic optimal control problem, one can see that both agree. Although the link is known there are few papers that show how to explicitly establish this link and the different language in these fields makes it difficult to understand the overlap. For this reason we want to show this connection in more detail for our problem of interest namely the application of importance sampling. However, the underlying connection is much more general and can be used for other applications that use stochastic modeling. Moreover, making this connection has the advantage that ideas developed in one area can be transferred to the other.
The paper is structured as follows. In Sec. II we set the stage of rare event simulation, present the importance sampling problem and state its stochastic optimal control formulation. Section III is dedicated to the introduction of the reinforcement learning framework. In Secs. III A–III D we discuss Markov decision problems, which are the underlying theoretical framework of reinforcement learning. In Sec. III E we state the reinforcement learning optimization problem and in Sec. III F we recap the key ideas behind the main RL algorithms. Section IV is devoted to showing how optimal control and reinforcement learning are related. In Sec. IV A we show how the stochastic optimal control problem can be formulated as an MDP. In Sec. IV B we compare the two optimization formulations presented for both problems. In Sec. IV C we first discuss how a previously presented solution method in the framework of reinforcement learning can be understood as a model-based approach. We introduce the well-known model-free deterministic policy gradient (DPG) algorithms. In Sec. V we present an application of the presented methods to a small toy problem. The focus of this section is the approximation of the optimal control problem which we will discuss in more detail. We conclude the article with a summary of our results.
II. IMPORTANCE SAMPLING SOC PROBLEM
The main goal of this paper is to show how stochastic optimal control (SOC) and reinforcement learning (RL) are related. We introduce a SOC problem related to an importance sampling diffusion problem. However, the importance sampling SOC formulation can be easily adapted to general SOC problems and is not restricted to the importance sampling application. First we motivate the importance sampling problem and present a rather formal formulation. Then we show the relationship to the corresponding Hamilton-Jacobi-Bellman (HJB) equation. Finally we show how this problem can be formulated as a stochastic optimization problem. This section will be formulated in continuous time to show the relation to the HJB equation. In the following sections we will return to the discrete-time problem for convenience as this is the viewpoint used in most of the reinforcement learning literature.
A. The sampling problem
We usually consider a potential V which has many local minima and which induces metastable dynamics in the model under consideration. This metastability leads to a high variance of the Monte Carlo estimator of the path dependent quantity of interest and thus it becomes unreliable. To improve the properties of the estimator we will apply an importance sampling strategy. The general idea of such a strategy is to sample a different dynamical system by introducing a bias and later correcting this effect in the expectation.
B. Hamilton-Jacobi-Bellman equation
A priori one can calculate the quantity that we originally wanted to estimate as a function of the initial position via e.g., a finite difference method. However, such a problem becomes not trivial to solve for high-dimensional settings due to the curse of dimensionality. Furthermore, we know that the problem we are trying to solve is hard because we are trying to find a solution for a nonlinear PDE.
C. Stochastic optimization problem
Many different ideas have been proposed in the literature to solve this problem. One idea is to use a Galerkin projection of the control into a space of weighted finite ansatz functions and optimise over the weights using a gradient descent method or a cross-entropy method.9,24 Another idea is to solve the deterministic control problem and use it to steer the dynamical system in the right direction.25 A more detailed discussion of solution methods can be found in Ref. 6.
According to the theory using optimal control to sample the quantity of interest would result in a zero variance estimator. In principle one needs to sample a trajectory to find the quantity of interest. However, due to discretization and numerical issues this is not possible in the implementation. However, with a good approximation to the optimal control the sampling effort can be massively reduced and the estimator converges faster with a smaller relative error. Furthermore, it can be shown that the relative error scales exponentially with the approximation error.26 This means that one should use the best possible approximation to find a good estimator of the quantity of interest. Because of this dependence, it is necessary to use methods that find a solution close to the optimum.
For mathematical completeness it is necessary to have a small remark on other possible types of time horizons. In this article Tu is an a.s. finite stopping time with respect to the canonical filtration of the controlled process. In general we could consider
a finite horizon time Tu = Tend, leading to a deterministic stopping time,
a bounded stopping time ,
or a general random stopping time .
III. INTRODUCTION TO REINFORCEMENT LEARNING
Before showing the connection between RL and SOC for completeness we would like to give a brief overview of a typical model for a reinforcement learning problem. First, we look at the underlying theoretical framework namely Markov decision processes. Then, we will discuss key concepts of RL theory, such as types of policies, value and Q-value functions, and their recursive relations provided by the Bellman equations. Finally, we will introduce the reinforcement learning problem from an optimization point of view and discuss two different formulations of the RL problem.
As stated by Sutton and Barto: “Reinforcement learning is learning what to do - how to map situations to actions - in order to maximise a numerical reward signal.”21 So from this very first definition we can already see what it takes to define a reinforcement learning problem. To learn what “to do,” we need someone to do something. This is usually called the agent. The agent experiences situations by interacting with an environment. This interaction is based on the actions the agent takes, the reward signals the agent receives from the environment and the next state the agent has moved to by following the controlled dynamics. Usually the interaction with the environment is considered to take place over some time interval (finite or infinite).
The goal of the agent is to reach a predefined goal that is part of the environment. The agent will reach the goal if it optimally chooses the sequence of actions so that the sum of the reward signals received along a trajectory is maximised. Given the dynamics of the environment and a chosen goal reinforcement learning assumes that there are reward functions that can lead the agent to success in the sense that the predefined goal is reached. The existence of a unique reward function that is certain to lead to the predefined goal is not given. Thus, the choice of the reward function can be flexible and depend on the task that the agent has to solve. Furthermore, it may have an influence on the learned action.
Environments with sparse reward functions are often difficult to handle, since the reward signal received is often negligible. For example this is often the case in games where the reward signal is 0 throughout the game and changes to −1 in the event of a loss or +1 in the event of a win. However, this is not the only possible reward function that makes the agent learn how to play. If a person plays a game of chess with a teacher who tells her how she played after every move the person will receive a much richer reward signal which will eventually lead to faster learning.
A. Markov decision processes
The theoretical framework of all reinforcement learning problems is a Markov decision process. A typical RL problem has the following elements:
the state space is the set of states i.e. the set of all possible situations in the environment,
the action space is the set of actions the agent can choose at each state,
the set of decision epochs is the set of time steps corresponding to the times where the agent acts. Let us assume the set of decision epochs is discrete. In this case, it can either be finite, i.e., with , or infinite .
- the (state-action) transition probability function provides the probability of transitioning between states after having chosen a certain action. The transition probability function depends on the state where the agent is , the action she chooses and the next state the agent moves into , i.e.,The transition probability function given a state-action pair p(⋅|s, a) is a conditional probability mass function(15)If is a continuous state space represents the (state-action) transition probability density. Let be a Borel set of the state space, then the probability of transitioning into Γ conditional on being in the state and having chosen action is given by(16)For more details about continuous MDP we recommend to look at Ref. 28. For ease of notation, we will use the same symbol p to denote the transition probability function and the transition probability density throughout the article.
- and the reward function is the reward signal the agent will receive after being in state and having taken action , i.e.,Formally, the tuple defines a Markov decision process. A more detailed introduction to MDPs can be found in Ref. 29.(17)
B. Reward and state-action transition probabilities
If an action at is chosen in a state st at time t, two things happen. First, the agent receives a reward signal rt = r(st, at). Second, the agent transitions to the next state st+1 according to the transition probability function st+1 ∼ p(⋅|st, at). In the literature one can find reward functions that depend on the state, the action and the next state. In this case the order changes. First the agent moves to the next state, and second, the agent receives the reward signal. We will consider the first case throughout this paper. Note that both the (state-action) transition probability function and the reward function depend only on the current state of the agent and the action chosen in that state. This is sufficient to describe the dynamics of the agent, since we assume that the agent is Markov. Recall that the Markov decision framework can be generalised to non-Markovian dynamics; see, e.g., Ref. 29.
If the dynamics of the environment is deterministic, the state transition probability function is replaced by the so-called environment transition function, denoted by . In this case the next state is given by st+1 = h(st, at). By introducing in the transition function a dependence on a random disturbance ξt one can treat both stochastic and deterministic dynamical systems st+1 = h(st, at, ξt). To treat deterministic systems in this framework we just have to set ξ to zero; see, e.g., Ref. 30.
C. Policies
Policies are the most important part of reinforcement learning. A policy is a mapping that determines what action to take when the agent is in state st at time t. This is why they are sometimes called the agent’s brain. Policies can be either deterministic or probabilistic.
A deterministic policy is a function from the state space into the action space. Deterministic policies are usually denoted by μ, and the action for state st is given by at = μ(st). A stochastic policy is a conditional probability distribution over the action space. Stochastic policies are usually denoted by π, and the new action for state st can be computed by sampling from this conditional probability distribution. The action for st is given by at ∼ π(⋅|st). The deterministic policies can be seen as special cases of stochastic policies where the probability distributions over the action space are degenerate.
In the following sections we will show how the goal of reinforcement learning can be expressed in terms of finding policies that maximise the reward signal received at each time step.
D. Trajectories, return and value functions
Depending on the time horizon which is considered the reinforcement learning literature distinguishes between infinite horizon problems and terminal problems. We are only going to discuss terminal problems here since the SOC problem belongs to this class. For infinite time horizon problems a discounted factor has to be taken into account to make sure that the cumulative reward is finite. Details can be found in Ref. 31.
For a small, finite MDP, policy iteration (PI) strategies offer in general convergence to the optimal policy.21 They interlude value iteration where the value function is estimated by using the Bellman equation iteratively and policy improvement. A direct way to find the optimal actions is given by solving the Bellman equation. For any MDP results about the existence of an optimal policy can be found in Ref. 29.
E. RL as optimization problem
In general it is numerically infeasible to optimize over a function space or a space of probability distributions. Hence, one often considers a parametric representation of a policy and tries to optimize with respect to the chosen parameters.
F. Brief summary of RL algorithms
Over the last few years many different algorithms have been developed to solve reinforcement learning problems. For a good but non-exhaustive review we refer to Ref. 32 and the references therein. Many of the algorithms share a general framework, which is summarized in Algorithm 1.
Main RL Online model-free algorithm.
1: for trajectory = 1, 2, … do |
2: for timestep = 1, 2, …, T do |
3: Evaluate the dynamical system with the current policy π and calculate the reward. |
4: end for |
5: Optimize the policy. |
6: end for |
1: for trajectory = 1, 2, … do |
2: for timestep = 1, 2, …, T do |
3: Evaluate the dynamical system with the current policy π and calculate the reward. |
4: end for |
5: Optimize the policy. |
6: end for |
The main difference between the methods is how the policy optimization is computed which depends on the underlying problem. The proposed methods can be distinguished if they are model based or model-free and if the policy optimization is done via Q-learning, policy gradient or a combination of both methods.
Let us first discuss the difference between model-based and model-free approaches. An algorithm is said to be model-based if the transition function is explicitly known (tractable to evaluate) point-wise. In this case we can just take the transition function and sample s′ from p(⋅|s, a). The known transition function contains a lot of information about the underlying dynamics and is therefore very useful for possible solution methods. A method is called model-free when the transition function is not explicitly known. In this case the transition function cannot be used explicitly in the solution methods. The difference is related to the distinction between stochastic optimal control and reinforcement learning. In the case of stochastic optimal control the model is often known while reinforcement learning aims at a more general solution method that does not depend on the underlying dynamical system.
Let us briefly discuss the main underlying solution methods.
1. Q-learning
An extension of the Q-learning idea for continuous action spaces is only possible by considering a separate policy parameterisation leading to an actor-critic setting which will be discussed in detail in Sec. IV C 2. We refer to Ref. 37 and the references therein for a comprehensive study of Q-learning in continuous time.
2. Policy gradient
The methods presented here are the two main general ideas behind many variants of RL algorithms and were presented very early in the reinforcement learning community. Over the years, many drawbacks have been identified and extensions and combinations of solution methods have been proposed. Since policy evaluation is quite expensive, off-policy methods have been developed, i.e., trajectories simulated under a different policy are used for the current optimization step. This can be done, for example, by using an importance sampling approach; see, e.g., Ref. 39. Methods that only use policies that have been sampled with the current policy are called online. A combination of Q-learning and policy gradient has been proposed to overcome the high variance of a pure policy gradient.40 These ideas have been further developed and methods such as TRPO39 and PPO41 have been proposed.
All of these algorithms have been developed for stochastic policies but methods for deterministic policies have also been derived. In the next section we will introduce the family of model-free deterministic policy gradient algorithms which provide a policy gradient without needing to know the model explicitly.
The selection presented here is far from complete but a more detailed discussion of RL is beyond the scope of this article. For a more detailed overview we refer to Refs. 21, 32, and 42. Most of the developed methods have a specific domain of application so one needs to carefully consider whether a method can be applied to a specific problem at hand.
IV. THE SOC PROBLEM AS RL FORMULATION
In this section we show how the importance sampling SOC problem can be formulated as a reinforcement learning problem. First we show how to define an MDP for the stochastic control problem. This allows us to construct an RL environment based on the time-discretised stochastic optimal control problem. Then we compare the optimization approaches for both problems and argue that both formulations have a large overlap. After that we discuss how a previously presented method for the SOC problem can be categorised as a reinforcement learning algorithm. Finally we present a family of RL algorithms designed to deal with environments such as our stochastic optimal control problem.
A. Importance sampling SOC problem as RL environment
To do this we have defined an MDP for the importance sampling SOC problem. So we change the viewpoint of the SOC problem to the RL viewpoint. Next let us have a brief look at the two optimization problems for reinforcement learning and stochastic optimal control.
B. Comparison between the optimization approaches
Let us start by looking at the two optimization problems given in (25) and (13). Both problems are stochastic optimization problems. The RL optimization problem maximises the expected return and the importance sampling SOC problem minimises the time-discretized cost functional conditional on the initial position. The RL optimization problem is a bit more general as the forward time evolution is stated in a very general setting. The forward trajectories are determined by the probability transition density and the stochastic policy while for the stochastic optimal control problem the time evolution is explicitly given by a controlled SDE (29) for a chosen control. Furthermore, in the RL framework, technical conditions are rarely imposed on the policy. In contrast in the SOC optimization formulation it is known that the optimal control is deterministic and must satisfy some technical assumptions as shown above.
In both cases the expectation is about trajectories. For the SOC case the probability density over the trajectories can be given in the same way as in (27). Note that in this particular case the reward along a trajectory depends on the chosen policy.
C. Algorithms for SOC in a RL framework
The first method proposed to solve the importance sampling SOC problem9 was a pure gradient descent but many other variants of this approach have been developed in the literature; see, e.g., Ref. 23 and the references therein. Nevertheless most methods are based on the idea of gradient descent. We will show a derivation of the gradient descent method in the RL framework and show that this can be interpreted as a deterministic model-based policy gradient method. For the deterministic policy setting methods in reinforcement learning have also been proposed namely the model-free DPG family of algorithms and its subsequent variants. We will introduce the main idea of these methods and in the next section we will present an application of both algorithms.
1. Model-based deterministic policy gradient
2. Model-free deterministic policy gradient
Further ideas behind the success of DQN algorithms34 such as off-policy training with samples from a replay buffer or the use of separate target networks have been implemented to provide more stable and robust learning. The replay buffer consists of a finite set of trajectory transitions (st, at, rt, st+1, d) that are stored online or offline. When the replay buffer is full the oldest tuples are discarded. The transition records required in (41) are randomly sampled from the replay buffer. The motivation for using target networks is to reduce the correlations between the action values Qω(s, a) and the corresponding targets r + (1 − d)Qω(s′, μ(s)). A separate network is used for the targets, and its weights are updated slowly according to the original network. By forcing the target networks to update slowly, the stability of the algorithm is improved.
Finally, the work of Ref. 48 addresses how to deal with a possible overestimation of the value function for continuous action space problems with ideas from double q-learning. They introduce the idea of clipped actions as a regularisation technique for deep value learning. The idea behind this is that similar actions should have similar action values. The algorithm developed is the latest developed algorithm in the DPG family and is called twin delayed deep deterministic policy gradient (TD3).
In general, the DPG family of algorithms combines the two ideas of policy optimization and Q-learning methods. On the one hand it is a type of policy gradient algorithm because it uses a parametric representation of the policy and updates its parameters by a gradient ascent method. On the other hand the required gradient depends on the Q-value function that needs to be approximated. Last let us conclude by highlighting that the resulting DPG algorithms are model-free, i.e., they do not require knowing the model transition density (see Refs. 22 and 47 for further details).
V. EXAMPLES: ONE-DIMENSIONAL DOUBLE WELL POTENTIAL
In this section we compare the use of two different algorithms to solve the RL optimization problem for the importance sampling environment introduced in Sec. IV. We consider the two main possible approaches for environments with continuous states and actions where the desired optimal policy is deterministic. We restrict ourselves to deterministic policy methods because we know from the PDE connection of the SOC problem that the optimal solution is deterministic. The first is an online model-based policy gradient method and the second is an offline model-free actor-critic method.
We repeat all our experiments several times with different random seeds to ensure generalisability. Each experiment requires only one CPU core, and the maximum value of allocated memory, is set to 1 GB unless otherwise stated.
A. Model-based deterministic REINFORCE
First we present the results of the deterministic model-based version of the REINFORCE algorithm for the one-dimensional environment described above. We consider an online based implementation where the batch of trajectories is not reused after each gradient step. We summarise this method in Algorithm 2.
Model-based deterministic policy REINFORCE.
1: Initialize deterministic policy μθ. |
2: Choose a batch size K, a gradient based optimization algorithm, a corresponding learning rate λ > 0, a time step size Δt and a stopping criterion. |
3: repeat |
4: Simulate K trajectories by running the policy in the environment’s dynamics. |
5: Estimate the policy gradient ∇θJ(μθ) by |
6: Update the parameters θ based on the optimization algorithm. |
7: until stopping criterion is fulfilled. |
1: Initialize deterministic policy μθ. |
2: Choose a batch size K, a gradient based optimization algorithm, a corresponding learning rate λ > 0, a time step size Δt and a stopping criterion. |
3: repeat |
4: Simulate K trajectories by running the policy in the environment’s dynamics. |
5: Estimate the policy gradient ∇θJ(μθ) by |
6: Update the parameters θ based on the optimization algorithm. |
7: until stopping criterion is fulfilled. |
For both metastable and non-metastable settings we consider two different batch sizes K = {1, 103} and use the Adam gradient based optimization algorithm49 with corresponding learning rates λ = {5 × 10−5, 5 × 10−4} respectively. The stopping criteria is set to be a fixed number of gradient steps N = 104 and the approximated policy is tested every 100 gradient updates.
Trajectories following the null policy for the two settings of study. The actions chosen along the trajectories are null. The trajectories are sampled starting at sinit = −1 until they arrive into the target set . Left panel: snapshots of the metastable trajectory β = 4. Right panel: trajectory positions as a function of the time steps.
Trajectories following the null policy for the two settings of study. The actions chosen along the trajectories are null. The trajectories are sampled starting at sinit = −1 until they arrive into the target set . Left panel: snapshots of the metastable trajectory β = 4. Right panel: trajectory positions as a function of the time steps.
Figures 2 and 3 show the L2 empirical error as a function of the gradient updates and the policy approximation at different gradient steps for the two different problem settings. In the non-metastable case we can see for the experiment with batch size K = 103 that the policy approximation agrees well with the reference control already after gradient steps. On the other hand with only one trajectory one has to rely on a lower learning rate and therefore learning is much slower. In the more metastable scenario we can see that the policy approximation for K = 103 is not as close to the HJB policy as in the less metastable case. One can see that after the same number of gradient steps the L2 error differs by more than an order of magnitude. Moreover, for the non-metastable setting with K = 103 we need to increase the maximum value of allocated memory to 8 GB. For the metastable setting we end up allocating 4 GB for the one trajectory case K = 1 and around 100 GB for the batch case K = 103.
Left panel: estimation of L2(μθ) at each gradient step for the non-metastable setting β = 1. Right panel: approximated policy for different gradient updates for the batch of trajectories case (K = 103).
Left panel: estimation of L2(μθ) at each gradient step for the non-metastable setting β = 1. Right panel: approximated policy for different gradient updates for the batch of trajectories case (K = 103).
Left panel: estimation of L2(μθ) as a function of the gradient steps for the metastable setting β = 4. Right panel: approximated policy for different gradient updates.
Left panel: estimation of L2(μθ) as a function of the gradient steps for the metastable setting β = 4. Right panel: approximated policy for different gradient updates.
B. Model-free deterministic policy gradient
Next we present the application of a model-free DPG method. In particular we will implement the TD3 variant of the DDPG method introduced at the end of Sec. IV C 2. First we consider a replay buffer and use separate target networks with slow updating so that the learning of the Q-value function is stabilised. We consider two critic networks to avoid overestimation of the value function. The replay buffer helps to reduce the amount of data that needs to be generated for each gradient estimation step. It uses caching of trajectories and random sampling of these cached trajectories for gradient estimation. The method is therefore considered an offline algorithm since it does not use the current control to estimate the gradient of the actor and critic networks. We summarise this method in Algorithm 3.
Model-free deterministic Policy Gradient (TD3).
1: Initialize actor network μθ and critic networks and . |
2: Initialize corresponding target networks: θ′ ← θ, , and choose ρp ∈ (0, 1). |
3: Initialize replay buffer . |
4: Choose a batch size K, a gradient based optimization algorithm and a corresponding learning rate λactor, λcritic > 0 for both optimization procedures, a time step size Δt and a stopping criterion. |
5: Choose standard deviation exploration noise σexpl and lower and upper action bounds alow, ahigh. |
6: repeat |
7: Select clipped action and step the environment dynamics forward. |
8: Observe next state s′, reward r, and done signal d and store the tuple (s, a, r, s′, d) in the replay buffer. |
9: if s′ is terminal then |
10: Reset trajectory. |
11: end if |
12: for j in range (update frequency) do |
13: Sample batch from replay buffer. |
14: Compute targets (Clipped Double Q-learning and policy smoothing). |
15: Estimate critic gradient by |
16: Update the critic parameters ωi based on the optimization algorithm. |
17: if j mod policydelayfrequency = 0 then |
18: Estimate actor gradient ∇θJ(μθ) by |
19: Update the actor parameters θ based on the optimization algorithm. |
20: Update target networks softly: |
21: end if |
22: end for |
23: until stopping criterion is fulfilled. |
1: Initialize actor network μθ and critic networks and . |
2: Initialize corresponding target networks: θ′ ← θ, , and choose ρp ∈ (0, 1). |
3: Initialize replay buffer . |
4: Choose a batch size K, a gradient based optimization algorithm and a corresponding learning rate λactor, λcritic > 0 for both optimization procedures, a time step size Δt and a stopping criterion. |
5: Choose standard deviation exploration noise σexpl and lower and upper action bounds alow, ahigh. |
6: repeat |
7: Select clipped action and step the environment dynamics forward. |
8: Observe next state s′, reward r, and done signal d and store the tuple (s, a, r, s′, d) in the replay buffer. |
9: if s′ is terminal then |
10: Reset trajectory. |
11: end if |
12: for j in range (update frequency) do |
13: Sample batch from replay buffer. |
14: Compute targets (Clipped Double Q-learning and policy smoothing). |
15: Estimate critic gradient by |
16: Update the critic parameters ωi based on the optimization algorithm. |
17: if j mod policydelayfrequency = 0 then |
18: Estimate actor gradient ∇θJ(μθ) by |
19: Update the actor parameters θ based on the optimization algorithm. |
20: Update target networks softly: |
21: end if |
22: end for |
23: until stopping criterion is fulfilled. |
We consider a replay buffer that can allocate a maximum of 106 trajectory transitions and we let the training start after 103 time steps. Before the training phase starts the actions are chosen completely randomly and later they follow the actor policy with some exploration noise. For the non-metastable case, the standard deviation of the exploration is chosen to be σexpl = 1.0 and the domain of the action space is set to . For the metastable case we choose σexpl = 2.0 and . The clipping function is formally defined as clip(⋅, c, d) ≔ max(min(⋅, d), c). The maximum number of time steps per trajectory is set to 103. We consider a batch size of K = 103 and use the Adam gradient based optimization algorithm with a learning rate of λactor = λcritic = 10−4 for both the actor and critic optimization procedures. The training takes place every ten time steps. During each training phase, the actor performs one gradient update for every two gradient updates performed by the critic, i.e. the critic performs ten gradient updates and the actor just 5. The standard deviation of the target noise is chosen to be σtarget = 0.2 and the Polyak averaging factor is chosen to be ρp = 0.995. The actor model is tested every 100 trajectories.
Figures 4 and 5 show the evolution of the L2 error as a function of the sampled trajectories and the policy approximation at the end of different trajectories for the two chosen settings. The L2 error is compared with the model-based approach with one trajectory (K = 1).
Left panel: estimation of L2(μθ) as a function of the trajectories for the non-metastable setting β = 1. Right panel: approximated policy by the actor model after different trajectories.
Left panel: estimation of L2(μθ) as a function of the trajectories for the non-metastable setting β = 1. Right panel: approximated policy by the actor model after different trajectories.
Left panel: estimation of L2(μθ) as a function of the trajectories for the metastable setting β = 4. Right panel: approximated policy by the actor model after different trajectories.
Left panel: estimation of L2(μθ) as a function of the trajectories for the metastable setting β = 4. Right panel: approximated policy by the actor model after different trajectories.
For the non-metastable settings we can see that the policy approximation agrees well with the reference policy. We observe that the learning is quite fast. After around 2000 the L2 error is already smaller than 10−1, but after this point the error does not decrease any further. Moreover, we observe that the model-free method learns faster than the model-based method in terms of generated data i.e., trajectories sampled. For the more metastable scenario we can see a similar pattern. Despite the metastability of the system the method learns quite fast and even manages to achieve a lower L2 error than the model-based approach. However, it seems that the metastability does affect the stability of the method. Unfortunately this unstable behavior is observed for other choices of the hyperparameters of the algorithm, where the L2 error can even blow up.
Finally in Fig. 6 we compare the L2 error as a function of the computation time for the two considered methods. We see that the TD3 method learns a decent control much faster than the model-based approach, especially in the metastable setting. However, we observe that at a certain point the L2-error stops decreasing. In contrast to that, for the model-based method learning is much slower but there is a steady decrease in the L2-error.
Estimation of L2(μθ) as a function of the computation time for the two considered methods. Left panel: non-metastable setting β = 1. Right panel: metastable setting β = 4.
Estimation of L2(μθ) as a function of the computation time for the two considered methods. Left panel: non-metastable setting β = 1. Right panel: metastable setting β = 4.
C. Discussion
Let us now compare the two different methods used above. First we focus on the ingredients needed for each application. The model-based deterministic approach requires the model to be known. Without knowing the transition probability density this approach is not possible. For our importance sampling application with damped Langevin dynamics the transition probability density can be approximated after time discretization. However, we may be interested in general diffusion processes where this information is not given or cannot be trusted. Alternatively the model could be learned which is indeed a current area of research in model-based RL. On the other hand, the model-free alternative only requires knowledge of the reward function which is always the case for the importance sampling problem.
However, this model-free approach has its implications. The method relies on a good approximation of the Q-value function especially along the action axis because this determines the direction of the gradient that the actor will follow. Figure 7 shows the approximated Q-value and the advantage function in the space-action discretized grid for both settings. Note that the shape of the Q-value function has the following property: the difference between the Q-values for a given action along the state axis is orders of magnitude larger than the difference between the Q-values for a given state along the action axis. This difference may be the cause of the observed instabilities in the model-free TD3 method. This problem is not specific to our importance sampling application and has been addressed in the field of advantage learning; see, e.g., Ref. 50. Advantage learning is an alternative approach to Q-learning where the advantage function is learned instead of the Q-value function. For future work it may therefore be interesting to exploit dueling network architecture approaches51 where two separate estimators are maintained: one for the value function and one for the advantage function.
Critic models after the last trajectory for both settings. Left panels: approximated Q-value function Qω. Right panels: approximated optimal advantage function after action space discretization and resulting greedy policy (gray dashed).
Critic models after the last trajectory for both settings. Left panels: approximated Q-value function Qω. Right panels: approximated optimal advantage function after action space discretization and resulting greedy policy (gray dashed).
Regarding the performance of both approaches we have seen that the model-based method gets nearer to the optimal solution in the non-metastable settings. For high metastable dynamics this approach suffers from long running times and a high variance of the gradient estimator.44 In our experimental analysis we observed a significant advantage of the TD3 algorithm over the deterministic reinforce algorithm in terms of learning a reasonable control faster. Our experiments suggest several reasons for this superiority. The TD3 algorithm performs a notably higher number of gradient steps per episode compared to the deterministic reinforce algorithm which relies on complete trajectory sampling before each gradient update, making bootstrapping impractical. This difference in gradient steps necessitates the deterministic gradient method to allocate considerably more memory for each update due to the extended length of trajectories. To ensure a fair comparison we conducted experiments for the case K = 1, allowing more gradient steps per data generated. The faster convergence observed in this scenario suggests that TD3 particularly benefits from the increased number of gradient steps especially in handling metastable problems. Another critical aspect contributing to TD3’s effectiveness is its reliance on accurate Q-value function approximation. When the Q-value function is well-approximated TD3’s gradient updates effectively guide the algorithm toward the optimal policy without requiring complete trajectory sampling, unlike the deterministic gradient estimator which lacks this correction term. Furthermore, this advantage enables TD3 to pursue off-policy learning enhancing its overall efficiency and adaptability. Moreover, TD3’s third advantage lies in its integration of exploration mechanisms a feature absent in traditional gradient-based methods. By actively exploring the environment, TD3 efficiently uncovers novel and potentially rewarding state-action trajectories resulting in more informed policy discovery.
VI. SUMMARY AND CONCLUSION
In this article we have shown that the stochastic optimization approach to importance sampling can be interpreted as a reinforcement learning problem. After presenting the importance sampling problem and a brief introduction to reinforcement learning we have shown how to formulate a MDP for the corresponding stochastic control problem. The MDP is the basic framework for reinforcement learning. By constructing the MDP we have established a first link. We then compared the optimization approaches given for both problems. The comparison has shown that the two optimization approaches are similar and that the optimization in the SOC case is a special case of the reinforcement learning formulation. In the SOC case the forward model of the controlled dynamical system is explicitly given while the reinforcement learning formulation is more general. A third connection has been shown by a detailed discussion of the algorithms developed for the SOC case. Here we have shown that a gradient-based method already proposed in the stochastic optimal control literature can be interpreted as the deterministic policy version of the well-known REINFORCE algorithm which turns out to be model-based. All in all, we have made three connections. We have introduced ideas from reinforcement learning that can be applied to problems seeking optimal deterministic policies namely DPG and its most popular variants DDPG and TD3. These algorithms are model-free policy gradient methods. They can be applied to the importance sampling SOC problem. We have presented the application of both algorithms used in a small dimensional setting and discussed their possible advantages and disadvantages. By applying TD3 to the SOC problem we have clearly shown that the importance sampling SOC problem can be interpreted as a reinforcement learning problem.
The main advantage of this is that ideas from reinforcement learning can now be applied to the stochastic optimal control approach to solving the importance sampling problem. For example reinforcement learning has already addressed the question of how to deal with the variance of the gradient estimator. The actor-critic method has been developed to solve this problem and it has been shown that the method achieves this goal. Especially for problems where the time evolution of the dynamical system is strongly influenced by metastable behavior this can be very helpful to reduce the sampling effort. Furthermore, the issue of efficient data usage has been discussed in the reinforcement learning community and various offline methods have been proposed to solve this problem. There are many other interesting ideas that have been addressed by the reinforcement learning community. Thus, this link can be used to efficiently design good and robust algorithms for high-dimensional settings of the importance sampling application.
We think that a combination of our model-based gradient estimator with an actor-critic design could be very interesting for the development of algorithms with fast convergence. Another research direction for us is the application of importance sampling to high-dimensional problems such as molecular dynamics. There is already related work exploring these ideas (see, e.g., Ref. 52). However, a stable application to real molecules is still lacking in the literature and would be a very helpful area of application. Another interesting line of research is the combination of model-free and model-based methods. As we have seen in the experiments with higher metastability learning with TD3 become unstable at a certain point. One could use TD3 to compute a good starting point so that the metastability is reduced and then switch to model-based optimization which seems to be much more stable. Similar ideas with pre-initialisation have been proposed in Ref. 44 where the optimization procedure is combined with an adapted version of the metadynamics algorithm.
ACKNOWLEDGMENTS
The authors would like to thank the HPC Service of ZEDAT, Freie Universität Berlin, for computing time.53 The research of J.Q. has been funded by the Einstein Foundation Berlin. The research of E.R.B. has been funded by Deutsche Forschungsgemeinschaft (DFG) through grant CRC 1114 “Scaling Cascades in Complex Systems,” Project No. 235221301, Project A05 “Probing scales in equilibrated systems by optimal nonequilibrium forcing.” Furthermore, we would like to thank the anonymous reviewer for his suggestions and improvements of the article and W. Quer for his linguistic improvements.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
J. Quer: Conceptualization (lead); Investigation (equal); Methodology (lead); Software (equal); Visualization (equal); Writing – original draft (lead); Writing – review & editing (equal). Enric Ribera Borrell: Conceptualization (supporting); Investigation (equal); Methodology (supporting); Software (equal); Visualization (equal); Writing – original draft (supporting); Writing – review & editing (equal).
DATA AVAILABILITY
The code used for the numerical examples is available on GitHub at www.github.com/riberaborrell/rl-sde-is.