Concise, accurate descriptions of physical systems through their conserved quantities abound in the natural sciences. In data science, however, current research often focuses on regression problems, without routinely incorporating additional assumptions about the system that generated the data. Here, we propose to explore a particular type of underlying structure in the data: Hamiltonian systems, where an “energy” is conserved. Given a collection of observations of such a Hamiltonian system over time, we extract phase space coordinates and a Hamiltonian function of them that acts as the generator of the system dynamics. The approach employs an autoencoder neural network component to estimate the transformation from observations to the phase space of a Hamiltonian system. An additional neural network component is used to approximate the Hamiltonian function on this constructed space, and the two components are trained jointly. As an alternative approach, we also demonstrate the use of Gaussian processes for the estimation of such a Hamiltonian. After two illustrative examples, we extract an underlying phase space as well as the generating Hamiltonian from a collection of movies of a pendulum. The approach is fully data-driven and does not assume a particular form of the Hamiltonian function.

Neural network-based methods for modeling dynamical systems are again becoming widely used, and methods that explicitly learn the physical laws underlying continuous observations in time constitute a growing subfield. Our work contributes to this thread of research by incorporating additional information into the learned model, namely, the knowledge that the data arise as observations of an underlying Hamiltonian system.

We use machine learning to extract models of systems whose dynamics conserve a particular quantity (the Hamiltonian). We train several neural networks to approximate the total energy function for a pendulum, in both its natural action-angle form and also as seen through several distorting observation functions of increasing complexity. A key component of the approach is the use of automatic differentiation of the neural network in formulating the loss function that is minimized during training.

Our method requires data evaluating the first and second time derivatives of observations across the regions of interest in state space or, alternatively, sufficient information (such as a sequence of delayed measurements) to estimate these. We include examples in which the observation function is nonlinear and high-dimensional.

Current data science exploration of dynamics often focuses on regression or classification problems, without routinely incorporating additional assumptions about the nature of the system that generated the data. This has started to change recently, with approaches to extract generic dynamical systems by Ref. 32, specifying the variables and possible expressions for the formulas beforehand. In particular, treating the central object to be modeled as the discrete-time flowmap Φτ(x)=x+0τf(x(t))dt but learning Φ directly as a black box may result in qualitative differences from the true system.31 Using the associated differential equation dx/dt=f(x) instead allows us to exploit established numerical integration schemes to help approximate the flow map and can be accomplished with a neural network by structuring the loss function similarly to such classical numerical integration schemes. In addition to our older work along these lines,6,25–31 more recent work has revived this approach, focused on treating the layers of a deep neural network as iterates of a dynamical system, where “learning” consists of discovering the right attractors.3,5,14

In particular, interest has focused on how residual networks11 and highway networks33 can be interpreted as iterative solvers8 or as iterated dynamical systems.4 In the latter paper (a NeurIPS 2018 best paper awardee), the authors choose not to unroll the iteration in time explicitly, but instead use continuous-time numerical integration. While the focus was on the dynamics-of-layers concept, time series learning was also performed.

The Koopman operator has also been employed in combination with neural networks to extract conservation laws and special group structures.12,15 Symmetries in relation to conserved quantities are a well-studied problem in physics.10,13,19,20 A recent thread of research consists of learning physics models from observation data,17 including modeling discrete-time data as observations of a continuous-time dynamical system.21,22

It is informative to study physical systems through their conserved quantities, such as total energy, which can be encoded in a Hamiltonian function.1,10 The measure-preserving property of Hamiltonian systems has recently been exploited for transport of densities in Markov-chain Monte-Carlo methods18 and variational autoencoders.2,24 For our purposes, a natural progression of the thread of computational modeling of physics from observations is to represent the Hamiltonian function directly.

Concurrently with this submission, two papers independently addressing similar issues appeared as preprints.9,34 In the first of these,9 a loss function very similar to parts of our Eq. (12) was used. The second paper focuses on the transformation of densities generated through the Hamiltonian. It is possible to draw some analogies between the second preprint and the old (non-Hamiltonian) work we mentioned above,6,25–31 as this newer work also used rollout with a time step templated on a classical numerical integration method (here, symplectic Euler and leapfrog). Both papers also use the pendulum as an example, emphasizing conditions where the system can be well approximated by a linear system: a thin annulus around the stable steady state at (q,p)=(0,0) where the trajectories are nearly circular.

The remainder of the paper is structured as follows:

  1. We derive data-driven approximations (through two approaches: Gaussian processes and neural networks) of a Hamiltonian function on a given phase space, from time series data. The Hamiltonian functions we consider do not need to be separable as a sum H(q,p)=T(p)+V(q), and in our illustrative example, we always work in the fully nonlinear regime of the pendulum.

  2. We build data-driven reconstructions of a phase space from (a) linear and (b) nonlinear, nonsymplectic transformations of the original Hamiltonian phase space. The reconstruction then leads to a symplectomorphic copy of the original Hamiltonian system.

  3. We construct a completely data-driven pipeline combining (a) the construction of an appropriate phase space and (b) the approximation of a Hamiltonian function on this new phase space, from nonlinear, high-dimensional observations (e.g., from movies/sequences of movie snapshots).

A Hamiltonian system on Euclidean space E=R2n, nN is determined through a function H:ER that defines the equations

q˙(t)=H(q(t),p(t))/p,
(1)
p˙(t)=H(q(t),p(t))/q,
(2)

where ():=d/dt and q(t),p(t)Rn are interpreted as “position” and “momentum” coordinates in the “phase space” E. In many mechanical systems, and in all examples we discuss in this paper, the interpretation of the coordinates q,p is reflected in the dynamics through q˙=p, i.e., H(q,p)=12p2+h(q) for some function h:RnRn. In general, Eqs. (1) and (2) imply that the Hamiltonian is constant along trajectories (q(t),p(t)) because

ddtH(q,p)=Hq(q,p)q˙+Hp(q,p)p˙=0.
(3)

Equations (1) and (2) can be restated as a partial differential equation for H at every (q,p)E,

0II0ωH(q,p)ν(q,p)=0,
(4)

where IRn×n is the identity matrix and ν is the vector field on E [the left hand side of (1) and (2)], which only depends on the state (q,p). The symplectic form on the given Euclidean space takes the form of the matrix ω.

In Sec. III, we discuss how to approximate the function H from given data points D={(qi,q˙i,q¨i)}i=1N. This involves solving the partial differential equation (4) for H. Since these equations determine H only up to an additive constant, we assume that we also know the value H0=H(q0,p0) of H at a single point (q0,p0) in the phase space. This is not a major restriction for the approach because H0 as well as (q0,p0) can be chosen arbitrarily.

As an example, consider the case n=1 and the Hamiltonian

H(q,p)=p22+(1cos(q)).
(5)

This Hamiltonian forms the basis for the differential equations of the nonlinear pendulum, q¨=sin(q), or, in the first-order form, q˙=H(q,p)/p=p and p˙=H(q,p)/q=sin(q). In this section, we numerically solve partial differential equation (PDE) (4) by approximating the solution H using two approaches: Gaussian processes23 (Sec. III A) and neural networks (Sec. III B).

We model the solution H as a Gaussian process H^ with a Gaussian covariance kernel,

k(x,x)=expxx2/ϵ2,
(6)

where x and x are points in the phase space, i.e., x=(q,p), x=(q,p), and ϵR+ is the kernel bandwidth parameter (we chose ϵ=2 in this paper). Given a collection X of N points in the phase space, as well as the function values H(X) at all points in X, the conditional expectation of the Gaussian process H^ at a new point y is

E[H^(y)|X,H(X)]=k(y,X)Tk(X,X)1H(X),
(7)

where we write k(X,X)i,j:=k(xi,xj) for the kernel matrix evaluated over all x values in the given dataset X. In Eq. (7), the dimensions of the symbols are yR2n, k(X,X)RN×N, k(y,X)RN, and H(X):=(H(x1),H(x2),,H(xN))RN. All vectors are column vectors. Estimates of the solution H to the PDE at new points depend on the value of H over the entire dataset. We do not know the values of H, but differentiating Eq. (7) allows us to set up a system of equations to estimate H at an arbitrary number M of new points y close to the points xX, by using the information about the derivatives of H given by the time derivatives q˙ in Eq. (1) and (the negative of) p˙ in Eq. (2). Let g(xi)R2n be the gradient of H at a point xi in the phase space, which we know from data (q˙,p˙). Then,

yH^(y)y=xig(xi),yk^(y,X)Ty=xik(X,X)1H(X)g(xi).
(8)

Together with an arbitrary pinning term at a point x0, the list of known derivatives leads to a linear system of 2nN+1 equations, where we write

xk(xi,Y):=xk(x,x)x=xi,xY
(9)

for the derivative of the given kernel function k with respect to its first argument, evaluated at a given point xi in the first argument and all points in the dataset Y in the second argument. For each xi, we thus have a (column) vector of M derivative evaluations, which is transposed and stacked into a large matrix to form the full linear system

xk(x1,Y)Tk(Y,Y)1xk(x2,Y)Tk(Y,Y)1xk(xN,Y)Tk(Y,Y)1k(x0,Y)Tk(Y,Y)1R(2nN+1)×MH(Y)RM=g(X)H0R2nN+1.
(10)

We reiterate: X is a dataset of N points where we know the derivatives of H through g(xi)=(Hq(xi),Hp(xi))T=(p˙i,q˙i)TR2n. We evaluate on a fine grid Y of M points (such that k(Y,Y)RM×M, xk(xi,Y)R2n×M) and have information g(X)R2nN on a relatively small set of N points called X [black dots in Fig. 1(b)]. The derivative of the Gaussian process can be stated using the derivative of the kernel k with respect to the first argument. The matrix inverse for k(Y,Y) is approximated by the inverse of (k(Y,Y)+σ2I), that is, through Tikhonov regularization with parameter σ=105, as is the standard for Gaussian process regression. Solving this system of equations for H(Y) yields the approximation for the solution to the PDE. Figure 1(b) shows the result, and Table I lists the mean squared error to the 625 training data points and an independently drawn set of 200 validation points in the same domain, where no derivatives were available. See Ref. 22 for a more detailed discussion of the solution of PDE with Gaussian processes.

FIG. 1.

(a) Hamiltonian function of the pendulum system. (b) Approximation using Gaussian processes, solving Eq. (10) on a fine grid Y[9,9]×[1.5,1.5], with information g about the derivative of H on a set of 625 randomly sampled points X[2π,2π]×[1,1] (black points in the pink rectangle). (c) Hamiltonian learned with a neural network, with all loss terms included. Note that the learned function is only accurate where data sampling is sufficiently dense. For this experiment, we used the “softplus” activation function σl(z)=ln(ez+1). The black points shown in panel (b) show all training data for the Gaussian process. The points in panel (c) only show a subset of the training data used for the neural network (20 000 in total).

FIG. 1.

(a) Hamiltonian function of the pendulum system. (b) Approximation using Gaussian processes, solving Eq. (10) on a fine grid Y[9,9]×[1.5,1.5], with information g about the derivative of H on a set of 625 randomly sampled points X[2π,2π]×[1,1] (black points in the pink rectangle). (c) Hamiltonian learned with a neural network, with all loss terms included. Note that the learned function is only accurate where data sampling is sufficiently dense. For this experiment, we used the “softplus” activation function σl(z)=ln(ez+1). The black points shown in panel (b) show all training data for the Gaussian process. The points in panel (c) only show a subset of the training data used for the neural network (20 000 in total).

Close modal
TABLE I.

Comparison of loss between training and held-out validation data. As the sizes of our generated datasets are always much larger than the numbers of trainable parameters, overfitting is not an issue, and the validation loss is always indistinguishable from the training loss. The first row includes the results from the Gaussian process approximation (a nonparametric method) using all training data for inference. Only the hyperparameters ε and σ needed to be adjusted. For the first row, the loss reported is the MSE of the Gaussian process regression. For other rows, the loss is computed by either Eqs. (12) or (14) as appropriate.

No. of parametersNo. of training pointsNo. of validation pointsTraining lossValidation loss
Section III A Nonparametric 625 200 2.2 × 10−5 3.5 × 10−5 
Section III B 337 20 000 200 6.8 × 10−4 7.4 × 10−4 
Section IV B 345 20 000 1000 5.5 × 10−6 5.0 × 10−6 
Section IV C 511 19 200 200 1.4 × 10−4 1.8 × 10−4 
Section IV D 755 17 489 200 0.51 0.54 
No. of parametersNo. of training pointsNo. of validation pointsTraining lossValidation loss
Section III A Nonparametric 625 200 2.2 × 10−5 3.5 × 10−5 
Section III B 337 20 000 200 6.8 × 10−4 7.4 × 10−4 
Section IV B 345 20 000 1000 5.5 × 10−6 5.0 × 10−6 
Section IV C 511 19 200 200 1.4 × 10−4 1.8 × 10−4 
Section IV D 755 17 489 200 0.51 0.54 

Another possibility for learning the form of H using data is to represent the function with an artificial neural network7 (ANN). We write

xl=σl(xl1Wl+bl),l=1,,L+1,
(11)

where the activation function σl is nonlinear (except where otherwise indicated, we used tanh) for l=1,,L (if L1) and the identity for l=L+1. The learnable parameters of this ANN are {(Wl,bl)}l=1,,L+1, and we gather all such learnable parameters from the multiple layers that may be used in one experiment into a parameter vector w. If there are no hidden layers (L=0), then we learn an affine transformation x1=x0W+b. This format provides a surrogate function H^(q,p)=xL+1, where the input x0 is the row vector [q,p]. (Treating inputs as row vectors and using right-multiplication for weight matrices is convenient, as a whole batch of N inputs can be presented as an N-by-2 array.) For all the experiments shown here, this network for computing H^ has two hidden layers of width 16.

Similarly, in the case(s) we need to learn additional transformations θ^ and θ^1 (see Sec. IV), they are also learned using such networks.

We collect training data by sampling a number of initial conditions in the rectangle (q,p)[2π,2π]×[6,6], then simulate short trajectories from each to their final (q,p) points. For each of these, we additionally evaluate (q˙,p˙). Shuffling over simulations once per epoch, and dividing this dataset into batches, we then perform batchwise stochastic gradient descent to learn the parameters w using an Adam optimizer on the objective function defined below.

For this paper, all neural networks were constructed and trained using TensorFlow, and the gradients necessary for evaluating the Hamiltonian loss terms in Eq. (12) were computed using TensorFlow’s core automatic differentiation capabilities.

This objective function comprises a scalar function evaluated on each data 4-tuple d=(q,p,q˙,p˙) in the batch and then averaged over the batch. This scalar function is written as

f(q,p,q˙,p˙;w)=k=14ckfk,
(12)
f1=H^pq˙2,f2=H^q+p˙2,f3=H^(q0,p0)H02,f4=H^qq˙+H^pp˙2,
(13)

where the dependence on w is through the learned Hamiltonian H^, the loss-term weights ck are chosen to emphasize certain properties of the problem thus posed, and the partial derivatives of H^p and H^q are computed explicitly through automatic differentiation. Except for c2, all ck values are set to either 1 or 0 depending on whether the associated loss term is to be included or excluded. Because of the square term in Eq. (5), we set c2 arbitrarily to 10 if nonzero, so the loss is not dominated by f1. An alternative might be to set c1 to 1/10.

Since Eqs. (1) and (2) together imply (3), any one of the three terms f1, f2, and f4 could be dropped as redundant; therefore, we can set c4 to zero but monitor H^˙ as a useful sanity check on the accuracy of the learned solution. In Fig. 1(c), we show the results of this process with our default nonzero values for ck.

As an ablation study, we explored the effect (not shown here) of removing the first, second, and fourth terms. By construction, the true Ht(q,p) function is zero for all (q,p). Note that this is only ever achieved to any degree in the central box in the figure, where data were densely sampled. Removing f4 made no visible difference in the quality of our H^t0 approximation, which was expected due to the redundancy in the set of Eqs. (1)–(3). However, removing either f1 or f2 gives poor results across the figure, despite the apparent redundancy of these terms with f4. This might be due to not balancing the contributions of the p˙ and q˙ terms, for which we attempted to compensate by unequal weighting values c1 and c2.

We now consider a set of observation functions θ:ERM, θ=(y1,,yM), with MdimE=2n, such that θ is a diffeomorphism between the phase space E and its image θ(E). In this setting, the notion of a symplectomorphism is important.1 In general, a symplectomorphism is a diffeomorphism that leaves the symplectic structure on a manifold invariant. In our setting, a symplectomorphism of E=Q×P maps to a deformed space E^=Q^×P^ where the system dynamics in the new variables q^Q^, p^P^ is again Hamiltonian and conjugate to the original Hamiltonian dynamics. Not every diffeomorphism is a symplectomorphism, and we do not assume that θ is a symplectomorphism. A constant scaling of the coordinates is also not possible to distinguish from a scaling in the Hamiltonian function itself, so the recovered system will be a symplectomorphic copy of the original, scaled by an arbitrary constant.

In the setting of this section, we do not assume access to E, H, or the explicit form of θ. Only a collection of points θi and time derivatives ddtθi in the image θ(E) is available. We describe an approach to approximate a new map θ^1:θ(E)E^ into a symplectomorphic copy of E through an autoencoder,7 such that the transformed system in E^ is conjugate to the original Hamiltonian system in E. Upon convergence, and if we had access to θ, the map θ^1°θS:EE^ would approximate a symplectomorphism, and θ1°θ^S1 would be its inverse. During the estimation of θ^1, we simultaneously approximate the new Hamiltonian function H^:E^R. Figure 2(a) visualizes the general approach, where only the information (x,y)i and ddt(x,y)i is available to the procedure, while θ^1 and a Hamiltonian H^ on E^ are constructed numerically.

FIG. 2.

(a) Illustration of the network structure. Starting with observations of the (unknown) variables (x,y)θ(E), the autoencoder is structured to map via θ^1 to the Hamiltonian form (q^,p^)E^, as well as map back to the observations through θ^. The complete process involves estimating a Hamiltonian for (the symplectomorphic copies of) p,q during training of the autoencoder. (b) The transformation θ from the original space q,p to the observed space x,y for the linear example of Sec. IV B.

FIG. 2.

(a) Illustration of the network structure. Starting with observations of the (unknown) variables (x,y)θ(E), the autoencoder is structured to map via θ^1 to the Hamiltonian form (q^,p^)E^, as well as map back to the observations through θ^. The complete process involves estimating a Hamiltonian for (the symplectomorphic copies of) p,q during training of the autoencoder. (b) The transformation θ from the original space q,p to the observed space x,y for the linear example of Sec. IV B.

Close modal

An interesting and important feature common to the next three examples is that one cannot expect to systematically recover the original (q,p) values from the given observation data (x,y). Only symplectic transformations (q^,p^) can be recovered, which are enough to define a Hamiltonian. Once the coordinates (q^,p^) are fixed, the Hamiltonian function in these coordinates is unique up to an additive constant.

Table I contains training and validation loss of the networks for all experiments. Additionally, we trained the network from Sec. IV D with only 331 images and did observe a significantly higher validation loss (not shown), consistent with overfitting.

The following loss function is used to train an autoencoder component together with a Hamiltonian function approximation network component:

f(q^,p^,q^˙,p^˙;w)=kckfk,f1=H^pq^˙2,f2=H^q+p^˙2,f3=H^(θ^1(x0,y0))H02,f4=H^˙2=H^q^q^˙+H^p^p^˙2,f5=||θ^(θ^1(x,y))[x,y]||2,f6=det(Dθ^1)2,
(14)

where the dependence on w is through the learned Hamiltonian H^ and the learned transformations θ^ and θ^1, and the time derivatives in the space E^ are computed as q^˙=x˙q^x+y˙q^y and p^˙=x˙p^x+y˙p^y, using the Jacobian Dθ^1=q^xq^yp^xp^y of the transformation θ^1 (computed pointwise with automatic differentiation). When we learn a (especially nonlinear) transformation θ^1, in addition to the Hamiltonian H^(q^,p^), including the f4 term in the composite loss can have a detrimental effect on the learned transformation. There exists an easily-encountered naïve local minimum in which θ^1 maps all of the sampled values from θ(E)(x,y) to a single point in E^, and the Hamiltonian learned is merely the constant function at the pinning value, H^(q^,p^)=H^0. In this state (or an approximation of this state), all of Hq^, Hp^ and the elements of Dθ^1 are zero, so the loss (with terms f1, f2, f3, and optionally f4) is zero (resp., small). A related failure is that in which the input in θ(E) is collapsed by θ^1 to a line or curve in E^.

To alleviate both of these problems, we added a new loss component f6. That is, we require that the learned transformation does not collapse the input. It is sufficient for the corresponding weighting factor c6 to be a very small nonzero value (e.g., 106). The addition of f6 to our loss helps us to avoid falling early in training into the unrecoverable local minimum described above and also helps keep the scale of the transformed variables q^ and p^ macroscopic.

We generate data from a rectangular region x[1,1], y[1,1] and then transform the region linearly with θ1(x,y)=A1[x,y]T=[q,p]T. The matrix A1 is the inverse of A=RΛ; a scaling is followed by a rotation where Λ=λ100λ2, λ1=1, λ2=64, R=cosρsinρsinρcosρ, and ρ=5°. Our observation data θ(E) are thus given by [x,y]T=A[q,p]T. Using the true Hamiltonian H(q,p)=p2/2+(1cosq), we additionally compute true values for dq/dt and dp/dt and then use A to propagate these to x and y via dxdt=xqdqdt+ypdpdt and similar for y, where the partial derivatives are computed analytically (here, just the elements of A itself).

Our network is then presented with observation data x,y and its corresponding time-derivatives. Its task is to learn A^ and A^1, which convert to and from variables q^,p^ (symplectomorphic to the original q,p) and a Hamiltonian H^ in this new space. When evaluating the loss, the time derivatives of q^ and p^ are likewise computed via automatic differentiation using the chain rule through the learned transformation, e.g., as dq^dt=q^xdxdt+q^ydydt. Note also that [q^,p^]T=A^1A[q,p]T, so if the original space E could be found, A^ would satisfy A^A1=I. This cannot be expected given only the data in θ(E); we can only be sure that A^A1 approximates a symplectomorphism of the original E.

We could learn θ^1 from a general class of nonlinear functions, as a small tanh neural network, but here, we simply learn A^ and A^1 as linear transformations (that is, we have a linear “neural network,” where L=0 in Eq. (11), and b1 is fixed as 0). As we include the reconstruction error of this autoencoder in our loss function, A^1 is constrained to be the inverse of A^ to a precision no worse than the f1 and f2 terms in Eq. (14), after all three are scaled by their corresponding ck values. In fact, for the linear case, initially the autoencoder’s contribution to the loss is significantly lower than the Hamiltonian components (see Fig. 3), but, as training proceeds and the f1 and f2 terms are improved (at the initial expense of raising the autoencoder loss), larger reductions in loss are possible by optimizing H^ rather than θ^, so f5 is decreased as quickly as (the larger of) f1 or f2. That is, the autoencoder portion of the loss falls quickly to the level where it no longer contributes to the total loss given its weighting in the loss sum.

FIG. 3.

Learned H^(q^,p^) after a linear transformation θ. (a) (Left) The true function H(q,p) is evaluated on a grid of q,p values. (a) (Right) The learned function on q^,p^ is pulled back onto the original q,p space. (b) Linear symplectomorphism S=θ^1°θ between the original and the learned space.

FIG. 3.

Learned H^(q^,p^) after a linear transformation θ. (a) (Left) The true function H(q,p) is evaluated on a grid of q,p values. (a) (Right) The learned function on q^,p^ is pulled back onto the original q,p space. (b) Linear symplectomorphism S=θ^1°θ between the original and the learned space.

Close modal

We find that the learned symplectomorphism S(q,p)=A^1A[q,p]T, depicted in its q portion in Fig. 3, preserves q unmixed with p in one or the other of its two discovered coordinates. This is because both (a) (q,p)(p,q) as well as (b) (q,p)(q,p+f(q)) for any smooth function f are symplectomorphisms. They are special because H(q,p)=H^(q^(q,p),p^(q,p)); i.e., they even preserve the Hamiltonian formulation. For the map (a), the transformation of the Hamiltonian can be seen from the following derivation:

q^˙=p˙=H/q=H^/p^,
(15)
p^˙=q˙=H/p=H^/q^,
(16)
H^/q=H^/q^q^/q=0+H^/p^p^/q=1=H^/p^=H/q,
(17)
H^/p=H^/q^q^/p=1+H^/p^p^/p=0=H^/q^=H/p.
(18)

Here, the first equality of (15) and (16) follows from the map, and the last equality of (15) and (16) follows from the requirement that q^,p^ follow Hamiltonian dynamics with respect to the new Hamiltonian H^. Equations (18) and (20) then show that the new Hamiltonian is the same as the old one (modulo an additive constant) when considered as a map on the old coordinates.

In addition to the linear θ of Sec. IV B, we show comparable results for a nonlinear transformation θ and learned θ^. Specifically, we transform the data through (x,y)=θ(q,p) where

a=q/20,b=p/10,x=a+(b+a2)2,y=b+a2,
(19)

the inverse of which is given by q=(xy2)20 and p=(yx2+2xy2y4)10. We use the analytical Jacobian of this θ to compute the necessary x˙ and y˙ for input to our network.

We proceed as before, except that we no longer restrict the form of the learned θ^ and θ^1 to linear transformations, but instead allow small multilayer perceptrons of a form similar to that used for H^.

The resulting induced symplectomorphism is again one that appears to preserve an approximately monotonically increasing or decreasing q in either q^ or p^. This can be seen in Fig. 4.

FIG. 4.

Results after a nonlinear transformation θ. (a) (Left) The true function H(q,p), pushed forward to q^,p^. (b) The learned function on q^, p^. A grid is now taken in E^ and transformed through S1 for plotting H at corresponding q,p points. As is also true for the linear case, the sign of the learned H^ may be flipped depending on whether we learn a representation of q via q^ or p^, which is monotonically increasing or decreasing. Bottom: Nonlinear symplectomorphism S=θ^1°θ between the original and the learned space. Again, we find that q is preserved (nearly) unmixed in the discovered space E^(q^,p^). For this experiment, the encoder θ^1 and decoder θ^ each have three hidden layers of width five.

FIG. 4.

Results after a nonlinear transformation θ. (a) (Left) The true function H(q,p), pushed forward to q^,p^. (b) The learned function on q^, p^. A grid is now taken in E^ and transformed through S1 for plotting H at corresponding q,p points. As is also true for the linear case, the sign of the learned H^ may be flipped depending on whether we learn a representation of q via q^ or p^, which is monotonically increasing or decreasing. Bottom: Nonlinear symplectomorphism S=θ^1°θ between the original and the learned space. Again, we find that q is preserved (nearly) unmixed in the discovered space E^(q^,p^). For this experiment, the encoder θ^1 and decoder θ^ each have three hidden layers of width five.

Close modal

As a further demonstration of the method, we use a graphical rendering of the moving pendulum example from before as the transformation θ from the intrinsic state (q,p) to an image x as our high-dimensional observable. We use a symplectic semi-implicit Euler’s method,

p(τ)=p(0)+τp˙q(0),p(0),q(τ)=q(0)+τq˙q(0),p(τ),
(20)

to generate q(ti),p(ti) trajectories for various initial conditions and then a simple graphical renderer to display these as images [see Fig. 5(a)]. When rendering our video frames, we drag a tail of decaying images behind the moving pendulum head so that information about both position q and velocity p is present in each rendered frame. This is done by iterating over each q(ti),p(ti) trajectory and, for each ti, (1) multiplying the entire current image by a cooling factor of 0.96, (2) adding a constant heating amount to the image in a circle of fixed radius centered around the current cos(q),sin(q) point, and (3) clipping the image per-pixel to lie within [0,1]. Samples of the resulting images are visible in Fig. 5(a).

FIG. 5.

(a) Principal component analysis (PCA) autoencoder reconstructions. The autoencoder is trained to reproduce PCA projections of the monochrome images on the left, where velocity p information is available in the length of the tail dragged behind the moving pendulum head. We show in the column “Approx” these reconstructions passed back through the approximate inverse of the PCA projection. Note that only the magnitude of p is preserved, which can be predicted by the fact that only the square of p is required for computing H. (b) Learned Hamiltonian. For each of the images x in our dataset, we have an associated known q,p pair. Here, we plot these values, colored by the learned H^(x). For this experiment, the encoder θ^1 has two hidden layers of widths six and four, respectively, and the decoder θ^ mirrors this structure.

FIG. 5.

(a) Principal component analysis (PCA) autoencoder reconstructions. The autoencoder is trained to reproduce PCA projections of the monochrome images on the left, where velocity p information is available in the length of the tail dragged behind the moving pendulum head. We show in the column “Approx” these reconstructions passed back through the approximate inverse of the PCA projection. Note that only the magnitude of p is preserved, which can be predicted by the fact that only the square of p is required for computing H. (b) Learned Hamiltonian. For each of the images x in our dataset, we have an associated known q,p pair. Here, we plot these values, colored by the learned H^(x). For this experiment, the encoder θ^1 has two hidden layers of widths six and four, respectively, and the decoder θ^ mirrors this structure.

Close modal

Though we do not use p(ti) directly in this procedure, its value is observed in the length of the resulting tail dragged behind the moving pendulum head. We create trajectories long enough that the effect of the initial formation of the tail (during which its length is not necessarily a good indicator of p) is not visible anymore and then use only the final two observations from these trajectories.

In order to make the approach agnostic to the data, we do not want to assume that the space E^ is periodic, so instead, we use a four-dimensional phase space with elements z^=[q^1,q^2,p^1,p^2]=[q^,p^] and consider the splitting into (q^1,q^2) and (p^1,p^2) during training. In the space of input images, the manifold does not fill up four-dimensional space but a cylinder, which is mapped to the four-dimensional encoding layer by the autoencoder.

In addition, to simplify the learning problem, we learn θ^1 as the combination of a projection onto the first 20 principal components of the training dataset followed by a dense autoencoder, reserving learning θ^1 as an end-to-end convolutional autoencoder for future work. The encoding provides z^ and, as before, we learn θ^1 in tandem with H^, where now the conditions of Eq. (14) are upgraded to vector equivalents to accommodate z^.

In Sec. IV A, we added a loss term proportional to the reciprocal of the determinant of the transformation’s Jacobian in order to avoid transformations that collapsed the phase space. Here, this was not such an issue—of course, some collapsing of the high-dimensional representation is obviously required. Instead, a common mode of failure turned out to be learning constant H^ functions, which automatically satisfy the Hamiltonian requirements (a constant is naturally a conserved quantity). To avoid this, we considered several possible ways to promote a nonflat H^ function, ultimately settling on (a) adding a term that encouraged the standard deviation of H^ values to be nonzero and (b) minimizing not just the mean squared error in our f1 and f2 terms, but also the max squared error, to avoid trivial or bilevel H^(q,p) functions. The spherical Gaussian prior used in training variational autoencoders34 also avoids, as a by-product, learning constant functions.

The result, shown in Fig. 5(b), was a pulled-back H^(q,p) function that at least in broad strokes resembles the truth and that satisfies dH^/dt0 (typically about 102).

We described an approach to approximate Hamiltonian systems from observation data. It is a completely data-driven pipeline to (a) construct an appropriate phase space and (b) approximate a Hamiltonian function on the new phase space, from nonlinear, possibly high-dimensional observations (here, movies).

When only transformations of the original Hamiltonian phase space can be observed, it is only possible to recover a symplectic copy of the original phase space, with additional freedom in a constant scaling of the coordinates and an additive constant to the Hamiltonian function. If no additional information about the original space is available, this is a fundamental limitation and not particular to our approach. It is not necessary that there is an “original phase space” at all, and so, the resulting symplectic phase space is by no means unique. In fact, Darboux's theorem implies the opposite.35,36 There is no local obstruction to a symplectic structure, so all symplectic manifolds of dimension 2n are locally symplectomorphic to the standard flat space R2n. The choice of a single such space has to rely on other factors, such as, possibly, interpretability by humans or simplicity of the equations.

The approach may be extended to time-dependent Hamiltonian functions. This would allow us to cope with certain dissipative systems.16 An even broader extension may allow transformations to arbitrary normal forms as the “target vector field” and thus would not be constrained to Hamiltonian systems. In the general case, it will become important to explore whether the transformation we approximate remains bounded over our data or whether it starts showing signs of approaching a singularity, suggesting that the problem may not be solvable.

This work was funded by the US Army Research Office (ARO) through a Multidisciplinary University Research Initiative (MURI) and by the Defense Advanced Research Projects Agency (DARPA) through their Physics of Artificial Intelligence (PAI) program, and through agreement HR001118C0100 (I.M.).

1.
A. M.
Almeida
,
Hamiltonian Systems: Chaos and Quantization
(
Cambridge University Press
,
1992
).
2.
A. L.
Caterini
,
A.
Doucet
, and
D.
Sejdinovic
, “Hamiltonian variational auto-encoder,” in Proceedings of the 32nd Conference on Neural Information Processing Systems (Curran Associates, Inc., 2018), p. 11.
3.
B.
Chang
,
L.
Meng
,
E.
Haber
,
F.
Tung
, and
D.
Begert
, “Multi-level residual networks from dynamical systems view,” in International Conference on Learning Representations (2018).
4.
R. T. Q.
Chen
,
Y.
Rubanova
,
J.
Bettencourt
, and
D.
Duvenaud
, “
Neural ordinary differential equations
,” in Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Curran Associates, Inc., 2018), Vol. 31, pp.
6571
6583
(
2018
).
5.
W. E
, “
A proposal on machine learning via dynamical systems
,”
Commun. Math. Stat.
5
(
1
),
1
11
(
2017
).
6.
R.
González-García
,
R.
Rico-Martínez
, and
I. G.
Kevrekidis
, “
Identification of distributed parameter systems: A neural net based approach
,”
Comput. Chem. Eng.
22
(
98
),
S965
S968
(
1998
).
7.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
,
Deep Learning
(
MIT Press
,
2016
).
8.
K.
Greff
,
R. K.
Srivastava
, and
J.
Schmidhuber
, “Highway and residual networks learn unrolled iterative estimation,” in Proceedings of the International Conference on Learning Representations (2017).
9.
S.
Greydanus
,
M.
Dzamba
, and
J.
Yosinski
, “Hamiltonian neural networks,” e-print arXiv:1906.01563 (2019).
10.
W. R.
Hamilton
, “
On a general method in dynamics
,”
Philos. Trans. R. Soc. II
1834
,
247
308
.
11.
K.
He
,
X.
Zhang
,
S.
Ren
, and
J.
Sun
, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), Vol. 7, pp. 171–180.
12.
E.
Kaiser
,
J.
Nathan Kutz
, and
S. L.
Brunton
, “Discovering conservation laws from data for control,” in 2018 IEEE Conference on Decision and Control (CDC) (IEEE, 2018).
13.
M.
Livio
, “
Why symmetry matters
,”
Nature
490
(
7421
),
472
473
(
2012
).
14.
Y.
Lu
,
A.
Zhong
,
Q.
Li
, and
B.
Dong
, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” in Proceedings of the 35th International Conference on Machine Learning, edited by J. Dy and A. Krause (PMLR, Stockholm, Sweden, 2018), p. 10.
15.
B.
Lusch
,
J.
Nathan Kutz
, and
S. L.
Brunton
, “
Deep learning for universal linear embeddings of nonlinear dynamics
,”
Nat. Commun.
9
(
1
),
4950
(
2018
).
16.
K. T.
McDonald
, “Hamiltonian with z as the independent variable,” Technical Report (
2015
).
17.
R.
Mottaghi
,
H.
Bagherinezhad
,
M.
Rastegari
, and
A.
Farhadi
, “Newtonian image understanding: Unfolding the dynamics of objects in static images,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Las Vegas, NV, 2016), pp. 3521–3529.
18.
R.
Neal
, “MCMC using Hamiltonian dynamics,” in Handbook of Markov Chain Monte Carlo, edited by S. Brooks, A. Gelman, G. Jones, and X.-L. Meng (Chapman and Hall/CRC, 2011).
19.
E.
Noether
, “
Invariant variation problems
,”
Transp. Theory Stat. Phys.
1
(
3
),
186
207
(
1971
).
20.
R.
Kondor
,
Z.
Lin
, and
S.
Trivedi
, “Clebsch-Gordan nets: A fully Fourier space spherical convolutional neural network,” in Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Curran Associates, Inc., 2018), Vol. 31, pp. 10117–10126.
21.
M.
Raissi
and
G. E.
Karniadakis
, “
Hidden physics models: Machine learning of nonlinear partial differential equations
,”
J. Comput. Phys.
357
,
125
141
(
2018
).
22.
M.
Raissi
,
P.
Perdikaris
, and
G. E.
Karniadakis
, “
Inferring solutions of differential equations using noisy multi-fidelity data
,”
J. Comput. Phys.
335
,
736
746
(
2017
).
23.
C. E.
Rasmussen
and
C. K. I.
Williams
,
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
(
The MIT Press
,
2005
).
24.
D. J.
Rezende
and
S.
Mohamed
, “Variational inference with normalizing flows,” in Proceedings of the 32nd International Conference on Machine Learning (PMLR, 2015), p. 9.
25.
R.
Rico-Martinez
,
R. A.
Adomaitis
, and
I. G.
Kevrekidis
, “Noninvertibility in neural networks,” in Proceedings of the 1993 IEEE International Conference on Neural Networks (The Institute of Electrical and Electronics Engineers, 1993), pp. 382–386.
26.
R.
Rico-Martínez
,
R. A.
Adomaitis
, and
I. G.
Kevrekidis
, “
Noninvertibility in neural networks
,”
Comput. Chem. Eng.
24
(
11
),
2417
2433
(
2000
).
27.
R.
Rico-Martínez
,
J. S.
Anderson
, and
I. G
Kevrekidis
, “Continuous-time nonlinear signal processing: A neural network based approach for gray box identification,” in Proceedings of IEEE Workshop on Neural Networks for Signal Processing (The Institute of Electrical and Electronics Engineers, 1994).
28.
R.
Rico-Martínez
and
I. G
Kevrekidis
, “Nonlinear system identification using neural networks: Dynamics and instabilities,” in Neural Networks for Chemical Engineers, edited by A. B. Bulsari (Elsevier, 1995), pp. 409–442.
29.
R.
Rico-Martínez
,
I. G.
Kevrekidis
, and
R. A.
Adomaitis
, “Noninvertible dynamics in neural network models,” in Proceedings of the Twenty-Eighth Annual Conference on Information Sciences and Systems (John Hopkins University, 1994), pp. 965–969.
30.
R.
Rico-Martínez
,
I. G.
Kevrekidis
,
M. C.
Kube
, and
J. L.
Hudson
, “Discrete- vs continuous-time nonlinear signal processing attractors, transitions and parallel implementation issues,” in American Control Conference (The Institute of Electrical and Electronic Engineers, 1993), pp. 1475–1479.
31.
R.
Rico-Martínez
,
K.
Krischer
,
I. G.
Kevrekidis
,
M. C.
Kube
, and
J. L.
Hudson
, “
Discrete- vs continuous-time nonlinear signal processing of Cu electrodissolution data
,”
Chem. Eng. Commun.
,
118
(
1
),
25
48
(
1992
), ISBN: 0098644920.
32.
M.
Schmidt
and
H.
Lipson
, “
Distilling free-form natural laws from experimental data
,”
Science
324
(
5923
),
81
85
(
2009
).
33.
R. K.
Srivastava
,
K.
Greff
, and
J.
Schmidhuber
, “Highway networks,” in Proceedings of the International Conference on Machine Learning (PMLR, 2015).
34.
P.
Toth
,
D. J.
Rezende
,
A.
Jaegle
,
S.
Racanière
,
A.
Botev
, and
I.
Higgins
, “Hamiltonian generative networks,” e-print arXiv:1909.13789 (2019).
35.
J. M.
Lee
,
Introduction to Smooth Manifolds. Graduate Texts in Mathematics
(
Springer New York
,
2012
).
36.
G.
Darboux
,
Sur le probléme de Pfaff. Bulletin des Sciences Mathématiques et Astronomiques, Gauthier-Villars, 1882, 2e serie, 6, 14–36