Concise, accurate descriptions of physical systems through their conserved quantities abound in the natural sciences. In data science, however, current research often focuses on regression problems, without routinely incorporating additional assumptions about the system that generated the data. Here, we propose to explore a particular type of underlying structure in the data: Hamiltonian systems, where an “energy” is conserved. Given a collection of observations of such a Hamiltonian system over time, we extract phase space coordinates and a Hamiltonian function of them that acts as the generator of the system dynamics. The approach employs an autoencoder neural network component to estimate the transformation from observations to the phase space of a Hamiltonian system. An additional neural network component is used to approximate the Hamiltonian function on this constructed space, and the two components are trained jointly. As an alternative approach, we also demonstrate the use of Gaussian processes for the estimation of such a Hamiltonian. After two illustrative examples, we extract an underlying phase space as well as the generating Hamiltonian from a collection of movies of a pendulum. The approach is fully data-driven and does not assume a particular form of the Hamiltonian function.

Neural network-based methods for modeling dynamical systems are again becoming widely used, and methods that explicitly learn the physical laws underlying continuous observations in time constitute a growing subfield. Our work contributes to this thread of research by incorporating additional information into the learned model, namely, the knowledge that the data arise as observations of an underlying Hamiltonian system.

We use machine learning to extract models of systems whose dynamics conserve a particular quantity (the Hamiltonian). We train several neural networks to approximate the total energy function for a pendulum, in both its natural action-angle form and also as seen through several distorting observation functions of increasing complexity. A key component of the approach is the use of automatic differentiation of the neural network in formulating the loss function that is minimized during training.

Our method requires data evaluating the first and second time derivatives of observations across the regions of interest in state space or, alternatively, sufficient information (such as a sequence of delayed measurements) to estimate these. We include examples in which the observation function is nonlinear and high-dimensional.

## I. INTRODUCTION

Current data science exploration of dynamics often focuses on regression or classification problems, without routinely incorporating additional assumptions about the nature of the system that generated the data. This has started to change recently, with approaches to extract generic dynamical systems by Ref. 32, specifying the variables and possible expressions for the formulas beforehand. In particular, treating the central object to be modeled as the discrete-time flowmap $\Phi \tau (x)=x+\u222b0\tau f(x(t))dt$ but learning $\Phi $ directly as a black box may result in qualitative differences from the true system.^{31} Using the associated differential equation $dx/dt=f(x)$ instead allows us to exploit established numerical integration schemes to help approximate the flow map and can be accomplished with a neural network by structuring the loss function similarly to such classical numerical integration schemes. In addition to our older work along these lines,^{6,25–31} more recent work has revived this approach, focused on treating the layers of a deep neural network as iterates of a dynamical system, where “learning” consists of discovering the right attractors.^{3,5,14}

In particular, interest has focused on how residual networks^{11} and highway networks^{33} can be interpreted as iterative solvers^{8} or as iterated dynamical systems.^{4} In the latter paper (a NeurIPS 2018 best paper awardee), the authors choose not to unroll the iteration in time explicitly, but instead use continuous-time numerical integration. While the focus was on the dynamics-of-layers concept, time series learning was also performed.

The Koopman operator has also been employed in combination with neural networks to extract conservation laws and special group structures.^{12,15} Symmetries in relation to conserved quantities are a well-studied problem in physics.^{10,13,19,20} A recent thread of research consists of learning physics models from observation data,^{17} including modeling discrete-time data as observations of a continuous-time dynamical system.^{21,22}

It is informative to study physical systems through their conserved quantities, such as total energy, which can be encoded in a Hamiltonian function.^{1,10} The measure-preserving property of Hamiltonian systems has recently been exploited for transport of densities in Markov-chain Monte-Carlo methods^{18} and variational autoencoders.^{2,24} For our purposes, a natural progression of the thread of computational modeling of physics from observations is to represent the Hamiltonian function directly.

Concurrently with this submission, two papers independently addressing similar issues appeared as preprints.^{9,34} In the first of these,^{9} a loss function very similar to parts of our Eq. (12) was used. The second paper focuses on the transformation of densities generated through the Hamiltonian. It is possible to draw some analogies between the second preprint and the old (non-Hamiltonian) work we mentioned above,^{6,25–31} as this newer work also used rollout with a time step templated on a classical numerical integration method (here, symplectic Euler and leapfrog). Both papers also use the pendulum as an example, emphasizing conditions where the system can be well approximated by a linear system: a thin annulus around the stable steady state at $(q,p)=(0,0)$ where the trajectories are nearly circular.

The remainder of the paper is structured as follows:

We derive data-driven approximations (through two approaches: Gaussian processes and neural networks) of a Hamiltonian function on a given phase space, from time series data. The Hamiltonian functions we consider do not need to be separable as a sum $H(q,p)=T(p)+V(q)$, and in our illustrative example, we always work in the fully nonlinear regime of the pendulum.

We build data-driven reconstructions of a phase space from (a) linear and (b) nonlinear, nonsymplectic transformations of the original Hamiltonian phase space. The reconstruction then leads to a symplectomorphic copy of the original Hamiltonian system.

We construct a completely data-driven pipeline combining (a) the construction of an appropriate phase space and (b) the approximation of a Hamiltonian function on this new phase space, from nonlinear, high-dimensional observations (e.g., from movies/sequences of movie snapshots).

## II. GENERAL DESCRIPTION

A Hamiltonian system on Euclidean space $E=R2n$, $n\u2208N$ is determined through a function $H:E\u2192R$ that defines the equations

where $(\u22c5):=d/dt$ and $q(t),p(t)\u2208Rn$ are interpreted as “position” and “momentum” coordinates in the “phase space” $E$. In many mechanical systems, and in all examples we discuss in this paper, the interpretation of the coordinates $q,p$ is reflected in the dynamics through $q\u02d9=p$, i.e., $H(q,p)=12p2+h(q)$ for some function $h:Rn\u2192Rn$. In general, Eqs. (1) and (2) imply that the Hamiltonian is constant along trajectories $(q(t),p(t))$ because

Equations (1) and (2) can be restated as a partial differential equation for $H$ at every $(q,p)\u2208E$,

where $I\u2208Rn\xd7n$ is the identity matrix and $\nu $ is the vector field on $E$ [the left hand side of (1) and (2)], which only depends on the state $(q,p)$. The symplectic form on the given Euclidean space takes the form of the matrix $\omega $.

In Sec. III, we discuss how to approximate the function $H$ from given data points $D={(qi,q\u02d9i,q\xa8i)}i=1N$. This involves solving the partial differential equation (4) for $H$. Since these equations determine $H$ only up to an additive constant, we assume that we also know the value $H0=H(q0,p0)$ of $H$ at a single point $(q0,p0)$ in the phase space. This is not a major restriction for the approach because $H0$ as well as $(q0,p0)$ can be chosen arbitrarily.

## III. EXAMPLE: THE NONLINEAR PENDULUM

As an example, consider the case $n=1$ and the Hamiltonian

This Hamiltonian forms the basis for the differential equations of the nonlinear pendulum, $q\xa8=\u2212sin\u2061(q)$, or, in the first-order form, $q\u02d9=\u2202H(q,p)/\u2202p=p$ and $p\u02d9=\u2212\u2202H(q,p)/\u2202q=\u2212sin\u2061(q)$. In this section, we numerically solve partial differential equation (PDE) (4) by approximating the solution $H$ using two approaches: Gaussian processes^{23} (Sec. III A) and neural networks (Sec. III B).

### A. Approximation using Gaussian processes

We model the solution $H$ as a Gaussian process $H^$ with a Gaussian covariance kernel,

where $x$ and $x\u2032$ are points in the phase space, i.e., $x=(q,p)$, $x\u2032=(q\u2032,p\u2032)$, and $\u03f5\u2208R+$ is the kernel bandwidth parameter (we chose $\u03f5=2$ in this paper). Given a collection $X$ of $N$ points in the phase space, as well as the function values $H(X)$ at all points in $X$, the conditional expectation of the Gaussian process $H^$ at a new point $y$ is

where we write $k(X,X\u2032)i,j:=k(xi,xj)$ for the kernel matrix evaluated over all $x$ values in the given dataset $X$. In Eq. (7), the dimensions of the symbols are $y\u2208R2n$, $k(X,X\u2032)\u2208RN\xd7N$, $k(y,X)\u2208RN$, and $H(X):=(H(x1),H(x2),\u2026,H(xN))\u2208RN$. All vectors are column vectors. Estimates of the solution $H$ to the PDE at new points depend on the value of $H$ over the entire dataset. We do not know the values of $H$, but differentiating Eq. (7) allows us to set up a system of equations to estimate $H$ at an arbitrary number $M$ of new points $y$ close to the points $x\u2208X$, by using the information about the derivatives of $H$ given by the time derivatives $q\u02d9$ in Eq. (1) and (the negative of) $p\u02d9$ in Eq. (2). Let $g(xi)\u2208R2n$ be the gradient of $H$ at a point $xi$ in the phase space, which we know from data $(q\u02d9,p\u02d9)$. Then,

Together with an arbitrary pinning term at a point $x0$, the list of known derivatives leads to a linear system of $2nN+1$ equations, where we write

for the derivative of the given kernel function $k$ with respect to its first argument, evaluated at a given point $xi$ in the first argument and all points in the dataset $Y$ in the second argument. For each $xi$, we thus have a (column) vector of $M$ derivative evaluations, which is transposed and stacked into a large matrix to form the full linear system

We reiterate: $X$ is a dataset of $N$ points where we know the derivatives of $H$ through $g(xi)=(\u2202H\u2202q(xi),\u2202H\u2202p(xi))T=(\u2212p\u02d9i,q\u02d9i)T\u2208R2n$. We evaluate on a fine grid $Y$ of $M$ points (such that $k(Y,Y\u2032)\u2208RM\xd7M$, $\u2202\u2202xk(xi,Y)\u2208R2n\xd7M$) and have information $g(X)\u2208R2nN$ on a relatively small set of $N$ points called $X$ [black dots in Fig. 1(b)]. The derivative of the Gaussian process can be stated using the derivative of the kernel $k$ with respect to the first argument. The matrix inverse for $k(Y,Y\u2032)$ is approximated by the inverse of $(k(Y,Y\u2032)+\sigma 2I)$, that is, through Tikhonov regularization with parameter $\sigma =10\u22125$, as is the standard for Gaussian process regression. Solving this system of equations for $H(Y)$ yields the approximation for the solution to the PDE. Figure 1(b) shows the result, and Table I lists the mean squared error to the 625 training data points and an independently drawn set of 200 validation points in the same domain, where no derivatives were available. See Ref. 22 for a more detailed discussion of the solution of PDE with Gaussian processes.

. | No. of parameters . | No. of training points . | No. of validation points . | Training loss . | Validation loss . |
---|---|---|---|---|---|

Section III A | Nonparametric | 625 | 200 | 2.2 × 10^{−5} | 3.5 × 10^{−5} |

Section III B | 337 | 20 000 | 200 | 6.8 × 10^{−4} | 7.4 × 10^{−4} |

Section IV B | 345 | 20 000 | 1000 | 5.5 × 10^{−6} | 5.0 × 10^{−6} |

Section IV C | 511 | 19 200 | 200 | 1.4 × 10^{−4} | 1.8 × 10^{−4} |

Section IV D | 755 | 17 489 | 200 | 0.51 | 0.54 |

. | No. of parameters . | No. of training points . | No. of validation points . | Training loss . | Validation loss . |
---|---|---|---|---|---|

Section III A | Nonparametric | 625 | 200 | 2.2 × 10^{−5} | 3.5 × 10^{−5} |

Section III B | 337 | 20 000 | 200 | 6.8 × 10^{−4} | 7.4 × 10^{−4} |

Section IV B | 345 | 20 000 | 1000 | 5.5 × 10^{−6} | 5.0 × 10^{−6} |

Section IV C | 511 | 19 200 | 200 | 1.4 × 10^{−4} | 1.8 × 10^{−4} |

Section IV D | 755 | 17 489 | 200 | 0.51 | 0.54 |

### B. Approximation using an artificial neural network

Another possibility for learning the form of $H$ using data is to represent the function with an artificial neural network^{7} (ANN). We write

where the activation function $\sigma l$ is nonlinear (except where otherwise indicated, we used $tanh$) for $l=1,\u2026,L$ (if $L\u22651$) and the identity for $l=L+1$. The learnable parameters of this ANN are ${(Wl,bl)}l=1,\u2026,L+1$, and we gather all such learnable parameters from the multiple layers that may be used in one experiment into a parameter vector $w$. If there are no hidden layers ($L=0$), then we learn an affine transformation $x1=x0\u22c5W+b$. This format provides a surrogate function $H^(q,p)=xL+1$, where the input $x0$ is the row vector $[q,p]$. (Treating inputs as row vectors and using right-multiplication for weight matrices is convenient, as a whole batch of $N$ inputs can be presented as an $N$-by-$2$ array.) For all the experiments shown here, this network for computing $H^$ has two hidden layers of width 16.

Similarly, in the case(s) we need to learn additional transformations $\theta ^$ and $\theta ^\u22121$ (see Sec. IV), they are also learned using such networks.

We collect training data by sampling a number of initial conditions in the rectangle $(q,p)\u2208[\u22122\pi ,2\pi ]\xd7[\u22126,6]$, then simulate short trajectories from each to their final $(q,p)$ points. For each of these, we additionally evaluate $(q\u02d9,p\u02d9)$. Shuffling over simulations once per epoch, and dividing this dataset into batches, we then perform batchwise stochastic gradient descent to learn the parameters $w$ using an Adam optimizer on the objective function defined below.

For this paper, all neural networks were constructed and trained using TensorFlow, and the gradients necessary for evaluating the Hamiltonian loss terms in Eq. (12) were computed using TensorFlow’s core automatic differentiation capabilities.

This objective function comprises a scalar function evaluated on each data 4-tuple $d=(q,p,q\u02d9,p\u02d9)$ in the batch and then averaged over the batch. This scalar function is written as

where the dependence on $w$ is through the learned Hamiltonian $H^$, the loss-term weights $ck$ are chosen to emphasize certain properties of the problem thus posed, and the partial derivatives of $\u2202H^\u2202p$ and $\u2202H^\u2202q$ are computed explicitly through automatic differentiation. Except for $c2$, all $ck$ values are set to either $1$ or $0$ depending on whether the associated loss term is to be included or excluded. Because of the square term in Eq. (5), we set $c2$ arbitrarily to $10$ if nonzero, so the loss is not dominated by $f1$. An alternative might be to set $c1$ to $1/10$.

Since Eqs. (1) and (2) together imply (3), any one of the three terms $f1$, $f2$, and $f4$ could be dropped as redundant; therefore, we can set $c4$ to zero but monitor $H^\u02d9$ as a useful sanity check on the accuracy of the learned solution. In Fig. 1(c), we show the results of this process with our default nonzero values for $ck$.

As an ablation study, we explored the effect (not shown here) of removing the first, second, and fourth terms. By construction, the true $Ht(q,p)$ function is zero for all $(q,p)$. Note that this is only ever achieved to any degree in the central box in the figure, where data were densely sampled. Removing $f4$ made no visible difference in the quality of our $H^t\u22480$ approximation, which was expected due to the redundancy in the set of Eqs. (1)–(3). However, removing either $f1$ or $f2$ gives poor results across the figure, despite the apparent redundancy of these terms with $f4$. This might be due to not balancing the contributions of the $p\u02d9$ and $q\u02d9$ terms, for which we attempted to compensate by unequal weighting values $c1$ and $c2$.

## IV. ESTIMATING HAMILTONIAN STRUCTURE FROM OBSERVATIONS

We now consider a set of observation functions $\theta :E\u2192RM$, $\theta =(y1,\u2026,yM)$, with $M\u2265dim\u2061E=2n$, such that $\theta $ is a diffeomorphism between the phase space $E$ and its image $\theta (E)$. In this setting, the notion of a *symplectomorphism* is important.^{1} In general, a symplectomorphism is a diffeomorphism that leaves the symplectic structure on a manifold invariant. In our setting, a symplectomorphism of $E=Q\xd7P$ maps to a deformed space $E^=Q^\xd7P^$ where the system dynamics in the new variables $q^\u2208Q^$, $p^\u2208P^$ is again Hamiltonian and conjugate to the original Hamiltonian dynamics. Not every diffeomorphism is a symplectomorphism, and we **do not assume** that $\theta $ is a symplectomorphism. A constant scaling of the coordinates is also not possible to distinguish from a scaling in the Hamiltonian function itself, so the recovered system will be a symplectomorphic copy of the original, scaled by an arbitrary constant.

In the setting of this section, we do not assume access to $E$, $H$, or the explicit form of $\theta $. Only a collection of points $\theta i$ and time derivatives $ddt\theta i$ in the image $\theta (E)$ is available. We describe an approach to approximate a new map $\theta ^\u22121:\theta (E)\u2192E^$ into a symplectomorphic copy of $E$ through an autoencoder,^{7} such that the transformed system in $E^$ is conjugate to the original Hamiltonian system in $E$. Upon convergence, and if we had access to $\theta $, the map $\theta ^\u22121\xb0\theta \u2261S:E\u2192E^$ would approximate a symplectomorphism, and $\theta \u22121\xb0\theta ^\u2261S\u22121$ would be its inverse. During the estimation of $\theta ^\u22121$, we simultaneously approximate the new Hamiltonian function $H^:E^\u2192R$. Figure 2(a) visualizes the general approach, where only the information $(x,y)i$ and $ddt(x,y)i$ is available to the procedure, while $\theta ^\u22121$ and a Hamiltonian $H^$ on $E^$ are constructed numerically.

An interesting and important feature common to the next three examples is that one cannot expect to systematically recover the original $(q,p)$ values from the given observation data $(x,y)$. Only symplectic transformations $(q^,p^)$ can be recovered, which are enough to define a Hamiltonian. Once the coordinates $(q^,p^)$ are fixed, the Hamiltonian function *in these coordinates* is unique up to an additive constant.

### A. A composite loss function for the joint learning of a transformation and a Hamiltonian

The following loss function is used to train an autoencoder component together with a Hamiltonian function approximation network component:

where the dependence on $w$ is through the learned Hamiltonian $H^$ and the learned transformations $\theta ^$ and $\theta ^\u22121$, and the time derivatives in the space $E^$ are computed as $q^\u02d9=x\u02d9\u2202q^\u2202x+y\u02d9\u2202q^\u2202y$ and $p^\u02d9=x\u02d9\u2202p^\u2202x+y\u02d9\u2202p^\u2202y$, using the Jacobian $D\theta ^\u22121=\u2202q^\u2202x\u2202q^\u2202y\u2202p^\u2202x\u2202p^\u2202y$ of the transformation $\theta ^\u22121$ (computed pointwise with automatic differentiation). When we learn a (especially nonlinear) transformation $\theta ^\u22121$, in addition to the Hamiltonian $H^(q^,p^)$, including the $f4$ term in the composite loss can have a detrimental effect on the learned transformation. There exists an easily-encountered naïve local minimum in which $\theta ^\u22121$ maps all of the sampled values from $\theta (E)\u220b(x,y)$ to a single point in $E^$, and the Hamiltonian learned is merely the constant function at the pinning value, $H^(q^,p^)=H^0$. In this state (or an approximation of this state), all of $\u2202H\u2202q^$, $\u2202H\u2202p^$ and the elements of $D\theta ^\u22121$ are zero, so the loss (with terms $f1$, $f2$, $f3$, and optionally $f4$) is zero (resp., small). A related failure is that in which the input in $\theta (E)$ is collapsed by $\theta ^\u22121$ to a line or curve in $E^$.

To alleviate both of these problems, we added a new loss component $f6$. That is, we require that the learned transformation does not collapse the input. It is sufficient for the corresponding weighting factor $c6$ to be a very small nonzero value (e.g., $10\u22126$). The addition of $f6$ to our loss helps us to avoid falling early in training into the unrecoverable local minimum described above and also helps keep the scale of the transformed variables $q^$ and $p^$ macroscopic.

### B. Example: Linear transformation of the pendulum

We generate data from a rectangular region $x\u2208[\u22121,1]$, $y\u2208[\u22121,1]$ and then transform the region linearly with $\theta \u22121(x,y)=A\u22121[x,y]T=[q,p]T$. The matrix $A\u22121$ is the inverse of $A=R\u22c5\Lambda $; a scaling is followed by a rotation where $\Lambda =\lambda 100\lambda 2$, $\lambda 1=1$, $\lambda 2=64$, $R=cos\u2061\rho \u2212sin\u2061\rho sin\u2061\rho cos\u2061\rho $, and $\rho =5\xb0$. Our observation data $\theta (E)$ are thus given by $[x,y]T=A\u22c5[q,p]T$. Using the true Hamiltonian $H(q,p)=p2/2+(1\u2212cos\u2061q)$, we additionally compute true values for $dq/dt$ and $dp/dt$ and then use $A$ to propagate these to $x$ and $y$ via $dxdt=\u2202x\u2202qdqdt+\u2202y\u2202pdpdt$ and similar for $y$, where the partial derivatives are computed analytically (here, just the elements of $A$ itself).

Our network is then presented with observation data $x,y$ and its corresponding time-derivatives. Its task is to learn $A^$ and $A^\u22121$, which convert to and from variables $q^,p^$ (symplectomorphic to the original $q,p$) and a Hamiltonian $H^$ in this new space. When evaluating the loss, the time derivatives of $q^$ and $p^$ are likewise computed via automatic differentiation using the chain rule through the learned transformation, e.g., as $dq^dt=\u2202q^\u2202xdxdt+\u2202q^\u2202ydydt$. Note also that $[q^,p^]T=A^\u22121\u22c5A\u22c5[q,p]T$, so if the original space $E$ could be found, $A^$ would satisfy $A^\u22c5A\u22121=I$. This cannot be expected given only the data in $\theta (E)$; we can only be sure that $A^\u22c5A\u22121$ approximates a symplectomorphism of the original $E$.

We could learn $\theta ^\u22121$ from a general class of nonlinear functions, as a small $tanh$ neural network, but here, we simply learn $A^$ and $A^\u22121$ as linear transformations (that is, we have a linear “neural network,” where $L=0$ in Eq. (11), and $b1$ is fixed as $0$). As we include the reconstruction error of this autoencoder in our loss function, $A^\u22121$ is constrained to be the inverse of $A^$ to a precision no worse than the $f1$ and $f2$ terms in Eq. (14), after all three are scaled by their corresponding $ck$ values. In fact, for the linear case, initially the autoencoder’s contribution to the loss is significantly lower than the Hamiltonian components (see Fig. 3), but, as training proceeds and the $f1$ and $f2$ terms are improved (at the initial expense of raising the autoencoder loss), larger reductions in loss are possible by optimizing $H^$ rather than $\theta ^$, so $f5$ is decreased as quickly as (the larger of) $f1$ or $f2$. That is, the autoencoder portion of the loss falls quickly to the level where it no longer contributes to the total loss given its weighting in the loss sum.

We find that the learned symplectomorphism $S(q,p)=A^\u22121\u22c5A\u22c5[q,p]T$, depicted in its $q$ portion in Fig. 3, preserves $q$ unmixed with $p$ in one or the other of its two discovered coordinates. This is because both (a) $(q,p)\u21a6(p,\u2212q)$ as well as (b) $(q,p)\u21a6(q,p+f(q))$ for any smooth function $f$ are symplectomorphisms. They are special because $H(q,p)=H^(q^(q,p),p^(q,p))$; i.e., they even preserve the Hamiltonian formulation. For the map (a), the transformation of the Hamiltonian can be seen from the following derivation:

Here, the first equality of (15) and (16) follows from the map, and the last equality of (15) and (16) follows from the requirement that $q^,p^$ follow Hamiltonian dynamics with respect to the new Hamiltonian $H^$. Equations (18) and (20) then show that the new Hamiltonian is the same as the old one (modulo an additive constant) when considered as a map on the old coordinates.

### C. Example: Nonlinear transformation of the pendulum

In addition to the linear $\theta $ of Sec. IV B, we show comparable results for a nonlinear transformation $\theta $ and learned $\theta ^$. Specifically, we transform the data through $(x,y)=\theta (q,p)$ where

the inverse of which is given by $q=(x\u2212y2)20$ and $p=(y\u2212x2+2xy2\u2212y4)10$. We use the analytical Jacobian of this $\theta $ to compute the necessary $x\u02d9$ and $y\u02d9$ for input to our network.

We proceed as before, except that we no longer restrict the form of the learned $\theta ^$ and $\theta ^\u22121$ to linear transformations, but instead allow small multilayer perceptrons of a form similar to that used for $H^$.

The resulting induced symplectomorphism is again one that appears to preserve an approximately monotonically increasing or decreasing $q$ in either $q^$ or $p^$. This can be seen in Fig. 4.

### D. Example: Constructing a Hamiltonian system from nonlinear, high-dimensional observations of *q*, *p*

As a further demonstration of the method, we use a graphical rendering of the moving pendulum example from before as the transformation $\theta $ from the intrinsic state $(q,p)$ to an image $x$ as our high-dimensional observable. We use a symplectic semi-implicit Euler’s method,

to generate $q(ti),p(ti)$ trajectories for various initial conditions and then a simple graphical renderer to display these as images [see Fig. 5(a)]. When rendering our video frames, we drag a tail of decaying images behind the moving pendulum head so that information about both position $q$ and velocity $p$ is present in each rendered frame. This is done by iterating over each $q(ti),p(ti)$ trajectory and, for each $ti$, (1) multiplying the entire current image by a cooling factor of $\u223c0.96$, (2) adding a constant heating amount to the image in a circle of fixed radius centered around the current $cos\u2061(q),sin\u2061(q)$ point, and (3) clipping the image per-pixel to lie within $[0,1]$. Samples of the resulting images are visible in Fig. 5(a).

Though we do not use $p(ti)$ directly in this procedure, its value is observed in the length of the resulting tail dragged behind the moving pendulum head. We create trajectories long enough that the effect of the initial formation of the tail (during which its length is not necessarily a good indicator of $p$) is not visible anymore and then use only the final two observations from these trajectories.

In order to make the approach agnostic to the data, we do not want to assume that the space $E^$ is periodic, so instead, we use a four-dimensional phase space with elements $z^=[q^1,q^2,p^1,p^2]=[q^,p^]$ and consider the splitting into $(q^1,q^2)$ and $(p^1,p^2)$ during training. In the space of input images, the manifold does not fill up four-dimensional space but a cylinder, which is mapped to the four-dimensional encoding layer by the autoencoder.

In addition, to simplify the learning problem, we learn $\theta ^\u22121$ as the combination of a projection onto the first 20 principal components of the training dataset followed by a dense autoencoder, reserving learning $\theta ^\u22121$ as an end-to-end convolutional autoencoder for future work. The encoding provides $z^$ and, as before, we learn $\theta ^\u22121$ in tandem with $H^$, where now the conditions of Eq. (14) are upgraded to vector equivalents to accommodate $z^$.

In Sec. IV A, we added a loss term proportional to the reciprocal of the determinant of the transformation’s Jacobian in order to avoid transformations that collapsed the phase space. Here, this was not such an issue—of course, some collapsing of the high-dimensional representation is obviously required. Instead, a common mode of failure turned out to be learning constant $H^$ functions, which automatically satisfy the Hamiltonian requirements (a constant is naturally a conserved quantity). To avoid this, we considered several possible ways to promote a nonflat $H^$ function, ultimately settling on (a) adding a term that encouraged the standard deviation of $H^$ values to be nonzero and (b) minimizing not just the mean squared error in our $f1$ and $f2$ terms, but also the max squared error, to avoid trivial or bilevel $H^(q,p)$ functions. The spherical Gaussian prior used in training variational autoencoders^{34} also avoids, as a by-product, learning constant functions.

The result, shown in Fig. 5(b), was a pulled-back $H^(q,p)$ function that at least in broad strokes resembles the truth and that satisfies $dH^/dt\u22480$ (typically about $10\u22122$).

## V. CONCLUSIONS

We described an approach to approximate Hamiltonian systems from observation data. It is a completely data-driven pipeline to (a) construct an appropriate phase space and (b) approximate a Hamiltonian function on the new phase space, from nonlinear, possibly high-dimensional observations (here, movies).

When only transformations of the original Hamiltonian phase space can be observed, it is only possible to recover a symplectic copy of the original phase space, with additional freedom in a constant scaling of the coordinates and an additive constant to the Hamiltonian function. If no additional information about the original space is available, this is a fundamental limitation and not particular to our approach. It is not necessary that there is an “original phase space” at all, and so, the resulting symplectic phase space is by no means unique. In fact, Darboux's theorem implies the opposite.^{35,36} There is no local obstruction to a symplectic structure, so all symplectic manifolds of dimension 2n are locally symplectomorphic to the standard flat space $R2n$. The choice of a single such space has to rely on other factors, such as, possibly, interpretability by humans or simplicity of the equations.

The approach may be extended to time-dependent Hamiltonian functions. This would allow us to cope with certain dissipative systems.^{16} An even broader extension may allow transformations to arbitrary normal forms as the “target vector field” and thus would not be constrained to Hamiltonian systems. In the general case, it will become important to explore whether the transformation we approximate remains bounded over our data or whether it starts showing signs of approaching a singularity, suggesting that the problem may not be solvable.

## ACKNOWLEDGMENTS

This work was funded by the US Army Research Office (ARO) through a Multidisciplinary University Research Initiative (MURI) and by the Defense Advanced Research Projects Agency (DARPA) through their Physics of Artificial Intelligence (PAI) program, and through agreement HR001118C0100 (I.M.).

## REFERENCES

*Proceedings of the 32nd Conference on Neural Information Processing Systems*(Curran Associates, Inc., 2018), p. 11.

*Advances in Neural Information Processing Systems*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Curran Associates, Inc., 2018), Vol. 31, pp.

*Proceedings of the International Conference on Learning Representations*(2017).

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*(IEEE, 2016), Vol. 7, pp. 171–180.

*2018 IEEE Conference on Decision and Control (CDC)*(IEEE, 2018).

*Proceedings of the 35th International Conference on Machine Learning*, edited by J. Dy and A. Krause (PMLR, Stockholm, Sweden, 2018), p. 10.

*2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*(IEEE, Las Vegas, NV, 2016), pp. 3521–3529.

*Handbook of Markov Chain Monte Carlo*, edited by S. Brooks, A. Gelman, G. Jones, and X.-L. Meng (Chapman and Hall/CRC, 2011).

*Advances in Neural Information Processing Systems*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Curran Associates, Inc., 2018), Vol. 31, pp. 10117–10126.

*Proceedings of the 32nd International Conference on Machine Learning*(PMLR, 2015), p. 9.

*Proceedings of the 1993 IEEE International Conference on Neural Networks*(The Institute of Electrical and Electronics Engineers, 1993), pp. 382–386.

*Proceedings of IEEE Workshop on Neural Networks for Signal Processing*(The Institute of Electrical and Electronics Engineers, 1994).

*Neural Networks for Chemical Engineers*, edited by A. B. Bulsari (Elsevier, 1995), pp. 409–442.

*Proceedings of the Twenty-Eighth Annual Conference on Information Sciences and Systems*(John Hopkins University, 1994), pp. 965–969.

*American Control Conference*(The Institute of Electrical and Electronic Engineers, 1993), pp. 1475–1479.

*Proceedings of the International Conference on Machine Learning*(PMLR, 2015).