Concise, accurate descriptions of physical systems through their conserved quantities abound in the natural sciences. In data science, however, current research often focuses on regression problems, without routinely incorporating additional assumptions about the system that generated the data. Here, we propose to explore a particular type of underlying structure in the data: Hamiltonian systems, where an “energy” is conserved. Given a collection of observations of such a Hamiltonian system over time, we extract phase space coordinates and a Hamiltonian function of them that acts as the generator of the system dynamics. The approach employs an autoencoder neural network component to estimate the transformation from observations to the phase space of a Hamiltonian system. An additional neural network component is used to approximate the Hamiltonian function on this constructed space, and the two components are trained jointly. As an alternative approach, we also demonstrate the use of Gaussian processes for the estimation of such a Hamiltonian. After two illustrative examples, we extract an underlying phase space as well as the generating Hamiltonian from a collection of movies of a pendulum. The approach is fully data-driven and does not assume a particular form of the Hamiltonian function.
Neural network-based methods for modeling dynamical systems are again becoming widely used, and methods that explicitly learn the physical laws underlying continuous observations in time constitute a growing subfield. Our work contributes to this thread of research by incorporating additional information into the learned model, namely, the knowledge that the data arise as observations of an underlying Hamiltonian system.
We use machine learning to extract models of systems whose dynamics conserve a particular quantity (the Hamiltonian). We train several neural networks to approximate the total energy function for a pendulum, in both its natural action-angle form and also as seen through several distorting observation functions of increasing complexity. A key component of the approach is the use of automatic differentiation of the neural network in formulating the loss function that is minimized during training.
Our method requires data evaluating the first and second time derivatives of observations across the regions of interest in state space or, alternatively, sufficient information (such as a sequence of delayed measurements) to estimate these. We include examples in which the observation function is nonlinear and high-dimensional.
I. INTRODUCTION
Current data science exploration of dynamics often focuses on regression or classification problems, without routinely incorporating additional assumptions about the nature of the system that generated the data. This has started to change recently, with approaches to extract generic dynamical systems by Ref. 32, specifying the variables and possible expressions for the formulas beforehand. In particular, treating the central object to be modeled as the discrete-time flowmap but learning directly as a black box may result in qualitative differences from the true system.31 Using the associated differential equation instead allows us to exploit established numerical integration schemes to help approximate the flow map and can be accomplished with a neural network by structuring the loss function similarly to such classical numerical integration schemes. In addition to our older work along these lines,6,25–31 more recent work has revived this approach, focused on treating the layers of a deep neural network as iterates of a dynamical system, where “learning” consists of discovering the right attractors.3,5,14
In particular, interest has focused on how residual networks11 and highway networks33 can be interpreted as iterative solvers8 or as iterated dynamical systems.4 In the latter paper (a NeurIPS 2018 best paper awardee), the authors choose not to unroll the iteration in time explicitly, but instead use continuous-time numerical integration. While the focus was on the dynamics-of-layers concept, time series learning was also performed.
The Koopman operator has also been employed in combination with neural networks to extract conservation laws and special group structures.12,15 Symmetries in relation to conserved quantities are a well-studied problem in physics.10,13,19,20 A recent thread of research consists of learning physics models from observation data,17 including modeling discrete-time data as observations of a continuous-time dynamical system.21,22
It is informative to study physical systems through their conserved quantities, such as total energy, which can be encoded in a Hamiltonian function.1,10 The measure-preserving property of Hamiltonian systems has recently been exploited for transport of densities in Markov-chain Monte-Carlo methods18 and variational autoencoders.2,24 For our purposes, a natural progression of the thread of computational modeling of physics from observations is to represent the Hamiltonian function directly.
Concurrently with this submission, two papers independently addressing similar issues appeared as preprints.9,34 In the first of these,9 a loss function very similar to parts of our Eq. (12) was used. The second paper focuses on the transformation of densities generated through the Hamiltonian. It is possible to draw some analogies between the second preprint and the old (non-Hamiltonian) work we mentioned above,6,25–31 as this newer work also used rollout with a time step templated on a classical numerical integration method (here, symplectic Euler and leapfrog). Both papers also use the pendulum as an example, emphasizing conditions where the system can be well approximated by a linear system: a thin annulus around the stable steady state at where the trajectories are nearly circular.
The remainder of the paper is structured as follows:
We derive data-driven approximations (through two approaches: Gaussian processes and neural networks) of a Hamiltonian function on a given phase space, from time series data. The Hamiltonian functions we consider do not need to be separable as a sum , and in our illustrative example, we always work in the fully nonlinear regime of the pendulum.
We build data-driven reconstructions of a phase space from (a) linear and (b) nonlinear, nonsymplectic transformations of the original Hamiltonian phase space. The reconstruction then leads to a symplectomorphic copy of the original Hamiltonian system.
We construct a completely data-driven pipeline combining (a) the construction of an appropriate phase space and (b) the approximation of a Hamiltonian function on this new phase space, from nonlinear, high-dimensional observations (e.g., from movies/sequences of movie snapshots).
II. GENERAL DESCRIPTION
A Hamiltonian system on Euclidean space , is determined through a function that defines the equations
where and are interpreted as “position” and “momentum” coordinates in the “phase space” . In many mechanical systems, and in all examples we discuss in this paper, the interpretation of the coordinates is reflected in the dynamics through , i.e., for some function . In general, Eqs. (1) and (2) imply that the Hamiltonian is constant along trajectories because
where is the identity matrix and is the vector field on [the left hand side of (1) and (2)], which only depends on the state . The symplectic form on the given Euclidean space takes the form of the matrix .
In Sec. III, we discuss how to approximate the function from given data points . This involves solving the partial differential equation (4) for . Since these equations determine only up to an additive constant, we assume that we also know the value of at a single point in the phase space. This is not a major restriction for the approach because as well as can be chosen arbitrarily.
III. EXAMPLE: THE NONLINEAR PENDULUM
As an example, consider the case and the Hamiltonian
This Hamiltonian forms the basis for the differential equations of the nonlinear pendulum, , or, in the first-order form, and . In this section, we numerically solve partial differential equation (PDE) (4) by approximating the solution using two approaches: Gaussian processes23 (Sec. III A) and neural networks (Sec. III B).
A. Approximation using Gaussian processes
We model the solution as a Gaussian process with a Gaussian covariance kernel,
where and are points in the phase space, i.e., , , and is the kernel bandwidth parameter (we chose in this paper). Given a collection of points in the phase space, as well as the function values at all points in , the conditional expectation of the Gaussian process at a new point is
where we write for the kernel matrix evaluated over all values in the given dataset . In Eq. (7), the dimensions of the symbols are , , , and . All vectors are column vectors. Estimates of the solution to the PDE at new points depend on the value of over the entire dataset. We do not know the values of , but differentiating Eq. (7) allows us to set up a system of equations to estimate at an arbitrary number of new points close to the points , by using the information about the derivatives of given by the time derivatives in Eq. (1) and (the negative of) in Eq. (2). Let be the gradient of at a point in the phase space, which we know from data . Then,
Together with an arbitrary pinning term at a point , the list of known derivatives leads to a linear system of equations, where we write
for the derivative of the given kernel function with respect to its first argument, evaluated at a given point in the first argument and all points in the dataset in the second argument. For each , we thus have a (column) vector of derivative evaluations, which is transposed and stacked into a large matrix to form the full linear system
We reiterate: is a dataset of points where we know the derivatives of through . We evaluate on a fine grid of points (such that , ) and have information on a relatively small set of points called [black dots in Fig. 1(b)]. The derivative of the Gaussian process can be stated using the derivative of the kernel with respect to the first argument. The matrix inverse for is approximated by the inverse of , that is, through Tikhonov regularization with parameter , as is the standard for Gaussian process regression. Solving this system of equations for yields the approximation for the solution to the PDE. Figure 1(b) shows the result, and Table I lists the mean squared error to the 625 training data points and an independently drawn set of 200 validation points in the same domain, where no derivatives were available. See Ref. 22 for a more detailed discussion of the solution of PDE with Gaussian processes.
(a) Hamiltonian function of the pendulum system. (b) Approximation using Gaussian processes, solving Eq. (10) on a fine grid , with information about the derivative of on a set of randomly sampled points (black points in the pink rectangle). (c) Hamiltonian learned with a neural network, with all loss terms included. Note that the learned function is only accurate where data sampling is sufficiently dense. For this experiment, we used the “softplus” activation function . The black points shown in panel (b) show all training data for the Gaussian process. The points in panel (c) only show a subset of the training data used for the neural network (20 000 in total).
(a) Hamiltonian function of the pendulum system. (b) Approximation using Gaussian processes, solving Eq. (10) on a fine grid , with information about the derivative of on a set of randomly sampled points (black points in the pink rectangle). (c) Hamiltonian learned with a neural network, with all loss terms included. Note that the learned function is only accurate where data sampling is sufficiently dense. For this experiment, we used the “softplus” activation function . The black points shown in panel (b) show all training data for the Gaussian process. The points in panel (c) only show a subset of the training data used for the neural network (20 000 in total).
Comparison of loss between training and held-out validation data. As the sizes of our generated datasets are always much larger than the numbers of trainable parameters, overfitting is not an issue, and the validation loss is always indistinguishable from the training loss. The first row includes the results from the Gaussian process approximation (a nonparametric method) using all training data for inference. Only the hyperparameters ε and σ needed to be adjusted. For the first row, the loss reported is the MSE of the Gaussian process regression. For other rows, the loss is computed by either Eqs. (12) or (14) as appropriate.
. | No. of parameters . | No. of training points . | No. of validation points . | Training loss . | Validation loss . |
---|---|---|---|---|---|
Section III A | Nonparametric | 625 | 200 | 2.2 × 10−5 | 3.5 × 10−5 |
Section III B | 337 | 20 000 | 200 | 6.8 × 10−4 | 7.4 × 10−4 |
Section IV B | 345 | 20 000 | 1000 | 5.5 × 10−6 | 5.0 × 10−6 |
Section IV C | 511 | 19 200 | 200 | 1.4 × 10−4 | 1.8 × 10−4 |
Section IV D | 755 | 17 489 | 200 | 0.51 | 0.54 |
. | No. of parameters . | No. of training points . | No. of validation points . | Training loss . | Validation loss . |
---|---|---|---|---|---|
Section III A | Nonparametric | 625 | 200 | 2.2 × 10−5 | 3.5 × 10−5 |
Section III B | 337 | 20 000 | 200 | 6.8 × 10−4 | 7.4 × 10−4 |
Section IV B | 345 | 20 000 | 1000 | 5.5 × 10−6 | 5.0 × 10−6 |
Section IV C | 511 | 19 200 | 200 | 1.4 × 10−4 | 1.8 × 10−4 |
Section IV D | 755 | 17 489 | 200 | 0.51 | 0.54 |
B. Approximation using an artificial neural network
Another possibility for learning the form of using data is to represent the function with an artificial neural network7 (ANN). We write
where the activation function is nonlinear (except where otherwise indicated, we used ) for (if ) and the identity for . The learnable parameters of this ANN are , and we gather all such learnable parameters from the multiple layers that may be used in one experiment into a parameter vector . If there are no hidden layers (), then we learn an affine transformation . This format provides a surrogate function , where the input is the row vector . (Treating inputs as row vectors and using right-multiplication for weight matrices is convenient, as a whole batch of inputs can be presented as an -by- array.) For all the experiments shown here, this network for computing has two hidden layers of width 16.
Similarly, in the case(s) we need to learn additional transformations and (see Sec. IV), they are also learned using such networks.
We collect training data by sampling a number of initial conditions in the rectangle , then simulate short trajectories from each to their final points. For each of these, we additionally evaluate . Shuffling over simulations once per epoch, and dividing this dataset into batches, we then perform batchwise stochastic gradient descent to learn the parameters using an Adam optimizer on the objective function defined below.
For this paper, all neural networks were constructed and trained using TensorFlow, and the gradients necessary for evaluating the Hamiltonian loss terms in Eq. (12) were computed using TensorFlow’s core automatic differentiation capabilities.
This objective function comprises a scalar function evaluated on each data 4-tuple in the batch and then averaged over the batch. This scalar function is written as
where the dependence on is through the learned Hamiltonian , the loss-term weights are chosen to emphasize certain properties of the problem thus posed, and the partial derivatives of and are computed explicitly through automatic differentiation. Except for , all values are set to either or depending on whether the associated loss term is to be included or excluded. Because of the square term in Eq. (5), we set arbitrarily to if nonzero, so the loss is not dominated by . An alternative might be to set to .
Since Eqs. (1) and (2) together imply (3), any one of the three terms , , and could be dropped as redundant; therefore, we can set to zero but monitor as a useful sanity check on the accuracy of the learned solution. In Fig. 1(c), we show the results of this process with our default nonzero values for .
As an ablation study, we explored the effect (not shown here) of removing the first, second, and fourth terms. By construction, the true function is zero for all . Note that this is only ever achieved to any degree in the central box in the figure, where data were densely sampled. Removing made no visible difference in the quality of our approximation, which was expected due to the redundancy in the set of Eqs. (1)–(3). However, removing either or gives poor results across the figure, despite the apparent redundancy of these terms with . This might be due to not balancing the contributions of the and terms, for which we attempted to compensate by unequal weighting values and .
IV. ESTIMATING HAMILTONIAN STRUCTURE FROM OBSERVATIONS
We now consider a set of observation functions , , with , such that is a diffeomorphism between the phase space and its image . In this setting, the notion of a symplectomorphism is important.1 In general, a symplectomorphism is a diffeomorphism that leaves the symplectic structure on a manifold invariant. In our setting, a symplectomorphism of maps to a deformed space where the system dynamics in the new variables , is again Hamiltonian and conjugate to the original Hamiltonian dynamics. Not every diffeomorphism is a symplectomorphism, and we do not assume that is a symplectomorphism. A constant scaling of the coordinates is also not possible to distinguish from a scaling in the Hamiltonian function itself, so the recovered system will be a symplectomorphic copy of the original, scaled by an arbitrary constant.
In the setting of this section, we do not assume access to , , or the explicit form of . Only a collection of points and time derivatives in the image is available. We describe an approach to approximate a new map into a symplectomorphic copy of through an autoencoder,7 such that the transformed system in is conjugate to the original Hamiltonian system in . Upon convergence, and if we had access to , the map would approximate a symplectomorphism, and would be its inverse. During the estimation of , we simultaneously approximate the new Hamiltonian function . Figure 2(a) visualizes the general approach, where only the information and is available to the procedure, while and a Hamiltonian on are constructed numerically.
(a) Illustration of the network structure. Starting with observations of the (unknown) variables , the autoencoder is structured to map via to the Hamiltonian form , as well as map back to the observations through . The complete process involves estimating a Hamiltonian for (the symplectomorphic copies of) during training of the autoencoder. (b) The transformation from the original space to the observed space for the linear example of Sec. IV B.
(a) Illustration of the network structure. Starting with observations of the (unknown) variables , the autoencoder is structured to map via to the Hamiltonian form , as well as map back to the observations through . The complete process involves estimating a Hamiltonian for (the symplectomorphic copies of) during training of the autoencoder. (b) The transformation from the original space to the observed space for the linear example of Sec. IV B.
An interesting and important feature common to the next three examples is that one cannot expect to systematically recover the original values from the given observation data . Only symplectic transformations can be recovered, which are enough to define a Hamiltonian. Once the coordinates are fixed, the Hamiltonian function in these coordinates is unique up to an additive constant.
A. A composite loss function for the joint learning of a transformation and a Hamiltonian
The following loss function is used to train an autoencoder component together with a Hamiltonian function approximation network component:
where the dependence on is through the learned Hamiltonian and the learned transformations and , and the time derivatives in the space are computed as and , using the Jacobian of the transformation (computed pointwise with automatic differentiation). When we learn a (especially nonlinear) transformation , in addition to the Hamiltonian , including the term in the composite loss can have a detrimental effect on the learned transformation. There exists an easily-encountered naïve local minimum in which maps all of the sampled values from to a single point in , and the Hamiltonian learned is merely the constant function at the pinning value, . In this state (or an approximation of this state), all of , and the elements of are zero, so the loss (with terms , , , and optionally ) is zero (resp., small). A related failure is that in which the input in is collapsed by to a line or curve in .
To alleviate both of these problems, we added a new loss component . That is, we require that the learned transformation does not collapse the input. It is sufficient for the corresponding weighting factor to be a very small nonzero value (e.g., ). The addition of to our loss helps us to avoid falling early in training into the unrecoverable local minimum described above and also helps keep the scale of the transformed variables and macroscopic.
B. Example: Linear transformation of the pendulum
We generate data from a rectangular region , and then transform the region linearly with . The matrix is the inverse of ; a scaling is followed by a rotation where , , , , and . Our observation data are thus given by . Using the true Hamiltonian , we additionally compute true values for and and then use to propagate these to and via and similar for , where the partial derivatives are computed analytically (here, just the elements of itself).
Our network is then presented with observation data and its corresponding time-derivatives. Its task is to learn and , which convert to and from variables (symplectomorphic to the original ) and a Hamiltonian in this new space. When evaluating the loss, the time derivatives of and are likewise computed via automatic differentiation using the chain rule through the learned transformation, e.g., as . Note also that , so if the original space could be found, would satisfy . This cannot be expected given only the data in ; we can only be sure that approximates a symplectomorphism of the original .
We could learn from a general class of nonlinear functions, as a small neural network, but here, we simply learn and as linear transformations (that is, we have a linear “neural network,” where in Eq. (11), and is fixed as ). As we include the reconstruction error of this autoencoder in our loss function, is constrained to be the inverse of to a precision no worse than the and terms in Eq. (14), after all three are scaled by their corresponding values. In fact, for the linear case, initially the autoencoder’s contribution to the loss is significantly lower than the Hamiltonian components (see Fig. 3), but, as training proceeds and the and terms are improved (at the initial expense of raising the autoencoder loss), larger reductions in loss are possible by optimizing rather than , so is decreased as quickly as (the larger of) or . That is, the autoencoder portion of the loss falls quickly to the level where it no longer contributes to the total loss given its weighting in the loss sum.
Learned after a linear transformation . (a) (Left) The true function is evaluated on a grid of values. (a) (Right) The learned function on is pulled back onto the original space. (b) Linear symplectomorphism between the original and the learned space.
Learned after a linear transformation . (a) (Left) The true function is evaluated on a grid of values. (a) (Right) The learned function on is pulled back onto the original space. (b) Linear symplectomorphism between the original and the learned space.
We find that the learned symplectomorphism , depicted in its portion in Fig. 3, preserves unmixed with in one or the other of its two discovered coordinates. This is because both (a) as well as (b) for any smooth function are symplectomorphisms. They are special because ; i.e., they even preserve the Hamiltonian formulation. For the map (a), the transformation of the Hamiltonian can be seen from the following derivation:
Here, the first equality of (15) and (16) follows from the map, and the last equality of (15) and (16) follows from the requirement that follow Hamiltonian dynamics with respect to the new Hamiltonian . Equations (18) and (20) then show that the new Hamiltonian is the same as the old one (modulo an additive constant) when considered as a map on the old coordinates.
C. Example: Nonlinear transformation of the pendulum
In addition to the linear of Sec. IV B, we show comparable results for a nonlinear transformation and learned . Specifically, we transform the data through where
the inverse of which is given by and . We use the analytical Jacobian of this to compute the necessary and for input to our network.
We proceed as before, except that we no longer restrict the form of the learned and to linear transformations, but instead allow small multilayer perceptrons of a form similar to that used for .
The resulting induced symplectomorphism is again one that appears to preserve an approximately monotonically increasing or decreasing in either or . This can be seen in Fig. 4.
Results after a nonlinear transformation . (a) (Left) The true function , pushed forward to . (b) The learned function on , . A grid is now taken in and transformed through for plotting at corresponding points. As is also true for the linear case, the sign of the learned may be flipped depending on whether we learn a representation of via or , which is monotonically increasing or decreasing. Bottom: Nonlinear symplectomorphism between the original and the learned space. Again, we find that is preserved (nearly) unmixed in the discovered space . For this experiment, the encoder and decoder each have three hidden layers of width five.
Results after a nonlinear transformation . (a) (Left) The true function , pushed forward to . (b) The learned function on , . A grid is now taken in and transformed through for plotting at corresponding points. As is also true for the linear case, the sign of the learned may be flipped depending on whether we learn a representation of via or , which is monotonically increasing or decreasing. Bottom: Nonlinear symplectomorphism between the original and the learned space. Again, we find that is preserved (nearly) unmixed in the discovered space . For this experiment, the encoder and decoder each have three hidden layers of width five.
D. Example: Constructing a Hamiltonian system from nonlinear, high-dimensional observations of q, p
As a further demonstration of the method, we use a graphical rendering of the moving pendulum example from before as the transformation from the intrinsic state to an image as our high-dimensional observable. We use a symplectic semi-implicit Euler’s method,
to generate trajectories for various initial conditions and then a simple graphical renderer to display these as images [see Fig. 5(a)]. When rendering our video frames, we drag a tail of decaying images behind the moving pendulum head so that information about both position and velocity is present in each rendered frame. This is done by iterating over each trajectory and, for each , (1) multiplying the entire current image by a cooling factor of , (2) adding a constant heating amount to the image in a circle of fixed radius centered around the current point, and (3) clipping the image per-pixel to lie within . Samples of the resulting images are visible in Fig. 5(a).
(a) Principal component analysis (PCA) autoencoder reconstructions. The autoencoder is trained to reproduce PCA projections of the monochrome images on the left, where velocity information is available in the length of the tail dragged behind the moving pendulum head. We show in the column “Approx” these reconstructions passed back through the approximate inverse of the PCA projection. Note that only the magnitude of is preserved, which can be predicted by the fact that only the square of is required for computing . (b) Learned Hamiltonian. For each of the images in our dataset, we have an associated known pair. Here, we plot these values, colored by the learned . For this experiment, the encoder has two hidden layers of widths six and four, respectively, and the decoder mirrors this structure.
(a) Principal component analysis (PCA) autoencoder reconstructions. The autoencoder is trained to reproduce PCA projections of the monochrome images on the left, where velocity information is available in the length of the tail dragged behind the moving pendulum head. We show in the column “Approx” these reconstructions passed back through the approximate inverse of the PCA projection. Note that only the magnitude of is preserved, which can be predicted by the fact that only the square of is required for computing . (b) Learned Hamiltonian. For each of the images in our dataset, we have an associated known pair. Here, we plot these values, colored by the learned . For this experiment, the encoder has two hidden layers of widths six and four, respectively, and the decoder mirrors this structure.
Though we do not use directly in this procedure, its value is observed in the length of the resulting tail dragged behind the moving pendulum head. We create trajectories long enough that the effect of the initial formation of the tail (during which its length is not necessarily a good indicator of ) is not visible anymore and then use only the final two observations from these trajectories.
In order to make the approach agnostic to the data, we do not want to assume that the space is periodic, so instead, we use a four-dimensional phase space with elements and consider the splitting into and during training. In the space of input images, the manifold does not fill up four-dimensional space but a cylinder, which is mapped to the four-dimensional encoding layer by the autoencoder.
In addition, to simplify the learning problem, we learn as the combination of a projection onto the first 20 principal components of the training dataset followed by a dense autoencoder, reserving learning as an end-to-end convolutional autoencoder for future work. The encoding provides and, as before, we learn in tandem with , where now the conditions of Eq. (14) are upgraded to vector equivalents to accommodate .
In Sec. IV A, we added a loss term proportional to the reciprocal of the determinant of the transformation’s Jacobian in order to avoid transformations that collapsed the phase space. Here, this was not such an issue—of course, some collapsing of the high-dimensional representation is obviously required. Instead, a common mode of failure turned out to be learning constant functions, which automatically satisfy the Hamiltonian requirements (a constant is naturally a conserved quantity). To avoid this, we considered several possible ways to promote a nonflat function, ultimately settling on (a) adding a term that encouraged the standard deviation of values to be nonzero and (b) minimizing not just the mean squared error in our and terms, but also the max squared error, to avoid trivial or bilevel functions. The spherical Gaussian prior used in training variational autoencoders34 also avoids, as a by-product, learning constant functions.
The result, shown in Fig. 5(b), was a pulled-back function that at least in broad strokes resembles the truth and that satisfies (typically about ).
V. CONCLUSIONS
We described an approach to approximate Hamiltonian systems from observation data. It is a completely data-driven pipeline to (a) construct an appropriate phase space and (b) approximate a Hamiltonian function on the new phase space, from nonlinear, possibly high-dimensional observations (here, movies).
When only transformations of the original Hamiltonian phase space can be observed, it is only possible to recover a symplectic copy of the original phase space, with additional freedom in a constant scaling of the coordinates and an additive constant to the Hamiltonian function. If no additional information about the original space is available, this is a fundamental limitation and not particular to our approach. It is not necessary that there is an “original phase space” at all, and so, the resulting symplectic phase space is by no means unique. In fact, Darboux's theorem implies the opposite.35,36 There is no local obstruction to a symplectic structure, so all symplectic manifolds of dimension 2n are locally symplectomorphic to the standard flat space . The choice of a single such space has to rely on other factors, such as, possibly, interpretability by humans or simplicity of the equations.
The approach may be extended to time-dependent Hamiltonian functions. This would allow us to cope with certain dissipative systems.16 An even broader extension may allow transformations to arbitrary normal forms as the “target vector field” and thus would not be constrained to Hamiltonian systems. In the general case, it will become important to explore whether the transformation we approximate remains bounded over our data or whether it starts showing signs of approaching a singularity, suggesting that the problem may not be solvable.
ACKNOWLEDGMENTS
This work was funded by the US Army Research Office (ARO) through a Multidisciplinary University Research Initiative (MURI) and by the Defense Advanced Research Projects Agency (DARPA) through their Physics of Artificial Intelligence (PAI) program, and through agreement HR001118C0100 (I.M.).