A Gauss-Newton method for iterative optimization of memory kernels for generalized Langevin thermostats in coarse-grained molecular dynamics simulations

In molecular dynamics simulations, dynamically consistent coarse-grained (CG) models commonly use stochastic ther-mostats to model friction and ﬂuctuations that are lost in a CG description. While Markovian, i.e., time-local, formulations of such thermostats allow for an accurate representation of diffusivities/long-time dynamics, a correct description of the dynamics on all time scales generally requires non-Markovian, i.e., non-time-local, thermostats. These ther-mostats are typically of the form of a Generalized Langevin Equation (GLE) determined by a memory kernel. In this work, we use a Markovian embedded formulation of a position-independent GLE thermostat acting independently on each CG degree of freedom. Extracting the memory kernel of this CG model from atomistic reference data requires several approximations. Therefore, this task is easiest understood as an inverse problem. While our recently proposed approximate Newton scheme (IOMK) allows for the iterative optimization of a memory kernel, Markovian embedding remained potentially error-prone and computationally expensive. In this work, we present a Gauss-Newton scheme (IOMK-GN) based on IOMK, which allows for a direct parameterization of a Markovian embedded model.


I. INTRODUCTION
The principles of statistical mechanics can be used to derive coarse-grained (CG) models for computer simulations of condensed-phase systems.Their formal derivation involves averaging over atomic degrees of freedom (DoFs) that are considered irrelevant to the problem under study, a process that leads to the definition of a many-body potential of mean force (MB-PMF). 1,2[9][10] Since CG-MD simulations are based on Hamiltonian dynamics, they misrepresent the dynamic properties of the FG system. 113][14][15] It is well known that neglecting these dissipative interactions results in a speed-up of dynamics, 11,[16][17][18] and more generally in a mismatch of absolute and relative time scales for different molecular processes. 19,202][23][24][25][26][27][28][29] While in principle an exact CG EoM can be derived using a projection operator formalism (Mori-Zwanzig theory), 12,14,30 in practice, certain approximations and assumptions must be made to derive tractable models.2][33][34] Therefore, parameterizing a CG model directly from FG reference data requires a good understanding of the relationship between the modeled and exact CG EoM.Otherwise, the quality of the resulting CG model is difficult to predict.
[35][36][37] While the optimization can be performed by brute-force sampling of the free parameters, 20,38 efficient iterative methods based on physically motivated Jacobians can significantly reduce the computational costs.][41] In our previous work, [26][27][28][29] the Markovian embedding required fitting the memory kernel with a (potentially large) set of parameters in every iteration.In this work, we formulate a procedure, referred to as IOMK-GN, which allows us to bypass this fitting step by using a Gauss-Newton type algorithm to optimize the parameters of the aux-GLE thermostat directly.
The remainder of this work is structured as follows: We summarize the non-Markovian GLE thermostat and its Markovian formulation in an extended phase space in Secs.II A and II B, respectively, and formulate their parameterization as an inverse problem in Sec.II C. In Secs.II D and II E, we describe the Gauss-Newton method and introduce the regularization strategies we apply to improve the convergence and stability of the proposed IOMK-GN method.In Sec.III, we summarize the iterative procedure.We test IOMK-GN on coarse-grained liquid ethanol and compare its performance with the original IOMK method in Sec.IV.Finally, we discuss the main results in Sec.V and conclude with a summary and outlook in Sec.VI.
Additional strategies applied for improved stability and convergence, tests of the proposed regularization strategies, and computational details on the FG and CG simulations are described in the Appendix.

A. The isotropic GLE thermostat
Consider a classical FG-Hamiltonian reference system of n particles with positions and momenta, r, p ∈ R 3n .In systematic coarse-graining, a mapping scheme defines the transformation r → R and p → P where R, P ∈ R 3N are the positions and momenta of CG beads with N < n. 7 We assume that the CG DoFs are defined by linear combinations of the FG DoFs, e.g., the mapping of a set of atoms onto its center of mass (CoM).We assume all CG beads are equivalent and that the system is isotropic for notational simplicity.
As a CG EoM, we consider the GLE where F I (t) is the total force acting on the Ith CG bead at time t.The conservative force, given by the negative gradient of a CG potential, is denoted by F C I .The dissipative force is represented by the convolution of the thermostat's memory kernel K(t) and the CG bead's momentum P I , and is balanced by a random force denoted as δ F R I .By imposing the fluctuationdissipation theorem (FDT) 22,35,42 the dissipative and random force together act as a thermostat, ensuring canonical sampling.In Eq. 2, M is the particle mass, T is the temperature, and k B is the Boltzmann constant.

B. Markovian embedded GLE thermostat
Non-Markovian models of the type of Eq. 1 can be rendered Markovian in an extended phase space.For the Ith particle along one dimension d ∈ {x, y, z}, the Markovian EoM reads 39

ṖI,d ṡI
Here, ξ is a vector of uncorrelated random numbers with zero mean and unity variance, s ∈ R h is a set of h auxiliary momenta, which are coupled to the CG particles momentum via the drift matrix A ∈ R h+1×h+1 , and B ∈ R h+1×h+1 is the diffusion matrix.To link A in Eq. 3 to K(t) in Eq. 1, we consider a drift matrix of the form, with A Ps ∈ R h and A ss ∈ R h×h .By integrating out the auxiliary variables in Eq. 3, Eq. 1 is recovered with the relationship 43 Canonical sampling is ensured by the FDT which can only be fulfilled if A + A T is positive (semi-)definite. 43This can be easily enforced by constructing A as the sum of a positive diagonal matrix and a skew-symmetric matrix, i.e., A kl ≥ 0 for all k = l and A kl = −A kl for all k = l.Li et al. 23,44 proposed to construct A ss as a block diagonal matrix with blocks of size two such that Eq. 4 can be rewritten as a sum of damped oscillators (see Sec. S4 A in the supplemental material).We employed the same strategy in our previous work, 15,[27][28][29] but this approach has the disadvantage that with only 2h free parameters a large part of A is not utilized.In principle, limiting the fitting function to a sum of damped oscillators is unnecessary as Eq. 5 can be efficiently computed (see Sec. S3 in the supplementary material for some technical details).Fully exploiting A yields (h + 1)(h + 2)/2 − 1 free parameters, which can represent more complex memory kernels for a given number of auxiliary variables.

C. The inverse problem
We can now define the inverse problem to be solved.First, note that a VACF, defined as C VV (t) = V I (0)V I (t) /3 can always be associated with an integrated single-particle memory kernel, G(t) = t 0 K(s)ds, 45 via It follows that C tgt VV (t) and G tgt (t) can in principle be used interchangeably as target functions.Thus, given G tgt (t) we aim to find a drift matrix A that solves the the non-linear least squares problem where we introduced G as the equidistantly sampled discrete representation of G(t) with the mth entry G m = G(m∆t).In the following, we aim to solve Eq. 8 using the well established Gauss-Newton method. 46. The Gauss-Newton method We first introduce the framework and notations in general terms to simplify the discussion in the following sections.For a general non-linear least-squares problem, consider the residual vector r(x) : R N → R M , with r(x) = f (x) − f tgt representing the difference between a function f (x) of a set of parameters x and a target vector f tgt .The optimal parameters are the solution to This problem can be solved iteratively using the Gauss-Newton method by starting from some initial guess x 0 and successively solving where J is the Jacobian of f given by For brevity of notation, J refers to the Jacobian at the respective iteration, and we define r i ≡ r(x i ).Eq. 10 can be rewritten as with ∆ i+1 = x i+1 − x i , D = J T J and b i = −J T r i .Note that for N = M we can set D = J and b i = −r i , and recover Newton's root-finding algorithm.

E. Regularization
Problems of the form of Eq. 12 are ill-posed when D is not (numerically) invertible.However, even when D is invertible, if the eigenvalues of D are small, the resulting iterations may become unstable.This is commonly combated by approximating Eq. 12 with a regularization scheme 47 of the form stabilizing the procedure.The matrix Λ in Eq. 13 is a regularization matrix, which can be chosen such that D + Λ has sufficiently large positive eigenvalues.A common choice is the Tikhonov regularization, 47 where Λ = αI with I the N × N identity matrix and α > 0. This regularization scheme results in damping of contributions to ∆ i+1 from eigenvectors with eigenvalues ≪ α while leaving contributions from eigenvectors with eigenvalues ≫ α unchanged.Eq. 13 can be rewritten as the linearized least squares problem solved in each iteration min In this formulation, it is evident that choosing a large value for α adds a quadratic bias to the absolute step size, leading to more stable iterations while reducing the convergence rate.The iterative process can then get stuck at a point with a small gradient far from the global optimum.
To simultaneously get stable iterations and fast convergence, we propose the adaptive regularization matrix (15) Formulating the least squares problem for the ith iteration, we find min Combining Eq. 15 with the second term in Eq. 16 yields To understand the difference between the two regularization schemes, we can consider three distinct scenarios: In the first case, both regularization schemes are equivalent as Eq. 15 would recover the Tikhonov matrix.In the second case, the adaptive scheme penalizes large updates of the nth parameter less strongly compared to the Tikhonov regularization, while in the third case, the penalty is increased.
Using the Tikhonov regularization, convergence usually slows down quickly after the first few iterations as ||b|| 2 becomes small (see Sec. S4 in the supplementary material for an exemplary demonstration).With Eq. 15, the regularization is automatically reduced near the optimal solution, allowing for faster convergence without the need to adjust α.Also, in absolute terms, the change of larger parameters is less strongly damped.This means that Eq. 15 allows larger updates for large parameters, while small parameters are adjusted more carefully.We further compare both regularization schemes in Appendix C 1 and find a significantly improved performance with the adaptive scheme for 100 randomly generated memory kernels (for further elaboration, see Sec.S4 in the supplementary material).Note, however, that this adaptive regularization scheme should not be applied without additional adjustments if any parameter x m can become zero, which we avoid by setting bounds on the parameter space (see Appendix B 3).

III. SUMMARY OF THE OF ITERATIVE PROCEDURE A. Preparatory step
With the preceding discussion, we can formulate the full procedure for IOMK-GN, as summarized in the flowchart Fig. 1.The preparatory step is equivalent in both IOMK-GN and IOMK.First, a FG reference trajectory is produced and mapped.With the mapped trajectory, the target VACF C tgt VV (t) and using Eq.7 the integrated target memory kernel G tgt (t) are calculated.Additionally, to describe the conservative interactions in the CG system, a CG potential is derived.Using Eq. 7

IOMK
Eq. 21 FIG. 1. Schematic summary of the IOMK and IOMK-GN methods.Each rectangle represents a quantity, each arrow represents an operation, and the diamond represents a decision.The procedure is generally divided into two parts: First, a preparation step in which all the properties needed for the procedure are calculated, and second, an iteration step is repeated until a convergence criterion is met or the predefined maximal number of iteration steps is reached.this potential, a (Hamiltonian) CG-MD trajectory is produced to obtain C • VV (t) and G • (t).Finally, an initial guess A 0 is made (see Appendix B 4 and Sec.S6 in the supplementary material).Then, for every iteration, a CG GLE simulation is carried out using A i to obtain G i (t) from C i VV (t) with Eq. 7. A i+1 is derived using either an IOMK or IOMK-GN update as described below.

IOMK update
For the IOMK update we set f := G, f tgt := G tgt and x := G.Note that this corresponds to a Newton scheme for root finding.The Jacobian is based on the is based on the linear approximation 28 where c is a constant function of time.By estimating c with the memory kernel from the previous iteration we arrive at Rewriting Eq. 19 in vector notation, we obtain an approximate Jacobian as from which follows the Newton update as described in our previous work. 28We briefly demonstrate and discuss this approximate linearity in Sec.S2 in the supplementary material.To obtain A i+1 , we fit the integrated form of Eq. 5 given by to Gi+1 (t).For that we use the Gauss-Newton method as described in Sec.II D using the adaptive regularization scheme with α = 1 and the additional strategies described in Appendix C.

IOMK-GN update
For the IOMK-GN update we set f := G, f tgt := G tgt and x := all distinct entries in A, defining the Gauss-Newton update step.With these definitions, there is no analytical relation between f i and x i and the evaluation of the Jacobian would involve at least one CG simulation per iteration for every parameter.To avoid this, we derive an approximate Jacobian from Eq. 19, which yields where G i is calculated from the previous iteration, G • is known from the preparatory step, Gi is calculated from A i with Eq. 22 and using that ∂ Gi n /∂ x m A i can be efficiently evaluated by finite differences (see Sec. S3 in the supplementary material).Eq. 19 also provides an efficient way to predict f i+1 , without carrying out simulations.This allows to optimize the regularization parameter α in every iteration following Appendix B 1, whereby we chose the maximal considered regularization parameter to be α max = 1 using the adaptive regularization scheme.We additionally apply the strategies described in Appendix B 2-B 3.

IV. RESULTS
To test the developed IOMK-GN method, we coarse-grain liquid ethanol at 300 K using a single-site center of mass mapping scheme, where we derived the conservative forces using iterative Boltzmann inversion. 4Details on atomistic and CG simulations and computational details on the coarse-graining procedure are described in Appendix A.
Following the procedure outlined in Sec.III, we applied the IOMK-GN method to optimize the drift matrix for liquid CG ethanol.Drawing on the observations discussed in Appendix C 1 we use the adaptive regularization scheme with α = 1, fitted G tgt (t) − G • (t) to get an initial guess A 0 and choose the dimensions of A to be 9 × 9.
Note that we could also use an a-priori initial guess for A 0 as described in Appendix B 4 (as demonstrated in Sec.S6 in the supplementary material) but adding one fitting step before the first iteration is computationally cheap and thus a simple way to derive a good initial guess.
In Fig. 2 a), we compare G i (t) to the target G tgt (t) for a few selected iterations.We find that on short time scales, the integrated single particle memory kernel is well reproduced by the initial guess, while on time scales beyond ≈ 1 ps, deviations steadily increase.After 4 iterations, the integrated single particle memory kernel visually matches the target.
In Figs. 2 b) and c), we show the VACF and its integral for the same iterations.We note that the overall form of the VACF is well matched for the initial guess while the integrated VACF reveals the accumulated error on larger time scales, indicating too slow diffusion.Figs. 2 b) and c) show that small deviations on short time scales are still present after 4 iterations and further improvement is achieved in successive iterations.Depending on the desired accuracy, the optimization can be considered converged after 4-6 iterations.
For comparison, we show the performance of the IOMK method in Figs. 2 d)-e).Hereby, we use the Gauss-Newton method for fitting, as described in Appendix C, in every iteration to extract the aux-GLE parameters.As found in our previous work, 28 , the IOMK method converges within 2-4 iterations.
As a quantitative measure, the insets in Figs. 2 a) and d) show the respective averaged squared residual (based on the normalized representation of G tgt (t) as discussed in Appendix B 2).After 4 iterations, both methods achieve an value smaller then 10 −4 and successive iterations range between 10 −6 and 10 −4 for both methods.
V. DISCUSSION

A. Main findings
Despite the fact that IOMK-GN solves an inverse problem of higher complexity than the original IOMK method, its efficiency, i.e., the number of iterations and, thus, independent CG simulations until convergence, are comparable.The main improvements over our previous work [26][27][28][29] and advantages over a naive GN method are: 1.The drift matrix for the aux-GLE thermostat is updated directly in each iteration using an approximation for the Jacobian.As a result, no additional fitting step is required for Markovian embedding.
2. We use the full drift matrix (see Eq. 22) instead of 2 × 2 blocks in A ss as is the case for sums of damped oscillators.The larger number of free parameters allows for the modeling of more complex memory kernels while reducing the number of local minima.
3. The GN update is based on several numerical improvements like normalized integrated memory kernels and an adaptive regularization scheme to, on the one hand, increase stability and, on the other hand, improve convergence.

B. Relation of IOMK and IOMK-GN to other methods
The IOMK method 27,28 was originally inspired by the iterative method for memory reconstruction due to Jung et al.. 33 One goal in which we succeeded was to improve the convergence and thus minimize the computational cost by reducing the number of iterations by one to two orders of magnitude. 27imilarly, IOMK-GN was motivated by the work of Wang et al., who proposed using Gaussian process regression (GPR) to optimize aux-GLE parameterizations, which circumvents the problem of memory kernel fitting. 34While the proposed approach is intriguing, the computational cost can be high due to the need for hundreds of CG simulations.Here, the IOMK-GN method solves an equivalent inverse problem, again with about two orders of magnitude fewer CG simulations.
The rapid convergence of both IOMK and IOMK-GN is based on the approximate but accurate Jacobian derived from Eq. 19, while the Jacobian due to Jung et al. 33 was originally based on the infinite mass limit, 35 which is often a strong approximation in molecular coarse-graining.The GPR method, due to Wang et al., is a derivative-free method that interpolates between sampled points in the parameter space.This has the advantage that no information about the relationship between the target function and the parameters must be available, which comes with the cost of a thorough sampling of the parameter space.
Thus, the strategies presented in this work represent a significant step towards simplifying the development of dynamically consistent CG models.Furthermore, we believe that similar ideas, as presented in this work, can be applied when parameterizing different dissipative thermostats.Thus, studying the effect of dissipative thermostats of different flavors on a broader range of dynamic properties could yield further relationships similar to Eq. 19.0][51] In some studies using dissipative particle dynamics (DPD), 52,53   d-f).The top, middle, and bottom rows show the integrated single particle memory kernel, the VACF, and the integrated VACF for a few selected iterations, compared to the respective target.The insets in a) and d) show the averaged squared residuals in the normalized representation.
a hybrid approach has been used, in which the distance dependence of the pair friction was a-priori fixed, 54 or determined by bottom-up techniques, 38 while the remaining parameters were determined by brute-force sampling of the parameters. 38,54While this is feasible in Markovian models as long as the number of target properties and free parameters is small systematic derivative-based optimization approaches similar to IOMK-GN should be more effective in more complex systems, such as molecular mixtures and polymer blends.

VI. SUMMARY AND OUTLOOK
In this work, we developed and implemented an efficient Gauss-Newton method (IOMK-GN) to iteratively parameterize a Markovian-embedded GLE for CG models.We have shown that the approximate linearity between the singleparticle integrated memory kernel and the integrated thermo-stat memory kernel in many-body particle-based CG simulations (Eq.18) can be exploited to define a cost-effective way to estimate the Jacobian of the single-particle integrated memory kernel with respect to the drift matrix parameters for the aux-GLE thermostat.In the context of systematic coarse-graining, IOMK-GN can be used to derive CG models that are dynamically consistent with atomistic models on all time scales within a few iterations.During the development of IOMK-GN, we solved several numerical challenges related to Markovian embedding by least squares minimization, resulting in an efficient implementation that iteratively solves the Markovian embedding problem.Thus, in this work, we additionally solved all major numerical problems of using IOMK with Markovian-embedded GLEs.In terms of convergence and stability, the performance of IOMK-GN and IOMK is comparable, so choosing one over the other remains a matter of preference.The present results, combined with the results of our previous work, 27 indicate that IOMK and IOMK-GN are orders of magnitude more efficient than previously available methods 33,34 when applied to equivalent tasks.The implemented methods are available on GitHub (https://github.com/vklip/d-cg).
In this work, we have focused on the technical aspects of solving an inverse problem using an EoM without rigorously justifying its form.Thus, we want to emphasize that the considered GLE is approximate, and limitations in its ability to reproduce all aspects of an underlying FG system accurately are to be expected.][57][58] While previous work suggests that quantitatively capturing of all aspects of the joint dynamics is not always possible when using configuration-independent thermostats, 28,59 some properties, such as the distinct van Hove function, can be represented quite accurately when the single-particle dynamics are well matched. 28,29ith the improvements in numerical reliability and ease of use, IOMK and IOMK-GN can be more easily applied to complex systems, and the scope and limitations of the isotropic GLE thermostat in accurately representing a CG system can be more easily explored than before.The extension to mixtures is straightforward 29 , and it should be possible to derive dynamically consistent CG models with bonded interactions and different bead types.In particular, when the long-time dynamics are governed by a cascade of processes relaxing on shorter time scales, as in polymer systems, it is conceivable that by targeting processes with shorter relaxation times, IOMK or IOMK-GN can be used to derive CG models with bonded interactions that accurately represent the long-time dynamics only informed by short reference trajectories.
Of course, there are applications where a configurationindependent GLE thermostat formulation is insufficient to capture the relevant physics accurately.For example, the aux-GLE thermostat is not momentum conserving and thus should not be naively applied in non-equilibrium simulations. 60Dissipative particle dynamics (DPD) 52,53 (and possibly non-Markovian DPD 23,61 ) models are better suited for such applications. 600][51] Some authors, after determining the functional form of the configuration dependent friction, optimized the absolute amplitude to match the diffusion coefficient or viscosity. 38,54This optimization is often done by brute-force sampling of parameters, which is feasible as long as the number of free parameters and target observables is small.As long as a sufficiently accurate approximate Jacobian can be found, systematic Newton or Gauss-Newton methods can be explored.This would allow one to tackle more complex problems, such as mixtures or, more generally, mapping schemes with multiple bead types.
Appendix A: Computational details

Atomistic ethanol simulations
The FG atomistic reference simulations of liquid ethanol were performed with LAMMPS 62 using the OPLS/AA force field. 63,64A cutoff of 1.1 nm was applied for Lennard-Jones and real-space electrostatic interactions while long-range electrostatics were treated by the particle-particle particlemesh method.We used MolTemplate 65 to generate an initial configuration of 2744 ethanol molecules.All FG simulations were carried out with a timestep of 0.5 fs.An energy minimized structure was equilibrated under NVT conditions at 300 K with a Nosé-Hoover thermostat using a time constant of 50 fs.The simulation box size was then relaxed under NPT conditions at a pressure of 1 atm using the Nosé-Hoover barostat with a time constant of 500 fs.A tail-correction was applied for pressure calculations.The average box length was evaluated from an additional NPT run of 200 ps and set to 6.42778 nm for all other simulations.A final NVT equilibration run was performed using a Langevin thermostat with a time constant of 500 fs to generate an equilibrated structure for the production run.
To calculate the center of mass (COM) radial distribution function (RDF) a 0.5 ns trajectory was generated using a Langevin thermostat with a time constant of 500 fs.The COM RDF was computed using the VOTCA package. 66 addition, ten snapshots were taken to be used as initial states for independent runs to generate reference dynamic data, i.e., the VACF.From these ten initial snapshots, independent NVT trajectories were generated using the Nose-Hoover thermostat with a time constant of 2.5 ps and a length of 300 ps each.Frames were saved every 10 fs.From these runs, ten independent VACFs were calculated and averaged to serve as a reference (Fig. 2 b) and e)).The target integrated memory kernel G tgt (t) was derived accordingly from Eq. 7.

Coarse-grained conservative interactions
The RDF from the mapped FG trajectory was used as the target to generate a CG potential using the IBI 4 method as implemented in VOTCA. 66.Each IBI iteration was started from the same initial structure of a mapped equilibrated snapshot of the FG NVT system.We used tabulated potentials with a cutoff of 0.9 nm, a time step of 2 fs, and a Langevin thermostat with a time constant of 0.2 ps.In each iteration, velocities were initialized according to the Maxwell-Boltzmann distribution at 300 K.The system was equilibrated for 40 ps, after which every 100th frame of a 200 ps simulation was saved.The IBI method was run for 300 iterations.The COM RDF (g(R)) of the FG and the final IBI model are compared in Fig. 3.

Coarse-grained simulations without a dissipative thermostat
We performed standard CG-MD simulations to evaluate G • (t), the integrated single-particle memory kernel of the Hamiltonian CG model.Velocities were initialized according to the Maxwell-Boltzmann distribution at 300 K.The system was equilibrated for 100 ps using a Langevin thermostat with a time constant of 0.2 ps.A Nose-Hoover thermostat with a time constant of 4 ps was used for production.The trajectories were generated by 10 × 400 ps simulations, saving frames only every 10 fs after the first 100 ps.By skipping the first 100 ps in each subtrajectory, we generate sufficiently independent trajectories to ensure proper convergence of the subsequently computed VACF (C • VV (t)).
Appendix B: Numerical considerations

Update optimization by varying regularization
A common approach to improve the performance of iterative methods is to optimize the iteration step.For example, a line search along the ∆ i+1 direction specified by the initial scheme can be used to find the parameters that locally minimize the residual along the update direction. 46We however choose an alternative approach, where the regularization parameter α and thus both the norm and the direction of the iteration step are optimized simultaneously.
To this end, consider a function f (x) ≈ f (x).Fixing the maximum value of α to α max , we can compute a Gauss- Newton step ∆ i+1 (ε) for a set of scaling parameters 0 < ε ≤ 1 with α = εα max and choose the actual step such that This is a computationally cheap way to prevent an a priori chosen regularization parameter from strongly slowing down convergence.We chose ε < 1, such that the size of the update step can only be increased.While this may cause some iterations to increase the residuals, it prevents the procedure from getting stuck in regions where progress would be slow with stronger regularization.As long as α max is chosen such that ε = 1 would yield stable iterations, this strategy is expected to speed up convergence.However, we note that consequently, ||r|| 2 does not necessarily decrease with every iteration, which can be an advantage as local minima of ||r|| 2 can be overcome.

Normalization of memory kernels
To optimize A, a reasonable initial guess and appropriate bounds on the parameter space should be used.The choices of the initial guess and the bounds are not transferable.To ensure that a single heuristic choice (described in Appendix B 4 and B 3) applies to a wide range of systems, we propose to normalize the target memory kernel.From that we obtain a normalized drift matrix that must be renormalized to parameterize the thermostat.
To understand the normalization process, consider a discretized memory kernel G ∈ R M , equally spaced in time such that t max = M ∆t is the time up to which the memory kernel is considered in the optimization process.We define a normalized time-axis as t norm m = t m t max .Similarly, the memory kernel is normalized to its maximum value as G max = max (G) and define the normalized memory kernel as G norm = G G max .The optimization is performed on normalized data, and we extract the normalized drift matrix A norm from which A can be determined by

Bounds
The parameters making up the drift matrix A are restrained by the FDT which, with Eq. 4, requires that (A ss ) kk ≥ 0 for all k.Other than that, the parameters in the drift matrix only have to be in R\{0} to guarantee lim t→∞ G(t) > 0.
To increase numerical stability, we chose to restrict the parameter space further.Changing the sign does not affect the memory kernel for all off-diagonal parameters in A. Still, we want to prevent sign reversals since an update that attempts to reduce an entry (A i ) kl might otherwise result in |(A i+1 ) kl | > (A i ) kl .Consequently, we chose all the parameters to be in R + .Furthermore, when using the adaptive regularization scheme proposed in Sec.II E, the entries of the regularization matrix would tend to infinity as the respective parameter approaches zero.To prevent this, we limit the breadth of the parameter space by setting |(A norm ) kl | ≥ 10 −5 for all k = l and accordingly set the step size for evaluating finite differences to 10 −5 .Finally, as t norm ≤ 1, we limit (A norm ss ) kk ≥ 2 for all k.We do not set an upper boundary for any parameter.
We enforce the bounds by setting a parameter to the minimal value whenever an update would otherwise fall below that value.

Initial guess
We can make a reasonable heuristic initial guess within the normalized parameter space described in Sec.B 2. Consider the case where the off-diagonal entries of A ss are set to zero.Eq. 22 then yields the integral of a sum of h exponential, 67 where the time constant of the kth exponential is given by (A ss ) −1 kk .We want to ensure that the initial guess is plateaued on the relevant time scale (t ≈ 1) and, at the same time, can be sufficiently well resolved by the time grid.Thus we set 10∆t/t max ≤ (A ss ) −1 kk ≤ 0.5 or equivalently 2 ≤ (A ss ) kk ≤ t max /10∆t.More specifically, we choose (A ss ) −1 kk to be logarithmically equispaced within these bounds.The plateau value of the kth integrated exponential is given by (A ss ) −1 kk (A Ps ) 2 k .As an initial guess, all exponentials should equally contribute to the total (long time) friction.Thus, to achieve a total plateau value of ≈ 1 we set (A Ps ) k = (A ss ) kk h . Finally, the off-diagonal entries of A ss should not be set to zero for the initial guess, as the corresponding entries in the Jacobian would otherwise vanish.We set the off-diagonals to (A ss ) kl = (A ss ) kk /(l − k) for all k < l.
Appendix C: Gauss-Newton method for fitting memory kernels How well a Gauss-Newton method performs can depend strongly on the specific application.Further strategies beyond those described in sections II E and B 1 may be necessary to improve reliability and convergence.As an intermediate step in finding an optimized procedure for solving Eq. 8, we first solved the more straightforward problem of fitting a memory kernel G(t) using Eq.22.To do this, we set f := G(A), f tgt := G and x := all distinct entries in A and evaluate the Jacobian J by finite differences.With these definitions, applying the Gauss-Newton method is, in principle, straightforward.The strategies described in Appendix B 2 and B 3 can be applied to achieve stable and efficient convergence in both IOMK-GN and memory kernel fitting.

Fitting exemplary integrated memory kernels using the Gauss-Newton method
As the Jacobian for fitting only differs from the IOMK-GN Jacobian (compare Eq. 23) by a time-dependent factor, we assume that we can infer a good choice for the regularization parameter α and compare the two regularization schemes for fitting as described in Sec. C. The statistics of the logarithm of the squared residuals averaged over all entries of the integrated memory kernel for the Tikhonov and adaptive regularization schemes, respectively, using different regularization parameters.The residuals are evaluated based on the normalized representation of the respective target.The box spans the distance between the first and third quartiles, with the median indicated by a black line.The whiskers indicate the largest and smallest data point found within 1.5 box lengths outside the box.All remaining points are considered outliers and are marked with small circles.For the sake of clarity, outliers or whiskers below −7.5, only present in the adaptive scheme with α = 10 0 and α = 10 −1 respectively, are not shown.The Tikhonov regularization with α = 10 −5 has outliers up to ≈ 5.In the rare case that the fitting process failed due to non-invertible D + Λ (only in α = 10 −1 in the adaptive scheme), we set the error to 10.0 (1 on the logarithmic scale).The data was obtained by fitting Eq. 22 to 100 randomly generated integrated memory kernels.
To that end, we fitted Eq. 22 to 100 randomly generated memory kernels (see Appendix C 1 for details) and show the mean squared error after 100 iterations for considered regularizations in Fig. 4. For the Tikhonov regularization, we found the best results for α = 0.01, while the adaptive regularization scheme performs optimally for α = 1.Independent of the choice of α, the adaptive scheme yields, on average, errors that are orders of magnitude smaller than the Tikhonov scheme.

FIG. 3 .
FIG. 3. COM radial distribution function (g(R)) of ethanol from FG-MD and CG-MD using the IBI potential.
FIG. 4.The statistics of the logarithm of the squared residuals averaged over all entries of the integrated memory kernel for the Tikhonov and adaptive regularization schemes, respectively, using different regularization parameters.The residuals are evaluated based on the normalized representation of the respective target.The box spans the distance between the first and third quartiles, with the median indicated by a black line.The whiskers indicate the largest and smallest data point found within 1.5 box lengths outside the box.All remaining points are considered outliers and are marked with small circles.For the sake of clarity, outliers or whiskers below −7.5, only present in the adaptive scheme with α = 10 0 and α = 10 −1 respectively, are not shown.The Tikhonov regularization with α = 10 −5 has outliers up to ≈ 5.In the rare case that the fitting process failed due to non-invertible D + Λ (only in α = 10 −1 in the adaptive scheme), we set the error to 10.0 (1 on the logarithmic scale).The data was obtained by fitting Eq. 22 to 100 randomly generated integrated memory kernels.