The influence of microscopic force fields on the motion of Brownian particles plays a fundamental role in a broad range of fields, including soft matter, biophysics, and active matter. Often, the experimental calibration of these force fields relies on the analysis of the trajectories of the Brownian particles. However, such an analysis is not always straightforward, especially if the underlying force fields are non-conservative or time-varying, driving the system out of thermodynamic equilibrium. Here, we introduce a toolbox to calibrate microscopic force fields by analyzing the trajectories of a Brownian particle using machine learning, namely, recurrent neural networks. We demonstrate that this machine-learning approach outperforms standard methods when characterizing the force fields generated by harmonic potentials if the available data are limited. More importantly, it provides a tool to calibrate force fields in situations for which there are no standard methods, such as non-conservative and time-varying force fields. In order to make this method readily available for other users, we provide a Python software package named DeepCalib, which can be easily personalized and optimized for specific force fields and applications. This package is ideal to calibrate complex and non-standard force fields from short trajectories, for which advanced specific methods would need to be developed on a case-by-case basis.

## I. INTRODUCTION

Measuring microscopic force fields is of fundamental importance to understanding microscale systems. In experimental soft matter, biophysics, and active matter, microparticles are often used to probe force fields.^{1–4} This has been done, for example, to measure the elasticity of cells,^{5,6} inter-particle interactions,^{7–9} and non-equilibrium fluctuations.^{10–13} Accurate force calibration is also crucial to study molecular motors^{14} and microscopic heat engines.^{15–20} Sometimes the calibration of the force field needs to be done even in real time.^{21} Disentangling the deterministic force fields from the unavoidable Brownian noise in these systems requires care and has a direct impact on the quality of the experimental results.

If one has access to a large amount of data, the profile of a generic force field can be directly estimated by averaging the particle displacements at different positions and times (see, e.g., Refs. 22 and 23). However, there are many experimental situations where this is not feasible. As a consequence, several methods have been developed for the most common force fields, which have become standard in various research areas.^{1,4,24}

A particularly well-studied case is that of the force field $Fh(x)=\u2212kx$ (where *k* is the stiffness and *x* the particle position with respect to the equilibrium) generated by a harmonic potential $Uh(x)=12kx2$. This case is particularly interesting because it approximates the force field near any stable equilibrium, such as that experienced by microscopic particles held in optical, magnetic, or acoustic traps.^{1,4} The simplest approach to its calibration exploits the relation between the experimental probability distribution $\rho (x)$ and the potential, i.e., $Uh(x)=\u2212kBT\u2009ln\u2009N\rho (x)$, where *N* is the normalization factor, $kB$ is the Boltzmann constant, and *T* is the absolute temperature, from which the force can be derived as $Fh(x)=\u2212\u2202\u2202xUh(x)$. Beyond this potential method, several additional methods are also available. These methods use the temporal information contained in the particle trajectory, extracted by calculating the autocorrelation function,^{1,4} the power spectral density,^{25} or the recently developed algorithm FORMA, a maximum likelihood estimator based on linear regression.^{26} All these methods work well with long trajectories with a sufficiently high sampling rate, while their performance declines when only short trajectories or low sampling rates are available.

Some of the methods used for the calibration of harmonic traps can be generalized to more complex force fields. For example, the potential method can, in principle, be used to characterize any conservative force field at thermodynamic equilibrium; however, the required amount of data grows exponentially for complex potential landscapes, because the probe particle must be given enough time to explore the entire configuration space. Standard methods for the calibration of even more complex force fields, such as non-conservative or time-varying force fields, are not readily available. The calibration becomes particularly complex when dealing with a limited amount of data, such as when real-time calibration is necessary.^{12} In fact, developing methods for the calibration of some specific examples of these force fields is a very active field of research.^{24,26–30}

In this article, we demonstrate numerically and experimentally that machine learning can efficiently calibrate the force field experienced by a Brownian particle. Specifically, we employ a recurrent neural network (RNN),^{31} because RNNs have been proven very successful at tasks requiring the analysis of time series, such as natural language recognition and translation,^{32–34} event prediction,^{35} and anomalous diffusion characterization.^{36} We demonstrate that this RNN-powered method outperforms standard calibration techniques when calibrating a harmonic potential using only a short trajectory. Then, we demonstrate that it can also be used to calibrate force fields for which standard calibration techniques do not exist, namely, bistable, non-conservative, and time-varying force fields. In order to make this approach readily available for other users, we provide a Python software package, called DeepCalib,^{37} which can be easily adapted to different force fields and, therefore, personalized and optimized for the needs of specific users and applications.

## II. RESULTS

Machine-learning-powered techniques have been particularly successful in data analysis, emerging as an ideal method to study systems for which only limited data are available or no standard approaches are available.^{38,39} In particular, artificial neural networks^{40,41} provide a powerful way to automatically extract information from data. They belong to the class of supervised machine-learning methods. Unlike standard algorithmic approaches that use explicit mathematical recipes in order to obtain the sought-after results, supervised machine-learning methods are trained with large data sets associated with the corresponding ground truth in order to determine the optimal processing to estimate this ground truth from the input data. The learning task is typically a classification (where the ground truth indicates to which class the input belongs, e.g., determining if an image contains a cat or a dog) or a regression (where the ground truth is the numerical value of a quantity, e.g., inferring a parameter from a physical experiment).

Neural networks are composed of artificial neurons connected by adjustable weights. These neurons are often arranged in layers. The neurons in a layer perform a nonlinear transformation of their inputs and feed their results to the neurons of the subsequent layer. The final layer returns an estimate of the ground truth corresponding to the original input. The training process consists of iteratively adjusting the weights of the neural network in order to decrease the difference between the output and the ground truth of the sample so that the network progressively learns to associate the input data with the correct ground truth. This is usually achieved by backpropagating the estimation error through the layers.^{42} Once the neural network is trained, it can be used to predict the features of data it has never seen before.

Neural networks have recently been shown to be a powerful tool for classification and parametrization of stochastic phenomena, e.g., to determine anomalous diffusion exponents^{36,43} (also recently done using random forests^{44}) the arrow of time,^{45} and the position of particles,^{46,47} as well as in microscopy^{48,49} and in the simulations of hydrodynamic interactions^{50} and optical forces.^{51}

This success of neural networks in analyzing experimental data motivated us to test their performance in reconstructing microscopic force fields. This has led us to develop a force-field calibration method based on the use of RNNs, which are especially well-suited to handle time series, because they process the input data sequence iteratively and, therefore, explicitly model their time evolution. We name this method and the corresponding software package DeepCalib.^{37} Given a force field characterized by a set of parameters (e.g., a harmonic force field characterized by its stiffness *k*), we train the RNN to infer these parameters from short trajectories of Brownian particles moving in such force fields. Specifically, DeepCalib analyzes an input trajectory of varying length using a RNN with three long short-term memory (LSTM) layers (with 1000, 250, and 50 nodes, respectively, see discussion in the supplementary material and Fig. S1 for details) and outputs the estimated values of the force-field parameters. We choose LSTMs because their architecture manages to retain a combination of short time as well as longer term correlations without making the training procedure excessively unstable.^{31} Further, LSTMs have been shown to perform well on short stochastic time series.^{36} We have chosen the number of layers and of nodes within each layer to achieve a good complexity–performance trade-off for all tasks presented in this article (see the supplementary material for a more detailed discussion of this trade-off); however, these parameters can be easily changed in DeepCalib by the user in order to optimize the performance for specific applications.

For the training of the RNN, we use simulated trajectories, for which we know the ground-truth values of the force-field parameters, to iteratively adjust the weights in the nodes in the LSTM layers using the backpropagation training algorithm.^{42} The possibility of rapidly generating a large amount of data by simulation allows us to employ a relatively large RNN with $\u223c5\xb7106$ parameters without overfitting and without the need of overwhelmingly long training time (a few hours on a GPU-enhanced laptop). In this way, we can keep the specific number of layers and their dimensions constant for all the different calibration tasks we present in this article, easing the comparison of the RNN performance across the different tasks. The details of the network size effect on performance can be found in the supplementary material and Fig. S2.

Finally, we test the performance of the trained RNN on experimental trajectories of Brownian particles in force fields that we generate using a thermophoretic feedback trap^{52} (see experimental details below).

In Secs. II A–II F, we demonstrate that DeepCalib can be used to estimate a large variety of force fields from stochastic trajectories. We start by considering the paradigmatic case of a harmonic trap, showing that DeepCalib outperforms standard techniques for short trajectories. Then, we move to more complex scenarios: a double-well potential, a non-conservative force field, and a time-varying force field for which no simple general calibration method exists. We provide the source code of DeepCalib together with example files that reproduce all presented results.^{37} This code can be easily adapted to other force fields and, therefore, optimized for the needs of specific users and applications.

### A. Harmonic potential

In order to benchmark the performance of DeepCalib, we start by considering the simple case of a force field generated by a harmonic potential, for which many efficient standard calibration methods already exist. Harmonic traps are widely studied because they represent good approximations to more complex force-field profiles near their stable equilibria, and they are easy to experimentally realize and to analyze. A Brownian particle in a harmonic trap, in the overdamped limit, is described by the Langevin equation^{53}

where *γ* is the friction coefficient and $\xi (t)$ is uncorrelated Gaussian noise with unitary variance. An example of a simulated trajectory is shown in Fig. 1(a). To calibrate this force field, one needs to estimate the stiffness *k*.

We train the RNN using simulated trajectories with different *k*. The friction coefficient *γ* is randomly varied by 5% around its nominal value in order for the RNN to gain tolerance against small fluctuations in the friction. Since we want to train the RNN to estimate accurately stiffness values that can vary over a few orders of magnitude (from 1 to $100\u2009fN\mu m\u22121$), we draw the values of *k* from a distribution that is uniform in logarithmic scale (from $10\u22120.5$ to $103.5\u2009fN\mu m\u22121$). This is a challenging task because the range of *k* is very broad and the trajectory is very short [an example is the black portion of the trajectory in Fig. 1(a)]. Importantly, the training range of *k* is wider than the desired measurement range in order to ensure that the RNN is properly trained also for the expected edge cases. Overall, we train the RNN using 10^{7} trajectories corresponding to $10\u2009s$ and sampled 1000 times (time step 10 ms). We continuously generate new trajectories (so that the RNN is never trained twice with the same trajectory, avoiding any risk of overtraining) and split them in batches of increasing size (from 32 to 2048, so that, at the beginning, the RNN optimization process can freely explore a large parameter space, and, gradually, it gets progressively annealed toward an optimal parameter set^{54}) The training process is efficient and takes about four hours on a GPU-enhanced laptop (Intel Core i7 8750H, Nvidia GeForce GTX 1060). For further details on the model and the training, see also example 1a of the DeepCalib software package.^{37}

The estimations done by DeepCalib are shown in Fig. 1(b) (orange distribution) in comparison with the ground truth (black dashed line), while the corresponding relative mean absolute error (MAE) is shown in Fig. 1(c) (orange dots). DeepCalib provides accurate results for the entire range of *k*, significantly improving its performance at larger *k*. This is expected, because the time-scales of the fluctuations of the particle position in the trap are inversely proportional to *k*, so that for larger *k* the 10-s trajectory is able to explore the trapping potential more efficiently. For very small values of *k*, one can observe a slight bend in the cloud of predicted values, which tends to slightly overestimate the stiffness. This is a common feature for neural-network-based regression and happens because we are considering values that are close to the smallest values the network has seen in its training.

The most commonly used methods to estimate the stiffness of a harmonic trap are the variance method, the autocorrelation method, the power spectrum analysis, and the recently developed FORMA.^{1,4,25,26} For the trajectories we are considering, it is extremely difficult to employ the power spectrum analysis, because the trajectories are too short to accurately estimate the power spectral density. We, therefore, compare DeepCalib to the other three methods. The variance method [Figs. 1(d)–(e)] determines *k* from the measurement of the variance of the particle position in the trap:

The autocorrelation method [Figs. 1(f)–(g)] determines *k* by fitting the decorrelation curve of the particle position in the trap:

where $\tau =\gamma /k$ is the characteristic time of the trap. In both cases, $\u27e8\xb7\u27e9$ represents averaging over time. FORMA [Figs. 1(h)–(i)] determines *k* using a maximum likelihood estimator:

where *x _{i}* is the trajectory sample, and $\Delta t$ is the sampling time.

The estimations of *k* obtained with the variance method, with the autocorrelation method and with FORMA, present the distributions shown in Figs. 1(d) (blue density plot), 1(f) (green density plot), and 1(h) (cyan density plot), respectively. The autocorrelation method and FORMA provide slightly more accurate results than the variance method when *k* is small; however, they become less accurate when *k* is large, because individual data samples in the trajectory become excessively uncorrelated. This is expected because we are sampling the trajectory with a low frequency of just 100 Hz, which becomes comparable to the characteristic frequency of the trap (e.g., 53 Hz for *k* = 100 fN/*μ*m). The corresponding relative MAEs are shown in Figs. 1(e) (blue dots), 1(f) (green dots), and 1(g) (cyan dots), together with the comparison with DeepCalib's performance (orange dashed line). DeepCalib clearly outperforms the other methods for small *k* values where the measurement is more challenging and matches the performances of the best method for the simpler cases with larger stiffnesses.

### B. Experimental setup and initial experimental validation

So far, we have demonstrated how DeepCalib performs on simulated test data that are obtained similarly to the training data set. In order to test DeepCalib in a realistic situation, we now investigate the performance of the same RNN discussed in Sec. II A, trained on simulated data (Fig. 1), on experimental trajectories.

The experimental setup to obtain the trajectories consists of a feedback trapping system that enables us to generate a wide variety of force fields.^{52,55} We measure the Brownian motion of a single 200-nm diameter polystyrene particle (ThermoFisher Scientific, F8810) in an aqueous environment, confined by dynamic temperature fields to a circular region of a UV-lithographically fabricated nanostructure [Fig. 2(a)].^{55} The confinement in these temperature fields occurs as a result of thermophoretic drifts of the particle due to temperature-dependent solute–solvent interactions^{56,57} [red arrow, Fig. 2(a)]. The microscopic origin of these drifts is manifold and summarized in the thermodiffusion coefficient $DT$. As it is usually the case, also in our experiment, $DT$ has a positive sign, which means that the corresponding objects move toward colder regions in the temperature landscape. A thermophoretic drift velocity $v\u2192T=\u2212DT\u2207T$ can be assigned to this directed motion, which is proportional to the temperature gradient $\u2207T$. The relative strength of the thermophoretic motion of particles in liquids is given by the ratio $ST=DT/D$, which is also known as the Soret coefficient. Typical values for the Soret coefficient are in the range of 0.01 to $10\u2009K\u22121$.^{56} In the thermophoretic trapping setup, temperature gradients are generated by the conversion of optical energy of a focused 808-nm laser beam [Pegasus Lasersysteme, PL.MI.808.300, beam waist $\omega 0\u2248500\u2009nm$, Fig. 2(a)] positioned on the circumference of a circular hole with diameter of $15\u2009\mu m$ in an otherwise continuous chrome film (thickness 30 nm). Temperature differences between the rim and the trapping center are typically on the order of $\Delta T\u224810\u2009K$. The current position of the particle is obtained by the fluorescence emitted by the particle under homogeneous illumination with an excitation laser ($\lambda =532\u2009nm$, Pusch OptoTech), is recorded via an EMCCD camera (Andor iXon 3) at a frequency of 100 Hz, and is evaluated in real time using a custom-made software. The real-time positioning and intensity control of the heating laser, which is realized with an acousto-optic deflector (Brimrose, 2 DS-75-40-808), can be performed according to any protocol allowing for the investigation of a pluripotency of dynamic temperature fields.^{52} This technique is, thus, ideally suited for testing DeepCalib on experimental data obtained from a broad range of force fields.

In Sec. II B, we use the thermophoretic trap to generate a restoring force field corresponding to a harmonic potential. We record a 500-s trajectory ($5\xd7104$ samples, time step 10 ms) and determine the “true” ground-truth *k* by the variance method using the full recorded trajectory [black dashed lines in Figs. 2(b)–2(e)]. We then test the performance of DeepCalib on 400 (partially overlapping) segments of this trajectory (1000 samples each); the resulting estimations are presented by the orange histogram in Fig. 2(b) (see also example 1b of the DeepCalib software package^{37}). The estimations obtained by the variance and autocorrelation methods are presented by the blue and green histograms in Figs. 2(c) and 2(d), respectively, and show that these methods present a bias toward larger and smaller values of *k*, respectively. Such biases can be explained by the short length of the trajectories. For the variance method, the trajectory is not long enough to explore the full potential well leading to an underestimation of the variance and, thus, an overestimation of *k*. For the autocorrelation method, short trajectories exploring only the region near the equilibrium position lead to an overestimation of the correlation time in the trap and, thus, an underestimation of *k*. The estimations obtained by FORMA [Fig. 2(e)] show better results than the variance and autocorrelation methods. This is expected because FORMA performs best in this regime when the time step is very small compared to timescales of the trap ($\gamma /k$).^{26}

Although DeepCalib is trained with simulated trajectories, it determines the trap stiffness from experimental trajectories more accurately than the variance and autocorrelation methods and similarly to FORMA: DeepCalib estimations are both closer to the measured truth (lower bias) and less spread (higher precision). Therefore, thanks to its data-driven training process, the RNN manages to combine the insight provided by the variance and autocorrelation methods, while largely avoiding their pitfalls, resulting in a performance that matches FORMA. In addition, we also show that DeepCalib remains the most robust analysis method also in the presence of measurement errors (Fig. S3 in the supplementary material) or inhomogeneous diffusion coefficients (Fig. S4 in the supplementary material), even though it has not been trained with data that contemplate these scenarios. Detailed discussion of these scenarios is found in the supplementary material.

We remark that the RNN is expected to perform well for trap stiffnesses that lie in the range that is used for its training. For calibration of stronger harmonic traps (such as optical tweezers^{1}) it is sufficient to modify the range of *k* values used for the training. We demonstrate the use of DeepCalib with experimental data for colloids trapped by optical tweezers in Fig. S5 in the supplementary material.

### C. Double-well potential

Now that we have validated DeepCalib on the fundamental case of a harmonic trap, we move to the more complex case of a bistable potential. Bistable traps represent a model system to study several physical and biological phenomena, such as Kramer's transitions,^{58} Landauer's principle,^{12} and folding energies of nucleic acids.^{59} The simplest analytic form for a double-well potential is given by a quartic polynomial [solid line in Fig. 3(a)]:

where $x=\xb1L$ are the local minima and $\Delta U$ is the barrier height. This gives rise to a cubic force field [arrows in Fig. 3(a)]:

clearly showing that the force vanishes at the potential minima $x=\xb1L$ and at the local maximum *x* = 0. The parameters that characterize this double-well potential are the equilibrium distance *L* and the energy barrier height $\Delta U$.

The RNN employed by DeepCalib is similar to that for the harmonic trap case, but has two outputs to estimate both *L* and $\Delta U$. We train this RNN on about 10^{7} trajectories that are simulated with $\Delta U$ ranging from 0.1 to 10 $kBT$ (uniformly distributed in logarithmic scale) and *L* ranging from 1 $\mu m$ to 3 $\mu m$ (uniformly distributed in linear scale). Finally, we test its performance on 10^{4} simulated trajectories with 1000 samples (time step 50 ms). DeepCalib provides accurate estimations for both *L* [orange distribution, Fig. 3(b)] and $\Delta U$ [orange distribution, Fig. 3(c)] for a wide range of parameters (the ground truth is plotted by the black dashed lines). More details can be found in example 2a of the DeepCalib software package.^{37}

We now compare the performance of DeepCalib [Figs. 3(b)–3(c)] to standard methods [Figs. 3(d)–3(g)]. The standard methods to calibrate a double-well potential use the relation between equilibrium probability distribution and the potential energy,^{1} which is given by

where the normalization factor $N=\u222b\u2212\u221e\u221ee\u2212U(x)/kBTdx$ is the partition function. We remark that other standard methods that employ the statistics of the transition times between the wells^{58} cannot be applied here because we analyze short trajectories featuring few transitions. Here, we use two concrete approaches. First, we perform a quartic fit to $ln\u2009\rho (x)$ to determine the optimal values of *L* and $\Delta U$ [“potential method,”^{13}^{,}Figs. 3(d)–3(e)]. However, we observe that, for short trajectories, the potential method estimates $\Delta U$ with a strong bias. Thus, we employ a second method that is more accurate for shorter trajectories: As $\rho (x)$ displays two local maxima at $\xb1L$ (potential minima) and a local minimum at the origin (potential barrier), we obtain *L* as the distance between the maximum of $\rho (x)$ and the origin, and $\Delta U$ as the ratio of the maximum probability and the probability at the origin [“extrema method,”^{58} Figs. 3(f)–3(g)]. Although the extrema method provides much better estimations than the potential method, it achieves a significantly worse performance than DeepCalib because of the limited length of the trajectories. This is confirmed by the inspection of the relative MAE [Figs. 3(h)–3(i)]: The relative MAE of DeepCalib (orange dots) is much lower than that of the potential method (blue circles) and of the extrema method (green triangles) over the whole range of both *L* [Fig. 3(h)] and $\Delta U$ [(Fig. 3(i)].

Finally, we test the performance of DeepCalib on experimental trajectories while using the same RNN employed for the analysis of the simulated data. The experimental data are acquired using the same thermophoretic setup employed for the harmonic trap [Fig. 2(a)], but imposing the force field of a double-well trap. We record a 1500-s trajectory (150 000 samples, time step 10 ms). A part of the experimental trajectory is shown in Fig. 4(a). Interestingly, the experimental potential is not exactly a quartic potential {a typical example of the “reality gap” [Fig. 4(b)] between experiments and simulations^{39}}. The experimental potential obtained with the full extent of the trajectory is shown in Fig. 4(b). We determine the “true” ground-truth values for *L* and $\Delta U$ using the extrema method [green line Fig. 4(b) and black dashed lines, Figs. 4(c)–4(h)]. This reality gap makes it particularly interesting to assess how the various methods perform, because DeepCalib is trained on the idealized quartic potential, and the potential method assumes a quartic potential in its analysis. We test the performance of DeepCalib on 900 (partially overlapping) segments of this trajectory [1000 samples each with time step 50 ms, highlighted black line in Fig. 4(a)] obtaining the estimations of *L* and $\Delta U$ represented by the orange histograms in Figs. 4(c) and 4(d), respectively (see also example 2b of the DeepCalib software package^{37}) The corresponding estimations for the potential method are provided by the blue histograms in Figs. 4(e)–4(f), and those for the extrema method by the green histograms in Figs. 4(g)–4(h). Also in this case, DeepCalib is more accurate and less biased than the standard methods. In particular, we highlight the fact that DeepCalib provides accurate estimations even though the experimental potential differs from the idealized double-well potential employed in the simulations used in its training. This demonstrates that the neural-network approach put forward by DeepCalib can efficiently bridge the reality gap between idealized simulations and actual experiments. We also highlight that the measurements of DeepCalib are robust against asymmetries in the double-well potential. In our tests, the measurements of the equilibrium distance remains almost unaffected even if the potential is strongly asymmetric (see Fig. S6 in the supplementary material), despite the network being trained only with symmetric potentials. We also show that DeepCalib can easily be retrained to measure two different potential barrier heights (see Fig. S7 in the supplementary material). A more detailed discussion about asymmetric double wells is found the supplementary material.

### D. Rotational force field

We now test DeepCalib in a non-equilibrium scenario created by a non-conservative rotational force field. Non-conservative force fields are widely used to investigate the non-equilibrium dynamics and thermodynamics of microscopic systems.^{60–63} We consider the rotational force field described by the following equation:

where **r** is the two-dimensional position in the *xy*-plane of the Brownian particle, which is subjected to a restoring force with stiffness *k* and a torque with rotational frequency Ω. An example of a rotational force field is shown in Fig. 5(a). This non-equilibrium system relaxes to a steady state, but its distribution is determined only by the restoring force and is independent of Ω [$\rho (x,y)\u221de\u2212k(x2+y2)/T$^{60}] Thus, different from the previous examples, even in principle, it is impossible to use the steady-state probability distribution to calibrate this force field, regardless of the amount of available data. The available methods^{26,30,60,61} rely essentially on local drifts and, therefore, require high-frequency measurements (i.e., the measurement time step must be at least one order of magnitude smaller than the characteristic times associated with the motion of the Brownian particle in the force field, which in this case are $\tau c=\gamma /k$ and $\tau r=\Omega \u22121$^{60,61}). To explore the potential of DeepCalib in challenging scenarios, we consider trajectories sampled with a relatively low frequency (20 Hz). Thus, we train DeepCalib on simulated two-dimensional trajectories with 1000 samples acquired with a time step of 50 ms. We use about 10^{7} trajectories that are simulated with *k* ranging from 6 to $150\u2009fN\mu m\u22121$ (uniformly distributed in logarithmic scale) and $\gamma \Omega $ ranging from –42 to $42\u2009fN\mu m\u22121$ (uniformly distributed in linear scale).

DeepCalib manages to estimate with good accuracy both *k* and Ω, as can be seen by comparing the orange distributions and the ground-truth values provided by the black dashed lines in Figs. 5(b) and 5(c), respectively [see also example 3a of the DeepCalib software package^{37}].

Since the time step is comparable to the characteristic time of the system, we expect the standard methods to fail.^{26,30} In fact, when we apply FORMA^{26} to calibrate this force field, we obtain much poorer estimations [blue distributions in Figs. 5(d) and 5(e)]. FORMA performs reasonably well for low *k* (longer characteristic times), but fails for higher values of *k* (shorter characteristic times), while it performs poorly over the whole range of Ω.

Finally, we test the performance of DeepCalib for an experimental rotational force field, generated using the thermophoretic setup [Fig. 2(a)]. We make the test on 100 (partially overlapping) segments of the experimental trajectory (1000 seconds long), each with 1000 samples with the time step of 50 ms. The estimation of the force-field parameters is challenging because the 50 ms measurement time step is comparable to the force-field characteristic times ($\tau c=145\u2009ms,\tau r=193\u2009ms$). We determine the “true” ground-truth values of *k* and Ω [black dashed lines in Figs. 5(f)–(i)] with the FORMA-based estimations using the full length of the trajectory sampled more often (i.e., every 10 ms instead of every 50 ms), so that the sampling time is much shorter than $\tau c$ and $\tau r$. Once again, the estimations of *k* by DeepCalib [orange distribution, Fig. 5(f)] are more accurate than those by FORMA [blue distribution, Fig. 5(h)], which clearly deviate from the measured ground truth (black dashed lines). Likewise, the estimations of Ω by DeepCalib [orange distribution, Fig. 5(g)] are also closer to the measured ground truth (black dashed lines) than those by FORMA [blue distribution, Fig. 5(j)]. For further details, see also example 3b of the DeepCalib software package.^{37}

### E. Dynamical nonequilibrium trap

To further demonstrate the potentiality of DeepCalib, we set to calibrate an even more challenging dynamical nonequilibrium system. We consider a Brownian particle subject to an alternating trapping potential that is switching between a low stiffness $klow$ and a high stiffness $khigh$ with a period *τ*. Figure 6(a) shows an example trajectory together with the corresponding stiffness protocol. There is no simple standard method for calibrating such a system, as one would have to combine techniques to detect the switching points (see, e.g., Ref. 64) with techniques to estimate stiffnesses (such as the variance and autocorrelation method that we discussed for the harmonic trap) on shorter segments of the trajectory. However, it is quite difficult to estimate these parameters for most cases, as the exact switching point gets very difficult to determine when the stiffness values are close [Fig. 6(a) features an example with a large difference between $klow$ and $khigh$]. In addition, as the system is continuously kept in a nonequilibrium state, the variance and the autocorrelation methods cannot be used.

DeepCalib can be straightforwardly applied also to this case. We train DeepCalib on simulated trajectories with 1000 samples acquired with a time step of 100 ms. We train this RNN on about 10^{7} trajectories that are simulated with $klow$ and $khigh$ ranging from 2 to 280 fN $\mu m\u22121$ (uniformly distributed in logarithmic scale, with a condition that $khigh>2klow$) and *τ* ranging from 3 to 110 s (uniformly distributed in logarithmic scale). We then test the trained RNN on $2\xd7104$ simulated trajectories, demonstrating that it is able to simultaneously and accurately estimate $klow$ [Fig. 6(b)], $khigh$ [Fig. 6(c)], and *τ* [Fig. 6(d)] (see also example 4a of the DeepCalib software package^{37}) Of course, the accuracy in estimating the switching time depends on how different the stiffnesses of the traps are. We show this by plotting the MAE for the switching time as a function of the ratio $khigh/klow$ in Fig. S8 of the supplementary material, where it is evident how traps with similar stiffnesses represent more challenging cases.

Experimentally, we realize this protocol uses a thermophoretic harmonic trap that alternates between two stiffnesses. We record the experimental trajectory of 10^{5} data samples with 10-ms time steps; a part of this trajectory is shown in Fig. 6(e). We then perform the test on 100 (partially overlapping) segments of the experimental trajectory each with 1000 samples with the time step of 100 ms [black line, Fig. 6(e)]. The measured ground truth [black dashed lines in Figs. 6(f)–6(h)] for the stiffnesses of the experimental data is obtained from trajectories recorded at constant stiffnesses $klow$ and $khigh$, while we know exactly the ground truth for *τ*, because we control the period of the experimental switching protocol. Using the same RNN trained for Figs. 6(b)–6(d), DeepCalib successfully estimates the parameters of the system $klow$ [Fig. 6(f)], $khigh$ [Fig. 6(g)], and *τ* [Fig. 6(h)] from the experimental data (see also example 4b of the DeepCalib software package^{37}).

This latter example demonstrates that DeepCalib can be readily applied beyond simple equilibrium or steady-state dynamics to rather generic settings, for which standard techniques are not available and one would have to develop system-specific analysis methods.

### F. Robustness of DeepCalib

Neural-network-based methods often operate as black boxes, and it is, therefore, of great importance to properly characterize their robustness and validity in the specific scenarios where they are to be employed.^{39} By applying DeepCalib to experimental data in all tasks we have studied, we have already demonstrated its ability to bridge the reality gap and correctly calibrate experiments that may subtly differ from the simulations employed for the training. Here we further explore how common sources of alterations of the trajectories affect the performances of DeepCalib and how this compares to the other techniques.

First of all, we consider how the presence of measurement noise affects force calibration. For the harmonic trap case, we investigate how increasing levels of noise disrupt the performances of DeepCalib and of the other methods. Figure S3 in the supplementary material shows that DeepCalib is less affected by the presence of noise than the standard methods, even when the power of the signal equals that of the noise [signal-to-noise ratio (SNR) equal to 1]. Interestingly, these results are obtained with the very same RNN employed in Figs. 1 and 2, which is trained on trajectories without measurement noise. Even better results can be expected by re-training the RNN with trajectories with measurement noise.

Then, we consider how inhomogeneities in the diffusion coefficient (i.e., spatial gradients in the friction coefficient^{65}) affect the performances of DeepCalib and of the other methods. Again, DeepCalib is more robust than the other methods, as shown in Fig. S4 in the supplementary material. Also in this case, we use the very same RNN employed in Figs. 1 and 2, which is trained on trajectories without diffusion gradients. Thus, we can expect even better results by re-training the RNN with trajectories with simulations that account for the presence of diffusion gradients.

Another important source of variability is the length of trajectories from which we want to calibrate the force fields. In Secs. II A–II E, for the sake of simplicity, we have trained and tested DeepCalib on trajectories of the same length (always 1000 time steps). However, DeepCalib is capable of handling trajectories of different lengths. Figure S9 in the supplementary material shows how the performance of the RNN trained for the harmonic trap with trajectories containing 1000 measurements changes as a function of the length of the test trajectories from 500 to 2000 time steps. As can be expected, longer trajectories result in more accurate calibration. However, for 500-time step trajectories (much shorter than those employed in the training) systematic biases arise, likely because the RNN is optimized to calibrate data of a certain typical length. These biases can be mitigated by training the RNN using trajectories with a distribution of lengths. However, we recommend to train DeepCalib on trajectories of similar lengths to the actual trajectories the user is interested in characterizing (see the supplementary material for an extended discussion). If very long stationary trajectories are available, an efficient use of DeepCalib is to apply it on a sliding window along the trajectory and average its predictions, similarly to what was done for anomalous diffusion in Ref. 36.

To further assess the ability of DeepCalib to address deviations in the test data from the data employed in the training, we have studied how the RNN trained for a symmetric double-well quartic potential (employed in Figs. 3 and 4) performs when applied to a trap in which the two wells have different depths. As shown by Fig. S6 in the supplementary material, the performance of DeepCalib in determining the distance between the two wells is essentially unaltered by this asymmetry.

Finally, we discuss the possible concern that for neural-network based predictions one lacks a measure of the confidence of the calibration.^{39} This problem emerges because a neural network returns an answer even if asked a “trick question,” such as measuring a parameter that does not belong to the physical model under measurement. For example, one could employ a neural network trained to characterize the stiffness of a harmonic trap on data from a bistable potential. As another example, one could employ a neural network trained to determine the potential barrier height in a double-well trap on data from a harmonic trap. In both cases, the neural network will return a value, which is largely meaningless. Therefore, it is useful to be able to detect this meaninglessness. As a solution to this problem, it is possible to quantify the reliability of the neural network prediction by training an ensemble of neural networks and consider how scattered their predictions are. A high variance between predictions signals a low reliability. We test this method by training 40 different networks on data from a harmonic trap and contrasting the variance of their prediction between the case in which they are actually applied to a harmonic potential and the one in which they are “tricked” into estimating a (meaningless) spring constant for a double-well potential. As shown in Fig. S10 in the supplementary material, the RNNs applied to the wrong model display a much higher variance in their prediction. As a final remark, we also note that the standard methods (e.g., the variance method, the autocorrelation method, or FORMA) would suffer from the same issues, if applied in the same way.

## III. DEEPCALIB SOFTWARE PACKAGE

We provide DeepCalib on GitHub as a Python open-source freeware software package, which can be readily personalized and optimized for the needs of specific users and applications.^{37} The user can easily adapt DeepCalib to the analysis of any force field by altering the stochastic differential equations describing the motion of the Brownian particle used for the simulation of the training datasets. This gives users the ability to train their own RNN in order to calibrate their specific force field with no prior machine learning knowledge. The trained RNN can also be saved to be used on other software platforms (e.g., MATLAB and LabVIEW). This opens the possibility to straightforwardly analyze any force field, even when no standard calibration techniques are available, greatly enhancing the range of microscopic systems that can be analyzed and studied.

## IV. CONCLUSION

We have introduced DeepCalib, a data-driven, neural-network approach for the calibration of microscopic force fields acting on a Brownian particle, and reported its performance. By benchmarking it on simple tasks, for which standard techniques are available, we have shown that it outperforms standard methods in challenging conditions involving short and/or low frequency measurements. Then, we have demonstrated that it can be straightforwardly applied to non-equilibrium, unsteady force fields, for which no simple standard technique exists. We have also demonstrated that DeepCalib, while trained on simulated data, is able to generalize and successfully calibrate force fields from experimental data. Remarkably, even when the model of the force field used for the training was not perfectly matching the experimental one, as in the case of the double-well trap, DeepCalib managed to extract the key features, such as the location of the potential minima and the barrier height, better than the standard methods. This demonstrates its capability to bridge the reality gap between the idealized simulation used for training and the experimental reality. We have also shown DeepCalib robustness to measurement noise, inhomogeneities in diffusivity, variations of trajectory length, and differences between the models used to generate the training data and the properties of the testing data.

DeepCalib is, thus, a flexible method that can be used on a wide variety of calibration tasks. This can be clearly appreciated if one considers that there is no standard technique that we could have used to address all the examples we have considered. Indeed, even for the scenarios that admit standard methods, we had to employ different tools for each case, whereas DeepCalib just needed different training sets and the minor modification of adjusting the number of outputs to match the number of the desired calibration parameters. Therefore, DeepCalib is ideal to calibrate complex and non-standard force fields from short trajectories, for which advanced specific methods would have to be developed on a case-by-case basis. Potential areas of application include the real time calibration of bistable potentials used for information theory,^{12} improvement of the analysis of microscopic heat engines,^{18} and prediction of the free energies of biomolecules.^{66}

## SUPPLEMENTARY MATERIAL

See the supplementary material for a section with details about the neural network architecture, a section with a series of tests of the robustness of DeepCalib, a section discussion on how to evaluate the robustness of the predictions provided by DeepCalib, and a section with the plots of experimental data used in the article.

## AUTHORS' CONTRIBUTIONS

All authors contributed equally to this manuscript. All authors reviewed the final manuscript.

## ACKNOWLEDGMENTS

The authors thank Harshith Bachimanchi and Martin Selin for critically revising the manuscript and the software. A.A. and G.V. acknowledge support from H2020 European Research Council (ERC) Starting Grant ComplexSwimmers (Grant No. 677511). F.C. acknowledges financial support by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) through the Collaborative Research Center TRR 102 “Polymers under multiple constraints: Restricted and controlled molecular order and mobility” (funded by the DFG, German Research Foundation, Project No. 189853844), and the ESF and the Free State of Saxony (Junior Research Group UniDyn, Project No. SAB 100382164).

## DATA AVAILABILITY

The data and software that support the findings of this study are openly available in GitHub.^{37}