First-principles molecular dynamics (FPMD) simulations based on density functional theory are becoming increasingly popular for the description of liquids. In view of the high computational cost of these simulations, the choice of an appropriate equilibration protocol is critical. We assess two methods of estimation of equilibration times using a large dataset of first-principles molecular dynamics simulations of water. The Gelman-Rubin potential scale reduction factor [A. Gelman and D. B. Rubin, Stat. Sci. **7**, 457 (1992)] and the marginal standard error rule heuristic proposed by White [Simulation **69**, 323 (1997)] are evaluated on a set of 32 independent 64-molecule simulations of 58 ps each, amounting to a combined cumulative time of 1.85 ns. The availability of multiple independent simulations also allows for an estimation of the variance of averaged quantities, both within MD runs and between runs. We analyze atomic trajectories, focusing on correlations of the Kohn-Sham energy, pair correlation functions, number of hydrogen bonds, and diffusion coefficient. The observed variability across samples provides a measure of the uncertainty associated with these quantities, thus facilitating meaningful comparisons of different approximations used in the simulations. We find that the computed diffusion coefficient and average number of hydrogen bonds are affected by a significant uncertainty in spite of the large size of the dataset used. A comparison with classical simulations using the TIP4P/2005 model confirms that the variability of the diffusivity is also observed after long equilibration times. Complete atomic trajectories and simulation output files are available online for further analysis.

## I. INTRODUCTION

The importance and ubiquitous nature of water has motivated a large number of numerical simulations of its properties over the past four decades, following the pioneering work of Stillinger and Rahman.^{1–3} The development of force fields of increasing accuracy has been an active field of research in the following years.^{4–9} First-principles simulations based on a quantum mechanical description of interatomic forces have become possible due to the development of Density Functional Theory (DFT)^{10–12} and *ab initio* molecular dynamics (MD),^{13} followed by large increases in available computing power. Simulations by several groups have allowed for an increasingly accurate assessment of the quality of DFT for the description of water during the past two decades. In view of the extensive literature addressing DFT simulations of water, we refer to the recent review by Gillan, Alfè, and Michaelides for a comprehensive overview of the performance of DFT for various forms of water, including clusters, ice, and the liquid phase.^{14}

Although DFT is a first-principles approach, which implies that no empirical parameters are used in the model, the choice of the exchange-correlation functional used in electronic structure calculations significantly affects the results. It was recognized early on that the local density approximation (LDA)^{15} is unable to describe hydrogen bonds in water with sufficient accuracy, yielding a binding energy significantly larger than experimental values. Various forms of generalized gradient approximations (GGAs) were developed during the 1990’s and used in first-principles simulations of water.^{16} A consensus was reached about a decade later about the performance of a popular GGA approximation [Perdew-Burke-Ernzerhof (PBE)],^{17} showing that its use results in an overstructured liquid and a strongly underestimated diffusion coefficient at room temperature.^{18} Several authors^{19–21} have observed that raising the simulation temperature to 400 K restores some of the structural and dynamical properties, making them comparable to experimental data at 300 K. While this way of “fixing” the PBE approximation is *ad hoc* and does not have any firm theoretical foundation, it is frequently used in first-principles simulations in order to obtain an approximately correct diffusive behavior at moderate cost.

A growing number of new exchange-correlation functionals have been recently proposed, including, e.g., hybrid density functionals,^{22–24} in which a fraction of the non-local Hartree-Fock exchange energy is included, or van der Waals functionals in which long-range dispersion forces are included.^{25–28} The importance of the quantum nature of protons has also been investigated.^{29} Recently, a combination of hybrid and van der Waals functionals^{21} was shown to yield an accurate description of liquid water at ambient conditions using a simulation temperature of 330 K. The increased simulation temperature was used to approximate nuclear quantum effects. In another recent study, the Strongly Constrained and Appropriately Normed (SCAN) “metaGGA” density functional^{30} was shown to provide an accurate description of liquid water using a similarly increased simulation temperature. Additionally, it has recently become possible to perform simulations of water using higher accuracy methods, such as quantum Monte Carlo^{31} and many-body perturbation theory.^{32}

Several comparative studies of exchange-correlation functionals for the description of liquid water have been published.^{21,33–36} Such comparisons are quite demanding since long simulations are needed in order to distinguish actual differences in the models from statistical fluctuations. It is currently accepted that a minimum of 10-20 ps of simulation is necessary in order to compare the properties of different approximations, although no systematic study of this convergence time is available for first-principles simulations. As more comparative studies are needed in order to evaluate the merits of new exchange-correlation functionals, it becomes critical to provide some quantitative measure of variability in the quantities representing structural and dynamical properties of water. Comparisons of different models are often based on comparisons of the atomic pair correlation functions measured during a single MD simulation. In particular, the oxygen-oxygen pair correlation function *g*_{OO}(*r*) is routinely used to compare simulation results and assess their ability to reproduce experimental data.^{37,38} In order to enable quantitative comparisons, the computation of the variance of a physical quantity is a necessary step. This requires a sufficient number of uncorrelated configurations. However, MD simulations provide configurations that show strong short-time correlations, reflecting the continuity of atomic trajectories. Even in cases where such correlation times can be estimated, long simulations (lasting many times the correlation time) must be performed in order to get variance estimates. This goal can be reached faster (in terms of the wall-clock time of simulations) if multiple independent simulations are used.

In this paper, we address the issue of quantitatively comparing first-principles simulations of liquids, focusing on the case of water. We first assess two methods of estimation of equilibration times using a dataset of first-principles molecular dynamics simulations of liquid water. The dataset (hereafter called the PBE400 dataset) consists of simulations of an ensemble of 32 independent samples of liquid water performed using density functional theory. We analyze the Kohn-Sham energy, number of hydrogen bonds, and pair correlation functions and compute the diffusion coefficient for each independent sample. The comparison of averaged physical properties between samples provides estimates of the variability of various quantities, establishing quantitative bounds for comparisons of different approximations.

## II. COMPUTATIONAL PROCEDURE

First-principles molecular dynamics simulations were performed on an ensemble of thirty-two independent 64-molecule water samples. Electronic structure calculations were carried out within the Perdew-Burke-Ernzerhof (PBE) exchange-correlation functional^{17} and the plane-wave, pseudopotential formalism. Simulations were performed using the Qbox first-principles molecular dynamics code,^{39} with a plane-wave energy cutoff of 85 Ry and norm-conserving pseudopotentials generated according to the procedure of Hamann, Schlüter, and Chiang^{40} and Vanderbilt.^{41} We use Born-Oppenheimer dynamics and recompute the electronic ground state of the system at every molecular dynamics step. All simulations were run using a time step of 10 (a.u.) (0.2419 fs). The Bussi-Donadio-Parrinello (BDP) thermostat^{42} was used to maintain a temperature of 400 K. This choice of temperature was motivated by previous studies^{19,20} in which it was found that first-principles MD simulations of water performed at 400 K with the PBE exchange-correlation functional yield structural and dynamical properties similar to experimental data at ambient conditions. Although the PBE approximation is not the most accurate description of liquid water known today, we choose this level of approximation since (a) it is a popular way to obtain a reasonable description of water at low cost, e.g., in aqueous solutions and at interfaces and (b) the associated low cost of the simulation allows us to generate a much more complete dataset than would be possible with more accurate functionals (in particular hybrid functionals). It would clearly be of interest to repeat our simulations using other density functionals for comparison. This is the subject of ongoing investigations.

As is common practice, we simulate heavy water (D_{2}O) in order to allow for larger MD time steps.

### A. Sample preparation

Thirty-two 64-molecule D_{2}O samples were prepared using the following procedure. Starting from an empty cubic unit cell of size a = 23.46 (a.u.), corresponding to a density of 1.11 g/cm^{3}, D_{2}O molecules were inserted in the unit cell at random positions with random orientations. Insertions were rejected if any atom of the inserted molecule was located within a distance of 4.2 (a.u.) from any atom already present in the unit cell. The distance between atoms was computed taking into account periodic boundary conditions. Insertion attempts were repeated until the unit cell contained 64 molecules. This sample preparation method yields configurations that are far from equilibrium, having a Kohn-Sham energy of about 0.5 (a.u.) (∼4.9 kcal/mol per molecule) higher than a typical equilibrium configuration at *T* = 400 K. While it is customary to generate initial configurations for DFT simulations using equilibrated classical MD simulations based on force fields (e.g., TIP4P, TIP5P, or SPC/E), we have intentionally avoided this approach here in order to avoid any bias towards structures that may be favored by a particular force field.

Simulations were performed as simultaneous runs for the 32 independent samples. For reasons of convenience of organization of the computations, simulations were run as 120 successive runs of 2000 MD steps each, labeled md001 to md120. Each run has a duration of 0.48 ps, for a total simulation time of 58 ps for each sample. The random number generator used in the BDP thermostat was initialized using the current time returned by the C time() function as a seed. The starting times of the 32 runs were staggered to ensure that different seeds were used for different samples. In the first run (md001), no thermostat was used and the temperature showed a rapid rise to about 600 K. In the following runs, the thermostat was set to a temperature of 600 K and gradually lowered to the target of 400 K. Runs md002-md004 used a thermostat temperature of 600 K, runs md005-md007 used a thermostat temperature of 500 K, runs md008-md009 used a thermostat temperature of 450 K, and all subsequent runs used a thermostat temperature of 400 K. The BDP thermostat time constant *τ* described in Ref. 42 was set to 10 000 (a.u.). During the first 13 ps, corresponding to the initial equilibration phase, the electronic ground state was computed at each ionic step using 4 self-consistent (SCF) iterations. This number was then increased to 5 for increased accuracy during the rest of the simulation. A full description of atomic trajectories, Kohn-Sham energy, atomic positions, velocities, and forces was recorded for the entire simulation, including the initial equilibration phase. Since each of the 32 samples was independently started from random configurations, we can expect the initial configurations to be far from equilibrium. The first step in the analysis of the trajectories consists in determining when equilibrium is reached to ensure that further data analysis and computation of physical quantities is representative of equilibrium properties. In Sec. II B, we present two methods that are used to determine when a simulation has reached equilibrium.

### B. Equilibration period detection

The first method we consider stems from the observation that the determination of the equilibration period is analogous to the problem of “warm-up” period detection in the field of discrete event simulation. In that field, the widely used marginal standard error rule (MSER) heuristic of White^{43} provides a way to calculate an optimal truncation point for removing the warm-up period from a simulation.^{44} This heuristic is based on balancing the narrowing of the confidence interval that comes from removing data far from the mean and the broadening of the confidence interval that comes from having fewer data. Let $dj*$ be the optimal truncation point for sequence *j* of length *n*_{j}, *Y*_{ij} be some observable quantity at step *i*, and $\u0232j$(*a*, *b*) be the mean of the *j*th sequence truncated from step *a* to *b*. Then the optimal truncation point is

In order to reduce the computational cost of computing $dj*$ and to remove sensitivity to individual observations, $dj*$ is computed on block averages of five values (see the MSER-5 rule of Spratt^{45}). The value of $dj*$ is computed by searching over all *d*_{j} and plotting the intermediate function

The value of the optimal truncation point can be used to discard the initial phase of the simulation that corresponds to the equilibration phase.

The second approach we consider is the heuristic of Gelman and Rubin^{46} which is widely used to detect convergence in Markov Chain Monte Carlo (MCMC) simulations. This method is based on comparing the means and variances of a given observable quantity across multiple independent simulations. When multiple independent samples obtained from different starting points converge to the same distribution, the method considers the calculation converged. This method has also been used by Cooke and Schmidler for classical molecular dynamics simulations.^{47}

This heuristic computes a value called the potential scale reduction factor (PSRF), an estimate of the factor by which the deviation might be reduced if the simulation were run further to the infinite duration limit. The PSRF is driven to 1 by the convergence of the variance within a sample and between samples and the increasing number of total steps taken. First, we begin by discarding the equilibration period values as detected on the basis of the White MSER value (this differs from the empirical recommendation of Gelman and Rubin to remove the first half of the samples). Then, for each sequence of the remaining data, we compute

the variance

*B*between each of the*m*sequences,the average

*W*of the*m*variances within sequences,$\sigma ^=n\u22121nW+1nB$: an estimate of the variance as a weighted average of

*W*and*B*,$V^=\sigma ^2+Bmn$: the scale of an approximate Student

*t*distribution,$d=2V^2var(V^)$: the degrees of freedom, where $var(V^)$ is Eq. (4) in the original paper by Gelman and Rubin,

^{46}$PSRF=V^Wd+3d+1$ (using the updated formula in the work of Brooks and Gelman

^{48}).

We compute the PSRF using the CODA MCMC analysis package.^{49} The computed value of the PSRF is then represented graphically as a function of iteration number. We determine whether the PSRF is sufficiently close to 1 by visual inspection, following a criterion adopted by Cooke and Schmidler^{47} for MD simulations.

## III. RESULTS

In this section, we apply the two methods of equilibration time detection discussed above to the Kohn-Sham energy and to the number of hydrogen bonds. We then present an analysis of atomic pair correlation functions and of the diffusion coefficient. For each of these quantities, we compare data collected in separate samples and derive estimates of the variance of the data.

### A. Kohn-Sham energy

The Kohn-Sham energy cannot in itself be compared directly to an experimentally measured energy (since the Kohn-Sham energy functional is defined up to an additive constant). However, we use its time dependence to monitor the approach to equilibrium. First, we focus on block averages $E\xafKS(i),i=1,\u2026,120$ of the Kohn-Sham energy taken over the duration of each run, i.e.,

Figure 1 shows the time evolution of $E\xafKS(i)$ for all samples during the entire simulation. Based on this data, equilibrium appears to be reached after approximately 30 runs, or about 15 ps. The stabilization of the block-averaged Kohn-Sham energy provides a rough indication of the approach to equilibrium. A more detailed analysis is however necessary to confirm that equilibrium has been reached.

We have applied the analysis of White to the Kohn-Sham energy for each of the 32 independent simulations. The MSER values are shown in Fig. 2. At first, the MSER values are high, reflecting the large variance before equilibrium is reached. As more non-equilibrium samples are discarded, the MSER value decreases until it hits a minimum value and then increases again as equilibrium data points are discarded. The MSER minimum indicates that the equilibration period ends by about the 20th run (∼10 ps). We note that one sample (s0030) has a substantially higher MSER value than the other samples until around the 60th run, at which point the sample’s MSER value drops dramatically and becomes similar to the other samples’ values. A closer inspection reveals that, around the 60th run, this sample enters into a highly structured configuration with a lower Kohn-Sham energy and a higher number of hydrogen bonds. The presence of these structured configurations increases the variance, and thus, by discarding these configurations, the variance and the MSER value decrease and become similar to the values of samples that do not enter these structured configurations. The MSER analysis on independent samples shows a first example of detection of an outlier among simulations, which may be difficult to detect by observing the time dependence of the Kohn-Sham energy alone.

Figure 3 shows the analysis of Gelman and Rubin applied to the Kohn-Sham energy after discarding the initial runs md001-md030 (corresponding to a simulation time of ∼15 ps). Starting at run md031, the PSRF value is near 1.14. As each run continues to sample equilibrium, the independent runs converge rapidly to the same distribution, driving the value of the PSRF down towards 1. Using the criterion adopted by Cooke and Schmidler,^{47} we consider that equilibrium is reached as the PSRF becomes smaller than 1.1. This confirms our estimate that equilibrium was reached approximately after run md030 (∼15 ps).

### B. Hydrogen bonding

The number of hydrogen bonds is commonly used to characterize the structure of liquid water. In this section, we use the number of hydrogen bonds as a second measure for the determination of the equilibration period. We compute the number of hydrogen bonds per molecule using the definition of Luzar and Chandler^{50} in which two water molecules are considered hydrogen bonded if the inter-oxygen distance is less than 3.5 Å and the O–H⋯O angle is less than 30°. Several other definitions of hydrogen bonding have been proposed in the past and have been reviewed by Kumar *et al.*^{51}

Figure 4 shows the analysis of White applied to the number of hydrogen bonds computed at the end of each run. We note that the equilibration period is over by about the 20th run, in good agreement with the analysis applied to the Kohn-Sham energy. The equilibration period corresponds to the time needed to create a hydrogen bond network, which explains the good correlation between convergence in the number of hydrogen bonds and the Kohn-Sham energy. The computed MSER also reveals that the same sample (s0030) identified as an outlier in the Kohn-Sham energy analysis is also an outlier with respect to the number of hydrogen bonds.

Figure 5 shows the analysis of Gelman and Rubin applied to the number of hydrogen bonds. This analysis is also in good agreement with that of the Kohn-Sham energy, with convergence happening near the 30th run. The White MSER measure and the Gelman-Rubin heuristic give somewhat different estimates of the equilibration time. This may be partly due to the arbitrariness of the criterion used^{47} to analyze the Gelman-Rubin PSRF function. However, it suggests that using multiple diagnostics (such as the two presented here) would be advisable to reliably determine the duration of the equilibration period. When averaged over all samples and runs after the equilibration period, the mean number of hydrogen bonds is 3.57 with a standard deviation of 0.12. We note that even for a well converged mean and standard deviation value, the 95% confidence interval contains values separated by almost half a hydrogen bond. Future analysis of the effect of various approximations on the number of hydrogen bonds should take this variability into consideration.

### C. Pair correlation functions

The pair correlation functions *g*_{OO}(*r*), *g*_{OH}(*r*), and *g*_{HH}(*r*) are routinely used as indicators of the quality of the models used in simulations of liquid water. Computed pair correlation functions are typically compared to corresponding functions derived from experimental data, such as neutron or X-ray diffraction structure factors. This comparison is itself a delicate procedure, affected, e.g., by various assumptions about the treatment of experimental data.^{52} A consensus has been reached over the past decade that the use of the PBE exchange-correlation functional leads to overstructured pair correlation functions at ambient conditions, compared to experimental data.^{19–21} It was also observed that raising the simulation temperature by about 100 K (i.e., simulating at 400 K) yields a reasonable description of liquid water at ambient conditions.

Rather than giving a detailed comparison of our PBE results with experimental data, we focus here on describing the statistical variability of pair correlation functions that we observe in the PBE400 dataset. We have computed the correlation functions for each separate sample over the period corresponding to equilibrium. The pair correlation functions are computed by binning interatomic distance in bins of size 0.05 Å. Figure 6 shows the *g*_{OO}(*r*) correlation functions computed for all 32 samples including simulations in runs [31:120], corresponding to a simulation time of ∼44 ps for each sample. The figure also includes the experimental data of Soper^{37,52} and Skinner *et al.*^{38} as a guide to the eye. We emphasize that our goal is not to discuss the merits of the PBE approximation in reproducing experimental data. Figure 6 shows that different samples have visibly different pair correlation functions, with deviations of up to 0.2 near the first peak and near the first minimum. This variability across samples provides a measure of the deviations that should be considered significant (or insignificant) when comparing pair correlation functions obtained using different approximations (e.g., different exchange-correlation functionals). In particular, two computed *g*_{OO}(*r*) obtained from 44 ps simulations using different approximations cannot be considered significantly different if they fall within the range of the data shown in Fig. 6.

In order to provide a more detailed picture of the time evolution of pair correlation functions in the PBE400 dataset, we have computed the *g*_{OO}(*r*) of all 32 samples in time intervals of 5 ps, corresponding to runs [1:10], [11:20], etc. Figure 7 shows the evolution of the pair correlation functions over the entire set of simulations. As expected, the pair correlations are far from their equilibrium values during the first 5 ps of simulations ([1:10]). More surprisingly, the intervals [11:20] and [21:30] yield pair correlations that are visually comparable to equilibrium values although the behavior of the Kohn-Sham energy (Fig. 1) indicates that equilibrium has not yet been reached. After the md031 run, the pair correlation functions stabilize and are similar to their average values (Fig. 6). However, a number of outliers can be observed, in particular in the intervals [51:60], [71:80], and [91:100]. During these intervals, one of the samples exhibits a more structured *g*_{OO}(*r*) than all other samples. These anomalies are notable since the pair correlation functions of these samples show a significant deviation even when averaged over 5 ps.

A more detailed investigation reveals that these outliers correspond to the presence of highly “structured” configurations, consistent with the hypothesis of Shiratani and Sasai.^{53} We have computed the local structure index (LSI) defined by these authors. We consider the example of sample s0006 during runs md056 and md101. The corresponding *g*_{OO}(*r*) are shown in Fig. 8. The pair correlation function is highly structured in run md056, whereas it is close to the average in run md101 (both runs have a duration of 0.5 ps).

We have analyzed the corresponding structures by computing the distribution of neighboring oxygen atoms around a given oxygen atom, including the first five neighbors. Figure 9 shows the calculated distributions in runs md056 and md101, with the fifth neighbor distribution highlighted in red. Consistently with the analysis of Shiratani and Sasai, we find that the highly structured configurations in md056 are associated with an increased distance to the fifth oxygen neighbor. Further analysis of these configurations will be pursued and lie beyond the scope of the present paper. Similar configurations have been observed by Santra *et al.*^{54} in *ab initio* simulations of water including van der Waals interactions.

### D. Diffusion coefficient

We have computed the diffusion coefficient from the mean square displacement (MSD) of each oxygen atom for each sample separately, using the entire simulation data (runs [1:120]). When computing dynamical properties such as the diffusion coefficient, care must be taken to ensure that the use of a thermostat does not affect the results. The BDP thermostat used in our simulations was shown^{42} to have a negligible effect on the diffusion coefficient of water (compared to NVE simulations), when used with the same parameters as we use in this work. Figure 10 shows the MSD of oxygen for all samples during the entire simulation, as well as the average taken over all samples. The MSD shows large variations from sample to sample, with a wide range of slopes, indicating that a single simulation of 58 ps is insufficient to obtain an accurate value of the diffusion coefficient. The values of *D* are computed by fitting a linear function to the mean square displacement of the oxygen atoms during the runs [31:120] (step > 60 000, i.e., after 15 ps of equilibration). The resulting values of *D* range from 0.9 × 10^{−5} to 3.4 × 10^{−5} cm^{2}/s. Figure 11 shows the distribution of the values of *D* for all samples. The standard deviation is *σ*_{D} = 0.5 × 10^{−5} cm^{2}/s (i.e., about 30% of the mean).

The wide range of values obtained for *D* shows that the diffusion coefficient computed from a single simulation can be very unreliable. In particular, one sample (s0030) yields a value nearly twice as large as the average, while another sample (s0018) yields a value that is nearly half the average.

Combining all 32 simulations, we obtain the average value *D*_{avg} = 1.96 × 10^{−5} cm^{2}/s. The standard deviation of the mean is estimated as $\sigma Davg=\sigma D/32\u22430.1\xd710\u22125\u2009cm2$/s. Our final estimate of the diffusion coefficient is therefore *D* = (1.96 ± 0.1) × 10^{−5} cm^{2}/s. The uncertainty in the final value of *D* is still substantial in spite of the large amount of data used. This shows that small differences between diffusion coefficients computed in different conditions (e.g., in the presence of different ions in solution) should be interpreted with caution, in particular if they are obtained from a single simulation.

In view of the wide range of values obtained for the diffusion coefficient, it is important to establish whether the random initial configurations used to start the simulations may be the cause of this variation, i.e., whether some initial configurations could induce fast or slow diffusivity beyond the 15 ps equilibration period. We have verified that this is not the case by running independent classical MD simulations. We used the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) classical molecular dynamics code^{55} and the TIP4P/2005 force field.^{56} Starting from 32 64-molecule water samples initialized in the same way as in the PBE400 simulations, we ran 32 independent NVT simulations at T = 300 K for a duration of 1.4 ns. We used the BDP thermostat with the same relaxation time used in the PBE400 simulations. The recorded MSD is shown in Fig. 12 and shows a substantial variability of diffusive behavior. From Fig. 12, it can be seen that the slope of the MSD shows large fluctuations on the scale of the PBE400 simulation data collection period (44 ps). We conclude that a large variability in diffusivity is present in both FPMD and classical simulations when using samples of 64 molecules. A closer examination of the MSD of individual samples (see the supplementary material) reveals that, in both classical and FPMD simulations, the diffusivity shows abrupt changes between fast and slow diffusion, with small or vanishing diffusion persisting for long periods (>10 ps). The picture that emerges is that of a system that switches between a nearly frozen state and a liquid state. The dispersion in the observed values of the diffusion among different samples thus appears to result from the randomness of these transitions rather than from samples showing consistently fast or slow diffusivity. This confirms that it is essential to use multiple independent simulations in order to compute the diffusion coefficient of water using samples of 64 molecules. This variability of the diffusivity when using this sample size has likely been overlooked in the context of classical simulations, where much larger samples are typically used.

We have also verified that the variability of the oxygen pair correlation function observed in the PBE400 dataset is comparable to that observed with the TIP4P/2005 model and that it persists after long equilibration periods. Figure 13 shows the *g*_{OO}(*r*) pair correlation function of the 32 TIP4P/2005 samples measured at the beginning and at the end of the 1.4 ns simulation. The data were collected in both cases over a period of 44 ps. The variability in the pair correlation functions does not show a tendency to decrease after the 1.4 ns equilibration. We conclude that the variability observed in both the diffusivity and pair correlation functions is a persistent feature of simulations of water using 64 molecules, independent of the model used.

We note that the value of *D* obtained in our calculations appears to be comparable to the experimental value of 1.872 × 10^{−5} cm^{2}/s obtained by Mills^{57} for D_{2}O at *T* = 298 K, even though our simulations are performed at *T* = 400 K. However, the fact that the diffusion coefficient can be significantly affected by finite size effects—even when averaged over long periods^{18,58}—prevents us from making a more detailed comparison of our results with experimental data. These results suggest that a larger diffusion coefficient would be obtained with a larger system.

### E. Estimate of correlation times

The short-time correlation present in molecular dynamics simulations is known to lead to an underestimation of the variance of computed quantities.^{59} In order to determine an appropriate sampling interval for the computation of variances, we have evaluated the autocorrelation function of the block-averaged Kohn-Sham energy for all independent samples. We use the implementation of the autocorrelation function provided by the acf() function of the R statistics package.^{60} Figure 14 shows the autocorrelation of the block-averaged energy of sample s0000 computed over the range of runs [31:120]. Significant amplitudes of the autocorrelation can be identified by comparison with the standard error of the autocorrelation: the dashed horizontal lines show the values $\xb12/n$. Amplitudes outside of this range are considered significant. The presence of significant amplitudes for lag values of 1 and 2 in Fig. 14 indicates that the correlation time for sample s0000 is of the order of 2 runs, or approximately 1 ps. Interestingly, autocorrelation functions of other samples show a wide range of correlation times. An overview of the autocorrelation functions is given in Fig. 15 showing that different samples can have rather different autocorrelation functions, with decay times varying between 2 and 12 runs (∼1-6 ps).

## IV. CONCLUSIONS

We have analyzed a dataset of first-principles simulations of water consisting of multiple simulations of independent samples. Using independent runs provides a way to compute error estimates for various physical quantities by comparing variance between samples and variance during a simulation. This feature is used to estimate the equilibration time using the Gelman-Rubin statistic, providing a systematic and quantitative way of evaluating the equilibration process. Equilibration times derived from the MSER statistic of White yield similar conclusions. An analysis of the pair correlation functions, number of hydrogen bonds, and diffusion coefficient derived from the dataset shows that their variability across samples can be large and that an accurate determination of these quantities may require longer simulations and larger sample sizes than those are typically used. A comparison with classical simulations using the TIP4P/2005 model shows that the variability of the diffusivity and pair correlation functions is a persistent feature of simulations using 64 molecules. The observation of outliers in various computed quantities also strongly suggests the use of multiple independent runs to ensure reliable conclusions. The size of the PBE400 dataset allowed us to identify relatively rare events, such as the appearance of highly structured configurations leading to temporarily overstructured pair correlation functions, as previously identified by Shiratani and Sasai. The PBE400 dataset is available online including all trajectories and electronic structure information.^{61} Subsets of the full dataset are available and provide representative configurations of liquid water for future validation and verification of electronic structure calculations.

## SUPPLEMENTARY MATERIAL

See supplementary material for a detailed figure showing the mean square displacements for each sample separately.

## ACKNOWLEDGMENTS

We thank G. Galli for numerous discussions, A. Gaiduk for verification of the data, and D. Donadio for advice on LAMMPS simulations. This work was supported by the U.S. Department of Energy Office of Basic Energy Sciences under Grant No. DE-SC0008938 and by the MICCoM DOE Computational Materials Center. Computer time was provided by the DOE Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357.

## REFERENCES

The marginal standard error rule was originally called the marginal confidence rule by White;^{43} however, it has since come to be referred to as the MSER. See R. Pasupathy and B. Schmeiser, WSC ’10 Proceedings of the Winter Simulation Conference, pp. 184-197, Baltimore Maryland, (2010).