We present a new model for computing hydration free energies by 3D reference interaction site model (3D-RISM) that uses an appropriate initial state of the system (as suggested by Sergiievskyi *et al.*). The new adjustment to 3D-RISM theory significantly improves hydration free energy predictions for various classes of organic molecules at both ambient and non-ambient temperatures. An extensive benchmarking against experimental data shows that the accuracy of the model is comparable to (much more computationally expensive) molecular dynamics simulations. The calculations can be readily performed with a standard 3D-RISM algorithm. In our work, we used an open source package AmberTools; a script to automate the whole procedure is available on the web (https://github.com/MTS-Strathclyde/ISc).

## I. INTRODUCTION

Hydration free energy, Δ*G _{hyd}*, is one of the most important properties in solution chemistry. It provides information about the partitioning of a solute between gas and solution phases and is used in calculating solubility,

^{1–3}acid-base dissociation constant,

^{4}octanol-water partition coefficient,

^{5}and protein-ligand binding free energy,

^{6}amongst other properties. The hydration free energies of non-ionic organic solutes can be computed using molecular dynamics (MD) simulations with an accuracy approaching that of experiments.

^{7}However, for many purposes (in particular, for high-throughput screening over large digital libraries of compounds), the MD approach is not ideal as it requires large computational resources.

^{7–9}Additionally, the large number of existing MD methods and the variety of input parameters both for the model itself (molecular geometry, force-field, number of molecules in the simulation box, etc.) and for the simulation protocol (time step, MD integrator, thermostat, barostat, equilibration time, etc.) together with the need to use a parallel computer architecture make it difficult for non-specialists to perform these calculations.

The 3D Reference Interaction Site Model (3D-RISM) is a molecular theory closely related to classical density functional theory.^{10–13} It describes the local solvent density around a solute using 3D spatial solute-solvent distribution functions, which are obtained via iterative solution of a system of integral equations. Using 3D-RISM, one can compute a range of thermodynamic properties as well as structural information about the solvent in a relatively short time (on the order of a few minutes).^{14–17} In principle, it can be applied to arbitrary solvent compositions and temperatures.^{18–20,72–74} However, despite recent advances, such as the introduction of semi-empirical free energy functionals,^{15,21–24} the accuracy of theory-based 3D-RISM methods (i.e., those without empirical corrections) has remained low.^{25}

In this paper, we build upon recent theoretical work by Sergiievskyi *et al.*^{25} and propose a new correction for the 3D-RISM hypernetted-chain (3D-RISM/HNC) theory. To assess the strengths and limitations of the new model, we compare its predictions to experimental hydration free energy data for small organic molecules at both ambient and non-ambient temperatures. We show that the new term not only improves predictions at 298 K, but also makes the error practically temperature-independent for the range of temperatures between 278 K and 378 K, which was studied here.

## II. THEORY

An important part of 3D-RISM theory is the so called closure expression. It provides a connection between intramolecular potential *u*(**r**) and distribution functions,

Here, *h*(**r**) is the total correlation function, *c*(**r**) is the direct correlation function, and *B*(**r**) is a bridge function, defined using diagrammatic expansion.^{14,26–29} In practice, however, one often substitutes *B*(**r**) with various approximations.

A common choice is to set *B*(**r**) = 0 in the so called HNC approximation. However, this approach suffers from poor computational convergence.^{30,31} This problem can be reduced by the use of a partial series expansion of order n (PSE-n) of the HNC closure,^{30}

where Ξ(**r**) = − *βu*(**r**) + *h*(**r**) − *c*(**r**). Setting *n* = 1 results in the partial linearized closure proposed by Kovalenko and Hirata (KH),^{26,27} a remarkably numerically stable closure. *n* → ∞ gives the HNC closure and its convergence issues. PSE-3 achieves a good balance between the two: it has been found to have good convergence and it gives results that approximate HNC predictions well.^{30} Due to these properties, PSE-3 and other PSE-n closures have been extensively applied for a variety of charged and biomolecular systems.^{32–35}

For both HNC and its n-order expansions, the excess chemical potential of the solute at infinite dilution, *μ ^{ex}*, can be derived from the RISM solute-solvent correlation functions.

^{14,30,36–38}For the HNC closure, the functional is

^{14}

where *ρ*_{α} is the number density of solvent sites *α*. For PSE-n closure,^{30}

where Θ is a Heaviside step function. Unfortunately, Δ*G _{s}* data computed by 3D-RISM with the use of these expressions as $\Delta G s = \mu RISM ex $ have a large positive bias and standard deviation when compared to available experimental data.

^{25}

In a recent paper, Sergiievskyi *et al.* have shown that Δ*G _{s}* predictions by molecular theories like Molecular Density Functional Theory (MDFT) and 3D-RISM can be improved by introducing a correction to transform the results from the grand canonical to the isobaric-isotherm ensemble.

^{25}We follow the idea from that paper in our derivation of the proper correction for Δ

*G*calculated by 3D-RISM.

_{s}The main idea behind the correction is that the original 3D-RISM is defined for the grand canonical ensemble,^{28} meaning that it computes the solute excess chemical potential with regards to state 1 (Figure 1; state 1 is the initial state with *N* + Δ*N* water molecules and volume *V*). However, experiments and MD simulations compute the free energy difference between states 2 and 0 (Figure 1; state 2 is the final state with *N* water molecules and volume *V*, state 0 has *N* water molecules and *V* − Δ*V* volume).

A derivation of the initial state correction (ISc) for the MDFT case is given in Ref. 25. Following the main result of that paper, the Δ*G _{s}* between states 2 and 0, when computed by 3D-RISM, takes the following form:

where *ρ* is the density of the bulk solvent, Δ*V* stands for the partial molar volume, and $ c \u02c6 ( k = 0 ) $ is the value of the Fourier transformed direct correlation function of pure water at k = 0 (it can be obtained from 1D-RISM calculation and expressed using the pure solvent isothermal compressibility: $\rho c \u02c6 ( k = 0 ) =1\u22121/ \rho k T \chi T $).^{25} We will refer to this expression as the ISc.

After deriving Eq. (5), Sergiievskyi *et al.* argue that the Δ*G _{s}* has an extra contribution due to the change in the chemical potential of solvent (as a result of the increase in volume) equal to

*ρkT*Δ

*V*. They propose the following formula for computing the solvation free energy:

For simplicity, we will refer to this expression as ISc^{*}. This is the formula which was used in Ref. 25 to estimate the accuracy of the corrected 3D-RISM/HNC.

Equation (6) was originally developed for MDFT, not for 3D-RISM. However, the 3D-RISM is actually defined for the case of grand canonical ensemble^{14,28} meaning that the solvent potential is imposed by the outside reservoir; therefore, there should be no change in chemical potential of the solvent due to an increase in volume. Consequently, we argue that for 3D-RISM the ISc formula (Eq. (5)) has to be used to correct the solvation free energies.

## III. METHODS

To evaluate the performance of the ISc and the ISc^{*} models at 298 K, we have used a collection of experimental hydration free energies for small organic molecules compiled by Mobley *et al.*^{39} The original dataset contained 504 molecules, but we found that two molecules were duplicated (“3_methyl_but_1_ene”/“3_methylbut_1_ene” and “2_methyl_but_2_ene”/“2_methylbut_2_ene”). All compounds in the dataset are non-ionic, with an average molecular mass of 113 g/mol and an average logP of 1.7 (logP was estimated using the RdKit^{40} implementation of the Wildman-Crippen algorithm^{41}). In addition to the molecular structures and experimental data, the dataset also includes free energies obtained using the Bennett acceptance ratio method^{42} from MD simulations performed with the Generalized Amber Forcefield (GAFF),^{43} charges computed using the Austin Model 1 - Bond Charge Correction (AM1-BCC) method, and TIP3P water.^{45}

The dataset with hydration free energies at varied temperatures was kindly granted to us by Cramer.^{46,47} The dataset contains measurements of hydration free energy for a variety of small, non-ionic organic molecules in a broad temperature range and was used for parametrization of the SM6T and SM8T solvation models.^{46,47} For simplicity, we will refer to this dataset by the name of the first author of the publications: the Chamberlin dataset. The experimental hydration free energy data in both datasets used here are given as $\Delta G hyd exp =\u2212RTln c aq / c gas $, with concentrations in mol/l, which corresponds to the choice of standard states suggested by Ben-Naim.^{48,49} The raw version of the Chamberlin dataset contained measurements collected from various sources. To identify and eliminate possible systematic and random errors in this data, we have adopted a strategy similar to that used in the original publications (Refs. 46 and 47). We have carefully described our steps in the supplementary material.^{50} The final version of the dataset contains 272 molecules with 3053 hydration free energy data points measured at temperatures between 273 K and 373 K; many values were averaged over multiple experimental measurements. The average molecular mass of the molecules in the dataset is 107 g/mol, average logP is 1.6. Both Chamberlin and Mobley datasets, including the structures of the molecules that we used, are provided in the supplementary material.^{50}

The initial geometry guess for each molecule was created using the Openbabel software package.^{51,52} After that, for each molecule, we performed a conformational search with the OPLS_2005 force field^{53} using the mixed torsional/low-mode sampling method as implemented in Macromodel.^{54} For subsequent calculations, we used only the lowest energy conformation for each molecule. The data analysis and plotting were accomplished using the Scientific Python software stack.^{55–58} Molecular structures were visualized using RDKit and Marvin.^{40,59}

For all calculations, we used a modified version of the extended simple point charge water model (cSPC/e, see Ref. 60 for details) with oxygen partial charge of −0.8476, *σ _{O}* = 3.165 72 Å,

*σ*= 1.165 72 Å,

_{H}*ϵ*= 0.1553 kcal/mol,

_{O}*ϵ*= 0.015 53 kcal/mol. For all the solutes, Lennard-Jones parameters were taken from the GAFF

_{H}^{43}and partial charges were generated using the AM1-BCC scheme.

^{44}

The water susceptibility functions were generated using the DRISM program^{14,27,60} included in the AmberTools 13 package.^{61} The set of input parameters consisted of temperature, water density, and dielectric constants. The latter two parameters were obtained using interpolation functions provided in the Water Society manual^{62} (the relative uncertainty of the density is around 0.0001% and for the dielectric constant is 0.01%). The DRISM equations were solved with tolerance set to 1 × 10^{−12} and grid spacing to 0.025 Å.

All 3D-RISM calculations were performed using the rism3d.snglpnt program from AmberTools 13 package. The grid spacing was set to 0.3 Å buffer to 30 Å and tolerance to 1 × 10^{−10}. These parameters were found to converge even for rather big systems.^{23} All calculations were performed twice, with both HNC and PSE-3 bridge closures. The outputs obtained with different closures were not mixed, meaning that DRISM, 3D-RISM, and hydration free energy evaluations were done using formulas specific to a particular closure. Integral equations were solved numerically using a modified direct inversion in the iterative subspace (MDIIS) algorithm to speed up convergence.^{63} To automate the calculation procedure, we created a Python script, which we have made freely available at https://github.com/MTS-Strathclyde/ISc.

## IV. RESULTS AND DISCUSSION

Figure 2 shows the accuracies of four different models on the 298 K dataset composed by Mobley *et al.* Remarkably, HNC/ISc achieves an accuracy equivalent to the MD simulations which used the same force field and partial charges (the MD data were taken from Ref. 39). While the HNC/ISc^{*}’s standard error is not much bigger, it has a large positive bias, providing a justification for our approach to the correction of $ \mu RISM ex $.

Due to the above mentioned problems, calculations with HNC closure did not converge for 25 molecules (details are provided in the supplementary material).^{50} On the other hand, all the PSE-3/ISc calculations converged with an accuracy only slightly worse than that for HNC/ISc. This suggests that PSE-3/ISc may be a practical alternative to HNC/ISc calculations for many systems of chemical interest.

The results predicted by both HNC/ISc and PSE-3/ISc are similar to MD: the correlation coefficient is *r* ≈ 0.97 for both models (for comparison, the correlation of HNC/ISc and PSE-3/ISc models with experiment was 0.93 and 0.91, respectively). The difference between 3D-RISM and MD approaches mostly originates from the difference in the signs of their biases (Figure 2). This indicates that the force field quality might be one of the limiting factors for the further improvement of 3D-RISM hydration free energy quality. These issues might be addressed through the combinations of 3D-RISM with quantum chemistry^{64–69} or through the use of polarizable water models.^{70}

Careful examination of Figure 2 shows that ISc models perform somewhat worse than MD for hydrophilic compounds. This is most likely caused by shortcomings of the bridge closure. Essentially, the correction presented here only addresses errors due to excluded volume effects, the problems due to the exclusion of higher order bridge terms are still present. This issue might be addressed using more advanced bridges, of which the method proposed by Duh and Henderson is one such example.^{71} This is the subject for future work.

It should be noted that the functional form of the model presented here is similar to another 3D-RISM hydration free energy correction published by Palmer *et al.* in Ref. 21: $\Delta G s , U C = \mu RISM , KH ex +a\rho \Delta V+b$. Here, *a* = − 3.312 kcal/mol and *b* = 1.152 kcal/mol are fitted coefficients for water solvent. Rearranging the ISc model into similar form gives us *a* = − 3.673 kcal/mol and *b* = 0 kcal/mol for aqueous solvent at standard conditions. The correction by Palmer *et al.* gives similar accuracy for *ambient* conditions. However, ISc similarly to 3D-RISM can be applied to a system at arbitrary temperatures and pressures. Therefore, to provide additional insights into the validity of the ISc and ISc^{*} models, we performed benchmarks for each model at different temperatures.

To the best of our knowledge, there have been no reports investigating the performance of the 3D-RISM or a similar model at a wide range of temperatures for a large dataset of compounds from different chemical classes. Therefore, the benchmark provides a landmark for existing molecular theories of solvation.

Figure 3 shows how the root-mean-square error (RMSE) of the PSE-3/ISc, HNC/ISc, and HNC/ISc^{*} predictions varies with temperature for the Chamberlin dataset. Each bar is computed for all points falling into ±5 K interval. In cases where a molecule had more than one data point in a particular interval, we only took into account its average error when computing the interval’s RMSE. Not all intervals have the same number of molecules: there are 225 points between 293 and 303 K and only 130 points between 363 and 373 K. This is reflected by somewhat larger standard errors of RMSE towards higher temperatures where there are generally fewer points. We are only comparing the 3D-RISM models here because we were not able to find similar MD data for non-ambient temperatures.

At most temperatures, ISc models are almost twice as accurate as ISc^{*}. Additionally, ISc^{*} becomes less reliable towards higher temperatures, while the ISc errors remain essentially constant for all temperatures. This is another indication that supports the validity of the ISc over the ISc^{*} model.

The error of the ISc models at 298 K is slightly higher (about 1.5 kcal/mol) for the Chamberlin dataset than for Mobley dataset (3). This difference seems to originate from differences in the molecules as well as the sources of experimental data. It is hard to determine which data are more accurate as experimental uncertainties were not published with either dataset. As in the case of ambient temperature dataset, PSE-3/ISc results are only slightly worse than HNC/ISc ones, but have an advantage due to greater numerical stability: calculations with HNC bridge did not converge in 121 cases (more details are provided in the supplementary material^{50}).

To summarize, this letter describes a new, 3D-RISM theory based on the ISc. In contrast to the models that use semi-empirical corrections like the ones described in Refs. 15, 21, and 22, this model does not require additional training and/or extensive parameterisation. At the same time, the ISc models predict the hydration free energy for a wide set of organic molecules with a good accuracy both at ambient and non-ambient temperatures. In terms of accuracy, the model performs best when combined with HNC; however, much better numerical stability with only a slight decrease in accuracy can be achieved with the PSE-3 closure.

## Acknowledgments

We are grateful to Fumio Hirata for valuable discussions about the theoretical foundations of the 3D-RISM and to Volodymyr Sergiievskyi and Daniel Borgis for helpful discussions about the corrected MDFT theory. Results were obtained using the EPSRC funded ARCHIE-WeSt High Performance Computer (www.archie-west.ac.uk), EPSRC Grant No. EP/K000586/1. D.S.P. is grateful for funding from the European Commission through a Marie Curie Intra-European Fellowship within the 7th European Community Framework Programme (FP7-PEOPLE-2010-IEF) and to the University of Strathclyde for support through its Strategic Appointment and Investment Scheme.