In this work, a systematic protocol is proposed to automatically parametrize the non-polar part of implicit solvent models with polar and non-polar components. The proposed protocol utilizes either the classical Poisson model or the Kohn-Sham density functional theory based polarizable Poisson model for modeling polar solvation free energies. Four sets of radius parameters are combined with four sets of charge force fields to arrive at a total of 16 different parametrizations for the polar component. For the non-polar component, either the standard model of surface area, molecular volume, and van der Waals interactions or a model with atomic surface areas and molecular volume is employed. To automatically parametrize a non-polar model, we develop scoring and ranking algorithms to classify solute molecules. The their non-polar parametrization is obtained based on the assumption that similar molecules have similar parametrizations. A large database with 668 experimental data is collected and employed to validate the proposed protocol. The lowest leave-one-out root mean square (RMS) error for the database is 1.33 kcal/mol. Additionally, five subsets of the database, i.e., SAMPL0-SAMPL4, are employed to further demonstrate that the proposed protocol. The optimal RMS errors are 0.93, 2.82, 1.90, 0.78, and 1.03 kcal/mol, respectively, for SAMPL0, SAMPL1, SAMPL2, SAMPL3, and SAMPL4 test sets. The corresponding RMS errors for the polarizable Poisson model with the Amber Bondi radii are 0.93, 2.89, 1.90, 1.16, and 1.07 kcal/mol, respectively.

## I. INTRODUCTION

Solvation, in which separated solvent and solute molecules are combined to form a solution, is an elementary physical process in nature that provides a foundation for more complicated chemical and biological processes, such as ion channel permeation, protein-ligand binding, electron transfer, signal transduction, DNA specification, transcription, post-transcription modification, gene expression, and protein synthesis. Therefore, the understanding of solvation is a prerequisite for the quantitative study of other more complex chemical and biological processes and is of paramount importance to chemistry, physics, and biology.^{14,67,78} The most basic and reliable experimental observation of solvation is the solvation free energy, which measures the free energy change in the solvation process. As a result, one of the most important tasks in the solvation modeling and computation is the prediction of solvation free energies, which has recently drawn much attention in computational biophysics and chemistry.^{14,15,46,67,78,88}

A large variety of solvation models has been developed in the past few decades for solvation analysis and solvation free energy prediction. In general, these models can be classified into either knowledge based models or physics based models. Knowledge based models usually utilize a large amount of available data of solvation free energies to train statistical models for the solvation free energy prediction. One example of this type of models makes use of solvent accessible surface areas, solvent excluded surface areas, or calibrated atomic surface areas for solvation free energy predictions.^{85} Knowledge based models can provide relatively accurate predictions provided that enough experimental data are available. On the other hand, physics based solvation models are based on fundamental laws of physics. These models can be further classified into three major classes. One of them is explicit solvent models in which both solvent and solute molecules are described at the atomic or electronic level. This class of models is accurate in general, but is usually computationally expensive.^{45,49,72} Another class of models is integral equation based solvation theories,^{36,65,66} in which the solute molecule is still modeled at the atomic level, while the solvent is modeled by statistical mechanics, such as liquid density functional theory, Ornstein-Zernike equation with hypernetted-chain, Percus-Yevick equation, or other specified closure. Integral equation based solvation models reduce the number of degrees of freedom dramatically compared to explicit solvent models. A major feature of integral equation approaches is that these methods are able to provide a good approximation of solvent microstructures near the solvent-solute interface.^{28} The other class of the physics based solvation models is implicit solvent models,^{1,7,12,22,27,61,64,71,89} in which the solute molecule can be modeled at either the atomic or the quantum mechanical level, while the solvent is described as a dielectric continuum.

Compared to other methods, implicit solvent models have advantages of involving a small number of degrees of freedom, highly accurate modeling of strong and long-range electrostatic interactions that often dominate solvation phenomena, and convenient treatments of solvent-solute electronic polarization effects.^{12} In implicit solvent models, the solvation free energy is typically divided into polar and non-polar components. The polar solvation free energy is due to electrostatic interactions and can be modeled by a number of approaches, including Born dielectric sphere model, Kirkwood model,^{43} generalized Born (GB) model,^{2,16,19,30,31,41} polarizable continuum model (PCM),^{18,39,81} and Poisson-Boltzmann (PB) theory.^{29,37,62,70,73} Among these approaches, Born dielectric sphere model which treats the solute molecule as the dielectric sphere with a centrally located atomic charge is the simplest one. The Kirkwood model gives analytical expressions for electrostatic potentials in a spherical solute-solvent system with multiple charges inside the solute molecule.^{43} The GB model^{2,16,19,30,31,41} is able to deal with arbitrarily shaped molecules and account for the impact of each point charge by its effective distance (i.e., the Born radius) from the solvent-solute interface.^{13,80} The GB model is relatively faster, while depending on other methods, such as the PB theory, for its parametrization. The PCM describes solute electrons by using quantum mechanics so that solvation induced polarizations can be accounted for in an iterative procedure. A few different versions of PCM models have been reported, including dielectric PCM, integral equation formalism PCM, and the variational PCM models.^{18,39,81} PCM is relatively computationally expensive due to its quantum mechanical charge calculation while its accuracy is bounded by the PB type of treatment of electrostatic potential as well as the quality of solvent-solute interfaces. The PB theory can be derived from more fundamental Maxwell’s equations.^{27,37,62,70,73} It is more accurate than the GB model and computationally more efficient than the PCM. The PB can be easily coupled with the quantum mechanics for a more accurate description of the solute charge density.^{9,86,87} It has become one of the most popular solvation models due to its relatively low computational cost and high modeling accuracy for biomolecules.

The non-polar solvation component has been modeled by a number of terms. A popular approach, based on the scaled-particle theory (SPT) for non-polar solutes in aqueous solutions,^{63,76} is to use a solvent-accessible surface area (SASA) term.^{53,79} It was shown that a solvent-accessible volume (SAV) term is relevant in large length scale regimes.^{38,51} Recent studies indicate that SASA based solvation models may not describe van der Waals (vdW) interactions near solvent-solute interface.^{11,22,23,82} A combination of surface area, surface enclosed volume, and vdW potential has been shown to provide accurate non-polar solvation predictions.^{10,83}

The polar and non-polar components are typically decoupled in traditional implicit solvent models. Recently, new efforts have been made to couple polar and non-polar components in the literature.^{7,17,89} Differential geometry based solvation model makes use of the differential geometry theory of surfaces to dynamically couple polar and nonploar models by surface evolution, which is driven by free energy optimization.^{7–9,89} Coupled with an optimization procedure, this model was shown to deliver some of the best non-polar solvent free energy analysis.^{10} By applying constrained optimization to non-polar parameter selection, this model provides state-of-the-art solvation free energy fitting and cross validation results for a large number of solute molecules.^{83}

It is important to validate solvation models by experimental data. SAMPLx blind solvation prediction project, aimed at testing protein ligand binding models, also provides benchmark tests of the performance of the solvation models. Many different solvation approaches have been tested by the SAMPLx challenging molecules.^{21,25,26,34,35,55,56,58–60,68,69} Implicit solvent models are among the most competitive approaches in SAMPLx tests.

The objective of the present work is to develop an automatic protocol to parametrize implicit solvent models for the *blind* solvation free energy prediction. In our approach, we utilize PB theories for computing electrostatic solvation free energies. Both a point charge based PB model and a Kohn-Sham density functional theory (KSDFT) based polarizable PB model are developed in the present work. For the KSDFT based PB model, an iterative process is developed to take care of the solvent polarization and solute response.^{9} In our non-polar models, we employ either a collection of surface area, molecular volume and van der Waals interactions, or a combination of atomic surface areas and molecular volume. To determine parameters in our non-polar models, we assume that similar molecules admit the same set of non-polar parameters. In this work, we develop two parametrization algorithms. One of them is based on a functional group scoring, and the other is based on a nearest neighbor search. Both algorithms utilize machine learning methods to optimize parameters. A database contains the solvation free energies in aqueous solvent for 668 solute molecules which is collected from the literature^{4,57} to validate the proposed parametrization protocol.

The rest of this paper is organized as follows. In Section II, we present our solvation models. Both polar and non-polar models are discussed. Sections II B and II C present two non-polar parametrization algorithms, namely, functional group scoring and nearest neighbor approaches. An algorithm for unified blind solvation free energy prediction is developed in Section II D. The blind prediction of a large number of solvation free energies is carried out in Section III. The leave-one-out test of 668 molecules is employed to validate the proposed protocol. Finally, the accuracy and robustness of the proposed protocol is demonstrated by five SAMPL test sets. This paper ends with a conclusion.

## II. MODELS AND ALGORITHMS

### A. Solvation models

As an implicit approach, our solvation model has polar and non-polar components,

#### 1. Polar solvation models

For polar solvation modeling, we utilize either the standard Poisson model or a KSDFT based Poisson model (i.e., a polarizable Poisson model) for polar solvation free energy calculation. These two approaches are described below.

##### a. The Poisson model.

where *ϵ*(**r**) is the discontinuous dielectric constant and is chosen as *ϵ _{m}* = 1 in the solute domain Ω

_{m}and at

*ϵ*= 80 in the solvent domain Ω

_{s}_{s}. Here

*ϕ*is the electrostatic potential,

*N*is the total number of partial charges in the solute, and

_{m}*q*is the

_{i}*i*th charge at the position of

**r**

_{i}. Additionally,

*ϕ*

_{s}and

*ϕ*

_{m}are limiting values of the potential from solvent and solute domains, respectively. Here

**n**is the outer norm vector at point

**r**of the interface Γ. Finally, Ω and ∂Ω are computational domain and domain boundary, respectively.

We use the solvent excluded surface (SES) in the present work and it is generated by using our in-house Eulerian solvent excluded surface (ESES) software.^{50} ESES automatically embeds the SES into the Cartesian mesh, which is convenient for the finite difference type of Poisson solvers. The matched interface and boundary based PB (MIBPB) software package^{6,27,84} is used to solve Eq. (2) with appropriate charge force fields for {*q _{i}*} and radius sets in SESs.

The polar solvation free energy is then calculated as

where *ϕ*_{home}(**r**_{i}) is the solution of the Poisson equation with uniform dielectric constant *ϵ _{m}* over the whole domain Ω.

##### b. The KSDFT based polarizable Poisson model.

Instead of using a given partial charge force field, the KSDFT based polarizable Poisson model utilizes a DFT approach to compute the charge distribution in the solvent environment. As such, this approach partially takes into account for solvent polarization and solute response in an iterative procedure. The combination of the Poisson equation with SES and KSDFT has been reported in the literature.^{86,87} Recently, we have proposed a differential geometry based polarizable PB model that further allows solvent-solute interface to be optimized in the Poisson-Kohn-Sham (PKS) formalism.^{9} In the present work, we adopt SES in our PKS software for the sake of an accurate estimation of atomic surface areas via our ESES software package.^{50} Consequently, our MIBPB software is also employed for solving the Poisson equation with sharp interfaces.^{6,27,84}

In this approach, the Poisson model is^{9}

where *ρ*_{total}(**r**) = *qn*(**r**) − *qn _{n}*(

**r**) is the charge distribution in the solute domain, with

*q*being the unit charge of an electron,

*n*(

**r**) being the electron density, and $ n n ( r ) = \u2211 I Z I \delta ( r \u2212 R I ) $ being the nucleus density, in which

*Z*and

_{I}**R**

_{I}are the atomic number and position vector of nucleus

*I*, respectively.

where *m* is electron mass and *ħ*^{2} = *h*/2*π* with *h* being the Planck constant. Here *ψ _{i}* and

*E*are the

_{i}*i*th KS orbital and

*i*th eigenvalue, respectively, and $ U eff \u2250q \varphi RF + U eff 0 $ is the effective potential of the generalized Kohn-Sham equation with

*ϕ*

_{RF}=

*ϕ*−

*ϕ*

_{0}being the reaction field potential of the solute in the solvent environment, in which

*ϕ*and

*ϕ*

_{0}are the electrostatic potentials generated by the solute in the solvent and solute environments, respectively. Finally, $ U eff 0 $ is the effective Kohn-Sham potential in the vacuum environment. Note that the solution of Eq. (5) gives rise to the electron density $n ( r ) = \u2211 i | \psi i | 2 $, which, in turn, updates the charge distribution for Eq. (4). A self-consistent iterative process has been developed to solve Eqs. (4) and (5) to their convergence

^{9}starting from solving the KSDFT in vacuum as an initial guess to the coupled system.

For KSDFT calculations, we employ the SIESTA solver,^{75} which uses pseudopotential to eliminate the complicated effects of core electrons. In all calculations, the default double-*ζ* plus single polarization (DZP) basis is used. The MeshCutOff is set as 125 Rydberg and local density approximation (LDA) is used to approximate the exchange correlation potential. The solution method is set to be “diagon.”

In this formalism, the polar solvation free energy is computed by

#### 2. Non-polar solvation models

For the non-polar solvation component, we use either vdW term based model or an atomic surface area based model.

##### a. A vdW term based non-polar model.

In our earlier work, the non-polar solvation free energy was modeled by^{7–9,89}

where *γ* and *A* are, respectively, surface tension and area of the solute molecule. Additionally, *p* and *V* are, respectively, surface enclosed volume and hydrodynamic pressure difference. Finally, *ρ*_{0} is the solvent bulk density, and *U*^{vdW} is the vdW interaction potential, i.e., the Lennard-Jones (LJ) potential, the detailed implementation of vdW interaction potential can be found in the work,^{7,83}

where *N* is the total number of atomic types in a given solute molecule and *σ _{s}* is the probe radius. Here, ε

_{α}and

*σ*

_{α}are the vdW potential well depth and vdW radius of the

*α*th type of atoms, respectively, and

**r**

_{i}is position of the

*i*th atom in the

*α*type of atoms.

In Eq. (7) the integration is over solvent domain Ω_{s}. This non-polar solvation free energy model was shown to offer excellent predictions of experimental data for various non-polar molecules.^{10} In our recent work, we have demonstrated superb results for a large number of polar and non-polar molecules when the non-polar model in Eq. (7) is combined with a PB model for the polar solvation component.^{83}

In this non-polar solvation free energy model, the parameters *γ*, *p*, *σ _{s}*, and ε

_{α}need to be learned from a training set. However, one can set

*ρ*

_{0}= 1 because its effect can be absorbed to ε

_{α}.

The LJ potential offers a physical model for dealing with vdW interactions near the solvent-solute interface. However, a drawback of this non-polar term is that the LJ potential energy is nonlinear with respect to the probe radius (*σ _{s}*), which gives rise to a difficulty for a linear optimization process. Note that the non-polar energy is linear with respect to pressure and surface tension parameters.

##### b. An atomic surface area based nonp-polar model.

Mathematically, for a small probe radius, the vdW term for a given atom is proportional to the atomic surface area. Therefore, atomic surface area approach used by Kollman and co-workers^{85} should have a similar modeling effect as that of the LJ potential. Based on this observation, we propose the following non-polar solvation energy model:

where *N* is the total number of atomic types in a given solute molecule. Here, *γ*^{α} and Area^{α} are the surface tension and surface area of the *α*th type of atoms, respectively, and $ Area i \alpha $ is the surface area of the *i*th atom in the *α* type of atoms. In this non-polar solvation free energy model, the parameters *γ*_{α} and *p* need to be learned from a training set. Both the LJ and the atomic-surface-area based non-polar solvation models are validated in the present work.

In the above non-polar solvation models, the solvent excluded surface area, atomic solvent excluded surface areas, and molecular enclosed volume are calculated by our in-house software ESES.^{50}

### B. Functional group classification

#### 1. Functional group modeling

In our non-polar solvation model, we assume that each functional group of molecules admits the same set of optimal parameters. Similar approach has been successfully used in the literature,^{74} including ours.^{10,83} In this work, we further incorporate machine learning type of methods for the non-polar solvation free energy prediction. To this end, the whole dataset, which contains 668 molecules, is partitioned into training sets and testing sets. In the leave-one-out test case, in each step we only leave one molecule out as the test set, and all the other molecules are regarded as the training set. In a specific leave-SAMPLx-out study, all the SAMPLx molecules are left as the testing set, while the remaining molecules are treated as the training set. Furthermore, the training set in this scenario is also divided into two parts: (i) the mono-functional group molecules, in which molecules of a specific functional group are used to train a set of parameters for all similar molecules and (ii) the poly-functional groups molecules in which scoring weights are used to weight the contributions of parameters of various involved functional groups.

Table I lists twenty functional groups used in the training set. We denote {*T*_{1}, *T*_{2}, …, *T _{n}*} as a given set of

*n*molecules with a specific functional group from the above 20 groups. For a molecule indexed

*j*, the solvation free energy calculated from the polar solvation free energy model, atomic surface areas and molecular volume are listed as

where *j* = 1, 2, …, *n*.

In our approach, the solvation free energy is modeled as the sum of the polar and non-polar parts, thus for the *j*th molecule, the corresponding modeled solvation free energy can be expressed as

where **P**≐{*γ*^{1}, *γ*^{2}, …, *γ ^{N}*,

*p*} is the set of parameters to be optimized.

For a given set of molecules with the same mono-func tional group, we denoted $\Delta G ( P ) \u2250 G 1 ( P ) , G 2 ( P ) , \u2026 , G n ( P ) $, and the associated experimental solvation free energy is denoted as $\Delta G Exp \u2250 \Delta G 1 Exp , \Delta G 2 Exp , \u2026 , \Delta G n Exp $. Then the optimal parameter set can be obtained by solving the following Tikhonov regularized least square problem (also known as ridge regression) which has a closed form solution:

where ‖∗‖_{2} is the *L*_{2} norm of the quantity ∗. Here *λ* is the regularization parameter chosen to be a large number, such as 100, in the present work to ensure the dominance of the first term and avoid over-fitting through controlling the magnitude of ‖**P**‖_{2}.

Group . | Number . | Group . | Number . | Group . | Number . | Group . | Number . |
---|---|---|---|---|---|---|---|

Alkynyl | 8 | Alkenyl | 38 | Aldehyde group | 11 | Nitrile group | 5 |

Carboxyl | 7 | Ester group | 34 | Ketone | 23 | Amino | 35 |

Nitro | 9 | Alcoholic hydroxy | 33 | Phenolic hydroxyl | 16 | Ether | 22 |

Alkane | 38 | Aromatics | 33 | Nitrogen heterocyclic | 19 | Chlorinated hydrocarbon | 53 |

Nitrate | 5 | Amid | 7 | Thiol | 4 | Thioether | 5 |

Group . | Number . | Group . | Number . | Group . | Number . | Group . | Number . |
---|---|---|---|---|---|---|---|

Alkynyl | 8 | Alkenyl | 38 | Aldehyde group | 11 | Nitrile group | 5 |

Carboxyl | 7 | Ester group | 34 | Ketone | 23 | Amino | 35 |

Nitro | 9 | Alcoholic hydroxy | 33 | Phenolic hydroxyl | 16 | Ether | 22 |

Alkane | 38 | Aromatics | 33 | Nitrogen heterocyclic | 19 | Chlorinated hydrocarbon | 53 |

Nitrate | 5 | Amid | 7 | Thiol | 4 | Thioether | 5 |

#### 2. Functional group scoring

Most molecules in the SAMPL blind test sets involve poly-functional groups. In this case, we further employ the poly-functional group molecules in the training set for training the optimal relative scoring weights between different functional groups. According to the relative scoring weights, the scoring weights between all the functional groups can be obtained through a simple normalization procedure. We denote $ T \u0303 1 , T \u0303 2 , \u2026 , T \u0303 n \u2032 $ as a given set of poly-functional group molecules that has the same functional groups in the training set. The associated optimized parameter sets are **P**_{1}, **P**_{2}, …, **P**_{m}, where *m* is the number of functional groups in this set, and each **P**_{i}, *i* = 1, 2, …, *m* is learned through solving the regularized optimization problem given by Eq. (12). For the *j*th molecule in this poly-functional group set, we model its solvation free energy as

where $\Vert \omega \Vert 1 \u2250 \u2211 i = 1 m \omega i =1$, with *ω _{i}* being the scoring weight of

*i*th functional group.

The relative scoring weights for the *m* functional groups associated to this set of poly-functional groups molecules can be learned via solving the following constraint optimization problem,

subject to

and

where $\Delta G \u0303 ( \omega ) $ and $\Delta G \u0303 Exp $ represent, respectively, the predicted and experimental solvation free energies for this group of poly-functional group molecules. Since both the optimization object given by Eq. (14) and the constraint conditions in Eqs. (15) and (16) are convex with respect to the scoring weights ** ω**, the above constrained optimization can be easily solved via a convex optimization solver in the CVX software package.

^{32,33}

In the rest of this section, we provide an example to illustrate the procedure of functional group based approach for solvation free energy prediction. As shown in Fig. 1, the target molecule 2-chlorosyringaldehyde (Pubchem ID: 53479) contains four different functional groups. We need to find scoring weights for these functional groups. Note that in the above molecular searching scheme, pairwisely, the relative weights for each two functional groups can be determined by solving the constrained optimization problem in Eqs. (14) and (15) for molecules in the two corresponding functional groups. According to the pairwise relative weights, the functional group scoring in the target molecule can be achieved.

### C. Nearest neighbor search

The above functional group based model is tested to be able to provide very accurate solvation free energy prediction provided that for each test molecule, parameters for all of the involved mono-functional groups can be determined, and a set of molecules with the same ploy-functional groups can be found in the training set as well. However, for some complex molecules that contain functional groups beyond those listed in Table I, this method fails. For instance, considering a SAMPL1 test molecule, thifensulfuron, as shown in Fig. 2, it has a very complex structure and contains functional groups cannot be found in Table I. We therefore propose the following ranking algorithm.

#### 1. Atomic feature based molecular ranking

To deal with general and complex molecules, we propose another approach to rank molecules and develop a nearest neighbor approach for non-polar solvation free energy prediction. Our molecular ranking is based on atomic features. Our basic assumption is that if two solute molecules have similar atomic features, their chemical properties are also similar. Therefore, these molecules should share the same parameter set for the non-polar solvation free energy model. This approach is further described below.

Our method is based on the ansatz that molecules chemical properties are mainly determined by atomic features, including structural features and atomic electrostatic features. Among them, atomic structural features include atomic type, atom hybridization state, and bonding information. Atomic electrostatic features include atomic charge, atomic dipole, atomic quadrupole, and atomic electrostatic solvation free energy. Atomic features are selected based the criteria that a given atomic feature is retained if it can effectively discriminate the previously mentioned mono-functional groups listed in Table I. Table II lists all the atomic features that are used for molecule ranking.

Feature name . | Feature name . |
---|---|

Number of atoms | Number of heavy atoms |

Number of hydrogen atoms | Number of single bonds |

Number of double bonds | Number of triple bonds |

Number of aromatic bonds | Number of each type of atoms |

Number of sp^{1} carbon atoms | Number of sp^{2} carbon atoms |

Number of sp^{3} carbon atoms | Number of sp^{1} nitrogen atoms |

Number of sp^{2} nitrogen atoms | Number of sp^{3} nitrogen atoms |

Number of sp^{2} oxygen atoms | Number of sp^{3} oxygen atoms |

Number of sp^{2} sulfur atoms | Number of sp^{3} sulfur atoms |

Maximum of atomic reaction field energy | Minimum of atomic reaction field energy |

Maximum reaction field energy of each type of atoms | Minimum reaction field energy of each type of atoms |

Maximum of atomic reaction field energy | Average reaction field energy of each type of atoms |

Total absolute charge | Total charge of each type of atoms |

Total absolute charge of each type of atoms | Maximum charge of each type of atoms |

Minimum charge of each type of atoms | The variation of each type of atomic charges |

Maximum of atomic dipole | Maximum of each atomic dipole |

Minimum of each atomic dipole | Variation of atomic dipole |

Maximum quadrupole | Maximum quadrupole of each type of atoms |

Variation of atomic quadrupole | Variation of each type of atom’s quadrupole |

Feature name . | Feature name . |
---|---|

Number of atoms | Number of heavy atoms |

Number of hydrogen atoms | Number of single bonds |

Number of double bonds | Number of triple bonds |

Number of aromatic bonds | Number of each type of atoms |

Number of sp^{1} carbon atoms | Number of sp^{2} carbon atoms |

Number of sp^{3} carbon atoms | Number of sp^{1} nitrogen atoms |

Number of sp^{2} nitrogen atoms | Number of sp^{3} nitrogen atoms |

Number of sp^{2} oxygen atoms | Number of sp^{3} oxygen atoms |

Number of sp^{2} sulfur atoms | Number of sp^{3} sulfur atoms |

Maximum of atomic reaction field energy | Minimum of atomic reaction field energy |

Maximum reaction field energy of each type of atoms | Minimum reaction field energy of each type of atoms |

Maximum of atomic reaction field energy | Average reaction field energy of each type of atoms |

Total absolute charge | Total charge of each type of atoms |

Total absolute charge of each type of atoms | Maximum charge of each type of atoms |

Minimum charge of each type of atoms | The variation of each type of atomic charges |

Maximum of atomic dipole | Maximum of each atomic dipole |

Minimum of each atomic dipole | Variation of atomic dipole |

Maximum quadrupole | Maximum quadrupole of each type of atoms |

Variation of atomic quadrupole | Variation of each type of atom’s quadrupole |

We also construct atomic features by using the statistics variables, i.e., average, maximum, minimum, and variation of quantities in Table II. Finally, we rule out redundant features, where redundancy is due to the fact that some atomic features are 100% correlated with each other. In this case, only one of these highly correlated features is retained.

Atomic features are calculated by following methods.

Molecular structural information is parsed by the Open Babel software.

^{54}Atomic charge, atomic dipole, and atomic quadrupole are obtained via the distributed multipole analysis (DMA) method,

^{77}in which the charge density is originally computed by the density function theory with B3LYP and 6-31G basis in the Gaussian quantum chemistry software.^{3,20,47}Atomic electrostatic solvation energy is calculated by our in-house MIBPB software.

^{6,27,84}

With the above selected atomic features, intuitively, we measure the similarity of molecules by the Pearson correlation coefficient of atomic feature vectors. Specifically, we denote the features of two molecules as vector *X*≐{*x*_{1}, *x*_{2}, …, *x _{k}*} and vector

*Y*≐{

*y*

_{1},

*y*

_{2}, …,

*y*}, then their similarity is measured by

_{k} where $ x \u0304 \u2250 1 k \u2211 i = 1 k x i $ and $ y \u0304 \u2250 1 k \u2211 i = 1 k y i $, and *k* is the dimension of the feature space. The higher the correlation between two atomic feature vectors indicates the more similarity between molecules.

#### 2. Nearest neighbor solvation free energy prediction

By using the above nearest neighbor based nearest molecule searching procedure, for a given molecule, we obtain the similarity ranking of the remaining molecules with this specified one, as well as correlations between molecules. In the nearest neighbor prediction of the non-polar solvation free energy, we learn the parameters in the non-polar solvation free energy expression, Eq. (9), by a given number of nearest molecules found in the training set. The number of the nearest molecules used is based on the following principles:

All the molecules whose correlation with a given molecule is greater than 0.99 are used for the parameter learning for this given molecule;

If the above criterion yields less than 5 molecules, we use 5 nearest molecules for the parameter learning. Here the 5-molecule nearest neighbor method is found to provide the best leave-one-out prediction.

### D. Unified protocol for blind solvation free energy prediction

Models for the total solvation free energy are given in Eq. (1). We utilize a unified protocol for the prediction of solvation free energy. First, for the polar solvation free energy Δ*G*^{p}, we select either the Poisson model with a given point charge force field or the polarizable Poisson model. Second, for the calculation of the non-polar solvation free energy, we note that in general, the functional group scoring approach can deliver better blind predictions than the nearest neighbor approach. Therefore, in our non-polar energy prediction step, the functional group scoring method is used whenever it works. Otherwise, we utilize the nearest neighbor approach. Our solvation prediction protocol is illustrated in a flowchart shown in Fig. 3.

## III. RESULTS AND DISCUSSIONS

### A. Date processing and model validation

#### 1. Datasets and force fields

We consider a total of 668 molecules, and the structures of these molecules are downloaded from the Pubchem project. The experimental solvation free energies of these molecules are collected from the literature. This dataset contains molecules from the Sampl blind solvation prediction projects, ranging from SAMPL0 to SAMPL4;^{25,26,34,35} and the remaining molecules in our dataset are collected from the literature.^{4,48,85} Coincidentally, there is a considerable overlap of our database with Mobley’s solvation database, which is available from http://mobleylab.org/resources.html. More precisely, 589 molecules in our database are already covered by Mobley’s. The detailed information of our dataset is provided in the supplementary material.

For both the standard Poisson model and the KSDFT based polarizable Poisson model, we consider four types of atomic radii, namely, Amber 6, Amber bondi, Amber mbondi2,^{5} and ZAP9^{60} parametrizations. Additionally, for the standard Poisson model, three sets of charge assignments, namely, OpenEye-AM1-BCC v1 parameters,^{40} Gasteiger,^{24} and Mulliken,^{5} are tested. Our MIBPB solver^{6,27,84} is utilized to solve the Poisson interface problem to obtain electrostatic solvation energies. The probe radius is set to 1.4 Å in all PB calculations.

For the KSDFT based polarizable Poisson model, the Poisson interface problem is coupled with the generalized KSDFT in a self-consistent approach. The charge density used by Poisson interface problem is calculated in *ab initio* approach by the generalized KSDFT, here generalization due to the additional solute-solvent reaction field energy is included in the effective KS potential. The Poisson interface problem is employed for calculating the reaction field energy. This approach is modified from our earlier differential geometry based polarizable PB model.^{9} We iteratively couple the SIESTA^{75} and the MIBPB solver,^{6,27,84} in which a sharp solvent-solute interface is employed. The SIESTA software with additional reaction field energy is used for charge density calculation, and our in-house software is used for reaction field energy calculation. A uniform mesh size of 0.25 Å is used for solving the Poisson equation in both the standard Poisson and Polarizable Poisson model. In the polarizable Poisson model, the computed change densities are mapped to the uniform mesh with the conservation of the total charge.

For the molecules containing iodine atoms, the current level of DFT method used in this work, including the Gaussian software, cannot handle this element appropriately. Therefore, for a uniform comparison, we ignore molecules containing iodine atoms. There is a similar situation for bromine atoms — there is no appropriate pseudo-potential for this atom in the SIESTA software.^{75} Therefore, molecules involving bromine atoms are excluded in our KSDFT based polarizable PB calculations. Table III lists four molecules in SAMPLx that are not considered in some of our predictions.

Test set . | Molecule . |
---|---|

SAMPL0 | Benzyl bromide |

SAMPL1 | Bromacil |

SAMPL2 | 5-bromouracil |

5-iodouracil | |

SAMPL3 | None |

SAMPL4 | None |

Test set . | Molecule . |
---|---|

SAMPL0 | Benzyl bromide |

SAMPL1 | Bromacil |

SAMPL2 | 5-bromouracil |

5-iodouracil | |

SAMPL3 | None |

SAMPL4 | None |

For both the Poisson model and the polarizable Poisson model, the ESES software^{50} is used for solute molecular surface generation. The MIBPB software^{6,27,84} is used for solving the Poisson model. The polarizable Poisson model is solved by the self-consistent coupled MIBPB and SIESTA software.^{75}

#### 2. Atomic surface area and molecular volume calculation

In our non-polar solvation free energy model, both atomic surface areas and surface enclosed volume are required. In a recently developed Eulerian solvent excluded surface (ESES) package, a second-order accurate numerical scheme for surface area calculations and a third-order accurate numerical scheme for volume estimations have been developed.^{50} A weighted Voronoi diagram algorithm is implemented to partition a molecular surface area into atomic surface areas. These schemes have been intensively validated by a large number of test examples. In this work, both atomic surface areas and molecular volume are calculated directly by using our ESES software package.^{50}

#### 3. Validation of atomic surface area based non-polar model

In this work, instead of using the classical non-polar model that includes the van der Waals interaction between the solvent and solute non-polar interaction^{83} as shown in Eq. (7), we utilize the atomic surface area model given in Eq. (9) for the corresponding effects. In this part, we consider a set of numerical test to verify the validity of this treatment. Our numerical results indicate that the atomic surface area model provides results as good as those obtained by the van der Waals based non-polar model.

The van der Waals interaction Eq. (7) is modeled as a semi-continuous and semi-atomic potential. More specifically, we consider the 6-12 LJ potential for modeling the van der Waals interaction between the continuum solvent and atomistic solute molecules.^{83} The probe radius of the solvent in the LJ potential is set to be 1.4 Å, which is the same as that used in all other calculations. Table IV lists the root mean square (RMS) errors with four types of atomic radii and four charge parametrization methods for the prediction of SAMPL0’s solvation free energies. From these results, it is seen that for a given force field parametrization, the atomic surface area approach provides slightly more accurate non-polar solvation free energy prediction. As shown in our earlier work,^{83} the performance of the semi-continuous and semi-atomic LJ potential is sensitive to the choice of the probe radius, due to its nonlinear dependence on the radius, whereas the SES based atomic surface area approach is less sensitive to the probe radius. We conclude that in general the atomic surface area based non-polar model is at least as good as the LJ potential based one for modeling the non-polar solvation free energy. Therefore, the atomic surface area based non-polar model is employed in the rest of this paper.

Non-polar model . | Radius . | AM1-BCC . | Mulliken . | Gasteiger . | SIESTA . |
---|---|---|---|---|---|

Atomic surface area | Amber 6 | 1.30 | 1.27 | 1.23 | 0.99 |

Amber Bondi | 1.40 | 1.30 | 1.30 | 0.93 | |

Amber Mbondi2 | 1.41 | 1.34 | 1.33 | 1.10 | |

ZAP9 | 1.33 | 1.39 | 1.37 | 1.08 | |

van der Waals | Amber 6 | 1.42 | 1.31 | 1.34 | 0.93 |

Amber Bondi | 1.44 | 1.30 | 1.30 | 1.05 | |

Amber Mbondi2 | 1.45 | 1.38 | 1.30 | 1.18 | |

ZAP9 | 1.44 | 1.39 | 1.32 | 1.32 |

Non-polar model . | Radius . | AM1-BCC . | Mulliken . | Gasteiger . | SIESTA . |
---|---|---|---|---|---|

Atomic surface area | Amber 6 | 1.30 | 1.27 | 1.23 | 0.99 |

Amber Bondi | 1.40 | 1.30 | 1.30 | 0.93 | |

Amber Mbondi2 | 1.41 | 1.34 | 1.33 | 1.10 | |

ZAP9 | 1.33 | 1.39 | 1.37 | 1.08 | |

van der Waals | Amber 6 | 1.42 | 1.31 | 1.34 | 0.93 |

Amber Bondi | 1.44 | 1.30 | 1.30 | 1.05 | |

Amber Mbondi2 | 1.45 | 1.38 | 1.30 | 1.18 | |

ZAP9 | 1.44 | 1.39 | 1.32 | 1.32 |

### B. Solvation predictions

#### 1. Leave-one-out prediction of 668 molecules

We first examine the proposed nearest neighbor approach by the leave-one-out test of 668 molecules. In this examination, we select one molecule at a time and use all other molecules as the training set to predict the selected one’s solvation free energy. This process is applied systematically to all the molecules in the whole dataset of 668 molecules. Four different atomic radius sets are considered, together with four different charge force fields. Our results are illustrated in Fig. 4, and the associated values are listed in Table V. In general, all radius sets and charge force fields perform similarly well. The maximum RMS error is below 1.7 kcal/mol for all methods over all molecules. More specifically, Bondi and Mbondi radii offer the best overall results. For charge force field, AM1-BBC charges appear to provide the best predictions and their RMS errors are less than 1.5 kcal/mol for all the four radius sets. The optimal result obtained with AM1-BBC charges and ZAP9 radii has an RMS error of 1.33 kcal/mol. Figure 5 plots the optimal results on the leave-one-out test.

Radius . | AM1-BCC . | Mulliken . | Gasteiger . | SIESTA . |
---|---|---|---|---|

Amber 6 | 1.47 | 1.49 | 1.65 | 1.65 |

Amber Bondi | 1.34 | 1.48 | 1.66 | 1.40 |

Amber Mbondi2 | 1.33 | 1.49 | 1.68 | 1.42 |

ZAP9 | 1.33 | 1.39 | 1.70 | 1.53 |

Radius . | AM1-BCC . | Mulliken . | Gasteiger . | SIESTA . |
---|---|---|---|---|

Amber 6 | 1.47 | 1.49 | 1.65 | 1.65 |

Amber Bondi | 1.34 | 1.48 | 1.66 | 1.40 |

Amber Mbondi2 | 1.33 | 1.49 | 1.68 | 1.42 |

ZAP9 | 1.33 | 1.39 | 1.70 | 1.53 |

For a set of 643 molecules, which largely overlaps with present dataset, Mobley and Guthrie reported an RMS error of 1.51 kcal/mol.^{57} Our results indicate that the present nearest neighbor approach can achieve highly accurate predictions for the solvation free energies for all the atomic radius and charge force fields.

#### 2. SAMPLx blind predictions

In this section, we present results of the blind solvation free energy predictions based on the proposed protocol. All the SAMPL0-SAMPL4 challenges for solvation free energies are considered. We predict the solvation free energies with a leave-SAMPLx-out approach, in which the SAMPLx molecules and their experimental solvation free energies are regarded as unknown while information of other molecules is utilized to predict selected SAMPL test set, based on molecular formulas.

We first consider SAMPL0 test set, in which molecules are diverse. One molecule in this test set contains Br atom, for which our polarizable PB model does not an appropriate pseudo-potential. The structures of the SAMPL0 molecules can be found in the literature.^{60} Tables S1-S4 of the supplementary material present the solvation free energies for the SAMPL0 test set obtained from both experiments and blind predictions using different charges and atomic radii parametrizations. The left chart of Fig. 6 shows the plot of RMS errors for each charge and atomic radius combination. It is clear that the change in atomic radii has a minor influence on the accuracy of predictions, while the change in charge force fields has a much more significant influence. In other words, the solvation free energy prediction for this test set is more sensitive to the charge parametrization than the atomic radii parametrization. In general, our KSDFT based polarizable Poisson model provides better predictions than those of other charge force fields. It is noted that the SIASTA based polarizable Poisson model with Amber Bondi radius parameters delivers the best solvation free energy prediction. This result is depicted in the right chart of Fig. 6. The associated RMS error (0.93 kcal/mol) appears to be better than that in the literature^{60} (i.e., 1.71 kcal/mol for the *full* SAMPL0 test set of 17 molecules). The best prediction in the literature has an RMS error of 1.34 kcal/mol.^{42} Even though for the SIESTA based *ab initio* charge calculation, the molecule contains the Br atom is neglected, based on the general pattern that when any other force field used, the prediction for that omitted molecules is very well. We believe with proper pseudopotential for Br atom, our polarizable Poisson model can deliver an accurate prediction for this molecule also. Moreover, for this test set, all the other three charge parametrizations can provide a similar level of solvation free energy prediction.

We next consider the SAMPL1 challenge set for solvation predictions.^{34} This set contains not only a largest number of 63 molecules, which is the largest among all the Sampl test sets, but also many molecules with extremely complicated structures. The detailed description of this set is given in the literature.^{34} Most molecules in this set are druggable and very complex. The difficulty of this test set comes from two aspects: on one hand, the structures of molecules in this set are very complicated. On the other hand, the knowledge in the training set that can be used for predicting the solvation free energies of these molecules is rare or insufficient. In addition to the molecular complexity, the reported experimental solvation free energies also admit large uncertainty.^{34} Most earlier computational predictions report RMS errors of 3–4 kcal/mol when some extremely complex molecules are excluded. Some best prediction for the whole set has an RMS error of 2.45 kcal/mol.^{42} The best performance was shown to give an RMS error of 2.4 kcal/mol on a subset of the SAMPL1 test set that contains only 56 molecules.^{52} Tables S5-S8 of the supplementary material present the solvation free energies for the SAMPL1 test set obtained from both experiments and blind predictions using different charges and atomic radii. Figure 7 plots the present blind predictions. One molecule (bromacil) that contains a Br atom is considered by all the charge models except for the polarizable PB method. It is noted that two of present approaches, namely, the AM1-BCC semi-empirical charges and the *ab initio* charge, provide relatively accurate predictions for this test set. When Amber Mondi2 atomic radii together with the SIESTA charge calculation is applied, the optimal solvation free energy prediction is achieved with RMS error 2.82 kcal/mol. When AM1-BCC charges and the Amber Bondi radii are used, the prediction for the whole SAMPL1 set without a molecule that contains Br atom has an RMS error of 3.06 kcal/mol, and the RMS error is 3.07 kcal/mol for the same test with all molecules. For this test set, the large prediction RMS error is mainly due to the extremely large error in predicting solvation free energies for some complex molecules, for which the RMS error can be as large as 15 kcal/mol. This unreasonable prediction is due to the fact that inappropriate molecule force field parametrization yields unreasonable electrostatic solvation free energies, which in turn leads to erroneous solvation free energy prediction. Similar to the SAMPL0 test set, the prediction is more sensitive to the charge parametrization. Nevertheless, the atomic radius parametrization also plays a critical role in the prediction accuracy for this test set.

SAMPL2 is another difficult test set with almost the same level of difficulty as the SAMPL1 test set.^{44} Compared with SAMPL1 test set, it contains a few complex molecules, and most molecules in this test set are drug like ones as well. Contrary to molecules in SAMPL1 test set, this set has less uncertainty in the experimental solvation free energies. As shown in Tables S9-S12 of the supplementary material, the experimental solvation free energies of this test set distribute over a wide range. Using all-atom molecular dynamics simulations and multiple starting conformations for blind prediction, Klimovich and Mobley reported an RMS error of 2.82 kcal/mol over the whole set and 1.86 kcal/mol over all the molecules except several hydroxyl-rich compounds.^{44} Some best prediction has an RMS error of 1.59 kcal/mol.^{42} In the present work, the molecule containing an I atom (5-iodouracil) is excluded in all calculations. Additionally, 5-bromouracil has a Br atom which is excluded in the polarizable PB model. Tables S9-S12 of the supplementary material list the details of the present predictions. The RMS errors from various radius and charge force fields are given Fig. 8. Apparently, this test set has a strong force field dependence as well. The RMS errors vary very much from one charge force field to another. However, the performance of these predictions has a weak radius dependence. The best prediction, obtained from a combination of Amber MBondi2 radius parameters and AM1-BCC charge force field, or Amber Bondi radius together with modified SIESTA charge, both have the RMS error of 1.90 kcal/mol when the molecule with a Br atom is excluded in the prediction. When all molecules are included, the combination of Amber MBondi2 radius and AM1-BCC charge parametrization gives the optimal prediction of RMS error 1.96 kcal/mol as depicted in the right chart of Fig. 8. Compared to the prediction of SAMPL1 test set, these predictions are more accurate, due to two reasons. First, the molecules in this set are slightly simpler and experimental uncertainly is less severe for the deterministic prediction. Second, in our parametrization of the non-polar solvation free energy prediction, we have more similar molecules in training set, which enables us to obtain better non-polar solvation free energy parameters in the nearest neighbor based approach. In contrast, in the SAMPL1 prediction, when the nearest neighbor approach is applied, the nearest molecules selected from the training set differ much from SAMPL1 molecules.

SAMPL3 test set is a relatively easy one with 36 solute molecules.^{25} Its molecular structures are less versatile than the earlier test sets. Additionally, in this test set, there is no molecule that involves Br or I atom. One of difficulties in the prediction of this set is the lack of similar molecules in our database when all the SAMPL3 molecules are left out. The best prediction in the literature offers an RMS error of 1.29 kcal/mol.^{42} Our blind predictions and experimental results for SAMPL3 test set are given in Tables S13-S16 of the supplementary material. The RMS errors of our blind predictions for the SAMPL3 test set are plotted in Fig. 9. In this case, the accuracy of predictions also depends strongly on charge force fields and weakly on radius parameters. In general, Gasteiger charges are superb for this test set and their RMS errors are always smaller than 1 kcal/mol. Our best prediction, obtained from the combination of ZAP9 radius parameters and Gasteiger charge force field, has a small RMS error of 0.78 kcal/mol and is depicted in the right chart of Fig. 9. In general Gasteiger charge performs very poor in the solvation free energy prediction for previous tests on complex molecules, while we see that this charge parametrization method is superior for chlorinated hydrocarbon molecules in the present test set. Unfortunately, there is no uniformly optimal parametrization for all the molecules. This fact motivates us to seek a solvation free energy prediction that does not heavily depend on the force fields parametrization.

SAMPL4 test set is studied by using the proposed methods as well. A key feature of this test set is that its molecules are diversified. As shown in Tables S17-S20 of the supplementary material, this test set also involves a wide range of solvation free energies. However, the structures of these molecules are not as complex as those in both SAMPL1 and SAMPL2 test sets, which indicates a slightly easier task for solvation free energy prediction. This test set was studied by a number of researchers in the literature and the best prediction in the literature has the RMS error of 1.2 kcal/mol.^{59} Figure 10 illustrates the RMS errors of our predictions for a total of 16 charge and radius combinations. Compared with those in the literature, all of our predictions are of high quality. Our best result has the RMS error of 1.03 kcal/mol and the corresponding results are given in the right chart of Fig. 10.

Table VI lists the root mean square errors (RMSEs), mean errors (MEs), Pearson correlations (Rs), and P-values (Ps) of our blind predictions of solvation free energies for all five test sets using a total of 16 charge and radius implementations. Some comments are in order. First, all predictions are quite sensitive to charge force fields and relatively less dependent on radius parameter selection. For SAMPL0 test set, although predictions with the AM1-BCC charge set have similar correlations as those obtained with other charge force fields, their MEs are significantly larger. However, AM1-BCC charge force field does a relatively better job in SAMPL1 and SAMPL2. Additionally, it is difficult to identify a clear winner over all the test sets — some approaches perform better in one or two test sets, but do not do well in the rest test sets. This phenomenon highlights the difficulty of designing optimal models for solvation analysis. Moreover, the KSDFT based polarizable Poisson model is more often to provide better predictions over most charge force fields and radius parameters. This indicates a potential to develop better physical models by improving the quantum charge density calculation. Further, the Pearson correlation becomes an interesting measure in SAMPL3 test set. Results obtained with AM1-BCC and Mulliken charge sets have particularly low correlations, although their RMSEs are quite good. Results for SAMPL1 and SAMPL4 test sets are quite consistent in terms of their RMSE, ME, and correlations. Furthermore, our P-value analysis indicates that most results very reliable, except for cases when the AM1-BCC and Mulliken charge forces fields are applied to SAMPL3. However, all P-values are smaller than the standard cutoff, 0.05. Finally, we point out that the blind prediction results presented in the present work are the state-of-the-art compared to those in the literature.^{21,25,26,34,35,42,55,56,58–60,68,69}

. | . | AM1-BCC . | Mulliken . | Gasteiger . | SIESTA . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Test set . | Radius . | RMSE . | ME . | R . | P . | RMSE . | ME . | R . | P . | RMSE . | ME . | R . | P . | RMSE . | ME . | R . | P . |

SAMPL0 | Amber 6 | 1.30 (1.26) | 0.78 (0.76) | 0.96 (0.96) | 7.97 × 10^{−9} (1.67 × 10^{−9}) | 1.27 (1.25) | −0.18 (−0.12) | 0.95 (0.94) | 6.91 × 10^{−8} (1.78 × 10^{−8}) | 1.23 (1.20) | −0.16 (−0.13) | 0.98 (0.94) | 5.99 × 10^{−8} (1.45 × 10^{−8}) | 0.99 (NA) | 0.13 (NA) | 0.96 (NA) | 3.79 × 10^{−9} (NA) |

Amber Bondi | 1.40 (1.37) | 0.88 (0.86) | 0.95 (0.95) | 1.70 × 10^{−8} (3.75 × 10^{−9}) | 1.30 (1.27) | −0.26 (−0.20) | 0.94 (0.94) | 8.10 × 10^{−8} (2.14 × 10^{−8}) | 1.30 (1.27) | −0.27 (−0.24) | 0.94 (0.94) | 1.05 × 10^{−7} (2.64 × 10^{−8}) | 0.93 (NA) | −0.12 (NA) | 0.94 (NA) | 1.71 × 10^{−9} (NA) | |

Amber Mbondi2 | 1.41 (1.37) | 0.91 (0.88) | 0.95 (0.96) | 1.15 × 10^{−7} (2.48 × 10^{−8}) | 1.34 (1.32) | 0.27 (0.21) | 0.94 (0.94) | 1.05 × 10^{−7} (2.78 × 10^{−8}) | 1.33 (1.29) | −0.28 (−0.25) | 0.93 (0.94) | 1.28 × 10^{−7} (3.25 × 10^{−8}) | 1.10 (NA) | −0.17 (NA) | 0.95 (NA) | 9.51 × 10^{−9} (NA) | |

ZAP9 | 1.33 (1.29) | 0.44 (0.44) | 0.96 (0.96) | 4.09 × 10^{−9} (9.93 × 10^{−10}) | 1.39 (1.37) | 0.00 (0.05) | 0.94 (0.94) | 8.47 × 10^{−8} (2.10 × 10^{−8}) | 1.37 (1.33) | 0.11 (0.12) | 0.94 (0.94) | 5.96 × 10^{−8} (1.54 × 10^{−8}) | 1.08 (NA) | −0.21 (NA) | 0.96 (NA) | 4.89 × 10^{−9} (NA) | |

SAMPL1 | Amber 6 | 3.26 (3.27) | 0.83 (0.88) | 0.74 (0.74) | 5.23 × 10^{−12} (3.18 × 10^{−12}) | 4.74 (4.77) | −2.22 (−2.28) | 0.74 (0.75) | 3.87 × 10^{−12} (2.12 × 10^{−12}) | 4.92 (4.96) | −0.93 (−1.02) | 0.69 (0.69) | 6.40 × 10^{−10} (4.36 × 10^{−10}) | 2.92 (NA) | −0.52 (NA) | 0.81 (NA) | 1.93 × 10^{−15} (NA) |

Amber Bondi | 3.06 (3.07) | 0.96 (0.99) | 0.79 (0.79) | 2.61 × 10^{−14} (1.39 × 10^{−14}) | 4.65 (4.68) | −2.22 (−2.28) | 0.78 (0.78) | 5.81 × 10^{−14} (3.35 × 10^{−14}) | 5.52 (5.55) | −1.16 (−1.26) | 0.67 (0.68) | 1.81 × 10^{−9} (1.21 × 10^{−9}) | 2.89 (NA) | −1.15 (NA) | 0.88 (NA) | 9.76 × 10^{−21} (NA) | |

Amber Mbondi2 | 3.29 (3.30) | 1.17 (1.22) | 0.82 (0.82) | 2.40 × 10^{−16} (1.45 × 10^{−16}) | 5.39 (5.41) | −2.32 (−2.39) | 0.80 (0.80) | 7.07 × 10^{−15} (3.82 × 10^{−15}) | 4.76 (4.82) | −0.56 (−0.67) | 0.60 (0.60) | 2.64 × 10^{−7} (2.07 × 10^{−7}) | 2.82 (NA) | −1.05 (NA) | 0.87 (NA) | 3.72 × 10^{−20} (NA) | |

ZAP9 | 4.26 (4.35) | 1.26 (1.30) | 0.81 (0.81) | 2.52 × 10^{−15} (1.29 × 10^{−15}) | 6.16 (6.16) | −3.01 (−3.06) | 0.73 (0.73) | 1.28 × 10^{−11} (7.52 × 10^{−12}) | 5.33 (5.45) | −0.66 (−0.82) | 0.64 (0.64) | 2.61 × 10^{−8} (2.26 × 10^{−8}) | 3.76 (NA) | 0.31 (NA) | 0.82 (NA) | 5.14 × 10^{−16} (NA) | |

SAMPL2 | Amber 6 | 2.09 (2.11) | −0.59 (−0.65) | 0.95 (0.95) | 1.34 × 10^{−15} (3.90 × 10^{−15}) | 3.51 (3.59) | 1.56 (1.65) | 0.86 (0.88) | 2.93 × 10^{−9} (1.59 × 10^{−9}) | 4.78 (4.86) | 2.51 (2.65) | 0.77 (0.78) | 1.68 × 10^{−6} (1.01 × 10^{−6}) | 3.46 (NA) | 1.18 (NA) | 0.85 (NA) | 8.10 × 10^{−9} (NA) |

Amber Bondi | 1.95 (1.97) | −0.17 (−0.26) | 0.96 (0.96) | 9.32 × 10^{−16} (4.36 × 10^{−16}) | 3.38 (3.47) | 1.55 (1.69) | 0.88 (0.90) | 8.77 × 10^{−10} (4.92 × 10^{−10}) | 4.62 (4.72) | 2.45 (2.61) | 0.79 (0.80) | 6.77 × 10^{−7} (4.29 × 10^{−7}) | 1.90 (NA) | 0.87 (NA) | 0.96 (NA) | 1.55 × 10^{−16} (NA) | |

Amber Mbondi2 | 1.90 (1.96) | −0.36 (−0.46) | 0.96 (0.96) | 9.87 × 10^{−16} (5.99 × 10^{−16}) | 3.55 (3.66) | 2.46 (2.62) | 0.86 (0.89) | 7.96 × 10^{−7} (5.59 × 10^{−7}) | 4.65 (4.76) | 1.64 (1.79) | 0.78 (0.80) | 2.90 × 10^{−9} (1.91 × 10^{−9}) | 2.35 (NA) | 1.21 (NA) | 0.95 (NA) | 1.17 × 10^{−14} (NA) | |

ZAP9 | 2.05 (2.03) | −0.37 (−0.41) | 0.95 (0.95) | 1.18 × 10^{−14} (2.40 × 10^{−15}) | 3.19 (3.15) | 2.36 (2.37) | 0.89 (0.91) | 6.43 × 10^{−7} (2.06 × 10^{−7}) | 4.56 (4.51) | 1.52 (1.45) | 0.79 (0.81) | 1.50 × 10^{−10} (3.88 × 10^{−11}) | 1.93 (NA) | 1.13 (NA) | 0.97 (NA) | 2.91 × 10^{−17} (NA) | |

SAMPL3 | Amber 6 | 1.28 | 0.38 | 0.47 | 3.55 × 10^{−3} | 1.42 | 0.72 | 0.51 | 1.67 × 10^{−3} | 0.97 | −0.16 | 0.86 | 1.85 × 10^{−11} | 1.08 | 0.34 | 0.81 | 2.98 × 10^{−9} |

Amber Bondi | 1.47 | −0.56 | 0.47 | 3.56 × 10^{−3} | 1.58 | 0.85 | 0.51 | 1.65 × 10^{−3} | 0.82 | −0.09 | 0.86 | 1.89 × 10^{−11} | 1.16 | 0.45 | 0.81 | 2.95 × 10^{−9} | |

Amber Mbondi2 | 1.47 | −0.56 | 0.37 | 2.43 × 10^{−2} | 1.58 | 0.85 | 0.39 | 2.02 × 10^{−2} | 0.82 | −0.09 | 0.86 | 2.28 × 10^{−11} | 1.16 | 0.39 | 0.70 | 1.50 × 10^{−6} | |

ZAP9 | 1.55 | −0.68 | 0.33 | 4.93 × 10^{−2} | 1.28 | −0.37 | 0.59 | 1.55 × 10^{−4} | 0.78 | 0.31 | 0.89 | 2.98 × 10^{−13} | 1.33 | −0.59 | 0.50 | 1.84 × 10^{−3} | |

SAMPL4 | Amber 6 | 1.28 | −0.14 | 0.97 | 3.28 × 10^{−30} | 1.20 | 0.11 | 0.95 | 7.77 × 10^{−25} | 1.08 | 0.18 | 0.96 | 3.42 × 10^{−27} | 1.41 | 0.15 | 0.96 | 8.09 × 10^{−26} |

Amber Bondi | 1.12 | 0.06 | 0.96 | 3.53 × 10^{−27} | 1.41 | 0.31 | 0.94 | 5.08 × 10^{−22} | 1.10 | 0.33 | 0.96 | 2.95 × 10^{−27} | 1.07 | 0.10 | 0.96 | 1.58 × 10^{−27} | |

Amber Mbondi2 | 1.09 | 0.15 | 0.96 | 1.02 × 10^{−27} | 1.33 | 0.14 | 0.94 | 1.84 × 10^{−22} | 1.03 | −0.08 | 0.96 | 4.64 × 10^{−26} | 1.04 | 0.00 | 0.96 | 8.87 × 10^{−28} | |

ZAP9 | 1.12 | −0.01 | 0.96 | 2.30 × 10^{−27} | 1.12 | 0.07 | 0.96 | 2.7 × 10^{−26} | 1.09 | −0.04 | 0.96 | 1.41 × 10^{−26} | 1.32 | 0.36 | 0.95 | 1.20 × 10^{−23} |

. | . | AM1-BCC . | Mulliken . | Gasteiger . | SIESTA . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Test set . | Radius . | RMSE . | ME . | R . | P . | RMSE . | ME . | R . | P . | RMSE . | ME . | R . | P . | RMSE . | ME . | R . | P . |

SAMPL0 | Amber 6 | 1.30 (1.26) | 0.78 (0.76) | 0.96 (0.96) | 7.97 × 10^{−9} (1.67 × 10^{−9}) | 1.27 (1.25) | −0.18 (−0.12) | 0.95 (0.94) | 6.91 × 10^{−8} (1.78 × 10^{−8}) | 1.23 (1.20) | −0.16 (−0.13) | 0.98 (0.94) | 5.99 × 10^{−8} (1.45 × 10^{−8}) | 0.99 (NA) | 0.13 (NA) | 0.96 (NA) | 3.79 × 10^{−9} (NA) |

Amber Bondi | 1.40 (1.37) | 0.88 (0.86) | 0.95 (0.95) | 1.70 × 10^{−8} (3.75 × 10^{−9}) | 1.30 (1.27) | −0.26 (−0.20) | 0.94 (0.94) | 8.10 × 10^{−8} (2.14 × 10^{−8}) | 1.30 (1.27) | −0.27 (−0.24) | 0.94 (0.94) | 1.05 × 10^{−7} (2.64 × 10^{−8}) | 0.93 (NA) | −0.12 (NA) | 0.94 (NA) | 1.71 × 10^{−9} (NA) | |

Amber Mbondi2 | 1.41 (1.37) | 0.91 (0.88) | 0.95 (0.96) | 1.15 × 10^{−7} (2.48 × 10^{−8}) | 1.34 (1.32) | 0.27 (0.21) | 0.94 (0.94) | 1.05 × 10^{−7} (2.78 × 10^{−8}) | 1.33 (1.29) | −0.28 (−0.25) | 0.93 (0.94) | 1.28 × 10^{−7} (3.25 × 10^{−8}) | 1.10 (NA) | −0.17 (NA) | 0.95 (NA) | 9.51 × 10^{−9} (NA) | |

ZAP9 | 1.33 (1.29) | 0.44 (0.44) | 0.96 (0.96) | 4.09 × 10^{−9} (9.93 × 10^{−10}) | 1.39 (1.37) | 0.00 (0.05) | 0.94 (0.94) | 8.47 × 10^{−8} (2.10 × 10^{−8}) | 1.37 (1.33) | 0.11 (0.12) | 0.94 (0.94) | 5.96 × 10^{−8} (1.54 × 10^{−8}) | 1.08 (NA) | −0.21 (NA) | 0.96 (NA) | 4.89 × 10^{−9} (NA) | |

SAMPL1 | Amber 6 | 3.26 (3.27) | 0.83 (0.88) | 0.74 (0.74) | 5.23 × 10^{−12} (3.18 × 10^{−12}) | 4.74 (4.77) | −2.22 (−2.28) | 0.74 (0.75) | 3.87 × 10^{−12} (2.12 × 10^{−12}) | 4.92 (4.96) | −0.93 (−1.02) | 0.69 (0.69) | 6.40 × 10^{−10} (4.36 × 10^{−10}) | 2.92 (NA) | −0.52 (NA) | 0.81 (NA) | 1.93 × 10^{−15} (NA) |

Amber Bondi | 3.06 (3.07) | 0.96 (0.99) | 0.79 (0.79) | 2.61 × 10^{−14} (1.39 × 10^{−14}) | 4.65 (4.68) | −2.22 (−2.28) | 0.78 (0.78) | 5.81 × 10^{−14} (3.35 × 10^{−14}) | 5.52 (5.55) | −1.16 (−1.26) | 0.67 (0.68) | 1.81 × 10^{−9} (1.21 × 10^{−9}) | 2.89 (NA) | −1.15 (NA) | 0.88 (NA) | 9.76 × 10^{−21} (NA) | |

Amber Mbondi2 | 3.29 (3.30) | 1.17 (1.22) | 0.82 (0.82) | 2.40 × 10^{−16} (1.45 × 10^{−16}) | 5.39 (5.41) | −2.32 (−2.39) | 0.80 (0.80) | 7.07 × 10^{−15} (3.82 × 10^{−15}) | 4.76 (4.82) | −0.56 (−0.67) | 0.60 (0.60) | 2.64 × 10^{−7} (2.07 × 10^{−7}) | 2.82 (NA) | −1.05 (NA) | 0.87 (NA) | 3.72 × 10^{−20} (NA) | |

ZAP9 | 4.26 (4.35) | 1.26 (1.30) | 0.81 (0.81) | 2.52 × 10^{−15} (1.29 × 10^{−15}) | 6.16 (6.16) | −3.01 (−3.06) | 0.73 (0.73) | 1.28 × 10^{−11} (7.52 × 10^{−12}) | 5.33 (5.45) | −0.66 (−0.82) | 0.64 (0.64) | 2.61 × 10^{−8} (2.26 × 10^{−8}) | 3.76 (NA) | 0.31 (NA) | 0.82 (NA) | 5.14 × 10^{−16} (NA) | |

SAMPL2 | Amber 6 | 2.09 (2.11) | −0.59 (−0.65) | 0.95 (0.95) | 1.34 × 10^{−15} (3.90 × 10^{−15}) | 3.51 (3.59) | 1.56 (1.65) | 0.86 (0.88) | 2.93 × 10^{−9} (1.59 × 10^{−9}) | 4.78 (4.86) | 2.51 (2.65) | 0.77 (0.78) | 1.68 × 10^{−6} (1.01 × 10^{−6}) | 3.46 (NA) | 1.18 (NA) | 0.85 (NA) | 8.10 × 10^{−9} (NA) |

Amber Bondi | 1.95 (1.97) | −0.17 (−0.26) | 0.96 (0.96) | 9.32 × 10^{−16} (4.36 × 10^{−16}) | 3.38 (3.47) | 1.55 (1.69) | 0.88 (0.90) | 8.77 × 10^{−10} (4.92 × 10^{−10}) | 4.62 (4.72) | 2.45 (2.61) | 0.79 (0.80) | 6.77 × 10^{−7} (4.29 × 10^{−7}) | 1.90 (NA) | 0.87 (NA) | 0.96 (NA) | 1.55 × 10^{−16} (NA) | |

Amber Mbondi2 | 1.90 (1.96) | −0.36 (−0.46) | 0.96 (0.96) | 9.87 × 10^{−16} (5.99 × 10^{−16}) | 3.55 (3.66) | 2.46 (2.62) | 0.86 (0.89) | 7.96 × 10^{−7} (5.59 × 10^{−7}) | 4.65 (4.76) | 1.64 (1.79) | 0.78 (0.80) | 2.90 × 10^{−9} (1.91 × 10^{−9}) | 2.35 (NA) | 1.21 (NA) | 0.95 (NA) | 1.17 × 10^{−14} (NA) | |

ZAP9 | 2.05 (2.03) | −0.37 (−0.41) | 0.95 (0.95) | 1.18 × 10^{−14} (2.40 × 10^{−15}) | 3.19 (3.15) | 2.36 (2.37) | 0.89 (0.91) | 6.43 × 10^{−7} (2.06 × 10^{−7}) | 4.56 (4.51) | 1.52 (1.45) | 0.79 (0.81) | 1.50 × 10^{−10} (3.88 × 10^{−11}) | 1.93 (NA) | 1.13 (NA) | 0.97 (NA) | 2.91 × 10^{−17} (NA) | |

SAMPL3 | Amber 6 | 1.28 | 0.38 | 0.47 | 3.55 × 10^{−3} | 1.42 | 0.72 | 0.51 | 1.67 × 10^{−3} | 0.97 | −0.16 | 0.86 | 1.85 × 10^{−11} | 1.08 | 0.34 | 0.81 | 2.98 × 10^{−9} |

Amber Bondi | 1.47 | −0.56 | 0.47 | 3.56 × 10^{−3} | 1.58 | 0.85 | 0.51 | 1.65 × 10^{−3} | 0.82 | −0.09 | 0.86 | 1.89 × 10^{−11} | 1.16 | 0.45 | 0.81 | 2.95 × 10^{−9} | |

Amber Mbondi2 | 1.47 | −0.56 | 0.37 | 2.43 × 10^{−2} | 1.58 | 0.85 | 0.39 | 2.02 × 10^{−2} | 0.82 | −0.09 | 0.86 | 2.28 × 10^{−11} | 1.16 | 0.39 | 0.70 | 1.50 × 10^{−6} | |

ZAP9 | 1.55 | −0.68 | 0.33 | 4.93 × 10^{−2} | 1.28 | −0.37 | 0.59 | 1.55 × 10^{−4} | 0.78 | 0.31 | 0.89 | 2.98 × 10^{−13} | 1.33 | −0.59 | 0.50 | 1.84 × 10^{−3} | |

SAMPL4 | Amber 6 | 1.28 | −0.14 | 0.97 | 3.28 × 10^{−30} | 1.20 | 0.11 | 0.95 | 7.77 × 10^{−25} | 1.08 | 0.18 | 0.96 | 3.42 × 10^{−27} | 1.41 | 0.15 | 0.96 | 8.09 × 10^{−26} |

Amber Bondi | 1.12 | 0.06 | 0.96 | 3.53 × 10^{−27} | 1.41 | 0.31 | 0.94 | 5.08 × 10^{−22} | 1.10 | 0.33 | 0.96 | 2.95 × 10^{−27} | 1.07 | 0.10 | 0.96 | 1.58 × 10^{−27} | |

Amber Mbondi2 | 1.09 | 0.15 | 0.96 | 1.02 × 10^{−27} | 1.33 | 0.14 | 0.94 | 1.84 × 10^{−22} | 1.03 | −0.08 | 0.96 | 4.64 × 10^{−26} | 1.04 | 0.00 | 0.96 | 8.87 × 10^{−28} | |

ZAP9 | 1.12 | −0.01 | 0.96 | 2.30 × 10^{−27} | 1.12 | 0.07 | 0.96 | 2.7 × 10^{−26} | 1.09 | −0.04 | 0.96 | 1.41 × 10^{−26} | 1.32 | 0.36 | 0.95 | 1.20 × 10^{−23} |

## IV. CONCLUDING REMARKS

In this work, we propose a protocol to parametrize implicit solvent models for the blind prediction of solvation free energies. The proposed protocol parametrizes non-polar solvation model by scoring and ranking algorithms while utilizes either the Poisson model or a density functional theory (DFT) based polarizable Poisson model for polar solvation free energy calculations. For the parametrization of non-polar solvation models, we first utilized the assumption that molecules with the same functional group admit the same parametrization of the non-polar solvation energy functional. For complex poly-functional-group molecules, we develop a scoring procedure to determine the optimal relative weight of each functional group. For extremely complex molecules that fail the functional group scoring method, we further develop a molecule ranking algorithm to select an optimal set of nearest neighbor molecules for parameter training. We construct atomic features for the molecule ranking. Finally, we systematically integrate the above mentioned models and algorithms into a robust protocol for blind solvation free energy prediction.

In the present work, we considered an experimental database of 668 solvation molecules, see Table S21 in the supplementary material, the largest database ever constructed for solvation, to validate our approach. Among them, SAMPL0–SAMPL4 test sets are paid special attention. For the Poisson model or DFT based polarizable Poisson model, four sets of atomic radius parameters (i.e., Amber 6, Amber bondi, Amber mbondi, and ZAP9 radii) are combined with four sets of charge force fields (i.e., AM1-BCC, Mulliken, Gasteiger, and SIESTA DFT) to arrive at a total of 16 different implementations. The resulting polar solvation free energies are utilized in our parametrization for blind predictions. We first carry out the leave-one-out validation of the whole database. The AM1-BCC charge force field delivers a low root mean square (RMS) error of 1.33 kcal/mol, which is the lowest for such a large test database, to our best knowledge. We further conduct a series of leave-SAMPLx-out blind tests. On average, the AM1-BCC parametrization in the Poisson model and DFT based polarizable Poisson model performs better than other charge force fields, especially for predicting the solvation free energies of the complex molecules. The optimal RMS errors for SAMPL0-SAMPL4 are, respectively, 0.93, 2.82, 1.90, 0.78, and 1.03 kcal/mol.

From the solvation free energy predictions, particularly on SAMPL1 and SAMPL2 test sets, we conclude that atomic charge parametrization is extremely important for the present physical models, namely, the Poisson model or the KSDFT based polarizable Poisson model. Without an appropriate charge parametrization, the prediction errors can be amplified for molecules with complex structures. In general, for four charge parametrization methods, both the semi-empirical AM1-BCC charge and the *ab initio* charge calculation from the generalized KSDFT can provide relatively reliable charge assignments. For this reason, a solvation free energy prediction method that does not heavily depend on the molecule parametrization is under our consideration. Essential idea is that if one does not partition the solvation free energy into two isolated parts, the prediction errors in the polar solvation free energy will not propagate to the non-polar solvation free energy prediction. Alternative, our differential geometry based solvation model that dynamically couples polar and non-polar models^{7,89} might also provide a less sensitive approach.

This work can be improved in a number of ways. First, the classification of the database into functional groups is not unique. Future study will explore optimal molecules partition. Additionally, the selection and computation of atomic features need to be further investigated. It is possible to construct an optimal set of atomic features for solvation analysis and prediction. Further, in the current work, molecular ranking for nearest neighbor searching is not optimal yet. More sophisticated machine learning and/or deep learning algorithms can be developed for this purpose. Finally, a more versatile DFT solvers can be utilized to further improve our in-house polarizable Poisson model. These issues are under our consideration.

## SUPPLEMENTARY MATERIAL

See supplementary material for specific predictions of SAMPL0-SAMPL4 test sets for all force field parametrizations and molecular name, PubChem ID, and experimental solvation free energies of all 668 molecules. Additionally, more detailed datasets with coordinates, radii, and charges for 668 molecules can be downloaded from website http://weilab.math.msu.edu/MIBPB/.

## Acknowledgments

This work was supported in part by NSF Grant No. IIS-1302285 and MSU Center for Mathematical Molecular Biosciences Initiative. B.W. thanks Dr. Pengfei Li, Professor Ray Luo, Professor David Mobeley, and Professor Wei Yang for helpful discussions. G.-W.W. acknowledges Nathan Baker, Michael Gilson, and Weitao Yang for useful discussions.