The parameterization of torsional/dihedral angle potential energy terms is a crucial part of developing molecular mechanics force fields. Quantum mechanical (QM) methods are often used to provide samples of the potential energy surface (PES) for fitting the empirical parameters in these force field terms. To ensure that the sampled molecular configurations are thermodynamically feasible, constrained QM geometry optimizations are typically carried out, which relax the orthogonal degrees of freedom while fixing the target torsion angle(s) on a grid of values. However, the quality of results and computational cost are affected by various factors on a non-trivial PES, such as dependence on the chosen scan direction and the lack of efficient approaches to integrate results started from multiple initial guesses. In this paper, we propose a systematic and versatile workflow called TorsionDrive to generate energy-minimized structures on a grid of torsion constraints by means of a recursive wavefront propagation algorithm, which resolves the deficiencies of conventional scanning approaches and generates higher quality QM data for force field development. The capabilities of our method are presented for multi-dimensional scans and multiple initial guess structures, and an integration with the MolSSI QCArchive distributed computing ecosystem is described. The method is implemented in an open-source software package that is compatible with many QM software packages and energy minimization codes.
I. INTRODUCTION
The potential energy surface (PES) along the torsional dihedral angle degrees of freedom is a crucial part of model potentials for computer simulations of bio/organic molecules and polymers, including commonly used molecular mechanics force fields. “Proper” torsion angles (i.e., those involving four consecutively bonded atoms a—b—c—d and labeled as ϕabcd) can be highly flexible due to the periodic nature and relatively small range of the free energy profile (often less than 5 kcal mol−1), which leads to broadly diverse conformations and accessible barrier crossings in ambient temperature experiments and simulations. Because the torsional angle is a principal descriptor of molecular conformation, the torsional potential energy is important for determining the thermodynamic distribution of molecular conformations and kinetics of conformational changes. Therefore, accurate empirical potentials, or molecular mechanics (MM) force fields are needed to predict properties of interest such as biomolecular structure and function, receptor-ligand binding free energies, and timescales of protein folding.1–8
The four-body energy term for proper torsion in most force fields uses a periodic functional form of the dihedral angle ϕabcd represented as a truncated Fourier series, i.e.,
where the sum is over periodicity n and Nk ≤ 6. The potential parameters for barrier height and phase shift may be assigned from parameter libraries based on the chemical environment, or they may be specifically fitted (i.e., bespoke) for an individual torsion angle of a specific molecule. The non-covalent interaction between the terminal atoms of the torsion angle may also be modeled using pairwise Coulomb and Lennard-Jones interactions on atoms separated by exactly three bonds (i.e., “1–4 interactions”), which may be modified from conventional non-bonded terms using scaling factors or alternative parameter values.9 Because the 1–4 distance depends strongly (but not exclusively) on the torsion angle, it may be considered as another contribution to the torsional potential energy.
Proper torsions have characteristics of both valence (i.e., bonded) and nonbonded regimes because the total energy includes contributions from the quantum nature of covalent bonding such as resonance and conjugation, as well as non-covalent interactions such as electrostatic and steric effects on vicinal functional groups. As the torsion angle is varied in a molecule, several important properties of the molecule are affected including the electronic character of the central bond as well as steric and other nonbonded interactions between groups on opposite sides of the bond.10–12 Importantly, the torsion angle dependence of these properties can induce relaxations in the orthogonal degrees of freedom as the torsion angle is varied. Such relaxations include bond stretching that accompanies disruption of conjugation, the bending of angles to minimize steric hindrance, and changes in distance between nonbonded functional groups in order to avoid clashes or make intramolecular contacts.
Force fields must accurately account for torsion-induced structural relaxations in order to produce accurate free energy profiles; thus, the standard practice of generating quantum mechanical (QM) data for MM force field parameterization involves minimizing the QM potential energy with the torsion angle of interest constrained to various values, e.g., on a regularly spaced grid.13–16 The result of this calculation is a set of QM constrained optimized structures and energies that includes relaxation effects from orthogonal degrees of freedom, which can be used to develop more accurate torsion parameters in the context of other energy contributions in the force field. In addition, two or more dihedral angles can be varied independently on a multi-dimensional grid to sample the conformational space more broadly and/or to generate data for parameterizing torsion–torsion coupling (also called CMAP) energy terms used in some force fields.17,18
For relevant molecular systems, the feasible geometry optimization methods involve local energy minimization starting from an initial guess structure. The optimized structure and energy, as well as the probability of the optimization algorithm successfully converging to a minimum, both depend strongly on the initial guess. The straightforward approach to this problem involves carrying out a series of constrained minimizations where the constraint value is scanned along the grid, and each minimization is initiated from the optimized structure of the previous one.19 This calculation, which we term a “serial relaxed scan,” is a standard feature in several widely used quantum chemistry and geometry optimization codes.20–28
The serial relaxed scanning approach has some major drawbacks.29 For one, the resulting optimized structures are dependent on the chosen sequence of calculations, such as the direction of the one-dimensional scan. This is because a series of constrained minimizations often stays in the same qualitative local minimum as determined by the orthogonal degrees of freedom even if another local minimum with a lower energy is reachable by scanning in the opposite direction; the other local minimum is found only if the energy barrier vanishes, which is not guaranteed. The result thus has a risk of including structures with unnecessarily high potential energies, which are not appropriate for fitting force field parameters because they introduce a bias toward thermodynamically unlikely conformations. This problem becomes more serious for multi-dimensional scanning as a greater number of choices need to be made for the scanning direction, and the results may depend on the ordering of dimensions. Another drawback is the lack of an efficient way to use multiple initial guesses, such as those resulting from a conformer generation method;30,31 intuitively, it should be possible to perform one scan using several initial guesses and keep the lowest energy structure from each but at a lower total cost compared to running each scan independently.
In this manuscript, we describe a new systematic workflow for generating optimized geometries along grids of torsion constraints by wavefront propagation, which addresses the drawbacks of serial relaxed scanning described above. The method, called TorsionDrive, generates results that are independent of scan direction and naturally incorporates multiple initial guesses into a single grid of constrained minimized structures. A predecessor to the present method was used to scan the two-dimensional torsion angles of blocked amino acid dipeptides in Ref. 8, but due to limitations of implementation, the method was limited to 2D grids and was not easily applicable to other molecules. The present method can be applied to any molecule subject only to the limits of the underlying energy and gradient method. Furthermore, it is capable of driving an arbitrary number of torsions to generate N-dimensional grids of optimized structures and energies where N ≥ 1. The current workflow is implemented as a Python package that interfaces with energy minimization routines in a modular way, including the open-source geomeTRIC optimization package28 that uses externally obtained energy and gradient information, as well as “native” optimization methods implemented in many quantum chemistry codes. The method is released as an open source package32 and includes a number of useful features such as including energy upper limits, extra constraints, and limited scan ranges. In addition to the standalone operation mode, TorsionDrive is also implemented as a service in the MolSSI QCArchive ecosystem,33 and it is available to compute results for any implemented gradient method in QCArchive; this includes not only quantum chemistry methods but also some MM force fields and recently developed neural network potentials parameterized by machine learning methods.34
II. METHOD
The main idea of TorsionDrive can be described conceptually as scanning the torsion angles with wavefront propagation. The details are illustrated by walking through a complete scan procedure, shown in Fig. 1. Before the start of the scan, we specify the dihedral angles in the molecule of interest using quartets of atomic indices and the spacing (resolution) of the scan. A grid of constraint values is created, which has the same dimension as the number of dihedral angles provided. In the illustration, we perform a 1D scan with a 60° spacing. The data associated with each grid point are represented by a circle, which consist of one or more “optimization datasets” (i.e., the Cartesian coordinates of a constrained local minimum and corresponding QM energy and gradient). Importantly, each grid point is able to contain multiple constrained optimization datasets that may correspond to local minima with different energies and values of the orthogonal degrees of freedom.
Given the above input, TorsionDrive starts to iteratively fill and refine the data of all grid points using constrained energy minimizations, denoted by the arrows in Fig. 1. Each constrained optimization is specified by an initial molecular structure and a target set of dihedral angle constraints, and the result is an optimized structure that matches the constraints with the other degrees of freedom relaxed. Within this workflow, TorsionDrive specifies the optimizations and processes the output data, and the actual optimizations are carried out via interfaces to other software packages. Multiple constrained optimizations that are specified at the same step in the workflow can be carried out in parallel. In the standalone operation mode, TorsionDrive uses the Work Queue distributed computing framework35,36 to take advantage of parallel resources. When TorsionDrive is used as part of the QCArchive ecosystem,33 it works as an application programming interface (API) to specify the constrained optimization inputs while QCArchive is responsible for job management; this is described in detail in Sec. IV.
The steps of the example scan (i.e., rows in Fig. 1) are described in the following example. For clarity of presentation, it is necessary to define the basic procedures within a step and the separation between steps. At the start of a step, constrained optimization calculations are started based on the results of the previous step. Upon completion of these calculations, some grid points are set as “active points” as described below, and then, the step is concluded. The result of one step is independent of the order of completion of the individual optimizations within a step.
- Step 1:
An initial constrained optimization is performed starting from the user-provided initial geometry of the molecule, with constraints set equal to the closest dihedral grid point (0° in the example). After the optimization is completed, the optimization data (structure and energy) are assigned to the grid point, and it is set as an “active” point, denoted by the red color.
- Step 2:
New constrained optimizations are launched from each active point of Step 1 toward each of its neighboring points. The number of neighboring points is equal to 2× the dimension of the scan. In this example, there is one active point at 0° in step 1, and two constrained optimizations are started at the two neighboring points (−60° and 60°) in Step 2. The active points from the last step, which are used to launch the optimizations in the previous step, are colored orange. Upon completion of the two constrained optimizations, the two neighboring points gain their initial set of optimization data, and they are set as active points, colored in red.
- Step 3:
The two active points from Step 2 spawn new optimizations toward each of their neighbors. Two such constrained optimizations expand to the left and right, resulting in new active grid points at −120° and 120°. The other two constrained optimization are both targeted at the grid point at 0°; thus, the grid point at 0° gains two new sets of geometries and energies, which are potentially better (lower in energy), equal, or worse (higher in energy) compared to the existing data. To determine which data to keep, we compare the energy of each new result with the current lowest energy at this grid point. In this example, we assume that both new optimization results are equal to or higher than the energy obtained from the original optimization in Step 1. In such cases, the grid point is marked as inactive (blue).
- Step 4:
The two active points from Step 3, located at −120° and 120°, spawn four new constrained optimizations. Since the dihedral grid is periodic, the “leftward” optimization from −120° wraps around to the “right-most” grid point at 180°. The “rightward” optimization from +120° also targets this grid point. The result of the two new optimizations is compared, and the one with the lowest energy is assigned as the new data for the grid point at 180°, which is assigned as active (red) in the current step. In this example, the optimization from −120° to −60° results in an equal- or higher-energy geometry similar as before. However, the optimization from 120° to 60° results in a lower-energy geometry due to finding a lower-energy local minimum. To explore the potential energy surface around this new lowest-energy local minimum, the 60° grid point is set as an active point.
- Step 5:
The active points from Step 4 at 60° and 180° spawn four constrained optimizations. In this example, we assume that all four new optimizations result in equal or higher energy structures compared to stored data, so all four points are set to inactive.
- Step 6:
There are no active points from Step 5. The TorsionDrive procedure is complete, and the data for the lowest-energy structure at each grid point are compiled and saved. The data from other constrained optimizations at equal or higher energies are retained in the scratch space of the calculation but are not considered to be part of the final result.
To summarize, the TorsionDrive scan follows these rules: (a) Any grid point that gains its initial set of optimization data, or new optimization data with lower energy than its current lowest energy, is set to “active”. (b) All active points from the previous step spawns new constrained optimizations, starting from the lowest-energy structure, targeting all neighboring grid points. (c) If no active point is left, the scan converges.
The above example only illustrates a simple 1D scan. It should be noted that TorsionDrive supports dihedral scans of arbitrary dimensions, with the minimum cost scaling as by O(2d × Nd), where N is the number of grid points on each dimension and d is the number of dimensions. In addition, multiple initial geometries can be provided to improve the coverage of the PES. Over the course of applying this software package in ongoing research projects, we have also created additional features that we found useful, which are stated in Sec. III D.
III. RESULTS AND DISCUSSION
A. One-dimensional scanning example
A comparison between a 1D scan in TorsionDrive and a conventional serial relaxed scan is shown in Fig. 2. The dihedral angle to be scanned is indicated by the four highlighted atoms. In both cases, the calculation is initiated from a single structure with a dihedral angle of 0°, and scans were performed with a 15° resolution. Geometry optimizations were carried out using the geomeTRIC software,28 and energies and gradients were calculated using density functional theory (DFT), as implemented in the TeraChem software package.26,27 A restricted Kohn–Sham wavefunction and the B3LYP hybrid functional37 with the corresponding D3(BJ) empirical dispersion correction38–40 were used.
The serial relaxed scan is carried out in the +ϕ direction, and the result is clearly asymmetric as there are two regions in the plot around −90° and +90° where the energy rises gradually and then sharply drops making a sawtooth pattern. This occurs because as the torsion angle deviates from planar, both atoms of the central bond start to adopt pyramidal geometries. The energy barrier to pyramidal inversion causes the optimizations to yield increasingly high-energy structures and eventually breaks down causing the large energy drop. Although the shape of this particular PES might be due to the lack of multireference effects in the wavefunction,41 it is sufficient to illustrate the general tendency of serial relaxed scans to get stuck in local minima in the orthogonal degrees of freedom. By contrast, the energy profile generated by TorsionDrive is more symmetrical, as expected from the twofold symmetry of the molecule. The wavefront propagation procedure initiates constrained optimizations in both directions, and although the central atoms still adopt pyramidal geometries, both “branches” of the potential energy surface are treated equally. Moreover, the final energy profile generated by TorsionDrive has significantly lower energy barriers compared to the serial scan though the barrier is still quite high at around 30 kcal/mol.
To compare the quality of these data for force field fitting, we computed MM single point energies at the optimized structures using the recently developed Open Force Field “Parsley” small molecule force field version 1.1.0,42,43 which did not include this molecule in its training set. The results show that the highest-energy conformations in the sequential scans have QM−MM energy differences that are more than twice as large as the wavefront propagation scan (Fig. 2, red lines). These data would introduce unwanted biases during force field fitting as the parameters would tend to minimize the energy errors in the highly strained structures at the expense of accuracy in the lower-energy regions. Therefore, we think that the QM data from wavefront propagation can improve the “ingredients” for force field fitting and ultimately lead to more accurate parameters.
In terms of computational cost, the serial relaxed scan involved a total of 24 constrained optimizations (601 gradient evaluations), whereas the TorsionDrive calculation involved 19 wavefront propagation steps with a total of 91 constrained optimizations (2073 gradient evaluations). Although the total computational cost of TorsionDrive is higher, we note that the wall time to job completion may actually be faster if sufficient parallel resources are made available (for 1D scans, four parallel jobs is mostly sufficient).
B. Two-dimensional scanning example
Multi-dimensional torsion scans provide greater insight into the conformational degrees of freedom of many biologically relevant molecules. For example, the backbone torsion angles of proteins occur in (φ, ψ) pairs, and amino acid side chains and glycosidic bonds contain flexible chains with two or more connected torsions. When scanning the torsion angle in two or higher dimensions, the conventional serial scanning approach suffers from similar problems as in the 1D case, but the problems may be more serious. In addition to the two choices of scan direction for each dimension, the ordering of dimensions may also affect the results because only one torsion angle may be varied between contiguous calculations, while the others are held fixed. Molecular systems with two or more coupled torsions tend to exhibit a high degree of flexibility, which also increases the chances of multiple local minima that are easily missed by a sequential scan.
Figure 3 compares the results for glutamine dipeptide for the wavefront propagation scan with TorsionDrive and two sequential scans with different choices of dimensional ordering. The grid spacing, level of theory, and software used were the same as Sec. III A, and there are now 576 total points on the torsion grid due to the increased dimensionality. The scan is initiated from a single structure near the energy minimum where (φ, ψ) = (−83°, 62°). While the results appear similar in the low-energy region near the starting structure, there are major differences in the more distant regions of the energy profiles. Namely, the serial scan results include several high-energy “islands” in excess of +25 kcal/mol, whereas the TorsionDrive scan results have much lower energies in these regions. The sequential scan results also contain more sharp differences in the energy between adjacent grid points, for example, near (+90, +90)° in Fig. 3(c), whereas the TorsionDrive energy profile is smoother. Some of the most significant differences between the potential energy surfaces are highlighted by the starred regions, indicating that both sequential scans visited higher-energy local minima compared to the wavefront propagation scan. These results show that serial scanning poses an increased risk of providing incorrect or insufficient data compared to wavefront propagation scanning for parameterizing force fields in simulations.
In terms of computational cost, the sequential relaxed scans involved running 576 geometry optimizations and a total of 21 208/21 658 gradient evaluations, depending on the ordering of dimensions. The TorsionDrive calculation ran for 33 wavefront propagation steps, involving a total of 4953 energy minimizations and 166 714 gradient evaluations. The number of gradient evaluations in the TorsionDrive calculation is about 7.5 times greater than the sequential relaxed scans, but the wall time may be greatly reduced if parallel resources are available as each wavefront propagation step could launch up to 300 energy minimizations in parallel. In the ideal case that all calculations are able to run in parallel, the TorsionDrive calculation wall time would be equivalent to around 33 sequential geometry optimizations.
One can also take advantage of parallelism in other ways, such as by slightly modifying the sequential scanning approach to use the results of a 1D scan along one dimension to start an array of 1-D scans along the other dimension to create the 2D PES. In this case, the number of sequential geometry optimizations is reduced to as low as 26 (if one goes in both directions simultaneously). Figure S1 of the Supplementary material shows an example where a unidirectional 1D scan along φ is used to start an array of 24 1D scans along ψ. The shape of the PES and general locations of high-energy minima are largely consistent with Fig. 3(c), in line with expectations.
C. Multiple initial structures
The wavefront propagation procedure of TorsionDrive is naturally able to incorporate multiple starting conformations. The initial constrained optimizations are performed on all starting structures with the torsion angle constrained to the closest grid point. If more than one initial structure is mapped to the same grid point, the lowest energy optimized conformer is used to launch new constrained optimization for neighboring points. This feature is beneficial because a grid of torsion angles can be covered in a smaller number of wavefront propagation steps when starting from multiple structures, and it also provides a natural way to consistently include the lowest-energy local minimum from multiple initial guesses.
In many molecules, the potential energy surface includes coupling across multiple torsion angles due to intramolecular non-bonded interactions, with protein backbone torsion angles (φ, ψ) being a well-known example. Ethylene glycol is an example of a molecule with strong intramolecular interactions between the hydroxyl functional groups. Figure 4 compares the results of a 1D torsion drive started from one initial conformation (indicated with +) and multiple initial conformations (indicated with *) together with a 2D torsional PES. These calculations were performed within the QCArchive infrastructure that provides TorsionDrive calculations as a service, as described in Sec. IV B. Energies and gradients were calculated using the B3LYP-D3(BJ) functional and DZVP basis optimized for DFT,44 as implemented in the Psi4 software package,23 and optimizations were carried out using the geomeTRIC software.28
The results show that a 1D torsion scan started from a conformation where the hydroxyl groups are facing in opposite directions (4b upper) will fail to find the lowest energy conformer. However, when the 1D scan is started from multiple conformations with different starting values of the O–C–C–O and H–O–C–C torsion angles, the resulting scan includes some structures that contain intramolecular hydrogen bonding character and lower overall energies (4b lower). Most conformers found by this scan are lower in energy than the structures found in the other scan. The 2D scan is shown in Fig. 4(c) with the two 1D scans mapped onto the heat-map. While the H–O–C–C dihedral angle does not change a lot in the scan started with one initial conformation (indicated with +), the H–O–C–C dihedral angle of the scan started with multiple conformations (indicated with *) changes to follow the lowest energy path on the potential energy surface.
It is well known that intramolecular electrostatic contacts (IECs), such as intramolecular hydrogen bonds, are much stronger in the gas phase than in solvent.45,46 This can be attributed to the dielectric solvent’s attenuation of electric fields and competing hydrogen bonding effects from the solvent, and is consistent with the observation that non-polarizable force fields for condensed phase simulation tend to underestimate gas-phase hydrogen bond energies.47 Therefore, searching for the lowest-energy structure in systems with strong gas-phase IECs could cause undesirable biases when fitting parameters of non-polarizable force fields; polarizable force fields are not as susceptible to this problem.48 One approach to avoiding forming IECs in geometry optimizations is to modify the potential surface by adding artificial repulsive potentials between groups expected to interact electrostatically49,50 or by using an implicit solvent model. Another promising approach is to carry out all steps of QM data generation and force field fitting with an implicit solvent model.51 Current implicit solvent implementations make it difficult to use the same solvent model for MM and QM during force field fitting, though this effect appears to be minor; we look forward to seeing more advances in this area in the future.
D. Generalized dihedral scanning and restricted grid
TorsionDrive has several additional features for flexibility in setting scanning coordinates for different molecular systems. First, the scanned coordinate(s) are not required to be strict definitions of proper torsion angles as four atoms in three sequential covalent bonds; any dihedral angle defined by four atom indices can be scanned, such as generalized torsion angles defined by four non-consecutive atoms or improper torsion angles describing pyramidalization. Second, the scan range can be restricted to focus computations on regions of interest, in case certain ranges of the dihedral angle are not physically reasonable.
The usefulness of these non-conventional torsion degrees of freedom is further enhanced by TorsionDrive’s robustness in building a smooth PES even for difficult systems. As an example, we conducted dihedral scans on a molecular motor that works by rotating around its torsion angle, as shown in Fig. 5. The subject molecule is a crowded and strained alkene where rotation of central double-bond torsion has to overcome a considerable energy barrier.52 Between the two sides of the rotation barrier, there are also large structural differences such as ring pucker flips. In this example, rotation around the central double bond is characterized by a torsion angle defined by four non-consecutive atoms because using consecutive atoms to define the torsion angle would lead to poor projection of the barrier onto the scanned coordinates.
The performance of TorsionDrive and conventional serial scanning is compared in Fig. 5 where a generalized proper torsion and improper torsion are scanned over. The serial scan started from the (−30, 0) grid point, and the dimensions are ordered such that consecutive optimizations involved changing the improper torsion angle. A comparison of the two calculations shows that TorsionDrive and serial scanning produce markedly different potential surfaces, with TorsionDrive giving a superior result in terms of finding much lower energy conformations. The serial scan succeeded in locating the leftmost minimum near the start point of the scan (−30, 0) but failed to obtain the other two local minima found by TorsionDrive. Moreover, the serial scan always yielded equal or higher energy than TorsionDrive at each grid point and reached many structures in excess of 40 kcal/mol on the right side of the PES. The choice of dimensional ordering also affects the outcome of a serial scan as the serial scan crashed at the (180, 0) grid point when the opposite ordering was used, likely due to reaching an excessively strained structure.
In summary, we think that the superior performance of TorsionDrive is because it optimizes multiple structures at the same grid point from different propagating directions. This procedure makes the TorsionDrive result less sensitive to the sometimes unpredictable convergence of geometry optimization methods to different local minima depending on starting conformation. This procedure, optionally augmented by using multiple starting conformations, results in a higher quality PES and much higher chances of finding relevant local minima. In addition, TorsionDrive saves human effort and wall time in troubleshooting and restarting crashed calculations due to its robustness against the sometimes unpredictable convergence failure of geometry optimizations.
E. Data analysis and visualization
The TorsionDrive software package includes Python scripts to parse output files into formatted files containing energies, gradients, coordinates, and associated metadata. These file formats are designed to facilitate automated fitting of force field parameters, which we will describe elsewhere. Also provided are Jupyter Notebooks53 for interactive visualization of the resulting energy surfaces (powered by Plotly) and inspection of corresponding molecular structures (powered by NGLview54). Figure 6 shows a typical usage of the visualization notebooks. Upon execution of the code cells, a contour is plotted to visualize the energies at each grid point, and hovering the mouse pointer on the plot shows the dihedral and energy values of the nearest grid point. Left clicking on the plot displays the optimized structure of the nearest grid point. The displayed structure is interactive and can be rotated, translated, zoomed, etc. Such synchronized visualization of energies and structures allow the user to efficiently examine critical points and other features of the PES.
IV. SOFTWARE INFRASTRUCTURE
The described method naturally lends itself to several layers of interoperability and parallelization strategies, which are detailed below. The simplest invocation of TorsionDrive at the API layer takes in a list of dihedral angles to scan over, the granularity of each scan, and necessary information on the initial molecules (atomic symbols and Cartesian coordinates). At every step, TorsionDrive emits a series of dihedral angles to perform a constrained optimization as well as the starting geometry, which can be evaluated by many downstream programs. The next iteration is then started by supplying the TorsionDrive procedure with the Cartesian coordinates and final molecular energy of each constrained geometry optimization.
This API design abstracts away the details of which program is used to evaluate the constrained geometry optimization, allowing many different quantum chemistry, semi-empirical, force field, or machine learning (ML) potential programs to be used to generate the necessary values. This strategy is robust to new methodology and program development and avoids being “locked into” a particular software package. Several examples of this are using Python-based geometry optimizers that are agnostic to the backend program to evaluate these constrained geometry optimizations such as geomeTRIC, PyBerny,55 and PyOptKing.56 In addition, Python-based suites of tools that attempt to abstract back ends exist, such as the Atomic Simulation Environment (ASE)57 and QCEngine, allowing for many additional programs to be used with a simplified interface.
A. Task execution systems
On top of allowing flexibility in the evaluated program, this structure also provides integration with task execution system parallelization tools.58 Task execution systems are typically software programs that can acquire computational resources on supercomputers through standard resource programs (e.g., SLURM59) and automate the computation of tasks (constrained geometry optimizations) on these resources. Tasks are typically computed via the following procedure:
A central task scheduler is created on a head node.
The central task scheduler acquires compute nodes through the local resource scheduler.
The compute nodes are harnessed by spawning a “worker” daemon process, which can communicate tasks to and from the scheduler via the local intranet.
Tasks are shipped from the central scheduler to a worker process, and the results of the task are shipped back to the central task scheduler.
Task execution systems allow the TorsionDrive calculation to be parallelized not only across cores of a single node but also across computational nodes even if the underlying quantum chemistry program is not able to do so. There are many such task execution systems available such as Work Queue,35,36 Dask,60 Parsl,61 RADICAL Pilot,62 and Fireworks63 in the academic computing space, which provide this service and can be trivially integrated with TorsionDrive. At present, TorsionDrive supports Work Queue when running in the standalone operation mode and, through the QCArchive integration, supports several other task execution systems such as Dask, Parsl, RADICAL, and Fireworks. Figure 7(a) shows how TorsionDrive interacts with task execution systems when running in the standalone execution mode, i.e., outside of QCArchive.
B. QCArchive integration
The MolSSI QCArchive project is a platform for computing, storing, analyzing, and sharing quantum chemistry data. The QCArchive software infrastructure uses a client–server model; the server (QCFractal) is a permanent Python-based server, which stores quantum chemistry computations, runs “services” such as TorsionDrive to generate new quantum chemistry tasks, and provides an API to search and organize previous computations, and QCPortal is a Python-based API for interacting with the server suitable for Jupyter Notebooks.53
Fundamentally, QCFractal is a tool to compute a large number of quantum chemistry primitives such as an energy or gradient computation or procedures such as a geometry optimization with a variety of different community packages. Building on top of this core of primitives, it is easy to add workflows such as TorsionDrive to the software stack due to its API layers, which are agnostic to how the geometry optimization is evaluated. In addition, tools such as QCFractal allow many TorsionDrives to be evaluated concurrently on one or more physical resources such as a campus cluster or supercomputer to improve the possible parallelization of these computations further. A general workflow with QCFractal would be as follows:
A user submits one or more TorsionDrive computations to QCFractal from the QCPortal front-end client.
The QCFractal server uses TorsionDrive to generate new geometry optimizations to be computed.
The geometry optimizations are computed on one or more physical resources where a physical resource can be a single core to a large supercomputer through a task execution system.
Items 2–3 are iterated until convergence.
Figure 7(b) shows how the user is able to use TorsionDrive as a service within the QCArchive infrastructure. An example usage of the QCPortal client is show in Fig. 8. The object returned in this image has API-based access to every geometry optimization and gradient evaluation.
V. CONCLUSIONS
The reformulation of torsion angle scanning as wavefront propagation comes with several important benefits that we think are worth the increased computational cost. These benefits include improved symmetry of the potential surface, which is related to the calculation results being independent of any chosen scan direction or dimensional ordering. The resulting potential energy surfaces have fewer discontinuities compared to sequential scanning, and in typical cases, lower-energy structures and potential minima can often be found. Multiple initial guesses can be naturally incorporated, allowing the calculations to finish in reduced wall time, given sufficient computing resources. This procedure also has a reduced tendency to get trapped in high-energy local minima, resulting in improved performance and reliability for challenging systems and generalized choices of dihedral angles.
In terms of software, TorsionDrive is a flexible package that can utilize either Python-based geometry optimization codes or quantum chemistry packages with integrated geometry optimization routines. It can run either in standalone mode and take advantage of parallel resources using the Work Queue task execution system or as a service in the QCArchive ecosystem that features centralized management of data and computer resources. Overall, we believe that this component-based approach to software development allows TorsionDrive to be a flexible piece of middleware that can be harnessed by a large number of external programs and incorporated into existing software ecosystems in a straightforward manner. This approach also helps users by enabling consistent approaches to geometry optimizations or higher level workflows when using different quantum chemistry software packages that may differ in terms of the methods that are implemented in each.
SUPPLEMENTARY MATERIAL
The supplementary material contains input files for TorsionDrive calculations including quantum chemistry inputs and output molecular structures and associated energies for Fig. 2–5. Structures are provided as Cartesian coordinates in XYZ format.
DATA AVAILABILITY
The data that support the findings of this study are available within the article (and its supplementary material).
ACKNOWLEDGMENTS
Y.Q. was supported by a fellowship from the Open Force Field Consortium and ACS-PRF Award No. 58158-DNI6. D.G.A.S. was supported by U. S. National Science Foundation (NSF) Grant No. ACI-1547580 and from the Open Force Field Consortium. C.D.S. was supported by a fellowship from the Molecular Science Software Institute under NSF Grant No. ACI-1547580. M.F. acknowledges funding from the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R01GM61300 to Michael K. Gilson. H.J. was supported by a fellowship from the Open Force Field Consortium. L.P.W. acknowledges funding from the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R01GM132386. We acknowledge John Chodera, John Stoppelman, Alberto Gobbi, Joshua Horton, and Xinjun Hou for helpful discussions. We also thank the Molecular Sciences Software Institute (MolSSI) for its support of the Open Force Field Consortium and Initiative. The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or the commercial partners of the Open Force Field Consortium.
The Open Force Field Consortium is composed of academic investigators from the Open Force Field Initiative and sponsoring Industry Partners collaborating to advance open force field science, toolkits, and standards for biomolecular drug discovery. The full list of funding partners is available online at https://openforcefield.org/consortium/.