We have implemented the Martini force field within Lawrence Livermore National Laboratory’s molecular dynamics program, ddcMD. The program is extended to a heterogeneous programming model so that it can exploit graphics processing unit (GPU) accelerators. In addition to the Martini force field being ported to the GPU, the entire integration step, including thermostat, barostat, and constraint solver, is ported as well, which speeds up the simulations to 278-fold using one GPU vs one central processing unit (CPU) core. A benchmark study is performed with several test cases, comparing ddcMD and GROMACS Martini simulations. The average performance of ddcMD for a protein–lipid simulation system of 136k particles achieves 1.04 µs/day on one NVIDIA V100 GPU and aggregates 6.19 µs/day on one Summit node with six GPUs. The GPU implementation in ddcMD offloads all computations to the GPU and only requires one CPU core per simulation to manage the inputs and outputs, freeing up remaining CPU resources on the compute node for alternative tasks often required in complex simulation campaigns. The ddcMD code has been made open source and is available on GitHub at https://github.com/LLNL/ddcMD.
I. INTRODUCTION
Molecular dynamics (MD) simulations compute the positions and velocities of particles using Newton’s laws of motion. MD has been widely applied to simulate the dynamics of many biomolecules, which are essential to life, at the atomic scale. Popular biomolecular MD simulation codes include CHARMM,1 GROMACS,2 AMBER,3 NAMD,4 and OpenMM.5 Although the speed and accuracy of the bio-MD simulations have been improved significantly over the past few decades, the energy and force functions remain largely unchanged. These energy terms can be divided into two categories, bonded and non-bonded terms. Bonded terms compute interactions between the particles connected by covalent bonds through the bond, angle, torsion, improper torsion, CMAP,6 and other terms. The van der Waals and electrostatic terms are the non-bonded forces normally calculated explicitly for all particles within a cutoff distance. The collection of parameters for the functional terms is called a force field. CHARMM, AMBER, OPLS,7 and GROMOS8 are all/united-atom force fields, while Martini,9 ELBA,10 and SIRAH11,12 are coarse-grained (CG) force fields where multiple heavy atoms are merged into a single “bead”; both types of force fields are widely employed in biomolecular simulations.
MD simulations are computationally expensive when systems are large, and long-time scales are desired. To accelerate simulations, various parallel techniques have been applied to MD codes running on different hardware. The domain decomposition method is widely used in many MD codes to split large simulation systems into smaller subdomains. For a medium-sized simulation system of tens of thousands of atoms, these codes can produce more than 100 ns of atomistic MD trajectory per day on central processing units (CPUs). The graphics processing unit (GPU) has gained popularity in scientific computing for its massive threading capability.2,5,13,14 The performance of AMBER, NAMD, GROMACS, and OpenMM has dramatically improved on GPUs in recent years. However, the scalability of the GPU codes is very poor for MD simulations due to the bottlenecks in bandwidth between GPUs and CPUs, and between nodes. The new generation of DOE supercomputers, Sierra and Summit,15 uses NVIDIA’s NVLinkTM for CPU–GPU and GPU–GPU connections, which can improve scalability. The third category of hardware for MD simulations is the special-purpose computer processor, such as Anton, developed by D. E. Shaw Research. DESMOND, the MD software running of Anton, has been demonstrated to be approximately two orders of magnitude faster than CPU-based MD codes.16
The AMBER MD code has been ported to the GPU, starting from AMBER 12 by Götz and co-workers.13,17 AMBER is an all-atom code and force field, whereas Martini is a coarse-grained force field. Accordingly, AMBER contains a fast Fourier transform (FFT) in its particle mesh Ewald (PME) method, whereas standard Martini requires no FFT and instead handles long-range interactions with reaction field electrostatics.17 Concerning its GPU implementation, AMBER is one of the first MD codes to port the majority of the compute loop to the GPU. Notably, AMBER implements a “cluster-pair” based algorithm for calculating non-bonded interactions. This algorithm calculates many unused pair interactions, which are outside the interaction cutoff radius, but consequently allows for optimizations based on force symmetry of pairwise interactions. Such optimization is especially useful in computation-heavy all-atom force fields.
GROMACS, the community standard code for Martini simulations, has traditionally employed a GPU acceleration strategy that offloads only the most performance-critical MD operations to the GPU, leaving several other workloads on the CPU.18,19 This strategy requires frequent CPU–GPU memory transfers, thus requiring careful orchestration and scheduling of computing workloads across both the CPU and the GPU. Very recent efforts with GROMACS are focused around selectively moving more of the computations to the GPU. Bonded interactions are finally moved to the GPU in the GROMACS 2019 release, and constraints are ported in the 2020 release.20,21
ddcMD is an in-house MD simulation code developed at the Lawrence Livermore National Laboratory and was originally designed to run on homogeneous computing architectures by massively parallelizing the simulations with a flexible domain-decomposition scheme.22,23 It has been applied in various research areas, including biology, material science, and plasma physics.24–26 ddcMD has been recognized for its outstanding achievement in high-performance computing by winning the Gordon Bell Prize twice,25,27 which is a prestigious award from the ACM/IEEE Supercomputing Conference. With the advent and proliferation of heterogeneous supercomputing architectures, there was a need to implement GPU support for ddcMD to take advantage of these novel computing platforms. While several GPU-enabled MD codes already exist, none of them satisfied the need for a high-throughput code that both supported the Martini force field and minimized the CPU utilization. Therefore, this unique capability within ddcMD is developed to enable running Martini simulations within the context of an even larger multiscale simulation framework.24
The Martini force field is one of the most popular bio-molecular CG force fields,9 supporting a growing list of molecule types, including lipids,9 proteins,28 nucleotides,29,30 and polymers;31 it has been used for a large variety of applications.32,33 The Martini philosophy is to use extendable CG building blocks and a fixed set of standard interaction potentials to maximize applicability and transferability. Each building block or CG bead represents approximately four heavy atoms and their attached hydrogens, giving an approximate tenfold reduction in particle count compared to all-atom force fields. The reduction in particle count combined with the smoother interaction potentials, allowing for longer time steps (typically 20 fs) and resulting in faster overall dynamics, leads to 2–3 orders of magnitude speed-up compared to all-atom force fields.9,34 Early efforts to accelerate MD using GPUs focused mostly on speeding up the non-bonded calculations, the bulk of the computational cost for atomistic simulations. These acceleration kernels had little effect on CG force fields, where the particle density is lower, and typically the non-bonded cutoff distance shorter, resulting in significantly fewer non-bonded interactions. Therefore, to gain good computational performance for Martini, all parts of the computationally intensive kernels of ddcMD are implemented on the GPU.
II. METHOD
A. GPU implementation
1. General GPU code design in ddcMD
Several GPU programming frameworks, including OpenMP, RAJA,35 and Kokkos,36 were evaluated when we started to develop the GPU codes in ddcMD. Although they provide reasonable GPU performance and good code portability, they do not offer the developer the level of control that comes with hand optimizing kernels using the NVIDIA® CUDA® framework. For bio-MD application codes that need strong scaling performance, the CUDA framework is leveraged. Furthermore, the existing literature on GPU-based MD codes shows that optimized GPU implementations generally do not share the same loop structures as their CPU counterparts.13,17 In other words, the algorithms, memory access patterns, and FLOP counts of the corresponding CPU and GPU kernels are vastly different. Thus, using a for-loop-offload based model such as OpenMP or RAJA would simply not suffice, as the GPU MD code requires a different algorithmic design altogether.
One of the biggest bottlenecks for the GPU code is the limit of the CPU–GPU and GPU–GPU bandwidth. To avoid the data transfer between GPU and GPU, the code is designed to run on a single GPU, which is sufficient for most of the Martini MD simulations since one V100 or P100 GPU can deal with up to 8 × 106 particles (equivalent to ∼80 × 106 atoms in an all-atom simulation) as shown in Secs. III A–III D. To reduce the traffic between CPU and GPU, we ported the entire MD loop to the GPU. Data transfers between the CPU and the GPU are only necessary at the beginning and end of the simulation run for transferring input and output data. Only one CPU is required since all of the compute-heavy work is being done by the GPU. Thus, the current ddcMD GPU code is designed to run one MD simulation on one CPU and one GPU. The best throughput, when many simulations are needed, can be achieved by running many single GPU simulations in parallel to saturate all GPUs on the supercomputer. The usage of CPUs has been minimized in ddcMD so the workflow manager, such as MuMMI,24 can allocate more CPU resources for other types of calculations, simulations, and in situ analysis.
To reduce the data traffic between the CPU and the GPU, the entire MD loop (Fig. 1), including the non-bonded and bonded calculations, the Verlet neighbor table construction, the position restraint term, the Langevin thermostat,37 the Berendsen pressure coupling (barostat),38 the velocity Verlet integrator,39 and the RATTLE constraint solver40 are off-loaded to the GPU.
2. Verlet neighbor table construction
The Verlet neighbor table maintains the lists of all particles within a given cutoff distance, plus some extra distance. Prior to every call to the neighbor table construction, a binning calculation is carried out by permuting and sorting the coordinate buffers into bin order so that particles in the same bin are contiguous in the GPU memory. The neighbor table construction kernel builds a per-particle neighbor list that is later used in the kernels of energy and force evaluation. The kernel allocates a four-thread team per particle for distance calculations, looping through the particles in the current bin and its 26 surrounding bins. The indices of particles within a cutoff distance of the current particle are then added to its neighbor list. Since each particle has a thread team of size 4, the neighbor table construction kernel requires atomics for the four threads to safely append neighboring particle indices to the current particle’s neighbor list. In the future, the atomic operations will be removed by using prefix sums and balloting primitives. Reduced precision optimization has also been explored for further speed-up. However, double precision is used throughout the GPU code in this study.
3. Non-bonded kernel
The crux of any classical MD simulation lies in the calculation of forces between each particle and its neighbors within a cutoff radius. Efficient calculation of these interactions is a ubiquitous component of any performant MD code. To support the wide array of the MD potentials, the priority is to create a templated “pair-processing” CUDA kernel for handling any near-neighbor pair potential, such as the Lennard-Jones and reaction field electrostatics of the Martini force field.9 In classical biological force fields, these terms are used to model the non-bonded interactions and consume the majority of the computing time. Optimizing the pair-processing kernel is crucial to the performance of the code. Since both the Lennard-Jones potential and the reaction field utilize the same neighbor table, the two kernels have been fused into one non-bonded kernel. The non-bonded kernel iterates over each particle’s neighbor list and calculates the energies and forces between each particle and its neighbors. We have not implemented PME in the GPU version of ddcMD yet, but as PME is often used with the polarizable version of Martini, adding PME GPU support is in the future plan.
Several optimization techniques have been applied to speed up this kernel. The first technique is to improve the thread scheduling for better latency hiding. In theory, assigning one thread per particle to process the neighbors in the current particle’s neighbor list yields enough threads to fill the GPU. However, there is better memory locality within a particle’s neighbor list than between two particles’ neighbor lists. Thus, assigning multiple threads per particle results in better thread and cache locality. We tuned the thread/particle count and determined that Martini simulations work best with eight threads per particle. The second technique is to enforce coalesced memory access of the neighbor table. Each particle has a team of eight threads, which iterate over its neighbor list. Ensuring these threads always access contiguous pieces of memory results in better locality and fewer bank conflicts. The third technique is to rearrange output buffers for better write locality. The non-bonded kernel’s read-only buffers, including those containing particle positions, are stored in memory in a struct-of-arrays fashion. This physical arrangement enables coalesced reads from memory, essential to achieving good GPU read efficiency. In contrast, the calculated energies, forces, and virials are interleaved and stored to global memory in an array-of-struct fashion. Through experimentation, this arrangement performs better for output arrays, as it enables better write-locality. In contrast to memory reads, each thread only conducts a small and constant number of writes at the end of the kernel (outside the loop), and the advantages of writing to a single cache line outweigh the benefits of coalescing memory writes. The fourth technique is to use shuffle-sync based reductions in lieu of shared memory reductions.41 The reduction operations are performed for the current particle’s partial forces, virials, and energies across the eight-thread team. The shared memory reduction, while fast, requires large amounts of shared memory proportional to the number of output buffers. In the simulations with large output requirements, the kernel would eventually exceed the shared memory limit, requiring to reuse shared memory space and, thus, serialize some reductions, losing some instruction level parallelism. Conversely, switching to warp-level, shuffle, intrinsic-based reductions eliminates the need for shared memory.
Many CPU-based MD codes utilize force-symmetry when optimizing non-bonded interactions between particles. In other words, when calculating the force exerted on particle X by neighboring particle Y, it is not necessary to recalculate the force exerted on Y by X; the latter force is simply the negation of the former force, as defined by Newton’s third law. Accordingly, this technique can cut the number of necessary non-bonded calculations in half. However, when implemented on the GPU, algorithms utilizing force-symmetry often involve bad memory accesses and heavy usage of atomics, both of which result in poor performance. Therefore, this technique is not implemented in the current ddcMD version. Indeed, there are forms of the neighbor list algorithm that efficiently utilize force symmetry on the GPU, such as the cluster-pair algorithm used by AMBER. However, these modified neighbor table algorithms must calculate many more pair interactions than necessary, thus counteracting many of the computational benefits of exploiting force symmetry. Implementing and optimizing a cluster pair-algorithm to better understand these tradeoffs within ddcMD is left for future work.
4. Bonded kernels
The Martini force field leverages several bonded energy potentials, including bond, angle, cosine-based angle, torsion angle, and improper terms. The torsion angle and improper terms share a dihedral kernel, while a kernel is created for each of the remaining bonded potentials. The order of bonded terms is computed differently on the CPU than the GPU. In the CPU code, we first iterate over each residue and then calculate all energy potentials (bonds, angles, etc.) for that specific residue. In alignment with the GPU’s SPMD (single program multiple data) programming model, the CUDA implementation has separate kernels for handling each bonded energy potential. For example, all bond potentials (for all residues) are calculated at once, and then, all the angle potentials are calculated at once, and so on. The data structures of bonded terms are nested structs in the CPU code. To improve the efficiency of memory accesses on the GPU, these data structures are serialized before being sent to the GPU. The CPU also constructs various prefix sum-based maps for indexing into the serialized arrays, which are also copied to the GPU. The data serialization only needs to be done once at the beginning of the simulation as the connectivity of the molecules during the MD simulation remains the same.
5. Constraint kernels
Martini, like other biochemical force fields, employs constraints to enable the stable usage of larger time steps. Other GPU-codes, such as GROMACS, elect to keep this calculation on the CPU and parallelize it across threads, thus requiring significant memory movement between the CPU and the GPU. In keeping with the strategy of minimizing data movement between the host and device processors to maximize performance, ddcMD keeps this calculation entirely on the GPU.
The constraint algorithm is iterative, and the associated kernel requires repeated access to memory containing molecular and constraint specific constants. On V100 specific implementations, these buffers are kept in GPU shared memory, which is unified with the constant and L1 caches, allowing for fast reads. Shared memory, however, is much slower on Pascal architectures. Micro-benchmarking shows that latencies for accessing constant and shared are both 27–28 cycles on V100 GPUs. On P100 GPUs, while constant memory broadcast latency is 24 cycles, shared memory hit latency is much more expensive at 82 cycles.42 Thus, a different implementation of the constraint kernel is maintained for P100 GPUs; these data are instead kept exclusively in constant memory. Furthermore, launch time code generation and compilation with the NVRTC API are used to augment the P100 kernel’s performance. Our code generator manually unrolls loops and injects the constants directly into the source code. Embedding the constants in the code, as opposed to storing them in a constant memory array, eliminates the ADD instructions that would be necessary to calculate offsets into a constant memory array.
6. Langevin thermostat kernel
The Langevin dynamics is based on a stochastic differential equation, which includes the frictional forces due to the solvent and random forces besides the forces of interactions between the particles. The Langevin thermostat relies on the generation of random, normally distributed numbers. The ddcMD CPU code uses a sampling and rejection-based strategy that employs linear congruent generators to create pseudo-random numbers. While efficient on the CPU, its sampling-rejection strategy would be inefficient on a GPU due to branch divergence. Thus, the cuRAND library, which provides highly optimized kernels for pseudo-random number generation (developed by NVIDIA), is used in ddcMD.
7. Berendsen pressure coupling kernel
The pressure computation in ddcMD is based on the simulation’s molecular composition. In other words, only the inter-molecular virial terms are used for the pressure calculation. ddcMD supports both isotropic and semi-isotropic pressure coupling. Semi-isotropic pressure coupling is used in simulations involving membranes, while isotropic coupling is used in non-membrane simulations.
Given a molecular pressure tensor Pab, where a ∈ {x, y, z} and b ∈ {x, y, z}, a diagonal tensor Δ is defined with diagonal elements for the semi-isotropic pressure coupling,
A diagonal scaling tensor λ is defined with elements
where β is a compressibility, δ is the time step, and τ is a relaxation time scale. The finite time step evaluation of the box tensor h from t to t + δ is given by
The particle position r evolves as
The velocity at the half time step v(t + δ/2) is given by
where the constraint force is the solution of the constraint solver that ensures r(t + δ) satisfies the constraints.
B. Benchmark calculation setup
1. Test case setup
A series of test cases have been prepared by using the insane Python tool for the benchmark study.43 The test cases are designed and created to test different aspects of the code. These test cases include 18 water boxes with the size of the system ranging from 23 063 to 7 837 793 particles, 20 POPC lipid bilayers with the systems size ranging from 28 312 to 1 477 567 particles, and a protein–lipid system with a RAS protein44 embedded in the membrane and containing 136 104 particles (Fig. 2). The water boxes have two different water residues named “W” and “WF” (regular and antifreeze, respectively) in the Martini force field. Besides the water molecules, the lipid bilayers have POPC molecules. The protein–lipid system consists of an asymmetric 8-component and 3081-lipid bilayer, a RAS protein of 186 amino acids, and a GTP substrate with a coordinated magnesium ion, 1498 sodium ions, 1100 chloride ions, and 99 474 waters. Note, virtual sites have not yet been implemented in this GPU-version of ddcMD. Therefore, the 8-component lipid system used a recent (unpublished) modified version of cholesterol that is based on the original cholesterol model9 without the virtual sites but updates the cholesterol shape to be similar to the newer virtual sites model described by Melo and co-authors.45
All MD simulations are run using Martini and the new-rf parameter set46 at 310 K and 1 bar semi-isotropic pressure coupling. A time step of 20 fs is used in all MD simulations except for the fidelity study. The non-bonded interaction is calculated using Lennard-Jones potential with a cut-off radius of 11 Å. The electrostatic interaction is treated by the reaction field method.47 The dielectric constant within the cutoff 11 Å is 15, and beyond the cutoff is infinite. The velocity Verlet algorithm is employed in integrating the Newtonian equations.39 The RATTLE method is used for the molecules with constraints in ddcMD.40 For GROMACS, it uses the LINCS algorithm for the constraint calculations.48 The Langevin dynamics thermostat37 and the Berendsen barostat38 are used in the ddcMD simulations, while the velocity-rescaling thermostat49 and the Parrinello–Rahman barostat50 are used in the GROMACS simulations. The Langevin thermostat in ddcMD simulations and velocity-rescaling thermostat use the same friction coefficient of 1 ps−1. The Berendsen barostat in ddcMD simulations and the Parrinello–Rahman barostat in GROMACS simulations use the same compressibility constant of 3.0 × 10−4 bar−1. In principle, the choice of thermostat/barostat should not matter for the calculation of equilibrium properties, though transport properties will be affected. We have compared ddcMD and GROMACS simulations for equilibrium properties and find no statistical differences. Historically, most of our ddcMD simulations have used the Langevin thermostat for most canonical ensemble simulations though other thermostats are available. We choose Langevin over velocity-rescaling methods for the following reasons: (1) the canonical ensemble is preserved; (2) the velocity distribution has the correct Gaussian distribution; and (3) parallelization does not require a global barrier and reduction. For the barostat, we use Berendsen’s equation of motion for the evolution of the volume/shape of the computational box; however, our implementation is likely different. A further difference is that we use the molecular pressure instead of particle pressure. This allows us to avoid correcting the pressure for the constraint forces. If the simulation systems include a lipid bilayer, position restraints are applied to the POPC lipids of the outer leaflet to limit large scale bilayer undulations, and these are weak (2 kJ mol−1 nm−2) harmonic potentials applied to the Z-direction of each lipid PO4 bead. The performance of ddcMD is measured by the elapsed time of each MD loop per particle per step, while in GROMACS, it is measured by the ns/day. In this paper, we convert the elapsed time to ns/day for consistency.
2. Machines used for benchmark
Several supercomputers, including Lassen, Sierra, Pascal, and Quartz from Lawrence Livermore National Lab and Summit from Oak Ridge National Lab, have been used for the benchmark study. Sierra and the smaller system, Lassen, utilize the same architecture. Each computing node on Lassen and Sierra contains four NVIDIA Tesla® V100 GPUs and two IBM® POWER9TM CPUs with a total of 44 cores. Summit uses the same CPUs and GPUs as Lassen and Sierra but with a different configuration; each computing node contains six NVIDIA Tesla V100 GPUs and two IBM POWER9 CPUs with a total of 42 cores. For Pascal, each computing node contains two NVIDIA Tesla P100 GPUs and two Intel® XeonTM E5-2695 v4 CPUs with a total of 36 cores. Quartz is a pure CPU supercomputer, which is used for the benchmark study of the ddcMD CPU code for comparison. Each computing node contains two Intel Xeon E5-2695 v4 CPUs. In this work, we do not make a comparison between the POWER9 CPU and ×86 CPU benchmarking results. The purpose is to compare the GPU implementation, which minimizes CPU utilization, relative to the CPU implementation and across different GPU architectures.
3. Profile tools
In developing and benchmarking this code, the NVIDIA profiling tool for GPU codes, nvprof, is heavily utilized. Specifically, latencies of individual GPU kernels are queried, and hardware counter metrics are accessed, such as floating-point instruction counts. Another NVIDIA tool used for visualizing GPU activity, nvvp, is useful when optimizing GPU utilization. The GNU Compiler Collection (GCC) profiling tool, gprof, is used to profile the ddcMD CPU code. In addition, ddcMD outputs a comprehensive performance summary at the end of each simulation.
4. MD ecosystem and Python tools
ddcMD is designed to be a standalone program. Nevertheless, several tools have been developed to facilitate the usage of ddcMD. ddcMD has been integrated into the in-house, developed MuMMI workflow manager, which allows for multiscale simulations of complex biological systems.24 A modified version of MDAnalysis51,52 can parse the ddcMD trajectory files for analysis, which is available on GitHub (https://github.com/XiaohuaZhangLLNL/mdanalysis). There is also an in-house developed Python tool that can convert GROMACS inputs to ddcMD inputs, which is available on GitHub (https://github.com/LLNL/ddcMDconverter). ddcMD is open source and available on GitHub (https://github.com/LLNL/ddcMD).
III. RESULTS AND DISCUSSION
A. Code fidelity study
GROMACS is the community standard for Martini simulations, so it is used for a comparative code fidelity study. The energy calculation of a single snapshot has been performed for both ddcMD and GROMACS (Table S1). Different energy terms, including bond, angle, Lennard-Jones, Coulomb, total potential energy, and total energy, are compared. The absolute errors for these energy terms are less than 3 * 10−5. The absolute errors for the pressure and pressure variant terms are less than 3 * 10−3. The difference between ddcMD and GROMACS is due to the difference in the physical constants used in the two codes, which is negligible. ddcMD consistently uses version 2014 CODATA for fundamental physical constants in the code.53 Different terms have also been calculated for 10-ns trajectories obtained from ddcMD and GROMACS. The arithmetic means of different physical terms, including kinetic energy, potential energy, total energy, temperature, volume, and dimensions of the simulation system for ddcMD and GROMACS, show no significant difference (Table S2). The absolute errors between the mean values of these terms are all less than 1%. Furthermore, the standard deviation values reveal similar distributions of these terms for ddcMD and GROMACS.
The energy conservation study for MD simulations is carried out by simulating systems without coupling thermostats (NVE ensemble). The simulations are performed using different time steps. The root mean square error (RMSE) of total energies for each simulation is calculated. At a time step of 10−5 fs, RMSE reaches the limit of decimal places (10−12) in the output. The logarithm of RMSE values is linearly correlated with the logarithm of time steps. The correlation coefficient R2 is 0.9997, which suggests excellent energy conservation during the MD simulations (Fig. 3).
B. GPU speed-up over CPU
The CPU and GPU versions of ddcMD are profiled on the Intel Xeon E5-2695 CPU and the NVIDIA V100 GPU, respectively. Three different simulations with similar size systems, a water box of 114 443 particles, a lipid bilayer of 112 205 particles, and a protein–lipid system of 136 104 particles are selected for the profiling study (Fig. 4). Measured by elapsed time for each MD step, on average, the GPU code is sped-up over the CPU code by 276, 247, and 278 times for the water box, lipid bilayer, and protein–lipid system, respectively. Note that the CPU performance is obtained from runs using a single CPU core. We divide the computationally intensive kernels into six components: neighbor table construction, non-bonded calculation, bonded calculation, integration, constraint calculation, and the rest are miscellaneous kernels. The non-bonded calculation includes the Lennard-Jones potential and reaction field electrostatics. The bonded calculation includes bond, angle, dihedral, and improper terms. The integration includes the Verlet integration, the Langevin thermostat, and the Berendsen barostat. The constraint kernel is separated from the integration since it only occurs in the protein–lipid simulation.
Among the six computing components, the neighbor table construction achieves the most significant speed-up. It is 880, 1055, and 1052 times faster on the GPU than the CPU for the water box, the lipid bilayer, and the protein–lipid system, respectively. The difference in speed-up of the neighbor table construction among the different systems likely originates from the difference in the density of the simulation systems. The density of the lipid bilayer and the protein–lipid system is higher than that of the water box. The current GPU version of ddcMD is a pure double precision code. However, for the neighbor table construction, high precision is not necessary. The use of low precision for the neighbor table construction will not affect the accuracy of the simulations. The NVIDIA V100 GPU has both single- and half-precision supports. Replacing double precision with single precision or even half precision could increase the throughput of the calculations and speed up the kernels further.
The non-bonded calculation, bonded calculation, and integration have a similar speed-up ranging from 149 to 279 times. The speed-up for different systems is consistent with different computing components. The speed-up of the constraint calculation is 32 times, which is much less than those of other computing components. The constraint GPU kernel performs relatively poorly, which is due to the RATTLE constraint algorithm used in ddcMD. The RATTLE algorithm employs an iterative approach to compute the constraint coordinates that exhibits lower levels of inherent parallelism. The number of steps for convergence varies, which is a disadvantage when deploying the calculations to the GPU. A non-iterative constraint algorithm is needed to take full advantage of the GPU hardware.
The percentages of different computing components on the GPU are calculated for the water box, the lipid bilayer, and the protein–lipid system (Fig. 5). As expected, the non-bonded calculation, including the Lennard-Jones potential and the reaction field electrostatics, takes the majority of the computing time. For the water box and the lipid bilayer, the non-bonded calculation takes two-thirds of the computing time. For the protein–lipid system, it takes almost half of the computing time. For all-atom simulations this is typically a much larger fraction. As shown in Sec. II A 2, the neighbor table construction has been accelerated by the GPU significantly. Thus, for all three different simulations, the neighbor table construction takes less than 10% of the computing time.
The bonded calculation takes a relatively small percentage of the computing time. Note, the water box simulation for the Martini force field has no bonded terms. However, the bonded calculation still takes about 3.6% of the computing time, which comes from the overhead of the bonded GPU kernel calls. This manifests in a high overhead for the GPU kernel calls and is related to the bonded calculation preparation in the code. Code could have been optimized by avoiding these GPU kernel calls when no bonded calculations are needed. Since simulations without any bonded calculations are fairly rare, we choose to only have one streamlined code instead of including additional code to optimize for these specific cases.
Integration in the water box and lipid bilayer simulation takes about one-quarter of the computing time. Although the integration for the protein–lipid simulation accounts for only 18.3% of computing time, almost the same amount of computing time is spent on the constraint calculations as that of the integration. In the protein–lipid system, 29.4% of the lipids are cholesterols, which require constraint calculations. The protein also has an extensive constraint network. To make it worse, the speed-up for the constraint kernels is lower compared to the rest of the computing components, as the constraint algorithm contains less inherent parallelism.
C. Scalability study
Due to bottlenecks stemming from CPU–GPU and GPU–GPU communications, running an MD simulation across GPUs or across nodes is often inefficient. Therefore, the GPU optimizations for ddcMD largely target single GPU simulations. Due to the resultant high throughput of single GPU ddcMD jobs, high levels of aggregated compute efficiency are achieved by running N-ddcMD jobs on N-GPU computing nodes. For the scalability study, only weak scaling of the code is investigated.
A series of water boxes with various numbers of particles are simulated and benchmarked on both NVIDIA V100 and P100 GPUs [Fig. 6(a)], the latter of which represents the previous generation of NVIDIA GPUs. Specifically, the benchmark calculations are performed on two different nodes: a V100 GPU with the IBM Power9 CPU and a P100 GPU with the Intel Xeon E5-2695 v4 CPU. In the Martini force field, only the non-bonded interactions are calculated for the water boxes. When the water box is small with 23 063 particles, code performance on a V100 and P100 is similar. Through code profiling, we find that the calculations do not saturate the more powerful V100 for this small system, rendering the GPU idle for large chunks of time and resulting in poor performance. However, when the size of the water box increases, the performance on a V100 rapidly surpasses performance on a P100. As the number of particles exceeds 100 000, ddcMD achieves near-linear scaling and runs approximately 1.5 times faster on the V100. ddcMD’s performance gains on the V100 (Table S3) can be largely attributed to the V100’s improved peak TFLOPs, memory bandwidth, and cache bandwidth. Among the most notable of the V100’s hardware improvements is the improved latency for accessing L1 cache, which forms the basis of GPU shared memory. The ddcMD code uses shared memory extensively in the compute kernels, making the code especially amenable to this improved L1 cache.42 In addition, V100 GPUs contain more optimizations for thread divergence, intra-block communication, and atomic operations utilized by ddcMD. Furthermore, both Volta GPU V100 and P100 GPUs include 16 GB of HBM2 memory, meaning that the maximum number of particles that can be simulated on two different GPUs is the same: about 8 × 106 particles per GPU.
Compared to water box simulations, the POPC lipid bilayer simulations include bond, angle, and position restraint, in addition to the non-bonded interaction. The performance curve trend for the lipid bilayer is similar to that of the water box [Fig. 6(b)]. When the number of particles exceeds 100 000, the V100 delivers about 1.7 times faster performance than the P100, which has slightly better performance than the water box simulations. Even though the two simulations are similar in size, the lipid bilayer with 249 281 particles and the water box with 200 933 particles, profiling shows differences in the two most time-consuming CUDA kernels, the non-bonded energy valuation kernel and the neighbor table construction kernel. With respect to the non-bonded energy valuation kernel, the lipid bilayer simulations are 2.3 times faster on the V100 than the P100, while the water box simulations are 2.0 times faster. For the neighbor table construction kernel, the performances are 1.7 and 1.6 times faster on the V100 than the P100, respectively, for the lipid bilayer and water box simulations.
D. Comparison between ddcMD and GROMACS
The protein–lipid simulation is performed on a Summit node using both ddcMD and GROMACS version 2020.1.20 For ddcMD, the current GPU code is designed to always utilize one CPU core and one GPU for one simulation. GROMACS distributes the calculations between CPU and GPU with an automated CPU–GPU load balance scheme. For optimal performance, a GROMACS simulation should be run with 1 GPU and 64 CPU threads on a node. A comparison of the timeline of the GPU kernels for both ddcMD and GROMACS is shown in Fig. 7. Both ddcMD and GROMACS use one CUDA stream by default, so there is only one line for each program in this figure. Since ddcMD has ported the entire MD loop to the GPU, it utilizes 49 different CUDA kernels during the simulation. GROMACS has only moved the non-bonded calculations, including Lennard-Jones and the reaction field electrostatics, to the GPU and only utilizes seven CUDA kernels. Other types of calculations, including the bonded kernels, integration, and constraint, remain on the CPU side. As shown in Fig. 7, the timeline for ddcMD has no significant gaps between each CUDA kernel, while for GROMACS, it has large gaps between the kernels. The gaps in the GROMACS timeline are mainly due to the synchronization between the CPU and the GPU when the CUDA kernel is waiting for the CPU calculations. GROMACS cannot fully exploit the GPU resource by distributing the calculation on both CPU and GPU.
A Summit node has two sockets, each having one IBM POWER9 CPU and three NVIDIA V100 GPUs. Each CPU has 21 cores available for computation, and each core has four hardware threads. Thus, a total of 84 threads are available on one CPU. To investigate the throughput of simulations on a Summit node, one to six simulations are run on a single node using ddcMD and GROMACS (Fig. 8). If a job is running one simulation on one node, it is called a one-simulation job. If a job is running two simulations on one node, it is called a two-simulation job, and so on. All ddcMD simulations use one CPU core and one GPU. For the one-simulation job, the performance of ddcMD is measured at 1.04 µs/day. When more simulations are added to a node, the accumulated performance increases almost linearly. Accordingly, the accumulated performance is 6.19 µs/day for a six-simulation job.
Since many calculations in GROMACS are performed on the CPU, multithreading is required to achieve the best performance. Various numbers of threads and different combinations of options for GROMACS simulations are explored, and only the best performance is reported in Fig. 8. In the end, 64 threads are used for the one-simulation job; 32 threads are used for each of the two-simulation jobs; 40 threads are used for each of the three-simulation and four-simulation jobs; and 28 threads are used for each of the five-simulation and six-simulation jobs. The performance of GROMACS is 1.01 µs/day for the one-simulation job. The accumulated performance increases slowly as more simulations are added to a node. The odd-number simulation jobs (i.e., three-simulation and five-simulation jobs) have the worse throughputs. In other words, the accumulated performance of the three-simulation job is worse than that of the two-simulation job, and the five-simulation job is only slightly better than the four-simulation job. Since each Summit node has two sockets, it is not possible to distribute the CPU cores in a node evenly among the odd number of simulations, which leads to a disadvantage in accumulating performance. The accumulated performance plateaus at the 6-simulation job with an accumulated performance of 3.25 µs/day.
Differences in design elements between ddcMD and GROMACS may expose the reason that ddcMD outperforms GROMACS. Both have a similar throughput for a one-simulation job. However, as more simulations are added to a node, the advantage of ddcMD’s 1 GPU/1 CPU core design becomes significant. For a six-simulation job, ddcMD has 1.9 times the throughput than GROMACS. In addition, ddcMD uses double-precision arithmetic, while GROMACS only supports single-precision for its GPU version. Knowing the communication between the GPU and the CPU is a bottleneck for the MD simulations; ddcMD alleviates this traffic between the CPU and the GPU by moving the whole MD loop to the GPU. The data are transferred between CPU and GPU only when the inputs or the outputs occur during the simulations. GROMACS distributes the calculations on both the CPU and the GPU, which requires copying and synchronizing the data between CPU and GPU at every time step. Frequent movement of data back and forth results in performance degradation when more GPUs of a node are occupied by the separate simulations. Additionally, ddcMD uses only one CPU thread per job leaving the other CPU resources available for other processes. For the six-simulation job, ddcMD uses 6 threads out of 168 and 6 GPUs with an accumulated performance of 6.19 µs/day, while GROMACS uses all 168 threads and 6 GPUs with an accumulated performance of 3.25 µs/day.
IV. CONCLUSION
The GPU version of the ddcMD program for the Martini force field has been implemented using NVIDIA’s CUDA framework. All computationally intensive calculations have been off-loaded to the GPU. A templatized generic pair-processing infrastructure has been developed so that it can be used to implement a diverse set of potential forms on GPUs. Particularly, the non-bonded calculation and the neighbor table construction, which rely on pair processing, account for most of the computing time. The infrastructure is carefully optimized to achieve better performance by using various techniques, including contiguous memory access, better latency hiding, and shuffle-sync based reductions.
The code fidelity study has shown high similarity between ddcMD and GROMACS in terms of the single snapshot calculation and MD trajectory. ddcMD maintains excellent energy conservation in the MD simulations. The speed-up of the MD step in ddcMD using one GPU vs one CPU is up to 278-fold, with the neighbor table construction achieving the best speed-up. Non-bonded calculation and associated neighbor table construction account for the majority of computing time. The code shows linear scaling when the number of particles in the simulation systems is greater than 100 000 with a maximum number of particles of 8 × 106 for both NVIDIA’s V100 and P100 GPUs. Compared to GROMACS, ddcMD has many more kernels running on the GPU and leaves no computationally expensive tasks on the CPU. The timeline of the GPU kernels for ddcMD reveals the high occupancy of the GPU, giving ddcMD (at double precision) a performance of 1.04 µs/day for a system of 136 104 particles on one GPU, which is faster than GROMACS (at only single precision). More importantly, ddcMD’s throughput on a single Summit node is almost twofold greater than GROMACS.
Although ddcMD has delivered high performance on the GPU, there is room to improve. One major optimization would be to reduce the precision to single-precision or half-precision for the neighbor table construction to speed up the calculations without sacrificing accuracy. Currently, ddcMD uses double-precision throughout the code. Another optimization option is to use multiple CUDA streams to overlap the kernel execution, as the code uses one CUDA stream now. This change may further improve the throughput on the GPU. Additionally, a non-iterative constraint approach can be developed to better exploit the GPU architecture. The current constraint kernel is implemented by an iterative algorithm, which clearly does not take advantage of the parallelism in the GPU. Other considerations are needed for future code modifications. The current ddcMD is designed to run on one GPU, which has an upper bound limit of ∼8 × 106 particles per single simulation due to the memory limit on a GPU. For simulation systems larger than 8 × 106 particles, the code needs to be re-designed to efficiently run across the GPUs or even across nodes. As we strong scale ddcMD across GPUs and, thus, decrease workloads of the GPU kernels, kernel launch time (with respect to kernel run time) will become more pronounced. This will warrant exploration into the CUDA Graphs and Cooperative Groups APIs that allow for the optimization and management of kernel launches.
SUPPLEMENTARY MATERIAL
See the supplementary material for the tables of comparison of a single snapshot between ddcMD and GROMACS, comparison of 10 ns trajectories from ddcMD and GROMACS, and comparison of a ddcMD simulation on NVIDIA P100 and V100 hardware.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding authors upon reasonable request.
ACKNOWLEDGMENTS
This work was supported, in part, by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health (NIH). We thank Livermore Computing (LC), Livermore Institutional Grand Challenge, Oak Ridge Leadership Computing Facility, and Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) program for computing time. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344 (Release No. LLNL-JRNL-810343).