Integration schemes are implemented with a plane-wave basis in the context of real-time time-dependent density functional theory. Crank-Nicolson methods and three classes of explicit integration schemes are explored and assessed in terms of their accuracy and stability properties. Within the framework of plane-wave density functional theory, a graphene monolayer system is used to investigate the error, stability, and serial computational cost of these methods. The results indicate that Adams-Bashforth and Adams-Bashforth-Moulton methods of orders 4 and 5 outperform commonly used methods, including Crank-Nicolson and Runge-Kutta methods, in simulations where a relatively low error is desired. Parallel runtime scaling of the most competitive serial methods is presented, further demonstrating that the Adams-Bashforth and Adams-Bashforth-Moulton methods are efficient methods for propagating the time-dependent Kohn-Sham equations. Our integration schemes are implemented as an extension to the Quantum ESPRESSO code.

## I. INTRODUCTION

The ability to predict the dynamics of electrons in complex many-body systems, from small molecules to extended solids, opens the door for an array of important engineering applications and investigations into fundamental physics questions. One such approach to describe electron dynamics is time-dependent density functional theory (TDDFT). TDDFT has been used in a range of application areas, including spectroscopy,^{1,2} electron transport,^{3} electronic stopping,^{4} and predicting ground-state properties of systems.^{5–7} Investigations using TDDFT generally fall under two categories: linear-response TDDFT, in which the time-dependent fields applied are small enough that the density responds linearly, and non-linear response TDDFT, in which the applied fields are strong enough to cause charge emission and other non-linear effects.^{2} Linear-response TDDFT is often formulated in the frequency-domain and has been successful in describing electron dynamics in a variety of settings.^{7,8} Non-linear response TDDFT can be explored directly in the time-domain through what is called real-time time-dependent density functional theory (RT-TDDFT).^{9,10} The RT-TDDFT approach is the primary focus of this paper, and we specifically look at effective ways of propagating the equations of motion for the system, known as the time-dependent Kohn-Sham (TDKS) equations [Eq. (2)], in time.

In principle, the RT-TDDFT approach allows us to obtain the real-time dynamics of electrons in complicated many-body environments.^{9} However, many questions remain to be answered in regard to both basic theory and the applicability to systems in which electron correlation effects are difficult to describe via standard methods.^{7,11} Nonetheless, many researchers currently use or could benefit from RT-TDDFT simulations. The purpose of this work is therefore to explore numerical methods that can be used to propagate the TDKS equations and to characterize these methods based on their accuracy, stability, and computational cost.

The numerical methods we present here have been implemented in the setting of plane-wave density functional theory.^{12–15} Generally, plane-wave DFT is the preferred approach for solids because all areas of the unit cell are represented with equal accuracy, and the accuracy can be systematically improved by increasing the plane-wave cutoff energy.^{13,15} We employ plane-waves here primarily because they are the most commonly employed basis set for periodic boundary conditions, facilitating the coupling of many other techniques that have been developed and implemented in commonly employed plane-wave codes. The code we use, which we refer to as rt-tddft, is adapted from a code originally developed by Qian *et al.*,^{16} with a complete redesign using modern object-oriented Fortran, object-oriented design patterns, and is built using test-driven development practices to help ensure correct execution upon compilation. rt-tddft exists as an add-on package to the popular ground-state DFT code Quantum ESPRESSO.^{17,18} We have ensured that rt-tddft works with multiple **k**-points and have implemented a variety of integration schemes which are presented in detail below. Although we focus exclusively on plane-wave methods, the basic theory and numerical analysis presented in Sec. III are applicable to other spatial discretizations.

Within the framework of RT-TDDFT, two basic approaches exist for propagating the TDKS equations: propagator approaches and ordinary differential equation (ODE) methods, detailed in Sec. III. While propagator methods have been favored in some contexts,^{19} ODE methods have also been shown to work well for RT-TDDFT.^{3,10} In this paper, we focus exclusively on ODE methods. Certain classes of ODE methods discussed in Secs. III and IV have high orders of accuracy and very low computational cost per time step. While these methods come with a restriction on the time step for stability purposes, we find that they can be very efficient when a relatively high accuracy in the solution is desired. Although we do not explore propagator methods in this paper, we believe that it could be useful to implement such methods within the Quantum ESPRESSO framework. A recent study^{20} using the Octopus code studied the error and computational cost of a mix of propagator and ODE methods. Some of the methods studied in that context overlap with the methods studied in this work and exhibit similar relative performance to the methods studied here. However, we emphasize that the results of that paper will not necessarily transfer over to the present study because the Octopus code uses a real space grid to represent the wave functions. Real-space approaches use very different computational methods from plane-wave codes such as Quantum Espresso. Not only do plane-wave codes make extensive use of Fourier transforms between the real and reciprocal space to apply parts of the Hamiltonian to the wave functions in each representation but are also generally associated with Hamiltonians that are larger than other methods.

To assess the stability, error, and computational cost of these methods, we use a simple example system consisting of an isolated graphene monolayer surrounded by vacuum space. The system is excited with an electric field in the direction perpendicular to the layer at time *t* = 0. To assess the error and stability of each method, we compute the dipole moment of the monolayer in the direction of the excitation at different time steps. The computational cost is assessed by varying the size of the monolayer, while keeping the **k**-point density fixed (see Sec. IV for more details).

This paper is organized as follows: in Sec. II, we provide a theoretical background for TDDFT and the TDKS equations, as well as an overview of the two approaches to RT-TDDFT: propagator and ODE methods. In Sec. III, we present a variety of ODE methods that can be used for the TDKS equations, and we provide an explanation of the accuracy, stability, and computational requirements for each method. In Sec. IV, we present numerical results for these methods, showing the error and computational cost associated with each. Finally, we look at parallelization of these methods in Sec. V.

## II. THEORETICAL BACKGROUND

The Hohenberg-Kohn theorem^{21} (1964) and Kohn-Sham scheme^{22} (1965) provide a framework and practical approach to computing ground-state properties of many-body quantum mechanical systems. Although the ground-state theory was developed in the 1960s, a rigorous time-dependent formulation did not appear until 1984 in the work of Runge and Gross.^{23} Their work extends the Hohenberg-Kohn theorem to time-dependent densities and provides a time-dependent version of the Kohn-Sham equations, a set of single-particle Schrödinger-like equations [Eq. (1)]. By solving these equations, we can obtain the time-dependent density, and in principle, any observable quantities that can be expressed in terms of the density.^{7}

### A. Time-dependent Kohn-Sham equations

The Kohn-Sham approach to time-dependent systems requires the solution of the following single-particle equations:

Here the Hamiltonian *Ĥ* is a functional of the time-dependent density. In the real-space representation, these equations become

where *v*_{s}[*n*](*t*) is the single-particle potential

Here, *v*_{ext} is the external potential due to the ions, the second term on the right-hand side is the Hartree (electron-electron) potential, and *v*_{xc} is the time-dependent exchange-correlation potential, which involves electron-electron interactions not accounted for in the other terms of the Hamiltonian. In this paper, we work strictly with fixed ionic positions so that the external potential due to the ions is effectively fixed in time. Therefore, the only time-dependent contribution to the external potential comes from terms other than the ionic motion, such as time-dependent electric fields.

The density is constructed via a summation of the modulus square of the single particle wave functions

Strictly speaking, the exchange-correlation potential *v*_{xc} is a functional not only of the density but also of the initial many-body wave function.^{24} In addition, the exchange-correlation potential will in general have memory so that the density at times *τ* less than the current time *t* ≥ *τ* may be needed.^{11,23} For our purposes, we work under the *adiabatic approximation* so that the time-dependence of *v*_{xc} arises strictly from the instantaneous charge density.

We may view Eq. (2) as a set of non-linear Partial Differential Equations (PDEs) that can be integrated in time. Generally, we choose a spatial discretization scheme (e.g., plane-waves), effectively transforming Eq. (2) from a set of time-dependent PDEs to a set of coupled time-dependent ODEs. In doing so, we may use standard numerical integration schemes to propagate this system of ODEs in time. We refer to this approach as the ODE approach to the TDKS equations. This ODE approach can be contrasted to propagator approaches, which view the problem of integrating Eq. (2) in a different light.

### B. Propagator approach to TDDFT

In the propagator approach, we seek an operator that evolves the state *ψ* at time *t*_{0} to time *t*

Formally, we can find an expression for the propagator in terms of a Dyson series^{19,25}

Although the propagator provides an exact expression for evolving the state *ψ* from *t*_{0} to *t*, in practice, we must find a suitable approximation. Examples of approximate propagators include the exponential midpoint rule, the enforced time-reversal symmetry (ETRS) method, and Magnus expansion methods.^{7,19,20} In general, these methods require computing the exponential of a matrix *A* times some vector *v*, exp(*A*)*v*. Various methods exist for computing this product, and several methods have been implemented in the context of TDDFT.^{19}

Although propagator methods have been used in the context of real-time TDDFT, we do not explore these methods here. In general, high-order propagators require the computation of commutators of the Hamiltonian at different times, and the number of commutators required grows with the order of accuracy of the method.^{19} The increase in the number of commutators required results in an increase in computational cost at each time step to achieve higher accuracy. Recently, the use of commutator-free propagator methods has been studied in the context of real-time TDDFT,^{20} showing that the fourth order Magnus expansion method can have a lower cost when using the commutator-free implementation. Such methods still require a basic building block for the computation of matrix exponentials, which could become expensive for large matrix sizes such as those associated with plane-wave basis sets. As we will see below, certain classes of explicit methods can achieve higher orders of accuracy without increasing the computational cost per time step.

### C. ODE approach to TDDFT

While propagator approaches rely on approximate expressions for the propagator in Eq. (6), ODE methods focus on approximate solutions to the TDKS equations of Eq. (1). When discretizing Eq. (1), we may view the equations as a system of ODEs. We can then immediately apply standard methods such as Runge-Kutta (RK) and Crank-Nicolson, which have well-known accuracy and stability properties.

In a more general context, ODE methods require solving an equation of the type

where *y* is an *n*-dimensional vector, *t* is a scalar, and *f* is a linear or non-linear function. We consider discretizing *t* into equally spaced intervals Δ*t* and search for approximations to Eq. (7) which allow us to find the solution *y*(*n*Δ*t*), *n* = 1, 2, … at these discrete points.

ODE methods generally fall under two categories: implicit and explicit (see Fig. 1). In implicit schemes, the solution at the next time step *y*(*t* + Δ*t*) requires solving an equation involving both *y*(*t* + Δ*t*) and the current solution *y*(*t*), since *y*(*t* + Δ*t*) cannot be isolated in a straight-forward way. For non-linear functions *f*, this formally requires solving a non-linear algebraic equation at each time step, which in general requires an iterative solution process.^{26} Iterative solutions are usually very computationally demanding, and generally, one linearizes the equations at each time step so that only the solution of a linear system of equations is required.

Since the TDKS equations are non-linear, we restrict our investigation to linearized implicit methods that require the solution of a linear system of equations at each step. Usually the linear solve is the most computationally demanding component of the scheme. However, one advantage of implicit schemes is that they sometimes come with exceptional stability properties so that large time steps can be taken. Importantly, however, many implicit methods have a finite stability window and therefore limit the time step that can be taken. In these cases, explicit methods can be more desirable.

Unlike implicit methods, explicit methods allow us to isolate the solution *y*(*t* + Δ*t*) in a straight-forward manner and therefore do not require the solution of a linear or non-linear system of equations at each time step. Explicit methods can therefore have much lower computational cost per time step than implicit methods, but they generally come with a restriction on time step for stability.

The time step restrictions on ODE methods can be understood through a linear stability analysis. In the generic setting of Eq. (7), we consider only a linearized version of *f* so that

Solutions to this simplified “model” problem will remain stable only if the eigenvalues {*λ*_{i}} of *A* have zero or negative real parts. Generally, the time step must be chosen so that all values *λ*_{i}Δ*t* fall within the “stability window” of a given method.

A useful tool for visualizing stability is to plot *λ*Δ*t*. In general, the eigenvalues of *A* will have both real and imaginary parts, and it is possible to generate a “stability diagram” that shows the contour *λ*Δ*t* = 1 on the complex plane. Shading indicates stable regions inside the contour. Figure 2 provides an example, illustrating the stability of explicit Euler^{26} and Crank-Nicolson methods.

Here, we are interested in the stability of Eq. (1). For simplicity, we may consider the ground-state Hamiltonian, *H*_{0}, to perform the stability analysis. Since *H*_{0} is Hermitian by construction, it has an eigenvalue decomposition *H*_{0} = *Q*Λ*Q ^{*}* for unitary matrix

*Q*and diagonal matrix Λ which has all real entries. Using this, we consider a model problem

By defining *φ*(*t*) ≡ *Q ^{*}ψ*(

*t*), we may write

Since Λ has all real entries, *A* = −*i*Λ is purely imaginary. Thus, when considering ODE integration schemes, we must look for schemes which have a stability region that includes the imaginary axis. Provided that an estimate for max_{i}$|$*λ*_{i}$|$ exists, it is then possible to determine a maximum time step Δ*t*. It immediately follows that Explicit Euler will never be stable for the TDKS equations, while Crank-Nicolson remains stable for any choice of Δ*t* (see Fig. 2).

In addition to stability, we are concerned with the accuracy of ODE methods. When we refer to a method as having accuracy order *n*, we mean that the local error is *O*(Δ*t*^{n+1}) so that the error accumulated by propagating the solution *N* = *T*/Δ*t* steps to time *T* is *O*(Δ*t*^{n}). We refer to the latter as global error.

A recent study^{20} discussed symplectic methods for real-time TDDFT. The time-dependent Kohn-Sham equations under the adiabatic approximation form a symplectic Hamiltonian system. It is therefore possible, but not required, to use symplectic integration methods to propagate the time-dependent Kohn-Sham equations, with the primary advantage being that energy conservation is guaranteed for such methods. In the following, we do not restrict ourselves to the study of symplectic methods for two reasons. First, we do not see any problematic deviation of the total electronic energy within the non-symplectic methods presented below for tractable simulation times. Second, symplectic methods are generally implicit and therefore have higher computational cost than non-symplectic methods, which could limit the feasibility of using these methods for long simulation times.

## III. INTEGRATION SCHEMES

Here we introduce a variety of integration schemes for the time-dependent Kohn-Sham equations, Eq. (1). At each time *t*, we propagate a set of single-particle states {*ψ*_{i}(*t*)} and a Hamiltonian constructed with a set of states {*ϕ*_{i}(*t* + *α*Δ*t*)}, where *α* is a real number. To account for the most general case, we do not require the set {*ϕ*_{i}(*t* + *α*Δ*t*)} used to construct the Hamiltonian to be the same as the states {*ψ*_{i}(*t*)}. We can simplify this notation by constructing a matrix Ψ(*t*) which holds the various single-particle states

and a similar expression for matrix Φ(*t* + *α*Δ*t*) corresponding to the single-particle states {*ϕ*_{i}(*t* + *α*Δ*t*)}. A Hamiltonian constructed with these single-particle states will be labeled *H*[Φ(*t* + *α*Δ*t*)].

For each method, we present an estimate for the computational cost. In the most general case, the computational cost of method *M* will be written as

where *k*, *l*, *m*, and *n* are integers corresponding to the number of times each subtask is performed in one time step. From left to right, these subtasks are (LS) linear solve, (UH) update Hamiltonian, (*H*Ψ) matrix-matrix multiplication of *H* with Ψ, and (IO) file I/O requirements.

All methods require at least one update Hamiltonian call at each time step. This update computes the density in Eq. (4) and updates the potential terms in Eq. (3).

All methods also require computation of products *H*Ψ. Note that computing this product is usually much less expensive than the update Hamiltonian subtask. We denote the time for one *H*Ψ product as *T*_{HΨ}.

For implicit methods, a linear solve is required at each time step. We use the conjugate gradient solver presented by Qian,^{16} which requires the specification of a convergence threshold *γ* such that the iterative solver stops when

for all bands *i*, where expressions for matrix *A* and vector *b* are method-dependent. The linear solve is usually the most costly of all operations and also depends critically on the accuracy *γ* chosen. Although *T*_{LS} has a dependence on the choice *γ*, we neglect the dependence below, treating it as a fixed quantity. Also note that the linear solver uses products *H*Ψ as the basic building block of the method, but we consider these as separate from the *H*Ψ calls given by *T*_{HΨ}.

Finally, the file I/O term refers to any save/load operations of the wave functions that must be performed during a single iteration. For the systems we study below, these operations are negligible compared to *T*_{UH} and/or *T*_{LS} (for the 4 atom graphene system described below, we find *ϵ*_{IO} ≲ *T*_{UH}/1000). The negligible cost of the file I/O operations is an important reason for the efficiency of the Adams-Bashforth (AB) and Adams-Bashforth-Moulton (ABM) methods presented below. Moreover, since this quantity is so small, we do not include it in the cost analysis of these methods. Note, however, that for very large systems, the cost of file I/O operations can increase slightly. For example, we find that for graphene simulations of 128 carbon atoms parallelized over 32 processors, file I/O can account for up to 3% of the total simulation cost. Even in these cases, however, the total file I/O cost remains small relative to *T*_{LS} and *T*_{UH} and therefore should not affect the results of the cost analysis of these methods.

### A. Implicit methods

As mentioned in Sec. II C, implicit methods require the solution of a linear system of equations at each time step. Because Adams-Moulton (AM) methods of order 3 and higher come with a restriction on the time step, we focus only on Crank-Nicolson methods in what follows.

#### 1. Crank-Nicolson

Applied to our problem, the Crank-Nicolson method is

Here, we have introduced variables *s*_{1} and *s*_{2} to note that different versions of the Crank-Nicolson method exist. We mention three possibilities below. All three of the methods are stable for arbitrary time steps (cf. Fig. 2), but the order of accuracy *n* depends on the choice of *s*_{1} and *s*_{2}.

##### a. CN1.

The simplest choice for *s*_{1} and *s*_{2} in Eq. (11) is *s*_{1} = *s*_{2} = *t*. This leads to a globally first-order accurate method, i.e., *n* = 1. This method formally has exact charge conservation, as can be seen via the following equation:

where for CN1, *s* = *t*.

The computational cost of CN1 is

##### b. CN2.

An alternative to the CN1 method is to choose *s*_{1} = *s*_{2} = *t* + Δ*t*/2. Like CN1, this method formally has exact charge conservation [Eq. (12) with *s* = *t* + Δ*t*/2]. The advantage of this method is that the choice of *s*_{i} = *t* + Δ*t*/2 leads to a cancellation of the local *O*(Δ*t*^{2}) terms so that the method is globally second-order accurate, i.e., *n* = 2.

In practice, the Hamiltonian is not known at *s* = *t* + Δ*t*/2. Although it is in principle possible to obtain the Hamiltonian at this intermediate time through an iterative process where we guess *H*^{(k)}(*t* + Δ*t*/2) for *k* = 1, 2, … until $H(k+1)(t+\Delta t/2)\u2212H(k)(t+\Delta t/2)<\u03f5$, this is both difficult to implement and unnecessary.

Instead, we estimate Ψ(*t* + Δ*t*/2) via Taylor expansion

which can be derived by expanding Ψ(*t*) and Ψ(*t* − Δ*t*) around Ψ(*t* + Δ*t*/2). With this estimate for Ψ(*t* + Δ*t*/2), we can evaluate *H*(*t* + Δ*t*/2), which has a local error term of *O*(Δ*t*^{2}). Since *H*(*t* + Δ*t*/2) is multiplied by Δ*t* in Eq. (11), the overall local error introduced by this approximation is *O*(Δ*t*^{3}), which combines with the other error terms of the method.

The computational cost of CN2 is slightly higher than that of CN1, requiring one more update Hamiltonian call per time step. Note that the cost of computing an approximate Ψ(*t* + Δ*t*/2) is negligible.

##### c. AM2.

Another choice for Crank-Nicolson uses the Hamiltonian at the present time *s*_{2} = *t* and future time *s*_{1} = *t* + Δ*t*. We call this method AM2 since it is identical to the 2nd-order Adams-Moulton method. Like CN2, this method is also second-order accurate and requires an estimate (or iterative solution process) for *H*(*t* + Δ*t*). Although this method has a computational cost identical to CN2, it comes with the disadvantage that the Propagator is not unitary, as seen from the following equation:

Thus, this method does not conserve charge by default, so there is no advantage of using AM2 over CN2.

#### 2. Ultrasoft pseudopotentials

Pseudopotential methods are ubiquitous in DFT and TDDFT calculations because they allow us to replace the sharp −1/*r* behavior of the potential near the ion centers with a smooth function that mimics the influence of the ions.^{12,13,15} This allows for a coarser spatial discretization and/or lower plane-wave energy cutoff, significantly reducing computational cost without sacrificing accuracy.

Throughout Sec. III, we have implicitly assumed the wave functions to be orthonormal so that $\psi i|\psi j=\delta ij$. This assumption holds when the so-called norm-conserving pseudopotentials are used. In some cases, however, we may wish to relax the assumption of norm conservation to allow for more smoothness of the wave functions in the core region. This allows us to use a lower energy cutoff and can sometimes be advantageous. The work of Qian *et al.*^{16} explored the use of ultrasoft pseudopotentials in the context of the first-order Crank-Nicolson scheme. While we do not present details here, we note that using this approach the CN1 scheme becomes

where *S* is a Hermitian, positive definite matrix built to ensure that $\psi i|S|\psi j=\delta ij$. In our code development process, we have ensured that the CN1 scheme works with ultrasoft pseudopotentials, but we have not done the same with the explicit schemes presented below.

### B. Explicit methods

Unlike implicit methods, explicit methods do not rely on the solution of a linear system of equations at each time step. The advantage of this approach for the TDKS equations [Eq. (1)] is that only matrix-vector multiplications and update Hamiltonian calls are required to advance the solution, which are generally faster than linear solves and may scale well in parallel.

Below, we investigate three distinct classes of explicit methods: Runge-Kutta (RK), which involves function evaluations at times *τ* between *t* and *t* + Δ*t*; Adams-Bashforth (AB), which estimates the solution at *t* + Δ*t* by fitting a polynomial to the solutions at times *t* − *n*Δ*t*, …, *t* − Δ*t*, *t*; and Adams-Bashforth-Moulton (ABM), which involve an AB estimate $\Psi \u0303(t+\Delta t)$ and a corrector step that uses the knowledge of $\Psi \u0303(t+\Delta t)$. The ABM method can be thought of as a clever way to escape the linear solve required in Adams-Moulton (AM) methods.

The explicit methods we study have two potential drawbacks, compared to Crank-Nicolson and some propagator approaches. First, they are always associated with a restriction on time step for stability reasons. These restrictions are explored in more detail below. Second, these methods do not, in general, conserve the norm of the wave function. The implicit methods can also suffer from this deficiency—an example is the AM2 method presented above. The severity of the charge non-conservation of explicit methods is largely method-dependent. Here, we present a way to eliminate the charge non-conservation issue for any explicit method.

At each time step, we propagate the solution $\psi iM(t)$ of method *M* to time *t* + Δ*t* to obtain $\psi iM(t+\Delta t)$. The local error in $\psi iM(t+\Delta t)$ can be written as

where *ψ*_{i}(*t* + Δ*t*) is the exact solution, given the initial condition $\psi i(t)=\psi iM(t)$ at time *t*.

Because method *M* may not conserve charge exactly, we can consider normalizing $\psi iM(t+\Delta t)$ by its 2-norm. Given that $\psi iM(t)$ is properly normalized so that $\Vert \psi iM(t)\Vert 2=1$, it is simple to show that

Therefore, normalizing the wave function at each time step will not reduce the order of the local error. We may also question whether normalizing at each time step affects the stability of explicit methods. In the context of the model problem [Eq. (8)], it is clear that introducing the normalization will have no effect. The error term that arises from normalizing the wave function combines with the other local error terms and does not introduce any additional stability restrictions. In our results, we find that both stability and the order of error for each method are unaffected by normalization and that the error in charge is at machine precision. Without the normalization, a loss (or growth) of charge may eventually lead to highly inaccurate solutions. The normalization completely avoids this issue. We also point out that the normalization introduces a negligible additional cost in the simulations presented below. This cost remains negligible even in the case of large unit cells with distributed plane-wave coefficients. For example, we timed the normalization procedure in the AB5-AM5 method presented below for the 128-atom cell, running on 32 processors. In this case, the normalization time accounts for 0.2% of the total solver time (including Message Passing Interface (MPI) communication), showing that it is to a good approximation negligible.

#### 1. Runge-Kutta

Explicit Runge-Kutta methods provide an estimate for the solution at *t* + Δ*t* via function evaluations between *t* and *t* + Δ*t*. In the context of Eq. (7), RK methods can be written in the following form:

By using more function evaluations, i.e., increasing *s* in Eq. (17), we can increase the order of accuracy of the method. Below we explore different choices for *s* and the coefficients *a*_{ij}, *b*_{i}, and *c*_{i}, applied to the TDKS equations. In general, the choice of constants to obtain a particular order of error is not unique, and it is therefore possible to further optimize the choice of constants for certain types of solutions,^{27} e.g., oscillatory solutions such as the TDKS equations. We present an optimized method for RK5 below.

Previous work by Schleife^{10} showed that RK2 and RK4 methods can be effective at propagating the TDKS equations. For comparison, we use the RK2 and RK4 methods implemented by Schleife; we also look at third- and fifth-order methods in more detail. The details of each method are presented below.

The stability diagrams of the Runge-Kutta methods we implemented are shown in Fig. 3, and the stability restrictions on *λ*Δ*t* on the imaginary axis are presented in Table I.

. | RK2^{a}
. | RK3 . | RK4 . | RK5 . | AB2^{a}
. | AB3 . | AB4 . | AB5 . | AB2-AM2 . | AB2-AM3 . | AB3-AM4 . | AB5-AM5^{b}
. |
---|---|---|---|---|---|---|---|---|---|---|---|---|

ξ_{max} | 0.30 | 1.74 | 2.83 | 3.60 | 0.25 | 0.72 | 0.43 | 0.22 | 1.29 | 1.20 | 1.18 | 0.53 |

Δt | 0.58 | 3.35 | 5.47 | 6.97 | 0.48 | 1.39 | 0.83 | 0.43 | 2.49 | 2.32 | 2.28 | 1.02 |

. | RK2^{a}
. | RK3 . | RK4 . | RK5 . | AB2^{a}
. | AB3 . | AB4 . | AB5 . | AB2-AM2 . | AB2-AM3 . | AB3-AM4 . | AB5-AM5^{b}
. |
---|---|---|---|---|---|---|---|---|---|---|---|---|

ξ_{max} | 0.30 | 1.74 | 2.83 | 3.60 | 0.25 | 0.72 | 0.43 | 0.22 | 1.29 | 1.20 | 1.18 | 0.53 |

Δt | 0.58 | 3.35 | 5.47 | 6.97 | 0.48 | 1.39 | 0.83 | 0.43 | 2.49 | 2.32 | 2.28 | 1.02 |

^{a}

Not technically stable for imaginary *ξ*. We use the value Im[*ξ*] corresponding to Re[*ξ*] = −0.001.

^{b}

Although AB5-AM5 does intersect the imaginary axis, it does so slowly. We therefore use the same procedure as (a) for this method.

##### a. RK2.

The simplest of the Runge-Kutta methods we investigate is second-order accurate. RK2 technically does not have a stability window that overlaps the imaginary axis, so strictly speaking the method should be unstable for propagating the TDKS equations. In practice, this instability can be mild;^{26} however, we recommend against using this method because it could lead to instabilities with long enough run times. Importantly, the RK2 results below use a total simulation time that is short enough for RK2 to remain stable. As Fig. 3 shows, the stability region of RK2 approaches the imaginary axis; we find that for Im[*λ*Δ*t*] ≲ 0.58, which corresponds to −0.001 ≤ Re[*λ*Δ*t*] ≤ 0, the method appears to be stable (cf. Table I)

The computational cost of RK2 is

##### b. RK3.

The third-order accurate Runge-Kutta method (RK3) is given in Eq. (19). This method is stable for imaginary values of *λ*Δ*t*, and the stability region is significantly larger than that of RK2 (cf. Table I)

The computational cost of RK3 is

##### c. RK4.

We use a standard 4th-order Runge-Kutta method, which is also presented by Schleife.^{10} The stability on the imaginary axis is greater than that of RK3

The computational cost for RK4 is

##### d. RK5.

For RK5, we use a method developed by Anastassi and Simos,^{27} designed to minimize phase error of oscillatory PDEs. This method has a greater stability region on the imaginary axis than RK4. Note, however, that to obtain a globally 5th-order accurate RK method, we must use at least 6 function evaluations at each time step.^{28} This pattern of requiring *p* > *n* function evaluations for an order-*n* method persists for all orders *n* ≥ 5.^{29} It is therefore possible that to achieve a given level of error, higher order RK methods may be more computationally expensive than lower-order methods. Nonetheless, we show in Sec. IV that RK5 can be more efficient than RK4. The RK5 method is given in the following equation:

The computational cost of RK5 is

The second class of explicit methods we consider are Adams-Bashforth methods. Unlike Runge-Kutta methods, Adams-Bashforth methods rely on function evaluations at previous time steps and can be seen as constructing an interpolating polynomial to determine the solution at time *t* + Δ*t* using only times *t* − *n*Δ*t*, *n* = 0, 1, 2, ….^{30,31}

The primary advantage of the Adams-Bashforth methods is that the products *H*[Ψ(*t* − *n*Δ*t*)]Ψ(*t* − *n*Δ*t*) can be saved to a disk, and only one update of the Hamiltonian is required at each time step.

Since AB methods rely on products *H*Ψ from previous time steps, we must start with a different method. An *O*(Δ*t*^{n}) AB method must be started with another scheme that is also *O*(Δ*t*^{n}) to ensure that the overall error remains at order *n*. We use Runge-Kutta methods to run for the first *n* − 1 steps and then start the Adams-Bashforth updates.

In general, Adams-Bashforth methods have a narrower stability window on the imaginary axis than Runge-Kutta methods. In addition, the stability window diminishes as the order of accuracy of the interpolating polynomial increases (cf. Fig. 4). Below, we present results for AB2, AB3, and AB4 methods, but we do not include AB5 results because the narrow stability window limits the usefulness of the method to applications, where an extremely low error is required. Nonetheless, we present the AB5 method below because it is used as the predictor step for the AB5-AM5 method, presented in Subsection III B 2. Stability diagrams for AB methods are shown in Fig. 4.

#### 2. Adams-Bashforth

##### a. AB2.

The AB2 method is globally 2nd order accurate and requires saving only one past-time product *H*Ψ. Like RK2, the stability diagram of this method technically does not contain the imaginary axis, so strictly speaking it is not stable for the TDKS problems. However, we find that in practice this is not an issue, and we provide an estimate for a stable value of Im[*λ*Δ*t*] in Table I

##### b. AB3.

The AB3 method is globally 3rd-order accurate and requires two wave functions to be saved at each time step. The restriction on time step for this method is much less stringent than of AB2

##### c. AB4.

AB4 is globally 4th-order accurate; we find that this method is one of the most competitive in overall runtime for low error

##### d. AB5.

We present AB5 here because it is the predictor step in the AB5-AM5 method, described below. This method is globally 5th-order accurate but requires a very small time step for stability. In our findings, the time step required limits the usefulness of this method

All Adams-Bashforth methods require one update Hamiltonian call and the computation of one *H*Ψ product at each iteration. The products *H*Ψ at previous time steps are saved and loaded at each iteration, but the computational cost associated with these operations is negligible in an overall cost analysis. Therefore, the cost for any of the methods above is

The advantage of the AB methods is that the computational cost is the same for all orders of accuracy. Of course, each method of increasing order of accuracy has a smaller stability window than the last, and eventually, this becomes prohibitive.

#### 3. Adams-Bashforth-Moulton

Adams-Bashforth-Moulton (ABM) methods use an AB step to predict the solution at *t* + Δ*t* and then use an Adams-Moulton (AM) corrector step to obtain Ψ(*t* + Δ*t*). ABM methods therefore have a higher computational cost at each time step than AB methods, but they generally have a larger stability window on the imaginary axis. ABM methods offer an advantage over AB methods when the benefit of taking a larger time step outweighs the higher computational cost at each time step.

In comparison to AM methods, ABM methods do not require an expensive linear solve at each time step. Unlike the CN methods, third- and higher-order AM methods have a finite stability window on the imaginary axis. The stability window for AM methods of order-3 and higher is roughly the same as ABM methods of order-3 and higher. We have implemented and tested AM methods, but we do not present the results here because we find that they are much less efficient than the ABM methods.

Like RK and AB methods, ABM methods do not conserve charge exactly. However, this issue is again solved by normalizing the wave functions at each time step.

In describing these methods, we present only the AM corrector steps. The AB predictor steps are presented in Sec. III B 2. We will refer to an AB*n*-predicted solution at *t* + Δ*t* as $\Psi \u0303ABn(t+\Delta t)$, where *n* is the order of accuracy of the AB step used.

For an order-*n* ABM method, there are two choices for the predictor step: we can predict using an AB method of order *n* − 1 or order *n*. We therefore refer to each method as AB*m*-AM*n*, where *m* = *n* or *m* = *n* − 1. These two choices for *m* are possible because the order-*m* AB predictor step is used only to update the Hamiltonian to form a product $H[\Psi \u0303ABm(t+\Delta t)]\Psi \u0303ABm(t+\Delta t)+O(\Delta tm+1)$, where *O*(Δ*t*^{m+1}) is the local truncation error. In AB*m*-AB*n* schemes, this product is always multiplied by Δ*t* in the corrector step, therefore introducing a local error of order *O*(Δ*t*^{m+2}) in the final estimate for Ψ(*t* + Δ*t*). In order to match the *O*(Δ*t*^{n+1}) local error required for an *O*(Δ*t*^{n}) globally accurate ABM method, we can choose *m* = *n* or *m* = *n* − 1.

For TDDFT, the choice of an AB*n* vs. AB(*n* − 1) predictor step is not arbitrary. It turns out that the stability of AB*m*-AM*n* methods on the imaginary axis depends crucially on the choice of *m* and *n*.^{32} Ghrist^{32} finds that AB2-AM2, AB2-AM3, AB3-AM4, and AB5-AM5 have a finite stability region on the imaginary axis, but that AB1-AM2 (equivalent to RK2), AB3-AM3, AB4-AM4, and AB4-AM5 do not have a stability region on the imaginary axis. We therefore describe only the methods which are stable for our problem. The stability diagrams for these methods are shown in Fig. 5.

##### a. AB2-AM2.

The AB2-AM2 method is shown in Eq. (26). The method is globally 2nd-order accurate. In contrast to AB2 alone, this method does have a stability window that includes a portion of the imaginary axis

##### b. AB2-AM3.

The globally 3rd-order accurate AB2-AM3 scheme is shown in the following equation:

##### c. AB3-AM4.

The AB3-AM4 method is shown in Eq. (28). This method is globally 4th-order accurate and is started with 3 steps of RK4

##### d. AB5-AM5.

The AB5-AM5 scheme is shown in Eq. (29). This method is globally 5th-order accurate and must be started with 4 steps of RK5. Note that AB5-AM5 has a smaller stability window on the imaginary axis than the lower-order AB*m*-AM*n* methods shown

All AB*m*-AM*n* methods require two update Hamiltonian calls and two *Hψ* calls. While higher-order methods require more file I/O, this time is negligible in comparison with a typical update Hamiltonian call. Thus, all AB*m*-AM*n* methods presented above have roughly the same computational cost at each time step, shown in the following equation:

## IV. APPLICATION TO GRAPHENE MONOLAYER

To perform an error analysis and runtime analysis of the integration schemes described in Sec. III, we used a simple system consisting of an isolated graphene monolayer, implemented in the plane-wave DFT code Quantum ESPRESSO.^{17,18} For each method, we investigated three items: (1) scaling of error with respect to time step, (2) scaling of serial runtime with respect to system size at fixed **k**-point density, and (3) serial computational time required to obtain a given level of error for fixed simulation time *T*.

### A. Plane-wave DFT and TDDFT

We implement these integration schemes in the rt-tddft code. This code is an extension to the well-known Quantum ESPRESSO code^{17,18} and uses the functionality of the ground-state plane wave (PW) portion of Quantum ESPRESSO to propagate the wave functions in time. Qian *et al.*^{16} wrote the CN1 algorithm described in Sec. III A. We have greatly extended the functionality of the original code of Qian *et al.* in our design of rt-tddft by adding the CN2 method and all explicit algorithms presented in Sec. III. We also added functionality for multiple **k**-points and have completely redesigned the code to use modern object-oriented Fortran, object-oriented design patterns, and test-driven development practices. These features will enable the rt-tddft code to be easily extended to incorporate new features in future versions and releases.

Quantum ESPRESSO uses a plane-wave expansion^{12,13} of the eigenstates *ψ*_{i}(**r**, *t*) so that we may write

Here, Ω is the volume of the unit cell, **k** is a vector in the Brillouin zone of the reciprocal-space lattice, and **G** is a vector in reciprocal space.

Generally, when we run a plane-wave calculation, we specify a plane-wave energy cutoff *E*_{cut}. This specification puts a restriction on the maximum size of the **G** vectors, which in atomic units is^{15}

One advantage in specifying this energy cutoff is that it plays a direct and important role in the stability analysis. We may use *E*_{cut} as a direct estimate for the magnitude of the maximum eigenvalue of the ground-state Hamiltonian *H*_{0}. Thus, the question of what time step is required for stability of any ODE scheme can be answered simply through

Here, *ξ*_{max} is the maximum value of *λ*Δ*t* on the imaginary axis of any stability diagram shown above. In general, *ξ*_{max} is method-dependent—a list of values for different methods is shown in Table I. The cutoff energy is in Hartree units, and the 24.188 factor is to convert Δ*t* into attoseconds. The interpretation of this formula is simple enough: for larger cutoff energies, a lower time step must be taken for stability. Of course, methods with a larger value of *ξ*_{max} allow a larger time step to be taken, but the question of computational cost must be assessed in each of these cases.

### B. Choice of error metric

To investigate the error, we use the time-dependent dipole moment of the system in the *z*-direction perpendicular to the monolayer

The error metric *e* is then computed as

We refer to this error in Figs. 7 and 9 as the relative *L*_{2} error in the dipole moment. Here, $dzref$ is a reference solution computed using the RK5 method at a time step of 2^{−8} = 0.003 906 25 as.

For the graphene monolayer, we excite the system using a *z* polarized electric field perpendicular to the monolayer, and the error analysis uses only the *z* component, *d*_{z}(*t*). The *z*-component of the dipole moment for a 16-atom graphene monolayer is shown in Fig. 6. A **c** axis of 9.64 Å is included in the unit cell to introduce vacuum space that prevents the graphene layer from interacting with its periodic repeats.

The excitation used is an electric field perturbation of the system at *t* = 0: $E(t)=\alpha \delta (t)n^$, where $n^$ is taken to be $z\u0303$.^{6,16} This perturbation gives rise to a phase shift in the wave functions

Part of the motivation for using the dipole moment as the error metric is that it represents a physical quantity that is relevant for applications of the rt-tddft code, namely, computing optical absorption spectra. Because the dipole moment is computed from the time-dependent density in Eq. (34), we can expect other observable quantities that can be written in terms of the density to also exhibit error scaling that is similar to the results presented below.

### C. Results

To investigate the error of the integration schemes in Sec. III, we computed the dipole moment [Eq. (34)] for each method using a range of values for Δ*t* and a fixed total simulation time *T* = 1000 as. The error metric in Eq. (35) is plotted against the time step for different methods in Fig. 7.

We find that the error for each method scales as expected for small values of Δ*t*. Some methods do not show the expected scaling for values Δ*t* close to the stability limit since the expected *O*(Δ*t*^{n}) scaling only holds for $\Delta t\u2009\u2192\u20090$. In addition, we point out that for small enough Δ*t*, we expect rounding error to dominate.

The stability restrictions for explicit methods (tabulated in Table I) are shown in Fig. 7. A vertical dashed line indicates the maximum value of Δ*t* for which each method is stable. We indicate the expected scaling behavior between the maximum Δ*t* measured and the Δ*t* corresponding to each method’s stability limit with a dashed line. The error at the stability limit is estimated using the slope of the last two measured points, i.e., by computing the error at the critical Δ*t* through linear extrapolation of the slope from the previous two measurements.

Since our interest is primarily in large systems, we also study how runtime scales with system size. To do this, we set up 4-, 8-, 16-, 32-, and 64-atom graphene monolayer systems. For each, we chose **k** points in the Brillouin zone such that the product of the number of **k** points, *N*_{k}, with the number of atoms, *N*_{a}, is fixed: *N*_{k}*N*_{a} = const. We refer to this as a “fixed **k** point density.”

We may think of each **k** point calculation as roughly independent so that doubling the number of **k** points while holding everything else fixed will effectively double the computational time. On the other hand, if we were to decrease the system size by 1/2, we know that the number of **G** vectors [cf. Eq. (31)] will decrease by half^{15}

where Ω is the volume of the system. Therefore, doubling the number of atoms while keeping the **k** density fixed has two effects: (1) the size of the basis set (and therefore *H*) doubles and (2) the number of independent **k** point calculations decreases by 1/2.

We can expect that the impact of these two effects on computational runtime will depend largely on the integration scheme chosen, as well as the particular characteristics of the exchange-correlation potential used. For Crank-Nicolson, a linear solve is required at each iteration, whereas the explicit methods require only update Hamiltonian calls and *H*Ψ calls at each iteration. The computational time for the linear solve will depend largely on the sparsity of the Hamiltonian. In turn, the sparsity of the Hamiltonian depends on the choice of exchange-correlation potential.

In Fig. 8, we present the results of solver time vs. the number of atoms at fixed **k** density using a PBE pseudopotential.^{33} We can draw several conclusions from Fig. 8. First, we see that all methods scale roughly as *O*(*N*_{a}). This is because the cost associated with increasing the size of the basis set is roughly balanced by the time saved by decreasing the number of **k** points. Second, we see that the AB and ABM methods do not increase in computational cost as the order of error is increased. The independence of runtime on order of error was explained in Sec. III and is made possible by the fact that the file I/O operations used to save and load previous *H*Ψ calls are negligible compared to other computations performed at each time step. In addition, we see that AB has the smallest computational cost, followed by RK2 and all ABM methods, as is expected by the cost analyses given in Sec. III. Third, we see that the 1st- and 2nd-order Crank-Nicolson methods have roughly the same computational cost because the linear solve is the dominant cost overall, and the additional update Hamiltonian call required in CN2 does not constitute a large fraction of the total runtime. While the cost of the linear solve can be reduced by choosing a lower convergence tolerance, this will result in increased error in the wave functions. Finally, we see that the runtime of the Runge-Kutta methods depends significantly on the order of error chosen, due to the number of update Hamiltonian calls required per iteration.

By combining the results of Figs. 7 and 8, we computed the serial computational cost for a given level of error. In Fig. 9, we plot computational time required to achieve a given level of error. The computational time on the ordinate is the time required to run a 1000 as simulation for the 4-atom graphene monolayer system. We plot only the most competitive methods in Fig. 9. In general, lower-order methods are not competitive with the higher-order methods; in the context of Fig. 9, the low-order explicit methods have a very steep slope and therefore have much higher computational cost for a given level of error than the higher-order methods.

In Fig. 9, the solid lines correspond to computed values, while the dashed lines correspond to the predicted behavior at intermediate time steps. The explicit methods all become unstable at some level of error, and these values are explicitly shown in the plot. In addition, we note that near the stability limit for the explicit methods, the total runtime changes greatly with the level of error required. These changes in error scaling are a result of the fact that the explicit methods scale as *O*(Δ*t*^{n}) only in the limit $\Delta t\u2009\u2192\u20090$.

Several conclusions can be drawn from Fig. 9. First, we see that for low errors, the explicit methods are much faster than CN2. AB5-AM5 is the fastest method overall up to *e* = 10^{−3}, while AB4 and RK5 are not far behind. For errors 10^{−3} < *e* < 10^{−2}, the situation is more complicated. Although AB3-AM4 and RK4 appear to be the fastest, the steep slope of these methods is the reason for concern; in general, we cannot say that one method is the “best” in this region since the explicit methods become unstable at different levels of error in this region. To avoid this instability, we recommend using the AB5-AM5 method to achieve lower error with only slightly higher computational cost. Finally, in the range of errors *e* ≥ 10^{−2}, clearly, CN2 is the only acceptable choice. If we are willing to accept a significant level of error, we may choose to run CN2 with a large time step.

The results of Fig. 9 show that high-order explicit schemes can be much more efficient than the implicit Crank-Nicolson scheme. Although the AB5-AM5 method comes with a heavy restriction on the time step, the high order of the method, combined with the low computational cost per iteration, makes it the best choice for low errors. On the other hand, CN2 can be seen as a very robust method, valid for any time step and range of error. The particular choice of method will, in general, depend on the application. In cases where a very high plane-wave cutoff energy is required, and low accuracy is acceptable, CN2 may be the best choice. However, in cases where high accuracy is required, these high-order explicit schemes can be very effective.

## V. PARALLELIZATION

The results discussed in Sec. IV are entirely for serial calculations. In practice, we are more interested in systems with large numbers of atoms, and it is therefore very important to understand the runtime scaling of our code when we enable the parallel routines in Quantum ESPRESSO. Since the TDDFT code is essentially a wrapper around the ground-state PW code in Quantum ESPRESSO, the expensive operations such as the update Hamiltonian and *H*Ψ tasks are already parallelized.

The ground-state Quantum ESPRESSO code has several levels of parallelization that may be exploited,^{17,18} including parallelization over **k** points, parallelization through distribution of plane-wave coefficients, parallelization through linear algebra routines, and parallelization of the FFT.

To explore the runtime scaling of the parallelized code, we use a 128-atom graphene monolayer (with vacuum space) system with the same cutoff energy as in Sec. III (*E*_{cut} = 25 Ry). For such large systems, it is common to use only the Gamma point so that *N*_{k} = 1 and we may ignore parallelization of **k** points. The parallelization we assess here only involves distributing the plane-wave coefficients so that each processor works on vectors roughly of size *N*_{PW}/*n*_{proc}, where *N*_{PW} is the total number of plane-waves and *n*_{proc} is the number of processors. This parallelization is done using MPI; we do not use any OpenMP directives although they are available.

To understand the runtime scaling, we look only at the parallel scaling of methods in Fig. 9. For each method, we chose the Δ*t* that corresponds to a fixed *L*_{2} dipole error *e* = 10^{−4} [(cf. Eq. (35)]. We then run a TDDFT calculation for each method with a different *n*_{steps} = *T*/Δ*t*, where *T* is the total simulation runtime, which was chosen to be 1000 as.

The total computational runtime for a 1000 as simulation and *e* = 10^{−4} vs. the number of processors is shown in Fig. 10. A dashed line corresponding to ideal scaling is also shown. In the ideal case, a doubling of the number of processors decreases the total runtime by one-half. We see from Fig. 10 that both the explicit methods and the CN2 method scale roughly the same. We can understand the scaling by recognizing that the explicit methods and CN2 both rely on products *H*Ψ and update Hamiltonian calls as their basic ingredients. The CN2 method requires many more *H*Ψ calls during each linear solve step, but because the fundamental operations required are the same as in the explicit methods, we see the same scaling behavior in CN2 as in the explicit methods.

The scaling study presented in Fig. 10 uses only 32 processors. Our code allows one to use many more processors; however, the scaling for large numbers of processors becomes more difficult to assess. In Fig. 10, we have only parallelized by distributing plane-wave coefficients, while all FFT operations are done in serial. In moving to a larger number of processors, it becomes very important to use parallel FFT routines. In addition, we specifically disabled other parallelization levels in Quantum ESPRESSO that could speed up calculations with a larger number of processors. The study of runtime in our TDDFT code using different parallelization levels is beyond the scope of this paper. Fundamentally, however, the explicit methods presented above will all scale in roughly the same way when other parallelization levels are enabled. We should therefore expect the explicit methods described above to perform well with a larger number of processors, given that the appropriate parallelization levels are enabled.

## VI. CONCLUSIONS

A variety of explicit and implicit integration schemes have been analyzed and implemented with a plane-wave basis in the context of Time-Dependent Density Functional Theory (TDDFT). We employ a plane wave basis for compatibility with the many codes commonly used for periodic boundary conditions. We find that Adams-Bashforth and Adams-Bashforth-Moulton methods of high order outperform more commonly used Runge-Kutta and Crank-Nicolson methods when a moderately low error is desired. The Adams-Bashforth and Adams-Bashforth-Moulton are advantageous because the computational cost per time step is independent of the order of accuracy of the method, unlike Runge-Kutta and Crank-Nicolson methods, which have higher computational cost for higher orders of accuracy. Although the Adams-Bashforth and Adams-Bashforth-Moulton methods require smaller time steps for stability, the lower computational cost per iteration outweighs the cost of the time step restriction. Ultimately, we find that the AB5-AM5 method is the most efficient method when a moderately low error is desired.

In addition to serial runtime analysis, we presented a parallel scaling study of a 128-atom graphene monolayer plus vacuum for up to 32 processors. We find that for small numbers of processors, the explicit methods and Crank-Nicolson methods exhibit roughly the same scaling. The Adams-Bashforth and Adams-Bashforth-Moulton therefore remain the most cost-effective among methods run on a given number of processors. The code we developed imposes no restriction on the number of processors, and the Adams-Bashforth and Adams-Bashforth are also expected to be the most efficient when larger numbers of processors are used.

## ACKNOWLEDGMENTS

We thank Lexing Ying for his insights in numerical aspects dealt with during the code development. This work was supported in part by the US Army Research Laboratory, through the Army High Performance Computing Research Center, Cooperative Agreement No. W911NF-070027. Additional support was provided by NSF Grant No. DMR-1455050.