We recently developed an algorithm to compute response properties for the state-averaged complete active space self-consistent field method (SA-CASSCF) that capitalized on sparsity in the atomic orbital basis. Our original algorithm was limited to treating small to moderate sized active spaces, but the recent development of graphical processing unit (GPU) based direct-configuration interaction algorithms provides an opportunity to extend this to large active spaces. We present here a direct-compatible version of the coupled perturbed equations, enabling us to compute response properties for systems treated with arbitrary active spaces (subject to available memory and computation time). This work demonstrates that the computationally demanding portions of the SA-CASSCF method can be formulated in terms of seven fundamental operations, including Coulomb and exchange matrix builds and their derivatives, as well as, generalized one- and two-particle density matrix and *σ* vector constructions. As in our previous work, this algorithm exhibits low computational scaling and is accelerated by the use of GPUs, making possible optimizations and nonadiabatic dynamics on systems with *O*(1000) basis functions and *O*(100) atoms, respectively.

## I. INTRODUCTION

The state-averaged complete active space self-consistent field (SA-CASSCF) method is an important theoretical method for the treatment of molecular systems involving multiple contributing electronic configurations.^{1–8} State averaging is especially critical for photochemical applications because it enables the treatment of multiple electronic states on an equal footing and alleviates the well-known root-flipping problem associated with optimizing an excited state wavefunction.^{8–12} As such, SA-CASSCF is commonly employed in the simulation of photochemical reactions. We recently developed algorithms for the calculation of the SA-CASSCF energy, nuclear gradient, and nonadiabatic coupling (NAC) vector that exploit the intrinsic sparsity in the atomic orbital (AO) basis and capitalize on graphical processing unit (GPU) acceleration.^{13,14} We were able to reduce the formal *O*(N^{4}) scaling to effectively *O*(N^{2}), where N is the number of basis functions, without introducing any approximations. Our original algorithms were designed for small to medium sized active spaces and involved explicitly building and diagonalizing the entire CAS configuration interaction (CI) Hamiltonian, which is the full CI Hamiltonian for a subset of active orbitals. Moreover, we worked in the basis of eigenstates of the CI Hamiltonian to solve the coupled perturbed multiconfiguration self-consistent field (CP-MCSCF) equations and form nuclear gradients and NAC vectors. Despite the robustness of these algorithms, the number of determinants scales factorially with the size of the active space, rendering them impractical for large active spaces due to memory requirements.

In order to treat large active spaces, direct-CI algorithms were developed to efficiently solve for a specified number of eigenvalues (energies) and eigenvectors of the CI Hamiltonian.^{15–19} These algorithms are efficient because they avoid the explicit construction of the CI Hamiltonian. Using the Slater-Condon rules and the resolution of the identity, direct techniques exploit the well-defined structure of the Hamiltonian matrix to reduce the scaling of each matrix-vector product required during the iterative solution to the CI problem from quadratic to linear with respect to the number of determinants. Although spin-adapted formulations of direct methods exist and reduce the dimensionality of the CI problem by a factor of two to three,^{20,21} determinant-based algorithms are often employed due to the convenient integer form of the coupling coefficients,^{22} which are required for computing the energy and response properties. In recent work,^{23} we implemented a vectorized determinantal direct-CI algorithm on GPUs based on the work of Knowles and Handy.^{15} Our implementation substantially reduced the cost of determining the lowest several energies and eigenvectors of the CI Hamiltonian, enabling the efficient treatment of very large active spaces using commonly available desktop hardware. With the addition of fast routines to compute the one- and two-particle density matrices, it was straightforward to connect our direct-CI and SA-CASSCF energy codes. However, further development was necessary in order to compute nuclear gradients and NAC vectors, as described in the present manuscript.

Our interest in adapting our SA-CASSCF code stems from the importance of nuclear gradients and nonadiabatic coupling vectors to chemical research. Nuclear gradients are broadly useful in electronic structure theory and are required for minimizations, transition state searches, and molecular dynamics. Nonadiabatic coupling (NAC) vectors are important for the study of photochemistry due to their use in dynamics simulations on multiple electronic states, where the Born-Oppenheimer approximation breaks down, as well as in conical intersection searches.^{24–27} Pulay and Schaefer, among others, developed the initial methods for computing nuclear gradients,^{28–32} and Yarkony and co-workers established that NAC vectors could be computed with essentially the same theory.^{33,34} More recent work has simplified the theory behind the evaluation of analytical gradients through the use of Lagrangians.^{35,36} We employ and further develop the Lagrangian perspective on gradient theory in the present work.

As previously noted, our original implementation of the SA-CASSCF analytical gradients and NAC vectors was performed in a basis of eigenvectors of the CI Hamiltonian.^{14} This necessitates determining all possible spin-adapted CI eigenvectors, which is only computationally tractable for small active spaces (where the CI Hamiltonian matrix can be stored in memory). To make our gradient code compatible with large active spaces, we reformulate the CP-MCSCF equations in a basis of determinants and design an algorithm that utilizes GPU-accelerated direct-CI operations. Past research has established some of the frameworks required for adapting our method to large active spaces.^{37,38} However, to the best of our knowledge, this is the first work that discusses a well-defined procedure for computing analytical gradients and NAC vectors for large molecules in an arbitrarily large determinant basis without approximating the electron repulsion integrals, and it is certainly the first to do so using GPUs. As in our previous work,^{13,14,23} we present timings demonstrating both the favorable efficiency and scaling of our algorithm. In addition, we report results from a minimum energy conical intersection (MECI) search on the Dronpa-2 protein using mixed quantum mechanics/molecular mechanics (QM/MM) and a large (>100 atoms) quantum region. We believe that this represents one of the largest conical intersection searches ever performed. The algorithms developed in this work will enable picosecond (and beyond) time scale excited state dynamics simulations on large molecular systems, including systems that require large active spaces.

## II. THEORY

### A. Definitions

Throughout this work, we employ the following notation. Unless otherwise noted, Θ, Λ, Ψ, Φ, ϓ, and Ξ refer to eigenstates of the CI Hamiltonian. Θ and Λ are the particular eigenstates of interest for determining response properties. Ψ, Φ, ϓ, and Ξ are arbitrary eigenstates within the state-averaged space. When an eigenstate index appears in a summation, one should assume that the summation runs over all states within the state-averaged space. A state is considered as a part of the state-averaged space when it has nonzero weight, and there are *N*_{s} such states. Determinants will be indexed with *I* and *J*, and there are *N*_{d} determinants total. Atomic orbitals (AOs) will be indexed with *μ*, *ν*, *ρ*, and *σ*. Arbitrary molecular orbitals (MOs) will be indexed with *p*, *q*, *r*, and *s*. The inactive subset of occupied MOs will be indexed with *i*, *j*, *k*, and *l*, the active subset of MOs will be indexed with *t*, *u*, *v*, *w*, and *x*, and the virtual MOs will be indexed with *a*, *b*, *c*, and *d*. There are *N*_{o} total molecular orbitals.

### B. Derivation of and solution to the coupled perturbed equations

The SA-CASSCF wavefunction is determined by variationally optimizing the following energy expression:^{39}

The state-averaged one-particle density matrix (OPDM), **γ**^{SA}, and two-particle density matrix (TPDM), **Γ**^{SA}, are each a weighted average of the corresponding state-specific OPDMs and TPDMs,

The weight for state Ψ is denoted as *ω*_{Ψ}, and these weights are typically positive and sum to unity.

The use of state averaging complicates the calculation of nuclear gradients and other response properties for a specific state because the energy of a single state is not variational with respect to the wavefunction parameters (the orbitals are optimized to minimize the state-averaged energy and not the energy of any individual state). However, by expressing the energy as a Lagrangian,^{36} it is possible to construct a variational expression for each electronic state,

where *κ*_{pq} and *c*_{ΨI} are orbital and CI parameters and $\kappa \xafpq\Theta $ and $c\xaf\Psi I\Theta $ are the corresponding Lagrange multipliers that make Eq. (4) variational. One can determine these multipliers by taking the partial derivative with respect to the orbital and CI parameters. The resulting set of linear equations is known as the coupled perturbed multiconfiguration self-consistent field equations,^{29,36}

where $H\mathbf{O}\mathbf{O}$ and $H\mathbf{C}\mathbf{C}$ are the orbital-orbital and CI−CI components of the Hessian matrix, respectively. The blocks $H\mathbf{O}\mathbf{C}$ and $H\mathbf{C}\mathbf{O}$ are the orbital-CI coupling components of the Hessian matrix, and $H\mathbf{O}\mathbf{C}=(H\mathbf{C}\mathbf{O})T$. The vector, $\mathbf{g}o\Theta $, is the state-specific orbital gradient. To calculate the nonadiabatic coupling vector between states Θ and Λ, $\mathbf{g}o\Theta $ is replaced by the orbital interstate coupling vector, $\u27e8\Theta |\u2202H/\u2202\kappa pq|\Lambda \u27e9$, but the Hessian is the same. In principle, solving the coupled perturbed equations is just a straightforward exercise in numerical linear algebra. However, the Hessian can be extremely large and cannot be stored entirely in memory. As a result, it is necessary to iteratively solve the coupled perturbed equations, which only requires matrix-vector products between the Hessian and a trial vector. We will call the trial vector **z** and the product between the Hessian and a trial vector **η**, where

and

at convergence. In certain instances, we will refer to elements and subvectors of $\bm{\kappa}\xaf\Theta $, $\bm{c}\xaf\Theta $, and **z**. If two indices appear as a subscript, then we are referencing an element of the vector. When a single index appears as a subscript, we are referencing a subvector containing all elements with that subscript. In the case of $\bm{\kappa}\xaf\Theta $, we will periodically use the words “*core*,” “*act*,” “*virt*,” and “*all*” as subscript index pairs to describe the subvector containing elements with these pair designations, expressed as a matrix. These will often appear as submatrices of a *N*_{o} × *N*_{o} matrix of orbital pairs.

In our previous works,^{13,14} we showed that it is possible to calculate the orbital gradient as well as certain components of $H\mathbf{O}\mathbf{O}$ (that scale as *O*($No2$) in memory) using a combination of **J** and **K** matrix builds

where **D** is a generalized density matrix. This was largely accomplished by constructing a subset of the electron repulsion integrals (ERIs)—more specifically the set of two-electron integrals involving two active orbitals with the form (*pq|tu*) or *(pt|qu*)—as well as a few Fock-like matrices. The remaining contributions to $H\mathbf{O}\mathbf{O}$ were recovered using two **J** and two **K** matrix builds during each iteration of the solution to the coupled perturbed equations. This procedure was discussed in detail in Refs. 13 and 14 and will not be repeated here. However, for the sake of completeness, we provide the equations for the nonredundant portions of the orbital gradient

as well as the nonredundant terms of the state-averaged orbital Hessian

In Eqs. (16)–(21), the braced terms are not explicitly constructed but instead recovered using matrix-vector products during the iterative solution to the coupled perturbed equations. The Fock-like matrices, **F**^{core} and **F**^{act}(**γ**), and the subset of ERIs with two active indices can be computed and expressed in the following way:^{40,41}

where

and *Cμ*_{p} are MO coefficients. Components of the gradient and Hessian with certain molecular orbital pairs (e.g., *ij*, *tu*, and *ab*) are guaranteed to be zero because the energy is invariant to orbital rotations within the core, active, and virtual subspaces. As a result, the Lagrange multipliers corresponding to these orbital index pairs are guaranteed to be zero and are not coupled to the other Lagrange multipliers. For this reason, we can and do omit these terms entirely from the coupled perturbed equations. Furthermore, $\mathbb{H}pqrsOO=\mathbb{H}qpsrOO=\mathbb{H}rspqOO=\mathbb{H}srqpOO$ and $gpq\Theta =\u2212gqp\Theta $. As a result, there are linear dependencies in the coupled perturbed equations, which can be removed by only including the terms in which *p* < *q* and *r* < *s*. Importantly, these terms are not zero and must be accounted for once the coupled perturbed equations are solved. Namely, it is important to remember that $\kappa \xafpq\Theta =\u2212\kappa \xafqp\Theta $. These omissions exist even when we employ the general MO indices *p*, *q*, *r*, and *s* in various other equations in this paper. In addition, we want to emphasize that the procedure to handle the orbital contribution to the nuclear gradient and nonadiabatic coupling vector remains unchanged in our direct-compatible implementation.

Our initial implementation of SA-CASSCF was designed for small to moderate sized active spaces (6 electrons in 8 orbitals or smaller). We explicitly built and diagonalized the CI Hamiltonian matrix

Here, *E*_{core} is the core energy, which is essentially the energy of the closed molecular orbitals and introduces a rigid shift for the eigenvalues, and the elements $\gamma tuIJ$ and $\Gamma tuvwIJ$ are one- and two-particle coupling coefficients. For small to moderate sized active spaces, the Hamiltonian is small, and this strategy is tractable. Moreover, diagonalizing the entire Hamiltonian produces a complete set of energies and CI eigenvectors for the given active space, which are the eigenvalues and eigenvectors of the CI Hamiltonian, respectively. This made it possible to build and store $H\mathbf{O}\mathbf{C}$, $H\mathbf{C}\mathbf{O}$, and $H\mathbf{C}\mathbf{C}$, which are also quite small, using a basis of CI eigenvectors (instead of a basis of determinants). Unfortunately, the number of determinants increases combinatorially with the size of the active space. For large active spaces, it is often impossible to store the CI Hamiltonian, let alone diagonalize it completely.

This problem is resolved by employing the direct-CI technique,^{15–17,19} which involves the iterative solution of the *N*_{s} lowest eigenvalue-eigenvector pairs of the CI Hamiltonian. In the direct-CI formalism, it is only necessary to explicitly construct the ERIs corresponding to the active space as well as matrix-vector products between the CI Hamiltonian and a trial vector, often called $\bm{\sigma}$,

In recent work,^{23} we developed a highly efficient direct-CI algorithm on the GPU, demonstrating that our atomic-orbital based CAS algorithm is still highly efficient for large active spaces. The work focused on the complete active space configuration interaction (CAS-CI) method. However, it is entirely compatible with SA-CASSCF, given an efficient algorithm for the construction of a generalized OPDM,

and generalized TPDM,

In Eqs. (31) and (32), *A* and *B* serve only to indicate that the bra and ket coefficient vectors, ** c**, need not be the same. Unless otherwise noted, generalized OPDMs and TPDMs are always symmetrized, which entails that $\gamma tuAB=\gamma utAB$ and $\Gamma tuvwAB=\Gamma utwvAB=\Gamma vwtuAB=\Gamma wvutAB$. The ordering of

*A*and

*B*makes no difference, assuming the CI coefficients are real. When we write a generalized OPDM or TPDM with a single superscript, it is to be assumed that

*A*=

*B*. In practice, the resolution of the identity is employed in Eqs. (29)–(32) to avoid explicit computation and storage of the two-particle coupling coefficients.

^{19,42}The algorithm to efficiently compute the OPDM and TPDM follows the one used to compute

*𝝈*.

Without diagonalizing the entire Hamiltonian, there is no way to construct $H\mathbf{O}\mathbf{C}$, $H\mathbf{C}\mathbf{O}$, and $H\mathbf{C}\mathbf{C}$ in a basis of CI eigenvectors. Moreover, explicitly constructing these blocks quickly becomes intractable as the size of the active space increases. As such, it is necessary to develop a direct-compatible implementation to calculate gradients and other response properties for SA-CASSCF wavefunctions with large active spaces. In a basis of determinants and when equal weights are employed, $H\mathbf{C}\mathbf{C}$ can be expressed as^{36}

The first term is a shifted variant of the CI Hamiltonian, and the second term projects out the contributions to $H\mathbf{C}\mathbf{C}$ coming from states within the state-averaged space. Similarly, $H\mathbf{O}\mathbf{C}$ and $H\mathbf{C}\mathbf{O}$ can be expressed as^{36,43}

The matrix, $\bm{T}pq(\Psi ,X)$, is a generalized orbital gradient matrix and is constructed with symmetrized, generalized density matrices, **γ**^{ΨX} and **Γ**^{ΨX}. The variable, *X*, is a placeholder for any possible vector. In the above equations, *X* appears as the determinant, *I*, and the CI eigenvector, ϓ. Analogous to the case for $H\mathbf{C}\mathbf{C}$, the second term in Eq. (34) can be thought of as a projector, which removes the contribution to the Hessian from each state within the state-averaged space. The non-redundant components of $\bm{T}pq(\Psi ,X)$ can be expressed as

We note that in the case when *X* = Ψ, **T** becomes the orbital gradient, $\bm{g}o\Psi $. For any other eigenvector of the CI Hamiltonian, **T** becomes an orbital interstate coupling vector, which replaces the orbital gradient in the calculation of nonadiabatic coupling vectors. All of these terms can be computed with **J** and **K** matrix builds in addition to OPDM and TPDM builds.

As previously noted, we do not explicitly construct any of the matrix blocks defined by Eqs. (33) and (34). Instead, we build products of each matrix block with trial vectors, much like a $\bm{\sigma}$ build, during each iteration of the solution to the coupled perturbed equations. We will begin by showing how to compute the product of $H\mathbf{C}\mathbf{C}$ and some trial vector because this is the most straightforward matrix-vector product. The expression for $H\mathbf{C}\mathbf{C}$ multiplied by the trial vector $\bm{c}\u223c\Phi \Theta $ is

The vector $\bm{c}\u223c\Phi \Theta $ is an estimate for $\bm{c}\xaf\Phi \Theta $, the set of CI Lagrange multipliers belonging to state $\Phi $. The first term is simply a $\bm{\sigma}$ build in which $\bm{c}\u223c\Phi \Theta $ is the trial vector and *E*_{core} is replaced by (*E*_{core} − *E*^{Ψ}) in Eq. (29). For the second term, the CI eigenvector for each state in the state-averaged space, *c*^{ϓ}, is weighted by the energy difference between states ϓ and Ψ as well as the overlap between $\bm{c}\u223c\Phi \Theta $ and *c*^{ϓ}. In practice, the second term in Eq. (38) is not required. The solution to the coupled perturbed equations is orthogonal to the state-averaged space, and there are no rotations within the Hessian between states inside and outside the state-averaged space. As such, it is guaranteed that all trial vectors are orthogonal to the state-averaged space as long as the initial guess is orthogonal to the space and the preconditioner contains no rotations between states inside and outside the state-averaged space. This is discussed at length in Ref. 36, and we employ similar techniques in this work. As such,

and only $Ns$$\bm{\sigma}$ builds need to be constructed per iteration, since $H\mathbf{C}\mathbf{C}$ is block diagonal. It is also quite straightforward to compute $H\mathbf{O}\mathbf{C}$ multiplied by a trial vector,

The generalized gradient, $\mathbf{T}(\Psi ,\bm{c}\u223c\Psi \Theta )$, and by extension the generalized density matrices, $\bm{\gamma}\Psi \bm{c}\xaf\Psi \Theta $ and $\mathbf{\Gamma}\Psi \bm{c}\xaf\Psi \Theta $, must be computed *N*_{s} times per iteration.

The calculation of $H\mathbf{C}\mathbf{O}$ multiplied by a trial vector is somewhat more complicated,

As with the CI Lagrange multipliers, $\bm{\kappa}\u223c\Theta $ is an estimate of the vector of orbital Lagrange multipliers, $\bm{\kappa}\xaf\Theta $. It is not feasible to construct **T**(Ψ,*I*) because the number of determinants can become extremely large. However, by inserting Eqs. (35)–(37) into Eq. (42), it is evident that Eq. (42) can be rewritten in the following way:

where the variables α, $Ftuact(\mathbf{D}\u223c)$, $Ftucore(\mathbf{D}\xaf)$, and (*tu*|*vw*)_{m} are defined as

and

By symmetrizing $\mathbf{F}act(\mathbf{D}\u223c)$, $\mathbf{F}core(\mathbf{D}\xaf)$, and (*tu*|*vw*)_{m}, we can further simplify Eq. (43),

where

The terms α, $\mathbf{F}act(\mathbf{D}\u223c)$, $\mathbf{F}core(\mathbf{D}\xaf)$, and (*tu*|*vw*)_{m} need only be constructed once per iteration. In addition, *N*_{s} modified $\bm{\sigma}$ builds and $Ns2$ dot products between $\bm{T}(\Psi ,\u03d3)$ and $\bm{\kappa}\xafpq$ are required per iteration. However, the matrices, $\mathbf{T}(\Psi ,\u03d3)$, need only be computed once and stored in memory. The reason this is tractable is that there are $Ns(Ns+1)/2$ unique orbital interstate coupling matrices (or gradient matrices, if Ψ = ϓ), which is a relatively small number in most SA-CASSCF calculations. In addition, each matrix only requires *O*($No2$) memory.

This procedure alone is valid as long as the states included within the state-averaged space all have equal weight. This is not always desirable.^{44,45} Fortunately, modification of the procedure to treat a wavefunction in which the states are not equally weighted is quite straightforward. Stålring *et al.* derived a set of more general equations for the CI−CI and orbital-CI blocks of the Hessian,^{36}

Importantly, the weights here are not the same as in Eqs. (33) and (34), in which all terms are weighted by *ω*_{Ψ}. In our implementation, however, we do not alter Eqs. (33) and (34). Instead, we introduce an additional set of CI Lagrange multipliers (expressed in the basis of CI eigenvectors) for the pairs of states within the state-averaged space that have different weights. The new Lagrangian takes the following form:

By taking the derivative with respect to each of the orbital and two CI parameters, we recover an extended set of coupled perturbed equations

As written, Eqs. (56)–(58) have multiple linear dependences because both the determinant and CI eigenvector blocks of the orbital-CI and CI−CI Hessian contain information regarding how the energy changes as a function of rotations within the state-averaged space. Eqs. (33) and (34) project out the components of the determinant Hessian that exist within the state-averaged space, and employing these equations in place of Eqs. (53) and (54) removes all linear dependencies. There is no CI−CI coupling between blocks expressed in a projected determinant and CI eigenvector basis because they span orthogonal spaces. For this reason, $\mathbb{H}\Psi I,\Phi \u03d3CC=\mathbb{H}\Phi \u03d3,\Psi ICC=0$. With that in mind, the additional blocks of the Hessian corresponding to the states with nonzero weight can then be expressed in the basis of eigenstates of the CI Hamiltonian^{36,43}

It is worth noting that Eqs. (53) and (59) and Eqs. (54) and (60) are the same equations expressed in different bases (though the indices Ψ and Ξ would run over all eigenstates, as opposed to those within the state-averaged space, if we built the entire CI−CI and orbital-CI blocks in a basis of CI eigenvectors). It is tractable to treat this block in the basis of eigenstates because there can be at most $Ns(Ns+1)/2$ non-redundant Lagrange multipliers. In fact, all of the terms appearing in Eqs. (59) and (60) are already computed and stored to solve the coupled perturbed equations (see Eq. (50)). The advantage of using Eqs. (53) and (54) is that no additional CI Lagrange multipliers are introduced. Furthermore, there are fewer Hessian matrix singularities, which get introduced when the components of the state-averaged space are projected out of the determinant blocks. The disadvantage is that the coupled perturbed equations expressed in a determinant basis exhibit linear dependencies. By expressing Eqs. (53) and (54) in a basis of CI eigenvectors (as is done in Eqs. (59) and (60)), it is clear that $c\xaf\Psi \Phi \Theta =\u2212c\xaf\Phi \Psi \Theta $, which is a linear dependency. This can be easily removed in a basis of CI eigenvectors, but the same is not true in a basis of determinants. Moreover, the CI−CI block of the Hessian is no longer block diagonal when states are not equally weighted. This increases the cost of computing the product of the Hessian and a trial vector and potentially diminishes the rate of convergence of the coupled perturbed equations, which become less diagonally dominant. Our strategy alleviates these problems. We believe that the advantages of our strategy outweigh the disadvantages. The number of additional Lagrange multipliers introduced is trivial compared to the size of the Hessian (at least for most tractable active spaces). Moreover, the Hessian is guaranteed to be singular in a basis of determinants, and the noise from the additional singularities can be easily removed during the course of the solution to the coupled perturbed equations, as the direction of each singularity is known.

Although the primary difficulty in solving the coupled perturbed equations is the efficient computation of matrix-vector products involving the Hessian and a trial vector, there is another difficulty worth mentioning. In a basis of determinants, the Hessian is typically indefinite and always singular. Consider Eqs. (59) and (60). If *ω*_{Ψ} = *ω*_{Φ}, then the matrix will be singular. In addition, there are linear dependencies in the Hessian, as *c*_{ΨΦ} = -*c*_{ΦΨ}, that introduce additional singularities. Similarly, if *ω*_{Ψ} < *ω*_{Φ} and *E*^{Ψ} < *E*^{Φ} or vice versa, then the matrix could be indefinite. This occurs because states with a different spin multiplicity are often lower in energy than the states of a given spin multiplicity included within the state-averaged space. When working in a basis of CI eigenvectors, we can guarantee that the Hessian is positive definite by removing these terms from the coupled perturbed equations, as long as the state weights are nonincreasing within a given spin block. This is the case because the Lagrange multipliers for these terms are either 0 or can be determined from another Lagrange multiplier. This transformation is shown pictorially in Figure 1. Unfortunately, we cannot cleanly excise these terms from the coupled perturbed equations when working in a basis of determinants and are forced to solve an indefinite system of linear equations with infinitely many valid solutions. Of course, only the least squares solution will produce the correct gradient or nonadiabatic coupling vector. There are a variety of possible ways to obtain the least squares solution. We use direct inversion of the iterative subspace (DIIS) to solve the coupled perturbed equations and employ a preconditioner, **M**, that projects the singular directions out of each subspace vector. The preconditioner employed to solve the coupled perturbed equations requires explanation. It is block diagonal, with separate orbital and CI blocks. The orbital block consists of the partially constructed portion of $H\mathbf{O}\mathbf{O}$ (the non-braced terms in Eqs. (16)–(21)) and will be called $\mathbf{M}\mathbf{O}$. $\mathbf{M}\mathbf{O}$ cannot be analytically and sparsely inverted and is instead inverted numerically using a preconditioned conjugate gradient routine during each iteration of the coupled perturbed equations. The CI block of the preconditioner can be written as

where

and

is the projector responsible for removing rotations between states inside and outside the active space. As demonstrated in prior work,^{36} $\mathbf{M}\mathbf{C}$ can be inverted analytically in the following way:

Since $\mathbf{M}\xaf\mathbf{C}$ is diagonal, it is trivial to invert and store this matrix, and the CI eigenvectors for states Φ and ϓ are also readily available. The matrix **Q** defines the overlap between each CI eigenvector pair and $[\mathbf{M}\xaf\mathbf{C}]\u22121$. It can be expressed as

There is no analytical expression for the inverse of Eq. (65), but this matrix is only $Ns2$ in size, which is quite small. As such, one can do a dense inversion of **Q** to determine **Q**^{−1}. The preconditioner enables us to omit certain components in $H\mathbf{C}\mathbf{C}$ and $H\mathbf{O}\mathbf{C}$ as previously mentioned.

### C. Computation of the nuclear gradient and nonadiabatic coupling vector

Once the coupled perturbed equations are solved, the procedure to calculate the gradient and nonadiabatic coupling vector for a system is essentially identical to the procedure presented in our previous work.^{14} However, we repeat much of this work for the sake of completeness. The nuclear gradient can be expressed as

where *ξ* is an arbitrary electric field or nuclear perturbation and **γ**^{e}, **Γ**^{e}, and **X**^{e} are the relaxed (or effective) OPDM, TPDM, and CI Lagrangian, respectively. The effective OPDM is straightforward to compute and store, taking the form

where

The use of Eq. (69) to construct the CI correction to the OPDM is one of the few modifications to our prior algorithm, apart from all the changes to the coupled perturbed equations just discussed. Importantly, the second term in Eq. (69) only appears when unequal weights are employed. The effective TPDM can be expressed in a very similar way,

where

The use of Eq. (72) is the other modification required to make our algorithm direct-compatible. Again, the second term only appears in the case where unequal weights are utilized. Due to the size of the TPDM, these matrices are never explicitly computed. One can derive a factorization for the effective TPDM in terms of contributions to the effective OPDM as well as the complete eigenvalue decomposition of the active portion of the various TPDM contributions. This is accomplished by inserting the factorization of the TPDM into Eqs. (70)–(72),

where the first term can be further factorized by taking the eigenvalue decomposition of the all-active state-specific TPDM,

A similar factorization exists for the state-averaged TPDM and two-particle transition density matrices. The result is the following matrix factorization for the effective TPDM:

Here, the “*act*” and “*core*” superscripts generally denote that only terms with exclusively active or core indices are nonzero. There is one exception to this rule,

Clearly, *p* and *q* need not both be active indices for the terms in $\bm{\gamma}\u223cact$ to be nonzero, but $\bm{\gamma}\u223cact$ is constructed from a state-averaged matrix in which all nonzero matrix elements have active indices. In practice, the state-specific and CI portions of the all-active TPDM are combined in order to reduce the cost of the calculation, but we separate them in Eqs. (75)–(77) for clarity.

With the effective OPDM and TPDM defined, the effective CI Lagrangian can be expressed as

The state-specific contribution to the effective CI Lagrangian $(\mathbf{X}\Theta )$ is trivial to recover

All other elements are zero. The terms appearing in Eqs. (80) and (81) were previously computed to solve the coupled perturbed equations, and as a result, this contribution to the CI Lagrangian is easy to compute. The CI correction to the effective CI Lagrangian $(\bm{X}\xaf)$ is also quite straightforward to compute, exhibiting a very similar form to Eqs. (80) and (81),

Again, all other terms are zero. There is no such convenient expression for the computation of $X\u223c$, but this term can be recovered with some additional effort

where

At least three indices in Eqs. (85)–(88) must correspond to active orbitals in order for the elements to be nonzero, and the identity of each index that does not need to be active is denoted by the superscript (e.g., 3 indicates that the third index need not be active). To emphasize this point, the active orbital indices *t*, *u*, and *v* are used when possible. Although *p* generally designates a general MO index, *p* must correspond to an active MO in Eqs. (86)–(88) for the element to be nonzero. Note that *p* is a general MO index in Eq. (85). We use this index, however, in order to have a clean expression for the orbital correction to the CI Lagrangian, where the first index in each factorized density matrix is a general MO index. Because the terms defined in Eqs. (85)–(88) must have three active indices to be nonzero, only ERIs with at most two general MO indices are required to recover the contribution to the CI Lagrangian arising from these terms, and these ERIs were previously built to solve the coupled perturbed equations. The remaining contribution to the orbital correction to the CI Lagrangian can then be recovered using a combination of **J** and **K** builds and matrix products, as shown in Eq. (84). The **J** and **K** builds are implicitly transformed into the MO space for the sake of brevity, but these calculations are performed in the AO basis.

With the effective OPDM, TPDM, and CI Lagrangian defined, it is possible to compute the nuclear gradient for a specific electronic state. It is straightforward to compute the one-electron contributions to the gradient by transforming the effective OPDM and CI Lagrangian into the AO basis and contracting with the derivative one-election integrals and overlap integrals, respectively. As detailed in our previous work, one can recover the two-electron contribution to the gradient using derivative **J** and **K** matrices,

where **A** and **B** are arbitrary matrices. To accomplish this, one needs only to transform each matrix pair in Eqs. (75)–(77) into the AO basis and contract with the appropriate **J** and **K** derivative matrix. In Eqs. (75)–(77), a **J** gradient evaluation is required when *p*/*q* are paired and *r*/*s* are paired. Otherwise, a **K** gradient evaluation is required

where the superscript *T* indicates the matrix transpose. This is important because the orbital Lagrange multiplier matrix, $\bm{\kappa}\xaf\Theta $, is antisymmetric, much like the orbital gradient.

As demonstrated by Yarkony and co-workers^{33,34} and discussed in our previous work,^{14} the calculation of the nonadiabatic coupling vectors is essentially identical to that of the gradient, with a few minor changes. To determine the NAC vector for states Θ and Λ, the coupled perturbed equations become

The equation for the NAC vector is quite similar to that of the gradient and can be expressed as

The crescent above the transition density matrix denotes that this matrix is not symmetrized. The relaxed transition density matrices are being contracted with symmetric derivative electron integrals. For this reason, any contribution to the effective transition density matrices can be symmetrized, but the overlap-like matrix in the final term is antisymmetric. As such, this transition density matrix cannot be symmetrized. In addition to the differences already noted, the effective density matrices and their corresponding factorization are different, with the state-specific density matrix replacement by a transition density matrix,

As noted in our prior work, $\Gamma \xaf$ is simply a weighted sum of transition density matrices. As a result, the factorization of Γ^{ΘΛ} takes the same form as Eq. (77), as opposed to Eq. (75), and can be combined during the computation of the NAC vector, reducing the number of derivative **J** and **K** gradient evaluations required. Similarly, the equation for the contribution to the effective CI Lagrangian takes the form of Eqs. (82) and (83) as opposed to Eqs. (80) and (81), and can be combined with the calculation of the CI contribution to effective CI Lagrangian. Otherwise, the gradient and NAC vector programs are exactly the same.

### D. Other implementation details

Unrelated to our direct-compatible algorithm, we also made some technical improvements to our previous algorithm. Both the **K** builds and **K** gradient builds are substantially more efficient, when the input matrices are symmetric and identical because these routines can capitalize on this information in order to reduce the number of required operations. As a result, there are multiple instances of the **K** matrix and **K** gradient available in our code. In the case of the **K** matrix build, there are two routines, one for symmetric and one for asymmetric input matrices. The symmetric routine is faster than the asymmetric routine by a factor of two. During the iterative solution to the coupled perturbed equations, we build a pair of matrices called **D**_{1} and **D**_{2}. In our original implementation,^{14} these matrices are not symmetric, but symmetrizing these matrices substantially reduces the cost of the **K** build

Of course, a minor modification to our algorithm is necessary to ensure that the result is the same. This modification is present in Algorithm 1 lines 10b–10g. We also capitalize on symmetric **K** build routines to reduce the cost of computing ERIs of the form (*pt|qt*). A good implementation will further take advantage of the fact that (*pq|tu*) = (*pq|ut*) and (*pt|qu*) = (*qu|pt*), reducing the number of **J** and **K** builds required to build the ERIs by roughly a factor of two. In the case of the **K** gradient build, there are three routines. The most general routine requires no information about the input matrices. The second requires the input matrices to be identical but not symmetric. This routine is twice as fast as the first routine.The fastest routine requires that the input matrices are identical and symmetric, and this routine is faster than the most general routine by a factor of four. We can reformulate matrix factorizations in Eqs. (75)–(77) in terms of identical matrix pairs by recognizing that

This simultaneously reduced the number of **K** gradient evaluations required and reduced the cost of each evaluation. Eq. (99) is essentially a fourth possible **K** gradient routine, for a case where the input matrices are different but both are symmetric. This routine is twice as fast as the most general routine, improving our computation of the **K** gradient by roughly a factor of two. Note also that $K(\bm{A},\bm{B})\xi =K(\bm{B},\bm{A})\xi $, reducing the number of **K** gradient routines required. We can further reduce the number of **K** gradient calls by noticing that

where

Of course, Eqs. (100)–(102) are only relevant for the computation of nuclear gradients.

1: Obtain CI eigenvectors and the state energies^{a} |

2: Construct the active-by-active subset of the ERIs: (pq|tu) and (pt|qu) ▶ Eqs. (24) and (25) |

3: Construct the partially built Hessian and components of the preconditioner, M: |

3a: Obtain γ^{SA}, Γ^{SA}, F^{core}, and F^{act}(γ^{SA}) ▶ Eqs. (2), (3), (22), and (23) |

3b: Construct the orbital component of the preconditioner: $\mathbf{M}\mathbf{O}$ ▶ non-braced parts of Eqs. (16)–(21) |

3c: Construct the components of $\mathbf{M}\mathbf{C}$ ▶ Eqs. (61)–(65) |

3d: For all non-redundant pairs of eigenstates Ψ and Φ: |

If Ψ = Φ then compute γ^{Ψ}, Γ^{Ψ}, and F^{act}(γ^{Ψ}), and $\mathbf{g}o\Psi $ ▶ Eqs. (13)–(15), (23), (31), and (32) |

If Ψ $\u2260$ Φ then compute γ^{ΨΦ}, Γ^{ΨΦ}, and F^{act}(γ^{ΨΦ}), and T(Ψ,Φ) ▶ Eqs. (23), (31), (32), |

and (35)–(37) |

4: Compute γ^{Θ}, Γ^{Θ}, and F^{act}(γ^{Θ}) ▶ Eqs. (23), (31), and (32) |

5: Construct the orbital gradient: $\mathbf{g}o\Theta $ ▶ Eqs. (13)–(15) |

6: Initialize incremental direct inversion of the iterative subspace (I-DIIS): |

$\mathbf{r}(0)=\u2212\mathbf{M}\u22121\mathbf{g}\Theta $;^{b} $\mathbf{z}(0)=\u2212\mathbf{r}(0)$; $\bm{\eta}(0)=0$; $\bm{\eta}inc(0)=0$; $k=1$ |

Add z^{(0)} and r^{(0)} to I-DIIS subspace and residual vector lists respectively. |

7: loop |

8: If k > 1 then z^{(k)} = DIIS extrapolated vector else z^{(k)} = z^{(k−1)} |

9: If k mod N_{inc}^{c} = 0 then z = z^{(k)} else z = z^{(k)} – z^{(k−1)} |

10: Build $H\mathbf{O}\mathbf{O}\mathbf{z}$ contribution to η: |

10a: $\bm{\eta}inc(k)=\mathbf{M}\mathbf{O}\mathbf{z}$^{d} |

10b: Build D_{1} and D_{2}: ▶ Eqs. (97) and (98) |

10c: $J\u223cpq=\u2211\mu \nu C\mu pC\nu qJ\mu \nu (\bm{D}\U0001d7cf)$; $K\u223cpq=\u2211\mu \nu C\mu pC\nu qK\mu \nu (\bm{D}\U0001d7cf)$ |

10d: $J\xafpq=\u2211\mu \nu C\mu pC\nu qJ\mu \nu (\bm{D}\U0001d7d0)$; $K\xafpq=\u2211\mu \nu C\mu pC\nu qK\mu \nu (\bm{D}\U0001d7d0)$ |

10e: $\eta inc,it(k)+=2\u2211u(2\delta tu\u2212\bm{\gamma}tu)(4J\u223ciu\u22122K\u223ciu)$ |

10f: $\eta inc,ia(k)+=8J\xafia\u22124K\xafia$ |

10g: $\eta inc,ta(k)+=2\u2211u\gamma tu(4J\u223cua\u22122K\u223cua)$ |

11: Build $H\mathbf{C}\mathbf{C}\mathbf{z}$ contribution to η: ▶ Eq. (39) |

11a: $\eta inc,\Psi I(k)+=\u2211\Phi 2\delta \Psi \Phi \omega \Psi [\u2211J(HIJ\u2212E\Psi \delta IJ)z\Phi J]$ |

12: Build $H\mathbf{O}\mathbf{C}\mathbf{z}$ contribution to η: ▶ Eq. (41) |

12a: $\eta inc,pq(k)+=\u2211\Psi 2\omega \Psi Tpq(\Psi ,z\Psi )$ |

13: Build $H\mathbf{C}\mathbf{O}\mathbf{z}$ contribution to η: ▶Eq. (50) |

13a: Construct $\mathbf{D}\u223c$ and $\mathbf{D}\xaf$ ▶Eqs. (48) and (49) |

13b: Construct $\alpha $, $\mathbf{F}act(\mathbf{D}\u223c)$, $\mathbf{F}core(\mathbf{D}\xaf)$, and $(tu|vw)m$ ▶Eqs. (44)–(47) |

13c: Setup $\mathbf{H}\u223c$ in order to compute $\sigma \u223c$ ▶Eqs. (51) and (52) |

13d: $\eta inc,\Psi I(k)+=2\omega \Psi [4\alpha cI\Psi +\sigma \u223cI\Psi \u2212\u2211pq,\u03d3cI\u03d3Tpq(\Psi ,\u03d3)\kappa \u223cpq]$ |

14: If k mod N_{inc} = 0 then $\bm{\eta}(k)=\bm{\eta}inc(k)$ else $\eta (k)=\eta i\u2062n\u2062c(k)+\eta (k\u22121)$ |

15: $\mathbf{r}(k)=\eta (k)+\mathbf{g}o\Theta $ |

16: if $\u2225\mathbf{r}(k)\u22252<$ threshold then: |

17: return z^{(k)} |

18: end if |

19: $\mathbf{e}(k)=\mathbf{M}\u2212\U0001d7cf\mathbf{r}(k)$ ▶Eqs. (64) and (65) |

20: $\mathbf{p}(k)=\mathbf{z}(k)\u2212\mathbf{e}(k)$ |

21: Add $\mathbf{p}(k)$and $\mathbf{e}(k)$ to DIIS subspace vector and residual vector set respectively |

22: end loop |

23: Build the state-specific density matrices: $\bm{\gamma}\Theta and\u2009\mathbf{\Gamma}\Theta $ |

24: Build the components of $\bm{\gamma}e$ and the factorized variant of $\mathbf{\Gamma}e$ ▶ Eqs. (67)–(72) and (75)–(77) |

25: Perform an eigenvalue decomposition of $\mathbf{\Gamma}\Theta ,act$, $\mathbf{\Gamma}SA,act$, and $\mathbf{\Gamma}\xafact$ ▶ Eq. (74) |

26: Construct the effective Lagrangian: $\bm{X}e$ ▶ Eq. (79) |

27: Calculate the one-body contribution to the gradient: ▶ Eq. (66) |

28a: One-electron gradient: $\u2211pq\gamma pqe(p|h^|q)\xi $ |

28b: Overlap gradient: $\u2212\u2211pqXpqe(p|q)\xi $ |

29: To recover the two-electron gradient: $\u2211pqrs\Gamma pqrse(pq|rs)\xi $ ▶Eq. (91) |

29a: loop over matrix products in the factorization of $\mathbf{\Gamma}e=\mathbf{\Gamma}\Theta +\mathbf{\Gamma}\u223c+\mathbf{\Gamma}\xaf$: ▶ Eqs. (75)–(77) |

29b: Set first and second matrices in the product equal to A and B respectively |

29c: if the indices pq and rs are paired in Eqs. (75)–(77) then: |

29d: Do a J gradient calculation on A and B ▶ Eq. (89) |

29e: If applicable, weight the contribution to the gradient by $\lambda act,P$ |

29f: else if the indices ps and qr are paired in Eqs. (75)–(77) then: |

29g: if A = B: |

29h: Do a K gradient calculation on A and B ▶ Eq. (90) |

29i: else: |

29j: $K(\bm{A},\bm{B})\xi =12[K(\bm{A}+\bm{B},\bm{A}+\bm{B})\xi \u2212K(\bm{A}\u2212\bm{B},\bm{A}\u2212\bm{B})\xi ]$ ▶ Eq. (99) |

29k: end if |

29l: end if |

29m: end loop |

1: Obtain CI eigenvectors and the state energies^{a} |

2: Construct the active-by-active subset of the ERIs: (pq|tu) and (pt|qu) ▶ Eqs. (24) and (25) |

3: Construct the partially built Hessian and components of the preconditioner, M: |

3a: Obtain γ^{SA}, Γ^{SA}, F^{core}, and F^{act}(γ^{SA}) ▶ Eqs. (2), (3), (22), and (23) |

3b: Construct the orbital component of the preconditioner: $\mathbf{M}\mathbf{O}$ ▶ non-braced parts of Eqs. (16)–(21) |

3c: Construct the components of $\mathbf{M}\mathbf{C}$ ▶ Eqs. (61)–(65) |

3d: For all non-redundant pairs of eigenstates Ψ and Φ: |

If Ψ = Φ then compute γ^{Ψ}, Γ^{Ψ}, and F^{act}(γ^{Ψ}), and $\mathbf{g}o\Psi $ ▶ Eqs. (13)–(15), (23), (31), and (32) |

If Ψ $\u2260$ Φ then compute γ^{ΨΦ}, Γ^{ΨΦ}, and F^{act}(γ^{ΨΦ}), and T(Ψ,Φ) ▶ Eqs. (23), (31), (32), |

and (35)–(37) |

4: Compute γ^{Θ}, Γ^{Θ}, and F^{act}(γ^{Θ}) ▶ Eqs. (23), (31), and (32) |

5: Construct the orbital gradient: $\mathbf{g}o\Theta $ ▶ Eqs. (13)–(15) |

6: Initialize incremental direct inversion of the iterative subspace (I-DIIS): |

$\mathbf{r}(0)=\u2212\mathbf{M}\u22121\mathbf{g}\Theta $;^{b} $\mathbf{z}(0)=\u2212\mathbf{r}(0)$; $\bm{\eta}(0)=0$; $\bm{\eta}inc(0)=0$; $k=1$ |

Add z^{(0)} and r^{(0)} to I-DIIS subspace and residual vector lists respectively. |

7: loop |

8: If k > 1 then z^{(k)} = DIIS extrapolated vector else z^{(k)} = z^{(k−1)} |

9: If k mod N_{inc}^{c} = 0 then z = z^{(k)} else z = z^{(k)} – z^{(k−1)} |

10: Build $H\mathbf{O}\mathbf{O}\mathbf{z}$ contribution to η: |

10a: $\bm{\eta}inc(k)=\mathbf{M}\mathbf{O}\mathbf{z}$^{d} |

10b: Build D_{1} and D_{2}: ▶ Eqs. (97) and (98) |

10c: $J\u223cpq=\u2211\mu \nu C\mu pC\nu qJ\mu \nu (\bm{D}\U0001d7cf)$; $K\u223cpq=\u2211\mu \nu C\mu pC\nu qK\mu \nu (\bm{D}\U0001d7cf)$ |

10d: $J\xafpq=\u2211\mu \nu C\mu pC\nu qJ\mu \nu (\bm{D}\U0001d7d0)$; $K\xafpq=\u2211\mu \nu C\mu pC\nu qK\mu \nu (\bm{D}\U0001d7d0)$ |

10e: $\eta inc,it(k)+=2\u2211u(2\delta tu\u2212\bm{\gamma}tu)(4J\u223ciu\u22122K\u223ciu)$ |

10f: $\eta inc,ia(k)+=8J\xafia\u22124K\xafia$ |

10g: $\eta inc,ta(k)+=2\u2211u\gamma tu(4J\u223cua\u22122K\u223cua)$ |

11: Build $H\mathbf{C}\mathbf{C}\mathbf{z}$ contribution to η: ▶ Eq. (39) |

11a: $\eta inc,\Psi I(k)+=\u2211\Phi 2\delta \Psi \Phi \omega \Psi [\u2211J(HIJ\u2212E\Psi \delta IJ)z\Phi J]$ |

12: Build $H\mathbf{O}\mathbf{C}\mathbf{z}$ contribution to η: ▶ Eq. (41) |

12a: $\eta inc,pq(k)+=\u2211\Psi 2\omega \Psi Tpq(\Psi ,z\Psi )$ |

13: Build $H\mathbf{C}\mathbf{O}\mathbf{z}$ contribution to η: ▶Eq. (50) |

13a: Construct $\mathbf{D}\u223c$ and $\mathbf{D}\xaf$ ▶Eqs. (48) and (49) |

13b: Construct $\alpha $, $\mathbf{F}act(\mathbf{D}\u223c)$, $\mathbf{F}core(\mathbf{D}\xaf)$, and $(tu|vw)m$ ▶Eqs. (44)–(47) |

13c: Setup $\mathbf{H}\u223c$ in order to compute $\sigma \u223c$ ▶Eqs. (51) and (52) |

13d: $\eta inc,\Psi I(k)+=2\omega \Psi [4\alpha cI\Psi +\sigma \u223cI\Psi \u2212\u2211pq,\u03d3cI\u03d3Tpq(\Psi ,\u03d3)\kappa \u223cpq]$ |

14: If k mod N_{inc} = 0 then $\bm{\eta}(k)=\bm{\eta}inc(k)$ else $\eta (k)=\eta i\u2062n\u2062c(k)+\eta (k\u22121)$ |

15: $\mathbf{r}(k)=\eta (k)+\mathbf{g}o\Theta $ |

16: if $\u2225\mathbf{r}(k)\u22252<$ threshold then: |

17: return z^{(k)} |

18: end if |

19: $\mathbf{e}(k)=\mathbf{M}\u2212\U0001d7cf\mathbf{r}(k)$ ▶Eqs. (64) and (65) |

20: $\mathbf{p}(k)=\mathbf{z}(k)\u2212\mathbf{e}(k)$ |

21: Add $\mathbf{p}(k)$and $\mathbf{e}(k)$ to DIIS subspace vector and residual vector set respectively |

22: end loop |

23: Build the state-specific density matrices: $\bm{\gamma}\Theta and\u2009\mathbf{\Gamma}\Theta $ |

24: Build the components of $\bm{\gamma}e$ and the factorized variant of $\mathbf{\Gamma}e$ ▶ Eqs. (67)–(72) and (75)–(77) |

25: Perform an eigenvalue decomposition of $\mathbf{\Gamma}\Theta ,act$, $\mathbf{\Gamma}SA,act$, and $\mathbf{\Gamma}\xafact$ ▶ Eq. (74) |

26: Construct the effective Lagrangian: $\bm{X}e$ ▶ Eq. (79) |

27: Calculate the one-body contribution to the gradient: ▶ Eq. (66) |

28a: One-electron gradient: $\u2211pq\gamma pqe(p|h^|q)\xi $ |

28b: Overlap gradient: $\u2212\u2211pqXpqe(p|q)\xi $ |

29: To recover the two-electron gradient: $\u2211pqrs\Gamma pqrse(pq|rs)\xi $ ▶Eq. (91) |

29a: loop over matrix products in the factorization of $\mathbf{\Gamma}e=\mathbf{\Gamma}\Theta +\mathbf{\Gamma}\u223c+\mathbf{\Gamma}\xaf$: ▶ Eqs. (75)–(77) |

29b: Set first and second matrices in the product equal to A and B respectively |

29c: if the indices pq and rs are paired in Eqs. (75)–(77) then: |

29d: Do a J gradient calculation on A and B ▶ Eq. (89) |

29e: If applicable, weight the contribution to the gradient by $\lambda act,P$ |

29f: else if the indices ps and qr are paired in Eqs. (75)–(77) then: |

29g: if A = B: |

29h: Do a K gradient calculation on A and B ▶ Eq. (90) |

29i: else: |

29j: $K(\bm{A},\bm{B})\xi =12[K(\bm{A}+\bm{B},\bm{A}+\bm{B})\xi \u2212K(\bm{A}\u2212\bm{B},\bm{A}\u2212\bm{B})\xi ]$ ▶ Eq. (99) |

29k: end if |

29l: end if |

29m: end loop |

^{a}

We only compute the CI eigenvectors and energies for states within the state-averaged space. This is also true in step 3d.

^{b}

**g**^{Θ} is the gradient of state Θ with respect to both orbital and CI parameters. The orbital portion of **g**^{Θ} is equal to the orbital gradient, $\mathbf{g}o\Theta $, and the CI portion of **g**^{Θ} is 0.

^{c}

*N*_{inc} is a user-defined variable that determines how often we need to build $\bm{\sigma}inc(k)$ from the full trial vector, as opposed to an incremental update to the new trial vector. We use an incremental algorithm to increase the speed of the calculation, as each incremental update becomes sparser. It is necessary to periodically build $\bm{\sigma}inc(k)$ from the full trial vector to avoid numerical instability. In our algorithm, we use *N*_{inc} = 8.

^{d}

The matrix, $\mathbf{M}\mathbf{O}$, is the partially constructed portion of $H\mathbf{O}\mathbf{O}$. This is equivalent to what we referred to as the orbital-orbital part of **H**^{[2]} in Ref. 14.

## III. COMPUTATIONAL DETAILS

The extension of our AO-based SA-CASSCF gradient and nonadiabatic coupling program to treat large active spaces has been implemented within the TeraChem electronic structure package.^{46–48} Timings for this program were obtained on Intel Xeon E5-2643 CPUs clocked at 3.30 GHz with either NVIDIA GeForce GTX 970 or TITAN X (Pascal) GPUs. Unless otherwise noted, all calculations were performed on a single GPU in double precision. In Section IV, we will use the notation SA-*N*_{s}-CAS(*m*,*n*) to describe the details of the SA-CASSCF wavefunction, where *m* is the number of active electrons, and *n* is the number of active orbitals. Molecular geometries for microsolvated polyenes were generated using the Amber 12 simulation package^{49} by equilibrating the system using constant temperature molecular dynamics and minimizing the final frame. The general AMBER force field (GAFF)^{50,51} was used to parameterize the solute, and the force field parameters for methanol were taken from previous work.^{52} The structure for Dronpa-2 was taken from the protein data bank crystal structure 4UTS.^{53} Standard amino acid residues were parameterized using the ff99SBildn force field,^{54–56} while parameters for the chromophore were developed using a combination of GAFF and RESP charges from the R.E.D. Server.^{57–61} To determine the protonation state of the various titratable residues, we employed the PDB2PQR Server.^{62} The protein was solvated using the SPCFw flexible water model.^{63} Dronpa-2 was treated using QM/MM calculations,^{64–68} which were recently introduced in a developmental version of TeraChem. The MM portion of the calculation is carried out by the OpenMM molecular dynamics package,^{69–73} which is linked to TeraChem as a library using an API. We implemented the equations for the interaction between the QM and MM regions^{74,75} following the QM/MM implementation in Amber and interfaced OpenMM with TeraChem for the evaluation of QM, MM, and QM/MM terms in the energy and forces. Geometry optimizations and conical intersection searches were performed using the DL-FIND optimization package.^{76} We employed the gradient projection algorithm developed by Bearpark and Robb to search for conical intersections.^{26} The algorithm which we used to compute the gradients is presented as Algorithm 1. As in our previous work, the algorithm to compute the coupling vector is essentially identical to that used to compute the gradient, and the differences are highlighted in Algorithm 2.

1-3: Same as Algorithm 1: |

4: Compute γ^{ΘΛ}, Γ^{ΘΛ}, and F^{act,ΘΛ} ▶ Eqs. (23), (31), and (32) |

5: Construct the orbital interstate coupling gradient: T(Θ,Λ) ▶ Eqs. (35)–(37) |

6-22: Same as Algorithm 1, but replace $\mathbf{g}o\Theta $ with T(Θ,Λ) |

23: Build the transition density matrices: $\bm{\gamma}\Theta \Lambda ,\mathbf{\Gamma}\Theta \Lambda ,and\bm{\gamma}\u02d8\Theta \Lambda $ |

24-26: Same as in Algorithm 1, but replace $\bm{\gamma}\Theta and\mathbf{\Gamma}\Theta $ with $\bm{\gamma}\Theta \Lambda and\mathbf{\Gamma}\Theta \Lambda $^{a} |

27: Scale the effective density matrices and Lagrangian by $(E\Theta \u2212E\Lambda )\u22121$ |

28-30: Calculate the nonadiabatic coupling vector: ▶ Eq. (94) |

(Same as lines 27-29 in Algorithm 1) |

31: Antisymmetric overlap contribution: $\u2211pq\bm{\gamma}\u02d8pq\Theta \Lambda [(p|q\xi )\u2212(p\xi |q)]$ |

1-3: Same as Algorithm 1: |

4: Compute γ^{ΘΛ}, Γ^{ΘΛ}, and F^{act,ΘΛ} ▶ Eqs. (23), (31), and (32) |

5: Construct the orbital interstate coupling gradient: T(Θ,Λ) ▶ Eqs. (35)–(37) |

6-22: Same as Algorithm 1, but replace $\mathbf{g}o\Theta $ with T(Θ,Λ) |

23: Build the transition density matrices: $\bm{\gamma}\Theta \Lambda ,\mathbf{\Gamma}\Theta \Lambda ,and\bm{\gamma}\u02d8\Theta \Lambda $ |

24-26: Same as in Algorithm 1, but replace $\bm{\gamma}\Theta and\mathbf{\Gamma}\Theta $ with $\bm{\gamma}\Theta \Lambda and\mathbf{\Gamma}\Theta \Lambda $^{a} |

27: Scale the effective density matrices and Lagrangian by $(E\Theta \u2212E\Lambda )\u22121$ |

28-30: Calculate the nonadiabatic coupling vector: ▶ Eq. (94) |

(Same as lines 27-29 in Algorithm 1) |

31: Antisymmetric overlap contribution: $\u2211pq\bm{\gamma}\u02d8pq\Theta \Lambda [(p|q\xi )\u2212(p\xi |q)]$ |

## IV. RESULTS

We tested the efficiency of our direct-compatible, atomic orbital-based SA-CASSCF gradient program by applying it to a series of conjugated polyenes in droplets of up to 250 quantum methanol molecules. The microsolvated conjugated polyenes, octatetraene, decapentaene, and dodecahexaene, were treated at the SA-4-CAS(8,8)/6-31G, SA-5-CAS(10,10)/6-31G, and SA-5-CAS(12,12)/6-31G levels of theory, respectively. The highest excited state in each calculation corresponds to the first optically bright state, and all π and π^{*} molecular orbitals are included within the active space. A representative subset of these systems is displayed in Figure 2.In Figure 3, the scaling of computational effort with respect to the number of basis functions is reported for the ERIs, effective Lagrangian, **J** gradient, and **K** gradient. These terms are described in Eqs. (24), (25), (79), (89), and (90), respectively. In addition, we report scaling with respect to basis set size for the partial construction of the Hessian (labeled “Hessian”), the first iteration of the coupled perturbed equations (labeled “CP-MCSCF”), and the entire gradient calculation (assuming one iteration of the coupled perturbed equations is required, labeled “Total”). Typically, 10-20 iterations of the coupled perturbed equations are required to reach convergence. However, the construction of the ERIs still has the highest cost, as can be seen in Figure 3 and will be discussed shortly. The partially constructed Hessian consists of the non-braced terms in Eqs. (16)–(21) as well as the orbital gradients, $\bm{T}(\Psi ,\Psi )$, and interstate couplings, $\bm{T}(\Psi ,\u03d3)$, required in Eq. (50). The computationally expensive portion of each iteration of the coupled perturbed equations consists of doing a **J** and **K** matrix evaluation for each matrix defined in Eqs. (97) and (98) as well as evaluating Eqs. (39), (41), and (50). The gradient exhibits approximately *O*($No2$) scaling, with the Hessian and **K** gradient construction exhibiting the worst scaling. The scaling of the entire gradient calculation tends to decrease with the size of the active space due to the fact that more time is spent computing the relatively weakly scaling ERIs. However, the scaling of the calculation depends significantly on the sparsity of the density matrices in the AO basis. The decapentaene series exhibits lower scaling than the dodecahexaene series, despite the use of a smaller active space. This occurs because the decapentaene systems happen to have a sparser representation in the AO basis, likely by chance, and most of the contributions to the gradient exhibit relatively weaker scaling. Otherwise, the scaling behavior reported here is essentially the same as that reported in our previous work,^{13,14} further demonstrating that the scaling is consistent as the active space increases.

In contrast to the results presented in our previous work,^{14} the ERI builds are the most expensive operation in the computation of the gradient for systems treated with large active spaces. Despite exhibiting relatively weak scaling, the number of ERIs required to solve the coupled perturbed equations increases quadratically with the number of active molecular orbitals. The size of the partially constructed Hessian and the number of **J** gradient calls also increase quadratically with the number of active orbitals, but these terms exhibit much lower prefactors in relation to the ERIs. This additional cost is an unavoidable consequence of employing large active spaces and has motivated the development of semi-numerical^{77} techniques and density fitting^{78,79} approximations within CASSCF. Although it is possible that our algorithm could also benefit from these approximations, our SA-CASSCF energy and response property implementation is currently both the fastest and weakest scaling such algorithm and requires neither density-fitting approximations^{78,79} nor overly stringent convergence thresholds.^{77} Moreover, there is likely still sparsity that we are not exploiting, which would reduce the cost of recovering the contributions from the ERIs. Nonetheless, our current algorithm is efficient enough to perform calculations on systems with *O*(10) active orbitals, *O*(1000) basis functions, and *O*(10^{8}) determinants on a single GPU in hours, significantly advancing the scale of simulations possible at the SA-CASSCF level of theory.

In our previous work,^{23} we developed efficient routines to compute a *𝝈* vector, the matrix vector product between the CI Hamiltonian matrix and a trial vector, which scaled linearly with the number of determinants. The construction of the OPDM and TPDM follows a very similar algorithm. As noted in Section II, the matrix vector products between the orbital-CI blocks and CI−CI blocks of the Hamiltonian can be written in terms of *𝝈* vector builds and generalized OPDM and TPDM builds. The scaling of $H\mathbf{C}\mathbf{O}\mathbf{z}$, $H\mathbf{O}\mathbf{C}\mathbf{z}$, and $H\mathbf{C}\mathbf{C}\mathbf{z}$ with respect to the number of determinants is displayed in Figure 4. These calculations were performed on hexadecaoctaene treated at the SA-2-CAS(16,12)/6-31G through SA-2-CAS(16,16)/6-31G levels of theory. For the SA-2-CAS(16,16)/6-31G calculation, all π and π^{*} molecular orbitals were included within the active space, and the smaller active spaces include the lowest π^{*} molecular orbitals in addition to all π orbitals. In accordance with our previous work, the construction of $H\mathbf{C}\mathbf{O}\mathbf{z}$ and $H\mathbf{C}\mathbf{C}\mathbf{z}$ exhibits approximately linear scaling with respect to the number of determinants. This is expected because the construction of these blocks is dominated by *𝝈* vector builds, as demonstrated in Eqs. (39) and (50). The construction of $H\mathbf{O}\mathbf{C}\mathbf{z}$ is somewhat more expensive and exhibits slightly worse scaling (but still nearly linear) of *O*($Nd1.15$). This is due to the fact that it is slightly more difficult to compute a TPDM compared to a *𝝈* vector. Although the scaling of each of these blocks increases nearly linearly with the number of determinants, the number of determinants increases factorially with the size of the active space, and memory requirements are an issue as the active space increases beyond 16 electrons in 16 orbitals, which is the largest active spaces we have employed. This is an inherent limitation to all CASSCF codes and is the subject of continuing research. Despite this, our GPU-based algorithms substantially reduce the cost of performing SA-CASSCF energy and gradient calculations for both large and small active spaces.

To further demonstrate the efficiency and accuracy of our code, we apply our method to the Dronpa-2 protein solvated in a water droplet, shown in Figure 5. Dronpa-2 is a fast switching variant of the reversibly photoswitchable protein Dronpa,^{80} a homologue to the green fluorescent protein, with the M159T point mutation.^{81} Reversible photoswitchable proteins are fluorescent proteins that can be switched between optically bright and dark states numerous times, which typically involves isomerization of the chromophore, and are important due to their use in super resolution spectroscopy.^{82–86} Dronpa-2 has been the subject of a variety of experiments,^{81,87–90} which suggest that fast photoswitching is enabled by a significant reduction in steric resistance to chromophore twisting within the protein matrix. Due to its fast switching dynamics, it was also recently studied using SA-2-CAS(6,6)/3-21G with a very small quantum region (21 atoms).^{91}

Here we perform a conical intersection search on Dronpa-2 at the SA-2-CAS(12,11)/6-31G* level of theory with a significantly expanded quantum region. This active space has been used in previous studies on the closely related green fluorescent protein chromophore.^{92,93} Our quantum region consists of the chromophore (Cys62, Tyr63, and Gly64) as well as the side chains of Arg66, Arg91, Ser142, Glu144, His193, and Glu211 (where Glu211 is protonated). Furthermore, the backbone NH atoms from Asp65 and CO atoms from Phe61 in addition to two crystal water molecules were included in the quantum region, as displayed in Figure 5 embedded within the broader protein matrix. A total of 122 atoms are included in the quantum region for this calculation. Although this system has roughly half as many molecular orbitals as the microsolvated butadiene system from our previous paper,^{14} it has over 5000 times more CI parameters (213,444 total). The total number of parameters optimized during the coupled perturbed equations is 608,507, and as such, this is one of the largest conical intersection searches ever performed at the SA-CASSCF level of theory. Moreover, each step of the conical intersection search required on average 20 min on 8 NVIDIA GTX 970 GPUs in mixed precision,^{94} placing systems of this scale within the realm of possibility for nonadiabatic dynamics.

The MECI for Dronpa-2 is plotted in Figure 6. It is peaked or very slightly sloped, indicating that radiationless decay is likely to be quite fast once the chromophore twists. However, the fluorescence quantum yield for Dronpa-2 is 0.33,^{81} and there is expected to be a modest barrier to rotation. The direction of the gradient difference and coupling vectors corresponds to resonance and hula twisting motions, respectively. Of these directions, the hula twisting motion is responsible for isomerization and determines the switching yield. Experimental results^{81,87–90} demonstrate that the bright-to-dark state switching quantum yield is very low (4.7 × 10^{−2})^{81} and the branching ratio between the bright and dark state heavily favors the bright state. These results are consistent with the intersection topology shown in Figure 6, but excited-state dynamics simulations of Dronpa-2 are needed to estimate the branching ratio using computational models. This will be the subject of future work.

## V. CONCLUSIONS

We have generalized our previous algorithm for computing response properties for a SA-CASSCF wavefunction in order to enable efficient treatment of large active spaces. This algorithm simultaneously exploits the intrinsic sparsity in the AO basis, efficient direct-CI routines, and GPU acceleration. Even for very large active spaces, we observe effective *O*($No2$) scaling, as opposed to the formal *O*($No4$) scaling. Although the number of determinants increases factorially with the size of the active space, the scaling of each block of the Hessian multiplied by a trial vector is nearly linear with respect to the number of determinants. As such, we were able to carry out one of the largest SA-CASSCF gradient and nonadiabatic coupling vector calculations in a short timeframe using a single GPU. Importantly, our work demonstrates that the computationally demanding portions of the SA-CASSCF method can be formulated, essentially entirely, in terms of seven fundamental operations, including **J** and **K** matrix builds and their derivatives, as well as, generalized *𝝈*, OPDM, and TPDM builds. We believe that these are fundamental operations in electronic structure theory, much like BLAS and LAPACK routines are considered fundamental operations in scientific computing.

Although our methods represent an exciting opportunity to study a variety of photochemical reactions at an unprecedented scale, we are continuing to work on methods that address the primary shortcoming of SA-CASSCF—limited treatment of dynamic electron correlation. This includes the development of tensor hypercontraction methods^{95–100} for multistate complete active space perturbation theory (MS-CASPT2)^{101,102} as well as methods that combine aspects of SA-CASSCF and density functional theory.^{103–106} In the meantime, our SA-CASSCF code can be applied to the wide variety of systems where dynamic correlation can be neglected (often due to a judicious choice of active space and state averaging^{107–109}).

As mentioned in several of our past papers,^{13,14,23} the primary motivation for this work is to apply this code to nonadiabatic dynamics simulations. We recently developed an interface between TeraChem and full multiple spawning (FMS) nonadiabatic dynamics,^{24,25} and applied this code to the study of the photoinduced ring opening of provitamin D_{3}.^{110} At the time, this was the largest simulation of its kind and predicted a nonadiabatic mechanism that was subsequently corroborated by experiment.^{111} With the development of this integral-direct compatible program, we can now simulate nonadiabatic dynamics on molecules with large active spaces (limited primarily by memory requirements for the CI vector). Although our method has not expanded the maximum tractable active space, we can perform calculations with large active spaces faster than previous codes, especially for large molecular systems. More specifically, our method allows us to routinely perform nonadiabatic dynamics on systems with several hundreds to a thousand basis functions, which corresponds to systems with *O*(100) quantum atoms, in active spaces with up to 12 electrons in 12 orbitals. Moreover, we can perform critical point searches on systems with *O*(1000) basis functions and with active spaces of up to 16 electrons in 16 orbitals. Molecules that might benefit from such treatment include DNA bases, photochromic molecular switches, and transition metal complexes, where multiple roots of different electronic character as well as the ability to treat degenerate states are important. Furthermore, our implementation of QM/MM has enabled the treatment of large protein systems. In the future, we intend to apply our code to study the dynamics of photoswitchable proteins.

## SUPPLEMENTARY MATERIAL

See supplementary material for coordinates of the molecules used in this article as well as input and output files with detailed energy information.

## ACKNOWLEDGMENTS

This work was supported by the National Science Foundation (Grant No. ACI-1450179). J.W.S. is grateful to the National Science Foundation for an NSF Graduate Fellowship. This work used the XStream computational resource supported by the NSF MRI Program (Grant No. ACI-1429830). T.J.M. is a cofounder of PetaChem, LLC.