Conical intersections control excited state reactivity, and thus, elucidating and predicting their geometric and energetic characteristics are crucial for understanding photochemistry. Locating these intersections requires accurate and efficient electronic structure methods. Unfortunately, the most accurate methods (e.g., multireference perturbation theories such as XMS-CASPT2) are computationally challenging for large molecules. The state-interaction state-averaged restricted ensemble referenced Kohn–Sham (SI-SA-REKS) method is a computationally efficient alternative. The application of SI-SA-REKS to photochemistry was previously hampered by a lack of analytical nuclear gradients and nonadiabatic coupling matrix elements. We have recently derived analytical energy derivatives for the SI-SA-REKS method and implemented the method effectively on graphical processing units. We demonstrate that our implementation gives the correct conical intersection topography and energetics for several examples. Furthermore, our implementation of SI-SA-REKS is computationally efficient, with observed sub-quadratic scaling as a function of molecular size. This demonstrates the promise of SI-SA-REKS for excited state dynamics of large molecular systems.

## I. INTRODUCTION

Conical intersections^{1} (CIs) are molecular geometries where two or more potential energy surfaces intersect and the nonadiabatic coupling between these states is non-vanishing. It has been well established that CIs are dominant players in nonadiabatic transitions, which transfer energy from electronic to vibrational degrees of freedom.^{2–5} Hence, locating and characterizing CIs are crucial for understanding reaction mechanisms in photochemistry. To accomplish this task, an electronic structure method should meet several requirements. Specifically, it must (1) efficiently evaluate analytical nuclear gradients and nonadiabatic coupling (NAC) vectors,^{1} (2) include both static and dynamic electron correlation,^{6–8} and (3) provide an unbiased description of ground and excited states.^{7,9–11} Given these restrictions, the most accurate methods for describing CIs are the multistate complete active space second-order perturbation theory (MS-CASPT2)^{12} and its “extended” variant (XMS-CASPT2).^{13,14} However, these methods are computationally demanding. This has limited their application to molecules with at most tens of atoms. Recent approaches using tensor hypercontraction^{15} and graphical processing units (GPUs) promise to extend this to hundreds of atoms^{16,17} but are still under development.

REKS (spin-restricted ensemble-referenced Kohn–Sham method) and its state-averaged (SA-REKS) and state-interaction (SI-SA-REKS) variants^{18–21} are ensemble density functional approaches that meet all the requirements listed above, but with the computational cost of mean-field methods. We recently reported^{22} a formalism for analytical energy derivatives of individual states in the SA-REKS and SI-SA-REKS (also denoted as SSR) methods. In this work, we detail and benchmark our graphical processing unit (GPU) implementation of analytical energy derivatives for individual states and NAC vectors. We exploit sparsity in the atomic orbital basis to reduce the computational scaling with respect to molecular size. The formal scaling of the gradient implementation presented here is O(N^{3}), where N is the number of basis functions. However, we present timings that demonstrate an observed scaling of approximately O(N^{1.76}), where the decreased effort is largely due to integral screening. We are able calculate the analytical gradients of individual state energies of molecules with hundreds of atoms and more than 2500 orbitals. Moreover, we perform minimal energy conical intersection (MECI) searches on the *trans*-penta-2,4-dieniminium cation (C_{5}H_{8}N^{+}, *trans*-PSB3)^{23} with SSR and (X)MS-CASPT2. Careful inspection around the MECI shows that SSR gives the correct topography and energetics of conical intersections. A comparison of the MECI structures also indicates that SSR MECIs mimic geometric and energetic features found using MS-CASPT2 but missing in state-averaged complete active space self-consistent field (SA-CASSCF) methods.^{24–27}

This paper is organized as follows: Sec. II recapitulates the essential formulas for the implementation of analytical nuclear gradients of individual states of SA-REKS and SSR. Section III details the algorithm used in our GPU implementation. Section IV provides the computational details of the performance tests and application examples. Section V presents the computational performance of the implementation and compares SSR, XMS-CASPT2, and SA-CASSCF for local minima and MECIs of the PSB3 molecule.

## II. THEORY

Throughout this work, we employ the following notation: Ψ and Φ refer to eigenstates of the Hamiltonian. AOs are indexed with *μ*, *ν*, *κ*, *γ*, and *τ*. MOs are indexed with *p*, *q*, *i*, and *j*. The active subset of MOs is indexed with *r* and *s*, referring to the lower and upper active orbitals of REKS(2,2). The coefficients defining the MOs in the AO basis are denoted as *C*_{μi}. Energies are differentiated with respect to an external perturbation (e.g., displacement of a nuclear coordinate or application of an external field) denoted as *λ*.

### A. REKS energy

We first briefly review the essential equations for the analytical gradient calculation of individual states in SA-REKS(2,2) and SI-SA-REKS(2,2).^{13,18,19,32,33} Both methods focus on strongly correlated systems with two fractionally occupied frontier orbitals accommodating two electrons. In SA-REKS(2,2), we focus on two individual states, the ground state, which is reminiscent of the perfectly spin-paired two-configurational wavefunction of generalized valence bond theory,^{28–30}

and the lowest excited open-shell singlet (OSS) state obtained by single excitation from Eq. (1),

The energies of these two states are expressed as weighted sums over the energies of several microstates *L* (Fig. 1), $EL=E[\rho L\alpha ,\rho L\beta ]$, depending on the pure-state (single determinant) spin densities $\rho L\sigma ,\u2009\sigma =\alpha ,\beta $, as follows:^{20,21}

Here, *r* and *s* label the fractionally occupied frontier orbitals and *n*_{r} and *n*_{s} are the respective fractional occupation numbers (FONs), where the occupation numbers are constrained such that *n*_{r} + *n*_{s} = 2. The occupation patterns of the individual microstates are defined in Fig. 1. The functions

with δ = 0.4, are used to interpolate between the weakly correlated and strongly correlated limiting cases.^{19–21} Note that we will often use the single argument form indicated in the last line of Eq. (5), incorporating the constraint on the sum of *n*_{r} and *n*_{s}.

Combining the two states leads to the SA-REKS energy functional (the subscript *SA* is added to the individual state energies in the SA-REKS functional),

which is minimized with respect to the orbitals and the FONs of the fractionally occupied orbitals in the REKS ground singlet state (GSS). An equiensemble is usually set in Eq. (6), i.e.,. *w*_{GSS} = *w*_{OSS} = 1/2, for a balanced treatment of the ground and excited states.

The SI-SA-REKS(2,2) method is an extension of SA-REKS(2,2) that allows the REKS and OSS states to interact. The ground ($E1SSR$) and excited ($E2SSR$) state energies in the SI-SA-REKS method (abbreviated as SSR henceforth for brevity) are given by

where *a*_{ij} are the elements of the eigenvectors of the 2 × 2 secular problem,

with the interstate coupling element Δ_{SA} defined as

where *W*_{rs} is the off-diagonal Lagrange multiplier between the fractionally occupied orbitals in the SA-REKS one-electron equations,

Here, the Lagrange multipliers, ϵ_{pq}, are defined as

where the one-electron Fock operators $F^q$ are defined as

where *f*_{q} = *n*_{q}/2 is the average occupation number of the *q*th spin orbital and the one-electron Fock operators $F^L\sigma $ of the individual microstates are given by

In Eqs. (12) and (13), σ = α, β is the electron spin, $\u0125$ is the one-electron core Hamiltonian, $Vxc,L\sigma (r)=\delta Exc,L/\delta \rho L\sigma (r)$ is the exchange–correlation (XC) potential of the respective microstate for the spin channel σ, and $nq,L\sigma $ are the (integer) occupation numbers of the orbitals *φ*_{q}(**r**) in the *L*th microstate, as defined in Fig. 1.

Several variations of the Lagrange multipliers defined in Eq. (11), including the form for the averaged state, will be used in the equation

for an individual state *X* = REKS/OSS,

and for a microstate L,

### B. Coupled-perturbed REKS equation

Since the molecular orbitals are not variationally optimized with respect to the individual state energies of SA-REKS or SSR, analytic gradients require the solution of orbital response from the coupled-perturbed (CP) REKS equations, which boil down to solving the Z-vector equation,

The matrix **A** is given as

where

Here, *E*_{L} are the energies of the respective microstates and *x* = *n*_{r}/2. The Coulombic/exchange–correlation functional contribution in Eq. (18) is given as follows:

where the symmetrized Hartree–XC kernel $f\u0303Hxc,L\sigma \sigma \u2032$ is defined as

In Eq. (24), the four terms are Coulomb, exact exchange, DFT exchange, and DFT correlation and *g*_{HF} and *g*_{DFT} are the parameters (or functions) defining the fraction of exact and DFT exchange used in a hybrid functional (see Text S1 of the supplementary material).

The right-hand side of Eq. (17), **X**, takes a different form for SA-REKS and SSR. For SA-REKS, it is defined as

for the REKS and OSS states, respectively, and the corresponding solutions of the Z-vector equation are denoted as $ZREKS$ and $ZOSS$. For SSR, **X** is given by

where

where the Coulombic/exchange–correlation term, $(rs|\u2009f\u0303Hxct(p\u2212q)|pq)$, is defined similarly to Eq. (22) for *t = r*, *s*,

and the scalar *R*_{sr} is given as

The corresponding solutions of the SSR Z-vector equation are then denoted as $ZSSRm$, where *m* = 1, 2 for the ground and excited states, respectively.

### C. SA-REKS gradient

Given the appropriate **Z** vector, the analytical gradient of an individual state *X* (*X* = REKS/OSS) in SA-REKS can be calculated with the following equation:

where

and the matrix $TL\sigma ,\lambda $ is the derivative $\u2202\u2032\u27e8p|F^L\sigma |q\u27e9/\u2202\lambda $ transformed to the AO basis set representation,

The primed differentiation symbol denotes that only the AO basis integrals are to be differentiated and not the MO coefficients. The superscript λ denotes differentiation with respect to λ, $Vxc,L\sigma (r)=\delta Exc,L/\delta \rho L\sigma (r)$ is the XC potential of the *L*th microstate, and

are the elements of the density matrix for the *L*th microstate. Here, *g*_{HF} and *g*_{DFT} have the same definition, as in Eq. (24) (more information available in the supplementary material, Text S1). The matrices $Q(1)X$ and $Q(2)X$ are defined as

where

### D. SSR gradient

The energies of the individual SSR states are given by Eq. (8) in terms of the energies of the diabatic states ($ESAREKS$ and $ESAOSS$) and the coupling element Δ_{SA}. Therefore, the analytical gradient of the individual states of SSR is given as

Using the results for SA-REKS gradients, this can be expressed as

In Eq. (38), $ZpqSSRm$ are the elements of the vector $ZSSRm$, which is the solution of the Z-vector equation given in Eq. (17) using the right-hand side **X** given by Eq. (26). The microstate weight coefficient for the *m*th SSR eigenstate, $CLSSRm$, is defined as

and the matrices $RL\sigma SSRm$, $Q(1)SSR1$, and $Q(2)SSR1$ are given by

where $RL,\nu \mu \sigma \Delta =nrnr,L\sigma \u2212nsns,L\sigma C\nu sCr\mu \u2020$ and $RL\sigma (m)$, $Q(1)(m)$, and $Q(2)(m)$ are obtained from Eqs. (31), (34), and (35), respectively, by replacing **Z**^{X} with $ZSSRm$. The matrices $Q(1)\Delta $ and $Q(2)\Delta $ are given by

### E. Nonadiabatic coupling vectors

The two vectors that define the branching plane of the conical intersection (CI) are the gradient difference vector (GDV),

and the derivative coupling vector (DCV),

For SSR, these two vectors can be calculated as

where $\u2207E1SSR$, $\u2207E2SSR$ and $\u2207ESAREKS$, $\u2207ESAOSS$ are the gradients of the adiabatic and diabatic states, respectively. Here, we need to explicitly calculate the gradient of the coupling element Δ_{SA},

where **Z**^{Δ} is the solution of Eq. (17) with −**X**_{Δ} from Eq. (27) on the right-hand side. The quantities $RL\sigma coup$, $Q(1)coup$, and $Q(2)coup$ are obtained by replacing ^{X}**Z** with **Z**^{Δ} in Eqs. (31), (34), and (35), respectively.

In practice, we do not explicitly calculate all the gradients $\u2207E1SSR$, $\u2207E2SSR$, $\u2207ESAREKS$, $\u2207ESAOSS$, and ∇Δ_{SA} to obtain the GDV and DCV. Instead, only $\u2207E1SSR$ and ∇Δ_{SA} are explicitly formed. Then, $\u2207E2SSR=2\u2207ESA-REKS\u2212\u2207E1SSR$. This is advantageous because $\u2207ESA-REKS$ can be obtained without solving any CP equations (see Text S2 of the supplementary material). $\u2207ESAREKS$ and $\u2207ESAOSS$ can then be solved from Eq. (37), and thus, the DCV can be obtained from Eq. (47).

## III. IMPLEMENTATION USING GPUS

The success of GPUs in accelerating electronic structure methods is largely attributed to highly efficient GPU algorithms for calculating two-electron integrals.^{31–36} The details of our GPU algorithms for the **J** and **K** matrices (Coulomb and exchange operators, respectively) and the DFT quadrature have been previously published.^{37–43} These algorithms were first implemented for ground state SCF methods such as HF and DFT, which are naturally formulated with these integrals. GPU acceleration was later extended to single reference excited-state methods, such as configuration interaction singles (CIS) and time-dependent density functional theory (TDDFT).^{44} More recently, the complete active space self-consistent field (CASSCF) method was reformulated in terms of **J** and **K** matrices,^{27} which greatly increased the applicability of CASSCF to large molecular systems. Here, we show that these same primitive building blocks are sufficient for REKS, CP-REKS, and SSR gradient evaluation.

The Coulomb- and exchange-like matrices elements are given by

where **D** are usually some type of density or transition density matrices. Highly effective sorting^{45–47} and screening^{48–56} algorithms are used in building these matrices to exploit the sparsity of both electron repulsion integrals and the density matrix. Primitive Gaussian basis function pairs are sorted by their Schwarz bounds,^{57} and negligible pairs smaller than a prescreening threshold are removed. Surviving pairs are transferred to the GPUs, where the ERIs are built and contracted with the density matrices in a direct SCF approach,^{58} without storage of intermediate quantities. Because many widely available GPUs have much more efficient single precision arithmetic (compared to double precision), mixed precision^{33,39} and dynamic precision^{59} approaches have been adopted to balance performance and accuracy. ERIs with density-weighted Schwarz bounds less than a threshold, *δ*_{spdp}, are calculated with single precision and are then accumulated into the **J** or **K** matrix with double precision.

For density functional calculations, we also need to evaluate the exchange–correlation energy *E*_{xc} and its functional derivatives in various orders. The single point energy evaluation of REKS, SA-REKS, and SSR involves building *E*_{xc} and the matrix elements of its first order functional derivative, *v*_{xc}, also known as the exchange-correlation potential, in the AO basis,^{60}

In practice, these integrals are evaluated with numerical quadrature.^{37,61} The most computationally demanding parts of the quadrature evaluation, including generating quadrature weights and assembling the integral from the values on grid points, are intrinsically data parallel and well-suited to evaluation on GPUs.^{43,62} Other operations, including the evaluation of the exchange–correlation functional at each grid point, are done on the CPU.

With the basic primitive operations defined in Eqs. (49)–(51), the (SI)-(SA)-REKS energy evaluation can be implemented efficiently through a self-consistent field (SCF) procedure similar to normal Hartree–Fock or DFT calculations. Many commonly used SCF algorithms can be easily adapted for REKS. Specifically, we implemented the REKS SCF with both the original Roothaan repeated diagonalization algorithm^{63} and Pulay’s direct inversion in the iterative subspace (DIIS),^{64} as shown in Algorithms 1 and 2. Mixed^{33,39} and dynamic^{59} precision schemes for accelerating **J** and **K** builds on GPUs during SCF iterations are also adapted here for REKS. There are two features that make the REKS SCF different from the normal SCF: (1) composing an effective Fock matrix **F**^{REKS} from the microstate Fock matrices $FL\sigma $, as shown in Algorithm 3, and (2) self-consistently updating the FON, which we perform using the Newton–Raphson method. Hence, each REKS SCF iteration contains two spin-restricted Fock-like builds for microstates 1 and 2 and two spin-unrestricted Fock-like builds for microstates 3 and 5, given the relationship between microstates 3 and 4 (or 5 and 6). In total, one REKS SCF iteration includes six **J** builds, six **K** builds, and six *v*_{xc} builds.

Obtain the initial guess of MO coefficients C_{μi} |

k = 0 |

do while e_{SCF} > threshold and k < max_iteration and k ≠ 0 |

Build density matrices |

$P\mu \nu 00=\u2211p=1NcoreC\mu pC\nu p;\u2003P\mu \nu 01=P\mu \nu 00+C\mu sC\nu sP\mu \nu 10=P\mu \nu 00+C\mu rC\nu r;\u2003P\mu \nu 11=P\mu \nu 00+C\mu rC\nu r+C\mu sC\nu s$ |

Assign the microstate densities $PL\alpha ,PL\beta $ |

$(P1\alpha ,P1\beta )=(P10,P10);\u2003(P2\alpha ,P2\beta )=(P01,P01)(P3\alpha ,P3\beta )=(P4\beta ,P4\alpha )=(P10,P01);\u2003(P5\alpha ,P5\beta )=(P6\beta ,P6\alpha )=(P11,P00)$ |

Build Fock matrix (AO) for each distinct microstate on GPUs, L = 1, 2, 3, 5 Eqs. (49) and (50) |

$FL\alpha =Hcore+J(PL\alpha +PL\beta )\u2212K(PL\alpha )+vxc\alpha FL\beta =Hcore+J(PL\alpha +PL\beta )\u2212K(PL\beta )+vxc\beta F4\alpha =F3\beta ;\u2009F4\beta =F3\alpha ;\u2009F6\alpha =F5\beta ;\u2009F6\beta =F5\alpha $ |

Compute microstate energies: $EL=12\u2211\mu \nu \u2211\sigma (H\mu \nu core+FL,\mu \nu \sigma )PL,\mu \nu \sigma $ |

Update occupation number n_{r}, n_{s} (x = n_{r}/2) with the Newton–Raphson method to satisfy equations |

$\u2202EREKS\u2202xx=xmin=\u2202EREKS\u2202x+\u22022EREKS\u2202x2(xmin\u2212x)\cdots =0\u2202EREKS\u2202xx=xmin=Err\xaf\u2212Ess\xaf+\u2202fint(x)\u2202x+\u22022fint(x)\u2202x2xmin\u2212xErs2+Er\xafs\xaf2\u2212Er\xafs2\u2212Ers\xaf2=0$ |

Update C_{L}: $C1=nr2;\u2009C1=ns2;\u2009C3=C4=\u2212C5=\u2212C6=\u221212fint(nr2)$ Eq. (3) |

Compute $EREKS=\u2211L=16CLREKSEL$ Eq. (3) |

Assemble F^{REKS} from $FL\alpha ,FL\beta $ in MO space with Algorithm 3 |

$eSCF=maxi\u2260j{FijREKS}$ |

Diagonalize F^{REKS}: $FREKSC\u0303=C\u0303\epsilon $ |

Update MO coefficients: |

$C(k+1)=C(k)C\u0303$ |

k = k + 1 |

Obtain the initial guess of MO coefficients C_{μi} |

k = 0 |

do while e_{SCF} > threshold and k < max_iteration and k ≠ 0 |

Build density matrices |

$P\mu \nu 00=\u2211p=1NcoreC\mu pC\nu p;\u2003P\mu \nu 01=P\mu \nu 00+C\mu sC\nu sP\mu \nu 10=P\mu \nu 00+C\mu rC\nu r;\u2003P\mu \nu 11=P\mu \nu 00+C\mu rC\nu r+C\mu sC\nu s$ |

Assign the microstate densities $PL\alpha ,PL\beta $ |

$(P1\alpha ,P1\beta )=(P10,P10);\u2003(P2\alpha ,P2\beta )=(P01,P01)(P3\alpha ,P3\beta )=(P4\beta ,P4\alpha )=(P10,P01);\u2003(P5\alpha ,P5\beta )=(P6\beta ,P6\alpha )=(P11,P00)$ |

Build Fock matrix (AO) for each distinct microstate on GPUs, L = 1, 2, 3, 5 Eqs. (49) and (50) |

$FL\alpha =Hcore+J(PL\alpha +PL\beta )\u2212K(PL\alpha )+vxc\alpha FL\beta =Hcore+J(PL\alpha +PL\beta )\u2212K(PL\beta )+vxc\beta F4\alpha =F3\beta ;\u2009F4\beta =F3\alpha ;\u2009F6\alpha =F5\beta ;\u2009F6\beta =F5\alpha $ |

Compute microstate energies: $EL=12\u2211\mu \nu \u2211\sigma (H\mu \nu core+FL,\mu \nu \sigma )PL,\mu \nu \sigma $ |

Update occupation number n_{r}, n_{s} (x = n_{r}/2) with the Newton–Raphson method to satisfy equations |

$\u2202EREKS\u2202xx=xmin=\u2202EREKS\u2202x+\u22022EREKS\u2202x2(xmin\u2212x)\cdots =0\u2202EREKS\u2202xx=xmin=Err\xaf\u2212Ess\xaf+\u2202fint(x)\u2202x+\u22022fint(x)\u2202x2xmin\u2212xErs2+Er\xafs\xaf2\u2212Er\xafs2\u2212Ers\xaf2=0$ |

Update C_{L}: $C1=nr2;\u2009C1=ns2;\u2009C3=C4=\u2212C5=\u2212C6=\u221212fint(nr2)$ Eq. (3) |

Compute $EREKS=\u2211L=16CLREKSEL$ Eq. (3) |

Assemble F^{REKS} from $FL\alpha ,FL\beta $ in MO space with Algorithm 3 |

$eSCF=maxi\u2260j{FijREKS}$ |

Diagonalize F^{REKS}: $FREKSC\u0303=C\u0303\epsilon $ |

Update MO coefficients: |

$C(k+1)=C(k)C\u0303$ |

k = k + 1 |

Obtain the initial guess of MO coefficients C_{μi} |

k = 0 |

loop while (e_{DIIS} < threshold or $k>max_DIIS_iteration$) and k ≠ 0 |

Build density matrices Algorithm1 |

Assign the values of microstate densities $PL\alpha ,PL\beta $ Algorithm1 |

Build Fock matrices (AO) for microstates Algorithm1 |

Compute microstate energies Algorithm1 |

Update occupation number n_{r}, n_{s} Algorithm1 |

Update C_{L} Algorithm1 |

Compute E^{REKS} Algorithm1 |

Assemble F^{REKS} in MO space Algorithm 3 |

Compute the DIIS error matrix: |

$D=diag{1\cdots 1\ufe38Ncore,fr,fs,0\cdots 0\ufe38Nvirt}$ |

$ek=XSC(FREKSD\u2212DFREKS)C\u2020SX$ |

$Bik=tr(eiTek)$, e_{DIIS} = ||e_{k}||_{∞} |

Transform F^{REKS} to AO space as $F\u0303(k)=SCFREKSC\u2020S$ and store $*notes below$ |

Solve the DIIS linear equation: |

$0\u2003\u22121\u2003\cdots \u2003\u22121\u22121\u2003B11\u2003\cdots \u2003B1k\vdots \u2003\vdots \u2003\ddots \u2003\vdots \u22121\u2003Bk1\u2003\cdots \u2003Bkk\u2212\lambda \xi 1\vdots \xi k=\u221210\vdots 0$ |

Construct the interpolated Fock matrix: $F\u0303int=\u2211ik\xi iF\u0303(i)$ |

Transform back to MO space: $Fint=C\u2020F\u0303intC$ $*notes below$ |

Diagonalize the interpolated Fock matrix: $FintC\u0303=C\u0303\epsilon $ |

Update MO coefficients: $C(k+1)=C(k)C\u0303$ |

k = k + 1 |

^{*} Further details about the transformation of F^{REKS} to AO are available in Text S4 of the supplementary material. |

Obtain the initial guess of MO coefficients C_{μi} |

k = 0 |

loop while (e_{DIIS} < threshold or $k>max_DIIS_iteration$) and k ≠ 0 |

Build density matrices Algorithm1 |

Assign the values of microstate densities $PL\alpha ,PL\beta $ Algorithm1 |

Build Fock matrices (AO) for microstates Algorithm1 |

Compute microstate energies Algorithm1 |

Update occupation number n_{r}, n_{s} Algorithm1 |

Update C_{L} Algorithm1 |

Compute E^{REKS} Algorithm1 |

Assemble F^{REKS} in MO space Algorithm 3 |

Compute the DIIS error matrix: |

$D=diag{1\cdots 1\ufe38Ncore,fr,fs,0\cdots 0\ufe38Nvirt}$ |

$ek=XSC(FREKSD\u2212DFREKS)C\u2020SX$ |

$Bik=tr(eiTek)$, e_{DIIS} = ||e_{k}||_{∞} |

Transform F^{REKS} to AO space as $F\u0303(k)=SCFREKSC\u2020S$ and store $*notes below$ |

Solve the DIIS linear equation: |

$0\u2003\u22121\u2003\cdots \u2003\u22121\u22121\u2003B11\u2003\cdots \u2003B1k\vdots \u2003\vdots \u2003\ddots \u2003\vdots \u22121\u2003Bk1\u2003\cdots \u2003Bkk\u2212\lambda \xi 1\vdots \xi k=\u221210\vdots 0$ |

Construct the interpolated Fock matrix: $F\u0303int=\u2211ik\xi iF\u0303(i)$ |

Transform back to MO space: $Fint=C\u2020F\u0303intC$ $*notes below$ |

Diagonalize the interpolated Fock matrix: $FintC\u0303=C\u0303\epsilon $ |

Update MO coefficients: $C(k+1)=C(k)C\u0303$ |

k = k + 1 |

^{*} Further details about the transformation of F^{REKS} to AO are available in Text S4 of the supplementary material. |

Compute weighting factor for each block: |

Diagonal blocks: |

$wL,c\sigma =0.5CL;\u2009wL,r\sigma =0.5CL\u22c5nL,r\sigma ;\u2009wL,s\sigma =0.5CL\u22c5nL,s\sigma $ |

Off-diagonal blocks: |

$wL,pq\sigma =0.5CL\u22c5(nL,p\sigma \u2212nL,q\sigma )$ |

Assemble the effective Fock operator F^{REKS} from $FL\alpha ,FL\beta $: |

$FpqREKS=\u2211LLmax\u2211\sigma =\alpha ,\beta wL,pq\sigma FL,pq\sigma $ |

Compute weighting factor for each block: |

Diagonal blocks: |

$wL,c\sigma =0.5CL;\u2009wL,r\sigma =0.5CL\u22c5nL,r\sigma ;\u2009wL,s\sigma =0.5CL\u22c5nL,s\sigma $ |

Off-diagonal blocks: |

$wL,pq\sigma =0.5CL\u22c5(nL,p\sigma \u2212nL,q\sigma )$ |

Assemble the effective Fock operator F^{REKS} from $FL\alpha ,FL\beta $: |

$FpqREKS=\u2211LLmax\u2211\sigma =\alpha ,\beta wL,pq\sigma FL,pq\sigma $ |

The second functional derivatives of the exchange–correlation functional in the MO basis are given by

This four-index tensor is needed in TDDFT^{65} (or TDDFT/TDA^{66}), CPKS (coupled-perturbed Kohn–Sham),^{67} and our CPREKS equations. Expanded analytical expressions of Eq. (52) in terms of *ρ*_{σ}, *γ*_{σ}, $\u22022\epsilon xc\u2202\rho \sigma \u2202\rho \sigma \u2032$, $\u22022\epsilon xc\u2202\gamma \sigma \sigma \u2032\u2202\gamma \sigma \u2033\sigma \u2034$, and $\u22022\epsilon xc\u2202\gamma \sigma \sigma \u2032\u2202\rho \sigma \u2032$ are needed for implementation and have been provided previously.^{68} In our GPU implementation,^{44} we compute this tensor in the contracted form in the AO basis,

where the second functional derivatives at quadrature grid points are evaluated for a reference state with density **P**^{α}, **P**^{β}. These derivatives are evaluated only once and saved in memory. The density matrices **D**^{α}, **D**^{β} are usually updated through iterations of certain iterative solvers such as Davidson^{69} for TDDFT or conjugate gradient (CG), for CPKS and CPREKS. In each iteration, the saved functional derivatives are contracted with **D**^{α}/**D**^{β} and the basis function values and then summed over the quadrature grid to form matrix elements. For CPREKS, we denote this integral for each microstate *L* as

The CPREKS equation [Eq. (17)] is a large scale, symmetric positive definite linear system, where the matrix **A** is of dimension *M*x*M*, where *M* = *O*(*n*_{occ}*n*_{virt}) and *n*_{occ}*/n*_{virt} are the numbers of occupied/virtual orbitals, respectively. Following standard practice, we use the iterative preconditioned conjugate gradient^{70} (PCG) method to avoid the $O(nocc3nvirt3)$ cost of solving Eq. (17) with direct inversion. In each iteration, we evaluate the inner product of **A** and trial vector **Y**, which eventually converges to **Z**. The whole procedure is shown in Algorithm 4. The most crucial part in PCG is the evaluation of the **AY** product, which can be efficiently implemented with the following expression:

where matrix **A** is partitioned into **A**^{1e} and **A**^{2e}, involving only the one-electron and two-electron integrals, respectively. In Eq. (57), the auxiliary matrix $\Lambda L\sigma \u2032$ is defined as

where auxiliary matrix **R** in Eq. (58) has the same form as ^{X}**R** defined in Eq. (31) but with **Z**^{X} replaced by trial vector **Y**. The pseudocode for calculating **AY** is shown in Algorithm 6. When MO integrals are needed, these are obtained by transformation of mean-field operators in the AO basis (e.g., J and K matrices for specified input density matrices). The cost for evaluating the above integrals can be significantly decreased by leveraging relationships between the spin densities of different microstates. First, the evaluation of microstates L = 4, 6 can be omitted because they have symmetric spin density with L = 3 and 5, respectively. Second, the cost of evaluating integrals for closed-shell microstates (L = 1, 2) is halved since the auxiliary density matrix, $RL\sigma $, is symmetric in spin, and $K\mu \nu (D)$ only needs to be evaluated once for these microstates. The cost for $ML,\mu \nu \sigma ,(Da,D\beta )$ [Eq. (54)] can also be reduced. On the one hand, at each quadrature grid point, because $PL\alpha =PL\beta $ for L = 1, 2, some of the second functional derivatives can be reused (e.g., $\u22022exc\u2202\rho \alpha \u2202\rho \alpha =\u22022exc\u2202\rho \beta \u2202\rho \beta $); on the other hand, the second functional derivatives only need to be contracted with $RL\sigma $ and are only evaluated (and stored for future use) in the first iteration of CPREKS. With these simplifications, the computation of one **AY** product for CPREKS takes roughly 3x/6x the time for an **AY** build in CP-UKS (CP-unrestricted Kohn–Sham) and CP-RKS (CP-restricted Kohn–Sham), respectively.

Build right-hand side: |

Build X_{GSS}, X_{OSS} Eq. (25) |

if SI-SA-REKS: |

Build X_{Δ} using Algorithm 5 Eq. (27) |

Build $XSSRm$ Eq. (26) |

Initialize the preconditioned conjugate gradient (PCG): |

$D\u22121=diag{1/dij};\u2009dij=Aij,ijle$ Eq. (18) |

$r(0)=X\u2212AZ(0);\u2009p(1)=z(0)=D\u22121r(0);\u2009k=1$ |

do while $r(k)22$ < Threshold |

if k > 1: |

$p(k)=z(k\u22121)+(r(k))Tz(k)(r(k\u22121))Tz(k\u22121)p(k\u22121)$ |

Construct Ap^{(k)} with Algorithm 6 |

$\mu =rz(k)(p(k))TAp(k)$ |

$Z(k)=Z(k\u22121)+\mu p(k)$ |

$r(k)=r(k\u22121)\u2212\mu Ap(k)$ |

z^{(k)} = D^{−1}r^{(k)} |

k = k + 1 |

Build right-hand side: |

Build X_{GSS}, X_{OSS} Eq. (25) |

if SI-SA-REKS: |

Build X_{Δ} using Algorithm 5 Eq. (27) |

Build $XSSRm$ Eq. (26) |

Initialize the preconditioned conjugate gradient (PCG): |

$D\u22121=diag{1/dij};\u2009dij=Aij,ijle$ Eq. (18) |

$r(0)=X\u2212AZ(0);\u2009p(1)=z(0)=D\u22121r(0);\u2009k=1$ |

do while $r(k)22$ < Threshold |

if k > 1: |

$p(k)=z(k\u22121)+(r(k))Tz(k)(r(k\u22121))Tz(k\u22121)p(k\u22121)$ |

Construct Ap^{(k)} with Algorithm 6 |

$\mu =rz(k)(p(k))TAp(k)$ |

$Z(k)=Z(k\u22121)+\mu p(k)$ |

$r(k)=r(k\u22121)\u2212\mu Ap(k)$ |

z^{(k)} = D^{−1}r^{(k)} |

k = k + 1 |

Many other two-electron integrals involved in the evaluation of the SSR gradient can be treated with very similar algorithms as the **AY** product (Algorithm 6). The evaluation of **Q**^{(2)} [Eq. (35)] is only different from Algorithm 6 in the final AO to MO transformation, so **Q**^{(2)} can be readily formed at the last step of CG iteration of CPREKS without extra integral evaluations. Algorithm 5 for evaluating **X**_{Δ} [Eq. (27)] is just a simplified version of Algorithm 6, with fewer ERI evaluations, and the evaluation of $Q(2)\Delta $ [Eq. (44)] only differs from **X**_{Δ} in the AO to MO transformation.

Build one-electron terms |

$X\Delta pq1e=\delta qs2nr\u03f5prSAr\u2212ns\u03f5prSAs+\delta qr2nr\u03f5psSAr\u2212ns\u03f5psSAs+G(1)Rsr\Omega pq$ |

$P\mu \nu (rs)=C\mu rC\nu s$ |

Construct half transformed ERIs: |

$J\mu \nu (P(rs))=(\mu \nu |rs)$, $K\mu \nu (P(rs))=(\mu r|\nu s)$ Eq. (49 ) |

$K\u0303(P(rs))=12(K(P(rs))+(K(P(rs)))T)$ Eq. (50) |

for each distinct microstate, L = 1, 2, 3, 5 |

Build auxiliary density matrices |

$PL,\mu \nu \Delta ,\sigma =nsns,L\sigma \u2212nrnr,L\sigma Cr\mu \u2020C\nu s$ |

Rescale the ERIs: |

$J\mu \nu (PL\Delta )=\u2211\sigma nsns,L\sigma \u2212nrnr,L\sigma J\mu \nu (P(rs))K\u0303\mu \nu (PL\Delta ,\sigma )=nsns,L\sigma \u2212nrnr,L\sigma K\u0303\mu \nu (P(rs))$ |

Construct the DFT integrals $ML,\mu \nu \sigma ,(PL\Delta ,\alpha ,PL\Delta ,\beta )$ Eq. (54) |

Let integral $IL,\mu \nu \sigma =J\mu \nu (P\Delta )\u2212K\u0303\mu \nu (P\Delta ,\sigma )+ML,\mu \nu \sigma ,(PL\Delta ,\alpha ,PL\Delta ,\beta )$ Eq. (24) |

Transform the integral into MO space and add to X_{pq} |

β_{L} = 1 for L = 1, 2; β_{L} = 2 for L = 3, 5; |

$X\Delta pq+=\beta LCL\u2211\sigma np,L\sigma \u2212nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu qIL,\mu \nu \sigma $ Eq. (27) |

Build one-electron terms |

$X\Delta pq1e=\delta qs2nr\u03f5prSAr\u2212ns\u03f5prSAs+\delta qr2nr\u03f5psSAr\u2212ns\u03f5psSAs+G(1)Rsr\Omega pq$ |

$P\mu \nu (rs)=C\mu rC\nu s$ |

Construct half transformed ERIs: |

$J\mu \nu (P(rs))=(\mu \nu |rs)$, $K\mu \nu (P(rs))=(\mu r|\nu s)$ Eq. (49 ) |

$K\u0303(P(rs))=12(K(P(rs))+(K(P(rs)))T)$ Eq. (50) |

for each distinct microstate, L = 1, 2, 3, 5 |

Build auxiliary density matrices |

$PL,\mu \nu \Delta ,\sigma =nsns,L\sigma \u2212nrnr,L\sigma Cr\mu \u2020C\nu s$ |

Rescale the ERIs: |

$J\mu \nu (PL\Delta )=\u2211\sigma nsns,L\sigma \u2212nrnr,L\sigma J\mu \nu (P(rs))K\u0303\mu \nu (PL\Delta ,\sigma )=nsns,L\sigma \u2212nrnr,L\sigma K\u0303\mu \nu (P(rs))$ |

Construct the DFT integrals $ML,\mu \nu \sigma ,(PL\Delta ,\alpha ,PL\Delta ,\beta )$ Eq. (54) |

Let integral $IL,\mu \nu \sigma =J\mu \nu (P\Delta )\u2212K\u0303\mu \nu (P\Delta ,\sigma )+ML,\mu \nu \sigma ,(PL\Delta ,\alpha ,PL\Delta ,\beta )$ Eq. (24) |

Transform the integral into MO space and add to X_{pq} |

β_{L} = 1 for L = 1, 2; β_{L} = 2 for L = 3, 5; |

$X\Delta pq+=\beta LCL\u2211\sigma np,L\sigma \u2212nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu qIL,\mu \nu \sigma $ Eq. (27) |

Build one-electron terms: $A1eYij$ Eq. (56) |

for distinct microstate, L = 1, 2, 3, 5 |

Build the auxiliary density matrix: Eq. (31) |

$RL,\mu \nu \sigma =\u2211i<jYij(ni,L\sigma \u2212nj,L\sigma )Ci\mu \u2020C\nu jRL,\mu \nu =RL,\mu \nu \alpha +RL,\mu \nu \beta $ |

Construct ERIs: |

$J\mu \nu (RL)$, $K\mu \nu (PL\sigma )$ Eqs. (49) and (50) |

$K\u0303\mu \nu (RL\sigma )=12(K\mu \nu (RL\sigma )+(K\mu \nu (RL\sigma ))T)$ |

Construct the DFT integrals $ML,\mu \nu \sigma ,(RL\alpha ,RL\beta )$ Eq. (54) |

Let integral $\Lambda L,\mu \nu \sigma =J\mu \nu (RL)\u2212K\u0303\mu \nu (RL\sigma )+ML,\mu \nu \sigma ,(RL\sigma ,RL\sigma )$ Eq. (58) |

Transform the integral into MO space and add to $A2eYpq$ |

β_{L} = 1 for L = 1, 2; β_{L} = 2 for L = 3, 5; |

$A2eYpq+=\beta LCL\u2211\sigma np,L\sigma \u2212nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu q\Lambda L,\mu \nu \sigma $ Eq. (57) |

Build one-electron terms: $A1eYij$ Eq. (56) |

for distinct microstate, L = 1, 2, 3, 5 |

Build the auxiliary density matrix: Eq. (31) |

$RL,\mu \nu \sigma =\u2211i<jYij(ni,L\sigma \u2212nj,L\sigma )Ci\mu \u2020C\nu jRL,\mu \nu =RL,\mu \nu \alpha +RL,\mu \nu \beta $ |

Construct ERIs: |

$J\mu \nu (RL)$, $K\mu \nu (PL\sigma )$ Eqs. (49) and (50) |

$K\u0303\mu \nu (RL\sigma )=12(K\mu \nu (RL\sigma )+(K\mu \nu (RL\sigma ))T)$ |

Construct the DFT integrals $ML,\mu \nu \sigma ,(RL\alpha ,RL\beta )$ Eq. (54) |

Let integral $\Lambda L,\mu \nu \sigma =J\mu \nu (RL)\u2212K\u0303\mu \nu (RL\sigma )+ML,\mu \nu \sigma ,(RL\sigma ,RL\sigma )$ Eq. (58) |

Transform the integral into MO space and add to $A2eYpq$ |

β_{L} = 1 for L = 1, 2; β_{L} = 2 for L = 3, 5; |

$A2eYpq+=\beta LCL\u2211\sigma np,L\sigma \u2212nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu q\Lambda L,\mu \nu \sigma $ Eq. (57) |

For the analytical gradient evaluation of the energy of an individual SSR state, the challenging part is the derivative two-electron integral evaluation. These involve the generalized derivative J and K matrices,

and the derivative of the exchange–correlation potential *v*_{xc},

This last integral can be efficiently implemented, provided that functional second derivatives and gradients of DFT grid weights are available. The GPU implementation of such integrals has already been realized in previous work on TDDFT.^{44} It is noteworthy that the TDDFT gradient evaluation involves the third order functional derivatives of the XC functional,^{67} which are not needed in SA-REKS or SSR. Algorithm 7 provides the pseudocode for calculating the gradient of an individual SSR state.

β_{L} = 1 for L = 1, 2; β_{L} = 2 for L = 3, 5; |

Calculate with the standard RKS gradient subroutine for microstate L = 1,2: |

$\u2202EL\u2202\lambda unrelaxed=\u2202\u2032EL\u2202\lambda \u221212\u2211i,jall(\epsilon ijLi+\epsilon ijLj)Sji\lambda $ |

Calculate $\u2202EL\u2202\lambda unrelaxed$ with the standard UKS gradient subroutine for microstate L = 3, 5 |

Solve the CPREKS equation and get Z vector $ZSSRm$ with Algorithm 4 Eq. (17) |

Assemble the unrelaxed microstate gradient: |

$\u2207E\lambda SSRm=\u2211L=1,2,3,5\beta L(CLSSRm+wGSSG(1)IL\u2211p<qZpqSSRm\Omega pq)\u2202EL\u2202\lambda unrelaxed$ Eqs. (30), (39), and (19)–(21) |

Calculate one-body relaxation contribution to the gradient: |

One-electron gradient: $\u2207E\lambda SSRm+=\u2211\mu \nu R\mu \nu SSRmh\mu \nu \lambda ,\u2009where\u2009R\mu \nu SSRm=\u2211L=16\u2211\sigma R\mu \nu \sigma ,LSSRm$ Eq. (40) |

Overlap gradient: |

Construct $Qpq(1)$, $Qpq(1)\Delta $ Eq. (34) |

Construct Q^{(2)}: |

Construct $\Lambda L,\mu \nu \sigma $ in Algorithm 6 for vector $ZSSRm$ |

Transform the integral into MO space: |

$Qpq(2)=12\u2211L=1,2,3,5\beta LCL\u2211\sigma np,L\sigma +nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu q\Lambda L,\mu \nu \sigma $ Eq. (35) |

Construct $Q(2)\Delta $: |

Take $IL,\mu \nu \sigma $ in Algorithm 5 |

Transform the integral into MO space: |

$Qpq(2)\Delta =\u221212\u2211L=1,2,3,5\beta LCL\u2211\sigma np,L\sigma +nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu qIL,\mu \nu \sigma $ |

$\u2207E\lambda SSRm+=tr(Q(1)SSRm+Q(2)SSRm)S\lambda $ Eq. (30) |

Calculate two-body relaxation contribution to the gradient: |

Construct the J gradient: $\u2207E\lambda SSRm+=\u2212\beta LCLJ(PL,RLSSRm)\lambda ,\u2009L=1,2,3,5$ Eq. (59) |

Construct the K gradient: Eq. (60) |

$\u2207E\lambda SSRm+=2\beta LCLK(PL\alpha ,RL\sigma SSRm)\lambda ,\u2009L=1,2$ |

$\u2207E\lambda SSRm+=\beta LCLK(PL\alpha ,RL\sigma SSRm)\lambda +K(PL\beta ,RL\beta SSRm)\lambda ,\u2009L=3,5$ |

Construct the DFT gradient: Eq. (61) |

$\u2207E\lambda SSRm+=\u22122\beta LCLvxc(PL\alpha ,RL\sigma SSRm)\lambda ,\u2009L=1,2$ |

$\u2207E\lambda SSRm+=\u2212\beta LCLvxc(PL\alpha ,RL\sigma SSRm)\lambda +vxc(PL\beta ,RL\beta SSRm)\lambda ,\u2009L=3,5$ |

β_{L} = 1 for L = 1, 2; β_{L} = 2 for L = 3, 5; |

Calculate with the standard RKS gradient subroutine for microstate L = 1,2: |

$\u2202EL\u2202\lambda unrelaxed=\u2202\u2032EL\u2202\lambda \u221212\u2211i,jall(\epsilon ijLi+\epsilon ijLj)Sji\lambda $ |

Calculate $\u2202EL\u2202\lambda unrelaxed$ with the standard UKS gradient subroutine for microstate L = 3, 5 |

Solve the CPREKS equation and get Z vector $ZSSRm$ with Algorithm 4 Eq. (17) |

Assemble the unrelaxed microstate gradient: |

$\u2207E\lambda SSRm=\u2211L=1,2,3,5\beta L(CLSSRm+wGSSG(1)IL\u2211p<qZpqSSRm\Omega pq)\u2202EL\u2202\lambda unrelaxed$ Eqs. (30), (39), and (19)–(21) |

Calculate one-body relaxation contribution to the gradient: |

One-electron gradient: $\u2207E\lambda SSRm+=\u2211\mu \nu R\mu \nu SSRmh\mu \nu \lambda ,\u2009where\u2009R\mu \nu SSRm=\u2211L=16\u2211\sigma R\mu \nu \sigma ,LSSRm$ Eq. (40) |

Overlap gradient: |

Construct $Qpq(1)$, $Qpq(1)\Delta $ Eq. (34) |

Construct Q^{(2)}: |

Construct $\Lambda L,\mu \nu \sigma $ in Algorithm 6 for vector $ZSSRm$ |

Transform the integral into MO space: |

$Qpq(2)=12\u2211L=1,2,3,5\beta LCL\u2211\sigma np,L\sigma +nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu q\Lambda L,\mu \nu \sigma $ Eq. (35) |

Construct $Q(2)\Delta $: |

Take $IL,\mu \nu \sigma $ in Algorithm 5 |

Transform the integral into MO space: |

$Qpq(2)\Delta =\u221212\u2211L=1,2,3,5\beta LCL\u2211\sigma np,L\sigma +nq,L\sigma \u2211\mu \nu Cp\mu \u2020C\nu qIL,\mu \nu \sigma $ |

$\u2207E\lambda SSRm+=tr(Q(1)SSRm+Q(2)SSRm)S\lambda $ Eq. (30) |

Calculate two-body relaxation contribution to the gradient: |

Construct the J gradient: $\u2207E\lambda SSRm+=\u2212\beta LCLJ(PL,RLSSRm)\lambda ,\u2009L=1,2,3,5$ Eq. (59) |

Construct the K gradient: Eq. (60) |

$\u2207E\lambda SSRm+=2\beta LCLK(PL\alpha ,RL\sigma SSRm)\lambda ,\u2009L=1,2$ |

$\u2207E\lambda SSRm+=\beta LCLK(PL\alpha ,RL\sigma SSRm)\lambda +K(PL\beta ,RL\beta SSRm)\lambda ,\u2009L=3,5$ |

Construct the DFT gradient: Eq. (61) |

$\u2207E\lambda SSRm+=\u22122\beta LCLvxc(PL\alpha ,RL\sigma SSRm)\lambda ,\u2009L=1,2$ |

$\u2207E\lambda SSRm+=\u2212\beta LCLvxc(PL\alpha ,RL\sigma SSRm)\lambda +vxc(PL\beta ,RL\beta SSRm)\lambda ,\u2009L=3,5$ |

## IV. COMPUTATIONAL DETAILS

Our SSR CP equation and the gradient algorithm have been implemented within the TeraChem^{33,39–41} electronic structure package. Timings for SSR have been obtained using Intel Xeon X5570 CPUs clocked at 2.93 GHz and NVIDIA GeForce GTX TITAN GPUs. For comparison, restricted time-dependent density functional theory (RTDDFT) calculations have been conducted with TeraChem using the same GPUs and CPUs, and XMS-CASPT2^{12} calculations have been conducted with the CPU-based electronic structure package BAGEL.^{71} Timings for BAGEL have been obtained using a single core of the faster Intel Xeon E5-2643 CPU clocked at 3.40 GHz. In order to test the efficiency of the CPREKS equation and SSR individual state gradients, we benchmark our implementation for a series of *trans*-PSB3 molecules microsolvated by up to 186 water molecules (see Fig. 2 and Fig. S1 and Tables S1–S12 of the supplementary material). Molecular geometries were obtained using the Amber 12^{72} simulation package by taking snapshots from classical molecular dynamics, using the general AMBER force field (GAFF)^{73,74} for the solute molecule and the TIP3P^{75} force field for surrounding water molecules. The ωPBEh^{76} exchange–correlation functional with 6-31G^{77} and 6-31G^{*}^{78} basis sets was used for the performance tests. The (2e,2o) active space is used for both SSR and XMS-CASPT2. For XMS-CASPT2 calculations, BAGEL automatically sets a group of core orbitals to be frozen (see Table S13 of the supplementary material), and density fitting^{79} with the svp-jkfit auxiliary basis set^{80} is applied.

We first assess the accuracy of SSR in locating and describing conical intersections by comparing the PBS3 central bond torsion minimum energy conical intersection^{1} (MECI_{CEN}) optimized using SSR, MS-CASPT2, and XMS-CASPT2.^{13} In all three methods, the orbitals are determined by state-averaging over the lowest two singlet states. We used a (2e,2o) active space for SSR and a (6e,6o) active space for (X)MS-CASPT2. All calculations use the cc-pVDZ^{81} basis set, with the cc-pvdz-jkfit auxiliary basis set^{82} for density fitting^{79} in (X)MS-CASPT2 calculations. The SSR conical intersection searches were performed using the Lagrange–Newton algorithm^{83} as implemented in the DL-FIND optimization package.^{84} Three different exchange–correlation functionals were used for SSR: ωPBEh,^{76} BH&HLYP,^{85} and Hartree–Fock exchange. The MS-CASPT2 and XMS-CASPT2 calculations were performed with BAGEL.^{86} The (X)MS-CASPT2 MECI geometries were taken from previous work.^{87} For BAGEL calculations, the same active space, basis set, and auxiliary basis set are used as in previous work,^{87} with an SS-SR contraction scheme and a level shift of 0.5 E_{h}. To compare the topography of the MECIs obtained from different methods, we investigated the branching plane spanned by the **g/h** vector.^{1} The **g/h** vectors are orthogonalized with the procedure proposed by Yarkony.^{88} The topography descriptors^{1} Δ_{gh}, *d*_{gh}, and (*s*_{x}, *s*_{y}) are then calculated based on the orthogonal $g\xaf/h\xaf$ vectors. The $g\xaf/h\xaf$ vectors can be further normalized as the “unscaled intersection adapted coordinates,”^{89} $x\u2032=g\xaf/g\xaf$, $y\u2032=h\xaf/h\xaf$. To directly compare the cone plots of the branching planes generated by different methods, the **x**′/**y**′ vectors are rotated within the **x**′–**y**′ plane to maximize their overlap with the reference $xref\u2032/yref\u2032$ vectors from XMS-CASPT2. The plots of the gap around the MECIs and the cone plots of the branching plane are generated with these rotated $x\xaf\u2032/y\xaf\u2032$ vectors. A detailed description of this rotation procedure is available in Text S3 of the supplementary material.

We further investigated the geometries and energetics of different PSB3 minima and MECIs. We started optimizations for these minima and MECIs from geometries previously obtained with SA-3-MS-CASPT2(6,6)/6-31G^{*}.^{90} Optimizations were performed using SSR(2,2)/cc-pVDZ and three different exchange–correlation functionals: ωPBEh, BH&HLYP, and Hartree–Fock exchange. Cartesian coordinates for all optimized structures are provided in Tables S14–S31 of the supplementary material .

## V. RESULTS AND DISCUSSION

In this section, we first compare the computational performance of the analytical gradient evaluation for SSR, XMS-CASPT2(2), and time-dependent density functional theory. Then, we decompose the computational cost of the SSR analytical gradient evaluation and inspect the impact of applying two acceleration strategies: mixed precision and ultra-sparse DFT quadrature. Finally, we assess the quality of SSR predicted conical intersections and other stationary structures of *trans*-PSB3 by comparison to XMS-CASPT2.

### A. Comparison of SSR, CASPT2, and TDDFT

The SSR method could be a computationally efficient alternative to high-level multi-reference methods to include both dynamic and static electron correlation in a balanced way. To test this, we compared the timings of SSR/ωPBEh vs XMS-CAS(2,2)PT2 and restricted closed-shell time-dependent density functional theory (RTDDFT), as shown in Fig. 3(a). The timings include the total runtime of the corresponding energy and gradient evaluation for all methods. Since analytical energy gradients are usually evaluated in the context of MD simulation or geometry optimization, a very good initial guess for the MO coefficients is usually available by extrapolation of a few previous steps.^{91,92} To mimic this behavior, we took the converged SA-CASSCF MOs, converged SSR MOs, and converged ground state Kohn–Sham MOs as the initial guess for XMS-CAS(2,2)PT2, SSR/ωPBEh, and RTDDFT, respectively.

The formal scaling of CASPT2 calculations with a constant number of occupation orbitals is O(N^{6}) with a traditional implementation^{93} and O(N^{4}) with the most efficient tensor-hypercontracted algorithm currently available.^{17} In Fig. 3(a), we see that XMS-CAS(2,2)PT2 has an observed empirical scaling for this test set of O(N^{3.83}). The reduced scaling is partly due to the use of density fitting^{79} and frozen core approximations in BAGEL. For *trans*-PSB3 + 30 waters (microsolvation sphere r = 3.75 Å, 104 atoms, 384 electrons, 460 orbitals), XMS-CAS(2,2)PT2 already takes about 9 h for a single gradient evaluation. XMS-CAS(2,2)PT2 calculations for the larger test cases were not possible on our machines due to large memory requirements [in addition to the long computation time, which can be inferred from Fig. 3(a)]. Our SSR implementation scales formally as O(N^{4}), but integral and density screening are expected to decrease this substantially. We observe an empirical scaling of O(N^{1.76}) for this test set, which is very close to the observed O(N^{1.69}) scaling of RTDDFT in TeraChem. From *trans*-PSB3 (14 atoms, 44 electrons, and 70 orbitals) to *trans*-PSB3 + 30 waters, the speedup of SSR vs XMS-CAS(2,2)PT2 grows rapidly from 3X to 56X. Assuming that we are doing MD simulation or geometry optimization with the tolerable speed of about 4 min/step (360 steps/day), the largest system that can be handled by XMS-CAS(2,2)PT2 is just *trans*-PSB3 (14 atoms), whereas for SSR, it is *trans*-PSB3 + 18 waters (68 atoms). The calculations in Fig. 3(a) used the 6-31G basis set because when larger basis sets (e.g., 6-31G^{*}) are used, XMS-CASPT2 cannot even deal with *trans*-PSB3 + 24 waters (86 atoms), resulting in too few data points to fit a reliable empirical scaling. The same comparison obtained with 6-31G^{*} is shown in Fig. S2 of the supplementary material.

To obtain a more comprehensive comparison between CPREKS and RTDDFT, we decomposed the timings for both methods, as shown in Fig. 3(b). The total run can be partitioned into three major parts: obtaining the energy, calculating orbital relaxation by solving coupled-perturbed equations, and evaluating gradients. For SSR, the energy is obtained from the REKS SCF procedure, while the TDDFT energy requires both the ground state KS SCF and Davidson iterations for the excitation energy. For the total calculation and each task, we plotted the ratio of SSR time to RTDDFT time as a function of system size. The ratio of all components tends to fluctuate around a constant, indicating that the computational cost of SSR relative to TDDFT does not deteriorate as the system size grows. As previously analyzed, each REKS SCF iteration involves two RKS (restricted Kohn–Sham) Fock builds and two UKS (unrestricted Kohn–Sham) builds, which is equivalent to about six RKS iterations. Each CPREKS iteration involves two RKS- and two UKS-like Fock-like builds, which is equivalent to about six CPKS iterations. These match the benchmarking results (black triangles and green triangles). Similar trends are observed for the gradient evaluations. The times for obtaining energies are similar because REKS requires only the REKS SCF, while TDDFT requires both the KS SCF and Davidson iterations. The ratio for the total time of solving the CP-equations is much larger than 6. This is because the CPREKS equations are harder to solve than the CPKS equation, requiring more iterations to converge to the same convergence threshold (a norm of residue 10^{−6} a.u.; see Table S32 of the supplementary material). In summary, the runtime of SSR is 4-6X that of RTDDFT, regardless of system size (given the same number of iterations are used to solve the CP equations).

### B. Decomposition of SSR runtime

The primary computational parts of the SSR gradient evaluation can be divided as follows: (1) solving the CPREKS equations and evaluating (2) the gradient of the Coulomb **J** matrix, (3) the gradient of the exchange **K** matrix, (4) the gradient of the XC potential, and (5) the overlap gradient. Figure 4 (left panel) shows the GPU + CPU time taken by each of these contributions. CPREKS is the most expensive part and takes 60%–75% of the total gradient evaluation time. Decomposition of the CPREKS timing is discussed below. The **K** gradient has the worst empirical scaling of O(N^{2.00}). This is similar to what was previously reported in the CASSCF gradient implementation,^{27} and the reason has also been discussed. The XC gradient has the best scaling of O(N^{1.60}), slightly better than the **J** gradient [O(N^{1.70})]. However, it has a significantly larger prefactor than **J** and **K** gradients. This is because the evaluation of functional quantities at each quadrature point is done on the CPU rather than on the GPU. The overlap gradient term refers to both the overlap derivatives and the terms contracted with it. The latter [$Q(2)X$ and $Q(2)\Delta $ matrices; see Eqs. (35) and (44)] is actually the computationally dominant part. As discussed in Sec. III, the evaluation of these quantities is very similar to the **AY** product for CPREKS, so the scalings are also similar.

Each CPREKS iteration can be decomposed into four computationally important parts: Coulomb matrix **J** construction, exchange matrix **K** construction, XC second functional derivative *f*_{xc} evaluation, and Linear algebra (LA), as shown in Fig. 4 (right panel). LA scales the worst because the dominant part is to construct the six auxiliary density matrices $RL\sigma $ (L = 1, 2, *σ* = *α*; L = 3, 5, *σ* = *α*, *β*). Each construction involves two matrix–matrix products, requiring O(N^{3}) operations. However, linear algebra has a much smaller prefactor compared with integral evaluation, so its cost only becomes significant when the system size is very large. For our benchmark systems, LA is only the second most expensive part even for the largest molecule with ∼2500 basis functions. For small molecules, the most expensive part is *f*_{xc} because of the large prefactor caused by the CPU evaluation of functional values at quadrature grid points. For large molecules, however, **K** is the bottleneck because of the scaling.

### C. Using mixed precision

In order to leverage the fast single precision arithmetic on GPUs, we used mixed precision (MP) for the calculation of all two-electron integrals related to both CPREKS and SSR gradient evaluations. Our REKS SCF is coded with dynamic precision^{59} using a single/double precision (SP/DP) threshold *δ*_{spdp}, which is dynamically adjusted according to the DIIS error. Upon convergence, the threshold *δ*_{spdp} is fixed and used throughout the CPREKS and gradient evaluation. We found that this mixed precision treatment improves the efficiency without hampering the accuracy and stability of SSR gradient calculation. Figure 5 compares the convergence behavior of CPREKS using mixed and double precision. A relatively strict convergence threshold of ||*r*||_{2} = 10^{−6} is used. In all cases, convergence behavior is identical for mixed and double precision integration until very small residual values well below the convergence threshold. The accuracy of the gradient evaluation is shown in Fig. 6. The maximum unsigned difference of all gradients entries calculated by MP and DP is plotted as a function of system size. This error stays well below 10^{−4} Hartree/Bohr for all systems, sufficient for geometry optimization and molecular dynamics. The speedup achieved by mixed precision is also shown in the same graph. The overall speedup for gradient evaluation is about 2.5X. This result is similar to previous reports for ground state HF/KS calculations.^{59} The XC gradient has the least MP speedup because a significant portion of the XC gradient evaluation is done on the CPU, which always uses double precision in our implementation. In contrast, the use of mixed precision leads to speedups of more than 3X for **J** and **K** gradients, which are almost entirely evaluated on the GPU.

It is worth noting that even though *δ*_{spdp} is fixed for gradient evaluation, successive iterations of CPREKS CG take less computation time than the first. As shown in Fig. 7, for all three systems displayed, the **J** and **K** build times decrease drastically as the CG iteration proceeds. This behavior is different from early TDDFT implementations, where all Davidson iterations take roughly the same amount of time. As CPREKS iterations proceed, the trial vector **Y** becomes sparser. **J** and **K** builds exploit the sparsity of **Y** by screening out small integrals. For **XC** integrals [Eq. (53)], however, the screening is always based only on the reference density (**P**^{α}, **P**^{β}) rather than the density to be contracted (**D**^{α}, **D**^{β}). Therefore, the screening is irrelevant to the sparsity of the **Y** matrix, and the computation time remains constant for the XC gradient. Recent work has reformulated TDDFT iterations to benefit from a similar sparsity by using unnormalized correction vectors in the Davidson iterations.^{94,95}

### D. Ultra-sparse quadrature grid

Given a molecule described with a certain number of basis functions, the cost of XC integrals is proportional to the density of the grid. Thus, it is desirable if coarse grids can already generate a moderately accurate result. Previous studies of TDDFT implementation on GPUs demonstrated that using an ultra-sparse grid for the TDDFT portion of computation can result in considerable time saving with little or no loss of accuracy.^{44,96,97} Here, we found the same phenomenon for SSR (Table I). For both molecules, the second SSR state gradients from the sparsest grid and the densest grid only differ in the order of 10^{−4} hartree/bohr, while the speedups are over 20X.

Grid . | Points . | Pts/atom^{b}
. | Time(s)^{a}
. | ∇E (Hartree/Bohr)
. |
---|---|---|---|---|

PSB3 (14 atoms) | ||||

0 | 14 062 | 1 004 | 10.37 | 0.047 414 |

1 | 39 366 | 2 811 | 16.88 | 0.047 719 |

2 | 87 942 | 6 281 | 27.10 | 0.047 634 |

3 | 161 543 | 11 538 | 40.78 | 0.047 622 |

4 | 414 161 | 29 582 | 93.18 | 0.047 639 |

5 | 1 060 891 | 75 777 | 222.74 | 0.047 636 |

PSB3 + 34 waters (116 atoms) | ||||

0 | 110 381 | 951 | 426.1 | −0.075 695 |

1 | 306 783 | 2 644 | 686.0 | −0.075 036 |

2 | 692 046 | 5 965 | 1 211.9 | −0.075 159 |

3 | 1 268 577 | 10 936 | 1 971.0 | −0.075 156 |

4 | 3 258 711 | 28 092 | 3 829.6 | −0.075 140 |

5 | 8 335 026 | 71 853 | 10 461.7 | −0.075 144 |

Grid . | Points . | Pts/atom^{b}
. | Time(s)^{a}
. | ∇E (Hartree/Bohr)
. |
---|---|---|---|---|

PSB3 (14 atoms) | ||||

0 | 14 062 | 1 004 | 10.37 | 0.047 414 |

1 | 39 366 | 2 811 | 16.88 | 0.047 719 |

2 | 87 942 | 6 281 | 27.10 | 0.047 634 |

3 | 161 543 | 11 538 | 40.78 | 0.047 622 |

4 | 414 161 | 29 582 | 93.18 | 0.047 639 |

5 | 1 060 891 | 75 777 | 222.74 | 0.047 636 |

PSB3 + 34 waters (116 atoms) | ||||

0 | 110 381 | 951 | 426.1 | −0.075 695 |

1 | 306 783 | 2 644 | 686.0 | −0.075 036 |

2 | 692 046 | 5 965 | 1 211.9 | −0.075 159 |

3 | 1 268 577 | 10 936 | 1 971.0 | −0.075 156 |

4 | 3 258 711 | 28 092 | 3 829.6 | −0.075 140 |

5 | 8 335 026 | 71 853 | 10 461.7 | −0.075 144 |

^{a}

SSR/ωPBEh gradient timings include the time for solving CPREKS equations and for gradient calculation.

^{b}

Number of points/atom refers to the pruned grid for TeraChem.

### E. Conical intersection of PSB3

To test the quality of SSR gradients and NAC vectors, we studied the potential energy surfaces of cationic PSB3 near the MECIs. The branching plane is defined with the commonly used vectors, the gradient difference, **g**, and the interstate coupling vector, **h**.^{1} These vectors are closely related to the GDV and DCV defined above: the **g** vector is the same as the GDV, while **h**_{IJ} = (*E*_{J} − *E*_{I})**H**_{IJ}. The differences in the shapes of the potential energy surfaces calculated with different methods are schematically shown in Figs. 8–10. We first present the energy gaps along the loop (radius 0.002 Å) within the branching plane spanned by $x\u2032\xaf/y\u2032\xaf$ (an orthonormalized form of **g/h)** in Fig. 8. The details about the transformation from **g/h** to $x\u2032\xaf/y\u2032\xaf$ are available in Sec. IV and Text S3 of the supplementary material. Similar plots have been previously presented for other methods.^{87,98} However, we observed some new phenomena here. The same method is used for both optimizing the MECI and plotting the loop for every method in Fig. 7. This is different from an earlier study^{98} where the SSR loop was computed around the MECI optimized with another method since direct optimization of MECIs with SSR was not available. When computed fully consistently, the SSR loop curves show little dependence on the XC functional used. All SSR curves are very smooth with an oscillation amplitude of around 0.002 kcal/mol, which is significantly smaller than MS-CASPT2 (0.06 kcal/mol) but slightly larger than the XMS-CASPT2 results (0.0004 kcal/mol).

We then compared the topology of the MECIs obtained with different methods characterized by $g\xaf/h\xaf$ vectors^{88} and the orthogonal version of **g**/**h** vectors. We used the topography parameters as suggested by Yarkony,^{89} including the descriptors for the sharpness and asymmetry of the intersection,

where *g, h* are the length of the orthogonal branching plane vectors $g\xaf/h\xaf$; the tilt parameters

where

and their length-scaled forms

The results are summarized in Table II. Based on the tilt parameters (*s*_{x}, *s*_{y}), XMS-CASPT2 predicts the MECI to be most peaked (a perfectly peaked intersection would have vanishing *s*_{x} and *s*_{y}). Both (X)MS-CASPT2 methods are more peaked than all three REKS methods tested. Previous studies^{7,90} showed that MS-CASPT2 predicts the PSB3 MECI to be more peaked than CASSCF. Based on Δ_{gh} values, the (X)MS-CASPT2 MECIs are more asymmetric than all the REKS methods. The orthonormal $x\xaf\u2032/y\xaf\u2032$ vectors for different methods are compared in Fig. 9.

. | Δgh
. | d_{gh}
. | s_{x}
. | s_{y}
. | s_{x}/g
. | s_{y}/g
. |
---|---|---|---|---|---|---|

SSR/BH&HLYP | −0.50 | 0.07 | 0.06 | −0.12 | 1.64 | −1.97 |

SSR/ωPBEh | 0.72 | 0.03 | −0.01 | 0.11 | −0.17 | 9.76 |

SSR/HF | 0.88 | 0.05 | −0.04 | −0.09 | −0.80 | −7.57 |

XMS-CASPT2 | 1.00 | 0.04 | 0.01 | −0.01 | 0.35 | −58.37 |

MS-CASPT2 | 1.00 | 0.05 | 0.05 | 0.00 | 0.98 | −11.84 |

. | Δgh
. | d_{gh}
. | s_{x}
. | s_{y}
. | s_{x}/g
. | s_{y}/g
. |
---|---|---|---|---|---|---|

SSR/BH&HLYP | −0.50 | 0.07 | 0.06 | −0.12 | 1.64 | −1.97 |

SSR/ωPBEh | 0.72 | 0.03 | −0.01 | 0.11 | −0.17 | 9.76 |

SSR/HF | 0.88 | 0.05 | −0.04 | −0.09 | −0.80 | −7.57 |

XMS-CASPT2 | 1.00 | 0.04 | 0.01 | −0.01 | 0.35 | −58.37 |

MS-CASPT2 | 1.00 | 0.05 | 0.05 | 0.00 | 0.98 | −11.84 |

Finally, we present the cone plots of the MECIs obtained with different methods in Fig. 10. The SSR surfaces are all smooth with the correct double cone topography for a MECI. The PESs of all methods are scanned around their respective MECIs. As reported in previous work of Shiozaki *et al.*, MS-CASPT2 surfaces calculated with BAGEL have visible large fluctuations, which are greatly alleviated in XMS-CASPT2. The branching planes computed by REKS methods are all very smooth with the correct double cone topography. Among the three XC functionals, the SSR/HF PES looks most similar to that of XMS-CASPT2 (SS-SR, BAGEL). This may be because of the formal similarities of SI-SA-REKS/HF and CASSCF.

### F. Energetics of stationary structures of PSB3

Equilibrium structures of *trans*-PSB3 in the ground and first singlet excited states were optimized using SSR/ωPBEh, SSR/BH&HLYP, and SSR/HF with the cc-pVDZ basis set. The optimized structures are depicted in Tables S14–S31 of the supplementary material (including Cartesian coordinates), and the relative energies are summarized in Table III together with SA3-MS-CASTP2(6,6) and SA3-CASSCF(6,6) reference results from previous work.^{90}

Structure . | SA3-MS-CASSCF . | SA3-MS-CASPT2 . | SSR/ωPBEh . | SSR/BH&HLYP . | SSR/HF . |
---|---|---|---|---|---|

trans-PSB3 | 0/113.2^{a} | 0/98.3^{a} | 0/95.4^{a} | 0/97.7^{a} | 0/114.2^{a} |

S1_{min_CenL} | 95.2 | 85.1 | 64.2 | 66.0 | 92.0 |

MECI_{Cen} | 66.1 | 60.3 | 66.7 | 68.7 | 58.1 |

MECI_{CN} | 79.1 | 87.8 | 90.6 | ||

MECI_{Ter} | 108.9 | 82.4 | 77.9 | ||

cis-PSB3 | 3.0/113.0^{a} | 3.0/97.2^{a} | 3.0/96.2^{a} | 3.3/98.9^{a} | 4.1/113.8^{a} |

Structure . | SA3-MS-CASSCF . | SA3-MS-CASPT2 . | SSR/ωPBEh . | SSR/BH&HLYP . | SSR/HF . |
---|---|---|---|---|---|

trans-PSB3 | 0/113.2^{a} | 0/98.3^{a} | 0/95.4^{a} | 0/97.7^{a} | 0/114.2^{a} |

S1_{min_CenL} | 95.2 | 85.1 | 64.2 | 66.0 | 92.0 |

MECI_{Cen} | 66.1 | 60.3 | 66.7 | 68.7 | 58.1 |

MECI_{CN} | 79.1 | 87.8 | 90.6 | ||

MECI_{Ter} | 108.9 | 82.4 | 77.9 | ||

cis-PSB3 | 3.0/113.0^{a} | 3.0/97.2^{a} | 3.0/96.2^{a} | 3.3/98.9^{a} | 4.1/113.8^{a} |

^{a}

Vertical excitation energy of the first excited state relative to S_{0}-*trans*.

We started all optimizations from previous MS-CASPT2-optimized structures and found the corresponding stationary structures for each method. The S_{1} minimum (S_{1min}) structures found with all SSR methods collapse to a structure with a single twist around the central C=C bond, which is similar to S_{1min_cen} of CASSCF, rather than the S_{1min_cenL} structure of MS-CASPT2. Nevertheless, we will show that SSR resembles MS-CASPT2 more than CASSCF for MECI calculations.

The three PSB3 S_{0}/S_{1} MECIs of PSB3, MECI_{cen}, MECI_{CN}, and MECI_{Ter}, correspond to structures that twist mainly around the central C=C bond, the C=N bond, and the terminal C=C bond, respectively (structures available in Tables S14–S31 of the supplementary material). As shown in Table III, all three SSR methods agree on the relative stability of these three structures, E(MECI_{cen}) < E (MECI_{Ter}) < E(MECI_{CN}). This also agrees with the result of MS-CASPT2, whereas CASSCF predicts that the least energetically favorable structure is MECI_{Ter.} More comparisons of the geometries further indicate that SSR goes beyond CASSCF and generates results that resemble MS-CASPT2. For the most stable MECI_{cen,} the length of the twisted central C=C is similar among all SSR methods and MS-CASPT2 (1.447 Å–1.462 Å), whereas CASSCF MECI_{cen} has a significantly longer bond length (1.485 Å). This structural difference is even more prominent in bond length alternation (BLA), as shown in Fig. 11. Here, we adopted the definition of the BLA coordinate from previous work,^{90}

where the atom labels are defined in Fig. 2. From *trans*-PSB3 to MECI_{cen}, the BLA is decreased in all cases. For MS-CASPT2, the change in the BLA, Δ_{BLA} = −0.038 Å. All SSR methods slightly underestimate the decrease, with Δ_{BLA} = −0.024 Å, −0.014 Å, and −0.009 Å using the HF, ωPBEh, and BH&HLYP functionals, respectively. In contrast, CASSCF significantly overestimates this decrease with Δ_{BLA} = −0.127 Å. Similar trends in BLA changes are also seen in MECI_{CN}.

Inspection of the dihedral angles also shows that SSR methods produce MECI structures more similar to MS-CASPT2 than CASSCF. The CASSCF structure only has a 90° twisting around C=N, and the –NH_{2} group is perfectly flat with sp^{2} hybridization. In contrast, the MECI_{CN} structures produced by all SSR methods and MS-CASPT2 feature two twists around the right C—C single bond and the C—N bond, respectively (Fig. 11). The –NH_{2} group also shows an improper torsion of about 128°–153°.

It is worth noting that for ground state minima, SSR results are sometimes closer to CASSCF. For instance, for S_{0}-*trans* and S_{0}-*cis*, SSR-HF and CASSCF have very similar structures and energetics, which is expected as no XC functional other than exact exchange is used in SSR-HF. Tables of the bond length and dihedrals of structures calculated by all methods are available in the supplementary material, Tables S33 and S34.

## VI. CONCLUSIONS

In this work, we demonstrated that by formulating the analytical gradients of individual state energies of SSR entirely in terms of traces of matrix products, exploiting the sparsity in the atomic orbital basis for two-electron integrals, utilizing mixed precision techniques to benefit from the fast single precision arithmetic, and performing DFT quadrature on ultra-sparse grids, analytic energy derivatives for the SSR method^{10,20,21,99} can be efficiently implemented on GPUs.

The formalism was tested by calculating analytic energy gradients of the *S*_{0} and *S*_{1} states of molecular clusters with a *trans*-PSB3 molecule embedded in water clusters of increasing size. The benchmark calculations are used to demonstrate the feasibility of the developed formalism. We achieve the same scaling as the single reference TDDFT method, but SSR is able to correctly describe the double cone topography of conical intersections. The O(N^{1.76}) scaling of SSR significantly outperforms the observed scaling of MS-CASPT2. We showed that our implementation can efficiently compute SSR gradients for systems with more than 5000 orbitals. Analysis of PSB3 MECIs shows that SSR produces geometric and energetic profiles similar to MS-CASPT2, especially when MS-CASPT2 and SA-CASSCF disagree.

As the SSR method affords mean-field computational cost while providing a MS-CASPT2 quality description of conical intersections, it can be applied to study excited state dynamics of large molecular systems. We have already successfully applied SSR nonadiabatic dynamics to investigate the photoisomerization mechanism of the retinal protonated Schiff base (RPSB) and its effects on the activation of channelrhodopsin 2.^{100} We also used *ab initio* multiple spawning with SSR to simulate the nonadiabatic dynamics of the ultrafast RPSB photoisomerization in bacteriorhodopsin and fully characterized the I fluorescent state.^{101} In the future, development of SSR with larger active spaces will enable excited state dynamics studies of even more molecular and biomolecular systems.

## SUPPLEMENTARY MATERIAL

See the supplementary material for the procedure for orthonormalization of **g**, **h** vectors; relation between $\u2207E1SSR$, $\u2207E2SSR$, and $\u2207ESA-REKS$; the details about the transformation of F^{REKS} to the AO basis; geometries of optimized PSB3 MECIs; geometries of optimized PSB3 S_{1} minima; geometries of optimized PSB3 S_{0} minima; geometries of solvated PSB3 used for performance testing; structures of the benchmark systems for XMS-CASPT2; SSR gradient performance test with 6-31G^{*} basis; the number of basis functions for benchmark systems; geometries of molecules for XMS-CASPT2 performance test; and bond lengths and dihedral angles for optimized PSB3 structures.

## ACKNOWLEDGMENTS

This work was supported by the Office of Naval Research (Grant Nos. N00014-18-1-2659 and N00014-18-1-2624).

## DATA AVAILABILITY

The data that support the findings of this study are available within the article and its supplementary material.

## REFERENCES

_{5}H

_{6}NH

_{2}

^{+}protonated Shiff base: An ab initio minimal model for retinal photoisomerization