The many-body simulation of quantum systems is an active field of research that involves several different methods targeting various computing platforms. Many methods commonly employed, particularly coupled cluster methods, have been adapted to leverage the latest advances in modern high-performance computing. Selected configuration interaction (sCI) methods have seen extensive usage and development in recent years. However, the development of sCI methods targeting massively parallel resources has been explored only in a few research works. Here, we present a parallel, distributed memory implementation of the adaptive sampling configuration interaction approach (ASCI) for sCI. In particular, we will address the key concerns pertaining to the parallelization of the determinant search and selection, Hamiltonian formation, and the variational eigenvalue calculation for the ASCI method. Load balancing in the search step is achieved through the application of memory-efficient determinant constraints originally developed for the ASCI-PT2 method. The presented benchmarks demonstrate near optimal speedup for ASCI calculations of Cr2 (24e, 30o) with 106, 107, and 3 × 108 variational determinants on up to 16 384 CPUs. To the best of the authors’ knowledge, this is the largest variational ASCI calculation to date.

The ability of ab initio quantum chemistry to solve challenging science problems strongly relies on its success in leveraging the latest advances in modern high-performance computing. As such, the development of massively parallel algorithms for quantum chemistry methods targeting large supercomputers has been of significant interest in recent years.1–6 In particular, many-body methods hold among the highest potential for exploiting concurrency on modern machines (we refer the reader to Ref. 7 and references therein for a recent comprehensive review). There are several competitive methods in quantum chemistry used for many-body simulation, including density matrix renormalization group (DMRG),8,9 coupled cluster,10,11 auxiliary field quantum Monte Carlo,12–14 and several others.15,16 It is an active field of research to understand the pros and cons of these various methods,16 and almost all of these methods are being actively developed.

Recently, the development of selected configuration interaction (sCI) methods has seen a renewed interest, and there has been a push to adapt the approach to modern computing architectures. This interest has been, in large part, driven by recent advances in the core components of sCI algorithms,17–19 which have, in turn, extensively broadened the reach of their applications.16,20–46 In particular, some of the newer applications include simulation of vibrational/nuclear degrees of freedom,24,35 dynamical mean-field theory,27,28 time evolution,25,31,32,47 Hamiltonian compression,43 new algorithms for quantum computing and benchmarking,44,48,49 machine learning approaches,22,33,41 and large-scale simulation of quantum circuits.50,51 While there has been much interest in sCI, there are still a number of technical aspects that are open research problems within the realm of high-performance computing. The lack of parallel sCI implementations has been somewhat visible in a recent comparison of different algorithms for the simulation of the benzene molecule.16 Several sCI methods were considered but none of them used anywhere close to the number of CPU hours that the most accurate methods required. Although there has been at least one simulation of sCI using more than a billion determinants,39 simulations of this scale have not been widely performed. This leads to the obvious question of what the limits of sCI approaches are when used in conjunction with modern high-performance computing resources. In this work, we present a parallel, distributed memory implementation of the adaptive sampling configuration interaction (ASCI) method18 for sCI simulations.

The ASCI algorithm was developed with the idea of making approximations to various aspects of sCI algorithms. The original ASCI paper18 included approximations to the search algorithm that reduced the number of search terms while maintaining highly accurate results. Soon after the heat-bath CI algorithm made further approximations to the search algorithm while still maintaining reasonable results in many systems.19 Initial results on Cr2 suggested these sCI algorithms are competitive with the most accurate contemporary many-body methods, including DMRG. Many further developments increased the efficiency of different parts of the algorithm, including the development of a fast perturbation step with a deterministic constraint approach (ASCI-PT2).23 In this paper, we look to turn the fast ASCI-PT2 algorithm into a massively parallel search step while also developing a scalable parallel diagonalization step. These make up the variational part of the sCI algorithm, and thus, we are hoping to advance the parallel capabilities of variational sCI.

While single-core and single-node approaches for sCI have been explored extensively with a focus on making use of modern computational tools,29 there has been much less focus on massive parallelism and the development of distributed memory algorithms. Most sCI implementations can make use of shared memory parallelism29,36,39 during the computationally heavy parts of the simulation. In particular, leverage of multi-threaded29,52,53 and GPU-accelerated23,29,54 sorting techniques, which are the primary algorithmic motifs of the ASCI method, have shown great promise for modern computing architectures. One recent study examined a distributed memory implementation of the heat-bath configuration interaction (HCI) method, particularly focusing on the details pertaining to the strong scaling of the second-order perturbative corrections (PT2), which dominated their simulation. While this study was able to exhibit excellent strong scaling for the PT2 step in HCI, the challenges associated with this development are somewhat unique to HCI and are not easily extended to other sCI algorithms. The ASCI-PT2 algorithm, which was first presented in 2018, uses triplet constraints to facilitate the division of parallel tasks.23 Some of the new ideas in the recent parallel HCI38 approach are used to overcome issues that are bottlenecks for only HCI PT2 approaches 39 and are not directly applicable to ASCI-PT2 with triplet constraints. An even bigger issue that we address in this paper is the parallel bottlenecks in the variational part of an sCI algorithm, which appears to be a major bottleneck for all the sCI approaches we are aware of. This is evident in the parallel HCI paper as their method exhibited much less favorable strong scaling for the variational component of the sCI calculation. Thus, the primary focus of the following developments will be on the variational component of the ASCI algorithm, of which a fair amount can be applied to other sCI algorithms as well.

The remainder of this work will be organized as follows: In Sec. II A, we review the salient features of the ASCI method relevant to the development of parallel algorithms. Section II B presents a work partitioning and load-balancing scheme relevant to the development of scalable determinant selection on massively parallel architectures. Section II C addresses key considerations for the development of parallel algorithms for the solution of large-scale eigenvalue problems and the parallel construction of sparse Hamiltonian matrices within the sCI formalism. We examine the performance and scalability of the proposed methods for a challenging test case (Cr2) in Sec. III and discuss future research directions in Sec. IV.

The notation and common abbreviations used in this work are summarized in Table I.

TABLE I.

List of symbols used in this work.

SymbolExplanation
ψk The wave function in the current ASCI step 
Ci The coefficients of the ith determinant Di of ψk 
Di The ith determinant in ψk 
D̃j The jth determinant not in ψk 
{DsdThe set of all single and double excitations that are connected to ψk 
{DisdThe set of all single and double excitations that are connected to the determinant Di 
Ntdets Number of determinants in the current wave function. 
Ncdets Core space size used for pruning in ASCI 
PE Processing element (independent compute context) 
SymbolExplanation
ψk The wave function in the current ASCI step 
Ci The coefficients of the ith determinant Di of ψk 
Di The ith determinant in ψk 
D̃j The jth determinant not in ψk 
{DsdThe set of all single and double excitations that are connected to ψk 
{DisdThe set of all single and double excitations that are connected to the determinant Di 
Ntdets Number of determinants in the current wave function. 
Ncdets Core space size used for pruning in ASCI 
PE Processing element (independent compute context) 

The details and efficacy of the ASCI algorithm are presented elsewhere. The initial implementation can be found in Ref. 18, and many details and improvements can be found in Ref. 29. The general schematic for the ASCI algorithm is given in Algorithm 1. In general, the ASCI algorithm, as most other sCI algorithms, consists of three major steps:

  1. Hamiltonian construction and diagonalization (eigenvalue problem),

  2. Determinant selection (search), and

  3. Perturbative (PT2) correction to the energy (optional).

ALGORITHM 1.

ASCI algorithm.

1: Input: Start with a Hartree-Fock simulation 
2: Output: Ground state energy of the system 
3: Create a starting wave function which can be a Hartree-Fock 
determinant 
4: ASCI-search (find important contributions not current in the 
wave function) 
5: Sort and select the top determinants from ASCI-search 
6: Diagonalize the Hamiltonian in the selected determinant space 
7: Return to step 3, but now use the ground state result of the 
diagonalization process as the starting wave function. Repeat 
until stopping criteria is reached 
8: Run a final ASCI-PT2 step (to improve the energy further) 
1: Input: Start with a Hartree-Fock simulation 
2: Output: Ground state energy of the system 
3: Create a starting wave function which can be a Hartree-Fock 
determinant 
4: ASCI-search (find important contributions not current in the 
wave function) 
5: Sort and select the top determinants from ASCI-search 
6: Diagonalize the Hamiltonian in the selected determinant space 
7: Return to step 3, but now use the ground state result of the 
diagonalization process as the starting wave function. Repeat 
until stopping criteria is reached 
8: Run a final ASCI-PT2 step (to improve the energy further) 

The first two steps are typically referred to as the variational steps for the sCI algorithm and will be the primary focus of this work. However, in the development of parallel algorithms for the ASCI search step (Sec. II B), we will extend a work partitioning scheme originally developed for the ASCI-PT2 method.23 For the variational steps, sCI algorithms primarily differ in their treatment of the determinant search, whereas once a set of determinants has been selected, the Hamiltonian construction and subsequent diagonalization steps are largely the same. As such, the parallel algorithms developed in this work for the latter can straightforwardly be applied to other sCI algorithms as well. However, the details of the search step, which must be carefully considered in the development of scalable parallel algorithms, are typically unique to each sCI algorithm, and, thus, not straightforwardly extended between methods.

With the possible exception of a few approaches such as the machine learning sCI methods,22,33,34,41 the vast majority of sCI algorithms use the following formula for generating a ranking value of a determinant not currently considered in the trial wave function, ψk:
Si=jSi(j),Si(j)=HijCjHiiEk,Ek=ψk|Hψk.
(1)
Here, H is the many-body molecular Hamiltonian, Hij is a matrix element of H between Slater determinants, Cj is the coefficient for Djψk, and Ek is the variational energy associated with ψk. Si is the metric (“score”) by which we will determine which determinants to add to the sCI subspace, and Si(j) is the jth partial-score that describes the contribution of Dj to Si. The contracted index, j, runs over Djψk, whereas the free index, i, runs over D̃iψk. In principle, i runs over the entire Hilbert space in the complement of the determinants in ψk. However, the elements of Hij are sparse, and the maximally two-body nature of H for molecular Hamiltonians imposes D̃i{Dsd}, where {Dsd} is the set of determinants that are singly or doubly connected to a determinant in ψk. In practice, even considering all determinants in {Dsd} is not needed since there exists considerable numerical sparsity beyond that, which is solely imposed by connections in the Hamiltonian. Thus, while calculating these equations is theoretically straightforward, in practice, it is important to identify the non-zero terms a priori because the large majority of terms are zero. In addition, it is important to only use and store the non-zero terms when needed since memory requirements for storing all the terms in these sums are extremely large when pushing to a large sCI calculation. Previous papers have reported examples pertaining to the number of connections of different examples. In a previous ASCI study, the number of such terms considered for a Cr2 example had over 282 × 109 connection terms, which were all generated on a single core,23 and in the recent parallel HCI paper, the authors were able to simulate over 89.5 × 109 connection terms. If all of these terms had to be stored in memory simultaneously, the memory requirements would make these approaches unfeasible.

Due to the immense cost of calculating Hij and Si, small differences in data structures and algorithmic design choices can yield large impacts on overall performance. Since its inception, the design approach of the ASCI algorithm has centered around the use of modern computing architectures. In particular, the ASCI method has adopted sorting as its primary algorithmic motif due to the favorable scaling and performance of sorting algorithms on contemporary CPU and GPU hardware.29 However, current developments of the ASCI algorithm have focused on shared memory computing environments. Almost all ASCI simulations that we are aware of have been performed on a single node with the exception of a small-scale MPI test with parallel ASCI-PT2.23 In Sec. II B, we examine the extension of the ASCI sorting motif to distributed memory compute environments and emergent challenges with load balancing on massively parallel architectures.

The general procedure for the ASCI determinant search is given in Algorithm 2. The search proceeds by considering single and double connections from a set of dominant configurations in the trial wave function.29 We will refer to these determinants as the “core” determinants in the following. The number of core determinants to consider is a tunable parameter in the ASCI algorithm and is typically achieved by selecting the Ncdets determinants in ψk with the largest coefficient magnitudes. In a naive approach, one determines a priori the {Dsd} associated with the core determinants, which we will refer to as the target determinants in the following, and subsequently calculates the contribution of each core determinant to each target configuration via Eq. (1). In many practical simulations, it is better to generate the possible target determinants as singles and doubles from each core configuration, rather than loop over all possible target configurations and then check for connections with the core determinants. The latter procedure generally performs excessive work as the majority of the core determinants only contribute to a few target determinants. In the shared memory ASCI algorithm, a single pass is taken over the core determinants, and the associated {Djsd} and Si(j) for D̃i{Djsd} are generated on the fly. If |Si(j)|>ϵsearch, where ϵsearch is a tunable threshold, it is appended to an array as a pair (D̃i,Si(j)), which we will refer to as an ASCI pair in the following. For molecular calculations, bit-strings are typically the data structure of choice for the representation and manipulation of many-body determinants as it allows for fast bit operations in, e.g., calculating matrix elements on modern computing architectures.17,29,36,38,55 As such, the ASCI pairs can be stored in a simple data structure consisting of the bit-string representing D̃i and the partial score Si(j). Once all pairs have been generated after passing over each core configuration, the array is sorted on the determinant bit-string, which, in turn, brings all partial scores associated with a particular target determinant contiguously in memory. With this new ordering, scores for individual target determinants can be straightforwardly assembled via Eq. (1). We refer to this procedure as “sort-and-accumulate” in the following.

ALGORITHM 2.

The ASCI search algorithm.

1: Input: Trial wave function determinants, ψk and coefficients 
C; Search configuration cutoff Ncdets; Max wave function size 
Ntdets; ASCI pair threshold, ϵsearch 
2: Output: New ASCI determinants ψk+1 
3: ψc ← Obtain Ncdets subset of ψk with largest coefficients 
4: P ← []⊳ ASCI Pairs 
5: for Djψc do 
6: {Djsd} Single and double connections from Dj 
7: for D̃i{Djsd} do 
8: Si(j) Eq. (1) 
9: if |Si(j)|>ϵthresh then 
10: P[P,(D̃i,Si(j))] 
11: end if 
12: end for 
13: end for 
14: P ← Sort P on D̃i
15: PjSi(j) for each unique i ⊳ Each unique pP now 
contains a complete Si 
16: P ← Sort P on Si 
17: return ψk+1 ← top-Ntdets determinants in P 
1: Input: Trial wave function determinants, ψk and coefficients 
C; Search configuration cutoff Ncdets; Max wave function size 
Ntdets; ASCI pair threshold, ϵsearch 
2: Output: New ASCI determinants ψk+1 
3: ψc ← Obtain Ncdets subset of ψk with largest coefficients 
4: P ← []⊳ ASCI Pairs 
5: for Djψc do 
6: {Djsd} Single and double connections from Dj 
7: for D̃i{Djsd} do 
8: Si(j) Eq. (1) 
9: if |Si(j)|>ϵthresh then 
10: P[P,(D̃i,Si(j))] 
11: end if 
12: end for 
13: end for 
14: P ← Sort P on D̃i
15: PjSi(j) for each unique i ⊳ Each unique pP now 
contains a complete Si 
16: P ← Sort P on Si 
17: return ψk+1 ← top-Ntdets determinants in P 

Even for relatively aggressive ϵsearch cutoffs, this approach can constitute a large memory footprint for larger Ncdets. In the original ASCI method, a threshold for the maximum number of ASCI pairs to store in memory was set, and the pair list would be occasionally pruned to ensure memory compactness. This pruning procedure has been demonstrated to provide approximate scores of sufficient accuracy to successfully rank determinants for the sCI wave function.29 A similar approach could be taken for the calculation of the PT2 energy, EPT2, which is closely related to the scores of Eq. (1).20,23,39 However, for accurate calculations of EPT2, one cannot employ the aggressive cutoffs and pruning schemes employed to ensure memory compactness in the ASCI search. In the development of the ASCI-PT2 method,23 a solution was developed that allows for memory compactness while still leveraging the powerful nature of sorting algorithms on modern computing architectures. In this procedure, the set of target configurations is partitioned according to a constraint identified by their largest occupied orbitals. In the ASCI-PT2 study, the target determinants were classified by triplet constraints, i.e., by their three highest-occupied orbitals. These constraints generate non-overlapping subsets of determinants such that every possible determinant belongs to exactly one constraint class.

At first glance, this approach appears similar to the naive approach for ASCI score calculation as it might imply that one would need to know the entire {Dsd} associated with the core determinants to be able to partition the target determinants a priori. However, the major realization of ASCI-PT2 was to determine which target determinants arising from excitations of a particular core configuration belong to a particular constraint class. As such, in a similar manner to the original ASCI search algorithm, one can make multiple passes over the core configurations for each constraint and generate its associated target configurations on the fly. The power of this procedure is that the sort-and-accumulate steps can now be done at the constraint level, i.e., ASCI pairs can be generated and aggregated for each constraint and, due to the non-overlapping partition generated by the individual constraints, all partial scores associated with the target determinant belonging to that class are guaranteed to be generated by the end of each pass over core configurations. As the number of unique determinant scores is much less than partial score contributions, this procedure exhibits much lower memory usage than the original ASCI search algorithm.

To proceed with the development of a parallel ASCI search algorithm, we consider a similar approach to ASCI-PT2, where the work associated with the generation of pairs for individual constraints is assigned to different processing elements (PE) in the distributed memory environment. To ensure the scalability of the pair generation and subsequent sort-and-accumulate steps, we have developed a static load balancing procedure presented in Algorithm 3. Before any pairs are generated for the ASCI search, the number of determinants associated with individual constraints is precomputed on the fly to generate the rough amount of work for each constraint. The number of contributions is then used as a heuristic to assign work to individual PEs. It is important to note that the number of contributions is not an exact measure of the work associated with a particular constraint because contributions are screened out according to ϵsearch. However, for the ASCI search, this procedure has been demonstrated to generate sufficiently balanced work (see Sec. II B). The primary challenge of this work distribution scheme stems from the large variance of work associated with any particular constraint, i.e., it is often the case that a few constraints yield a disproportionate of target determinants. As such, at a particular level of constraint (e.g., triplets), there always exists an amount of indivisible work that the load balancing procedure cannot distribute over PEs. The work associated with a particular constraint can be further divided by considering addition constraints of higher order, e.g., triplets can be broken up into quadruplets and quadruplets into quintuplets. However, the number of constraints at a particular level grows polynomially with the number of single-particle orbitals, e.g., O(norb3) for triplets and O(norb4) for quadruplets. Since, for each constraint, there is an inner loop over Ncdets core configurations, it is highly desirable to limit the number of considered constraints. As such, it would be highly inefficient to simply consider all higher-level constraints if the work associated with a particular constraint level is deemed to exhibit excessive load imbalance. Instead, it is advantageous to only divide the constraints that are expected to exhibit excessive work into higher constraint levels rather than dividing all constraints. We will demonstrate the efficacy of this partitioning scheme in Sec. II B.

ALGORITHM 3.

Load balancing for parallel constraint ASCI search.

1: Input (global): Dominant core configurations, ψc
max constraint level L 
2: Output (local): Local constraints Cloc 
3: W ← an array of size of the execution 
context (# PEs) ⊳ Initialize to zero 
4: p ← index of PE in 
5: Cloc ← [ ] 
6: Ct ← all unique triplet constraints 
7: for CCt do 
8: h ← 0 
9: for Djψc do 
10: s ← singly connected determinants to Dj satisfying C
11: d ← doubly connected determinants to Dj satisfying C
12: hh + |s| + |d
13: end for 
14: if h > 0 then 
15: q ← arg miniWi 
16: if p = q then 
17: Cloc ← [Cloc, C
18: end if 
19: end if 
20: end for 
21: return Cloc 
1: Input (global): Dominant core configurations, ψc
max constraint level L 
2: Output (local): Local constraints Cloc 
3: W ← an array of size of the execution 
context (# PEs) ⊳ Initialize to zero 
4: p ← index of PE in 
5: Cloc ← [ ] 
6: Ct ← all unique triplet constraints 
7: for CCt do 
8: h ← 0 
9: for Djψc do 
10: s ← singly connected determinants to Dj satisfying C
11: d ← doubly connected determinants to Dj satisfying C
12: hh + |s| + |d
13: end for 
14: if h > 0 then 
15: q ← arg miniWi 
16: if p = q then 
17: Cloc ← [Cloc, C
18: end if 
19: end if 
20: end for 
21: return Cloc 

Given the locally constructed arrays of string–score pairs generated as a result of the constraint partitioning, the remaining aspect of the parallel ASCI search is to determine the dominant Ntdets determinants to keep for the next iteration. In principle, this can be achieved by performing a full sort of the pairs on the absolute value of their scores. For an array A such that |A| = n, full sorting of A scales O(n log n), which is generally unfavorable for the large sizes of arrays considered in this work (as would be generated by large values of Ncdets). For distributed memory implementations, this problem would be exacerbated due to the communication overhead of existing distributed memory sort algorithms.52,56 For the vast majority of properties involving selected-CI wave functions, such as energies and density matrices, the order in which the determinant coefficients appear in the CI vector is irrelevant and all computed quantities will be invariant to arbitrary permutations CPC. As such, the sorting problem can be replaced with the selection problem, which can be used to determine the largest (or smallest) kn values of an array with O(n) complexity.57,58 In addition, selection algorithms can be performed in parallel with nearly optimal speedup without having to communicate significant segments of A.59 Rather than obtaining an absolute ordering of its input data, selection algorithms are designed to determine a pivot, akA, such that |Ag| = k, where Ag = {a > ak | aA}. In cases where ak appears multiple times in A, this definition can be extended to indicate |Ag| < k ≤ |AgAe|, where AeA is the indexed subset containing the duplicate elements of ak. We outline a duplicate-aware, distributed memory extension of the quickselect algorithm, with expected O(n) runtime, in Algorithm 4. For the ASCI search problem, Algorithm 4 is used to determine the contribution pair with the Ntdets-largest score, ptdets. We may then determine the contribution pairs with scores larger-than or equal-to ptdets and subsequently gather them (collectively) to all participating PEs [via, e.g., MPI_Allgather(v)]. We examine the performance of this scheme, which introduces the only source of distributed memory communication in the presented ASCI search method, in Sec. III.

ALGORITHM 4.

Distributed memory quickselect.

1: Input: Local part AI of an array A with |A| = n
Element index k ∈ [1, n
2: Output: The element ak (global) such that |{a > ak |aA}| = k 
3: ⊳ All operations are local unless otherwise specified. 
4: rank ← PE rank. 
5: nI ←|AI
6: N ←Allreduce(nI) ⊳ Collective 
7: while Nnmin do 
8: p ← Select a random pivot in [0, N). 
9: Determine the owner of ap and broadcast 
to other PEs ⊳ Collective 
10: AIg,AIe,AIlpartition(AI,ap) 
11: GI,EI,LI|AIg|,|AIe|,|AIl| 
12: {G, E, L}←Allreduce({GI, EI, LI}) ⊳ Collective 
13: if kG then 
14: AIAIg 
15: else if kG + E then 
16: akap 
17: return ak 
18: else 
19: AIAl 
20: kkGE 
21: end if 
22: nI ←|AI
23: N ←Allreduce(nI) ⊳ Collective 
24: end while 
25: Arem ←Gather(AI) ⊳ Collective (PE 0) 
26: if rank = 0 then 
27: ak ←SerialQuickselect(Arem) ⊳ std::nth_element 
28: end if 
29: Broadcast ak ⊳ Collective 
30: return ak 
1: Input: Local part AI of an array A with |A| = n
Element index k ∈ [1, n
2: Output: The element ak (global) such that |{a > ak |aA}| = k 
3: ⊳ All operations are local unless otherwise specified. 
4: rank ← PE rank. 
5: nI ←|AI
6: N ←Allreduce(nI) ⊳ Collective 
7: while Nnmin do 
8: p ← Select a random pivot in [0, N). 
9: Determine the owner of ap and broadcast 
to other PEs ⊳ Collective 
10: AIg,AIe,AIlpartition(AI,ap) 
11: GI,EI,LI|AIg|,|AIe|,|AIl| 
12: {G, E, L}←Allreduce({GI, EI, LI}) ⊳ Collective 
13: if kG then 
14: AIAIg 
15: else if kG + E then 
16: akap 
17: return ak 
18: else 
19: AIAl 
20: kkGE 
21: end if 
22: nI ←|AI
23: N ←Allreduce(nI) ⊳ Collective 
24: end while 
25: Arem ←Gather(AI) ⊳ Collective (PE 0) 
26: if rank = 0 then 
27: ak ←SerialQuickselect(Arem) ⊳ std::nth_element 
28: end if 
29: Broadcast ak ⊳ Collective 
30: return ak 

After each search iteration of the ASCI procedure, we must obtain the ground state eigenpair of the many-body Hamiltonian projected onto the basis of newly selected determinants. The large basis dimensions (Ntdets) employed in accurate sCI applications preclude the use of direct eigensolvers [e.g., those implemented in dense linear algebra libraries such as (Sca)LAPACK60 and ELPA61,62] due to their O(N2) dense storage requirement and steep O(N3) scaling with problem dimension. As such, Krylov-subspace methods, such as Davidson,63 LOBPCG,64,65 and Lanczos,66 are typically employed. The development of efficient and scalable Krylov methods on distributed memory computers is challenging due to the existence of many synchronization points arising from the serial nature of subspace construction. Significant research effort has been afforded to the development of distributed memory Krylov eigenvalue methods across many scientific computing domains.67–76 In this work, due to the diagonally dominant nature of the Hamiltonian, we consider the parallel implementation of the Davidson method for sCI applications, although many of the same principles can be extended to other Krylov methods such as Lanczos or LOBPCG.

Given an algorithm to produce the action of the Hamiltonian onto a trial vector (σ formation), we present the general scheme for the distributed memory Davidson method in Algorithm 5. The algorithm presented is general to an arbitrary preconditioner, although we will always adopt the shifted Jacobi (diagonal) preconditioner, which has become standard in configuration interaction calculations.63 Unlike exact diagonalization configuration interaction methods (e.g., full/complete active space CI), where implicit σ formation exhibits efficient closed-form expressions,1,55 the vast majority of sCI methods require explicit construction of Hij to perform the necessary contractions. We refer to the reader to Ref. 29 for a comprehensive discussion of efficient algorithms for the assembly of Hij for sCI wave functions. Although explicit storage of the Hamiltonian can be avoided via recalculation of the Hij at each Davidson iteration, the immense cost of matrix element construction renders this procedure prohibitive for practical applications. The sparsity of the Hamiltonian in the basis of determinants allows for its explicit storage using sparse matrix storage formats. In this work, we will use the distributed sparse matrix format depicted in Fig. 1, which has been employed for the development of parallel Krylov algorithms in many community software packages for sparse matrix algebra (e.g., Ref. 77). In this storage scheme, each PE owns a contiguous (not necessarily equal) row subset of the full sparse matrix divided into diagonal and off-diagonal blocks that are stored in standard sparse matrix formats. We will use the compressed sparse row (CSR) format for local matrix storage.78 The partitioning into diagonal and off-diagonal blocks will ultimately determine the communication pattern of the σ formation in the Davidson method, as will be apparent in Sec. II C. Of the several methods for matrix element construction discussed in Ref. 29, we consider the double-loop approach in this work. In the limit that each task only has a small portion of the Hamiltonian, the double loop approach is quick at generating all the matrix elements. There are likely time savings to be had by optimizing this further with consideration to other approaches of generating the matrix elements like with residue arrays and dynamic bitmasking.29 However, the cost analysis must consider inefficiencies due to load balancing. Thus, a double-loop approach is quite reasonable in terms of load balancing, and we consider it a reasonable starting point for massively parallel approaches to sCI.

ALGORITHM 5.

Parallel Davidson algorithm.

1: Input: Matrix-vector product operator, OP; preconditioner 
operator K, max Krylov dimension k; initial guess v0; Residual 
norm tolerance ϵ
2: Output: Eigenpair (E0, v) approximating 
the lowest eigenvalue of OP
3: ⊳ All vectors are distributed among PEs 
4: wOP(v0) 
5: hv0w; h ←Allreduce(h); 
vwv0h ⊳ Parallel Gram-Schmidt 
6: Ww; V ← [v0 v
7: i ← 1 
8: while i < k do 
9: W[WOP(v)] 
10: HlocWV ⊳ Local all PEs 
11: HtotReduce(Hloc) ⊳ Collective to PE 0 
12: Solve HtotC̃=C̃Ẽ ⊳ Local on PE 0 
13: Broadcast (Ẽ,C) ⊳ Collective 
14: R(WẼ0V)C̃0 ⊳ Local all PEs 
15: rlocRR ⊳ Local all PEs 
16: rAllgather(rloc) ⊳ Collective 
17: ifr‖ ≤ ϵ then 
18: return (E0,v)(Ẽ0,VC̃0) 
19: else 
20: RK(R) ⊳ Preconditioned Residual 
21: hVR; h ←Allreduce(h); 
vRVh ⊳ Parallel Gram-Schmidt 
22: V ← [V v
23: end if 
24: ii + 1 
25: end while 
1: Input: Matrix-vector product operator, OP; preconditioner 
operator K, max Krylov dimension k; initial guess v0; Residual 
norm tolerance ϵ
2: Output: Eigenpair (E0, v) approximating 
the lowest eigenvalue of OP
3: ⊳ All vectors are distributed among PEs 
4: wOP(v0) 
5: hv0w; h ←Allreduce(h); 
vwv0h ⊳ Parallel Gram-Schmidt 
6: Ww; V ← [v0 v
7: i ← 1 
8: while i < k do 
9: W[WOP(v)] 
10: HlocWV ⊳ Local all PEs 
11: HtotReduce(Hloc) ⊳ Collective to PE 0 
12: Solve HtotC̃=C̃Ẽ ⊳ Local on PE 0 
13: Broadcast (Ẽ,C) ⊳ Collective 
14: R(WẼ0V)C̃0 ⊳ Local all PEs 
15: rlocRR ⊳ Local all PEs 
16: rAllgather(rloc) ⊳ Collective 
17: ifr‖ ≤ ϵ then 
18: return (E0,v)(Ẽ0,VC̃0) 
19: else 
20: RK(R) ⊳ Preconditioned Residual 
21: hVR; h ←Allreduce(h); 
vRVh ⊳ Parallel Gram-Schmidt 
22: V ← [V v
23: end if 
24: ii + 1 
25: end while 
FIG. 1.

Example distributed memory storage scheme for the Hamiltonian and Davidson trial vectors across four PEs. The diagonal (Hd, darker) and off-diagonal (Hod, lighter) blocks of H are stored separately on each PE as CSR matrices. For PEs 1 and 2, the off-diagonal blocks are stored as a single sparse matrix with elements on both sides of the diagonal block. Vectors (textured) are distributed according to the same row distribution as H.

FIG. 1.

Example distributed memory storage scheme for the Hamiltonian and Davidson trial vectors across four PEs. The diagonal (Hd, darker) and off-diagonal (Hod, lighter) blocks of H are stored separately on each PE as CSR matrices. For PEs 1 and 2, the off-diagonal blocks are stored as a single sparse matrix with elements on both sides of the diagonal block. Vectors (textured) are distributed according to the same row distribution as H.

Close modal
A naive approach for the distributed σ formation would replicate the basis vectors on each PE such that the local sparse matrix-vector (SpMV) product could be done without communication. However, in the context of a Krylov method such as Davidson, this would require gathering the locally computed elements of the SpMV to every PE at each iteration. For large problem sizes, the connectivity of the full gather operation would become a bottleneck. Due to the sparsity of H, the computation of individual elements of σ does not require knowledge of the entire basis vector; in fact, the number of elements required for the contraction of any row of H is (pessimistically) bounded by the number of single and double excitations allowable from the single-particle basis. As such, replication (or communication) of the entire vector is not generally required. To minimize data communication in the Davidson iterations, we employ the vector partitioning scheme depicted in Fig. 1. By using the sparsity pattern of the off-diagonal blocks of H, we can efficiently determine which elements of remote vector elements need to be communicated between PEs as the union of the row sparsities, i.e.,
CI=ion PE IRi,Ri={j|Hijod0}.
(2)
From this set, the owners of remote matrix elements can be looked up by using the row distribution of H. Another benefit of this precomputation is that it allows for the overlap of communication and computation.79 Through this distribution, the diagonal SpMV can be done without communication as the vector elements for the local SpMV reside on the same PE by construction. By using non-blocking MPI communication primitives, communication can be initiated before starting the diagonal SpMV and finalized only once the remote elements are required for the off-diagonal SpMV. The pseudo-code for this distributed SpMV employed in this work is given in Algorithm 6.
ALGORITHM 6.

MPI-based non-blocking parallel sparse matrix-vector product (SpMV).

1: Input: Matrix H and vector v in block distributed format (Fig. 1). 
(Optional) Precomputed communication data [Eq. (2)
2: Output: w = Hv in conforming row partitioned format 
3: 
4: if Communication data needed then 
5: Compute communication data for H per Eq. (2) 
6: end if 
7: Post local receives for incoming data from 
remote PEs ⊳ MPI_Irecv 
8: Pack elements of v required by remote processes ⊳ Local 
9: Post local sends to satisfy remove receive 
requests ⊳ MPI_Isend 
10: wHdv ⊳ Local 
11: Wait for remote receives to complete 
12: vrem ← Unpack remote data 
13: ww + Hodvrem 
1: Input: Matrix H and vector v in block distributed format (Fig. 1). 
(Optional) Precomputed communication data [Eq. (2)
2: Output: w = Hv in conforming row partitioned format 
3: 
4: if Communication data needed then 
5: Compute communication data for H per Eq. (2) 
6: end if 
7: Post local receives for incoming data from 
remote PEs ⊳ MPI_Irecv 
8: Pack elements of v required by remote processes ⊳ Local 
9: Post local sends to satisfy remove receive 
requests ⊳ MPI_Isend 
10: wHdv ⊳ Local 
11: Wait for remote receives to complete 
12: vrem ← Unpack remote data 
13: ww + Hodvrem 

In this section, we examine the strong scaling behavior of the proposed algorithms for Cr2 /def-SVP80 in an active space of (24e, 30o). This test system is one of the challenging systems that has effectively been solved to chemical accuracy but only in the last ten years.9,18 Therefore, Cr2 is a good benchmark system for testing and developing modern methods. Molecular integrals were obtained by using the Molpro81,82 program, and all ASCI calculations were performed on the Haswell partition of the Cray XC40 Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC). Each Cori-Haswell node consists of 2x Intel Xeon Processor E5-2698 v3 16-core CPUs with 128 GB DDR4 RAM. All jobs were run with 32 PEs (MPI ranks) per node. Energies for Cr2 calculations presented in this work are given in Table II. All calculations were performed using ϵsearch = 10−10. Although not the primary focus of this work, we note that the variational energy determined at Ntdets = 3 × 108 is only 0.3 mEh higher in energy than the most accurate variational DMRG result for the same system in Ref. 9.

TABLE II.

ASCI Ground State energies for Cr2 (24e, 30o).

Ncdets
E0/Eh 100k 300k 
Ntdets = 106 −2086.409 66 −2086.410 18 
Ntdets = 107 −2086.417 69 −2086.417 81 
Ntdets = 3 × 108 ⋯ −2086.420 42 
Reference DMRG9  −2086.420 77 
Ncdets
E0/Eh 100k 300k 
Ntdets = 106 −2086.409 66 −2086.410 18 
Ntdets = 107 −2086.417 69 −2086.417 81 
Ntdets = 3 × 108 ⋯ −2086.420 42 
Reference DMRG9  −2086.420 77 

In the following, we present strong scaling results for the parallel ASCI methods proposed in this work.

In Fig. 2, we present the component and overall strong scaling of the ASCI search for triplet, quadruplet, quintuplet, and sextuplet determinant constraints (Sec. II B) for Cr2 with Ncdets = 300k and Ntdets = 106, 107. The calculation is dominated by the generation of ASCI pairs, whereas the distributed determinant selection via Algorithm 4 is relatively minor in comparison (<1 s). As such, the execution time of the ASCI search, at this scale, is largely independent of Ntdets. At the lowest determinant constraint level (triplets), the presented algorithm stagnates immediately with no further performance improvement with large PE counts. Strong scaling is improved significantly by using larger determinant constraints. Overall, we see a 5–5.5× performance improvement in the strong scaling limit by moving from triplet to sextuplet determinant constraints. The reason for this improvement (or conversely why the triplet case exhibits suboptimal strong scaling) can be seen in Fig. 3, where we present a plot showing the effect of constraint size on the number of ASCI pairs generated per PE. At the triplet level, there is a single triplet that yields a disproportionate number of pair contributions. As the work can only be distributed at the level of the highest determinant constraint, this single triplet represents indivisible work. By breaking up the offending triplet into higher constraints, the work can be distributed. However, at each constraint level, the algorithm always reaches a point where there is a disproportionate amount of work generated for a few particular constraints. As such, we expect that there always is a strong scaling stagnation point for this procedure, but the inclusion of higher constraints (septuplets, octuplets, etc.) will continue to improve the strong scaling behavior at larger PE counts. However, due to the relatively low cost of the search relative to the Hamiltonian formation and Davidson iterations (see Sec. III B), we believe this level of optimization to be unnecessary at this time.

FIG. 2.

Strong scaling of ASCI search components with different determinant constraints (triplet, 3, red; quadruplet, 4, green; quituplet, 5, blue; sextuplet, 6, orange) for Ncdets = 300k, and (a) Ntdets = 106 and (b) Ntdets = 107. Dominant components exhibit improved scaling with the addition of larger determinant constraints while lesser components are largely unaffected.

FIG. 2.

Strong scaling of ASCI search components with different determinant constraints (triplet, 3, red; quadruplet, 4, green; quituplet, 5, blue; sextuplet, 6, orange) for Ncdets = 300k, and (a) Ntdets = 106 and (b) Ntdets = 107. Dominant components exhibit improved scaling with the addition of larger determinant constraints while lesser components are largely unaffected.

Close modal
FIG. 3.

Constraint pair count statistics for Cr2 with Ntdets = 107, and (a) Ncdets = 100k and (b) Ncdets = 300k. Triplet (3, red shading), quadruplet (4, green shading), quintuplet (5, blue shading), and sextuplets (6, orange scaling) constraints are considered. A smaller shaded volume indicates better work distribution in the constraint-partitioned ASCI search.

FIG. 3.

Constraint pair count statistics for Cr2 with Ntdets = 107, and (a) Ncdets = 100k and (b) Ncdets = 300k. Triplet (3, red shading), quadruplet (4, green shading), quintuplet (5, blue shading), and sextuplets (6, orange scaling) constraints are considered. A smaller shaded volume indicates better work distribution in the constraint-partitioned ASCI search.

Close modal

Figure 4 illustrates the strong scaling of the sCI Hamiltonian generation and subsequent Davidson iterations for Cr2 with Ncdets = 300k and Ntdets = 106, 107, and 3 × 108. It is clear that the cost of Hamiltonian formation dominates the variational ASCI calculation at all Ntdets considered, and is O(10x) the cost of the Davidson iterations and O(100x) the cost of the ASCI search discussed in Sec. III A. Due to the significant computational cost of the largest test problem, there does not exist a single point of reference (PE count) to measure parallel performance, which would be consistent among the calculations performed. As such, we have chosen to characterize the strong scaling of this application in terms of the relative parallel speedup (RPS) achieved between adjacent data points (i.e., RPS128 = t128/t64 for the speedup between 64 and 128 PEs). As all of the scaling calculations were performed by the successive doubling of PE counts, a speedup value of 2 indicates ideal (linear) strong scaling, 1 indicates no speedup, and <1 indicates a performance degradation. Values slightly above 2 can be attributed to small variances in underlying execution times and are not indicative of super-linear strong scaling.

FIG. 4.

Strong scaling of Ntdets = 106, 107, and 3 × 108 ASCI. Panel (a) illustrates the strong scaling of the Hamiltonian generation (H), Davidson diagonalization (D), and overall application performance (T) for each system, whereas panel (b) illustrates the corresponding relative parallel speedups. Panel (c) plots the ratio of the largest to the smallest number of non-zero matrix elements over the PEs, which explains the imbalance in the Davidson iterations.

FIG. 4.

Strong scaling of Ntdets = 106, 107, and 3 × 108 ASCI. Panel (a) illustrates the strong scaling of the Hamiltonian generation (H), Davidson diagonalization (D), and overall application performance (T) for each system, whereas panel (b) illustrates the corresponding relative parallel speedups. Panel (c) plots the ratio of the largest to the smallest number of non-zero matrix elements over the PEs, which explains the imbalance in the Davidson iterations.

Close modal

The Hamiltonian formation exhibits near linear strong scaling (RPS > 1.96), whereas the Davidson iterations exhibit RPS values in the range of 0.98–1.6. The scaling of the former is expected since each PE performs nearly the same number of determinant distance checks in the double loop method (Sec. II C), and there is no communication between PEs. The scaling of the Davidson iterations can be understood in terms of Fig. 4(c), where the source of the load imbalance can be attributed to the variance of non-zero matrix elements among PE. This is because the ordering of the determinants in the ASCI wave function, as produced by the parallel ASCI search, is more or less random, and thus unable to guarantee regular distributions of non-zero matrix elements. This degradation is most prevalent for the smallest problem (106 determinants) as the effective lack of data distribution leads to load imbalance in a manner similar to Sec. III A. This problem is well known in the field of distributed memory sparse linear algebra,56 as both the local work and overall communication volume of the parallel SpMV (Algorithm 6) are affected by irregular distributions of non-zero matrix elements. Improving the communication characteristics of the sCI SpMV will be the subject of future work by the authors. Figure 4 also shows the performance of the overall ASCI application for the different problem sizes. As the calculation is dominated by the Hamiltonian construction, we see similar parallel speedups for the total application, particularly for the larger problems. Overall, the ASCI calculation maintains RPS > 1.7 for the smallest problem (106) while maintaining RPS > 1.9 for the largest problem at 3 × 108 determinants out to 16 384 PEs. To the best of the authors’ knowledge, this is the largest variational ASCI calculation to date.

In this work, we have proposed and demonstrated a parallel, distributed memory extension for the variational ASCI method. We addressed the key concerns for the dominant steps of the ASCI algorithm, including the development of a parallel algorithm for the determinant selection, a parallel algorithm and sparse storage scheme for the sCI Hamiltonian, and a parallel algorithm for the variational eigenvalue calculation via the Davidson method. The parallel search algorithm was enabled by previous work in the context of ASCI-PT2, where the introduction of the determinant constraints offered a convenient and robust mechanism to distribute work across PEs. This method was extended to include higher-order determinant constraints than were considered for the ASCI-PT2 method, and it was coupled to a load balancer to ensure parallel scaling on large compute resources. We have demonstrated the efficacy and performance of the proposed algorithms on a challenging test system (Cr2) on up to 16 384 cores. The use of higher-order determinant constraints yielded a 5–5.5× performance improvement in the ASCI search (triplet constraints) in the strong scaling limit. The overall ASCI calculation was demonstrated to achieve near linear parallel speedups (RPS between 1.7 and 1.9). In addition, we have demonstrated the stability and continued convergence of the ASCI method into the regime of hundreds of millions of determinants, which, to the best of the authors’ knowledge, is the largest variational ASCI calculation to date. Although the developments for the parallel search algorithm are ASCI specific, the developments pertaining to distributed Hamiltonian construction and Davidson diagonalization are directly applicable to other sCI methods as well. These developments indicate a promising future for massively parallel sCI applications in the years to come.

While these results are promising, there remain open areas for exploration, which will be addressed by the authors in future work. In this work, the Hamiltonian construction dominates the overall ASCI application by at least an order of magnitude. The application of more advanced Hamiltonian matrix element techniques29 for the local sparse matrix construction holds high potential for achieving further improvements in the future, although it is unclear whether such developments would introduce load imbalance. The interplay between efficient Hamiltonian construction and load imbalance warrants further exploration. In addition, the strong scaling stagnation of the Davidson iterations can be attributed primarily to the massive load imbalance in the non-zero element distribution across PEs. While reordering techniques for SpMV optimization exist,83,84 their extensive cost and poor strong scaling preclude their practical application to the present work. The development of scalable sparse matrix reordering techniques for the CI problem would be a particularly fruitful area of exploration, given the massively parallel implementations of sCI. Finally, the leverage of bit operations both in the search and Hamiltonian construction steps of the ASCI algorithm would be particularly advantageous on current and future GPU architectures, and possibly other hardware accelerators to come. As such, the authors will build upon initial results29 for GPU applications for the ASCI method and develop GPU-accelerated extensions of the parallel algorithms proposed in this work.

D.B.W.-Y. and W.A.d.J. were supported by the Center for Scalable Predictive methods for Excitations and Correlated (SPEC) phenomena, which is funded by the U.S. Department of Energy (DoE), Office of Science, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences and Biosciences as part of the Computational Chemical Sciences (CCS) program at Lawrence Berkeley National Laboratory under FWP No. 12553. N.M.T. acknowledges the support from NASA Ames Research Center. C.M.-Z. acknowledges the financial support from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation program, Grant Agreement No. 692670 “FIRSTORM.” This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231 using NERSC Award No. BES-ERCAP-M3196. N.M.T. acknowledges funding from the NASA ARMD Transformational Tools and Technology (TTT) Project. Some calculations were performed as part of the XSEDE computational Project No. TG-MCA93S030 on Bridges-2 at the Pittsburgh supercomputer center.

The authors have no conflicts to disclose.

David B. Williams-Young: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Norm M. Tubman: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Carlos Mejuto-Zaera: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). Wibe A. de Jong: Conceptualization (equal); Formal analysis (equal); Funding acquisition (lead); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal).

The data that support the findings of this study are available within the article.

1.
K. D.
Vogiatzis
,
D.
Ma
,
J.
Olsen
,
L.
Gagliardi
, and
W. A.
de Jong
, “
Pushing configuration-interaction to the limit: Towards massively parallel MCSCF calculations
,”
J. Chem. Phys.
147
,
184111
(
2017
).
2.
C. L.
Janssen
and
I. M.
Nielsen
,
Parallel Computing in Quantum Chemistry
(
2008
).
3.
W. A.
De Jong
,
E.
Bylaska
,
N.
Govind
,
C. L.
Janssen
,
K.
Kowalski
,
T.
Müller
,
I. M. B.
Nielsen
,
H. J. J.
van Dam
,
V.
Veryazov
, and
R.
Lindh
, “
Utilizing high performance computing for chemistry: Parallel computational chemistry
,”
Phys. Chem. Chem. Phys.
12
,
6896
6920
(
2010
).
4.
M. S.
Gordon
and
T. L.
Windus
, “
Editorial: Modern architectures and their impact on electronic structure theory
,”
Chem. Rev.
120
,
9015
9020
(
2020
).
5.
D.
Kothe
,
S.
Lee
, and
I.
Qualters
, “
Exascale computing in the United States
,”
Comput. Scie. Eng.
21
,
17
29
(
2019
).
6.
F.
Alexander
,
A.
Almgren
,
J.
Bell
,
A.
Bhattacharjee
,
J.
Chen
,
P.
Colella
,
D.
Daniel
,
J.
DeSlippe
,
L.
Diachin
,
E.
Draeger
,
A.
Dubey
,
T.
Dunning
,
T.
Evans
,
I.
Foster
,
M.
Francois
,
T.
Germann
,
M.
Gordon
,
S.
Habib
,
M.
Halappanavar
,
S.
Hamilton
,
W.
Hart
,
Z.
Huang
,
A.
Hungerford
,
D.
Kasen
,
P. R. C.
Kent
,
T.
Kolev
,
D. B.
Kothe
,
A.
Kronfeld
,
Y.
Luo
,
P.
Mackenzie
,
D.
McCallen
,
B.
Messer
,
S.
Mniszewski
,
C.
Oehmen
,
A.
Perazzo
,
D.
Perez
,
D.
Richards
,
W. J.
Rider
,
R.
Rieben
,
K.
Roche
,
A.
Siegel
,
M.
Sprague
,
C.
Steefel
,
R.
Stevens
,
M.
Syamlal
,
M.
Taylor
,
J.
Turner
,
J.-L.
Vay
,
A. F.
Voter
,
T. L.
Windus
, and
K.
Yelick
, “
Exascale applications: Skin in the game
,”
Philos. Trans. R. Soc., A
378
,
20190056
(
2020
).
7.
J. A.
Calvin
,
C.
Peng
,
V.
Rishi
,
A.
Kumar
, and
E. F.
Valeev
, “
Many-body quantum chemistry on massively parallel computers
,”
Chem. Rev.
121
,
1203
1231
(
2021
).
8.
U.
Schollwöck
, “
The density-matrix renormalization group
,”
Rev. Mod. Phys.
77
,
259
315
(
2005
).
9.
R.
Olivares-Amaya
,
W.
Hu
,
N.
Nakatani
,
S.
Sharma
,
J.
Yang
, and
G. K.-L.
Chan
, “
The ab-initio density matrix renormalization group in practice
,”
J. Chem. Phys.
142
,
034102
(
2015
).
10.
R. J.
Bartlett
and
M.
Musiał
, “
Coupled-cluster theory in quantum chemistry
,”
Rev. Mod. Phys.
79
,
291
352
(
2007
).
11.
E.
Solomonik
,
D.
Matthews
,
J. R.
Hammond
,
J. F.
Stanton
, and
J.
Demmel
, “
A massively parallel tensor contraction framework for coupled-cluster computations
,”
Journal of Parallel and Distributed Computing
74
,
3176
3190
(
2014
) part of Special Issue: Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
12.
G. H.
Booth
,
A. J. W.
Thom
, and
A.
Alavi
, “
Fermion Monte Carlo without fixed nodes: A game of life, death, and annihilation in slater determinant space
,”
J. Chem. Phys.
131
,
054106
(
2009
).
13.
C.-C.
Chang
,
B. M.
Rubenstein
, and
M. A.
Morales
, “
Auxiliary-field-based trial wave functions in quantum Monte Carlo calculations
,”
Phys. Rev. B
94
,
235144
(
2016
).
14.
M.
Motta
and
S.
Zhang
, “
Ab initio computations of molecular systems by the auxiliary-field quantum Monte Carlo method
,”
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
8
,
e1364
(
2018
).
15.
A.
Szabo
and
N.
Ostlund
,
Modern Quantum Chemistry
(
Dover
,
1982
).
16.
J. J.
Eriksen
,
T. A.
Anderson
,
J. E.
Deustua
,
K.
Ghanem
,
D.
Hait
,
M. R.
Hoffmann
,
S.
Lee
,
D. S.
Levine
,
I.
Magoulas
,
J.
Shen
,
N. M.
Tubman
,
K. B.
Whaley
,
E.
Xu
,
Y.
Yao
,
N.
Zhang
,
A.
Alavi
,
G. K.-L.
Chan
,
M.
Head-Gordon
,
W.
Liu
,
P.
Piecuch
,
S.
Sharma
,
S. L.
Ten-no
,
C. J.
Umrigar
, and
J.
Gauss
, “
The ground state electronic energy of benzene
,”
J. Phys. Chem. Lett.
11
,
8922
8929
(
2020
).
17.
F. A.
Evangelista
, “
Adaptive multiconfigurational wave functions
,”
J. Chem. Phys.
140
,
124114
(
2014
).
18.
N. M.
Tubman
,
J.
Lee
,
T. Y.
Takeshita
,
M.
Head-Gordon
, and
K. B.
Whaley
, “
A deterministic alternative to the full configuration interaction quantum Monte Carlo method
,”
J. Chem. Phys.
145
,
044112
(
2016
).
19.
A. A.
Holmes
,
N. M.
Tubman
, and
C. J.
Umrigar
, “
Heat-bath configuration interaction: An efficient selected configuration interaction algorithm inspired by heat-bath sampling
,”
J. Chem. Theory Comput.
12
,
3674
3680
(
2016
).
20.
Y.
Garniron
,
A.
Scemama
,
P.-F.
Loos
, and
M.
Caffarel
, “
Hybrid stochastic-deterministic calculation of the second-order perturbative contribution of multireference perturbation theory
,”
J. Chem. Phys.
147
,
034101
(
2017
).
21.
P. M.
Zimmerman
, “
Incremental full configuration interaction
,”
J. Chem. Phys.
146
,
104102
(
2017
).
22.
J. P.
Coe
, “
Machine learning configuration interaction
,”
J. Chem. Theory Comput.
14
,
5739
5749
(
2018
).
23.
N. M.
Tubman
,
D. S.
Levine
,
D.
Hait
,
M.
Head-Gordon
, and
K. B.
Whaley
, “
An efficient deterministic perturbation theory for selected configuration interaction methods
,” arXiv:1808.02049 (
2018
).
24.
E.
Lesko
,
M.
Ardiansyah
, and
K. R.
Brorsen
, “
Vibrational adaptive sampling configuration interaction
,”
J. Chem. Phys.
151
,
164103
(
2019
).
25.
J. B.
Schriber
and
F. A.
Evangelista
, “
Time dependent adaptive configuration interaction applied to attosecond charge migration
,”
J. Chem. Phys.
151
,
171102
(
2019
).
26.
J. W.
Park
, “
Second-order orbital optimization with large active spaces using adaptive sampling configuration interaction (ASCI) and its application to molecular geometry optimization
,”
J. Chem. Theory Comput.
17
,
1522
1534
(
2021
).
27.
C.
Mejuto-Zaera
,
N. M.
Tubman
, and
K. B.
Whaley
, “
Dynamical mean field theory simulations with the adaptive sampling configuration interaction method
,”
Phys. Rev. B
100
,
125165
(
2019
).
28.
C.
Mejuto-Zaera
,
L.
Zepeda-Núñez
,
M.
Lindsey
,
N.
Tubman
,
B.
Whaley
, and
L.
Lin
, “
Efficient hybridization fitting for dynamical mean-field theory via semi-definite relaxation
,”
Phys. Rev. B
101
,
035143
(
2020
).
29.
N. M.
Tubman
,
C. D.
Freeman
,
D. S.
Levine
,
D.
Hait
,
M.
Head-Gordon
, and
K. B.
Whaley
, “
Modern approaches to exact diagonalization and selected configuration interaction with the adaptive sampling ci method
,”
J. Chem. Theory Comput.
16
,
2139
2159
(
2020
).
30.
D. S.
Levine
,
D.
Hait
,
N. M.
Tubman
,
S.
Lehtola
,
K. B.
Whaley
, and
M.
Head-Gordon
, “
CASSCF with extremely large active spaces using the adaptive sampling configuration interaction method
,”
J. Chem. Theory Comput.
16
,
2340
2354
(
2020
).
31.
V.
Kremenetski
,
T.
Hogg
,
S.
Hadfield
,
S. J.
Cotton
, and
N. M.
Tubman
, “
Quantum alternating operator ansatz (QAOA) phase diagrams and applications for quantum chemistry
,” arXiv:2108.13056 [quant-ph] (
2021
).
32.
V.
Kremenetski
,
C.
Mejuto-Zaera
,
S. J.
Cotton
, and
N. M.
Tubman
, “
Simulation of adiabatic quantum computing for molecular ground states
,”
J. Chem. Phys.
155
,
234106
(
2021
).
33.
S. D.
Pineda Flores
, “
Chembot: A machine learning approach to selective configuration interaction
,”
J. Chem. Theory Comput.
17
,
4028
4038
(
2021
).
34.
J. J.
Goings
,
H.
Hu
,
C.
Yang
, and
X.
Li
, “
Reinforcement learning configuration interaction
,”
J. Chem. Theory Comput.
17
,
5482
5491
(
2021
).
35.
A. U.
Bhatty
and
K. R.
Brorsen
, “
An alternative formulation of vibrational heat-bath configuration interaction
,”
Mol. Phys.
119
,
e1936250
(
2021
).
36.
E.
Epifanovsky
,
A. T. B.
Gilbert
,
X.
Feng
,
J.
Lee
,
Y.
Mao
,
N.
Mardirossian
,
P.
Pokhilko
,
A. F.
White
,
M. P.
Coons
,
A. L.
Dempwolff
,
Z.
Gan
,
D.
Hait
,
P. R.
Horn
,
L. D.
Jacobson
,
I.
Kaliman
,
J.
Kussmann
,
A. W.
Lange
,
K. U.
Lao
,
D. S.
Levine
,
J.
Liu
,
S. C.
McKenzie
,
A. F.
Morrison
,
K. D.
Nanda
,
F.
Plasser
,
D. R.
Rehn
,
M. L.
Vidal
,
Z.-Q.
You
,
Y.
Zhu
,
B.
Alam
,
B. J.
Albrecht
,
A.
Aldossary
,
E.
Alguire
,
J. H.
Andersen
,
V.
Athavale
,
D.
Barton
,
K.
Begam
,
A.
Behn
,
N.
Bellonzi
,
Y. A.
Bernard
,
E. J.
Berquist
,
H. G. A.
Burton
,
A.
Carreras
,
K.
Carter-Fenk
,
R.
Chakraborty
,
A. D.
Chien
,
K. D.
Closser
,
V.
Cofer-Shabica
,
S.
Dasgupta
,
M.
de Wergifosse
,
J.
Deng
,
M.
Diedenhofen
,
H.
Do
,
S.
Ehlert
,
P.-T.
Fang
,
S.
Fatehi
,
Q.
Feng
,
T.
Friedhoff
,
J.
Gayvert
,
Q.
Ge
,
G.
Gidofalvi
,
M.
Goldey
,
J.
Gomes
,
C. E.
González-Espinoza
,
S.
Gulania
,
A. O.
Gunina
,
M. W. D.
Hanson-Heine
,
P. H. P.
Harbach
,
A.
Hauser
,
M. F.
Herbst
,
M.
Hernández Vera
,
M.
Hodecker
,
Z. C.
Holden
,
S.
Houck
,
X.
Huang
,
K.
Hui
,
B. C.
Huynh
,
M.
Ivanov
,
Á.
Jász
,
H.
Ji
,
H.
Jiang
,
B.
Kaduk
,
S.
Kähler
,
K.
Khistyaev
,
J.
Kim
,
G.
Kis
,
P.
Klunzinger
,
Z.
Koczor-Benda
,
J. H.
Koh
,
D.
Kosenkov
,
L.
Koulias
,
T.
Kowalczyk
,
C. M.
Krauter
,
K.
Kue
,
A.
Kunitsa
,
T.
Kus
,
I.
Ladjánszki
,
A.
Landau
,
K. V.
Lawler
,
D.
Lefrancois
,
S.
Lehtola
,
R. R.
Li
,
Y.-P.
Li
,
J.
Liang
,
M.
Liebenthal
,
H.-H.
Lin
,
Y.-S.
Lin
,
F.
Liu
,
K.-Y.
Liu
,
M.
Loipersberger
,
A.
Luenser
,
A.
Manjanath
,
P.
Manohar
,
E.
Mansoor
,
S. F.
Manzer
,
S.-P.
Mao
,
A. V.
Marenich
,
T.
Markovich
,
S.
Mason
,
S. A.
Maurer
,
P. F.
McLaughlin
,
M. F. S. J.
Menger
,
J.-M.
Mewes
,
S. A.
Mewes
,
P.
Morgante
,
J. W.
Mullinax
,
K. J.
Oosterbaan
,
G.
Paran
,
A. C.
Paul
,
S. K.
Paul
,
F.
Pavošević
,
Z.
Pei
,
S.
Prager
,
E. I.
Proynov
,
Á.
Rák
,
E.
Ramos-Cordoba
,
B.
Rana
,
A. E.
Rask
,
A.
Rettig
,
R. M.
Richard
,
F.
Rob
,
E.
Rossomme
,
T.
Scheele
,
M.
Scheurer
,
M.
Schneider
,
N.
Sergueev
,
S. M.
Sharada
,
W.
Skomorowski
,
D. W.
Small
,
C. J.
Stein
,
Y.-C.
Su
,
E. J.
Sundstrom
,
Z.
Tao
,
J.
Thirman
,
G. J.
Tornai
,
T.
Tsuchimochi
,
N. M.
Tubman
,
S. P.
Veccham
,
O.
Vydrov
,
J.
Wenzel
,
J.
Witte
,
A.
Yamada
,
K.
Yao
,
S.
Yeganeh
,
S. R.
Yost
,
A.
Zech
,
I. Y.
Zhang
,
X.
Zhang
,
Y.
Zhang
,
D.
Zuev
,
A.
Aspuru-Guzik
,
A. T.
Bell
,
N. A.
Besley
,
K. B.
Bravaya
,
B. R.
Brooks
,
D.
Casanova
,
J.-D.
Chai
,
S.
Coriani
,
C. J.
Cramer
,
G.
Cserey
,
A. E.
DePrince
,
R. A.
DiStasio
,
A.
Dreuw
,
B. D.
Dunietz
,
T. R.
Furlani
,
W. A.
Goddard
,
S.
Hammes-Schiffer
,
T.
Head-Gordon
,
W. J.
Hehre
,
C.-P.
Hsu
,
T.-C.
Jagau
,
Y.
Jung
,
A.
Klamt
,
J.
Kong
,
D. S.
Lambrecht
,
W.
Liang
,
N. J.
Mayhall
,
C. W.
McCurdy
,
J. B.
Neaton
,
C.
Ochsenfeld
,
J. A.
Parkhill
,
R.
Peverati
,
V. A.
Rassolov
,
Y.
Shao
,
L. V.
Slipchenko
,
T.
Stauch
,
R. P.
Steele
,
J. E.
Subotnik
,
A. J. W.
Thom
,
A.
Tkatchenko
,
D. G.
Truhlar
,
T.
Van Voorhis
,
T. A.
Wesolowski
,
K. B.
Whaley
,
H. L.
Woodcock
,
P. M.
Zimmerman
,
S.
Faraji
,
P. M. W.
Gill
,
M.
Head-Gordon
,
J. M.
Herbert
, and
A. I.
Krylov
, “
Software for the frontiers of quantum chemistry: An overview of developments in the Q-Chem5 package
,”
J. Chem. Phys.
155
,
084801
(
2021
).
37.
J. W.
Park
, “
Near-exact CASSCF-level geometry optimization with a large active space using adaptive sampling configuration interaction self-consistent field corrected with second-order perturbation theory (ASCI-SCF-PT2)
,”
J. Chem. Theory Comput.
17
,
4092
4104
(
2021
).
38.
D.-K.
Dang
,
J. A.
Kammeraad
, and
P. M.
Zimmerman
, “
Advances in parallel heat bath configuration interaction
,”
J. Phys. Chem. A
127
,
400
411
(
2023
).
39.
J.
Li
,
M.
Otten
,
A. A.
Holmes
,
S.
Sharma
, and
C. J.
Umrigar
, “
Fast semistochastic heat-bath configuration interaction
,”
J. Chem. Phys.
149
,
214110
(
2018
).
40.
C.
Mejuto-Zaera
,
G.
Weng
,
M.
Romanova
,
S. J.
Cotton
,
K. B.
Whaley
,
N. M.
Tubman
, and
V.
Vlček
, “
Are multi-quasiparticle interactions important in molecular ionization?
,”
J. Chem. Phys.
154
,
121101
(
2021
).
41.
B.
Herzog
,
B.
Casier
,
S.
Lebègue
, and
D.
Rocca
, “
Solving the Schrödinger equation in the configuration space with generative machine learning
,”
J. Chem. Theory Comput.
19
,
2484
2490
(
2023
).
42.
J. E. T.
Smith
,
J.
Lee
, and
S.
Sharma
, “
Near-exact nuclear gradients of complete active space self-consistent field wave functions
,”
J. Chem. Phys.
157
,
094104
(
2022
).
43.
D. B.
Chamaki
,
S.
Hadfield
,
K.
Klymko
,
B.
O’Gorman
, and
N. M.
Tubman
, “
Self-consistent quantum iteratively sparsified Hamiltonian method (SQuISH): A new algorithm for efficient Hamiltonian simulation and compression
,” arXiv:2211.16522 [quant-ph] (
2022
).
44.
D.
Yoffe
,
A.
Natan
, and
A.
Makmal
, “
A qubit-efficient variational selected configuration-interaction method
,” arXiv:2302.06691 [quant-ph] (
2023
).
45.
J. P.
Coe
, “
Analytic gradients for selected configuration interaction
,”
J. Chem. Theory Comput.
19
,
874
886
(
2023
).
46.
X.
Wang
and
S.
Sharma
, “
Relativistic semistochastic heat-bath configuration interaction
,”
J. Chem. Theory Comput.
19
,
848
855
(
2023
).
47.
K.
Klymko
,
C.
Mejuto-Zaera
,
S. J.
Cotton
,
F.
Wudarski
,
M.
Urbanek
,
D.
Hait
,
M.
Head-Gordon
,
K. B.
Whaley
,
J.
Moussa
,
N.
Wiebe
,
W. A.
de Jong
, and
N. M.
Tubman
, “
Real-time evolution for ultracompact Hamiltonian eigenstates on quantum hardware
,”
PRX Quantum
3
,
020323
(
2022
).
48.
N. M.
Tubman
,
C.
Mejuto-Zaera
,
J. M.
Epstein
,
D.
Hait
,
D. S.
Levine
,
W.
Huggins
,
Z.
Jiang
,
J. R.
McClean
,
R.
Babbush
,
M.
Head-Gordon
, and
K. B.
Whaley
, “
Postponing the orthogonality catastrophe: Efficient state preparation for electronic structure simulations on quantum devices
,” arXiv:1809.05523 [quant-ph] (
2018
).
49.
K.
Kanno
,
M.
Kohda
,
R.
Imai
,
S.
Koh
,
K.
Mitarai
,
W.
Mizukami
, and
Y. O.
Nakagawa
, “
Quantum-selected configuration interaction: Classical diagonalization of Hamiltonians in subspaces selected by quantum computers
,” arXiv:2302.11320 [quant-ph] (
2023
).
50.
J. W.
Mullinax
and
N. M.
Tubman
, “
Large-scale sparse wavefunction circuit simulator for applications with the variational quantum eigensolver
,” arXiv:2301.05726 [quant-ph] (
2023
).
51.
M. R.
Hirsbrunner
,
D.
Chamaki
,
J. W.
Mullinax
, and
N. M.
Tubman
, “
Beyond MP2 initialization for unitary coupled cluster quantum circuits
,” arXiv:2301.05666 [quant-ph] (
2023
).
52.
E.
Solomonik
and
L. V.
Kalé
, “
Highly scalable parallel sorting
,” in
2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS)
(
IEEE
,
2010
), pp.
1
12
.
53.
M.
Axtmann
,
S.
Witt
,
D.
Ferizovic
, and
P.
Sanders
, “
In-place parallel super scalar samplesort (IPS4o)
,” arXiv:1705.02257 [cs.DC] (
2017
).
54.
N.
Bell
and
J.
Hoberock
, “
Thrust: A productivity-oriented library for CUDA
,” in
GPU Computing Gems Jade Edition
(
Elsevier
,
2011
), pp.
359
371
.
55.
J. S.
Spencer
,
N. S.
Blunt
,
S.
Choi
,
J.
Etrych
,
M.-A.
Filip
,
W. M. C.
Foulkes
,
R. S. T.
Franklin
,
W. J.
Handley
,
F. D.
Malone
,
V. A.
Neufeld
,
R.
Di Remigio
,
T. W.
Rogers
,
C. J. C.
Scott
,
J. J.
Shepherd
,
W. A.
Vigor
,
J.
Weston
,
R.
Xu
, and
A. J. W.
Thom
, “
The HANDE-QMC project: Open-source stochastic quantum chemistry from the ground state up
,”
J. Chem. Theory Comput.
15
,
1728
1742
(
2019
).
56.
M.
Axtmann
,
T.
Bingmann
,
P.
Sanders
, and
C.
Schulz
, “
Practical massively parallel sorting
,” in
Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’15
(
Association for Computing Machinery
,
New York, NY, USA
,
2015
), pp.
13
23
.
57.
M.
Blum
,
R. W.
Floyd
,
V.
Pratt
,
R. L.
Rivest
, and
R. E.
Tarjan
, “
Time bounds for selection
,”
J. Comput. Syst. Sci.
7
,
448
461
(
1973
).
58.
R. W.
Floyd
and
R. L.
Rivest
, “
Expected time bounds for selection
,”
Commun. ACM
18
,
165
172
(
1975
).
59.
S. G.
Akl
, “
An optimal algorithm for parallel selection
,”
Inf. Process. Lett.
19
,
47
50
(
1984
).
60.
J.
Choi
,
J.
Demmel
,
I. S.
Dhillon
,
J. J.
Dongarra
,
S.
Ostrouchov
,
A.
Petitet
,
K.
Stanley
,
D. W.
Walker
, and
R. C.
Whaley
, “
ScaLAPACK: A portable linear algebra library for distributed memory computers - design issues and performance
,” in
Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science
, Second International Workshop, PARA ’95, Lyngby, Denmark, August 21-24, 1995, Proceedings, Lecture Notes in Computer Science, Vol. 1041 edited by
J. J.
Dongarra
,
K.
Madsen
, and
J.
Wasniewski
(
Springer
,
1995
), pp.
95
106
.
61.
ELPA,
2023
.
62.
P.
Kus
,
A.
Marek
,
S. S.
Koecher
,
H. H.
Kowalski
,
C.
Carbogno
,
C.
Scheurer
,
K.
Reuter
,
M.
Scheffler
, and
H.
Lederer
, “
Optimizations of the eigensolvers in the ELPA library
,”
Parallel Comput.
85
,
167
177
(
2019
).
63.
E. R.
Davidson
, “
The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices
,”
J. Comput. Phys.
17
,
87
94
(
1975
).
64.
A. V.
Knyazev
, “
Preconditioned eigensolvers—An oxymoron
,”
Electron. Trans. Numer. Anal.
7
,
104
123
(
1998
).
65.
A. V.
Knyazev
, “
Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method
,”
SIAM J. Sci. Comput.
23
,
517
541
(
2001
).
66.
R. W. J. K.
Cullum
,
Lanczos Algorithms for Large Symmetric Eigenvalue Computations
, Theory Vol. I (
SIAM
,
2002
).
67.
Y.
Saad
, “
Krylov subspace methods on supercomputers
,”
SIAM J. Sci. Stat. Comput.
10
,
1200
1232
(
1989
).
68.
A.
Stathopoulos
and
C.
Fischer
, “
Reducing synchronization on the parallel davidson method for the large, sparse, eigenvalue problem
,” in
Supercomputing ’93:Proceedings of the 1993 ACM/IEEE Conference on Supercomputing
(
IEEE
,
1993
), pp.
172
180
.
69.
L.
Borges
and
S.
Oliveira
, “
A parallel Davidson-type algorithm for several eigenvalues
,”
J. Comput. Phys.
144
,
727
748
(
1998
).
70.
M.
Nool
and
A.
van der Ploeg
, “
A parallel Jacobi–Davidson-type method for solving large generalized eigenvalue problems in magnetohydrodynamics
,”
SIAM J. Sci. Comput.
22
,
95
112
(
2000
).
71.
E.
Romero
and
J. E.
Roman
, “
A parallel implementation of the Davidson method for generalized eigenproblems
,”
Adv. Parallel Comput.
19
,
133
140
(
2010
).
72.
E.
Romero
and
J. E.
Roman
, “
A parallel implementation of Davidson methods for large-scale eigenvalue problems in SLEPc
,”
ACM Trans. Math. Software
40
,
13-1
(
2014
).
73.
J. A.
Duersch
,
M.
Shao
,
C.
Yang
, and
M.
Gu
, “
A robust and efficient implementation of LOBPCG
,”
SIAM J. Sci. Comput.
40
,
C655
C676
(
2018
).
74.
R.
Van Beeumen
,
G. D.
Kahanamoku-Meyer
,
N. Y.
Yao
, and
C.
Yang
, “
A scalable matrix-free iterative eigensolver for studying many-body localization
,” in
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia2020
(
Association for Computing Machinery
,
New York, NY
,
2020
), pp.
179
187
.
75.
R.
Van Beeumen
,
K. Z.
Ibrahim
,
G. D.
Kahanamoku–Meyer
,
N. Y.
Yao
, and
C.
Yang
, “
Enhancing scalability of a matrix-free eigensolver for studying many-body localization
,”
Int. J. High Perform. Comput. Appl.
36
,
307
319
(
2022
).
76.
M.
Hoemmen
, “
Communication-avoiding Krylov subspace methods
,” Ph.D. thesis (
The University of California Berkeley
,
2010
).
77.
S.
Balay
,
S.
Abhyankar
,
M. F.
Adams
,
S.
Benson
,
J.
Brown
,
P.
Brune
,
K.
Buschelman
,
E.
Constantinescu
,
L.
Dalcin
,
A.
Dener
,
V.
Eijkhout
,
J.
Faibussowitsch
,
W. D.
Gropp
,
V.
Hapla
,
T.
Isaac
,
P.
Jolivet
,
D.
Karpeev
,
D.
Kaushik
,
M. G.
Knepley
,
F.
Kong
,
S.
Kruger
,
D. A.
May
,
L. C.
McInnes
,
R. T.
Mills
,
L.
Mitchell
,
T.
Munson
,
J. E.
Roman
,
K.
Rupp
,
P.
Sanan
,
J.
Sarich
,
B. F.
Smith
,
S.
Zampini
,
H.
Zhang
,
H.
Zhang
, and
J.
Zhang
, PETSc/TAO users manual, Technical Report No. ANL-21/39—Revision 3.18,
Argonne National Laboratory
,
2022
.
78.
M.
Grossman
,
C.
Thiele
,
M.
Araya-Polo
,
F.
Frank
,
F. O.
Alpak
, and
V.
Sarkar
, “
A survey of sparse matrix-vector multiplication performance on large matrices
,” arXiv:1608.00636 [cs.PF] (
2016
).
79.
R. H.
Bisseling
, “
Sparse matrix–vector multiplication
,” in
Parallel Scientific Computation: A Structured Approach Using BSP
(
Oxford University Press
,
2020
).
80.
F.
Weigend
and
R.
Ahlrichs
, “
Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy
,”
Phys. Chem. Chem. Phys.
7
,
3297
(
2005
).
81.
H.-J.
Werner
,
P. J.
Knowles
,
G.
Knizia
,
F. R.
Manby
, and
M.
Schütz
, “
Molpro: A general-purpose quantum chemistry program package
,”
Wiley Interdiscip. Rev.: Comput. Mol. Sci.
2
,
242
253
(
2012
).
82.
H.-J.
Werner
,
P. J.
Knowles
,
F. R.
Manby
,
J. A.
Black
,
K.
Doll
,
A.
Heßelmann
,
D.
Kats
,
A.
Köhn
,
T.
Korona
,
D. A.
Kreplin
,
Q.
Ma
,
T. F.
Miller
,
A.
Mitrushchenkov
,
K. A.
Peterson
,
I.
Polyak
,
G.
Rauhut
, and
M.
Sibaev
, “
The Molpro quantum chemistry package
,”
J. Chem. Phys.
152
,
144107
(
2020
).
83.
G.
Karypis
and
V.
Kumar
, “
A fast and high quality multilevel scheme for partitioning irregular graphs
,”
SIAM J. Sci. Comput.
20
,
359
392
(
1998
).
84.
E.
Cuthill
and
J.
McKee
, “
Reducing the bandwidth of sparse symmetric matrices
,” in
Proceedings of the 1969 24th National Conference, ACM ’69
(
Association for Computing Machinery
,
New York, NY
,
1969
), pp.
157
172
.