The many-body simulation of quantum systems is an active field of research that involves several different methods targeting various computing platforms. Many methods commonly employed, particularly coupled cluster methods, have been adapted to leverage the latest advances in modern high-performance computing. Selected configuration interaction (sCI) methods have seen extensive usage and development in recent years. However, the development of sCI methods targeting massively parallel resources has been explored only in a few research works. Here, we present a parallel, distributed memory implementation of the adaptive sampling configuration interaction approach (ASCI) for sCI. In particular, we will address the key concerns pertaining to the parallelization of the determinant search and selection, Hamiltonian formation, and the variational eigenvalue calculation for the ASCI method. Load balancing in the search step is achieved through the application of memory-efficient determinant constraints originally developed for the ASCI-PT2 method. The presented benchmarks demonstrate near optimal speedup for ASCI calculations of Cr_{2} (24e, 30o) with 10^{6}, 10^{7}, and 3 × 10^{8} variational determinants on up to 16 384 CPUs. To the best of the authors’ knowledge, this is the largest variational ASCI calculation to date.

## I. INTRODUCTION

The ability of *ab initio* quantum chemistry to solve challenging science problems strongly relies on its success in leveraging the latest advances in modern high-performance computing. As such, the development of massively parallel algorithms for quantum chemistry methods targeting large supercomputers has been of significant interest in recent years.^{1–6} In particular, many-body methods hold among the highest potential for exploiting concurrency on modern machines (we refer the reader to Ref. 7 and references therein for a recent comprehensive review). There are several competitive methods in quantum chemistry used for many-body simulation, including density matrix renormalization group (DMRG),^{8,9} coupled cluster,^{10,11} auxiliary field quantum Monte Carlo,^{12–14} and several others.^{15,16} It is an active field of research to understand the pros and cons of these various methods,^{16} and almost all of these methods are being actively developed.

Recently, the development of selected configuration interaction (sCI) methods has seen a renewed interest, and there has been a push to adapt the approach to modern computing architectures. This interest has been, in large part, driven by recent advances in the core components of sCI algorithms,^{17–19} which have, in turn, extensively broadened the reach of their applications.^{16,20–46} In particular, some of the newer applications include simulation of vibrational/nuclear degrees of freedom,^{24,35} dynamical mean-field theory,^{27,28} time evolution,^{25,31,32,47} Hamiltonian compression,^{43} new algorithms for quantum computing and benchmarking,^{44,48,49} machine learning approaches,^{22,33,41} and large-scale simulation of quantum circuits.^{50,51} While there has been much interest in sCI, there are still a number of technical aspects that are open research problems within the realm of high-performance computing. The lack of parallel sCI implementations has been somewhat visible in a recent comparison of different algorithms for the simulation of the benzene molecule.^{16} Several sCI methods were considered but none of them used anywhere close to the number of CPU hours that the most accurate methods required. Although there has been at least one simulation of sCI using more than a billion determinants,^{39} simulations of this scale have not been widely performed. This leads to the obvious question of what the limits of sCI approaches are when used in conjunction with modern high-performance computing resources. In this work, we present a parallel, distributed memory implementation of the adaptive sampling configuration interaction (ASCI) method^{18} for sCI simulations.

The ASCI algorithm was developed with the idea of making approximations to various aspects of sCI algorithms. The original ASCI paper^{18} included approximations to the search algorithm that reduced the number of search terms while maintaining highly accurate results. Soon after the heat-bath CI algorithm made further approximations to the search algorithm while still maintaining reasonable results in many systems.^{19} Initial results on Cr_{2} suggested these sCI algorithms are competitive with the most accurate contemporary many-body methods, including DMRG. Many further developments increased the efficiency of different parts of the algorithm, including the development of a fast perturbation step with a deterministic constraint approach (ASCI-PT2).^{23} In this paper, we look to turn the fast ASCI-PT2 algorithm into a massively parallel search step while also developing a scalable parallel diagonalization step. These make up the variational part of the sCI algorithm, and thus, we are hoping to advance the parallel capabilities of variational sCI.

While single-core and single-node approaches for sCI have been explored extensively with a focus on making use of modern computational tools,^{29} there has been much less focus on massive parallelism and the development of distributed memory algorithms. Most sCI implementations can make use of shared memory parallelism^{29,36,39} during the computationally heavy parts of the simulation. In particular, leverage of multi-threaded^{29,52,53} and GPU-accelerated^{23,29,54} sorting techniques, which are the primary algorithmic motifs of the ASCI method, have shown great promise for modern computing architectures. One recent study examined a distributed memory implementation of the heat-bath configuration interaction (HCI) method, particularly focusing on the details pertaining to the strong scaling of the second-order perturbative corrections (PT2), which dominated their simulation. While this study was able to exhibit excellent strong scaling for the PT2 step in HCI, the challenges associated with this development are somewhat unique to HCI and are not easily extended to other sCI algorithms. The ASCI-PT2 algorithm, which was first presented in 2018, uses triplet constraints to facilitate the division of parallel tasks.^{23} Some of the new ideas in the recent parallel HCI^{38} approach are used to overcome issues that are bottlenecks for only HCI PT2 approaches ^{39} and are not directly applicable to ASCI-PT2 with triplet constraints. An even bigger issue that we address in this paper is the parallel bottlenecks in the variational part of an sCI algorithm, which appears to be a major bottleneck for all the sCI approaches we are aware of. This is evident in the parallel HCI paper as their method exhibited much less favorable strong scaling for the variational component of the sCI calculation. Thus, the primary focus of the following developments will be on the variational component of the ASCI algorithm, of which a fair amount can be applied to other sCI algorithms as well.

The remainder of this work will be organized as follows: In Sec. II A, we review the salient features of the ASCI method relevant to the development of parallel algorithms. Section II B presents a work partitioning and load-balancing scheme relevant to the development of scalable determinant selection on massively parallel architectures. Section II C addresses key considerations for the development of parallel algorithms for the solution of large-scale eigenvalue problems and the parallel construction of sparse Hamiltonian matrices within the sCI formalism. We examine the performance and scalability of the proposed methods for a challenging test case (Cr_{2}) in Sec. III and discuss future research directions in Sec. IV.

## II. METHODS

The notation and common abbreviations used in this work are summarized in Table I.

Symbol . | Explanation . |
---|---|

ψ_{k} | The wave function in the current ASCI step |

C_{i} | The coefficients of the ith determinant D_{i} of ψ_{k} |

D_{i} | The ith determinant in ψ_{k} |

$D\u0303j$ | The jth determinant not in ψ_{k} |

{D^{sd}} | The set of all single and double excitations that are connected to ψ_{k} |

{$Disd$} | The set of all single and double excitations that are connected to the determinant D_{i} |

N_{tdets} | Number of determinants in the current wave function. |

N_{cdets} | Core space size used for pruning in ASCI |

PE | Processing element (independent compute context) |

Symbol . | Explanation . |
---|---|

ψ_{k} | The wave function in the current ASCI step |

C_{i} | The coefficients of the ith determinant D_{i} of ψ_{k} |

D_{i} | The ith determinant in ψ_{k} |

$D\u0303j$ | The jth determinant not in ψ_{k} |

{D^{sd}} | The set of all single and double excitations that are connected to ψ_{k} |

{$Disd$} | The set of all single and double excitations that are connected to the determinant D_{i} |

N_{tdets} | Number of determinants in the current wave function. |

N_{cdets} | Core space size used for pruning in ASCI |

PE | Processing element (independent compute context) |

### A. Adaptive sampling configuration interaction

The details and efficacy of the ASCI algorithm are presented elsewhere. The initial implementation can be found in Ref. 18, and many details and improvements can be found in Ref. 29. The general schematic for the ASCI algorithm is given in Algorithm 1. In general, the ASCI algorithm, as most other sCI algorithms, consists of three major steps:

Hamiltonian construction and diagonalization (eigenvalue problem),

Determinant selection (search), and

Perturbative (PT2) correction to the energy (optional).

1: Input: Start with a Hartree-Fock simulation |

2: Output: Ground state energy of the system |

3: Create a starting wave function which can be a Hartree-Fock |

determinant |

4: ASCI-search (find important contributions not current in the |

wave function) |

5: Sort and select the top determinants from ASCI-search |

6: Diagonalize the Hamiltonian in the selected determinant space |

7: Return to step 3, but now use the ground state result of the |

diagonalization process as the starting wave function. Repeat |

until stopping criteria is reached |

8: Run a final ASCI-PT2 step (to improve the energy further) |

1: Input: Start with a Hartree-Fock simulation |

2: Output: Ground state energy of the system |

3: Create a starting wave function which can be a Hartree-Fock |

determinant |

4: ASCI-search (find important contributions not current in the |

wave function) |

5: Sort and select the top determinants from ASCI-search |

6: Diagonalize the Hamiltonian in the selected determinant space |

7: Return to step 3, but now use the ground state result of the |

diagonalization process as the starting wave function. Repeat |

until stopping criteria is reached |

8: Run a final ASCI-PT2 step (to improve the energy further) |

The first two steps are typically referred to as the variational steps for the sCI algorithm and will be the primary focus of this work. However, in the development of parallel algorithms for the ASCI search step (Sec. II B), we will extend a work partitioning scheme originally developed for the ASCI-PT2 method.^{23} For the variational steps, sCI algorithms primarily differ in their treatment of the determinant search, whereas once a set of determinants has been selected, the Hamiltonian construction and subsequent diagonalization steps are largely the same. As such, the parallel algorithms developed in this work for the latter can straightforwardly be applied to other sCI algorithms as well. However, the details of the search step, which must be carefully considered in the development of scalable parallel algorithms, are typically unique to each sCI algorithm, and, thus, not straightforwardly extended between methods.

^{22,33,34,41}the vast majority of sCI algorithms use the following formula for generating a ranking value of a determinant not currently considered in the trial wave function,

*ψ*

_{k}:

*H*is the many-body molecular Hamiltonian,

*H*

_{ij}is a matrix element of

*H*between Slater determinants,

*C*

_{j}is the coefficient for

*D*

_{j}∈

*ψ*

_{k}, and

*E*

_{k}is the variational energy associated with

*ψ*

_{k}.

*S*

_{i}is the metric (“score”) by which we will determine which determinants to add to the sCI subspace, and $Si(j)$ is the

*j*th partial-score that describes the contribution of

*D*

_{j}to

*S*

_{i}. The contracted index,

*j*, runs over

*D*

_{j}∈

*ψ*

_{k}, whereas the free index,

*i*, runs over $D\u0303i\u2209\psi k$. In principle,

*i*runs over the entire Hilbert space in the complement of the determinants in

*ψ*

_{k}. However, the elements of

*H*

_{ij}are sparse, and the maximally two-body nature of

*H*for molecular Hamiltonians imposes $D\u0303i\u2208{Dsd}$, where {

*D*

^{sd}} is the set of determinants that are singly or doubly connected to a determinant in

*ψ*

_{k}. In practice, even considering all determinants in {

*D*

^{sd}} is not needed since there exists considerable numerical sparsity beyond that, which is solely imposed by connections in the Hamiltonian. Thus, while calculating these equations is theoretically straightforward, in practice, it is important to identify the non-zero terms

*a priori*because the large majority of terms are zero. In addition, it is important to only use and store the non-zero terms when needed since memory requirements for storing all the terms in these sums are extremely large when pushing to a large sCI calculation. Previous papers have reported examples pertaining to the number of connections of different examples. In a previous ASCI study, the number of such terms considered for a Cr

_{2}example had over 282 × 10

^{9}connection terms, which were all generated on a single core,

^{23}and in the recent parallel HCI paper, the authors were able to simulate over 89.5 × 10

^{9}connection terms. If all of these terms had to be stored in memory simultaneously, the memory requirements would make these approaches unfeasible.

Due to the immense cost of calculating *H*_{ij} and *S*_{i}, small differences in data structures and algorithmic design choices can yield large impacts on overall performance. Since its inception, the design approach of the ASCI algorithm has centered around the use of modern computing architectures. In particular, the ASCI method has adopted sorting as its primary algorithmic motif due to the favorable scaling and performance of sorting algorithms on contemporary CPU and GPU hardware.^{29} However, current developments of the ASCI algorithm have focused on shared memory computing environments. Almost all ASCI simulations that we are aware of have been performed on a single node with the exception of a small-scale MPI test with parallel ASCI-PT2.^{23} In Sec. II B, we examine the extension of the ASCI sorting motif to distributed memory compute environments and emergent challenges with load balancing on massively parallel architectures.

### B. Parallel determinant search

The general procedure for the ASCI determinant search is given in Algorithm 2. The search proceeds by considering single and double connections from a set of dominant configurations in the trial wave function.^{29} We will refer to these determinants as the “core” determinants in the following. The number of core determinants to consider is a tunable parameter in the ASCI algorithm and is typically achieved by selecting the *N*_{cdets} determinants in *ψ*_{k} with the largest coefficient magnitudes. In a naive approach, one determines *a priori* the {*D*^{sd}} associated with the core determinants, which we will refer to as the *target* determinants in the following, and subsequently calculates the contribution of each core determinant to each target configuration via Eq. (1). In many practical simulations, it is better to generate the possible target determinants as singles and doubles from each core configuration, rather than loop over all possible target configurations and then check for connections with the core determinants. The latter procedure generally performs excessive work as the majority of the core determinants only contribute to a few target determinants. In the shared memory ASCI algorithm, a single pass is taken over the core determinants, and the associated ${Djsd}$ and $Si(j)$ for $D\u0303i\u2208{Djsd}$ are generated on the fly. If $|Si(j)|>\u03f5search$, where *ϵ*_{search} is a tunable threshold, it is appended to an array as a pair $(D\u0303i,Si(j))$, which we will refer to as an ASCI pair in the following. For molecular calculations, bit-strings are typically the data structure of choice for the representation and manipulation of many-body determinants as it allows for fast bit operations in, e.g., calculating matrix elements on modern computing architectures.^{17,29,36,38,55} As such, the ASCI pairs can be stored in a simple data structure consisting of the bit-string representing $D\u0303i$ and the partial score $Si(j)$. Once all pairs have been generated after passing over each core configuration, the array is sorted on the determinant bit-string, which, in turn, brings all partial scores associated with a particular target determinant contiguously in memory. With this new ordering, scores for individual target determinants can be straightforwardly assembled via Eq. (1). We refer to this procedure as “sort-and-accumulate” in the following.

1: Input: Trial wave function determinants, ψ_{k} and coefficients |

C; Search configuration cutoff N_{cdets}; Max wave function size |

N_{tdets}; ASCI pair threshold, ϵ_{search} |

2: Output: New ASCI determinants ψ_{k+1} |

3: ψ_{c} ← Obtain N_{cdets} subset of ψ_{k} with largest coefficients |

4: P ← []⊳ ASCI Pairs |

5: for D_{j} ∈ ψ_{c} do |

6: ${Djsd}\u2190$ Single and double connections from D_{j} |

7: for $D\u0303i\u2208{Djsd}$ do |

8: $Si(j)\u2190$ Eq. (1) |

9: if $|Si(j)|>\u03f5thresh$ then |

10: $P\u2190[P,(D\u0303i,Si(j))]$ |

11: end if |

12: end for |

13: end for |

14: P ← Sort P on $D\u0303i$. |

15: $P\u2190\u2211jSi(j)$ for each unique i ⊳ Each unique p ∈ P now |

contains a complete S_{i} |

16: P ← Sort P on S_{i} |

17: return ψ_{k+1} ← top-N_{tdets} determinants in P |

1: Input: Trial wave function determinants, ψ_{k} and coefficients |

C; Search configuration cutoff N_{cdets}; Max wave function size |

N_{tdets}; ASCI pair threshold, ϵ_{search} |

2: Output: New ASCI determinants ψ_{k+1} |

3: ψ_{c} ← Obtain N_{cdets} subset of ψ_{k} with largest coefficients |

4: P ← []⊳ ASCI Pairs |

5: for D_{j} ∈ ψ_{c} do |

6: ${Djsd}\u2190$ Single and double connections from D_{j} |

7: for $D\u0303i\u2208{Djsd}$ do |

8: $Si(j)\u2190$ Eq. (1) |

9: if $|Si(j)|>\u03f5thresh$ then |

10: $P\u2190[P,(D\u0303i,Si(j))]$ |

11: end if |

12: end for |

13: end for |

14: P ← Sort P on $D\u0303i$. |

15: $P\u2190\u2211jSi(j)$ for each unique i ⊳ Each unique p ∈ P now |

contains a complete S_{i} |

16: P ← Sort P on S_{i} |

17: return ψ_{k+1} ← top-N_{tdets} determinants in P |

Even for relatively aggressive *ϵ*_{search} cutoffs, this approach can constitute a large memory footprint for larger *N*_{cdets}. In the original ASCI method, a threshold for the maximum number of ASCI pairs to store in memory was set, and the pair list would be occasionally pruned to ensure memory compactness. This pruning procedure has been demonstrated to provide approximate scores of sufficient accuracy to successfully rank determinants for the sCI wave function.^{29} A similar approach could be taken for the calculation of the PT2 energy, *E*_{PT2}, which is closely related to the scores of Eq. (1).^{20,23,39} However, for accurate calculations of *E*_{PT2}, one cannot employ the aggressive cutoffs and pruning schemes employed to ensure memory compactness in the ASCI search. In the development of the ASCI-PT2 method,^{23} a solution was developed that allows for memory compactness while still leveraging the powerful nature of sorting algorithms on modern computing architectures. In this procedure, the set of target configurations is partitioned according to a constraint identified by their largest occupied orbitals. In the ASCI-PT2 study, the target determinants were classified by triplet constraints, i.e., by their three highest-occupied orbitals. These constraints generate non-overlapping subsets of determinants such that every possible determinant belongs to exactly one constraint class.

At first glance, this approach appears similar to the naive approach for ASCI score calculation as it might imply that one would need to know the entire {*D*^{sd}} associated with the core determinants to be able to partition the target determinants *a priori*. However, the major realization of ASCI-PT2 was to determine which target determinants arising from excitations of a particular core configuration belong to a particular constraint class. As such, in a similar manner to the original ASCI search algorithm, one can make multiple passes over the core configurations for each constraint and generate its associated target configurations on the fly. The power of this procedure is that the sort-and-accumulate steps can now be done at the constraint level, i.e., ASCI pairs can be generated and aggregated for each constraint and, due to the non-overlapping partition generated by the individual constraints, all partial scores associated with the target determinant belonging to that class are guaranteed to be generated by the end of each pass over core configurations. As the number of unique determinant scores is much less than partial score contributions, this procedure exhibits much lower memory usage than the original ASCI search algorithm.

To proceed with the development of a parallel ASCI search algorithm, we consider a similar approach to ASCI-PT2, where the work associated with the generation of pairs for individual constraints is assigned to different processing elements (PE) in the distributed memory environment. To ensure the scalability of the pair generation and subsequent sort-and-accumulate steps, we have developed a static load balancing procedure presented in Algorithm 3. Before any pairs are generated for the ASCI search, the number of determinants associated with individual constraints is precomputed on the fly to generate the rough amount of work for each constraint. The number of contributions is then used as a heuristic to assign work to individual PEs. It is important to note that the number of contributions is not an exact measure of the work associated with a particular constraint because contributions are screened out according to *ϵ*_{search}. However, for the ASCI search, this procedure has been demonstrated to generate sufficiently balanced work (see Sec. II B). The primary challenge of this work distribution scheme stems from the large variance of work associated with any particular constraint, i.e., it is often the case that a few constraints yield a disproportionate of target determinants. As such, at a particular level of constraint (e.g., triplets), there always exists an amount of indivisible work that the load balancing procedure cannot distribute over PEs. The work associated with a particular constraint can be further divided by considering addition constraints of higher order, e.g., triplets can be broken up into quadruplets and quadruplets into quintuplets. However, the number of constraints at a particular level grows polynomially with the number of single-particle orbitals, e.g., $O(norb3)$ for triplets and $O(norb4)$ for quadruplets. Since, for each constraint, there is an inner loop over *N*_{cdets} core configurations, it is highly desirable to limit the number of considered constraints. As such, it would be highly inefficient to simply consider all higher-level constraints if the work associated with a particular constraint level is deemed to exhibit excessive load imbalance. Instead, it is advantageous to only divide the constraints that are expected to exhibit excessive work into higher constraint levels rather than dividing all constraints. We will demonstrate the efficacy of this partitioning scheme in Sec. II B.

1: Input (global): Dominant core configurations, ψ_{c}, |

max constraint level L |

2: Output (local): Local constraints C_{loc} |

3: W ← an array of size of the execution |

context (# PEs) ⊳ Initialize to zero |

4: p ← index of PE in |

5: C_{loc} ← [ ] |

6: C_{t} ← all unique triplet constraints |

7: for C ∈ C_{t} do |

8: h ← 0 |

9: for D_{j} ∈ ψ_{c} do |

10: s ← singly connected determinants to D_{j} satisfying C. |

11: d ← doubly connected determinants to D_{j} satisfying C. |

12: h ← h + |s| + |d| |

13: end for |

14: if h > 0 then |

15: q ← arg min_{i}W_{i} |

16: if p = q then |

17: C_{loc} ← [C_{loc}, C] |

18: end if |

19: end if |

20: end for |

21: return C_{loc} |

1: Input (global): Dominant core configurations, ψ_{c}, |

max constraint level L |

2: Output (local): Local constraints C_{loc} |

3: W ← an array of size of the execution |

context (# PEs) ⊳ Initialize to zero |

4: p ← index of PE in |

5: C_{loc} ← [ ] |

6: C_{t} ← all unique triplet constraints |

7: for C ∈ C_{t} do |

8: h ← 0 |

9: for D_{j} ∈ ψ_{c} do |

10: s ← singly connected determinants to D_{j} satisfying C. |

11: d ← doubly connected determinants to D_{j} satisfying C. |

12: h ← h + |s| + |d| |

13: end for |

14: if h > 0 then |

15: q ← arg min_{i}W_{i} |

16: if p = q then |

17: C_{loc} ← [C_{loc}, C] |

18: end if |

19: end if |

20: end for |

21: return C_{loc} |

Given the locally constructed arrays of string–score pairs generated as a result of the constraint partitioning, the remaining aspect of the parallel ASCI search is to determine the dominant *N*_{tdets} determinants to keep for the next iteration. In principle, this can be achieved by performing a full sort of the pairs on the absolute value of their scores. For an array *A* such that |*A*| = *n*, full sorting of *A* scales *O*(*n* log *n*), which is generally unfavorable for the large sizes of arrays considered in this work (as would be generated by large values of *N*_{cdets}). For distributed memory implementations, this problem would be exacerbated due to the communication overhead of existing distributed memory sort algorithms.^{52,56} For the vast majority of properties involving selected-CI wave functions, such as energies and density matrices, the *order* in which the determinant coefficients appear in the CI vector is irrelevant and all computed quantities will be invariant to arbitrary permutations *C* ← *PC*. As such, the sorting problem can be replaced with the *selection* problem, which can be used to determine the largest (or smallest) *k* ≤ *n* values of an array with *O*(*n*) complexity.^{57,58} In addition, selection algorithms can be performed in parallel with nearly optimal speedup without having to communicate significant segments of *A*.^{59} Rather than obtaining an absolute ordering of its input data, selection algorithms are designed to determine a *pivot*, *a*_{k} ∈ *A*, such that |*A*^{g}| = *k*, where *A*^{g} = {*a* > *a*_{k} | *a* ∈ *A*}. In cases where *a*_{k} appears multiple times in *A*, this definition can be extended to indicate |*A*^{g}| < *k* ≤ |*A*^{g} ∪ *A*^{e}|, where *A*^{e} ⊂ *A* is the indexed subset containing the duplicate elements of *a*_{k}. We outline a duplicate-aware, distributed memory extension of the quickselect algorithm, with expected *O*(*n*) runtime, in Algorithm 4. For the ASCI search problem, Algorithm 4 is used to determine the contribution pair with the *N*_{tdets}-largest score, *p*_{tdets}. We may then determine the contribution pairs with scores larger-than or equal-to *p*_{tdets} and subsequently gather them (collectively) to all participating PEs [via, e.g., MPI_Allgather(v)]. We examine the performance of this scheme, which introduces the only source of distributed memory communication in the presented ASCI search method, in Sec. III.

1: Input: Local part A_{I} of an array A with |A| = n, |

Element index k ∈ [1, n] |

2: Output: The element a_{k} (global) such that |{a > a_{k} |a ∈ A}| = k |

3: ⊳ All operations are local unless otherwise specified. |

4: rank ← PE rank. |

5: n_{I} ←|A_{I}| |

6: N ←Allreduce(n_{I}) ⊳ Collective |

7: while N ≥ n_{min} do |

8: p ← Select a random pivot in [0, N). |

9: Determine the owner of a_{p} and broadcast |

to other PEs ⊳ Collective |

10: $AIg,AIe,AIl\u2190partition(AI,ap)$ |

11: $GI,EI,LI\u2190|AIg|,|AIe|,|AIl|$ |

12: {G, E, L}←Allreduce({G_{I}, E_{I}, L_{I}}) ⊳ Collective |

13: if k ≤ G then |

14: $AI\u2190AIg$ |

15: else if k ≤ G + E then |

16: a_{k} ← a_{p} |

17: return a_{k} |

18: else |

19: A_{I} ← A^{l} |

20: k ← k − G − E |

21: end if |

22: n_{I} ←|A_{I}| |

23: N ←Allreduce(n_{I}) ⊳ Collective |

24: end while |

25: A_{rem} ←Gather(A_{I}) ⊳ Collective (PE 0) |

26: if rank = 0 then |

27: a_{k} ←SerialQuickselect(A_{rem}) ⊳ std::nth_element |

28: end if |

29: Broadcast a_{k} ⊳ Collective |

30: return a_{k} |

1: Input: Local part A_{I} of an array A with |A| = n, |

Element index k ∈ [1, n] |

2: Output: The element a_{k} (global) such that |{a > a_{k} |a ∈ A}| = k |

3: ⊳ All operations are local unless otherwise specified. |

4: rank ← PE rank. |

5: n_{I} ←|A_{I}| |

6: N ←Allreduce(n_{I}) ⊳ Collective |

7: while N ≥ n_{min} do |

8: p ← Select a random pivot in [0, N). |

9: Determine the owner of a_{p} and broadcast |

to other PEs ⊳ Collective |

10: $AIg,AIe,AIl\u2190partition(AI,ap)$ |

11: $GI,EI,LI\u2190|AIg|,|AIe|,|AIl|$ |

12: {G, E, L}←Allreduce({G_{I}, E_{I}, L_{I}}) ⊳ Collective |

13: if k ≤ G then |

14: $AI\u2190AIg$ |

15: else if k ≤ G + E then |

16: a_{k} ← a_{p} |

17: return a_{k} |

18: else |

19: A_{I} ← A^{l} |

20: k ← k − G − E |

21: end if |

22: n_{I} ←|A_{I}| |

23: N ←Allreduce(n_{I}) ⊳ Collective |

24: end while |

25: A_{rem} ←Gather(A_{I}) ⊳ Collective (PE 0) |

26: if rank = 0 then |

27: a_{k} ←SerialQuickselect(A_{rem}) ⊳ std::nth_element |

28: end if |

29: Broadcast a_{k} ⊳ Collective |

30: return a_{k} |

### C. Parallel eigensolver

After each search iteration of the ASCI procedure, we must obtain the ground state eigenpair of the many-body Hamiltonian projected onto the basis of newly selected determinants. The large basis dimensions (*N*_{tdets}) employed in accurate sCI applications preclude the use of direct eigensolvers [e.g., those implemented in dense linear algebra libraries such as (Sca)LAPACK^{60} and ELPA^{61,62}] due to their *O*(*N*^{2}) dense storage requirement and steep *O*(*N*^{3}) scaling with problem dimension. As such, Krylov-subspace methods, such as Davidson,^{63} LOBPCG,^{64,65} and Lanczos,^{66} are typically employed. The development of efficient and scalable Krylov methods on distributed memory computers is challenging due to the existence of many synchronization points arising from the serial nature of subspace construction. Significant research effort has been afforded to the development of distributed memory Krylov eigenvalue methods across many scientific computing domains.^{67–76} In this work, due to the diagonally dominant nature of the Hamiltonian, we consider the parallel implementation of the Davidson method for sCI applications, although many of the same principles can be extended to other Krylov methods such as Lanczos or LOBPCG.

Given an algorithm to produce the action of the Hamiltonian onto a trial vector (*σ* formation), we present the general scheme for the distributed memory Davidson method in Algorithm 5. The algorithm presented is general to an arbitrary preconditioner, although we will always adopt the shifted Jacobi (diagonal) preconditioner, which has become standard in configuration interaction calculations.^{63} Unlike exact diagonalization configuration interaction methods (e.g., full/complete active space CI), where implicit *σ* formation exhibits efficient closed-form expressions,^{1,55} the vast majority of sCI methods require explicit construction of *H*_{ij} to perform the necessary contractions. We refer to the reader to Ref. 29 for a comprehensive discussion of efficient algorithms for the assembly of *H*_{ij} for sCI wave functions. Although explicit storage of the Hamiltonian can be avoided via recalculation of the *H*_{ij} at each Davidson iteration, the immense cost of matrix element construction renders this procedure prohibitive for practical applications. The sparsity of the Hamiltonian in the basis of determinants allows for its explicit storage using sparse matrix storage formats. In this work, we will use the distributed sparse matrix format depicted in Fig. 1, which has been employed for the development of parallel Krylov algorithms in many community software packages for sparse matrix algebra (e.g., Ref. 77). In this storage scheme, each PE owns a contiguous (not necessarily equal) row subset of the full sparse matrix divided into diagonal and off-diagonal blocks that are stored in standard sparse matrix formats. We will use the compressed sparse row (CSR) format for local matrix storage.^{78} The partitioning into diagonal and off-diagonal blocks will ultimately determine the communication pattern of the *σ* formation in the Davidson method, as will be apparent in Sec. II C. Of the several methods for matrix element construction discussed in Ref. 29, we consider the double-loop approach in this work. In the limit that each task only has a small portion of the Hamiltonian, the double loop approach is quick at generating all the matrix elements. There are likely time savings to be had by optimizing this further with consideration to other approaches of generating the matrix elements like with residue arrays and dynamic bitmasking.^{29} However, the cost analysis must consider inefficiencies due to load balancing. Thus, a double-loop approach is quite reasonable in terms of load balancing, and we consider it a reasonable starting point for massively parallel approaches to sCI.

1: Input: Matrix-vector product operator, OP; preconditioner |

operator K, max Krylov dimension k; initial guess v_{0}; Residual |

norm tolerance ϵ. |

2: Output: Eigenpair (E_{0}, v) approximating |

the lowest eigenvalue of OP. |

3: ⊳ All vectors are distributed among PEs |

4: $w\u2190OP(v0)$ |

5: $h\u2190v0\u2020w$; h ←Allreduce(h); |

v ← w − v_{0}h ⊳ Parallel Gram-Schmidt |

6: W ← w; V ← [v_{0} v] |

7: i ← 1 |

8: while i < k do |

9: $W\u2190[WOP(v)]$ |

10: H_{loc} ← W^{†}V ⊳ Local all PEs |

11: $Htot\u2190Reduce(Hloc)$ ⊳ Collective to PE 0 |

12: Solve $HtotC\u0303=C\u0303E\u0303$ ⊳ Local on PE 0 |

13: Broadcast $(E\u0303,C)$ ⊳ Collective |

14: $R\u2190(W\u2212E\u03030V)C\u03030$ ⊳ Local all PEs |

15: r_{loc} ← R^{†}R ⊳ Local all PEs |

16: $\Vert r\Vert \u2190Allgather(rloc)$ ⊳ Collective |

17: if ‖r‖ ≤ ϵ then |

18: return $(E0,v)\u2190(E\u03030,VC\u03030)$ |

19: else |

20: $R\u2190K(R)$ ⊳ Preconditioned Residual |

21: h ← V^{†}R; h ←Allreduce(h); |

v ← R − Vh ⊳ Parallel Gram-Schmidt |

22: V ← [V v] |

23: end if |

24: i ← i + 1 |

25: end while |

1: Input: Matrix-vector product operator, OP; preconditioner |

operator K, max Krylov dimension k; initial guess v_{0}; Residual |

norm tolerance ϵ. |

2: Output: Eigenpair (E_{0}, v) approximating |

the lowest eigenvalue of OP. |

3: ⊳ All vectors are distributed among PEs |

4: $w\u2190OP(v0)$ |

5: $h\u2190v0\u2020w$; h ←Allreduce(h); |

v ← w − v_{0}h ⊳ Parallel Gram-Schmidt |

6: W ← w; V ← [v_{0} v] |

7: i ← 1 |

8: while i < k do |

9: $W\u2190[WOP(v)]$ |

10: H_{loc} ← W^{†}V ⊳ Local all PEs |

11: $Htot\u2190Reduce(Hloc)$ ⊳ Collective to PE 0 |

12: Solve $HtotC\u0303=C\u0303E\u0303$ ⊳ Local on PE 0 |

13: Broadcast $(E\u0303,C)$ ⊳ Collective |

14: $R\u2190(W\u2212E\u03030V)C\u03030$ ⊳ Local all PEs |

15: r_{loc} ← R^{†}R ⊳ Local all PEs |

16: $\Vert r\Vert \u2190Allgather(rloc)$ ⊳ Collective |

17: if ‖r‖ ≤ ϵ then |

18: return $(E0,v)\u2190(E\u03030,VC\u03030)$ |

19: else |

20: $R\u2190K(R)$ ⊳ Preconditioned Residual |

21: h ← V^{†}R; h ←Allreduce(h); |

v ← R − Vh ⊳ Parallel Gram-Schmidt |

22: V ← [V v] |

23: end if |

24: i ← i + 1 |

25: end while |

*σ*formation would replicate the basis vectors on each PE such that the local sparse matrix-vector (SpMV) product could be done without communication. However, in the context of a Krylov method such as Davidson, this would require gathering the locally computed elements of the SpMV to

*every*PE at each iteration. For large problem sizes, the connectivity of the full gather operation would become a bottleneck. Due to the sparsity of

*H*, the computation of individual elements of

*σ*does not require knowledge of the entire basis vector; in fact, the number of elements required for the contraction of

*any*row of

*H*is (pessimistically) bounded by the number of single and double excitations allowable from the single-particle basis. As such, replication (or communication) of the entire vector is not generally required. To minimize data communication in the Davidson iterations, we employ the vector partitioning scheme depicted in Fig. 1. By using the sparsity pattern of the off-diagonal blocks of

*H*, we can efficiently determine which elements of remote vector elements need to be communicated between PEs as the union of the row sparsities, i.e.,

*H*. Another benefit of this precomputation is that it allows for the overlap of communication and computation.

^{79}Through this distribution, the diagonal SpMV can be done without communication as the vector elements for the local SpMV reside on the same PE by construction. By using non-blocking MPI communication primitives, communication can be initiated before starting the diagonal SpMV and finalized only once the remote elements are required for the off-diagonal SpMV. The pseudo-code for this distributed SpMV employed in this work is given in Algorithm 6.

1: Input: Matrix H and vector v in block distributed format (Fig. 1). |

(Optional) Precomputed communication data [Eq. (2)] |

2: Output: w = Hv in conforming row partitioned format |

3: |

4: if Communication data needed then |

5: Compute communication data for H per Eq. (2) |

6: end if |

7: Post local receives for incoming data from |

remote PEs ⊳ MPI_Irecv |

8: Pack elements of v required by remote processes ⊳ Local |

9: Post local sends to satisfy remove receive |

requests ⊳ MPI_Isend |

10: w ← H^{d}v ⊳ Local |

11: Wait for remote receives to complete |

12: v_{rem} ← Unpack remote data |

13: w ← w + H^{od}v_{rem} |

1: Input: Matrix H and vector v in block distributed format (Fig. 1). |

(Optional) Precomputed communication data [Eq. (2)] |

2: Output: w = Hv in conforming row partitioned format |

3: |

4: if Communication data needed then |

5: Compute communication data for H per Eq. (2) |

6: end if |

7: Post local receives for incoming data from |

remote PEs ⊳ MPI_Irecv |

8: Pack elements of v required by remote processes ⊳ Local |

9: Post local sends to satisfy remove receive |

requests ⊳ MPI_Isend |

10: w ← H^{d}v ⊳ Local |

11: Wait for remote receives to complete |

12: v_{rem} ← Unpack remote data |

13: w ← w + H^{od}v_{rem} |

## III. RESULTS AND DISCUSSION

In this section, we examine the strong scaling behavior of the proposed algorithms for Cr_{2} /def-SVP^{80} in an active space of (24e, 30o). This test system is one of the challenging systems that has effectively been solved to chemical accuracy but only in the last ten years.^{9,18} Therefore, Cr_{2} is a good benchmark system for testing and developing modern methods. Molecular integrals were obtained by using the Molpro^{81,82} program, and all ASCI calculations were performed on the Haswell partition of the Cray XC40 Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC). Each Cori-Haswell node consists of 2x Intel Xeon Processor E5-2698 v3 16-core CPUs with 128 GB DDR4 RAM. All jobs were run with 32 PEs (MPI ranks) per node. Energies for Cr_{2} calculations presented in this work are given in Table II. All calculations were performed using *ϵ*_{search} = 10^{−10}. Although not the primary focus of this work, we note that the variational energy determined at *N*_{tdets} = 3 × 10^{8} is only 0.3 mEh higher in energy than the most accurate variational DMRG result for the same system in Ref. 9.

. | N_{cdets}
. | |
---|---|---|

E_{0}/Eh | 100k | 300k |

N_{tdets} = 10^{6} | −2086.409 66 | −2086.410 18 |

N_{tdets} = 10^{7} | −2086.417 69 | −2086.417 81 |

N_{tdets} = 3 × 10^{8} | ⋯ | −2086.420 42 |

Reference DMRG^{9} | −2086.420 77 |

. | N_{cdets}
. | |
---|---|---|

E_{0}/Eh | 100k | 300k |

N_{tdets} = 10^{6} | −2086.409 66 | −2086.410 18 |

N_{tdets} = 10^{7} | −2086.417 69 | −2086.417 81 |

N_{tdets} = 3 × 10^{8} | ⋯ | −2086.420 42 |

Reference DMRG^{9} | −2086.420 77 |

In the following, we present strong scaling results for the parallel ASCI methods proposed in this work.

### A. ASCI search performance

In Fig. 2, we present the component and overall strong scaling of the ASCI search for triplet, quadruplet, quintuplet, and sextuplet determinant constraints (Sec. II B) for Cr_{2} with *N*_{cdets} = 300*k* and *N*_{tdets} = 10^{6}, 10^{7}. The calculation is dominated by the generation of ASCI pairs, whereas the distributed determinant selection via Algorithm 4 is relatively minor in comparison ($<$1 s). As such, the execution time of the ASCI search, at this scale, is largely independent of *N*_{tdets}. At the lowest determinant constraint level (triplets), the presented algorithm stagnates immediately with no further performance improvement with large PE counts. Strong scaling is improved significantly by using larger determinant constraints. Overall, we see a 5–5.5× performance improvement in the strong scaling limit by moving from triplet to sextuplet determinant constraints. The reason for this improvement (or conversely why the triplet case exhibits suboptimal strong scaling) can be seen in Fig. 3, where we present a plot showing the effect of constraint size on the number of ASCI pairs generated per PE. At the triplet level, there is a single triplet that yields a disproportionate number of pair contributions. As the work can only be distributed at the level of the highest determinant constraint, this single triplet represents indivisible work. By breaking up the offending triplet into higher constraints, the work can be distributed. However, at each constraint level, the algorithm always reaches a point where there is a disproportionate amount of work generated for a few particular constraints. As such, we expect that there always is a strong scaling stagnation point for this procedure, but the inclusion of higher constraints (septuplets, octuplets, etc.) will continue to improve the strong scaling behavior at larger PE counts. However, due to the relatively low cost of the search relative to the Hamiltonian formation and Davidson iterations (see Sec. III B), we believe this level of optimization to be unnecessary at this time.

### B. Hamiltonian and eigensolver performance

Figure 4 illustrates the strong scaling of the sCI Hamiltonian generation and subsequent Davidson iterations for Cr_{2} with *N*_{cdets} = 300*k* and *N*_{tdets} = 10^{6}, 10^{7}, and 3 × 10^{8}. It is clear that the cost of Hamiltonian formation dominates the variational ASCI calculation at all *N*_{tdets} considered, and is *O*(10*x*) the cost of the Davidson iterations and *O*(100*x*) the cost of the ASCI search discussed in Sec. III A. Due to the significant computational cost of the largest test problem, there does not exist a single point of reference (PE count) to measure parallel performance, which would be consistent among the calculations performed. As such, we have chosen to characterize the strong scaling of this application in terms of the relative parallel speedup (RPS) achieved between adjacent data points (i.e., RPS_{128} = *t*_{128}/*t*_{64} for the speedup between 64 and 128 PEs). As all of the scaling calculations were performed by the successive doubling of PE counts, a speedup value of 2 indicates ideal (linear) strong scaling, 1 indicates no speedup, and $<$1 indicates a performance degradation. Values slightly above 2 can be attributed to small variances in underlying execution times and are not indicative of super-linear strong scaling.

The Hamiltonian formation exhibits near linear strong scaling (RPS $>$ 1.96), whereas the Davidson iterations exhibit RPS values in the range of 0.98–1.6. The scaling of the former is expected since each PE performs nearly the same number of determinant distance checks in the double loop method (Sec. II C), and there is no communication between PEs. The scaling of the Davidson iterations can be understood in terms of Fig. 4(c), where the source of the load imbalance can be attributed to the variance of non-zero matrix elements among PE. This is because the ordering of the determinants in the ASCI wave function, as produced by the parallel ASCI search, is more or less random, and thus unable to guarantee regular distributions of non-zero matrix elements. This degradation is most prevalent for the smallest problem (10^{6} determinants) as the effective lack of data distribution leads to load imbalance in a manner similar to Sec. III A. This problem is well known in the field of distributed memory sparse linear algebra,^{56} as both the local work and overall communication volume of the parallel SpMV (Algorithm 6) are affected by irregular distributions of non-zero matrix elements. Improving the communication characteristics of the sCI SpMV will be the subject of future work by the authors. Figure 4 also shows the performance of the overall ASCI application for the different problem sizes. As the calculation is dominated by the Hamiltonian construction, we see similar parallel speedups for the total application, particularly for the larger problems. Overall, the ASCI calculation maintains RPS $>$ 1.7 for the smallest problem (10^{6}) while maintaining RPS $>$ 1.9 for the largest problem at 3 × 10^{8} determinants out to 16 384 PEs. To the best of the authors’ knowledge, this is the largest variational ASCI calculation to date.

## IV. CONCLUSIONS

In this work, we have proposed and demonstrated a parallel, distributed memory extension for the variational ASCI method. We addressed the key concerns for the dominant steps of the ASCI algorithm, including the development of a parallel algorithm for the determinant selection, a parallel algorithm and sparse storage scheme for the sCI Hamiltonian, and a parallel algorithm for the variational eigenvalue calculation via the Davidson method. The parallel search algorithm was enabled by previous work in the context of ASCI-PT2, where the introduction of the determinant constraints offered a convenient and robust mechanism to distribute work across PEs. This method was extended to include higher-order determinant constraints than were considered for the ASCI-PT2 method, and it was coupled to a load balancer to ensure parallel scaling on large compute resources. We have demonstrated the efficacy and performance of the proposed algorithms on a challenging test system (Cr_{2}) on up to 16 384 cores. The use of higher-order determinant constraints yielded a 5–5.5× performance improvement in the ASCI search (triplet constraints) in the strong scaling limit. The overall ASCI calculation was demonstrated to achieve near linear parallel speedups (RPS between 1.7 and 1.9). In addition, we have demonstrated the stability and continued convergence of the ASCI method into the regime of hundreds of millions of determinants, which, to the best of the authors’ knowledge, is the largest variational ASCI calculation to date. Although the developments for the parallel search algorithm are ASCI specific, the developments pertaining to distributed Hamiltonian construction and Davidson diagonalization are directly applicable to other sCI methods as well. These developments indicate a promising future for massively parallel sCI applications in the years to come.

While these results are promising, there remain open areas for exploration, which will be addressed by the authors in future work. In this work, the Hamiltonian construction dominates the overall ASCI application by at least an order of magnitude. The application of more advanced Hamiltonian matrix element techniques^{29} for the local sparse matrix construction holds high potential for achieving further improvements in the future, although it is unclear whether such developments would introduce load imbalance. The interplay between efficient Hamiltonian construction and load imbalance warrants further exploration. In addition, the strong scaling stagnation of the Davidson iterations can be attributed primarily to the massive load imbalance in the non-zero element distribution across PEs. While reordering techniques for SpMV optimization exist,^{83,84} their extensive cost and poor strong scaling preclude their practical application to the present work. The development of scalable sparse matrix reordering techniques for the CI problem would be a particularly fruitful area of exploration, given the massively parallel implementations of sCI. Finally, the leverage of bit operations both in the search and Hamiltonian construction steps of the ASCI algorithm would be particularly advantageous on current and future GPU architectures, and possibly other hardware accelerators to come. As such, the authors will build upon initial results^{29} for GPU applications for the ASCI method and develop GPU-accelerated extensions of the parallel algorithms proposed in this work.

## ACKNOWLEDGMENTS

D.B.W.-Y. and W.A.d.J. were supported by the Center for Scalable Predictive methods for Excitations and Correlated (SPEC) phenomena, which is funded by the U.S. Department of Energy (DoE), Office of Science, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences and Biosciences as part of the Computational Chemical Sciences (CCS) program at Lawrence Berkeley National Laboratory under FWP No. 12553. N.M.T. acknowledges the support from NASA Ames Research Center. C.M.-Z. acknowledges the financial support from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation program, Grant Agreement No. 692670 “FIRSTORM.” This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231 using NERSC Award No. BES-ERCAP-M3196. N.M.T. acknowledges funding from the NASA ARMD Transformational Tools and Technology (TTT) Project. Some calculations were performed as part of the XSEDE computational Project No. TG-MCA93S030 on Bridges-2 at the Pittsburgh supercomputer center.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**David B. Williams-Young**: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Norm M. Tubman**: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Carlos Mejuto-Zaera**: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal). **Wibe A. de Jong**: Conceptualization (equal); Formal analysis (equal); Funding acquisition (lead); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings of this study are available within the article.

## REFERENCES

*ab-initio*density matrix renormalization group in practice

^{4}o)

*GPU Computing Gems Jade Edition*

*Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science*

*Lanczos Algorithms for Large Symmetric Eigenvalue Computations*

*Parallel Scientific Computation: A Structured Approach Using BSP*