Tensor algebra operations such as contractions in computational chemistry consume a significant fraction of the computing time on large-scale computing platforms. The widespread use of tensor contractions between large multi-dimensional tensors in describing electronic structure theory has motivated the development of multiple tensor algebra frameworks targeting heterogeneous computing platforms. In this paper, we present Tensor Algebra for Many-body Methods (TAMM), a framework for productive and performance-portable development of scalable computational chemistry methods. TAMM decouples the specification of the computation from the execution of these operations on available high-performance computing systems. With this design choice, the scientific application developers (domain scientists) can focus on the algorithmic requirements using the tensor algebra interface provided by TAMM, whereas high-performance computing developers can direct their attention to various optimizations on the underlying constructs, such as efficient data distribution, optimized scheduling algorithms, and efficient use of intra-node resources (e.g., graphics processing units). The modular structure of TAMM allows it to support different hardware architectures and incorporate new algorithmic advances. We describe the TAMM framework and our approach to the sustainable development of scalable ground- and excited-state electronic structure methods. We present case studies highlighting the ease of use, including the performance and productivity gains compared to other frameworks.
I. INTRODUCTION
Enabling highly-scalable computational environments that abstract and automate the development of complex tensor algebra operations is critical in order to advance computational chemistry toward more complex and accurate formulations capable of taking advantage of emerging exascale computing architectures and also serve as a foundation for a sustainable and portable electronic structure software stack. In particular, tensor contractions (TCs) are a universal language used in many areas of quantum mechanics to encode equations describing collective phenomena in many-body quantum systems encountered in quantum field theory, nuclear structure theory, material sciences, and quantum chemistry. Typically, contractions between multi-dimensional tensors stem from the discretization procedures used to represent the Schrödinger equation in a finite-dimensional algebraic form. An important area where tensor contractions play a crucial role is electronic structure theory, where tensors describe basic interactions and parameters defining wave function expansions.
One of the most critical applications of TCs is the coupled-cluster (CC) formalism.1–8 In the CC theory, the form of the complex TCs used to represent non-linear equations describing the correlated behavior of electrons in molecular systems also reflects the fundamental feature of the CC formalism referred to as the size-extensivity or proper scaling of the energy with the number of particles. For this reason, the CC formalism is one of the most accurate computational models used these days in computational chemistry.
The CC theory has assumed a central role in high-accuracy computational chemistry and still attracts much attention in theoretical developments and numerical implementations. In this effort, high-performance computing (HPC) and the possibility of performing TCs in parallel play an essential role in addressing steep polynomial scaling as a function of system size and extending CC formalism to realistic chemical problems described by thousands of basis set functions. To understand the scale of the problem, canonical CC formulations such as the ubiquitous CCSD(T) approach9 (CC with iterative single and double excitations with non-iterative corrections due to triple excitations) involve contractions between four-dimensional tensors where ranges of thousands of entries can define each dimension. In order to alleviate efficient CC calculations on parallel platforms, several specialized tensor libraries have been developed over the last decade.
In the last few decades, significant effort has been expended toward enabling CC simulations for very large chemical systems to extend the applicability of the CC formalism further. In the reduced-scaling formulations, commonly referred to as the local CC methods,10–18 one takes advantage of the local character of correlation effects to effectively reduce the number of parameters and the overall cost of CC iterations, allowing for calculations of systems described by 10 000–40 000 basis set functions. This paper discusses a new tensor library, Tensor Algebra for Many-body Methods (TAMMs), which provides a flexible environment for expressing and implementing TCs for canonical and local CC approximations. The development of TAMM functionalities for the ground-state formulations [canonical CCSD and CCSD(T) methods and local CC formulations] is supported by the NWChemEx project,19 whereas excited-state formulations [CC Green’s function (CCGF), Equation-of-Motion (EOM) CC, and time-dependent CC (TDCC)] are supported by the SPEC project.20
As HPC systems continue to evolve to include different types of accelerators, diverse memory hierarchy configurations, and varying intra- and inter-node topologies, there is a need to decouple the development of electronic structure methods from their optimizations for various platforms. TAMM enables the compact specification of various electronic structure methods that heavily rely on tensor operations while allowing independent yet incremental development of optimization strategies to map these methods to current and emerging heterogeneous platforms.
The rest of the paper is organized as follows. Section II briefly discusses other tensor algebra frameworks used for developing computational chemistry applications. Section III describes the details of our tensor algebra interface and the underlying constructs that are used to efficiently distribute tensor data and schedule and execute tensor operations on modern HPC platforms. Later in Sec. III C, we provide a feature comparison with other distributed tensor algebra frameworks. Section IV showcases multiple CC methods implemented using TAMM and performance results obtained using the OLCF Summit Supercomputer.21 Section V demonstrates the performance of TAMM in comparison to other distributed tensor algebra frameworks.
II. REVIEW OF EXISTING INFRASTRUCTURE
Tensor-based scientific applications have been widely used in different domains, from scientific simulation to more recent machine learning (ML)-based scientific applications. Over the years, program synthesis and code generation have become the go-to solutions for such applications. The Tensor Contraction Engine (TCE),22 which is used in the NWChem computational chemistry software package,23 has been one such successful solution for automatically generating parallel code for various molecular orbital (MO) basis CC methods in Fortran 77. In later work, the TCE project24 added support for optimizations on tensor expression factorization, optimized code generation for various hardware, and space-time trade-offs in order to improve and also implement more complex CC methods.
In a separate effort, the FLAME project25 provided formal description support for describing linear algebra operations over matrices with support for the optimized implementation of these kernels for distributed memory systems. Later, various studies over-optimizing tensor algebra26–28 have been proposed using the FLAME framework.
The Cyclops Tensor Framework (CTF)29 was developed, aiming for a more efficient kernel implementation for tensor operations using concurrency. The framework focused on reducing the required communication in parallel executions of CC-based methods by using a dense triangular tensor representation. Solomonik and Hoefler extended CTF for general sparse computations.30 Manzer et al. demonstrated the benefits of exploiting block sparsity in electronic structure calculations.31 Neither approach includes notational support for the general block-structured nature of the sparsity that naturally occurs in electronic structure methods.
Epifanovsky et al.32 developed an object-oriented C++ library called libtensor for efficient execution of post-Hartree–Fock electronic structure methods using a blocked representation of large-size dense tensors. In later work,33 they optimized various operations using efficient memory management techniques that are thread-friendly and NUMA-aware. Unlike TAMM, libtensor does not offer a distributed infrastructure for tensor algebra operations and is restricted to shared memory systems.
Orca is a general quantum chemistry program involving implementations of reduced-scaling methods. In order to achieve the speedup, various approximations are employed, for example, density fitting, RIJ-COSX, or a local approach for NEVPT2 or CC methods. The C++ code employs message passing interface (MPI)-based parallelization schemes, and recently, in a pilot study, they implemented a scheme for generating 3- and 4-index integrals via accelerators.34 Orca provides a wide range of computational chemistry methods, whereas TAMM provides a general distributed tensor algebra framework to enable the productive development of scalable computational chemistry applications.
The ExaTensor35 library uses a domain-specific virtual processor concept to allow performance portable programming and execution of tensor operations on modern graphics processing unit (GPU) architectures. While the library is mostly focused on general GPU computation and the portability of such operations to different systems, they demonstrated its effectiveness on numerical tensor algebra workloads that mainly focus on quantum many-body computations on HPC platforms. The main focus of the ExaTensor library is to utilize GPUs for efficient tensor contractions. TAMM, on the other hand, provides a large set of supported tensor operations and utility routines. TAMM also provides special constructs allowing users to add new operations on tensors that are needed for the development of complete computational chemistry applications.
The TiledArray framework36 is actively being developed for scalable tensor operations that support various computational chemistry methods.37 It makes use of the Multiresolution Adaptive Numerical Environment for Scientific Simulation (MADNESS) parallel runtime,38 which employs a high-level software environment for increasing both programmer productivity and code scalability in massively parallel systems. TiledArray employs a hierarchical view of sparsity coupled with explicit user-written loop nests to perform specialized operations over sparse tensors. TiledArray has several similarities with TAMM while also differing in several key aspects, which are explained later in Sec. III C.
DISTAL39 is another recently developed distributed tensor algebra compiler that allows expressing computation on tensors as tensor algebra expressions. DISTAL lets users describe how the data and computation of a tensor algebra expression map onto a target machine, allowing separation of the specifications of data distribution and computation distributions. DISTAL lets users specialize computation to the way that data are already laid out or easily transform data between distributed layouts to match the computation. TAMM, on the other hand, offers generality by decomposing tensor algebra expressions into a series of distributed matrix multiplication and transposition operations. In a more recent work, the authors also introduced SpDISTAL,40 which adds support for sparse data structures and allows users to define generic sparse tensors. SpDISTAL then generates code for executing operations over these tensors. Both DISTAL and TAMM have several similarities in the aspects of storage, computation, and scheduling. However, they also differ in certain key aspects, which are further elaborated in Sec. III C.
III. TAMM FRAMEWORK
This section provides a detailed explanation of our tensor algebra framework. Figure 1 shows the conceptual overview of our framework. TAMM provides a tensor algebra interface that allows users to describe tensor operations in a familiar mathematical notation, while underneath it employs efficient data distribution/management schemes as well as efficient operation execution on both central processing unit (CPU)-based and GPU-based hardware. Our framework leverages efficient libraries such as HPTT41 for tensor index permutations on CPUs, LibreTT42 for tensor index permutations on GPUs, and Global Arrays,43 a partitioned global address space (PGAS) programming model for efficient data distribution.
A. Tensor algebra operations in TAMM
TAMM provides a flexible infrastructure for describing tensor objects using the general notion of an index space (simply an index range) that is used to describe the size of each dimension. TAMM also employs tiling capabilities, where users can define arbitrary or fixed size tiling over index spaces to construct tiled index spaces to represent a blocked representation of a tensor. This allows tensors to be constructed as a set of smaller blocks of data indexed by the Cartesian product of the number of tiles on each dimension. This notion of tiling enables efficient data distribution and efficient tensor operations (such as tensor contractions) that can better leverage the underlying hardware by utilizing the available cache system as well as the execution modules (i.e., GPUs and CPUs). TAMM’s flexible tiling capabilities provide crucial support for load balancing and making effective trade-offs over data communication and efficient computation. These optimizations will be detailed in Subsection III B, where we describe the data communication and execution of tensor operations.
Figure 2 shows an example of index space, tiled index space, and tensor construction in the TAMM framework. Lines 2–4 show the construction of IndexSpace objects that represent a range of indices using a range constructor. Our framework also allows users to describe application-specific annotations over the subsections of these spaces. Lines 4–7 show an example of string-based annotation over subsections of the construction range. This allows users to easily access different portions of an index space, thereby enabling access to slices of a tensor and constructing new tensors using slices of the original index space. Similarly, users can encode spin information related to each dimension over the input tensors, allowing our run-time to allocate these tensors using a block-sparse representation. For the time being, we only support spin symmetry for representing the block sparsity where the zero-blocks are not stored. Users can define their own non-zero check functionality to be used during allocation and tensor operation execution for storage and computational benefits. Additional capabilities, such as restricted summations or point group and permutational symmetries, are not supported in the tensor specification. Point group symmetry is less concerning since TAMM targets large systems that often do not have such symmetry. However, developers can explicitly code these features using TAMM to construct the desired tensors. This is the case for methods such as CCSD(T), discussed in Sec. IV A 2. TAMM also supports specialized tensor construction, called lambda tensor, where the user can provide a C++ lambda expression that specifies how a block of the tensor is computed, as shown in Line 19. This construct can be used to dynamically generate blocks of a tensor as needed. Note that these tensors are not stored in memory; they are read-only objects computed on the fly to be consumed in various tensor operations described later in this section. Similarly, TAMM also provides a special construct called view tensor to describe access to an existing tensor using C++ lambda expressions. The main use of view tensors is to define tensors of different shapes that can be used as a reference tensor for any operation as well as apply possible sparsity mapping/constraints on a dense tensor. Similar to lambda tensors, a view tensor does not allocate any additional storage but rather provides specialized access to the data available in the referenced tensor. We provide details of a use case for view tensors when discussing the DLPNO-CCSD method in Sec. IV B 2.
Source code example for IndexSpace, TiledIndexSpace, and Tensor construction using TAMM.
Source code example for IndexSpace, TiledIndexSpace, and Tensor construction using TAMM.
Lines 10–12 show the construction of TiledIndexSpace objects that represent a sliced index space that is used in tensor construction to have a blocked structure. Tiling can be applied as a fixed size (i.e., line 10) or as arbitrary tile sizes with full coverage of the index space (i.e., line 11). Finally, lines 15–17 in Fig. 2 show the construction of Tensor objects using tiled index spaces. Each TiledIndexSpace object used in the tensor constructor represents a dimension in the constructed tensors. In this example, tensor A is a 30 × 20 matrix with eight blocks, as denoted by tM, which consists of two tiles of sizes 10 and 20, while tK consists of four tiles that are size 5. While these lines construct a tensor object, the tensor is collectively allocated by all participating compute nodes in a subsequent operation using a scheduler.
Another important concept in constructing tensor operations is index labels, which allow specifying tensor operations in familiar mathematical notations and provide slicing capabilities over tensors by using the string-based subsections of the full index space. Labels are associated with tiled index spaces and used in the tensor operation syntax to describe the computation. Depending on the index spaces that the tensors are constructed on, users can specify string-based sub-spaces to define operations on different slices of the underlying allocated tensor object. Figure 3 shows examples of TiledIndexLabel construction using TiledIndexSpace objects. While lines 1–6 construct the labels over the full spaces, line 8 shows the label creation for the first portion of the tK index space (see construction on Fig. 2, line 4). These sub-spaces can then be used for specifying operations over sliced portions of the full tensor, as long as the index labels are from a subset of the original index space.
TAMM supports several tensor operations: tensor set, tensor addition, subtraction, scale, trace, transpose, general tensor contractions, inner and outer products, and reduction. TAMM supports different tensor data types and several mathematically valid operations between tensors of different data types (e.g., tensor contraction between a complex and real tensor). Figure 4 gives the grammar for allowed tensor operations’ syntax in the TAMM framework. Each tensor operation syntax rule (⟨tensor-op⟩) is composed of a left-hand side (lhs, ⟨op-lhs⟩) and a right-hand side (rhs, ⟨rhs⟩) operation. While lhs can only be a labeled tensor construct (⟨label-tensor⟩), rhs can be of different types that result in different tensor operations, as follows:
alpha value [A(i,l) = alpha], which corresponds to a tensor set operation that assigns the corresponding alpha value to all elements of the tensor.
labeled tensors [A(i,l) += alpha ∗ D(l,i)] correspond to a tensor addition operation (with respect to the label permutation on tensor D) in Eq. (7) in Fig. 4.
contraction of two labeled tensors [C(i,a) += alpha ∗ A(i,l) * B(l,a)] updates the lhs with the tensor contraction results in Eq. (8) in Fig. 4.
Several additional operations on tensors are provided as utility routines in TAMM. Element-wise operations such as square, log, inverse, etc. are provided. Additional useful operations on tensors such as min, max, norm, etc. are also provided, including high-level routines to read from and write tensors to disk using parallel file I/O. All these operations are defined using a parallel block_for construct provided by TAMM, allowing users to easily define custom element- and block-wise operations on a tensor. The block_for construct allows user-defined operations (using C++ lambda expressions) that are executed in parallel on distributed tensors.
Similar to tensor allocation, all other tensor operations that use and modify the distributed tensor data have to be performed via a scheduler that has the necessary information about the execution context of the operation. Execution context includes required information about the distribution scheme of the tensor data and the execution medium for the operations through distribution, memory manager, and process group constructs. Note that the tensor operations that are provided as utility routines do not use the scheduler but still require the execution context to perform the corresponding operation. TAMM employs a modular construction of these concepts to allow users to implement various distribution schemes as well as different distributed memory frameworks. Figure 5 shows the construction of a Scheduler object using the execution context and the execution of defined tensor operations. After creating a scheduler, users can directly queue tensor operations such as tensor allocation (line 12), tensor set and add operations (lines 13 and 14), and tensor contractions (line 15). Finally, all queued operations are executed using the execution context on a distributed system. The tensor operations, as well as the operations over the index spaces, are formally described in our previous work.44 The syntax for expressing operations shown in lines 13–15 also indicates the productivity benefits that can be obtained by using TAMM. The operations expressed in these three lines are executed in parallel on CPUs, GPUs, or both on any distributed computing platform. Manually writing parallel code for these operations would lead to a significant increase in the number of source lines of code (SLOC). Extending such manual development of parallel code to a real application with a large number of such operations would only lead to a significant increase (orders of magnitude) in the SLOC count and development time, which would also make future improvements to such code infeasible.
This section summarizes the tensor algebra interface as an embedded domain-specific language (eDSL) in the TAMM framework. By implementing an eDSL, we were able to separate concerns for developing scientific applications. While the high-level tensor abstractions allowed domain scientists to implement their algorithms using a representation close to mathematical formulation, it also allowed framework developers to test various different optimization schemes on the underlying constructs (i.e., different data distribution schemes, operation execution on accelerators, use of different PGAS systems, etc.), which we will detail in the coming section.
B. Tensor distribution and operation execution in TAMM
TAMM leverages various state-of-the-art frameworks and libraries to achieve a scalable performance–portable implementation of tensor algebra operations on exascale supercomputing platforms through efficient data distribution and intra-node execution of tensor operation kernels on CPUs and GPUs. The default memory manager for tensor data distribution in TAMM is based on the Global Arrays framework, a Partitioned Global Address Space (PGAS) programming model that provides a shared memory-like programming model on distributed memory platforms. Global Arrays provides performance, scalability, and user productivity in TAMM by managing the inter-node memory and communication for tensors. A TAMM tensor is essentially a global array with a certain distribution scheme. We have implemented three tensor distribution schemes in TAMM. The first scheme computes an effective processor grid for a given number of processes. A dense tensor is then mapped onto the processor grid. The second scheme is a simple round-robin distribution that allocates equal-sized blocks in a round-robin fashion, where the block size is determined by the size of the largest block in the tensor. This distribution over-allocates memory and ignores sparsity. The third scheme allocates the tensor blocks in a round-robin fashion while taking block sparsity into account. By only allocating non-zero blocks in the tensor, it minimizes the memory footprint of overall computation, allowing bigger-sized problems to be mapped to the available resources.
TAMM uses the “Single Program Multiple Data (SPMD)” model for distributed computation. In this programming abstraction, each node has its own portion of tensors available locally as well as access to the remote portions via communication over the network. As a result, operations on whole tensors can result in access to remote portions of the tensor, with implied communication. More importantly, many operations (i.e., tensor contractions, addition, etc.) are implied to be collective as they involve the distributed tensors as a whole. While the tensor algebra interface follows a sequential ordering of tensor operations, we also tried to conceal the burden of thinking in a distributed manner while writing a scientific application. To avoid possible issues with operations on distributed tensors, TAMM is designed to represent tensors in terms of handles and requires tensor operations to be declared explicitly and executed using a scheduler. Hence, any assignment performed on tensor objects will be a shallow copy as opposed to a deep copy, as a deep copy implies communication (message passing between nodes to perform the copy operation).
The computational chemistry methods are implemented as a sequence of operations over distributed tensors. Through this design, we were able to separate the specification of the operations from their implementations, allowing method developers to mainly focus on the computational chemistry algorithms while kernel developers can focus on the optimization of individual tensor operations. The execution of all tensor operations is managed by a scheduler. The TAMM scheduler employs a data flow analysis over the queued tensor operations to find the dependencies between each operation in order to load balance the computation and communication requirements of the overall execution. Using a levelized execution scheme, the scheduler is able to limit the global synchronizations to achieve load-balanced and communication-efficient schedules for the execution of all tensor operations. The dependency analysis over the high-level operations is based on a macro-operation graph. When two or more operations share the same tensor object and one of these operations updates the shared object, the operations are marked as conflicting operations that cannot be executed in parallel. This operation graph is used to construct a batch of operations that can be executed in parallel, minimizing the total number of global synchronizations required by the computation. The operations in these batches are then executed in an SPMD fashion. For instance, the canonical CCSD implementation in TAMM has 10 levels of operation batches that sum to over 125 tensor operations.
While TAMM hides the burden of choosing the best-performing schedule from the users through load-balanced scheduler execution, it allows users to control various aspects of the computation, such as data distribution, parallelization strategies, operation ordering, and execution medium (i.e., CPUs, GPUs). To achieve this, TAMM uses a modular structure to describe constraints imposed by the users to automate the construction of an execution plan for efficient execution of the computation. This allows users to incrementally optimize their code with minimal programming effort while keeping it readable, as opposed to a code generator-based solution. With these controls over the execution and distribution of the tensor data, users can choose from different optimizations at different granularities. For example, the user can increase the data locality by replicating frequently used small tensors on each distributed node or choosing from different distribution schemes for efficient tensor operation execution on various node topologies. Such an optimization can be implemented as a new data distribution scheme (i.e., SUMMA45) by extending the Distribution class. By simply using this distribution, users can enforce a specific distribution on tensors, which can optimize required data communication for specific operations. Another abstraction for optimization is the actual computation of the tensor operations. Users can define new operations by extending the Op class, which can then be scheduled to be executed. While TAMM supports the most common operations defined on tensors (i.e., addition and contraction), it also implements a Scan and Map operation that can be used to define various element/block-wise operations using lambda functions. Additionally, as a lower-level abstraction, users can also decide to describe new executions specific to a new architecture/accelerator on the kernel level (i.e., Double Precision General Matrix Multiply (DGEMM), GPU abstraction). While TAMM supports the main GPUs (i.e., Nvidia, AMD, and Intel) that will be available in upcoming HPC systems, users can also choose to implement new kernel-level abstractions for different hardware stacks [i.e., Field Programmable Gate Arrays (FPGAs) and Machine Learning/Deep Learning (ML/DL) accelerators].
To achieve highly optimized execution of tensor operations on a given platform, TAMM is designed to allow the use of multiple external libraries that provide optimized kernels for various operations. In addition to leveraging vendor-provided linear algebra libraries that are highly tuned for both CPUs and GPUs, TAMM also uses the following libraries: the HPTT41 library for optimized tensor index permutations on CPUs; LibreTT42 for enabling efficient index permutation on Nvidia, AMD, and Intel GPUs; the BLIS46 library for efficient BLAS implementation in CPU; and finally, TensorGen47 for generating optimized tensor contraction GPU kernels specialized for CC methods in computational chemistry.
While TAMM tries to hide the execution details from the end-user by employing high-level tensor algebra expressions, users can specify the medium on which they want the operations to be executed. Users can specify a particular operation to be executed on the CPU or GPU. All other execution-specific details like parallelization, CPU-to-GPU communication, and execution mechanisms are handled automatically by the TAMM infrastructure. While TAMM does not explicitly rely on OpenMP, it can leverage OpenMP for important operations like tensor transposition and GEMM via highly optimized external libraries such as HPTT and vendor BLAS. Setting the OMP_NUM_THREADS variable before running any TAMM-based application would enable OpenMP parallelization.
C. Comparison with other tensor algebra frameworks
In this section, we compare the functionality provided by TAMM with that provided by other distributed tensor algebra frameworks: TiledArray, CTF, DISTAL, and ExaTensor. Table I provides the key features and differences between these frameworks. While all these frameworks provide similar features in how tensors and tensor operations can be represented, TAMM is designed to be an extendable framework and hence differs from the remaining frameworks in the following key aspects discussed below:
Indexing and slicing: Indexing and slicing capabilities in any tensor algebra framework are crucial to defining tensor operations. Indexing plays a key role in constructing tensor operations (see Fig. 5). Slicing is predominantly used in computational chemistry methods, allowing operations on isolated parts of tensors. Users can choose to write the tensor operations over slices of the tensors instead of having to copy the required portions first for such operations. While the tensor algebra frameworks being compared provide an indexing mechanism, they lack support for using slicing in the tensor operations. TA and CTF use a string-based solution for indexing but lack the capability to support sliced indexing. On the other hand, DISTAL and ExaTensor use object-based constructs to create and use labels in the computation. While DISTAL does not allow any slicing, ExaTensor lets users to use pre-defined sub-ranges for slicing. TAMM provides generic indexing and slicing capabilities through various index space operations, allowing users to define a subset of indices to create object-based labels (see Fig. 3) as well as string-based labels to construct tensor operations. Users can make use of slicing capabilities by using object-based labels over sub-ranges or by creating new slices on-the-fly or string-based labels to define operations over full tensors.
Sparsity: Sparsity is widely used in several computational chemistry applications. All of the tensor algebra frameworks discussed in this section support some notion of sparsity for both data and computation. While DISTAL does not have direct support for sparsity, later work by the same authors, namely, SpDISTAL, incorporates different sparse data structures as well as specifications to describe the storage and computation of sparse tensors. SpDISTAL supports a generic specification for describing the dense/compressed format for each individual mode as well as pre-defined sparse data representations (i.e., CSR, CSC, etc.) for describing such tensors. Operations over these tensors are executed by generating custom kernels. To contrast, more application-oriented tensor libraries, such as TA, CTF, and ExaTensor, allow block-level sparsity. TA allows users to define non-zero and zero tiles that construct the tensors, whereas CTF uses pre-defined symmetry-based sparsity over the tensors. TAMM, on the other hand, incorporates sparsity by providing attributes over the tiled index spaces. By default, TAMM implements spin symmetry-related attributes to allocate block sparse tensors depending on the values of corresponding attributes on the tiles of the index space. Internally, TAMM analyzes the tensor blocks to determine if the corresponding block is non-zero to optimize storage and computation on block sparse tensors.
GPU Support: It is crucial to effectively leverage the available resources on modern heterogeneous HPC platforms. TAMM currently enables tensor algebra operations (mainly tensor contractions) on multiple GPU architectures (AMD, Nvidia, and Intel). At the time of this writing, the remaining frameworks are limited to Nvidia GPUs (except ExaTensor, which has AMD GPU support). The unified GPU infrastructure design allows TAMM to be easily extended to add support for a new hardware accelerator. Support for the upcoming GPU architectures with performance portability was also explored using SYCL, a domain-specific, heterogeneous programming language. Using SYCL and oneAPI Math Kernel Library (oneMKL) interfaces,48 perturbative triples contribution from the CCSD(T) formalism was demonstrated with large-scale simulations on different GPU architectures.49
Extendable backends: TAMM currently uses Global Arrays (GAs), which is built on top of MPI, for managing all aspects of data distribution and communication. TAMM also has an experimental UPC++50 backend that can alternatively be used instead of GA. TiledArray (TA) uses the MADNESS38 parallel runtime as the backend; DISTAL uses the Legion51 programming model and runtime system for distributing the data and computing; CTF and ExaTensor use MPI as the underlying parallel programming model. TAMM is designed to be easily extendable by implementing the methods in the process group and memory manager classes to support other parallel programming models. To the best of our knowledge, the remaining frameworks are not easily extendable to add support for additional parallel programming models.
Distribution: A TAMM tensor is allocated with a distribution type specified via the execution context. The distribution choices that are supported can be extended via the distribution class. DISTAL allows the specification of tensor distributions using a notation that allows the user to specify a process grid that the dimensions of a tensor can be mapped onto. Tensor distributions in TA and CTF are based on several parallel matrix multiplication algorithms (SUMMA, 2.5D, and 3D).26 TA additionally gives users the option of specifying a process map onto which tiles are distributed. However, CTF cannot be easily extended to allow other distributions.
Scheduling and Execution: DISTAL allows users to specialize computation to the way that data are already laid out or easily transform data between distributed layouts to match the computation, allowing fully customizable execution strategies for each operation. Several parallel matrix multiplication algorithms from the literature26 are expressible as schedules in DISTAL, which is currently not possible with any other framework in comparison. TA and CTF automatically execute tensor operations by dynamically redistributing data across operations if needed, but do not provide any way to customize the operation scheduling process. TAMM allows deferring execution for a collection of tensor operations. When the user explicitly calls the execute function, the TAMM scheduler analyzes the dependencies between the various operations and schedules them appropriately in order to load balance the computation and communication requirements of the overall execution across the collection of operations. However, TAMM does not provide any further customizations to the scheduling process. The trade-off between these frameworks is that TAMM, TA, and CTF fully automate the distribution process, while users must explicitly provide a schedule to distribute their computations with DISTAL.
Feature comparison with other distributed tensor algebra frameworks.
Frameworks . | Indexing . | Slicing . | GPU support . | Programming model . |
---|---|---|---|---|
TAMM | Object- and string-based | Generic slicing | AMD, Intel, Nvidia | GA (extendable) |
TA | String-based | No support | Nvidia | MADNESS |
DISTAL | Object-based | No support | Nvidia | Legion |
CTF | String-based | No support | Nvidia | MPI |
ExaTensor | Object-based | Sub-range slicing | AMD, Nvidia | MPI |
Frameworks . | Indexing . | Slicing . | GPU support . | Programming model . |
---|---|---|---|---|
TAMM | Object- and string-based | Generic slicing | AMD, Intel, Nvidia | GA (extendable) |
TA | String-based | No support | Nvidia | MADNESS |
DISTAL | Object-based | No support | Nvidia | Legion |
CTF | String-based | No support | Nvidia | MPI |
ExaTensor | Object-based | Sub-range slicing | AMD, Nvidia | MPI |
In this section, we have provided a detailed explanation of the TAMM framework. TAMM leverages a modular infrastructure to enable the implementation of various optimizations on different levels of computation, from data distribution to execution schemes on different hardware. This design allowed us to implement different memory managers, distribution schemes, and work distribution over different process groups without any major changes to the user-facing tensor algebra interface. Finally, we provided a feature-based comparison between TAMM and other distributed tensor algebra frameworks, namely, TiledArray, CTF, DISTAL, and ExaTensor.
IV. CC TAMM IMPLEMENTATIONS
In this section, we present case studies where TAMM was used to implement various scalable coupled-cluster (CC) methods for the latest HPC systems. While important, TAMM’s primary contributions are not just the faster performing versions of these methods but also the ability to productively develop and explore new algorithms and apply those improvements to all existing and new applications implemented using TAMM. This includes improvements in intra-node execution (single core, OpenMP multicore, GPUs, etc.), data distribution strategies (e.g., replication, process group-based distribution), and parallel execution (compute partitioning and communication scheduling algorithms). It also allows the concurrent development of optimized equations, parallel algorithms, and optimized intra-node kernels by different teams through clearly defined interfaces. The variety of methods presented and their performance are evidence that this approach accelerates the development and exploration of CC methods.
A. Canonical methods
1. Coupled cluster singles doubles (CCSD)
As benchmark systems for testing the performance of TAMM implementations of the Cholesky-based CCSD,52 we used two molecular systems previously used in studying the efficiency of the TCE implementations of the CC methods in NWChem.53,54 The first benchmark system is the model of Bacteriochlorophyll (BChl) MgC36N4O6H38,55,56 which plays an important role in understanding the mechanism of photosynthesis. In this process, the light is harvested by antenna systems and further funneled to BChl molecules, which initiate primary charge separation. The second benchmark system considered here is the β-carotene molecule, whose doubly excited states and triplet electronic states have recently been intensively studied in the context of singlet fission processes in carotenoid aggregates.57–59 The β-carotene molecule consists of 96 atoms, while the BChl model contains 85 atoms, including a central magnesium atom. We use the cc-pVDZ basis for both systems and evaluate their performance on OLCF Summit.21 Each Summit node contains two IBM POWER9 processors, each consisting of 22 cores and 512 GB of CPU memory. Each node is also equipped with six Nvidia Volta GPUs, each with 16 GB of memory, for a total of 96 GB of GPU memory per node.
Table II shows the CCSD performance compared to NWChem on 200 nodes of OLCF Summit. We measure the performance of the TAMM implementations against the TCE implementations in NWChem. The time per CCSD iteration is given in seconds. NWChem has a CPU-only implementation and uses 42 CPU cores on each node of Summit. TAMM-based Cholesky-CCSD uses only six MPI ranks per node, where each MPI rank is mapped to a single GPU. For these two molecular systems, we observe a 9–15× speedup with the TAMM-based Cholesky-CCSD implementation compared to the TCE CCSD method in NWChem. The CPU implementation of the CCSD tensor operations in NWChem comprises 11 314 source lines of code, whereas the Cholesky-CCSD implementation expressed using the TAMM framework is only 236 source lines of code. Since these 236 lines represent computation at a high level, they express both CPU and GPU computation. On the other hand, adding GPU capabilities to the NWChem CCSD code will only significantly increase the SLOC count and development time, which is why a GPU implementation for CCSD has not been attempted to date in NWChem. This clearly demonstrates the productivity benefits of expressing such computations in TAMM. Our implementation of the CCSD equations52 is expressed similarly to the example in Fig. 5. CCSD is an example of how TAMM can be used to productively create an effective and efficient implementation of certain classes of computational chemistry methods. The remaining computational chemistry methods discussed in this paper are also implemented along similar lines.
TAMM performance compared to NWChem on 200 nodes of OLCF Summit. Time per CCSD iteration is given.
. | . | Time (s) . | |||
---|---|---|---|---|---|
Molecule . | No. of atoms . | No. of occupied Orbitals . | No. of basis Functions . | NWChem60 . | TAMM . |
BChl | 85 | 171 | 852 | 1202 | 81 |
β-carotene | 96 | 148 | 840 | 801 | 65 |
. | . | Time (s) . | |||
---|---|---|---|---|---|
Molecule . | No. of atoms . | No. of occupied Orbitals . | No. of basis Functions . | NWChem60 . | TAMM . |
BChl | 85 | 171 | 852 | 1202 | 81 |
β-carotene | 96 | 148 | 840 | 801 | 65 |
2. Coupled cluster triples
Table III shows the TAMM-based NWChemEx triples correction (T) calculation52 performance compared to NWChem on 512 nodes of the OLCF Summit. The time is given in seconds. NWChem has a GPU implementation of the triples correction and uses all 6 GPUs and 42 CPU cores on each node. The TAMM-based triples correction uses only six MPI ranks per node, where each MPI rank is mapped to a single GPU. For the BChl and β-carotene molecules, we observe a speedup of ∼6–18× for the TAMM implementation compared to the TCE implementation in NWChem. The finer details of the TAMM-based triples correction implementation are detailed in Ref. 62.
NWChemEx-TAMM performance compared to NWChem for the perturbative correction in the CCSD(T) method on 512 nodes of the OLCF Summit.
. | . | Time (s) . | |||
---|---|---|---|---|---|
Molecule . | No. of atoms . | No. of occupied Orbitals . | No. of basis Functions . | NWChem60 . | TAMM . |
BChl | 85 | 171 | 852 | 18 285 | 1791 |
β-carotene | 96 | 148 | 840 | 6 646 | 1164 |
. | . | Time (s) . | |||
---|---|---|---|---|---|
Molecule . | No. of atoms . | No. of occupied Orbitals . | No. of basis Functions . | NWChem60 . | TAMM . |
BChl | 85 | 171 | 852 | 18 285 | 1791 |
β-carotene | 96 | 148 | 840 | 6 646 | 1164 |
B. New methods
1. Equation-of-motion coupled cluster formalism
Similar to the CCSD method, the numerical scaling of the EOM-CCSD approach is dominated by terms, where no represents the number of occupied orbitals and nu represents the number of unoccupied orbitals. The major computational complexity in the EOMCC method arises from tensor contractions involving various operators. The method involves two types of excitation operators: the CC operator T and the EOM operator R, besides the one-body and two-body components of the Hamiltonian operator. The EOMCCSD approach is a non-Hermitian eigenvalue problem, and because of the dimensionality of the problems, they are solved iteratively using a multi-root solver. Therefore, there is an additional cost for constructing and working with the iterative subspace. In terms of implementation, however, the same TAMM format used for writing various tensor contraction equations, as demonstrated while discussing Eq. (23), was followed for the EOMCC method.
The EOMCCSD method is usually employed in studies of excited states dominated by single excitations. It is also worth mentioning that LR-CC methods with singles and doubles lead to the same values of excited-state energies. However, in contrast to the EOMCCSD formalism, the LR-CCSD excitations are identified with the poles of frequency-dependent linear-response CC amplitudes.
Table IV shows the TAMM EOMCCSD calculation performance compared to NWChem on 200 nodes of the OLCF Summit. The time per EOMCCSD iteration is given in seconds. NWChem has a CPU-only implementation and uses 42 CPU cores on each of the 200 nodes of Summit. Just as with the previous calculations, the TAMM-based EOMCCSD uses six MPI ranks per node, where each MPI rank is mapped to a single GPU. The current TAMM implementation of the EOMCCSD approach has not been optimized at the same equation level as the CD-CCSD implementation. For example, in contrast to the CD-CCSD formalism, the EOMCCSD implementation fully represents two-electron integrals and uses a non-spin-explicit representation of the operators in the equations. While we plan to return to this implementation and incorporate the optimized (spin-explicit) equations, we already observe a significant speed-up over NWChem with this primitive implementation. CD-CCSD and EOMCCSD calculations are different algorithms for solving the corresponding problem, and the timing for an iteration of EOMCCSD is a sum of all of the steps in an interaction, which tends to be more computationally demanding than the Jacobi step in CD-CCSD. Nonetheless, one can see from Table IV that the TAMM EOMCCSD code is 2–3 times faster for β-carotene and BChl molecules.
TAMM performance compared to NWChem on 200 nodes of the OLCF Summit. Time per EOMCCSD iteration is given.
. | . | Time (s) . | |||
---|---|---|---|---|---|
Molecule . | No. of atoms . | No. of occupied Orbitals . | No. of basis Functions . | NWChem60 . | TAMM . |
BChl | 85 | 171 | 852 | 2030 | 715 |
β-carotene | 96 | 148 | 840 | 1170 | 540 |
. | . | Time (s) . | |||
---|---|---|---|---|---|
Molecule . | No. of atoms . | No. of occupied Orbitals . | No. of basis Functions . | NWChem60 . | TAMM . |
BChl | 85 | 171 | 852 | 2030 | 715 |
β-carotene | 96 | 148 | 840 | 1170 | 540 |
2. DLPNO CCSD(T)
Differentiate pairs with respect to their energy contributions ɛij by sequential pre-screenings. Identify ij-pairs that (a) are negligible and immediately dropped, (b) can be evaluated at a lower-level model (MP2 level), and (c) can be evaluated at CC level.
Find an optimal representation of the virtual space for each ij-pair in which the derived tensors (amplitudes, residuals, or integrals) are dense.
Find the optimal factorization of terms in amplitude equations.
Perform only those tensor contractions that lead to non-zero results (including integral transformation and amplitude equation evaluation).
In our work, we assume that we will utilize a larger number of nodes, so we will have more memory and computing effort available. In that case, we can afford tighter thresholds, leading to larger domains and PNO spaces, which means higher precision in the recovery of the correlation energy. For the implementation of DLPNO formulations in TAMM, we employed a code generator implemented in Python that transforms the canonical equations by the transformation rules for various spaces (i.e., PNO, PAO, etc.). Using a code generator allowed us to systematically convert equations while automatically applying an operation cost minimization algorithm for single-term optimization. We anticipate that these kinds of code generators on the high-level equations will enable trying different transformations and automatically choosing the best performing one. While this implementation is in the early stages, we were able to validate the correctness of the generated code using our infrastructure to directly compare the results with their canonical counterparts.
The perturbative correction (T) described in Sec. IV A 2 is in the local version evaluated using the same equations with some differences.81 While converged T1 and T2 amplitudes are obtained in their PNO space, the virtual indices in terms or are represented in triple natural orbitals (TNOs), which are computed from triplet density matrices obtained from three pair density matrices, Dijk = 1/3(Dij + Dik + Djk). In order to perform contractions of integrals and amplitudes in Eqs. (29) and (31), we need to involve transformation matrices between PNO and TNO spaces, Sij;klm, which are computed the same way as in Eq. (41) where dkl is replaced by dklm, transforming PAO orbitals to TNO space. Similar to the DLPNO CCSD implementation, we leveraged the TAMM framework’s dense tensor infrastructure to represent the perturbation correction implementation with block-sparse computation. Using the PNO representation of the amplitudes from the CCSD implementation, we implemented the DLPNO formulation of the canonical equations.
For the development of DLPNO-based methods in NWChemEx, we utilized specialized view tensor construction to avoid redundant storage of the transformation tensors, especially for the transformation required for the occupied pair indices. Figure 6 shows an example use case specific to DLPNO CCSD equations. In this example, we are using a view tensor over the large Sijkl transformation tensors. Depending on the pair indices used within the equations, various kinds of transformations are required. In this example, the constructed Sijki tensor (line 17) has values from Sijkl if the first index of the first pair matches the second index of the second pair. If not, the corresponding indices are set to zero. In this case, there was no need to translate the blocks or update the tensor since the tensor was used as read-only. However, users can provide special constraints on both the translation of the blocks and access to these blocks. By using this feature, we were able to reduce the memory footprint of DLPNO methods and add additional constraints to leverage sparsity within the computation. Since the TAMM allows mixed usage of specialized tensors with other tensor types in all supported operations, the representation and execution patterns of the DLPNO CCSD equations in the TAMM did not have to change in such scenarios.
View the tensor construction for Sijkl transformation tensors in DLPNO CCSD equations.
View the tensor construction for Sijkl transformation tensors in DLPNO CCSD equations.
3. Time-dependent coupled-cluster method
In addition to the stationary and frequency-dependent formulations of CC theory, one could witness significant progress on our end in developing a time-dependent CC method for simulating real-time dynamics.
The TDCC method has been studied in the context of the CC linear-response theory,68,69,82,83 x-ray spectroscopy and Green’s function theory,85–89 nuclear physics,90–92 condensed matter physics,93 and quantum dynamics of molecular systems in external fields.94–99 These studies have also initiated an intensive effort toward understanding many aspects of the TDCC formalism, including addressing fundamental problems such as the form of the action functional, the form of the time-dependent molecular orbital basis, the physical interpretation of time-dependent orbitals, various-rank approximations of the cluster operator, and the numerical stability of time integration algorithms. One of the milestone achievements in developing a time-dependent CC formalism was Arponen’s action functional for the bi-variational coupled cluster formalism93 and the following analysis by Kvaal considering time-varying orbitals.95
We have propagated the time-dependent CC amplitudes using a first-order Adams–Moulton method, which is an implicit time-propagator. This algorithm requires copying and swapping of the real and complex parts of time-dependent amplitudes in each micro-iteration. TAMM does not directly support swapping the real and imaginary parts of each element of a complex tensor. However, in this case, we used the block_for construct of TAMM to perform the swap in parallel. A specialized C++ lambda expression, as shown in Fig. 7, is provided to the block_for, which simply specifies that the real and imaginary parts of each element in a given block need to be swapped and copied to (or updated) the corresponding element in the destination tensor block.
Parallel element-wise swap and copy operation between complex tensors.
The TDCC method performs time propagation for a sufficiently long duration to obtain well-resolved spectra. However, obtaining computing time for such an extended period in a single run is unlikely when using shared computing resources. In addition, it is highly beneficial to track simulations to adjust and optimize various parameters in between different runs of the same simulation. The ability to checkpoint and restart simulations, hence, plays a significant role in the efficient utilization of computing resources. The TAMM library provides parallel file I/O operations on tensors, which helps with checkpointing and restarting the TDCC calculations. We simply use the TAMM interface to write and read tensor data to and from the disk, as shown in Fig. 8. The underlying TAMM infrastructure handles the parallelization aspects of these I/O operations.
The TAMM infrastructure has been utilized to implement the time evolution of the T(τ) operator. We have considered singles (S) and doubles (D) excitation approximations in our implementation. Our implementation uses Cholesky-decomposed two-body electron repulsion matrix elements,100,101 and we exploit spin-explicit formalism to evaluate tensor contractions of matrix elements of various operators.
4. Coupled cluster Green’s function
We refer to the list of all (ω, p) pairs in a given level as CCGF tasks, each of which can be executed independently. Each task solves the CCGF singles and doubles equations until convergence for a given (ω, p) pair. All the p orbitals for a given ω are divided across process groups that are set up at the beginning of each level. Currently, the user is responsible for explicitly creating MPI process groups and using them to construct the TAMM process group and execution context objects, which are, in turn, used for executing each task. Each TAMM process group now contains a subset of the resources (processes) provided to the CCGF calculation. In our CCGF implementation, the size of a process group for computing each task is determined automatically for a given problem size (i.e., number of occupied and virtual orbitals) and the resources (total number of nodes, processes, and GPUs per node) provided for that run. All process groups are created to be the same size. We are actively working on providing abstractions in TAMM for expressing computation to be divided across process groups without the user having to create them manually using MPI. Once the execution context objects representing different process groups are created, all CCGF tasks are distributed across the available process groups and executed using different execution contexts.
The resulting tensors from each task are stored on disk using the parallel file I/O routines provided by TAMM for writing and reading distributed tensors to and from disk, as shown in Fig. 8. This enables restarting the CCGF calculation at any point in the calculation. If the set of tasks in a given level is not completed, the subsequent CCGF run upon restart will create process groups only for the remaining tasks and execute them. Another feature provided by TAMM used in CCGF is batched parallel file I/O operations. Several tensors for each task are written to disk as part of the CCGF calculation. The number of tensors written to disk grows with increasing CCGF problem size and the number of frequencies (ω) to be solved. Even though each tensor is read/written using parallel file I/O, the operation uses a small subset of available nodes/processes to perform the read/write operation for a single tensor. This is because the entire set of available resources used for the CCGF calculation is not required to read/write even large tensors from/to disk. This leads to resource under-utilization since most of the resources are idle while a subset of them are performing the parallel file I/O operation on one distributed tensor at a time. To address this issue, TAMM provides high-level batched tensor I/O routines that can read several hundred tensors from disk concurrently by automatically dividing the total available resources into smaller process groups. Since all tensors that need to be read/written are not necessarily of the same size, variable-sized process groups are dynamically created based on the sizes of each tensor in a given list of tensors. Each of these process groups reads/writes a tensor using parallel file I/O within the group. Several tensors are read/written concurrently by different process groups, leading to significantly better resource utilization and an overall improvement in the total time spent in file IO. The user would express such an operation in TAMM, as shown in Fig. 9. The operation is expressed exactly the same way as one would express it for reading or writing a single tensor (e.g., Fig. 8), with the only obvious difference being that a list of tensor handles and corresponding filenames needs to be provided in this case. Reference 109 describes the highly scalable CCGF implementation developed using TAMM in detail, along with the performance and scalability analysis at the OLCF Summit. The implementation is also publicly available.110
This section provides an overview of some important CC methods and discusses key aspects of their implementations using TAMM. While frameworks such as TiledArray, CTF, DISTAL, and ExaTensor provide similar features in how tensor operations in CC methods can be represented, TAMM differs from them, as explained in Sec. III C. To the best of our knowledge, these are the first complete scalable distributed implementations of the discussed CC methods that can be run on different GPU architectures while also using different parallel programming models.
V. PERFORMANCE COMPARISON WITH OTHER TENSOR ALGEBRA FRAMEWORKS
We have provided feature comparisons with other distributed tensor algebra frameworks in Sec. III C, where key library features such as tensor-construction and tensor-operation specification, hardware support, and the underlying parallel programming model backends (for distributed data management) are compared. This section details the performance comparison results with other distributed tensor algebra frameworks, namely, TiledArray (TA) and Cyclops Tensor framework (CTF). We did not consider comparisons with ExaTensor since the development page111 states that the library has pending performance issues as well as numerous problems with existing MPI-3 implementations at the time of this writing.
All our experiments were performed on the National Energy Research Scientific Computing Center (NERSC) Perlmutter supercomputer. The GPU partition was used for all the experiments. Each node in this partition has a 64-core AMD EPYC 7763 CPU, 256 GB of DDR4 DRAM, and 4 Nvidia A100 GPUs, each with 40 GB of HBM2e RAM. Each GPU node is connected to four NICs, and the GPU nodes are connected using the Slingshot 11 interconnect. The TAMM experiments were configured to run with four MPI ranks per node and one MPI rank per GPU. The TiledArray experiments were configured with four MPI ranks per node (one MPI rank per GPU) and two threads per rank since that is the TiledArray developer’s recommended configuration for Perlmutter. The CTF implementation provided the best performance when using either 8 or 12 MPI ranks per node (2 or 3 MPI ranks per GPU), depending on the problem size and the number of nodes used in the experiment. For a given dimension size (N) and node count, we ran CTF using both 8 and 12 MPI ranks per node and chose the best timing out of the two. All codes were compiled with GCC 11.2, CUDA 11.7, and cray-mpich 8.1.
The performance results for TAMM, TiledArray, and CTF implementations of Eq. (57) are shown in Fig. 10. Plots for N = 400, 500, 600, and 800 are also shown on different Perlmutter node counts. With an increasing number of nodes in order to strong scale for a given problem size (N), TAMM demonstrated consistent performance improvements. In contrast, TA and CTF exhibited some oscillating behavior. The TAMM implementation strong scales consistently for a given problem size (N) and provides competitive performance compared to TA and CTF, especially for large problem sizes and node counts.
Scaling of 4D tensor contraction with TAMM and comparison with two other implementations.
Scaling of 4D tensor contraction with TAMM and comparison with two other implementations.
Table V shows the minimum number of nodes required on Perlmutter to run each of the three implementations for the tensor contraction shown in (57) for large problem sizes (N = 600 and N = 800). As N increased, the required minimum number for running the benchmark in the cases of TA and CTF was much higher than expected in comparison to TAMM. At the time of this writing, CTF crashed due to segmentation faults on node counts less than 100 for N = 600 and node counts less than 270 for N = 800. Similarly, TA also required a minimum of 40 nodes for N = 600 and 200 nodes for N = 800, whereas TAMM required only 25 and 80 nodes, respectively, for those cases.
Minimum number of Perlmutter nodes required to run the dense 4D tensor contraction shown in Eq. (57).
Dim size (N) . | Number of nodes . | ||
---|---|---|---|
TAMM . | TA . | CTF . | |
600 | 25 | 40 | 100 |
800 | 80 | 200 | 270 |
Dim size (N) . | Number of nodes . | ||
---|---|---|---|
TAMM . | TA . | CTF . | |
600 | 25 | 40 | 100 |
800 | 80 | 200 | 270 |
We use the COSMA provided distributed matrix-multiply miniapp.113 The size of each dimension for the COSMA benchmark was chosen as N2 in order to match the overall sizes for each tensor [i.e., (N4)] in Eq. (57). Table VI shows the values for N considered in the comparison. At the time of this writing, we observe that COSMA, like the TiledArray and CTF implementations, also requires a certain minimum number of nodes, which is significantly higher than what is required by the corresponding TAMM implementation for a given problem size (N). We observe that the TAMM performance is nearly identical to COSMA for N = 400 and about 10% better than COSMA for N = 500. For larger problem sizes, N = 600 and N = 800, the COSMA implementation ran out of memory on 300 and 400 nodes, respectively. The DISTAL39 work also provides a performance comparison with COSMA. The DISTAL authors mention that COSMA generally achieved about 10% higher performance in comparison to DISTAL, using up to 256 nodes of the Lassen supercomputer. Since the code for DISTAL is not in the public domain at the time of this writing, a direct performance comparison with DISTAL is currently not possible.
CCSD benchmark: We implemented our Cholesky-CCSD equations as tensor expressions in TA and CTF to further investigate the performance on a real application. We ran calculations of Ubiquitin-DGRTL74,84 molecule with the aug-cc-pVDZ basis (O = 146, V = 1096, 7810 Cholesky vectors) on up to 350 nodes of Perlmutter using the best performance parameters for TA and CTF. With TAMM, for a single CCSD iteration on 240 nodes, we observed 80% speedup over CTF while TA ran out of memory. On 350 nodes, we observed nearly identical timing with CTF and 2.5x speed-up over TA.
Performance comparison with COSMA.
Dim size (N) . | Node count . | TAMM . | COSMA . |
---|---|---|---|
300 | 50 | 6 | 5 |
400 | 100 | 9 | 9 |
500 | 100 | 28 | OOMa |
500 | 200 | 17 | 19 |
600 | 300 | 33 | OOMa |
600 | 400 | 28 | OOMa |
Dim size (N) . | Node count . | TAMM . | COSMA . |
---|---|---|---|
300 | 50 | 6 | 5 |
400 | 100 | 9 | 9 |
500 | 100 | 28 | OOMa |
500 | 200 | 17 | 19 |
600 | 300 | 33 | OOMa |
600 | 400 | 28 | OOMa |
Did not run to completion due to out-of-memory errors.
VI. CONCLUSION
We have introduced and discussed the Tensor Algebra for Many-body Methods framework that enables scalable and performance-portable implementations of important computational chemistry methods on modern large-scale heterogeneous high-performance computing systems. We described the TAMM framework in detail by introducing a tensor algebra interface that provides a high-level representation of the tensor algebra operations as an embedded domain-specific language. This interface enables the separation of concerns between scientific application development and high-performance computing development efforts. The domain scientist or scientific application developer can focus on the method development instead of the performance optimization details, whereas the HPC developers focus on the underlying algorithms and optimizations. Later, we presented our modular infrastructure that allows the implementation of different optimizations on tensor data distribution, execution, and scheduling of tensor operations for efficient execution on modern heterogeneous HPC platforms. We also compared the features of TAMM against several other distributed tensor algebra frameworks. Through various case studies, we showcased the performance and productivity benefits of using the TAMM framework for implementing complete ground- and excited-state electronic structure methods that are expressed as operations on tensors. Finally, we evaluated the performance of TAMM against other distributed tensor algebra frameworks and demonstrated its competitiveness and scalability on large problem sizes and node counts on NERSC Perlmutter.
ACKNOWLEDGMENTS
This research was supported by the Exascale Computing Project (No. 17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by the Center for Scalable Predictive Methods for Excitations and Correlated Phenomena (SPEC), which is funded by the U.S. Department of Energy (DoE), Office of Science, Office of Basic Energy Sciences, Division of Chemical Sciences, Geosciences and Biosciences as part of the Computational Chemical Sciences (CCS) program at Pacific Northwest National Laboratory (PNNL) under Grant No. FWP 70942. PNNL is a multi-program national laboratory operated by Battelle Memorial Institute for the U.S. DoE under Contract No. DE-AC06-76RLO-1830. N.P.B. also acknowledges support from the Laboratory Directed Research and Development (LDRD) Program at PNNL. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory, the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract No. DE-AC02-06CH11357, and the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract No. DE-AC05-00OR22725.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Erdal Mutlu: Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Ajay Panyala: Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Nitin Gawande: Software (supporting). Abhishek Bagusetty: Software (supporting). Jeffrey Glabe: Writing – review & editing (supporting). Jinsung Kim: Software (supporting). Karol Kowalski: Methodology (equal); Writing – original draft (equal); Writing – review & editing (equal). Nicholas P. Bauman: Methodology (supporting); Software (supporting); Writing – review & editing (supporting). Bo Peng: Methodology (supporting); Writing – review & editing (supporting). Himadri Pathak: Methodology (supporting); Software (supporting); Writing – review & editing (supporting). Jiri Brabec: Methodology (supporting); Software (supporting); Writing – original draft (supporting); Writing – review & editing (supporting). Sriram Krishnamoorthy: Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.