Spherical harmonics provide a smooth, orthogonal, and symmetry-adapted basis to expand functions on a sphere, and they are used routinely in physical and theoretical chemistry as well as in different fields of science and technology, from geology and atmospheric sciences to signal processing and computer graphics. More recently, they have become a key component of rotationally equivariant models in geometric machine learning, including applications to atomic-scale modeling of molecules and materials. We present an elegant and efficient algorithm for the evaluation of the real-valued spherical harmonics. Our construction features many of the desirable properties of existing schemes and allows us to compute Cartesian derivatives in a numerically stable and computationally efficient manner. To facilitate usage, we implement this algorithm in sphericart, a fast C++ library that also provides C bindings, a Python API, and a PyTorch implementation that includes a GPU kernel.
I. INTRODUCTION
The spherical harmonics are basis functions for the irreducible representations of the SO(3) group,1 which makes them a key tool for understanding physical phenomena that exhibit rotational symmetries and designing practical algorithms to model them on a computer. Examples include the distribution of charge in atoms,2 the behavior of gravitational3 and magnetic4,5 fields, and the propagation of light6 and sound7 in the atmosphere. Moreover, they provide a complete set of smooth, orthogonal functions defined on the surface of a sphere, and in this sense, they are widely used in many fields, including computer graphics,8,9 quantum10–13 and physical14–16 chemistry, and signal processing.17,18 More recently, they have become an essential tool in the context of geometric deep learning,19,20 as a structural descriptor needed in SO(3)-, O(3)-, and E(3)-equivariant machine learning models21–24 and more specifically in the construction of symmetry-adapted descriptors of atomic structures in chemical machine learning.25–29 Derivatives of spherical harmonics are also very often used in these applications. An example is the calculation of forces on a configuration of atoms by an E(3)-invariant machine learning model.30 Likewise, spherical harmonics gradients have been used in computer graphics for mid-range illumination, irradiance volumes, and optimization methods.9
II. ANALYTICAL EXPRESSIONS
In contrast, our algorithm aims at calculating , where . These are the so-called solid harmonics, which consist of simple homogeneous polynomials of the Cartesian coordinates. This choice avoids the need to normalize r and leads to simple and stable iterations to compute as well as very compact expressions for their derivatives with respect to the Cartesian coordinates, which re-use the same factors needed to evaluate . In most applications that require spherical harmonics in Cartesian coordinates, the radial direction is dealt with by a separate expansion [cf. Eq. (1)], and the rl factor that is included in the scaled can be compensated for at little to no additional cost by a corresponding r−l factor in the radial term. If, instead, spherical harmonics are needed in their conventional version, they can be recovered easily from the radially scaled (or solid) kind.
III. COMPUTER IMPLEMENTATION AND BENCHMARKING
Most computational applications require the evaluation of all the spherical harmonics, and possibly their derivatives, for the values of l up to a maximum degree lmax. In many cases, the spherical harmonics have to be computed for many points simultaneously, e.g., for the interatomic separation vectors of all neighbors of a selected atom.
Even though we recommend using the scaled spherical harmonics in applications, accounting for the fact that accompanying radial terms have to compensate for the scaling, we also provide an implementation of “normalized” spherical harmonics by simply evaluating and applying the chain rule to the derivatives. These additional operations typically result in an overhead of 5%–10% for the calculation of the spherical harmonics and their derivatives.
Our general routines take a list of Cartesian coordinates and evaluate the spherical harmonics and, optionally, their derivatives and/or second derivatives with respect to the inputs. We use C++ for the implementation using templates to exploit compile-time knowledge of the maximum angular momentum, the need to compute derivatives, and so on. We then define a pure C API covering the typical use cases on top of this C++ API. Since almost all programming languages have a way to call C functions, this enables using our code from most languages used in a scientific context. We also provide a Python package with a high-level interface to the C library.
Furthermore, we use the C++ API to provide an implementation compatible with PyTorch,36 using a custom backward function to compute gradients using the derivatives evaluated in the forward pass and making sure the code is compatible with TorchScript, allowing models to be used without a Python interpreter. Finally, we implement a GPU-accelerated version of the PyTorch code for NVIDIA GPUs, using the CUDA language. Table I compares the features of sphericart to those of other spherical harmonics implementations.41–50
Package . | Input . | Output . | . | Hardcoded . |
---|---|---|---|---|
librascal41 | Polar | Real | Yes* | No |
QUIP/turbosoap42 | Polar | Real | Yes* | No |
PACE43 | Cartesian | Complex | Yes | No |
e3nn38 | Cartesian | Real | Yes | Yes |
ACE.jl44 | Polar | Complex | Yes* | No |
Dscribe45 | Cartesian | Real | Yes | Yes |
libcint/PySCF46 | Cartesian | Real | No | Yes |
SciPy47 | Polar | Complex | No | No |
gsl48 | Polar | Real | Yes | No |
Boost49 | Polar | Complex | No | No |
Google50 | Polar | Real | No | Yes/No |
sphericart | Cartesian | Real | Yes | Yes/No |
Package . | Input . | Output . | . | Hardcoded . |
---|---|---|---|---|
librascal41 | Polar | Real | Yes* | No |
QUIP/turbosoap42 | Polar | Real | Yes* | No |
PACE43 | Cartesian | Complex | Yes | No |
e3nn38 | Cartesian | Real | Yes | Yes |
ACE.jl44 | Polar | Complex | Yes* | No |
Dscribe45 | Cartesian | Real | Yes | Yes |
libcint/PySCF46 | Cartesian | Real | No | Yes |
SciPy47 | Polar | Complex | No | No |
gsl48 | Polar | Real | Yes | No |
Boost49 | Polar | Complex | No | No |
Google50 | Polar | Real | No | Yes/No |
sphericart | Cartesian | Real | Yes | Yes/No |
A. CPU implementation details
We apply a number of trivial (and a few less obvious) optimizations. For example, we pre-compute the factors [Eq. (3)] to minimize the number of operations in the inner loop of the iterative algorithm discussed above. In order to further accelerate the evaluation of low-l Cartesian harmonics, we use a computer algebra system to derive expressions that evaluate the full and their derivatives with hard-coded expressions using the smallest possible number of multiplications. As presented in Table II and as noted in previous implementations,24,33 there is considerable scope for optimization by using ad hoc expressions. However, the speedup is less remarkable when including the calculation of the derivatives through Eq. (10), which re-uses quantities that have already been computed when evaluating .
. | General-purpose . | Hard-coded . | ||
---|---|---|---|---|
. | . | . | . | |
lmax = 1 | 3.49 | 11.0 | 1.33 | 7.85 |
lmax = 2 | 7.16 | 21.6 | 4.21 | 16.4 |
lmax = 3 | 13.3 | 36.1 | 7.26 | 28.4 |
lmax = 4 | 20.3 | 55.3 | 11.8 | 46.3 |
lmax = 5 | 29.8 | 81.9 | 16.3 | 68.3 |
lmax = 6 | 42.8 | 121 | 22.2 | 95.2 |
. | General-purpose . | Hard-coded . | ||
---|---|---|---|---|
. | . | . | . | |
lmax = 1 | 3.49 | 11.0 | 1.33 | 7.85 |
lmax = 2 | 7.16 | 21.6 | 4.21 | 16.4 |
lmax = 3 | 13.3 | 36.1 | 7.26 | 28.4 |
lmax = 4 | 20.3 | 55.3 | 11.8 | 46.3 |
lmax = 5 | 29.8 | 81.9 | 16.3 | 68.3 |
lmax = 6 | 42.8 | 121 | 22.2 | 95.2 |
As a compromise between the convenience of a function that works for arbitrary lmax and the efficiency of optimized expressions, we provide an interface for the hard-coded implementation up to lmax = 6. By default, our implementation applies the hard-coded version for small l and then switches to the general expression, making use of the alternative recursion (8) to avoid the computation of the Legendre polynomials . for small l. All our functions can be applied to many 3D points at once, and they are trivially parallelized over the sample index using OpenMP.37 Despite the simplicity of our design, we achieve good parallel scaling, particularly when using a relatively large lmax so that there is a substantial amount of computation for each thread, as illustrated in Fig. 1. The curves show some degree of irregular behavior, suggesting that there might be room for further optimization. As presented in Table III, for large lmax and numbers of data points, there is a substantial performance gain by using single-precision floating-point arithmetics. However, for smaller amounts of computation, there is a very small advantage, and in some corner cases (lmax = 8, nsamples = 10 000, and nthreads = 16), the single-precision version can be slower than that using 64-bit floating-point values, which further suggests that it may be possible to optimize memory-access patterns to improve parallel performance.
. | 1 OpenMP thread . | 16 OpenMP threads . | ||
---|---|---|---|---|
. | . | . | . | |
Single precision | ||||
lmax = 1 | 1.08 | 2.00 | 0.572 | 0.680 |
lmax = 2 | 3.74 | 12.4 | 0.564 | 1.10 |
lmax = 4 | 9.47 | 35.8 | 0.922 | 2.58 |
lmax = 8 | 56.4 | 169 | 17.5 | 54.2 |
lmax = 16 | 240 | 1093 | 28.9 | 126 |
lmax = 32 | 1128 | 3101 | 109 | 366 |
Double precision | ||||
lmax = 1 | 1.06 | 6.47 | 0.451 | 0.573 |
lmax = 2 | 4.02 | 15.9 | 0.578 | 1.28 |
lmax = 4 | 11.6 | 47.3 | 1.00 | 3.73 |
lmax = 8 | 57.0 | 280 | 5.15 | 21.7 |
lmax = 16 | 252 | 1602 | 18.3 | 164 |
lmax = 32 | 1385 | 4651 | 148 | 440 |
. | 1 OpenMP thread . | 16 OpenMP threads . | ||
---|---|---|---|---|
. | . | . | . | |
Single precision | ||||
lmax = 1 | 1.08 | 2.00 | 0.572 | 0.680 |
lmax = 2 | 3.74 | 12.4 | 0.564 | 1.10 |
lmax = 4 | 9.47 | 35.8 | 0.922 | 2.58 |
lmax = 8 | 56.4 | 169 | 17.5 | 54.2 |
lmax = 16 | 240 | 1093 | 28.9 | 126 |
lmax = 32 | 1128 | 3101 | 109 | 366 |
Double precision | ||||
lmax = 1 | 1.06 | 6.47 | 0.451 | 0.573 |
lmax = 2 | 4.02 | 15.9 | 0.578 | 1.28 |
lmax = 4 | 11.6 | 47.3 | 1.00 | 3.73 |
lmax = 8 | 57.0 | 280 | 5.15 | 21.7 |
lmax = 16 | 252 | 1602 | 18.3 | 164 |
lmax = 32 | 1385 | 4651 | 148 | 440 |
B. CUDA implementation details
We adapt the implementation to a custom CUDA kernel, which allows efficient execution on GPUs. Similarly to the CPU implementation, we use hard-coded spherical harmonics and derivatives up to a reduced hard-coded lmax = 3 to lower shared memory requirements, and we use the general expression for the remaining terms. We parallelize the computation with a two-dimensional thread-block of 16 × 8 threads using a grid dimension of (nsamples/16) blocks. The first dimension in the thread-block parallelizes over samples, while the first thread in the second dimension is responsible for performing the computation. The remaining threads perform coalesced writing of the temporary buffers to global memory. We store the intermediary work buffers for the spherical harmonics and derivatives in fast shared memory, which defaults to 48 KB for most NVIDIA cards. Memory accesses are designed such that each sample-dimension thread is mapped to a shared memory bank contiguously, eliminating possible bank conflicts. The CUDA wrapper automatically re-allocates the necessary amount of shared memory beyond the default 48 KB on the first compute call or attempts to reduce the number of threads launched in the CUDA kernel if the required allocation is too large. As a consequence, for accurate timings, we call the forward step once to perform the initialization and then measure timings thereafter. We note that using the recursion (7) would require storing temporary values for all l ≤ lmax, so that the default 48 KB shared memory allocation would be filled for lmax ≈ 8 when computing derivatives in 64-bit floating-point format. Using the alternative recursion (8) allows us to evaluate one l channel at a time, so values up to lmax ≈ 30 can be computed without adjusting the shared memory allocation or reducing the number of sample-dimension threads. For GPUs that support more than 48 KB shared memory per block, for example, the A100 (164 KB), the recursion (8) allows us to evaluate values up to lmax ≈ 82.
Table IV presents timings for GPU-accelerated computations on an A100 SXM4 (80 GB) card. In general, using single-precision floating-point arithmetic results in half the computation time than double-precision arithmetic. We note that the GPU per-sample timings reduce significantly upon an increasing number of samples, as the GPU is not fully saturated for nsamples = 10 000. For example, for nsamples = 100 000, the per-sample timings reduce by a factor of 1.5–4. Although the parallel CPU implementation is faster than the GPU for lmax ≤ 4 when using nsamples = 10 000, this trend reverses for higher values of lmax. For example, the 64-bit floating-point computations with nsamples = 10 000 and lmax = 16 show that the GPU outperforms the CPU by a factor of 8.
. | 10k points . | 100k points . | ||
---|---|---|---|---|
Single precision | ||||
lmax = 1 | 2.1 | 2.4 | 0.3 | 0.4 |
lmax = 2 | 2.2 | 2.4 | 0.3 | 0.4 |
lmax = 4 | 2.4 | 3.1 | 0.5 | 0.8 |
lmax = 8 | 3.1 | 4.8 | 1.0 | 3.3 |
lmax = 16 | 5.6 | 15.1 | 2.8 | 10.9 |
lmax = 32 | 15.5 | 43.6 | 9.6 | 33.5 |
Double precision | ||||
lmax = 1 | 2.2 | 2.6 | 0.4 | 0.6 |
lmax = 2 | 2.3 | 2.9 | 0.4 | 0.8 |
lmax = 4 | 2.7 | 4.0 | 0.7 | 1.7 |
lmax = 8 | 4.2 | 8.5 | 1.7 | 4.6 |
lmax = 16 | 9.3 | 22.2 | 5.2 | 16.0 |
lmax = 32 | 28.8 | 75.0 | 20.6 | 65.1 |
. | 10k points . | 100k points . | ||
---|---|---|---|---|
Single precision | ||||
lmax = 1 | 2.1 | 2.4 | 0.3 | 0.4 |
lmax = 2 | 2.2 | 2.4 | 0.3 | 0.4 |
lmax = 4 | 2.4 | 3.1 | 0.5 | 0.8 |
lmax = 8 | 3.1 | 4.8 | 1.0 | 3.3 |
lmax = 16 | 5.6 | 15.1 | 2.8 | 10.9 |
lmax = 32 | 15.5 | 43.6 | 9.6 | 33.5 |
Double precision | ||||
lmax = 1 | 2.2 | 2.6 | 0.4 | 0.6 |
lmax = 2 | 2.3 | 2.9 | 0.4 | 0.8 |
lmax = 4 | 2.7 | 4.0 | 0.7 | 1.7 |
lmax = 8 | 4.2 | 8.5 | 1.7 | 4.6 |
lmax = 16 | 9.3 | 22.2 | 5.2 | 16.0 |
lmax = 32 | 28.8 | 75.0 | 20.6 | 65.1 |
We note in closing that, even though precise timings depend on the hardware and the maximum angular momentum considered, sphericart is between 10 and 40 times faster than the spherical harmonics implementation in e3nn,38 a widely used library for equivariant neural networks. We provide wrappers that use the same conventions as e3nn, to simplify integration of our implementation in existing models and to accelerate computational frameworks that are limited by the evaluation of .
IV. CONCLUSIONS AND PERSPECTIVES
Spherical harmonics are ubiquitous in the computational sciences, and their efficient calculation has become even more important in light of the widespread adoption of equivariant machine-learning models in chemistry. Reformulating the calculation of spherical harmonics in a scaled form that corresponds to real-valued polynomials of the Cartesian coordinates provides simple expressions for their recursive evaluation. Derivatives can be obtained with little overhead, re-using the same factors that enter the definition of the scaled , and without numerical instabilities. The conventional form of the spherical harmonics can be recovered easily by computing at (x/r, y/r, z/r) and applying the corresponding correction to the derivatives. In many applications, this normalization may be unnecessary, as the scaling can be incorporated (explicitly or through regression) into additional radial terms. We provide an efficient implementation as a C++ library, complete with a C API, Python, and PyTorch bindings, that tackles the most common use case in scientific computing and geometric machine learning, i.e., the evaluation of real-valued spherical harmonics for all angular momentum channels up to a given cutoff lmax and possibly for many 3D points at once. The function call is parallelized over the sample direction, and it uses more efficient hard-coded expressions for low-l terms. Future efforts will focus on the extension of the software library to different programming languages and frameworks as well as on further optimization of new hardware platforms and improved parallelism.
ACKNOWLEDGMENTS
The Authors would like to thank Kevin Kazuki Huguenin-Dumittan for useful discussions, Philip Loche for help with the sphericart library, and Prashanth Kanduri for help with the continuous integration framework. M.C. acknowledges funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 101001890-FIAMMA). GF acknowledges support from the Platform for Advanced Scientific Computing (PASC). M.C. and F.B. acknowledge support from the NCCR MARVEL, funded by the Swiss National Science Foundation (Grant No. 182892).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Filippo Bigi and Michele Ceriotti conceived the project and performed analytical derivations. All authors contributed to software development and the writing of the paper.
Filippo Bigi: Conceptualization (equal); Formal analysis (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Guillaume Fraux: Software (equal); Writing – review & editing (supporting). Nicholas J. Browning: Software (equal); Writing – review & editing (supporting). Michele Ceriotti: Conceptualization (equal); Formal analysis (equal); Methodology (equal); Resources (lead); Software (equal); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The sphericart library can be freely downloaded under the Apache License version 2.0 from the public git repository https://github.com/lab-cosmo/sphericart. A Python package that provides a convenient class to compute spherical harmonics and derivatives from a NumPy39 or a PyTorch36 array of 3D coordinates is available on PyPI at https://pypi.org/project/sphericart/. The timing data can be generated using the benchmarking code that is included in the software distribution.