PySCF is a Python-based general-purpose electronic structure platform that supports first-principles simulations of molecules and solids as well as accelerates the development of new methodology and complex computational workflows. This paper explains the design and philosophy behind PySCF that enables it to meet these twin objectives. With several case studies, we show how users can easily implement their own methods using PySCF as a development environment. We then summarize the capabilities of PySCF for molecular and solid-state simulations. Finally, we describe the growing ecosystem of projects that use PySCF across the domains of quantum chemistry, materials science, machine learning, and quantum information science.
This article describes the current status of the Python Simulations of Chemistry Framework, also known as PySCF, as of version 1.7.1. The PySCF project was originally started in 2014 by Sun, then in the group of Chan, in the context of developing a tool to enable ab initio quantum embedding calculations. However, it rapidly outgrew its rather specialized roots to become a general purpose development platform for quantum simulations and electronic structure theory. The early history of PySCF is recounted in Ref. 1. Now, PySCF is a production ready tool that implements many of the most commonly used methods in molecular quantum chemistry and solid-state electronic structure. Since its inception, PySCF has been a free and open-source package hosted on Github2 and is now also available through pip,3 conda,4 and a number of other distribution platforms. It has a userbase numbering in the hundreds and over 60 code contributors. Beyond chemistry and materials science, it has also found use in the areas of data science,5,6 machine learning,7–13 and quantum computing14–16 in both academia and industry. To mark its transition from a code developed by a single group to a broader community effort, the leadership of PySCF was expanded in 2019 to a board of directors.17
While the fields of quantum chemistry and solid-state electronic structure are rich with excellent software,18–25 the development of PySCF is guided by some unique principles in the order of priority as follows:
PySCF should be more than a computational tool; it should be a development platform. We aim for users to be empowered to modify the code, implement their own methods without the assistance of the original developers, and incorporate parts of the code in a modular fashion into their own projects.
Unlike many packages that focus on either molecular chemistry or materials science applications, PySCF should support both equally, allowing calculations on molecules and materials to be carried out in the same numerical framework and with the same theoretical approximations.
PySCF should enable users outside of the chemical sciences (such as workers in machine learning and quantum information theory) to carry out quantum chemistry simulations.
In the remainder of this article, we elaborate on these guiding principles of PySCF, describing how they have impacted the program design and implementation and how they can be used to implement a new functionality in new projects. We provide a brief summary of the implemented methods and conclude with an overview of the PySCF ecosystem in different areas of science.
II. THE DESIGN PHILOSOPHY BEHIND PySCF
All quantum simulation workflows naturally require some level of programming and customization. This may arise in simple tasks, such as scanning a potential energy surface, tabulating results, or automating input generation, or in more advanced use cases that include more substantial programming, such as with complex data processing, incorporating logic into the computational workflow, or when embedding customized algorithms into the computation. In either case, the ability to program with and extend one’s simulation software greatly empowers the user. PySCF is designed to serve as a basic program library that can facilitate custom computational tasks and workflows as well as form the starting point for the development of new algorithms.
To enable this, PySCF is constructed as a library of modular components with a loosely coupled structure. The modules provide easily reusable functions with (where possible) simple implementations, and hooks are provided within the code to enable extensibility. Optimized and competitive performance is, as much as possible, separated out into a small number of lower level components that do not need to be touched by the user. We elaborate on these design choices below.
A. Reusable functions for individual suboperations
It is becoming common practice to provide a Python scripting interface for input and simulation control. However, PySCF goes beyond this by providing a rich set of Python APIs (Application Programming Interfaces) not only for the simulation models but also for many of the individual sub-operations that compose the algorithms. For example, after input parsing, a mean-field Hartree–Fock (HF) or density functional theory (DFT) algorithm comprises many steps, including integral generation, guess initialization, assembling components of the Fock matrix and diagonalizing, and accelerating iterations to self-consistent convergence. All of these suboperations are exposed as PySCF APIs, enabling one to rebuild or modify the self-consistent algorithm at will. Similarly, APIs are exposed for other essential components of electronic structure algorithms, such as integral transformations, density fitting, Hamiltonian manipulation, various many-electron and Green’s functions solvers, computation of derivatives, relativistic corrections, and so forth, in essence across all the functionality of PySCF. The package provides a large number of examples to demonstrate how these APIs can be used in customized calculations or methodology development.
With at most some simple initialization statements, the PySCF APIs can be executed at any place and in any order within a code without side-effects. This means that when implementing or extending the code, the user does not need to retain information on the program state and can focus on the physical theory of interest. For instance, using the above example, one can call the function to build a Fock matrix from a given density matrix anywhere in the code, regardless of whether the density matrix in question is related to a larger simulation. From a programming design perspective, this is because within PySCF, no implicit global variables are used and functions are implemented free of side effects (or with minimal side effects) in a largely functional programming style. The PySCF function APIs generally follow the Numpy/Scipy API style. In this convention, the input arguments are simple Python built-in datatypes or Numpy arrays, avoiding the need to understand complex objects and structures.
B. Simple implementations
Python is among the simplest of the widely used programming languages and is the main implementation language in PySCF. Apart from a few performance critical functions, over 90% of PySCF is written in Python, with dependencies on only a small number of common external Python libraries (Numpy, Scipy, and h5py).
Implementation language does not hide organizational complexity, however, and structural simplicity in PySCF is achieved via additional design choices. In particular, PySCF uses a mixed object oriented/functional paradigm: complex simulation data (e.g., data on the molecular geometry or cell parameters) and simulation models (e.g., whether a mean-field calculation is a HF or DFT one) are organized in an object oriented style, while individual function implementations follow a functional programming paradigm. Deep object inheritance is rarely used. Unlike packages where external input configuration files are used to control a simulation, the simulation parameters are simply held in the member variables of the simulation model object.
Where possible, PySCF provides multiple implementations of the same algorithm with the same API: one is designed to be easy to read and simple to modify, and another is for optimized performance. For example, the full configuration interaction module contains both a slower but simpler implementation and heavily optimized implementations, specialized for specific Hamiltonian symmetries and spin types. The optimized algorithms have components that are written in C. This dual level of implementation mimics the Python convention of having modules in both pure Python and C with the same API (such as the profile and cProfile modules of the Python standard library). It also reflects the PySCF development cycle, where often a simple reference Python implementation is first produced before being further optimized.
C. Easily modified runtime functionality
In customized simulations, it is often necessary to modify the underlying functionality of a package. This can be complicated in a compiled program due to the need to consider detailed types and compilation dependencies across modules. In contrast, many parts of PySCF are easy to modify due to the design of PySCF as well as the dynamic runtime resolution of methods and “duck typing” of Python. Generally speaking, one can modify the functionality in one part of the code without needing to worry about breaking other parts of the package. For example, one can modify the HF module with a custom Hamiltonian without considering whether it will work in a DFT calculation; the program will continue to run so long as the computational task involves HF and post-HF methods. Furthermore, Python “monkey patching” (replacing functionality at runtime) means that core PySCF routines can be overwritten without even touching the code base of the library.
D. Competitive performance
In many simulations, performance is still the critical consideration. This is typically the reason for implementing code in compiled languages such as Fortran or C/C++. In PySCF, the performance gap between Python and compiled languages is partly removed by a heavy reliance on Numpy and Scipy, which provide Python APIs to optimized algorithms written in compiled languages. Additional optimization is achieved in PySCF with custom C implementations where necessary. Performance critical spots, which occur primarily in the integral and tensor operations, are implemented in C and heavily optimized. The use of additional C libraries also allows us to achieve thread-level parallelism via OpenMP, bypassing Python’s intrinsic multithreading limitations. Since a simulation can often spend over 99% of its runtime in the C libraries, the overhead due to the remaining Python code is negligible. The combination of Python with C libraries ensures that PySCF achieves leading performance in many simulations.
III. A COMMON FRAMEWORK FOR MOLECULES AND CRYSTALLINE MATERIALS
Electronic structure packages typically focus on either molecular or materials simulations and are thus built around numerical approximations adapted to either case. A central goal of PySCF is to enable molecules and materials to be simulated with common numerical approximations and theoretical models. Originally, PySCF started as a Gaussian atomic orbital (AO) molecular code and was subsequently extended to enable simulations in a crystalline Gaussian basis. Much of the seemingly new functionality required in a crystalline materials simulation is, in fact, analogous to the functionality in a molecular implementation, such as
Using a Bloch basis. In PySCF, we use a crystalline Gaussian AO basis, which is analogous to a symmetry adapted molecular AO basis.
Exploiting translational symmetry by enforcing momentum conservation. This is analogous to handling molecular point group symmetries.
Handling complex numbers, given that the matrix elements between Bloch functions are generally complex. This is analogous to the requirements of a molecular calculation with complex orbitals.
Other modifications are unique to the crystalline material setting, including
Techniques to handle divergences associated with the long-ranged nature of the Coulomb interaction, since the classical electron–electron, electron–nuclear, and nuclear–nuclear interactions are separately divergent. In PySCF, this is handled via the density fitting integral routines (see below) and by evaluating certain contributions using Ewald summation techniques.
Numerical techniques special to periodic functions, such as the fast Fourier transform (FFT), as well as approximations tailored to plane-wave implementations, such as certain pseudopotentials. PySCF supports mixed crystalline Gaussian and plane-wave expressions using analytic integrals as well as FFT on grids.
Techniques to accelerate convergence to the thermodynamic limit. In PySCF, such corrections are implemented at the mean-field level by modifying the treatment of the exchange energy, which is the leading finite-size correction.
Additional crystal lattice symmetries. Currently, PySCF contains only experimental support for additional lattice symmetries.
In PySCF, we identify the three-index density fitted integrals as the central computational intermediate that allows us to largely unify molecular and crystalline implementations. This is because
three-center “density-fitted” Gaussian integrals are key to fast implementations;
the use of the FFT to evaluate the potential of a pair density of AO functions, which is needed in fast DFT implementations with pseudopotentials,26 is formally equivalent to density fitting with plane-waves;
the density fitted integrals can be adjusted to remove the Coulomb divergences in materials;27 and
three-index Coulomb intermediates are sufficiently compact that they can be computed even in the crystalline setting.
PySCF provides a unified density fitting API for both molecules and crystalline materials. In molecules, the auxiliary basis is assumed to be Gaussian AOs, while in the periodic setting, different types of auxiliary bases are provided, including plane-wave functions [in the FFTDF (Fast Fourier Transform Density Fitting) module], crystalline Gaussian AOs [in the GDF (Gaussian Density Fitting) module], and mixed plane-wave-Gaussian functions [in the MDF (Gaussian and Planewave Mixed Density Fitting) module].28 Different auxiliary bases are provided in periodic calculations as they are suited to different AO basis sets: FFTDF is efficient for smooth AO functions when used with pseudopotentials, GDF is more efficient for compact AO functions, and MDF allows a high accuracy treatment of the Coulomb problem, regardless of the compactness of the underlying atomic orbital basis.
Using the above ideas, the general program structure, implementation, and simulation workflow for molecular and materials calculations become very similar. Figure 1 shows an example of the computational workflow adopted in PySCF for performing molecular and periodic post-HF calculations. The same driver functions can be used to carry out generic operations such as solving the HF equations or coupled cluster amplitude equations. However, the implementations of methods for molecular and crystalline systems bifurcate when evaluating k-point dependent quantities, such as the three-center density-fitted integrals, Hamiltonians, and wavefunctions. Nevertheless, if only a single k-point is considered (and especially at the Γ point where all integrals are real), most molecular modules can be used to perform calculations in crystals without modification (see Sec. V).
IV. DEVELOPING WITH PySCF: CASE STUDIES
In this section, we walk through some case studies that illustrate how the functionality of PySCF can be modified and extended. We focus on cases that might be encountered by the average user who does not want to modify the source code but wishes to assemble different existing PySCF APIs to implement a new functionality.
A. Case study: Modifying the Hamiltonian
In PySCF, simulation models (i.e., different wavefunction approximations) are always implemented such that they can be used independently of any specific Hamiltonian with up to two-body interactions. Consequently, the Hamiltonian under study can be easily customized by the user, which is useful for studying model problems or, for example, when trying to interface to different numerical basis approximations. Figure 2 shows several different ways to define one-electron and two-electron interactions in the Hamiltonian followed by subsequent ground and excited state calculations with the custom Hamiltonian. Note that if a method is not compatible with or well defined using the customized interactions, for instance, in the case of solvation corrections, PySCF will raise a Python runtime error in the place where the requisite operations are ill-defined.
B. Case study: Optimizing orbitals of arbitrary methods
The PySCF MCSCF module provides a general purpose quasi-second order orbital optimization algorithm within orbital subspaces (e.g., active spaces) as well as over the complete orbital space. In particular, it is not limited to the built-in CASCI, CASSCF, and multi-reference correlation solvers but allows orbital optimization of any method that provides energies and one- and two-particle density matrices. For this reason, PySCF is often used to carry out active space orbital optimization for DMRG (density matrix renormalization group), selected configuration interaction, and full configuration interaction quantum Monte Carlo wavefunctions via its native interfaces to Block29 (DMRG), CheMPS230 (DMRG), Dice31–33 (selected CI), Arrow31,33,34 (selected CI), and NECI35 [FCIQMC (full configuration interaction quantum Monte Carlo)].
In addition, it is easy for the user to use the MCSCF module to optimize orbitals in electronic structure methods for which the orbital optimization API is not natively implemented. For example, although orbital-optimized MP236 is not explicitly provided in PySCF, a simple version of it can easily be performed using a short script, as shown in Fig. 3. Without any modifications, the orbital optimization will use a quasi-second order algorithm. We see that the user only needs to write a simple wrapper to provide two functions, namely, make_rdm12, which computes the one- and two-particle density matrices, and kernel, which computes the total energy.
C. Case study: Implementing an embedding model
As a more advanced example of customization using PySCF, we now illustrate how a simple script with standard APIs enables PySCF to carry out geometry optimization for a wavefunction in Hartree–Fock (WFT-in-HF) embedding model, as shown in Fig. 4 with a configuration interaction with single and double excitations (CISD) solver. Given the Hamiltonian of a system, expressed in terms of the Hamiltonians of a fragment and its environment,
we define an embedding Hamiltonian for the fragment in the presence of the atoms in the environment as
Geometry optimization can then be carried out with the approximate nuclear gradients of the embedding problem,
where the fragment wavefunctions Ψfrag,HF and Ψfrag,CI are obtained from the embedding Hamiltonian Hemb. The code snippet in Fig. 4 demonstrates the kind of rapid prototyping that can be carried out using PySCF APIs. In particular, this demonstration combines the APIs for ab initio energy evaluation, analytical nuclear gradient computation, computing the HF potential for an arbitrary density matrix, Hamiltonian customization, and customizing the nuclear gradient solver in a geometry optimization.
V. SUMMARY OF EXISTING METHODS AND RECENT ADDITIONS
A. Hartree–Fock and density functional theory methods
The starting point for many electronic structure simulations is a self-consistent field (SCF) calculation. PySCF implements Hartree–Fock (HF) and density functional theory (DFT) with a variety of Slater determinant references, including restricted closed-shell, restricted open-shell, unrestricted, and generalized (noncollinear spin) references,37,38 for both molecular and crystalline (k-point) calculations. Through an interface to the Libxc39 and XCFun40 libraries, PySCF also supports a wide range of predefined exchange–correlation (XC) functionals, including the local density approximation (LDA), generalized gradient approximations (GGAs), hybrids, meta-GGAs, nonlocal correlation functionals (VV1041), and range-separated hybrid (RSH) functionals. In addition to the predefined XC functionals, the user can also create customized functionals in a DFT calculation, as shown in Fig. 5.
Because PySCF uses a Gaussian AO representation, the SCF computation is usually dominated by Gaussian integral evaluation. Through the efficient Gaussian integral engine Libcint,42 the molecular SCF module can be used with more than 10 000 basis functions on a symmetric multiprocessing (SMP) machine, without resorting to any integral approximations such as screening. Further speed-up can be achieved through Gaussian density fitting, and the pseudo-spectral approach (SGX) is implemented to speed up the evaluation of exchange in large systems.43–45
In crystalline systems, HF and DFT calculations can be carried out either at a single point in the Brillouin zone or with a k-point mesh. The cost of the crystalline SCF calculation depends on the nature of the crystalline Gaussian basis and the associated density fitting. PySCF supports Goedecker–Teter–Hutter (GTH) pseudopotentials46 that can be used with the associated basis sets (developed by the CP2K group).26,47 Pseudopotential DFT calculations are typically most efficiently done using plane-wave density fitting (FFTDF). Alternatively, all-electron calculations can be performed with standard basis sets, and the presence of sharp densities means that Gaussian density fitting performs better. Gaussian density fitting is also the algorithm of choice for calculations with HF exchange. Figure 6 shows an example of the silicon band structures computed using a GTH-LDA pseudopotential with FFTDF and in an all-electron calculation using GDF.
B. Many-body methods
Starting from a SCF HF or DFT wavefunction, various many-body methods are available in PySCF, including Møller–Plesset second-order perturbation theory (MP2), multi-reference perturbation theory (MRPT),48,49 configuration interaction (CI),50–53 coupled cluster (CC),54–63 multi-configuration self-consistent field (MCSCF),64,65 algebraic diagrammatic construction (ADC),66–70 and G0W071–74 methods. The majority of these capabilities are available for both molecules and crystalline materials.
1. Molecular implementations
The PySCF CI module implements solvers for configuration interaction with single and double excitations (CISD) and a general full configuration interaction (FCI) solver that can treat fermion, boson, and coupled fermion–boson Hamiltonians. The FCI solver is heavily optimized for its multithreaded performance and can efficiently handle active spaces with up to 18 electrons in 18 orbitals.
The CC module implements coupled cluster theory with single and double excitations (CCSD)56,61 and with the perturbative triples correction [CCSD(T)].57 Λ-equation solvers are implemented to compute one- and two-particle density matrices as well as the analytic nuclear gradients for the CCSD and CCSD(T) methods.55,58,59 PySCF also implements various flavors of equation-of-motion CCSD to compute electron affinities (EAs), ionization potentials (IPs), neutral excitation energies (EEs), and spin–flip (SF) excitation energies.54,60,62,63 Experimental support for beyond doubles corrections to IP and EA via IP-EOM-CCSD*75,76 and EA-EOM-CCSD* is also available. For very large basis sets, PySCF provides an efficient AO-driven pathway that allows calculations with more than 1500 basis functions. An example of this is shown in Fig. 7, where the largest CCSD(T) calculation contains 50 electrons and 1500 basis functions.77
Second- and third-order algebraic diagrammatic construction (ADC) methods are also available in PySCF for the calculation of molecular electron affinities and ionization potentials66–70 [EA/IP-ADC(n), n = 2, 3]. These have a lower cost than EA/IP-EOM-CCSD. The advantage of the ADC methods over EOM-CCSD is that their amplitude equations can be solved in one iteration and the eigenvalue problem is Hermitian, which lowers the cost of computing the EA/IP energies and transition intensities.
The MCSCF module provides complete active space configuration interaction (CASCI) and complete active space self-consistent field (CASSCF)64,65 methods for multi-reference problems. As discussed in Sec. IV B, the module also provides a general second-order orbital optimizer78 that can optimize the orbitals of external methods, with native interfaces for the orbital optimization of the density matrix renormalization group (DMRG),29,30 full configuration interaction quantum Monte Carlo (FCIQMC),35,79 and selected configuration interaction wavefunctions.31,32 Starting from a CASCI or CASSCF wavefunction, PySCF also implements the strongly contracted second-order n-electron valence perturbation theory48,49 (SC-NEVPT2) in the MRPT module to include additional dynamic correlation. Together with external active-space solvers, this enables one to treat relatively large active spaces for such calculations, as illustrated in Fig. 8.
2. Crystalline implementations
As discussed in Sec. III, the PySCF implementations of many-body methods for crystalline systems closely parallel their molecular implementations. In fact, all molecular modules can be used to carry out calculations in solids at the Γ-point, and many modules (those supporting complex integrals) can be used at any other single k-point. Such single k-point calculations only require the appropriate periodic integrals to be supplied to the many-body solver (Fig. 9). For those modules that support complex integrals, twist averaging can then be performed to sample the Brillouin zone. To use savings from k-point symmetries, an additional summation over momentum conserving k-point contributions needs to be explicitly implemented. Such implementations are provided for MP2, CCSD, CCSD(T), IP/EA-EOM-CCSD27 and EE-EOM-CCSD,80 and G0W0. For example, Fig. 10 shows the MP2 correlation energy and the CIS excitation energy of MgO, calculated using periodic density-fitted implementations; the largest system shown, with a 7 × 7 × 7 k-point mesh, correlates 5488 valence electrons in 9261 orbitals. Furthermore, Fig. 11 shows some examples of periodic correlated calculations on NiO carried out using the G0W0 and CCSD methods.
In Table I, we provide rough estimates of the sizes of problems that can be tackled using PySCF for each electronic structure method. Figures 7, 8, 10, and 11 illustrate some real-world examples of calculations performed using PySCF. Note that the size of system that can be treated is a function of the computational resources available; the estimates given above assume relatively standard and modest computational resources, e.g., a node of a cluster, or a few dozen cores. For more details of the runtime environment and program settings for similar performance benchmarks, we refer readers to the benchmark page of the PySCF website (www.PySCF.org). The implementation and performance of PySCF on massively parallel architectures is discussed in Sec. V H.
For molecular calculations using mean-field methods, PySCF can treat systems with more than 10 000 AO basis functions without difficulty. Figure 15 shows the time of building the Fock matrix for a large water cluster with more than 12 000 basis functions. With the integral screening threshold set to 10−13 a.u., it takes only around 7 h on one computer node with 32 CPU cores. Applying MPI (Message Passing Interface) parallelization further reduces the Fock-build time (see Sec. V H). For periodic boundary calculations at the DFT level using pure XC functionals, even larger systems can be treated using pseudopotentials and a multi-grid implementation. Table II presents an example of such a calculation, where for the largest system considered ([H2O]512 with more than 25 000 basis functions), the Fock-build time is about an hour or less on a single node.
To demonstrate the efficiency of the many-body method implementations, in Tables III and IV, we show timing data of exemplary CCSD and FCI calculations. It is clear that systems with more than 1500 basis functions can be easily treated at the CCSD level and that the FCI implementation in PySCF is very efficient. In a similar way, the estimated performance for other many-body methods implemented in PySCF is listed in Table I.
At the mean-field level, the current PySCF program can compute various nonrelativistic and four-component relativistic molecular properties. These include NMR shielding and spin–spin coupling tensors,84–89 electronic g-tensors,90–93 nuclear spin-rotation constants and rotational g-tensors,94,95 hyperfine coupling (HFC) tensors,96,97 electron spin-rotation (ESR) tensors,98,99 magnetizability tensors,94,100,101 zero-field splitting (ZFS) tensors,102–104 as well as static and dynamic polarizability and hyper-polarizability tensors. The contributions from spin–orbit coupling and spin–spin coupling can also be calculated and included in the g-tensors, HFC tensors, ZFS tensors, and ESR tensors. In magnetic property calculations, an approximate gauge-origin invariance is ensured for NMR shielding, g-tensors, and magnetizability tensors via the use of gauge including atomic orbitals.94,100,101,105,106
Electric field gradients (EFGs) and Mössbauer parameters107–109 can be computed using either the mean-field electron density or the correlated density obtained from non-relativistic Hamiltonians, spin-free exact-two-component (X2C) relativistic Hamiltonians,110–114 or four-component methods in both molecules and crystals.
Finally, analytic nuclear gradients for the molecular ground state are available at the mean-field level and for many of the electron correlation methods such as MP2, CCSD, CISD, CASCI, and CASSCF (see Table I). The CASCI gradient implementation supports the use of external solvers, such as DMRG, and provides gradients for such methods. PySCF also implements the analytical gradients of time-dependent density functional theory (TDDFT) with or without the Tamm–Dancoff approximation (TDA) for excited state geometry optimization. The spin-free X2C relativistic Hamiltonian, frozen core approximations, solvent effects, and molecular mechanics (MM) environments can be combined with any of the nuclear gradient methods. Vibrational frequency and thermochemical analysis can also be performed using the analytical Hessians from mean-field level calculations or numerical Hessians of methods based on the numerical differentiation of analytical gradients.
E. Orbital localization
PySCF provides two kinds of orbital localizations in the LO module. The first kind localizes orbitals based on the atomic character of the basis functions and can generate intrinsic atomic orbitals (IAOs),115 natural atomic orbitals (NAOs),116 and meta-Löwdin orbitals.117 These AO-based local orbitals can be used to carry out a reliable population analysis in arbitrary basis sets.
The second kind optimizes a cost function to produce localized orbitals. PySCF implements Boys localization,118 Edmiston–Ruedenberg localization,119 and Pipek–Mezey localization.120 Starting from the IAOs, one can also use orbital localization based on the Pipek–Mezey procedure to construct the intrinsic bond orbitals (IBOs).115 A similar method can also be used to construct localized intrinsic valence virtual orbitals that can be used to assign the core-excited states.121 The optimization in these localization routines takes advantage of the second order co-iterative augmented Hessian (CIAH) algorithm122 for rapid convergence.
For crystalline calculations with k-point sampling, PySCF also provides maximally localized Wannier functions (MLWFs) via a native interface to the Wannier90 program.123 Different types of orbitals are available as initial guesses for the MLWFs, including the atomic orbitals provided by Wannier90, meta-Löwdin orbitals,117 and localized orbitals from the selected columns of density matrix (SCDM) method.124,125 Figure 12 illustrates the IBOs and MLWFs of diamond computed by PySCF.
F. QM/MM and solvent
PySCF incorporates two continuum solvation models, namely, the conductor-like screening model126 (COSMO) and the polarizable continuum model using the integral equation formalism127,128 (IEF-PCM). Both of them are implemented efficiently via a domain decomposition (dd) approach129–133 and are compatible with most of the electronic structure methods in PySCF. Furthermore, besides equilibrium solvation where the solvent polarization is governed by the static electric susceptibility, non-equilibrium solvation can also be treated within the framework of TDDFT in order to describe fast solvent response with respect to abrupt changes in the solute charge density. As an example, in Ref. 134, the COSMO method was used to mimic the protein environment of nitrogenase in electronic structure calculations for the P-cluster (Fig. 13). For excited states generated by TDA, the polarizable embedding model135 can also be used through an interface to the external library CPPE.135,136
Currently, PySCF provides some limited functionality for performing QM/MM calculations by adding classical point charges to the QM region. The implementation supports all molecular electronic structure methods by decorating the underlying SCF methods. In addition, MM charges can be used together with the X2C method and implicit solvent treatments.
G. Relativistic treatments
PySCF provides several ways to include relativistic effects. In the framework of scalar Hamiltonians, spin-free X2C theory,137 scalar effective core potentials138 (ECPs), and relativistic pseudo-potentials can all be used for all methods in calculations of the energy, nuclear gradients, and nuclear Hessians. At the next level of relativistic approximations, PySCF provides spin–orbit ECP integrals and one-body and two-body spin–orbit interactions from the Breit–Pauli Hamiltonian and X2C Hamiltonian for the spin–orbit coupling effects.139 Two-component Hamiltonians with the X2C one-electron approximation and four-component Dirac–Coulomb, Dirac–Coulomb–Gaunt, and Dirac–Coulomb–Breit Hamiltonians are all supported in mean-field molecular calculations.
H. MPI implementations
In PySCF, distributed parallelism with MPI is implemented via an extension to the PySCF main library known as MPI4PySCF. The current MPI extension supports the most common methods in quantum chemistry and crystalline material computations. Table V lists the available MPI-parallel alternatives to the default serial (OpenMP) implementations. The MPI-enabled modules implement almost identical APIs to the serial ones, allowing the same script to be used for serial jobs and MPI-parallel jobs (Fig. 14). The efficiency of the MPI implementation is demonstrated in Fig. 15, which shows the wall time and speed-up of Fock builds for a system with 12 288 AOs with up to 64 MPI processes, each with 32 OpenMP threads.
To retain the simplicity of the PySCF package structure, we use a server–client mechanism to execute the MPI parallel code. In particular, we use MPI to start the Python interpreter as a daemon that receives both the functions and data on remote nodes. When a parallel session is activated, the master process sends the functions and data to the daemons. The function object is decoded remotely and then executed. For example, when building the Fock matrix in the PySCF MPI implementation, the Fock-build function running on the master process first sends itself to the Python interpreters running on the clients. After the function is decoded on the clients, input variables (such as the density matrix) are distributed by the master process through MPI. Each client evaluates a subset of the four-center two-electron integrals (with load balancing performed among the clients) and constructs a partial Fock matrix, similarly to the Fock-build functions in other MPI implementations. After sending the partial Fock matrices back to the master process, the client suspends itself until it receives the next function. The master process assembles the Fock matrices and then moves on to the next part of the code. The above strategy is quite different from the traditional MPI programs that hard-code the MPI functionality into the code and initiate the MPI parallel context at the beginning of the program. This PySCF design brings the important benefit of being able to switch on and off MPI parallelism freely in the program without the need to be aware of the MPI-parallel context (see Ref. 1 for a more detailed discussion of PySCF MPI mode innovations).
VI. THE PySCF SIMULATION ECOSYSTEM
PySCF is widely used as a development tool, and many groups have developed and made available their own projects that either interface to PySCF or can be used in a tightly coupled manner to access a greater functionality. We provide a few examples of the growing PySCF ecosystem below, which we separate into use cases: (1) external projects to which PySCF provides and maintains a native interface and (2) external projects that build on PySCF.
A. External projects with native interfaces
PySCF currently maintains a few native interfaces to external projects, including
geomeTRIC140 and pyberny.141 These two libraries provide the capability to perform geometry optimization, and interfaces to them are provided in the PySCF GEOMOPT module. As shown in Fig. 4, given a method that provides energies and nuclear gradients, the geometry optimization module generates an object that can then be used by these external optimization libraries.
DMRG, SHCI, and FCIQMC programs (Block,29 CheMPS2,30 Dice,31–33 Arrow,31,33,34 and NECI35). These interfaces closely follow the conventions of PySCF’s FCI module. As such, they can be used to replace the FCI solver in MCSCF methods (CASCI and CASSCF) to study large active space multi-reference problems.
Libxc39 and XCFun.40 These two libraries are tightly integrated into the PySCF code. While the PySCF DFT module allows the user to customize exchange–correlation (XC) functionals by linearly combining different functionals, the individual XC functionals and their derivatives are evaluated within these libraries.
TBLIS.144–146 The tensor contraction library TBLIS offers a similar functionality to the numpy.einsum function while delivering substantial speed-ups. Unlike the BLAS-based “transpose-GEMM-transpose” scheme that involves a high memory footprint due to the transposed tensor intermediates, TBLIS achieves optimal tensor contraction performance without such memory overhead. The TBLIS interface in PySCF provides an einsum function that implements the numpy.einsum API but with the TBLIS library as the contraction back-end.
CPPE.135,136 This library provides a polarizable embedding solvent model and can be integrated into PySCF calculations for ground-state mean-field and post-SCF methods. In addition, an interface to TDA is currently supported for excited-state calculations.
B. External projects that build on PySCF
There are many examples in the literature of quantum chemistry and electronic structure simulation packages that build on PySCF. The list below is by no means exhaustive but gives an idea of the range of projects using PySCF today.
Quantum Monte Carlo. Several quantum Monte Carlo programs, such as QMCPACK,147 pyQMC,148 QWalk,149 and HANDE,150 support reading wavefunctions and/or Hamiltonians generated by PySCF. In the case of pyQMC, PySCF is integrated as a dependent module.
Quantum embedding packages. Many flavors of quantum embedding, including density matrix embedding and dynamical mean-field theory, have been implemented on top of PySCF. Examples of such packages include QSoME,151–153 pDMET,154,155 PyDMFET,156 Potato,157,158 and OpenQEMIST,16 which all use PySCF to manipulate wavefunctions and embedding Hamiltonians and to provide many-electron solvers.
General quantum chemistry. PySCF can be found as a component of tools developed for many different kinds of calculations, including localized active space self-consistent field (LASSCF),154 multiconfiguration pair-density functional theory (MC-PDFT),159 and state-averaged CASSCF energy and analytical gradient evaluation (all these use the PySCF MCSCF module to optimize multi-reference wavefunctions), as well as for localized orbital construction via the Pywannier90 library.155 The PyMBE package,160 which implements the many-body expanded full CI method,161–164 utilizes PySCF to perform all the underlying electronic structure calculations. Green’s function methods such as the second-order Green’s function theory (GF2) and the self-consistent GW approximation have been explored using PySCF as the underlying ab initio infrastructure.165 In the linear scaling program LSQC,166,167 PySCF is used to generate reference wavefunctions and integrals for the cluster-in-molecule local correlation method. The APDFT (alchemical perturbation density functional theory) program168,169 interfaces to PySCF for QM calculations. In the PySCF-NAO project,170 large-scale ground-state and excited-state methods are implemented based on additional support for numerical atomic orbitals, which has been integrated into an active branch of PySCF. The PyFLOSIC package171 evaluates self-interaction corrections with the Fermi–Löwdin orbitals in conjunction with the PySCF DFT module. Furthermore, PySCF FCI capabilities are used in the molsturm package172 for the development of Coulomb Sturmian basis functions, and PySCF post-HF methods appear in VeloxChem173 and adcc174 for spectroscopic and excited-state simulations.
VII. BEYOND ELECTRONIC STRUCTURE
A. PySCF in the materials genome initiative and machine learning
As discussed in Sec. I, one of our objectives when developing PySCF was to create a tool that could be used by non-specialist researchers in other fields. With the integration of machine learning techniques into molecular and materials simulations, we find that PySCF is being used in many applications in conjunction with machine learning. For example, the flexibility of the PySCF DFT module has allowed it to be used to test exchange–correlation functionals generated by machine-learning protocols in several projects7,8 and has been integrated into other machine learning workflows.9,10 PySCF can be used as a large-scale computational engine for quantum chemistry data generation.5,6 In the context of machine learning of wavefunctions, PySCF has also been used as the starting point to develop neural network based approaches for SCF initial guesses,11 for the learning of HF orbitals by the DeepMind team,12 and for Hamiltonian integrals used by fermionic neural nets in NetKet.13
B. PySCF in quantum information science
Another area where PySCF has been rapidly adopted as a development tool is in the area of quantum information science and quantum computing. This is likely because Python is the de facto standard programming language in the quantum computing community. For example, PySCF is one of the standard prerequisites to carry out molecular simulations in the OpenFermion14 library, the QisKit-Aqua15 library, and the OpenQEMIST16 package. Through PySCF’s GitHub page, we see a rapidly increasing number of quantum information projects that include PySCF as a program dependency.
After five years of development, the PySCF project can probably now be considered to be a feature complete and mature tool. Although no single package can be optimal for all tasks, we believe that PySCF to a large extent meets its original development criteria of forming a library that is useful not only in simulations but also in enabling the customization and development of new electronic structure methods.
With the recent release of version 1.7, the current year marks the end of development of the version 1 branch of PySCF. As we look toward PySCF version 2, we expect to build additional innovations, for example, in the areas of faster electronic structure methods for very large systems, further support and integration for machine learning and quantum computing applications, better integration of high-performance computing libraries and more parallel implementations, and perhaps even forays into dynamics and classical simulations. Beyond feature development, we will expand our efforts in documentation and in quality assurance and testing. We expect the directions of implementation to continue to be guided by and organically grow out of the established PySCF ecosystem. However, regardless of the scientific directions and methods implemented within PySCF, the guiding philosophy described in this article will continue to lie at the heart of PySCF’s development. We believe that these guiding principles will help ensure that PySCF remains a powerful and useful tool in the community for many years to come.
As a large package, the development of PySCF has been supported by different sources. Support from the U.S. National Science Foundation via Award No. 1931258 (T.C.B., G.K.-L.C., and L.K.W.) is acknowledged to integrate high-performance parallel infrastructure and faster mean-field methods into PySCF. Support from the U.S. National Science Foundation via Award No. 1657286 (G.K.-L.C.) and Award No. 1848369 (T.C.B.) is acknowledged for various aspects of the development of many-electron wavefunction methods with periodic boundary conditions. Support for integrating PySCF into quantum computing platforms was provided partially by the Department of Energy via Award No. 19374 (G.K.-L.C). The Simons Foundation is gratefully acknowledged for providing additional support for the continued maintenance and development of PySCF. The Flatiron Institute is a division of the Simons Foundation. M.B. acknowledges support from the Departemento de Educación of the Basque Government through a Ph.D. grant as well as from Euskampus and the DIPC at the initial stages of his work. J.C. was supported by the Center for Molecular Magnetic Quantum Materials (M2QM), an Energy Frontier Research Center funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences under Award No. DE-SC0019330. J.J.E. acknowledges financial support from the Alexander von Humboldt Foundation and the Independent Research Fund Denmark. M.R.H. and H.Q.P. were partially supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Division of Chemical Sciences, Geosciences and Biosciences under Award No. DE-FG02-17ER16362 while working in the group of Laura Gagliardi at the University of Minnesota. P.K. acknowledges financial support from the Fellows Gipuzkoa program of the Gipuzkoako Foru Aldundia through the FEDER funding scheme of the European Union. S.L. was supported by the Academy of Finland (Suomen Akatemia) through Project No. 311149. A.P. thanks the Swiss NSF for the support provided through the Early Postdoc Mobility program (Project No. P2ELP2_175281). H.F.S. acknowledges the financial support from the European Union via Marie Skłodowska-Curie Grant Agreement No. 754388 and LMUexcellent within the German Excellence Initiative (Grant No. ZUK22). S.B. and J.E.T.S. gratefully acknowledge support from a fellowship through The Molecular Sciences Software Institute under NSF Grant No. ACI-1547580. S.S. acknowledges support of the NSF (Grant No. CHE-1800584). S.U. acknowledges the support of the NSF (Grant No. CHE-1762337). J.M.Y acknowledges support of the National Science Foundation Graduate Research Fellowship Program. N.S.B. acknowledges funding and support from St. John’s College, Cambridge. G.H.B. gratefully acknowledges the funding received from the European Research Council under Grant Agreement No. 759063.
The data that support the findings of this study are available within the article and/or from the corresponding author upon reasonable request.