Gaussian Approximation Potentials: theory, software implementation and application examples

Gaussian Approximation Potentials are a class of Machine Learned Interatomic Potentials routinely used to model materials and molecular systems on the atomic scale. The software implementation provides the means for both fitting models using ab initio data and using the resulting potentials in atomic simulations. Details of the GAP theory, algorithms and software are presented, together with detailed usage examples to help new and existing users. We review some recent developments to the GAP framework, including MPI parallelisation of the fitting code enabling its use on thousands of CPU cores and compression of descriptors to eliminate the poor scaling with the number of different chemical elements.


I. INTRODUCTION
Machine Learned Interatomic Potentials (MLIPs) have revolutionised atomic simulations by offering predictive and computationally inexpensive force fields [1][2][3][4].While their implementations differ, these models approximate the Potential Energy Surface (PES) of atomic systems based on a database of atomic configurations with corresponding high-accuracy properties calculated with ab initio quantum mechanical methods.
Among the numerous related methods and their software implementations [1,2,[5][6][7][8][9], the Gaussian Approximation Potential (GAP) approach follows a Bayesian approach, which allows the formulation of prior knowledge about the atomic system and interactions as hyperparameters, as well as uncertainty estimation.A further advantage of using Gaussian Process Regression (GPR) is that fitting the model is a convex optimisation problem [10] equivalent to the solution of a linear system, therefore many problems associated with minimising the loss function of neural networks [11] do not occur.
Similarly to other MLIP frameworks, the GAP package can utilise reference atomic databases produced with arbitrary ab initio methods and software packages.Total energies and derivative quantities (forces and stresses) are used to fit the PES, although models for local atomic properties, such as nuclear magnetic resonance (NMR) shieldings or Hirshfeld volumes may also be generated using our software.Recently, GAP was used to accelerate ab initio molecular dynamics simulations within the CASTEP [22] package, utilising an adaptive scheme that produces an evolving and improving GAP model during the dynamics [23].
In this paper, we present the current status of the GAP framework, discussing the particular adaptation of the sparse GPR that enables a performant implementation suitable to fit energetic properties of atomic systems.GAP, as implemented within the Quantum and Interatomic Potentials (QUIP) [24] software package, is introduced, emphasising the flexibility and extensibility of the code.We also discuss recent developments, such as parallelisation and compression of descriptors, making connections between the theory and practical usage.Finally, we present some examples, which are intended to illustrate a selection of features enabled by the GAP package.
By documenting implementation details for the available options, this paper is not only intended for practitioners fitting GAP models, but also for those developing other MLIP frameworks.

II. THEORY
The GAP framework utilises sparse GPR which is customised to fit PESs as well as local properties of atomic systems.A detailed introduction to GPR can be found elsewhere [10,25] and background on the GAP framework is presented in the review article by Deringer et al. [26].
Here we revise only the formulae necessary for discussing the software implementation.
A central assumption in fitting the PES of atomic systems is that the total quantum mechanical energy may be decomposed into local contributions ε which depend on descriptors x: where N d is the number of descriptors of type d.Descriptors may be the arguments to two-body energy terms, based on the interatomic distance, optionally augmented by the symmetrised atomic coordinations, or three-body terms, based on the bond angle and the symmetrised bond distances, optionally including the coordination of the central atom.The greatest advantage of the nonlinear regression techniques enabled by machine learning methods is the ability to parameterise the highly complex many-body energy terms.The descriptors forming the arguments to these functions may be n-body terms, based on the interatomic distances within a cluster of n atoms, or flexible many-body terms, based on the Smooth Overlap of Atomic Positions (SOAP) [27] descriptor, the bispectrum [3,28] or the Behler-Parrinello symmetry functions [1].Finally, GAP implements customised descriptors, to represent molecules, dimers and trimers.
In GAP, each energy term ε d is written as an independent sparse Gaussian Process, in the form where M d is the number of sparse or representative points of descriptor d, k d is the kernel, covariance or similarity function and c m are the fitting coefficients.The coefficients in (2) are fitted using a database of atomic configurations, where corresponding total energies and derivative quantities, such as forces and virial stresses, have been determined using ab initio quantum mechanical calculations.The target properties, denoted by y, of the fitting procedure are therefore the sum of local energy contributions in the form of total energies, or the sum of the partial derivatives of local energy terms in the form of forces or virial stresses.The differentiation operator with respect to a Cartesian coordinate r iα is propagated through the kernel function, resulting in partial derivatives The sparse GPR adapted to our case [26] becomes The kernel or covariance matrices K M M and K N M contain the function values k(x m , x m ′ ) and k(x n , x m ) respectively, where m and m ′ denote sparse points and n denote descriptors of the database configurations.In case of K N M , kernel functions may be the derivative values, −∇ x k(x n , x m ) ∂xn ∂riα , if the corresponding target observation in y is a force quantity.The diagonal Σ matrix contains the regularisation strength parameters (σ energy , σ force and σ virial , which may be specified individually), encoding the prior assumption regarding the accuracy of target values.Finally, the operator L is a shorthand for the summation that accumulates the local terms composing each target value in y.
Foster et al. have shown [29] that solving equation ( 4) directly can lead to numerically unstable results, in which uncertainties in the input lead to disproportionate errors in the output.Instead, we first define where the Cholesky decomposition of K M M results in the upper triangular matrix While K M M is positive semidefinite, depending on the database configurations and descriptor types, the sparse points may be highly correlated, leading to an ill-conditioned K M M matrix, preventing the Cholesky decomposition we use to obtain U M M .To regularise the sparse covariance matrix K M M , we add a small constant to the diagonal, informally called the jitter, which is typically 8-10 orders of magnitude less than the elements of K M M .The jitter has a similar effect on the resulting sparse model as the noise hyperparameter in a full GPR.As both the aleatoric and epistemic uncertainty is controlled by Σ, the error in the local energy term introduced by the use of jitter is a small broadening, which is expected to be on the order of the square root of the jitter.
Padding the vector of target properties y by an M -long vector of zeros, we rewrite eq. ( 4) as the solution of the least-squares problem leading to the solution in the form of A numerically stable solution can be obtained by first carrying out a QR factorisation of A = QR where Q is orthogonal, namely, it is formed by orthonormal column vectors, while R is an upper triangular matrix.Substituting the factorised form of A into eq.( 8) results in as Q T Q = I.

III. IMPLEMENTATION
We develop and maintain a collection of software tools called QUIP [30] to carry out molecular dynamics simulations.Part of this suite is an implementation of GAP, including the gap_fit program, implementing the sparse GPR.The majority of QUIP is written in modern Fortran, utilising many object-oriented features, although not fully exploiting the most recent Fortran standards due to lack of compiler support at the time of the original development of the code, which started in 2005.QUIP features a Python interface, quippy [31], allowing access to various functionalities and casting all atomic potentials as Atomic Simulation Environment (ASE) calculators [32].There also exist generic C and LAMMPS-specific C++ interfaces, that allow GAP models to be used in external simulation packages.The source code can be found on GitHub [24].

A. The GAP submodule
Similarly to other MLIPs, the main components of GAP are the calculation of descriptors, mapping Cartesian coordinates into invariant representations, and a regression method, in this case GPR.

Representations
The descriptors.f95file implements the mapping from the Cartesian coordinates of the atoms to invariant representations.In the package we provide a number of descriptors, but it is straightforward to implement new ones.While the user fitting GAP models does not have to interact with the source code, in the following we give an overview of what is necessary to implement or adapt a descriptor.
A set of standardised interfaces are used for each descriptor, which are overloaded, therefore adding new descriptors is straightforward and does not require any further modifications elsewhere.The initialise interface interprets the parameters of the descriptor, which are provided by the user as key-value pairs, and stores these in a descriptor object.The query function cutoff returns the spatial cutoff of a descriptor, whereas finalise resets the descriptor object and deallocates all storage.The descriptor_sizes function takes an Atoms object and determines how many descriptors and partial derivatives will be calculated based on the cutoff and the connectivity of the particles.Finally, the calc function returns descriptors calculated based on an Atoms and a descriptor object, and optionally, their partial derivatives with respect to atomic coordinates in a generic container object.All of this functionality is exposed in quippy, ensuring interoperability with ASE.

Regression
GPR is implemented in the file gp_predict.f95, with some fitting-specific features in gap_fit_module.f95and sparse point selection in clustering.f95.For a pair of descriptors x and x ′ of dimensions D, whose distance r is defined as we have implemented the squared exponential kernel and the piece-wise polynomial kernel with compact support where j = ⌊ D 2 ⌋ + 2. The SOAP descriptors should be used with the dot-product or, more generally, polynomial kernel The hyperparameters, such as θ or ζ, and the coefficients c are stored in a Fortran object which is used by the function gp_predict to evaluate (2) as well as the partial derivatives with respect to descriptor components and variances predicted from GPR.
During the training procedure, all descriptors x and their partial derivatives are calculated and stored.Pointers are used to denote which descriptors and derivatives contribute to target properties, thereby avoiding the need to store repetitive information.The kernel matrices K N M used in the fitting procedure in (4) are not calculated explicitly, only the accumulated terms in LK N M corresponding to quantum mechanical observable quantities.

B. GPR using gap fit
Finding the coefficients used in GAP models can be accomplished using the gap_fit command line program, where parameters are set as arguments or a configuration file.The input to the fitting procedure consists of the fitting data, model definitions, and additional options.The database of atomic configurations, together with the quantum mechanical properties are read in from an extended XYZ [33] file.The extended XYZ contains information of the lattice, atomic numbers and Cartesian coordinates, and may provide the total energy, forces and virial stress, or any combination of these for each individual configuration.
Each individual atomic configuration may optionally have a type specification, given within the extended XYZ file using the config type keyword.The configuration type may be used for fine-grain control, such as selecting a specific number of sparse points from those configurations.
The particulars of the model are provided using the gap argument as a form of colon-separated list of descriptor definitions.These include hyperparameters, such as the spatial cutoff or the desired number of sparse points per descriptor.Other hyperparameters, such as the regularisation, can be provided either in the command line, or specifically for each individual frame within the extended XYZ file, allowing fine-grained control and the use of inhomogeneous accuracy of target quantities.Additional options can be used to adjust the processing, tune technical parameters, or enable additional features like more verbose output.Details about the currently available arguments can be found later in section V, or running gap fit --help for up-to-date information.

Program structure
After initialisation and reading of the input, the extended XYZ frames, where each frame consists of an atomic structure, are parsed for the number of target properties and descriptors.Storage for the descriptor arrays are allocated accordingly, and descriptors are calculated during a second pass over the atomic configurations.
For each set of N descriptors, M ≪ N points are chosen as a representative, or sparse, set.Options include random selection, k-means clustering, a uniform grid spanning the range of descriptors and CUR decomposition [34].It is also possible to provide the sparse points via text files.As the sparse points need to form a linearly independent covariance matrix, duplicates within a given tolerance are removed and only considered once.This may result in fewer sparse points used than specified by the user, particularly if the atomic environments in the database are highly correlated.
With the specified sparse points, the covariance matrices K M N and K M M are calculated, and matrix A constructed.The coefficients are determined via QR decomposition using (Sca)LAPACK [35,36] routines.
The memory requirement for the gap_fit program depends on the atomic structures, the number of target properties and sparse points, and descriptor definitions.In particular, the two main data components with significant memory requirements are the descriptors and FIG.1: MPI gap fit workflow.If run in serial, the tasks are not distributed and n is used instead of n * .their partial derivatives and the kernel matrix LK N M .The memory associated with storing descriptor derivatives scales linearly with the number of atomic environments and the dimensionality of the descriptor, as well as the number of neighbours within the spatial cutoff.Efforts directed at developing compact descriptors, using compression techniques, therefore significantly reduce the memory requirements of the fitting procedure, as well as the computational complexity of evaluating the descriptors.The size of the kernel matrix scales linearly with the number of target values and the total number of sparse points.
Recently, we have implemented domain decomposition in gap_fit that aims to utilise massively parallel computer architectures [37].The implementation relies on Message Passing Interface (MPI) and is illustrated on Fig. 1.
After determining the number of atoms and, consequently, the number of target data values in each configuration of the database, configurations are assigned to individual MPI tasks.Descriptors are computed locally, and the covariance matrix K M N is constructed in a distributed fashion.This approach allows the memory requirements of the program to be distributed over many computational nodes, therefore larger databases can be easily fitted without the need of specialised, highmemory servers.The linear algebra step makes use of the ScaLAPACK library, carrying out the QR decomposition and subsequent back-substitution steps in parallel, thereby reducing the computational time.We demonstrate the benefits of the parallel fitting approach by studying two fitting problems that would require significant amounts of wall-time and memory using a single node.One of the training databases consists of the highentropy alloy configurations by Byggmastar et al. [38], and the other is a data set of silicon carbide configu-FIG.2: Speedup (main panel) and memory (inset) requirements for fitting a high-entropy alloy (HEA) and silicon carbide (SiC) training set using a varying number of nodes.
rations from Ref. [37].The dependence of computational speedup and total memory use is presented on Fig. 2, showing that we have achieved excellent parallel performance, and we can utilise the aggregate memory of multiple computational nodes, thereby largely eliminating any limitations.We note that the dependence of the memory usage on the number of cores is due to overheads associated with repeated data allocations that are private to a process.Given the typical amount of memory, on the order of hundred GBs, available on commodity computing nodes, this is not likely to be a significant barrier to large-scale parallel fitting calculations.For more information, we refer the user to our prior work [37], where we explored hybrid OpenMP-MPI parallel strategies that optimise overall memory use and runtime.

Sparse point selection
The program gap fit implements several methods for the sparse point selection, controlled by the sparse method command line argument.Within the definition of each descriptor, the n sparse argument controls the number of sparse points.Alternatively, config type n sparse allows the user to specify a given number of sparse points from labelled configurations, to ensure adequate representation.Given none, no sparsification is applied, apart from the removal of duplicate descriptors, and all points are selected.The points may be chosen directly with either the file or index file option, in conjunction with the sparse file argument to specify the filename containing the descriptors or the indices of descriptors.The indices are 1-based and refer to descriptors as calculated in the database file.
Recommended choices are uniform for a distance 2b descriptor and cur points for a soap descriptor.For completeness, we list all currently implemented options in the Appendix.

C. Descriptors
The choice of invariant representation of atomic environments has a profound effect on the quality of the resulting interatomic potential.QUIP, being a test bed for methodological developments of MLIPs, implements numerous descriptors, of which some frequently used ones are presented in this section.
In the gap fit program, descriptors are specified in the gap command line argument, using the syntax gap={descriptor1 key=value ... : descriptor2 key=value ...}.The descriptor definitions are, by default, treated as templates by gap fit, and after parsing the database configurations, each descriptor is expanded with element (chemical species) information.If add species=F is added to the descriptor definition, the descriptor is not modified in this step.
Most descriptors implement the cutoff keyword, specifying the spatial cutoff within atomic connectivities are considered.The cutoff transition width keyword provides a smooth transition ensuring a numerically well behaved characteristic when the descriptor is used to build an interatomic potential.

Descriptors based on interatomic distances
Based on the idea of the cluster expansion of the total energy (14) the n-body terms may be fitted using GPR or other regression methods.The cluster of n atoms, representing the input variable of each term, is well defined by a monotonic function, which could be just the identity, of interatomic distances r = [r 12 , r 13 , . ..].
In case of the two-body descriptor, GAP implements a polynomial transformation that generates a descriptor from the pair distance r in the form of [r p1 , r p2 , . ..],where {p i } n i=1 are a set of exponents.When using this descriptor in conjunction with a dot-product kernel, the generalised form of a pair potential may be recovered.However, for three-or higher body energy terms the descriptor formed as a list of interatomic distances is not invariant with respect to the permutation of indices of the same elements within the cluster, therefore cannot be directly used for regression.Permutational invariance is achieved by symmetrising, then normalising, the kernel: where P represents the permutation of the order of atoms, and k ′′ is used in the GPR.
In GAP, we implemented the distance nb descriptor, where the body order is defined using the order keyword.The compact cluster keyword specifies the topology of the cluster.If compact cluster=T, each atom in an atomic configuration is considered as a central atom, and clusters are formed with its n−1 neighbours that are within the spatial cutoff.With the compact cluster=F, all possible graphs where the graph edge has a distance less than the cutoff are formed, allowing, for example, linear chains.
The special cases corresponding to two-and threebody terms are implemented as distance 2b and angle 3b, respectively.In case of angle 3b, the trimer of atoms formed by the central atom i and its two neighbours j, k is represented by the invariant descriptors As the interatomic distances are used in the kernel directly in these descriptor classes, the implementation of a smoothly varying spatial cutoff must be implemented in the kernel function.We modify the kernel by multiplying it by a cutoff function which smoothly interpolates between zero, where any of the interatomic distances are greater than the spatial cutoff, and one.The elementary cutoff function where r cut is the spatial cutoff and d is a transition width, is evaluated for each pairwise distance.The final cutoff function is obtained as a product of elementary cutoff functions, ensuring that each energy term vanishes smoothly.
GAP also allows further adjustment of the tail behaviour of the two-body descriptor, in order to approximate the polynomial decay of some interaction types.This is achieved by further multiplying the cutoff function by erf(αr) r q , where α is a range parameter and q is an exponent.

SOAP descriptors
The SOAP descriptors were proposed a decade ago [28] as invariant descriptors of atomic environments, and have been used successfully to develop interatomic potentials [12,13,38,39], clustering [40] as well as other machine learning tasks, such as the ShiftML model used to predict Nuclear Magnetic Resonance chemical shifts [41].For details on the theory, the review by Musil et al. [27] may be consulted.Here we only repeat what is necessary to explain the implementation options.
To construct the SOAP descriptor, the atomic environment is first written as a neighbourhood density function, where atoms of element α are represented by Gaussians: where the sum over i includes all neighbouring atoms with position vector r i , z i is the corresponding atomic number and σ is a length scale hyperparameter.The element density ρ α is expanded in a basis set consisting of the products of orthonormal radial basis functions g n and the spherical harmonics Y lm , resulting in the coefficients c α nlm .In the following, we often refer to grouping of certain indices in the basis expansions as channels, a term borrowed from signal processing.Therefore, the element indices α are the element channels and the radial basis indices n are the radial channels.The spherical harmonics indices l, together with the corresponding m indices, form the angular channels.
An invariant kernel or similarity function of two atomic environments is obtained by calculating the overlap of the densities, which has to be integrated over all rotations: For the choice of ν = 2, the SOAP kernel can be analytically evaluated in the form of a dot-product kernel due to the properties of the Wigner D-matrices representing the rotational transformation of the coefficients.
To ensure that k(ρ, ρ) = 1 for any ρ, we normalise the kernel as The dimension of p scales as O(n 2 max l max S 2 ) where n max , l max and S are the number of radial basis functions, spherical harmonics and elements, respectively.Apart from the original implementation of SOAP, a more efficient variant introduced by Caro [42] is available as soap turbo.This descriptor is further described in Sec.III C 4 and a comparison with regular SOAP is provided in Sec.IV B.
Increasing n max and l max improves the resolution of the basis set expansion, and are therefore convergence parameters of the SOAP kernel.Optimal values are strongly dependent of the typical number of neighbours and the Gaussian broadening parameter σ.In many applications, the user has the choice to adjust n max and l max to achieve the desired balance of accuracy and computational speed.However, the length of p has a quadratic dependence on the number of elements, thereby the computational cost of both the components of p and the k(ρ, ρ ′ ) as a dot product are impacted.Various strategies to reduce this scaling have been proposed, which are discussed below.

SOAP Compression
The O(l max n 2 max S 2 ) scaling of the number of descriptor components in SOAP is often limiting as it makes studying chemically diverse systems, such as multi-component alloys or proteins, very computationally demanding.A widely used approach [2,[43][44][45][46] to reduce this scaling is to embed the elements (and optionally radial) channels into a fixed K-dimensional space as ) and then form a compressed descriptor as (or To achieve good performance for K < S [47] the embedding weights are typically optimised during fitting and, following Willatt et al., they are interpretable as encoding similarity between different elements via the alchemical kernel [43] This idea was extended by Darby et al. [48], where it was shown that it is sufficient to couple the embedding channels to themselves only, rather than taking a full tensor product across the embedded index k, thus making the scaling linear in K, rather than quadratic.Two flavours of these tensor-reduced descriptors were proposed.The first is motivated by considering fitting a linear model as where the a are the model coefficients and the element and radial indices have been grouped together.For each value of l the matrix of coefficients a l can be approximated using symmetric eigen-decomposition as This decomposition is exact for K = n max S with w the eigenvectors of a l and is systematically improvable with random w.Substituting this approximation into Eq.( 26) results in where pk l are the new features.As the approximation in Eq. ( 27) is systematic, any function that can be fit as a linear function of p αβ nn ′ l can also be fit using a linear function of p k l .An alternative and complementary view motivated by using random mixing weights w k (αn) is to "sketch" the power spectrum as so that where we have used the fact that E[w k i w k j ] = σ 2 δ ij if the w k i are symmetric random variables with zero mean and that u and w are independent.This is a form of tensorsketching [49] and is also systematic with the expected error in approximating the kernel decreasing as K − 1 2 .The different flavours of element-embedding and tensor-reduction listed above are all accessible via specifying various combinations of R mix, Z mix, sym mix and K, which specify how the initial channels should be mixed, and the coupling keyword which specifies how the resulting channels should be coupled together.Note that optimisation of the embedding weights w is not available in gap fit with normallly distributed random weights used instead.Please see the keyword glossary for more details.
An alternative compression strategy proposed by Darby et al. [50] simply involves summing over one (or more) of the α, β, n or n ′ indices in Eq. (23).It is efficient to perform this summation at the level of the density expansion coefficients c α nl where it can also be most easily understood; summing over a radial index n is equivalent to projecting the 3D density onto the surface of the unit sphere whilst summing over the element index α corresponds to forming the total, element-agnostic density.The power-spectrum is a generalised 3-body descriptor where each term in the following sum corresponds to a triangle formed by the central atom and the neighbour atoms i and j.As such, the various possible effects of this compression scheme on an individual 3-body (correlation order 2) term in this summation can be visualised as in Fig. 3.The different options are labelled according to the element-sensitive ν S and radially-sensitive ν R correlation orders where each summation over an element (or radial) index lowers the respective correlation order by one, e.g., FIG. 3: Different SOAP compression strategies.Neighbour atoms around the central atom (black) may be represented as element-agnostic (grey) or element-specific (red or blue).To eliminate the radial dependence, neighbours may be projected on the unit sphere (dashed circle) around the central atom.Reprinted from Darby et al [50] with permission under the CC BY 4.0 license.
Finally, it is also possible to achieve compression through the experimental Z map keyword, which allows the user to group different elements together; equivalent to element embedding with w k α = 1 if element α is in group k or 0 if it is not.As two densities are coupled, two distinct sets of groups may be specified if desired.Please see the keyword glossary for more details.

soap turbo descriptors
The soap turbo descriptor is a variant of SOAP optimised for computational efficiency.A detailed account of this descriptor has been given in Ref. [42].Here, we briefly describe its main features while giving a more indepth description of the features that have been introduced since the publication of the original paper, namely multispecies support and compression.A comparison between the soap and soap turbo implementations of SOAP is given in Sec.IV B.
The representation of the atomic density field in the local neighborhood, that is, within a cutoff sphere of radius r cut of atom i, is carried out in an explicitly separable form of radial and angular channels.Therefore the expansion coefficients can also be split into components that depend exclusively on the radial index n or angular indices l, m: where b i,j n are the radial expansion coefficients, a i,j lm are the angular expansion coefficients and j runs over all neighbours of i within the cutoff sphere.A number of smoothing and scaling functions are introduced, to make the width of the atom-centered smooth functions depend on the distance from the center of the SOAP sphere.The main implementation differences between soap and soap turbo are the use of smoother polynomial radial basis functions and several numerical tricks that allow to express the radial and angular expansion coefficients as recursive series.There are also differences in how multispecies support and compression are handled, described below.
Support in soap turbo for multiple chemical elements is provided by augmenting the radial basis set via a direct sum: where s runs over the number of elements.The only advantage of this approach compared to the regular SOAP multielement support is that each element can be represented with a different radial basis set, including a different number of radial basis functions.One instance where this feature may be useful is when one of the elements can be represented with fewer radial basis functions than another, reducing the dimensionality of the descriptor and thus its computational cost.In principle, this approach also allows for using different cutoff radii for different elements within the same descriptor although, in practice, the GAP interface currently restricts the cutoff to be the same for all elements.The angular basis, on the other hand, is the same for all elements.
Compression is also supported by soap turbo through three different approaches.The first compression scheme is a heuristic recipe, that we refer to as "trivial", which retains only the SOAP elements that run over n = 1 for single element and the first radial component of the element-specific basis for multiple elements: where the tilde indicates the compressed descriptor and N 1 r is the number of radial basis functions for the first element, and the direct sum continues until the last element in the descriptor has been considered.The Kronecker deltas indicate that only components with n = 1, n = N 1 r +1, etc., are retained.Trivial compression affords of the order of a factor of 5 in dimensionality reduction without significant loss in accuracy for most production GAP models that we have fitted so far.
The second compression scheme in soap turbo provides a quasiequivalence with the radial-and elementsensitive correlation orders offered in regular SOAP compression introduced in Section III C 3. Numerical comparisons of these compression recipes are given in the local property example in Sec.IV.
The third compression scheme has no predefined recipe.Instead, the user can provide an arbitrary linear transformation (via an input text file) that projects the SOAP vector from its original N -dimensional space to a reduced M -dimensional space, where M < N : In all cases, the descriptor is renormalized after compression.
Finally, we remark that due to the overlap properties of the polynomial radial basis sets and related instabilities in the numerical approach employed to construct the orthonormal radial basis used to construct soap turbo descriptors, there is a practical limitation of n max ≈ 10.For most practical purposes (e.g., in constructing accurate GAP force fields), there is no need to increase the size of the radial basis set beyond ≈ 8 basis functions.

IV. PRACTICAL EXAMPLES A. Si interatomic potential
Among the first successful applications of GAP was a general-purpose interatomic potential for silicon [39].We have used the database of atomic configurations to train a series of GAP models to demonstrate the effect of the most crucial descriptor and kernel hyperparameter choices on the performance and computational cost of the resulting potential.The extended XYZ file, containing the database and shared in the supplementary material of Ref. [39], was randomly split into a training and a test set, containing 80% and 20% of the original configurations, respectively.We list the parameters used in the gap fit command line in Table I with a detailed explanation for each.
While keeping all other parameters constant, we individually varied the SOAP parameters n max , l max , r cutoff and σ, as well as the polynomial kernel exponent ζ and the number of sparse points M .Based on the original Si GAP model, the parameters, n max = 6, l max = 6, r cutoff = 5 Å, σ = 0.5 Å, ζ = 4 and M = 8000 were used.
We have evaluated the interaction energies, forces and virial stresses of all atomic configurations in the test set with the resulting models, and calculated the Root Mean Square Error (RMSE) with respect to the ab initio energies.
To illustrate how the choice of the parameters affects the computational cost of each model, we have also determined the average calculation time per atom, using a desktop computer utilising a single core of an Intel® Core™ i5-9600K CPU at 3.70 GHz.
Trends are presented in Fig. 4, generally showing that more complex models, i.e., those with higher l max , n max , ζ and M values, are more accurate.Thanks to the regularisation term in the GPR, higher complexity does not result in overfitting.However, the computational cost of models with higher l max , n max and sparse points is increased due to the larger number of terms.Even though the polynomial kernel at higher orders results in more terms, these are not calculated explicitly, therefore the computational cost remains approximately constant at different ζ values.
Increasing the spatial cutoff of the atomic neighbourhood environment results in more accurate models up to 5 Å, as further neighbours may influence the local energy function.However, at higher cutoff values the quality of the model deteriorates, which may be regarded as a sign of underfitting, when the available data is not sufficient to determine the dependence of the local energy terms on further neighbours.
The cutoff radius of SOAP or other descriptors may also be chosen by considering the force constant matrix of the atomic system, using a criterion on the spatial decay of the elements [12].
The smoothness parameter σ has a strong influence on the accuracy of the model, which is related to how the neighbouring atoms are represented.Narrow Gaussians lead to fewer similar kernel values between two atomic environments, resulting in overfitting, whereas with wide Gaussians the resolution of the representation is lower.
The kernel regularisation hyperparameters provide control on the accuracy and smoothness of the resulting model.Lower values bias the potential to fit the training data more accurately, but may result in overfitting.For a given spatial cutoff in the descriptors, the kernel regularisation on the forces may be derived from the de- cay of the force constant matrix or by quantifying the force uncertainty from ab initio calculations [12].Finding appropriate figures for the kernel regularisation of energy and virial stress values may require cross-validation, but a typical target energy error in condensed systems is 1 meV/atom, therefore this is often a suitable starting figure.The choices of kernel regularisation hyperparameters is discussed extensively in Ref. [12].
To illustrate the effect of varying the kernel regularisation hyperparameters, we fitted a series of silicon GAP models using the same parameters, listed in Table I, and same training database as previously, but varied the hyperparameters corresponding to energy (σ energy ), force (σ force ) and virial stress (σ virial ) independently.We have utilised the test set to predict energy, force and virial stress values using the resulting models, and evaluated RMSE figures for each model and quantity.
These results are shown in Fig. 5, highlighting how different choices of regularisation hyperparameters affect the quality of the fit.For the RMSE of the predicted energies, there is an optimal value of σ energy , while the RMSE of the predicted forces and virial stresses decrease monotonically when increasing σ energy .Similar opposing changes in the errors of predicted quantities may be observed when σ force and σ virial are varied.We attribute these trends to the epistemic uncertainty of our models which is due to our assumptions, such as the locality of the descriptor or the body-order representation.These assumptions limit the simultaneous accuracy the model may achieve in energies, forces and virial stresses, and biasing the fit towards reproducing a particular quantity causes a deterioration in others.
These results are intended to provide a guide to fitting GAP models and their adaptation is almost certainly necessary when fitting potentials representing different atomic systems.While an exhaustive hyperparameter search is not always feasible, the recently parallelised gap fit allows rapid creation and subsequent evaluation of models.While the local energies predicted by GAPs in cohesive energy models are not physical observables, there are different local atomic properties with physical significance amenable to direct learning within the GAP framework.SOAP descriptors are particularly suited for this task.Examples of such models that we have trained in the past include adsorption energies [51], effective Hirshfeld volumes [52] and core-electron binding energies (CEBEs) [53].Here, we revisit the CEBE database of Ref. [53] and use it as a test bed for the performance of SOAP-based local property models as a function of different convergence parameters.
First, let us briefly provide the context for the usefulness of CEBEs in materials science.X-ray photoelectron spectroscopy (XPS) uses monochromatic (fixed energy) X-ray light to excite the deep-lying core electrons in materials and molecules.In oxygen-and carbon-containing compounds these are the 1s states.When an X-ray photon with sufficient energy is absorbed by one such core electron the latter becomes photoejected, in such a way that its kinetic energy can be measured by a detector.Since the energy of the incident photons is fixed, the difference between the measured kinetic energy and the incident energy equals the CEBE.This CEBE is characteris-tic of the chemical environment around the atom whose core electron was excited, and so an XPS experiment provides a spectrum whose characteristic peaks give insight into the atomic structure of the material or molecule being probed.Because the core electron is strongly localized around the nucleus and only feels the influence of the immediate surrounding medium, XPS is particularly well suited to learning with local atomic descriptors such as SOAP, as we showed in Ref. [53].One of the results presented in that paper is a database of GW -level CEBEs for a subset of the CHO-containing molecules in the QM9 dataset [54], a large dataset of small stable organic molecules, containing up to 9 non-hydrogen atoms.
Ref. [53] presented learning curves for C1s and O1s CEBE models trained from the QM9-GW data using soap turbo descriptors without sparsification.Here we take a more detailed look at the effect of different technical parameters on the quality of the fit: cutoff radius for SOAP neighbours, sparsification scheme, soap vs soap turbo and the effect of different compression recipes on the results.We start out by splitting the QM9-GW CEBE database of Ref. [53] into a training set (80 % of the structures) and a test set (20 % of the structures).
The gap fit local property feature relies on the user providing a per-atom local property array.In this case, an ASE-format XYZ file is provided with a list  The fifth column contains the 1s CEBE for the C and O atoms, given in eV in this example.Since H atoms do not have a core, CEBEs for these atoms are not available, and we pad the array with zeros.A mask is provided to let gap fit know these are to be ignored during the fit.
In addition to the database of atomic structures and observables (CEBEs here), one needs to provide the name of the local property as specified in the database (local property parameter name="GW CEBE" here), the default regularization parameter (default local property sigma=0.01here, in the same units as the local property), and possibly offsets for the properties to be learned (local property0={C:290.816456:O:537.946208:H:0}here).In our case, the 0 offset is computed as the average CEBE of C1s and O1s core electrons separately.It is important to provide these offsets so that the ML model only needs to fit the (smooth) differences in the local property values, and not the absolute numbers, which are significantly harder to learn.Otherwise, the specification of the atomic descriptors is done in the same way as for a regular GAP model.
All the results for this example are summarized in Fig. 6.In panel (a) we show a comparison of models for C1s CEBEs fitted using a soap turbo descriptor with varying cutoff radii and varying number of sparse points; the cutoff radius is the single most important hyperparameter in SOAP-based models.Clearly, the accuracy of the models can be systematically increased by increasing the cutoff.However, the number of sparse points limits the expressivity of the model, and models with less sparse points will not benefit from further increasing the cutoff beyond a certain point.E.g., with 50 sparse points a 3 Å model performs equal to a 7 Å model.We observe that the statistical variation in the model performance increases with the cutoff, due to the corresponding increase in the size of configuration space covered by the descriptor (the error bars are given as the standard deviation computed over 10 different models obtained from 10 different randomly chosen sparse sets).
In Fig. 6(b) we show a comparison of C1s and O1s models fitted using soap and soap turbo descriptors of the same dimensionality (CHO-sensitive descriptors with 8 radial basis functions per element and up to 8th degree spherical harmonics, the same basis set as used for all the calculations in Fig. 6).soap turbo models perform slightly better than soap models, except for the C1s models with maximum number (1000) of sparse points, where they perform equally.If Fig. 6(c) we assess the effect of using random sparse point selection vs using CUR decomposition to select the sparse set descriptors, as well as the possible effect of us-ing descriptor compression (soap turbo's "trivial" compression recipe) on the results.For this numerical test, the performance of all models is essentially the same for most practical purposes.
Finally, in Fig. 6(d) we perform a thorough numerical test of different compression recipes on the accuracy of the QM9-GW models.The i j labeling convention refers to the ν R ≡ i and ν S ≡ j soap sensitivity parameters discussed above, and their quasiequivalent recipes for soap turbo.Additionally, soap turbo can use the "trivial" compression scheme as detailed above.In addition to the root-mean-squared error (RMSE), the graph shows the compression ratio computed as the number of dimensions of the compressed descriptor divided by those of the full descriptor.Unsurprisingly, the errors increase with the compression ratio, with most compression recipes providing better performance for soap turbo, except for 2 1, 1 2 and 1 1, where soap performs slightly better.That is, the advantage of using soap turbo increases with the degree of compression, whereas soap performs equally or slightly better than soap turbo at low compression ratios.The "trivial" compression recipe is only available for soap turbo and provides arguably the best compromise between accuracy and compression ratio (a factor of 4.3 vs uncompressed SOAP) among the tested schemes, at least for this particular test with the QM9-GW database.
Although not shown here, the relative advantage of soap turbo vs soap increases when looking at the O1s models, likely because the QM9-GW training database we used contains significantly more C1s data (11.5kentries) than O1s data (1.5k entries).This indicates better generalization and data efficiency for soap turbo, although it is important to note that the performance of each descriptor is dataset specific.

C. High-Entropy Alloy
The Mo-Nb-Ta-V-W quinary high-entropy alloy studied by Byggmästar et al. [38] is a complex, multicomponent system.To illustrate the generation of a GAP model, we provide the parameters, complete with explanations and comments, that were used to test and benchmark the MPI-ScaLAPACK implementation of the gap fit program [37].Here we highlight a recent feature which conveniently allows the fitting parameters to be stored in a file, rather than provided as command line arguments.
The fitting parameters are entered in the file config, provided in the supplementary information [55], in a key=value format, with commentary on each provided in Table II.The fitting database is openly available as the supplementary material of the article by Byggmästar et al. [38], which may be downloaded from the Fairdata repository [56].
The fitting procedure can then be carried out by running the command gap fit config file=config with Many sparse points are needed due to the high dimensionality of the descriptor.
Sparse points are chosen using the CUR method.
the database file db HEA reduced.xyzand the configuration file config in the same directory.For the parallel implementation, the command line mpirun -np 2 gap fit config file=config executes the process on two computational cores.Hybrid OpenMP-MPI execution is possible, for which the number of threads may be adjusted by setting the environment variable as export OMP NUM THREADS=4.For the specific queuing system available to the user, the documentation should be consulted.
The resulting GAP model is stored in the gp HEA.xml file and a set of text files according to the naming pattern gp HEA.xml.sparseX.GAP *.The interatomic potential may be accessed as a Calculator in ASE as from quippy.potentialimport Potential p = Potential(param_filename="gp_HEA.xml")... a.calc = p where the Python variable a indicates an ASE Atoms object.
Massively parallel simulations with GAP models are possible with LAMMPS.To use the high-entropy alloy potential, the following lines should be added to the LAMMPS input file: pair_style quip pair_coeff * * gp_HEA.xml"" 42 41 73 23 74 where the LAMMPS atom types 1, 2, 3, 4, and 5 are mapped to Mo, Nb, Ta, V and W, respectively.

V. CONCLUSION AND OUTLOOK
We have reviewed the GAP framework from an implementation point of view, highlighting how a generic sparse GPR formalism is adapted for the prediction of interatomic potentials and related quantities.An overview of the software package, coding practices and recent developments was provided, together with usage examples and detailed explanations of adjustable parameters, with references to the theory.The QUIP and GAP suite remains under maintenance and in active development by the authors, and will serve as a test bed for exploring ideas in the field of MLIPs.With interfaces to Python and major simulation packages, GAP serves as a useful tool for computational modelling.
Future developments will include the inclusion of more descriptors, such as Atomic Cluster Expansion (ACE) [2], rigorous means for uncertainty quantification, utilising modern computing architectures such as GPUs, and the implementation of more robust and efficient solvers for the fitting procedure.With the availability and popularity of more modern programming languages such as Python or Julia, Fortran may appear as an outdated choice.However, its interoperability with MPI and related libraries such as ScaLAPACK has proved to be an advantage in utilising the more traditional, but still highly prevalent, high performance computing facilities consisting of networked servers.As supercomputers and programming skills change, the current framework might prove to be too restrictive, but the practical insight documented here and the software will remain valuable for future endeavours.mpi print all If true, each MPI processes will print its output.Otherwise, only the first process does (default).

Arguments of the GAP string
The following keywords are to be specified for each descriptor within the gap command line argument.index file Reads indices of sparse points from the file given by sparse file and selects those from the deduplicated data.
file Reads sparse points from the file given by sparse file.
random Selects n sparse random descriptors with the same probability.
uniform Computes a histogram of the data with n sparse bins and returns a data point from each bin.This option is only suitable for low-dimensional descriptors.
kmeans The k-means clustering algorithm is performed on all descriptors to generate n sparse clusters, of which the descriptors closest to the cluster means are selected as sparse points.
fuzzy A fuzzy version of k-means clustering [57] is used to generate n sparse clusters.
cluster A k-medoid clustering based on the full covariance matrix of descriptors is performed, resulting in n sparse clusters.The medoid points are selected as sparse points.
pivot The n sparse "pivot" indices the full covariance matrix are found, and used as the sparse points.
covariance Greedy data point selection based on the sparse covariance matrix, to minimise the GPR variance of all datapoints.
cur points A CUR decomposition, based on the datapoints, is carried out to find the most representative n sparse points.
cur covariance A CUR decomposition, based on the full covariance matrix, is carried out to find the most representative n sparse points.

Descriptors
The GAP module implements over 30 descriptors, most of which being experimental or unsupported.In the following, we include those that are commonly used by practitioners and supported by the GAP developers.l max l max (spherical harmonics basis band limit) for soap-type descriptors.
n max n max (number of radial basis functions) for soaptype descriptors.
atom gaussian width (atom sigma) Width of atomic Gaussian functions for soap-type descriptors.
central weight Weight of central atom in environment.
central reference all species Place a Gaussian reference for all atom species densities.By default (F) only consider when neighbour is the same species as centre.
average Whether to calculate averaged SOAP -one descriptor per atoms object.If false (default), atomic SOAP is returned.
diagonal radial Only return the n 1 = n 2 elements of the power spectrum.
normalise (normalize) Normalise descriptor, so magnitude is 1.In this case the kernel of two equivalent environments is 1.
basis error exponent 10 −basis error exponent is the max difference between the target and the expanded function.
n Z How many different types of central atoms to consider.
n species Number of species for the descriptor.
species Z Atomic number of species.
xml version Version of GAP the XML potential file was created.
Z Atomic number of central atom, 0 is the wild-card or Atomic numbers to be considered for central atom, must be a list.sym mix Specifies whether a single set of mixing weights is used or whether two sets are used.If mix=T, tensor-decomposition is enabled.If sym mix=F tensorsketching is used.
K Integer specifying how many mixed channels to create.For example, R mix=T Z mix=T K=5 will create 5 mixed channels whereas R mix=F n max=6 Z mix=T K=5 will result in K*n max = 30 channels.
coupling If coupling=T full tensor-product coupling is applied across the resulting channels after mixing, whereas if coupling=F element-wise coupling is applied instead.The only exception to this rule occurs for Z mix=T R mix=F (or similarly for Z mix=F R mix=T) with coupling=F.Here, element-wise coupling is applied across the mixed-element channels but tensor-product coupling is applied across the unmixed radial channels, resulting in p k nn ′ l (or similarly p αβ kl ).mix shift Integer specifying the shift to the default seed that is used for the random number generator used to generate the mixing weights.
nu S species sensitive correlation order.Allowed values are 0, 1 and 2 (default).n species Number of species for the descriptor.l max l max (spherical harmonics basis band limit) for soap-type descriptors.

Z map
nf Sets the rate of decay of the atomic density in the region between soft and hard cutoffs.
radial enhancement Integer index (0, 1 or 2) that simulates the effect of modulating the radial overlap integral with the radial distance raised to this number.
basis Options: poly3 or poly3gauss chooses a 3rd and higher degree polynomial radial basis set and augments it with a Gaussian at the origin, respectively.
compress file Optional user-provided file specifying the compression recipe.
compress mode Optionally provides a predefined compression recipe.
central index 1-based index of central atom species Z in the species array.
alpha max Radial basis resolution for each species.
atom sigma r Width of atomic Gaussian functions for soap-type descriptors in the radial direction.
atom sigma r scaling Scaling rate of radial sigma: scaled as a function of neighbour distance.
atom sigma t Width of atomic Gaussian functions for soap-type descriptors in the angular direction.
atom sigma t scaling Scaling rate of angular sigma: scaled as a function of neighbour distance.
amplitude scaling Scaling rate of amplitude: scaled as an inverse function of neighbour distance.
central weight Weight of central atom in environment.
species Z Atomic number of species, including the central atom.

FIG. 4 :
FIG. 4: Performance (blue squares) and computational cost (red circles) of different GAP models.The performance is quantified by the Root Mean Square Error (RMSE) of the predicted energy, evaluated on a test set.Comparisons with respect to changes in (a) the radial resolution n max of SOAP; (b) the angular resolution l max of SOAP; (c) the spatial cutoff r cutoff of SOAP; (d) the power of the polynomial kernel ζ; (e) the width σ of Gaussians representing the atoms in SOAP; (f) the number of sparse points M .

FIG. 5 :
FIG. 5: Performance of GAP models as the function of regularisation hyperparameters.The RMSE of the predicted energies (blue squares), forces (red triangles) and virial stresses (blue circles) are shown as the energy (panel a), force (panel b) and virial (panel c) regularisation hyperparameters are varied independently.

FIG. 6 :
FIG. 6: (a) RMSEs for a soap turbo C1s model as a function of SOAP cutoff and sparse set size with random selection.(b) Comparison between soap and soap turbo C1s and O1s models with random sparse set size (fixed cutoff of 5 Å).(c) Effect of random and CUR sparsification strategies, for a soap turbo-based C1s model (fixed cutoff of 5 Å); the effect of adding compression (soap turbo's "trivial" recipe) is also tested.(d) Effect of different compression recipes on soap and soap turbo C1s model performance (fixed cutoff of 5 Å); the right axis gives the compression ratio of the descriptor as the number of dimensions of the full descriptor (2700) divided by the number of dimensions of the compressed descriptor.In all cases (a-d), error bars are estimated from the standard deviation of the RMSEs calculated among 10 different models with random sparse set selection.
distance 2b arguments cutoff Cutoff for distance 2b-type descriptors.cutoff transition width Transition width of cutoff for distance 2b-type descriptors.
soap compression arguments Z mix Mix the element channels together if present.R mix Mix the radial channels together if present.

TABLE I :
Command line parameters of gap fit used to fit a GAP model for silicon.

TABLE II :
Contents of file config used to fit a GAP model for the Mo-Nb-Ta-V-W quinary high-entropy alloy.
If T, these are interpreted as per-atom errors, and the variance will be scaled according to the number of atoms in the configuration.If F, they are treated as absolute errors and no scaling is performed.NOTE: values specified on a per-configuration basis (see kernel regularisation parameter name) are always absolute, not per-atom.Random seed.openmp chunk size Chunk size in OpenMP scheduling.do ip timing To enable or not the timing of the interatomic potential.
energy scale (delta) Set the typical scale of the function being fitted, or the specific energy term if using multiple descriptors.It is equivalent to the standard deviation of the Gaussian Process in the probabilistic view, and typically this would be set to the standard deviation (i.e.root mean square) of the function that is approximated with the Gaussian Process.
Options for sparse point selection none No sparsification, selects all datapoints.
Z1 Atom type #1 in bond.Any atom type if missing.Z2 Atom type #2 in bond.Any atom type if missing.resid name Name of an integer property in the atoms object giving the residue identifier of the molecule to which the atom belongs.only intra Only calculate bonds, i.e. intramolecular pairs with equal residue identifiers only inter Only apply to non-bonded atom pairs, i.e. inter molecular pairs with different residue identifiers.n exponents Number of exponents.
soap arguments cutoff Cutoff distance.cutofftransition width Transition width of cutoff function.cutoffdexp Cutoff decay exponent.cutoffscale Cutoff decay scale.cutoffrate Inverse cutoff decay rate.
23mmas separate groups within a density.A colon separates the two densities if present.Otherwise the groups are taken to be equal.Z map= {1, 3, 222324 } has a separate channel for H and Li but treats Ti, V and Cr as identical Z map = {1, 3, 22, 23, 24 : 1, 3, 22 23 24} has a separate channel for each element in the first density.In the second density there is a separate channel for H and Li but Ti, V and Cr are identical soap turbo arguments rcut hard Hard cutoff distance.rcut soft Soft cutoff distance.