Accurate and explainable artificial-intelligence (AI) models are promising tools for accelerating the discovery of new materials. Recently, symbolic regression has become an increasingly popular tool for explainable AI because it yields models that are relatively simple analytical descriptions of target properties. Due to its deterministic nature, the sure-independence screening and sparsifying operator (SISSO) method is a particularly promising approach for this application. Here, we describe the new advancements of the SISSO algorithm, as implemented into SISSO++, a C++ code with Python bindings. We introduce a new representation of the mathematical expressions found by SISSO. This is a first step toward introducing “grammar” rules into the feature creation step. Importantly, by introducing a controlled nonlinear optimization to the feature creation step, we expand the range of possible descriptors found by the methodology. Finally, we introduce refinements to the solver algorithms for both regression and classification, which drastically increase the reliability and efficiency of SISSO. For all these improvements to the basic SISSO algorithm, we not only illustrate their potential impact but also fully detail how they operate both mathematically and computationally.

## I. INTRODUCTION

Data-centric and artificial-intelligence (AI) approaches are becoming a vital tool for describing physical and chemical properties and processes. The key advantage of AI is its ability to find correlations between different sets of properties without the need to know which ones are important before the analysis. Because of this, AI has become increasingly popular for materials discovery applications with uses in areas such as thermal transport properties,^{1,2} catalysis,^{3} and quantum materials.^{4} Despite the success of these methodologies, creating explainable and physically relevant AI models remains an open challenge in the field.^{5–7}

One prevalent set of methods for explainable AI is symbolic regression.^{8–11} Symbolic regression algorithms identify optimal nonlinear, analytic expressions for a given target property from a set of input features, i.e., the *primary features*, that are related to the target.^{12} Originally, (stochastic) genetic-programming-based approaches were and still are used to find these expressions,^{12–15} but recently, a more diverse set of solvers have been developed.^{16–21} The sure-independence screening and sparsifying operator (SISSO) approach combines symbolic regression with compressed sensing^{22–25} to provide a deterministic way of finding these analytic expressions. This approach has been used to describe numerous properties, including phase stability,^{23,26,27} catalysis,^{28} and glass transition temperatures.^{29} It has also been used in a multi-task^{22} and hierarchical fashion.^{25}

In this paper, we introduce the new concepts implemented in the recently released SISSO++ code^{24} and detail their implementation. SISSO++ is a new, modular implementation of SISSO that provides a more user-friendly interface to run SISSO with several methodological updates. Before detailing the updated methodology, we summarize the main aspects of the SISSO algorithm and define some of the terminologies we use below. The SISSO approach starts with a collection of *primary features* and mathematical unary and binary operators (e.g., addition, multiplication, *n*^{th} root, and logarithms). The first step is the *feature-creation* step, where a pool of *generated features* is built by exhaustively applying the set of mathematical operators onto the primary features. The algorithm is iteratively repeated by applying the set of operators onto the previously generated features. The number of iterations in this feature-creation step is called the *rung*. The generated features, projected onto the training dataset, build the so-called sensing matrix **D**, with one column for each generated feature and one row for each data point.

The subsequent step is the *descriptor identification* step, i.e., compressed sensing is used to identify the best *D*-dimensional linear model by performing an *ℓ*_{0}-regularized optimization on a subspace $S$ of all generated features. $S$ is selected using sure-independence screening^{30} (SIS), with a suitable projection score, depending on the loss function used to evaluate the models. For a regression problem (see further below for the discussion of classification problems), i.e., the prediction of a continuous property **P** (an array with as many components as the number of training points), one solves argmin_{c}‖**Dc** − **P**‖_{2} + *λ*‖**c**‖_{0}, where ‖**c**‖_{0} is the number of nonzero components in **c**, which we call the dimensionality *D* of the model **Dc**. The hyperparameter *λ* is fixed by cross-validation, i.e., by minimizing the validation error over the values of *D*. Within the SISSO algorithm, the SIS step works iteratively by ranking all generated features according to their Pearson correlation values to the target property and adding only the most correlated features to $S$. At the next iteration, SIS ranks and adds to $S$ the features that Pearson correlate the most with the *residual* of the previous step, **Δ**_{D−1}, i.e., the difference between **P**, and the estimates predicted by the (*D* − 1)-dimensional model: **Δ**_{D−1} = **Dc**_{D−1} − **P**. The SIS step is followed by the sparsifying operator (SO) step, where linear-regression models are trained on all subsets of features in $S$, i.e., on all single features, then all pairs, then all triplets, etc. This yields a set of models of dimensions *D* = 1, 2, 3, etc. For each dimension, models are ranked by training error and the best model is selected. The model with the lowest validation error across dimensions is the final SISSO model. In practice, models at increasing dimension are trained until the validation error starts increasing or reaches a plateau. In summary, the result of the SISSO analysis is a *D*-dimensional descriptor, which is a vector with components from $S$. For a regression problem, the SISSO model is the scalar product of the identified descriptor with the vector of linear coefficients resulting from the *ℓ*_{0}-regularized linear regression.

For a classification problem, the SISSO model is given as a set of hyperplanes that divide the data points into classes, which are described by the scalar product of the identified descriptor, with a set of coefficients found by linear support vector machines (SVMs). The loss function for classification is based on a convex-hull algorithm, where the convex hulls for each set of points with the same property label, i.e., the same class, are evaluated, and the algorithm tries to minimize the number of data points inside more than one convex hull, i.e., the number of points inside the overlap region. The SIS step ranks features by how well an individual feature separates the remaining unclassified data points by first minimizing the number of misclassified points and then minimizing the hypervolume of the overlap region where they are located. When a feature leaves no misclassified points, the tiebreaker maximizes the minimum separation distance between points in different classes. The residual here is the set of all misclassified points from the *D* − 1-dimensional descriptor. The SO uses a convex-hull based loss function, i.e., minimizes the number of points inside the overlap region of a set of *D*-dimensional convex hulls.

This paper is organized as follows. We separately describe the recent updates to the SISSO methodology for the *feature creation* (Sec. II) and *descriptor identification* steps (Sec. III). The most important advancement of the code is expressing the features as binary expression trees (Subsection II A), instead of strings, allowing us to recursively define all aspects of the generated features from the primary features. With this implementation choice, we are able to keep track of the units for each generated feature, as well as an initial representation of its domain and range. This allows for the creation of grammatically correct expressions, in terms of consistency of the physical units, and the control of numerical issues generated by features going out of their physically meaningful range. In terms of the feature-creation step, we also discuss the implementation of *parametric SISSO* (Subsection II B), which introduces the flexibility of nonlinear parameters together with the operators that are optimized against a loss function based on the compressed-sensing-based feature-selection metrics. This procedure was used to describe the thermal conductivity of a material in a recent publication.^{31} For the descriptor-identification step, we cover two components: an improved classification algorithm and the multi-residual approach. For classification problems, we generalized the algorithm to work for any problem to an arbitrary dimension and explicitly include a model identification via linear SVM (Subsection III A). The multi-residual approach (Subsection III B), which was previously used in Ref. 25, introduces further flexibility for the identification of models with more than one dimension. Here, we provide an in-depth discussion of its machinery.

## II. FEATURE CREATION

### A. Binary-expression-tree representation of features

The biggest advancement to the implementation of SISSO in SISSO++, compared to the original implementation in Ref. 23, is its modified representation of the features as binary expression trees, instead of strings. While the choice of feature representation does not affect the methodology and could be used in any implementation, it does impact the performance and readability of the code base. This representation is illustrated in Fig. 1 and easily allows for all aspects of the generated features to be recursively calculated on the fly from the data stored in the primary features. For certain applications, it is also possible to store the data of higher-rung features to reduce the overall computational cost of the calculations. The individual features are addressed by the root node of the binary expression tree and stored in the code as a shared_ptr from the C++ standard library. This representation reduces the overall memory footprint of each calculation as the individual features only need to be created once and only copies of shared pointers need to be stored for each new expression. The remainder of this section will be used to describe the various aspects of the new representation, including a description of the units and range of the features, as well as how it is used to generate the feature space.

#### 1. Units

An important upgrade in SISSO++ is its generalized and exact treatment of units for the expressions. In physics, dimensional analysis is an important tool when generating physically meaningful expressions, and it is necessary to include it when using symbolic regression for scientific problems. We introduced this into SISSO++ by determining the units for each new expression from the primary features and explicitly checking to ensure that a new expression is physically possible. Within the code, the Units are implemented as dictionaries with the key representing the base unit and the value representing the exponent for each key, e.g., m/s^{2} would be stored as {m: 1, s: −2}. Functions exist to transform the Units to and from strings to more easily represent the information. We then implemented a multiplication, division, and power operators for these specialized dictionaries, allowing for the units of the generated features to be derived recursively following the rules in Table I. An important caveat is that the current implementation cannot convert between two units for the same physical quantity, e.g., between nanometers, picometers, and Bohr radii for length.

Operation . | Resulting unit . |
---|---|

A + B | Unit (A) |

A − B | Unit (A) |

A ∗ B | Unit (A) ∗ unit (B) |

A/B | Unit (A)/unit (B) |

$A\u2212B$ | Unit (A) |

$A$ | Unit (A) |

$sinA$ | Unitless |

$cosA$ | Unitless |

$expA$ | Unitless |

$exp\u2212A$ | Unitless |

$logA$ | Unitless |

$A\u22121$ | Unit (A)^{−1} |

$A2$ | Unit (A)^{2} |

$A3$ | Unit (A)^{3} |

$A6$ | Unit (A)^{6} |

$A$ | Unit (A)^{1/2} |

$A3$ | Unit (A)^{1/3} |

Operation . | Resulting unit . |
---|---|

A + B | Unit (A) |

A − B | Unit (A) |

A ∗ B | Unit (A) ∗ unit (B) |

A/B | Unit (A)/unit (B) |

$A\u2212B$ | Unit (A) |

$A$ | Unit (A) |

$sinA$ | Unitless |

$cosA$ | Unitless |

$expA$ | Unitless |

$exp\u2212A$ | Unitless |

$logA$ | Unitless |

$A\u22121$ | Unit (A)^{−1} |

$A2$ | Unit (A)^{2} |

$A3$ | Unit (A)^{3} |

$A6$ | Unit (A)^{6} |

$A$ | Unit (A)^{1/2} |

$A3$ | Unit (A)^{1/3} |

Using this implementation of units, a minimal treatment of dimensional analysis can be performed in the code. The dimensional analysis focuses on whether the units are consistent within each expression and for the final model. This check is used to determine the units of the fitted constants in the linear models at the end, which can take arbitrary units and therefore, can do any unit conversion natively. The restrictions, used to reject expressions by dimensional analysis, are outlined in Table II and can be summarized as follows: Addition and subtraction are only allowed to act on features of the same units, and all transcendental operations must act on a unitless quantity. With these two restrictions in place, only physically relevant features can be found, and the choice of units should no longer affect which features are selected. If one wants to revert back to previous descriptions without these restrictions, this can be achieved by providing all primary features with no units associated with them.

Operation . | Unit restriction . |
---|---|

A + B | Unit (A) = = unit (B) |

A − B | Unit (A) = = unit (B) |

$A\u2212B$ | Unit (A) = = unit (B) |

$sinA$ | Unit (A) = = ∅ |

$cosA$ | Unit (A) = = ∅ |

$expA$ | Unit (A) = = ∅ |

$exp\u2212A$ | Unit (A) = = ∅ |

$logA$ | Unit (A) = = ∅ |

Operation . | Unit restriction . |
---|---|

A + B | Unit (A) = = unit (B) |

A − B | Unit (A) = = unit (B) |

$A\u2212B$ | Unit (A) = = unit (B) |

$sinA$ | Unit (A) = = ∅ |

$cosA$ | Unit (A) = = ∅ |

$expA$ | Unit (A) = = ∅ |

$exp\u2212A$ | Unit (A) = = ∅ |

$logA$ | Unit (A) = = ∅ |

#### 2. Range

Another important advancement of the feature-creation step is the introduction of ranges for the primary features, which act as a domain for future operations during feature creation. One of the challenges associated with symbolic regression, especially with smaller datasets, is that the selected expressions can sometimes contain discontinuities that are outside of the training data, but still within the relevant input space for a given problem. For example, this can lead to an expression taking the logarithm of a negative number, resulting in an undefined prediction. SISSO++ solves this problem by including an option for describing the range of a primary feature in a standard mathematical notation, e.g., $0,\u221e$, and then using that to calculate the range for all generated formulas using that primary feature, following the rule specified in Table III. In the code, the ranges are referenced as the Domain because the range for a feature of rung *n* − 1 is the domain for a possible expression of rung *n* that is using that feature. While all ranges in Table III assume inclusive endpoints, the implementation can handle both exclusive endpoints and a list of values explicitly excluded from the range, e.g., point discontinuities inside the primary features themselves.

Operation . | Resulting range . |
---|---|

A + B | $minA+minB,maxA+maxB$ |

A − B | $minA\u2212maxB,maxA\u2212minB$ |

A ∗ B | $minminA\u2217minB,minA\u2217maxB,maxA\u2217minB,maxA\u2217maxB,$ |

$maxminA\u2217minB,minA\u2217maxB,maxA\u2217minB,maxA\u2217maxB$ | |

A/B | $RangeA\u2217RangeB\u22121$ |

$sinA$ | $\u22121,1$ |

$cosA$ | $\u22121,1$ |

$expA$ | $expminA,expmaxA$ |

$exp\u2212A$ | $exp\u2212maxA,exp\u2212minA$ |

$logA$ | $logminA,logmaxA$ |

$A\u22121$ | if$(0\u2208RangeA)$: and min (A) ! = 0 and max (A)! = 0: |

(−∞, 0) ∪ (0, ∞) | |

else if (0 ∈Range(A)) and min (A) == 0: | |

(0, ∞) | |

else if (0 ∈Range(A)) and max (A) == 0: | |

(−∞, 0) | |

else: | |

$maxA\u22121,minA\u22121$ | |

$A3$ | $minA3,maxA3$ |

$A$ | $minA,maxA$ |

$A3$ | $minA3,maxA3$ |

$A$ | $max0,minA,maxmaxA,minA$ |

$A2$ | $max0,minA2,maxmaxA,minA2$ |

$A6$ | $max0,minA6,maxmaxA,minA6$ |

$A\u2212B$ | $max0,minA\u2212maxB,$ |

$maxmaxmaxA\u2212minB,minmaxA\u2212minB$ |

Operation . | Resulting range . |
---|---|

A + B | $minA+minB,maxA+maxB$ |

A − B | $minA\u2212maxB,maxA\u2212minB$ |

A ∗ B | $minminA\u2217minB,minA\u2217maxB,maxA\u2217minB,maxA\u2217maxB,$ |

$maxminA\u2217minB,minA\u2217maxB,maxA\u2217minB,maxA\u2217maxB$ | |

A/B | $RangeA\u2217RangeB\u22121$ |

$sinA$ | $\u22121,1$ |

$cosA$ | $\u22121,1$ |

$expA$ | $expminA,expmaxA$ |

$exp\u2212A$ | $exp\u2212maxA,exp\u2212minA$ |

$logA$ | $logminA,logmaxA$ |

$A\u22121$ | if$(0\u2208RangeA)$: and min (A) ! = 0 and max (A)! = 0: |

(−∞, 0) ∪ (0, ∞) | |

else if (0 ∈Range(A)) and min (A) == 0: | |

(0, ∞) | |

else if (0 ∈Range(A)) and max (A) == 0: | |

(−∞, 0) | |

else: | |

$maxA\u22121,minA\u22121$ | |

$A3$ | $minA3,maxA3$ |

$A$ | $minA,maxA$ |

$A3$ | $minA3,maxA3$ |

$A$ | $max0,minA,maxmaxA,minA$ |

$A2$ | $max0,minA2,maxmaxA,minA2$ |

$A6$ | $max0,minA6,maxmaxA,minA6$ |

$A\u2212B$ | $max0,minA\u2212maxB,$ |

$maxmaxmaxA\u2212minB,minmaxA\u2212minB$ |

Table IV lists the cases where the range of a feature is used to prevent a new expression from being generated. In all cases, this prevents an operation from occurring where a mathematical operation would be not defined, such as taking the square-root of a negative number. In cases where the range of values for a primary feature is not defined, then these checks are not performed and the original assumption that all operations are safe is used.

### B. Parametric SISSO

*et al.*

^{31}For a general operator, $h\u0302x\u2208H\u0302$, with a set of scale and bias parameters, $P\u0302$, the parameterization scheme updates the operator to be

*α*is the scale parameter,

*β*is the bias term, and

**x**is a vector containing all input data. For binary operators, the scale and shift parameters for both input features can be set, leading to

*α*

_{0}

*α*

_{1}from the expression and defining $\beta 0\u2032=\beta 0\alpha 1$ and $\beta 1\u2032=\beta 1\alpha 0$, we can rewrite this expression to be

*α*

_{0}

*α*

_{1}term getting set by either the linear model or an operator further up the tree. For the log operator,

*α*is always set to ±1 to also avoid these linear dependencies as

*α*

_{unit}is the unit conversion factor. Table V defines all the free parameters for each operation in parametric SISSO.

Operation . | Parameterized expressions . | Fixed parameters . |
---|---|---|

A + B | A + α_{1}B | β_{0} = 0; β_{1} = 0; α_{0} = 1 |

A − B | A − α_{1}B | β_{0} = 0; β_{1} = 0; α_{0} = 1 |

A ∗ B | $A+\beta 0\u2217B+\beta 1$ | α_{0} = 1; α_{1} = 1 |

A/B | $A+\beta 0/B+\beta 1$ | α_{0} = 1; α_{1} = 1 |

$A\u2212B$ | $A\u2212\alpha 1B+\beta 1$ | β_{0} = 0; α_{0} = 1 |

$A$ | $A+\beta $ | α = 1 |

$sinA$ | $sin\alpha A+\beta $ | |

$cosA$ | $cos\alpha A+\beta $ | |

$expA$ | $exp\alpha A$ | β = 0; α > 0 |

$exp\u2212A$ | $exp\u2212\alpha A$ | β = 0; α > 0 |

$logA$ | $log\alpha A+\beta $ | α = ±1 |

$A\u22121$ | $A+\beta \u22121$ | α = 1 |

$A2$ | $A+\beta 2$ | α = 1 |

$A3$ | $A+\beta 3$ | α = 1 |

$A6$ | $A+\beta 6$ | α = 1 |

$A$ | $\alpha A+\beta $ | α = ±1 |

$A3$ | $A+\beta 3$ | α = 1 |

Operation . | Parameterized expressions . | Fixed parameters . |
---|---|---|

A + B | A + α_{1}B | β_{0} = 0; β_{1} = 0; α_{0} = 1 |

A − B | A − α_{1}B | β_{0} = 0; β_{1} = 0; α_{0} = 1 |

A ∗ B | $A+\beta 0\u2217B+\beta 1$ | α_{0} = 1; α_{1} = 1 |

A/B | $A+\beta 0/B+\beta 1$ | α_{0} = 1; α_{1} = 1 |

$A\u2212B$ | $A\u2212\alpha 1B+\beta 1$ | β_{0} = 0; α_{0} = 1 |

$A$ | $A+\beta $ | α = 1 |

$sinA$ | $sin\alpha A+\beta $ | |

$cosA$ | $cos\alpha A+\beta $ | |

$expA$ | $exp\alpha A$ | β = 0; α > 0 |

$exp\u2212A$ | $exp\u2212\alpha A$ | β = 0; α > 0 |

$logA$ | $log\alpha A+\beta $ | α = ±1 |

$A\u22121$ | $A+\beta \u22121$ | α = 1 |

$A2$ | $A+\beta 2$ | α = 1 |

$A3$ | $A+\beta 3$ | α = 1 |

$A6$ | $A+\beta 6$ | α = 1 |

$A$ | $\alpha A+\beta $ | α = ±1 |

$A3$ | $A+\beta 3$ | α = 1 |

This parameterization scheme is created at the root node and is recursively defined for all other operators in the binary expression tree at any level below the root. As a result, we introduce a new hyperparameter, the parameterization level, *P*_{l}, to specify the maximum number of levels that can be included within the parameters. For example, if *P*_{l} = 1, then only the root node and its associated parameters will be optimized, but if *P*_{l} = 2, then the parameters associated with the root node and its children will be optimized. This is best illustrated in Fig. 2, where a parameterized-sine operator (blue node) is acting on a parameterized square expression (left graph/green and orange nodes), $x+\beta 0sq2$ and an unparameterized cube expression (right graph/red and orange nodes), $x3$. In both cases when *P*_{l} = 1, only the scale and shift parameters associated with the sine operator are optimized, with all other previously found or not included parameters fixed to their current values, with the resulting expressions shown in the lower row of Fig. 2. However, if *P*_{l} = 2, then the parameters for the square and cube operations $(\beta 1sin)$ are added to those of the sine operator, and the entire expression is optimized. The resulting equations in this case are shown in the upper row of Fig. 2.

^{32}We use the Cauchy loss function as the objective for the optimization,

**P**is a property vector,

*c*is a scaling factor set to 0.5 for all calculations, and

*n*

_{samp}is the number of samples. We use the Cauchy loss function over the mean square error to make the nonlinear optimization more robust against outliers in the dataset. Because Eq. (9b) is not scale or bias invariant, additional external parameters

*α*

_{ext}and

*β*

_{ext}are introduced to, respectively, account for these effects. For the case of multi-task SISSO,

^{22}each task has its own external bias and scale parameters to account for the individual linear-regression solutions. As an example for the features illustrated in Fig. 2 (

*P*

_{l}= 2), the functions that are optimized would be

*α*and

*β*terms to 1.0 and 0.0, respectively, and

*α*

_{ext}and

*β*

_{ext}are set to the solution of the least squares regression problem for each task. In some cases,

*β*can be set to a nonzero value if leaving it at zero would include values outside the domain of the operator. In these cases,

*β*is set to $minsign\alpha x+10\u221210$.

Each optimization follows a two or three step process outlined here. First, a local optimization is performed to find the local minimum associated with the initial parameters. Once, at a local minimum, an optional global optimization is performed to find any minima that are better than the one initially found. For these first two steps, the parameters are optimized to a relative tolerance of 10^{−3} and 10^{−2}, respectively, with a maximum of five thousand function evaluations for each step. Finally, a more accurate local optimization is done to a relative tolerance of 10^{−6} to find the best parameter set. For this final optimization, ten thousand function evaluations are allowed. Additionally, for both the initial and global optimization steps, the parameters are bounded to be in a range between −100 and 100 to improve the efficiency of the optimization, but this restriction is removed for the final optimization. For all local optimizations, the subplex algorithm,^{33} a faster and more robust variant of the Nelder–Mead simplex method,^{34} is used. The improved stochastic ranking evolution strategy algorithm^{35} is used for all global optimizations. Once optimized, only the internal *α* and *β* parameters are stored in $P\u0302$. This procedure is repeated for all operators specified to be parameterized by the user, so it can increase the cost of feature creation considerably.

Figure 3 illustrates the power of the new parameterization scheme. For both toy problems represented by analytic Lorentzian and sine functions with some white noise, the nonparametric version of SISSO cannot find accurate models for the equations as it cannot address the nonlinearities properly. By using this new parameterization scheme, SISSO is now able to accurately find the models, as shown in Figs. 3(c) and 3(d). However, it is important to note that the more powerful featurization comes at the cost of a significantly increased time to generate the feature space as the parameterization becomes the bottleneck for the calculations. Additionally, there can be cases where the parameterization scheme does not find an optimal solution because there is too much noise or the optimal parameters are too far away from the initial guesses, as shown in Figs. 3(e) and 3(f).

### C. Building the complete feature set

With all the new aspects of the feature representation in place, SISSO++ has a fully parallelized feature set construction that uses a combination of threads and Message Passing Interface (MPI) ranks for efficient feature set construction. The basic process of creating new features is illustrated in Fig. 4, where each new rung builds on top of existing features by adding a new operation on top of existing binary expression trees for the previous rung. In this step, the operators are separated into parameterized and nonparameterized versions of each other to allow for the optional use of the parametric SISSO concepts. Throughout this process, all checks are done to ensure that the units are correct and the domains for each new operation are respected. For example, this is why $sint$ and $sin\omega $ are not included in the rung 1 features as taking the sine of a unit quantity is physically meaningless. If the sine operation was replaced by an algebraic operation or included the parameterization scheme, then two additional rung 1 features would be created with the unary operations acting on each primary feature. Finally, the code checks for invalid values, e.g., NaN or inf, and some basic simplifications for all features, e.g., features such as $t\omega \omega $, are rejected.

## III. DESCRIPTOR IDENTIFICATION

### A. Linear programming implementation for classification problems

*x*

_{i}is the

*i*

^{th}point inside the set of all points of a class

*I*,

*α*

_{i}is the coefficient for

*x*

_{i}, and

*x*

_{j}is the point to check if it is inside the convex hull. The above problem is only feasible if and only if

*x*

_{j}lies inside the set of points,

*I*, representing a class in the problem. Here, we are optimizing a zero function because the actual solution to this optimization does not matter; rather, it is important that such a solution can be found, i.e., the constraints can be fulfilled. The feasibility and linear-programming problem is defined using the CLP library.

^{36}The classifier then tries to minimize the number of points that are inside more than one of the convex hulls, with the region of feature space that contains the overlap between any of the convex hulls defined as the overlap region. Once the number of points in the overlap region is determined, a linear SVM model is calculated for the best candidates and used as the new tie-breaking procedure. The first tiebreaker is the number of misclassified points by the SVM model, and the second one is the margin distance. The SVM model is calculated using libsvm.

^{37}

Figure 5 demonstrates the capabilities of the new classification algorithm on a toy problem where a spherical point cloud is separated into four classes by the planes *x*_{0} + *x*_{1} + *x*_{2} = 0 and *x*_{2} = 0. The choice to use a three-dimensional model was done to demonstrate that the new method can work beyond two dimensions and still be easily visualized, but in principle, this can find a descriptor with an arbitrary dimension. In this example, we randomly sample a Gaussian distribution with a standard deviation of 0.5 and centered at the origin to generate one thousand samples for three features, *x*_{0}, *x*_{1}, and *x*_{2}, which are then copied and relabeled to *x*_{3}, *x*_{4}, and *x*_{5}. The samples are then separated into four classes based on whether $x0+x1+x2$ is greater than (light blue and black) or less than (red and yellow) zero and whether *x*_{2} > 0 (red and light blue) or *x*_{2} < 0 (black and yellow). To better highlight the separation of the four classes, an artificial margin area is created for *x*_{0}, *x*_{1}, and *x*_{2} by replacing all points where $x1+x2+x3<0.6$ or $x2<0.45$, with another random point away from the margin, but still within the same class. To focus on the new solver, we do not perform the feature creation step of SISSO for this problem; however, the rung two feature of $x0+x1+x2$ and *x*_{2} would be able to completely separate the classes in two dimensions. Using the updated algorithm, SISSO can now easily identify that the set $x0,x1,x2$ is the better classifier than the set $x3,x4,x5$, which all previous implementations would fail to do as neither set has any points in the overlap region. More importantly, SISSO now provides the *D*-dimensional dividing planes found by linear-SVM for all pairs of classes creating an actual classifier automatically.

### B. Multiple residuals

The second advancement to the descriptor-identification step of the SISSO algorithm is the introduction of a *multiple-residuals* approach to select the features for models with a dimension higher than one. As outlined in the Introduction, in the original SISSO algorithm,^{38} the residual of the previously found model, $\Delta D\u221210$, i.e., the difference between the vector storing the values of the property for each sample, **P**, and the estimates predicted by the (*D* − 1)-dimensional model $(\Delta D\u221210=DcD\u22121\u2212P)$, is used to calculate the projection score of the candidate features during the SIS step for the best *D*-dimensional model, $sj0=R2\Delta D\u221210,dj$. Here, *R* is the Pearson correlation coefficient, representing a regression problem, and *j* corresponds to each expression generated during the feature-creation step of SISSO. In SISSO++,^{24} we extend the residual definition and use the best *r* residuals to calculate the projection score: $maxsj0,sj1,\u2026,sjr\u22121$. The multiple-residual concept generalizes the descriptor identification step of SISSO by using information from an ensemble of models to determine which features to add to the selected subspace. The ensemble is created by using the *r* best models from the **Dc**_{D−1} identification step instead of only the top one. The value of *r* for a calculation is set as any hyperparameter via cross-validation.

This process is illustrated in Fig. 6 for a three-dimensional problem space (three training samples) defined by **e**_{1}, **e**_{2}, and **e**_{3} with five candidate features **F**_{1}, …, **F**_{5} to describe the property **P**. In principle, there can be an arbitrary number of feature vectors, but for clarity, we only show five. If only a single residual is used, then the selected two-dimensional model will comprise of **F**_{1} and **F**_{4}, as **F**_{3} is never selected because it has the smallest projection score from the residual of the model found using **F**_{1}. However, because **F**_{2} has a component along **e**_{2}, a better two-dimensional model consisting of a linear combination of **F**_{2} and **F**_{3} exists, despite **F**_{2} not being the most correlated feature to **P**. When going to higher-dimensional feature spaces, it becomes more likely that the feature vectors similarly correlated with the property contain orthogonal information, thus the need for using multiple residuals in SISSO.

In order to demonstrate the effect of learning over multiple residuals and to get an estimate of the optimal number of residuals and size of the SIS subspace (*n*_{sis}), we plot the training root-mean-square error of prediction (RMSE) for the two-dimensional models for the function *y* = 5.5 + 0.4158*d*_{1} − 0.0974*d*_{2} + *δ*, where $d1=x02x13$, $d2=x23$, and *δ* is a Gaussian white noise term pulled from a distribution with a standard deviation of 0.05 in Fig. 7. For this problem, the best one-dimensional descriptor is $1x33$, with *x*_{3} being explicitly set to $y+\Delta \u22123$, and Δ is a Gaussian white noise term with a standard deviation of 20.0. This primary feature was explicitly added in order to ensure that the best one-dimensional model would not be contain *d*_{1} or *d*_{2} in this synthetic problem. Because of this, when using a single residual, the SIS subspace size has to be increased to over 400, before *y* can be reproduced by SISSO. However, by increasing the number of residuals to 50, SISSO can now find which features are most correlated with the residual of *d*_{1} and it immediately finds *y*. In a recent paper published by some of us, we further demonstrate that this approaches effectiveness for learning models of the bulk modulus of cubic perovskites.^{25}

## IV. CONCLUSIONS

In this paper, we described recently developed improvements to the SISSO method and their implementation in the SISSO++ code in terms of both their mathematical and computational details, which constitute a large leap forward in terms of the expressivity of the SISSO method. Utilizing these features provides greater flexibility and control over the expressions found by SISSO and acts as a start to introducing “grammatical” rules into SISSO and symbolic regression. In particular, concepts such as the units and ranges of the formula could be extended to prune the search space of possible expressions for the final models. We have also described the implementation of *parametric SISSO*, which considerably opens up the range of possible expressions found by SISSO. Finally, we discussed two improvements related to the SISSO solver, i.e., a linear programming implementation for the classification problems and the multiple-residuals technique, both providing extended flexibility in the descriptors and models found by SISSO.

## ACKNOWLEDGMENTS

TARP thanks Christian Carbogno for valuable discussions related to the parametric SISSO scheme and proof reading those parts of the manuscript. TARP thanks Lucas Foppa for discussions related to the multi-residual approach and proof reading those parts of the manuscript. This work was funded by the NOMAD Center of Excellence (European Union’s Horizon 2020 Research and Innovation Program, Grant Agreement No. 951786), the ERC Advanced Grant TEC1p (European Research Council, Grant Agreement No. 740233), BigMax (the Max Planck Society’s Research Network on Big-Data-Driven Materials-Science), and the project FAIRmat (FAIR Data Infrastructure for Condensed-Matter Physics and the Chemical Physics of Solids, German Research Foundation, Project No. 460197019). TARP acknowledges the Alexander von Humboldt (AvH) Foundation for their support through the AvH Postdoctoral Fellowship Program.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

TARP implemented all methods and performed all calculations. TARP ideated all methods with assistance from LMG. MS and LMG supervised the project. All authors wrote the manuscript.

**Thomas A. R. Purcell**: Conceptualization (equal); Data curation (lead); Formal analysis (equal); Funding acquisition (equal); Investigation (lead); Methodology (equal); Software (lead); Writing – original draft (equal); Writing – review & editing (equal). **Matthias Scheffler**: Conceptualization (supporting); Funding acquisition (equal); Supervision (supporting); Writing – original draft (equal); Writing – review & editing (equal). **Luca M. Ghiringhelli**: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

The data that support the findings is available in FigShare at http://dx.doi.org/10.6084/m9.figshare.23813889. The code used here can be found on gitlab: https://gitlab.com/sissopp_developers/sissopp.

## REFERENCES

*Learning Symbolic Physics with Graph Networks*

*Symbolicgpt: A Generative Transformer Model for Symbolic Regression*