Accurate and explainable artificial-intelligence (AI) models are promising tools for accelerating the discovery of new materials. Recently, symbolic regression has become an increasingly popular tool for explainable AI because it yields models that are relatively simple analytical descriptions of target properties. Due to its deterministic nature, the sure-independence screening and sparsifying operator (SISSO) method is a particularly promising approach for this application. Here, we describe the new advancements of the SISSO algorithm, as implemented into SISSO++, a C++ code with Python bindings. We introduce a new representation of the mathematical expressions found by SISSO. This is a first step toward introducing “grammar” rules into the feature creation step. Importantly, by introducing a controlled nonlinear optimization to the feature creation step, we expand the range of possible descriptors found by the methodology. Finally, we introduce refinements to the solver algorithms for both regression and classification, which drastically increase the reliability and efficiency of SISSO. For all these improvements to the basic SISSO algorithm, we not only illustrate their potential impact but also fully detail how they operate both mathematically and computationally.
I. INTRODUCTION
Data-centric and artificial-intelligence (AI) approaches are becoming a vital tool for describing physical and chemical properties and processes. The key advantage of AI is its ability to find correlations between different sets of properties without the need to know which ones are important before the analysis. Because of this, AI has become increasingly popular for materials discovery applications with uses in areas such as thermal transport properties,1,2 catalysis,3 and quantum materials.4 Despite the success of these methodologies, creating explainable and physically relevant AI models remains an open challenge in the field.5–7
One prevalent set of methods for explainable AI is symbolic regression.8–11 Symbolic regression algorithms identify optimal nonlinear, analytic expressions for a given target property from a set of input features, i.e., the primary features, that are related to the target.12 Originally, (stochastic) genetic-programming-based approaches were and still are used to find these expressions,12–15 but recently, a more diverse set of solvers have been developed.16–21 The sure-independence screening and sparsifying operator (SISSO) approach combines symbolic regression with compressed sensing22–25 to provide a deterministic way of finding these analytic expressions. This approach has been used to describe numerous properties, including phase stability,23,26,27 catalysis,28 and glass transition temperatures.29 It has also been used in a multi-task22 and hierarchical fashion.25
In this paper, we introduce the new concepts implemented in the recently released SISSO++ code24 and detail their implementation. SISSO++ is a new, modular implementation of SISSO that provides a more user-friendly interface to run SISSO with several methodological updates. Before detailing the updated methodology, we summarize the main aspects of the SISSO algorithm and define some of the terminologies we use below. The SISSO approach starts with a collection of primary features and mathematical unary and binary operators (e.g., addition, multiplication, nth root, and logarithms). The first step is the feature-creation step, where a pool of generated features is built by exhaustively applying the set of mathematical operators onto the primary features. The algorithm is iteratively repeated by applying the set of operators onto the previously generated features. The number of iterations in this feature-creation step is called the rung. The generated features, projected onto the training dataset, build the so-called sensing matrix D, with one column for each generated feature and one row for each data point.
The subsequent step is the descriptor identification step, i.e., compressed sensing is used to identify the best D-dimensional linear model by performing an ℓ0-regularized optimization on a subspace of all generated features. is selected using sure-independence screening30 (SIS), with a suitable projection score, depending on the loss function used to evaluate the models. For a regression problem (see further below for the discussion of classification problems), i.e., the prediction of a continuous property P (an array with as many components as the number of training points), one solves argminc‖Dc − P‖2 + λ‖c‖0, where ‖c‖0 is the number of nonzero components in c, which we call the dimensionality D of the model Dc. The hyperparameter λ is fixed by cross-validation, i.e., by minimizing the validation error over the values of D. Within the SISSO algorithm, the SIS step works iteratively by ranking all generated features according to their Pearson correlation values to the target property and adding only the most correlated features to . At the next iteration, SIS ranks and adds to the features that Pearson correlate the most with the residual of the previous step, ΔD−1, i.e., the difference between P, and the estimates predicted by the (D − 1)-dimensional model: ΔD−1 = DcD−1 − P. The SIS step is followed by the sparsifying operator (SO) step, where linear-regression models are trained on all subsets of features in , i.e., on all single features, then all pairs, then all triplets, etc. This yields a set of models of dimensions D = 1, 2, 3, etc. For each dimension, models are ranked by training error and the best model is selected. The model with the lowest validation error across dimensions is the final SISSO model. In practice, models at increasing dimension are trained until the validation error starts increasing or reaches a plateau. In summary, the result of the SISSO analysis is a D-dimensional descriptor, which is a vector with components from . For a regression problem, the SISSO model is the scalar product of the identified descriptor with the vector of linear coefficients resulting from the ℓ0-regularized linear regression.
For a classification problem, the SISSO model is given as a set of hyperplanes that divide the data points into classes, which are described by the scalar product of the identified descriptor, with a set of coefficients found by linear support vector machines (SVMs). The loss function for classification is based on a convex-hull algorithm, where the convex hulls for each set of points with the same property label, i.e., the same class, are evaluated, and the algorithm tries to minimize the number of data points inside more than one convex hull, i.e., the number of points inside the overlap region. The SIS step ranks features by how well an individual feature separates the remaining unclassified data points by first minimizing the number of misclassified points and then minimizing the hypervolume of the overlap region where they are located. When a feature leaves no misclassified points, the tiebreaker maximizes the minimum separation distance between points in different classes. The residual here is the set of all misclassified points from the D − 1-dimensional descriptor. The SO uses a convex-hull based loss function, i.e., minimizes the number of points inside the overlap region of a set of D-dimensional convex hulls.
This paper is organized as follows. We separately describe the recent updates to the SISSO methodology for the feature creation (Sec. II) and descriptor identification steps (Sec. III). The most important advancement of the code is expressing the features as binary expression trees (Subsection II A), instead of strings, allowing us to recursively define all aspects of the generated features from the primary features. With this implementation choice, we are able to keep track of the units for each generated feature, as well as an initial representation of its domain and range. This allows for the creation of grammatically correct expressions, in terms of consistency of the physical units, and the control of numerical issues generated by features going out of their physically meaningful range. In terms of the feature-creation step, we also discuss the implementation of parametric SISSO (Subsection II B), which introduces the flexibility of nonlinear parameters together with the operators that are optimized against a loss function based on the compressed-sensing-based feature-selection metrics. This procedure was used to describe the thermal conductivity of a material in a recent publication.31 For the descriptor-identification step, we cover two components: an improved classification algorithm and the multi-residual approach. For classification problems, we generalized the algorithm to work for any problem to an arbitrary dimension and explicitly include a model identification via linear SVM (Subsection III A). The multi-residual approach (Subsection III B), which was previously used in Ref. 25, introduces further flexibility for the identification of models with more than one dimension. Here, we provide an in-depth discussion of its machinery.
II. FEATURE CREATION
A. Binary-expression-tree representation of features
The biggest advancement to the implementation of SISSO in SISSO++, compared to the original implementation in Ref. 23, is its modified representation of the features as binary expression trees, instead of strings. While the choice of feature representation does not affect the methodology and could be used in any implementation, it does impact the performance and readability of the code base. This representation is illustrated in Fig. 1 and easily allows for all aspects of the generated features to be recursively calculated on the fly from the data stored in the primary features. For certain applications, it is also possible to store the data of higher-rung features to reduce the overall computational cost of the calculations. The individual features are addressed by the root node of the binary expression tree and stored in the code as a shared_ptr from the C++ standard library. This representation reduces the overall memory footprint of each calculation as the individual features only need to be created once and only copies of shared pointers need to be stored for each new expression. The remainder of this section will be used to describe the various aspects of the new representation, including a description of the units and range of the features, as well as how it is used to generate the feature space.
A demonstration of the new representation of the features in the SISSO++ code. The feature is stored as the root of the tree (represented by the thick border), the primary features are the leaves, and the rung corresponds to the height of the tree, i.e., the longest path between each leaf and the root. The unit, range, expressions, and values of the features are necessarily stored only for the primary features, with them defined recursively for all generated features.
A demonstration of the new representation of the features in the SISSO++ code. The feature is stored as the root of the tree (represented by the thick border), the primary features are the leaves, and the rung corresponds to the height of the tree, i.e., the longest path between each leaf and the root. The unit, range, expressions, and values of the features are necessarily stored only for the primary features, with them defined recursively for all generated features.
1. Units
An important upgrade in SISSO++ is its generalized and exact treatment of units for the expressions. In physics, dimensional analysis is an important tool when generating physically meaningful expressions, and it is necessary to include it when using symbolic regression for scientific problems. We introduced this into SISSO++ by determining the units for each new expression from the primary features and explicitly checking to ensure that a new expression is physically possible. Within the code, the Units are implemented as dictionaries with the key representing the base unit and the value representing the exponent for each key, e.g., m/s2 would be stored as {m: 1, s: −2}. Functions exist to transform the Units to and from strings to more easily represent the information. We then implemented a multiplication, division, and power operators for these specialized dictionaries, allowing for the units of the generated features to be derived recursively following the rules in Table I. An important caveat is that the current implementation cannot convert between two units for the same physical quantity, e.g., between nanometers, picometers, and Bohr radii for length.
How the units are calculated for each operation.
Operation . | Resulting unit . |
---|---|
A + B | Unit (A) |
A − B | Unit (A) |
A ∗ B | Unit (A) ∗ unit (B) |
A/B | Unit (A)/unit (B) |
Unit (A) | |
Unit (A) | |
Unitless | |
Unitless | |
Unitless | |
Unitless | |
Unitless | |
Unit (A)−1 | |
Unit (A)2 | |
Unit (A)3 | |
Unit (A)6 | |
Unit (A)1/2 | |
Unit (A)1/3 |
Operation . | Resulting unit . |
---|---|
A + B | Unit (A) |
A − B | Unit (A) |
A ∗ B | Unit (A) ∗ unit (B) |
A/B | Unit (A)/unit (B) |
Unit (A) | |
Unit (A) | |
Unitless | |
Unitless | |
Unitless | |
Unitless | |
Unitless | |
Unit (A)−1 | |
Unit (A)2 | |
Unit (A)3 | |
Unit (A)6 | |
Unit (A)1/2 | |
Unit (A)1/3 |
Using this implementation of units, a minimal treatment of dimensional analysis can be performed in the code. The dimensional analysis focuses on whether the units are consistent within each expression and for the final model. This check is used to determine the units of the fitted constants in the linear models at the end, which can take arbitrary units and therefore, can do any unit conversion natively. The restrictions, used to reject expressions by dimensional analysis, are outlined in Table II and can be summarized as follows: Addition and subtraction are only allowed to act on features of the same units, and all transcendental operations must act on a unitless quantity. With these two restrictions in place, only physically relevant features can be found, and the choice of units should no longer affect which features are selected. If one wants to revert back to previous descriptions without these restrictions, this can be achieved by providing all primary features with no units associated with them.
Restrictions for each unit; if an operation is not listed, there are no restrictions.
Operation . | Unit restriction . |
---|---|
A + B | Unit (A) = = unit (B) |
A − B | Unit (A) = = unit (B) |
Unit (A) = = unit (B) | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ |
Operation . | Unit restriction . |
---|---|
A + B | Unit (A) = = unit (B) |
A − B | Unit (A) = = unit (B) |
Unit (A) = = unit (B) | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ | |
Unit (A) = = ∅ |
2. Range
Another important advancement of the feature-creation step is the introduction of ranges for the primary features, which act as a domain for future operations during feature creation. One of the challenges associated with symbolic regression, especially with smaller datasets, is that the selected expressions can sometimes contain discontinuities that are outside of the training data, but still within the relevant input space for a given problem. For example, this can lead to an expression taking the logarithm of a negative number, resulting in an undefined prediction. SISSO++ solves this problem by including an option for describing the range of a primary feature in a standard mathematical notation, e.g., , and then using that to calculate the range for all generated formulas using that primary feature, following the rule specified in Table III. In the code, the ranges are referenced as the Domain because the range for a feature of rung n − 1 is the domain for a possible expression of rung n that is using that feature. While all ranges in Table III assume inclusive endpoints, the implementation can handle both exclusive endpoints and a list of values explicitly excluded from the range, e.g., point discontinuities inside the primary features themselves.
How the range for each operation is calculated.
Operation . | Resulting range . |
---|---|
A + B | |
A − B | |
A ∗ B | |
A/B | |
if: and min (A) ! = 0 and max (A)! = 0: | |
(−∞, 0) ∪ (0, ∞) | |
else if (0 ∈Range(A)) and min (A) == 0: | |
(0, ∞) | |
else if (0 ∈Range(A)) and max (A) == 0: | |
(−∞, 0) | |
else: | |
Operation . | Resulting range . |
---|---|
A + B | |
A − B | |
A ∗ B | |
A/B | |
if: and min (A) ! = 0 and max (A)! = 0: | |
(−∞, 0) ∪ (0, ∞) | |
else if (0 ∈Range(A)) and min (A) == 0: | |
(0, ∞) | |
else if (0 ∈Range(A)) and max (A) == 0: | |
(−∞, 0) | |
else: | |
Table IV lists the cases where the range of a feature is used to prevent a new expression from being generated. In all cases, this prevents an operation from occurring where a mathematical operation would be not defined, such as taking the square-root of a negative number. In cases where the range of values for a primary feature is not defined, then these checks are not performed and the original assumption that all operations are safe is used.
B. Parametric SISSO
The free parameters and updated equations for each operation used in the updated feature creation step of SISSO. For each operation, parameter restrictions are shown in the third column.
Operation . | Parameterized expressions . | Fixed parameters . |
---|---|---|
A + B | A + α1B | β0 = 0; β1 = 0; α0 = 1 |
A − B | A − α1B | β0 = 0; β1 = 0; α0 = 1 |
A ∗ B | α0 = 1; α1 = 1 | |
A/B | α0 = 1; α1 = 1 | |
β0 = 0; α0 = 1 | ||
α = 1 | ||
β = 0; α > 0 | ||
β = 0; α > 0 | ||
α = ±1 | ||
α = 1 | ||
α = 1 | ||
α = 1 | ||
α = 1 | ||
α = ±1 | ||
α = 1 |
Operation . | Parameterized expressions . | Fixed parameters . |
---|---|---|
A + B | A + α1B | β0 = 0; β1 = 0; α0 = 1 |
A − B | A − α1B | β0 = 0; β1 = 0; α0 = 1 |
A ∗ B | α0 = 1; α1 = 1 | |
A/B | α0 = 1; α1 = 1 | |
β0 = 0; α0 = 1 | ||
α = 1 | ||
β = 0; α > 0 | ||
β = 0; α > 0 | ||
α = ±1 | ||
α = 1 | ||
α = 1 | ||
α = 1 | ||
α = 1 | ||
α = ±1 | ||
α = 1 |
This parameterization scheme is created at the root node and is recursively defined for all other operators in the binary expression tree at any level below the root. As a result, we introduce a new hyperparameter, the parameterization level, Pl, to specify the maximum number of levels that can be included within the parameters. For example, if Pl = 1, then only the root node and its associated parameters will be optimized, but if Pl = 2, then the parameters associated with the root node and its children will be optimized. This is best illustrated in Fig. 2, where a parameterized-sine operator (blue node) is acting on a parameterized square expression (left graph/green and orange nodes), and an unparameterized cube expression (right graph/red and orange nodes), . In both cases when Pl = 1, only the scale and shift parameters associated with the sine operator are optimized, with all other previously found or not included parameters fixed to their current values, with the resulting expressions shown in the lower row of Fig. 2. However, if Pl = 2, then the parameters for the square and cube operations are added to those of the sine operator, and the entire expression is optimized. The resulting equations in this case are shown in the upper row of Fig. 2.
A graphical representation of the effect of the parameterization level for the case where a new parameterized operator is added to a feature that was (left) or was not (right) previously parameterized. If Pl = 1, then only the parameters associated with the root node (blue, sine operator), and , get optimized. If the non-root operations were previously parameterized, then those parameters, , remain fixed. If Pl = 2, then additional parameters associated with child nodes are added to the set that are to be optimized irregardless of if the previous child node had optimized parameters.
A graphical representation of the effect of the parameterization level for the case where a new parameterized operator is added to a feature that was (left) or was not (right) previously parameterized. If Pl = 1, then only the parameters associated with the root node (blue, sine operator), and , get optimized. If the non-root operations were previously parameterized, then those parameters, , remain fixed. If Pl = 2, then additional parameters associated with child nodes are added to the set that are to be optimized irregardless of if the previous child node had optimized parameters.
Each optimization follows a two or three step process outlined here. First, a local optimization is performed to find the local minimum associated with the initial parameters. Once, at a local minimum, an optional global optimization is performed to find any minima that are better than the one initially found. For these first two steps, the parameters are optimized to a relative tolerance of 10−3 and 10−2, respectively, with a maximum of five thousand function evaluations for each step. Finally, a more accurate local optimization is done to a relative tolerance of 10−6 to find the best parameter set. For this final optimization, ten thousand function evaluations are allowed. Additionally, for both the initial and global optimization steps, the parameters are bounded to be in a range between −100 and 100 to improve the efficiency of the optimization, but this restriction is removed for the final optimization. For all local optimizations, the subplex algorithm,33 a faster and more robust variant of the Nelder–Mead simplex method,34 is used. The improved stochastic ranking evolution strategy algorithm35 is used for all global optimizations. Once optimized, only the internal α and β parameters are stored in . This procedure is repeated for all operators specified to be parameterized by the user, so it can increase the cost of feature creation considerably.
Figure 3 illustrates the power of the new parameterization scheme. For both toy problems represented by analytic Lorentzian and sine functions with some white noise, the nonparametric version of SISSO cannot find accurate models for the equations as it cannot address the nonlinearities properly. By using this new parameterization scheme, SISSO is now able to accurately find the models, as shown in Figs. 3(c) and 3(d). However, it is important to note that the more powerful featurization comes at the cost of a significantly increased time to generate the feature space as the parameterization becomes the bottleneck for the calculations. Additionally, there can be cases where the parameterization scheme does not find an optimal solution because there is too much noise or the optimal parameters are too far away from the initial guesses, as shown in Figs. 3(e) and 3(f).
A comparison of the expressions found nonparametric (a), (c), and (e) and parametric (b), (d), and (f) SISSO for a Lorentzian (a), (b), (e), and (f) and sine (c) and (d) function. Blue dots represent the training data, and the red line represents the expressions found by SISSO. The parameterization scheme either finds the correct (b) and (d) or better (f) model than the nonparametric functions, even when the high noise or bad initial guess of the parameters leads to a nonoptimal solution.
A comparison of the expressions found nonparametric (a), (c), and (e) and parametric (b), (d), and (f) SISSO for a Lorentzian (a), (b), (e), and (f) and sine (c) and (d) function. Blue dots represent the training data, and the red line represents the expressions found by SISSO. The parameterization scheme either finds the correct (b) and (d) or better (f) model than the nonparametric functions, even when the high noise or bad initial guess of the parameters leads to a nonoptimal solution.
C. Building the complete feature set
With all the new aspects of the feature representation in place, SISSO++ has a fully parallelized feature set construction that uses a combination of threads and Message Passing Interface (MPI) ranks for efficient feature set construction. The basic process of creating new features is illustrated in Fig. 4, where each new rung builds on top of existing features by adding a new operation on top of existing binary expression trees for the previous rung. In this step, the operators are separated into parameterized and nonparameterized versions of each other to allow for the optional use of the parametric SISSO concepts. Throughout this process, all checks are done to ensure that the units are correct and the domains for each new operation are respected. For example, this is why and are not included in the rung 1 features as taking the sine of a unit quantity is physically meaningless. If the sine operation was replaced by an algebraic operation or included the parameterization scheme, then two additional rung 1 features would be created with the unary operations acting on each primary feature. Finally, the code checks for invalid values, e.g., NaN or inf, and some basic simplifications for all features, e.g., features such as , are rejected.
Illustration of how the feature space of SISSO is created. In this example, the user selects two primary features ω (purple) and t (orange) and three operators sin (blue), multiplication (green), and division (red). SISSO then builds up a more complicated expression space by applying the operations onto the existing features by increasing the height of the binary expression trees. Throughout this process, the units and ranges of each of the operations are respected.
Illustration of how the feature space of SISSO is created. In this example, the user selects two primary features ω (purple) and t (orange) and three operators sin (blue), multiplication (green), and division (red). SISSO then builds up a more complicated expression space by applying the operations onto the existing features by increasing the height of the binary expression trees. Throughout this process, the units and ranges of each of the operations are respected.
III. DESCRIPTOR IDENTIFICATION
A. Linear programming implementation for classification problems
Figure 5 demonstrates the capabilities of the new classification algorithm on a toy problem where a spherical point cloud is separated into four classes by the planes x0 + x1 + x2 = 0 and x2 = 0. The choice to use a three-dimensional model was done to demonstrate that the new method can work beyond two dimensions and still be easily visualized, but in principle, this can find a descriptor with an arbitrary dimension. In this example, we randomly sample a Gaussian distribution with a standard deviation of 0.5 and centered at the origin to generate one thousand samples for three features, x0, x1, and x2, which are then copied and relabeled to x3, x4, and x5. The samples are then separated into four classes based on whether is greater than (light blue and black) or less than (red and yellow) zero and whether x2 > 0 (red and light blue) or x2 < 0 (black and yellow). To better highlight the separation of the four classes, an artificial margin area is created for x0, x1, and x2 by replacing all points where or , with another random point away from the margin, but still within the same class. To focus on the new solver, we do not perform the feature creation step of SISSO for this problem; however, the rung two feature of and x2 would be able to completely separate the classes in two dimensions. Using the updated algorithm, SISSO can now easily identify that the set is the better classifier than the set , which all previous implementations would fail to do as neither set has any points in the overlap region. More importantly, SISSO now provides the D-dimensional dividing planes found by linear-SVM for all pairs of classes creating an actual classifier automatically.
A demonstration of the classification algorithm. A set of 1000 points of three features, x0, x1, and x2, are sampled from a Gaussian distribution with a standard deviation of 0.5 and centered at the origin. The set is separated into four classes (the red, black, yellow, and light blue circles) by the planes x0 + x1 + x2 = 0 and x2 = 0 (green surfaces). (a) All points where or are moved to a random point in the same class, but outside the margin region, and (b) the original data stored as x3, x4, and x5. The updated classification algorithm correctly determines that , x1, and is the superior classifier, while for the original definitions of only the convex overlap region for three and more dimensions, they would be considered equally good. The projections are shown with an elevation angle of 7.5° and an azimuthal angle of 145°.
A demonstration of the classification algorithm. A set of 1000 points of three features, x0, x1, and x2, are sampled from a Gaussian distribution with a standard deviation of 0.5 and centered at the origin. The set is separated into four classes (the red, black, yellow, and light blue circles) by the planes x0 + x1 + x2 = 0 and x2 = 0 (green surfaces). (a) All points where or are moved to a random point in the same class, but outside the margin region, and (b) the original data stored as x3, x4, and x5. The updated classification algorithm correctly determines that , x1, and is the superior classifier, while for the original definitions of only the convex overlap region for three and more dimensions, they would be considered equally good. The projections are shown with an elevation angle of 7.5° and an azimuthal angle of 145°.
B. Multiple residuals
The second advancement to the descriptor-identification step of the SISSO algorithm is the introduction of a multiple-residuals approach to select the features for models with a dimension higher than one. As outlined in the Introduction, in the original SISSO algorithm,38 the residual of the previously found model, , i.e., the difference between the vector storing the values of the property for each sample, P, and the estimates predicted by the (D − 1)-dimensional model , is used to calculate the projection score of the candidate features during the SIS step for the best D-dimensional model, . Here, R is the Pearson correlation coefficient, representing a regression problem, and j corresponds to each expression generated during the feature-creation step of SISSO. In SISSO++,24 we extend the residual definition and use the best r residuals to calculate the projection score: . The multiple-residual concept generalizes the descriptor identification step of SISSO by using information from an ensemble of models to determine which features to add to the selected subspace. The ensemble is created by using the r best models from the DcD−1 identification step instead of only the top one. The value of r for a calculation is set as any hyperparameter via cross-validation.
This process is illustrated in Fig. 6 for a three-dimensional problem space (three training samples) defined by e1, e2, and e3 with five candidate features F1, …, F5 to describe the property P. In principle, there can be an arbitrary number of feature vectors, but for clarity, we only show five. If only a single residual is used, then the selected two-dimensional model will comprise of F1 and F4, as F3 is never selected because it has the smallest projection score from the residual of the model found using F1. However, because F2 has a component along e2, a better two-dimensional model consisting of a linear combination of F2 and F3 exists, despite F2 not being the most correlated feature to P. When going to higher-dimensional feature spaces, it becomes more likely that the feature vectors similarly correlated with the property contain orthogonal information, thus the need for using multiple residuals in SISSO.
An illustration of how tracking multiple residuals can improve the performance of SISSO. A three-dimensional problem space (three training samples) defined by e1, e2, and e3 with five feature vectors F1 (gray), F2 (blue), F3 (red), F4 (brown), and F5 (turquoise) for a property vector, P (purple). For this example, the size of the SIS subspace, nsis, is two. The residuals for the best (Δ1, gray) and second best (, blue) one-dimensional models are shown as dashed lines. Note that both residuals are plotted twice, once connecting the tips of the arrows representing the vectors the residuals are difference of and once rigidly translated in order to stem from the origin. If the number of residuals, nres, is one, then only is used (d) and (e) and F3 will not be selected in the second SIS step. This means that the best two-dimensional model is not found. However, if nres = 2 (b) and (c), then F3 is selected and the best two-dimensional model, a combination of F2 and F3, can be found in the second ℓ0 step.
An illustration of how tracking multiple residuals can improve the performance of SISSO. A three-dimensional problem space (three training samples) defined by e1, e2, and e3 with five feature vectors F1 (gray), F2 (blue), F3 (red), F4 (brown), and F5 (turquoise) for a property vector, P (purple). For this example, the size of the SIS subspace, nsis, is two. The residuals for the best (Δ1, gray) and second best (, blue) one-dimensional models are shown as dashed lines. Note that both residuals are plotted twice, once connecting the tips of the arrows representing the vectors the residuals are difference of and once rigidly translated in order to stem from the origin. If the number of residuals, nres, is one, then only is used (d) and (e) and F3 will not be selected in the second SIS step. This means that the best two-dimensional model is not found. However, if nres = 2 (b) and (c), then F3 is selected and the best two-dimensional model, a combination of F2 and F3, can be found in the second ℓ0 step.
In order to demonstrate the effect of learning over multiple residuals and to get an estimate of the optimal number of residuals and size of the SIS subspace (nsis), we plot the training root-mean-square error of prediction (RMSE) for the two-dimensional models for the function y = 5.5 + 0.4158d1 − 0.0974d2 + δ, where , , and δ is a Gaussian white noise term pulled from a distribution with a standard deviation of 0.05 in Fig. 7. For this problem, the best one-dimensional descriptor is , with x3 being explicitly set to , and Δ is a Gaussian white noise term with a standard deviation of 20.0. This primary feature was explicitly added in order to ensure that the best one-dimensional model would not be contain d1 or d2 in this synthetic problem. Because of this, when using a single residual, the SIS subspace size has to be increased to over 400, before y can be reproduced by SISSO. However, by increasing the number of residuals to 50, SISSO can now find which features are most correlated with the residual of d1 and it immediately finds y. In a recent paper published by some of us, we further demonstrate that this approaches effectiveness for learning models of the bulk modulus of cubic perovskites.25
The training RMSE of a two-dimensional model against the size of the SIS subspace for y = 5.5 + 0.4158d1 − 0.0974d2 + δ, where , , and δ is a Gaussian white noise term pulled from a distribution with a standard deviation of 0.05. The dark blue solid line is the error when learning using one residual, the light blue dashed line is the error when learning using 25 residuals, and the red solid line is the error when learning with 50 residuals.
The training RMSE of a two-dimensional model against the size of the SIS subspace for y = 5.5 + 0.4158d1 − 0.0974d2 + δ, where , , and δ is a Gaussian white noise term pulled from a distribution with a standard deviation of 0.05. The dark blue solid line is the error when learning using one residual, the light blue dashed line is the error when learning using 25 residuals, and the red solid line is the error when learning with 50 residuals.
IV. CONCLUSIONS
In this paper, we described recently developed improvements to the SISSO method and their implementation in the SISSO++ code in terms of both their mathematical and computational details, which constitute a large leap forward in terms of the expressivity of the SISSO method. Utilizing these features provides greater flexibility and control over the expressions found by SISSO and acts as a start to introducing “grammatical” rules into SISSO and symbolic regression. In particular, concepts such as the units and ranges of the formula could be extended to prune the search space of possible expressions for the final models. We have also described the implementation of parametric SISSO, which considerably opens up the range of possible expressions found by SISSO. Finally, we discussed two improvements related to the SISSO solver, i.e., a linear programming implementation for the classification problems and the multiple-residuals technique, both providing extended flexibility in the descriptors and models found by SISSO.
ACKNOWLEDGMENTS
TARP thanks Christian Carbogno for valuable discussions related to the parametric SISSO scheme and proof reading those parts of the manuscript. TARP thanks Lucas Foppa for discussions related to the multi-residual approach and proof reading those parts of the manuscript. This work was funded by the NOMAD Center of Excellence (European Union’s Horizon 2020 Research and Innovation Program, Grant Agreement No. 951786), the ERC Advanced Grant TEC1p (European Research Council, Grant Agreement No. 740233), BigMax (the Max Planck Society’s Research Network on Big-Data-Driven Materials-Science), and the project FAIRmat (FAIR Data Infrastructure for Condensed-Matter Physics and the Chemical Physics of Solids, German Research Foundation, Project No. 460197019). TARP acknowledges the Alexander von Humboldt (AvH) Foundation for their support through the AvH Postdoctoral Fellowship Program.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
TARP implemented all methods and performed all calculations. TARP ideated all methods with assistance from LMG. MS and LMG supervised the project. All authors wrote the manuscript.
Thomas A. R. Purcell: Conceptualization (equal); Data curation (lead); Formal analysis (equal); Funding acquisition (equal); Investigation (lead); Methodology (equal); Software (lead); Writing – original draft (equal); Writing – review & editing (equal). Matthias Scheffler: Conceptualization (supporting); Funding acquisition (equal); Supervision (supporting); Writing – original draft (equal); Writing – review & editing (equal). Luca M. Ghiringhelli: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings is available in FigShare at http://dx.doi.org/10.6084/m9.figshare.23813889. The code used here can be found on gitlab: https://gitlab.com/sissopp_developers/sissopp.