With big datasets and highly efficient algorithms becoming increasingly available for many problem sets, rapid advancements and recent breakthroughs achieved in the field of machine learning encourage more and more scientific fields to make use of such a computational data analysis. Still, for many research problems, the amount of data available for training a machine learning (ML) model is very limited. An important strategy to combat the problems arising from data sparsity is feature elimination—a method that aims at reducing the dimensionality of an input feature space. Most such strategies exclusively focus on analyzing pairwise correlations, or they eliminate features based on their relation to a selected output label or by optimizing performance measures of a certain ML model. However, those strategies do not necessarily remove redundant information from datasets and cannot be applied to certain situations, e.g., to unsupervised learning models. Neither of these limitations applies to the network-based, correlation-driven redundancy elimination (NETCORE) algorithm introduced here, where the size of a feature vector is reduced by considering both redundancy and elimination efficiency. The NETCORE algorithm is model-independent, does not require an output label, and is applicable to all kinds of correlation topographies within a dataset. Thus, this algorithm has the potential to be a highly beneficial preprocessing tool for various machine learning pipelines.

Whether for self-driving cars, virtual assistance, person-tailored advertisements, or deepfake processing of videos, machine learning (ML) has revolutionized various areas of our daily life.1–3 Moreover, in many scientific areas including medicine,4–6 materials research,7,8 and natural sciences,9,10 major breakthroughs were achieved by means of ML: for instance, a convolutional neural network trained on clinical images was able to detect skin cancer with the precision of an experienced dermatologist11 and a complex deep learning model could predict the conformation of protein sequences with an extraordinary accuracy.12 

To a large extent, the success of such ML approaches depends on the size and the quality of the database available for generating the models:13 when samples are characterized by a vector containing quantitative measures of different properties (feature vector), a certain density of data points is needed to sufficiently cover the relevant region of the feature space. Here, an important but counterintuitive issue needs to be considered: even though providing more information for each sample, a larger feature vector can reduce the prediction accuracy of an algorithm.14 This phenomenon is known as the “curse of dimensionality;” it often originates from a low data density in the multi-dimensional feature space, on the one hand, and a distance concentration, on the other hand (i.e., the observation that pairwise distances of sample points in a high-dimensional feature space tend to converge to the same value).15 

There are, of course, strategies to mitigate the problems associated with this curse of dimensionality, e.g., increasing the sample size or reducing the dimensionality of the feature space. However, the former approach can be very costly and time consuming—especially for scientific problems that depend on experimental data acquisition. Thus, reducing the dimensionality of the dataset by selecting a subset of features while discarding all others is often the more feasible approach.16,17 Feature selection techniques can be broadly subdivided into three categories: wrappers, embedded methods, and filters.18,19 Wrapper methods use prediction results obtained with a specific ML model as a score to evaluate the usability of a given feature set.20 Basic model performance measures are, then, repeatedly assessed to identify an optimal feature set following a greedy search approach.21,22 One very common example is the sequential feature selection;23 here, features are iteratively added (forward selection) or removed (backward selection) to establish a feature subset in a greedy fashion. In each iteration, the best feature to be added or to be removed is chosen based on the cross-validation score of a given ML model. However, since a full ML algorithm has to be executed several times, wrapper methods usually come with high computational costs and long runtimes.24 Moreover, the optimality reached is model-specific (and not necessarily transferrable to other models) and this search for an optimum can even severely increase the overfitting of the optimized model.25 

Instead of eliminating the features “outside” the ML model, embedded methods directly integrate the feature selection into the learning process.26,27 One of the most popular examples for embedded feature selection is the random forest classifier, which selects an individual feature to split the dataset in each recursive step of the tree growth process.28,29 Other commonly used embedded feature elimination approaches are the LASSO30 (L1 penalty) and ridge31 (L2 penalty) regression for constructing a linear model. In these two methods, feature weights are, on purpose, reduced to zero (or almost zero), which basically corresponds to an elimination of those features. However, those methods usually do not analyze a (putative) redundancy within the features and the outcome of such ranking-based eliminations is only valid for the particular model that was used to conduct the feature elimination process.

Last, filter approaches analyze the intrinsic properties of a dataset and conduct feature selection independent of any ML model.32 Popular filter methods employ statistical quantifications to assess the impact a feature has on a given output label.33,34 Even though the most popular feature selection approaches are used for supervised learning, i.e., labeled data, filter methods can also be applied to analyze the feature space of unlabeled data. Here, since no class or prediction label can be used to guide the search for important information, the feature elimination has to be performed by solely evaluating the intrinsic properties of the dataset, such as feature dependence,35,36 the entropy of distances between data points,37 or the Laplacian score.38 Other examples of filter techniques to reduce the dimensionality of unlabeled data are principal component analyses (PCAs),39 factor analyses,40 or projection pursuit.41,42 However, rather than actually selecting some features while discarding others, those approaches perform feature transformations. Thus, even though a PCA, for instance, is easy to use, the interpretability of the selected principal components is usually rather low.43,44

Here, we propose an interpretable, model- and output-independent feature elimination algorithm to reduce the dimensionality of a feature space by considering both feature redundancy and elimination efficiency. The proposed network-based, correlation-driven redundancy elimination (NETCORE) algorithm translates the dataset into a correlation network, which is, then, analyzed by conducting an iterative, three-step decision procedure. With this approach, the algorithm selects a subset of features that represent the full feature space on the basis of a (freely selectable) correlation threshold while taking into account the multi-connectivity of a feature to its neighbors in the correlation network. Furthermore, we demonstrate the applicability of this algorithm to different molecular datasets and analyze the influence of varying the correlation threshold. Finally, we show that the algorithm is suitable to reduce the dimensionality of a dataset such that the prediction accuracy of a ML model is considerably improved.

For all molecules analyzed in this study, a 3D conformer representation was obtained from PubChem.45 This 3D structure was, then, imported into MarvinSketch (version 21.17.0, 2021, Chemaxon, http://www.chemaxon.com), and nine physicochemical properties were determined using the Calculator Plugins (version 22.3, 2022) of MarvinSketch (Table I). For the vast majority of parameters, default settings were used; only some minor adjustments regarding pH-ranges and analysis step sizes were made.

TABLE I.

Physicochemical properties determined for the analyzed molecules.

FeatureDefinitionCalculation parameters
Rotatable bond count Number of bonds in a molecule that allow for a conformational change of the molecule geometry through rotation around the respective bond46  Decimal places: 2, aromatization method: general, and single fragment mode: no 
van der Waals volume Volume occupied by a molecule, i.e., space impenetrable for other molecules with thermal energies at ordinary temperatures47  Energy unit: kcal/mol, decimal places: 2, set MMFF94 optimization: no, and set projection optimization: no, calculate the lowest energy conformer: if molecule is in 2D, and optimization limit: normal 
Molecular weight Molecular mass calculated from standard atomic weights48,49 Recognize formula in pseudo-labels: yes, use D/T symbols for deuterium/tritium: yes, and single fragment mode: no 
Charge The total charge of a molecule at pH 7.4 calculated from the weighted sum of its microspecies (excluding the natural form); weights are assigned according to the microspecies distribution at the given pH50  Decimal places: 2, show charge distribution: yes, pH step size: 0.1, keep explicit hydrogens: no, and consider tautomerization/resonance: yes 
Dipole moment Net molecular polarity specified as the electron density that is unequally distributed between the atoms of the molecule51  ⋯ 
Partition coefficient Logarithm (logP) of the concentration ratio of a chemical dissolved in two different phases (here, octanol and water), where both concentrations are at equilibrium;52 in the case of multiple microspecies of a molecule, the non-ionic one was considered Method: consensus, Cl concentration (mol/dm3): 0.1, Na+/K+ concentration (mol/dm3): 0.1, and consider tautomerization/resonance: no 
Aromatic ring count Number of aromatic rings in a molecule calculated from the smallest set of smallest aromatic rings (SSSAR)53  Decimal places: 2, aromatization method: general, and single fragment mode: no 
Hydrogen bond acceptor sites (average at pH 7.4) Sum of lone electron pairs in a molecule, which are available for establishing a hydrogen bond54  Decimal spaces: 2, donor: yes, acceptor: yes, exclude sulfur atoms from acceptors: yes, exclude halogens from acceptors: yes, show microspecies data by pH: yes, pH lower limit: 7, pH upper limit: 8, pH step size: 0.1, and display major microspecies: no 
Hydrogen bond donor sites (average at pH 7.4) Sum of hydrogen atoms connected to atoms in the molecule that have hydrogen donor properties Decimal spaces: 2, donor: yes, acceptor: yes, exclude sulfur atoms from acceptors: yes, exclude halogens from acceptors: yes, show microspecies data by pH: yes, pH lower limit: 7, pH upper limit: 8, pH step size: 0.1, and display major microspecies: no 
FeatureDefinitionCalculation parameters
Rotatable bond count Number of bonds in a molecule that allow for a conformational change of the molecule geometry through rotation around the respective bond46  Decimal places: 2, aromatization method: general, and single fragment mode: no 
van der Waals volume Volume occupied by a molecule, i.e., space impenetrable for other molecules with thermal energies at ordinary temperatures47  Energy unit: kcal/mol, decimal places: 2, set MMFF94 optimization: no, and set projection optimization: no, calculate the lowest energy conformer: if molecule is in 2D, and optimization limit: normal 
Molecular weight Molecular mass calculated from standard atomic weights48,49 Recognize formula in pseudo-labels: yes, use D/T symbols for deuterium/tritium: yes, and single fragment mode: no 
Charge The total charge of a molecule at pH 7.4 calculated from the weighted sum of its microspecies (excluding the natural form); weights are assigned according to the microspecies distribution at the given pH50  Decimal places: 2, show charge distribution: yes, pH step size: 0.1, keep explicit hydrogens: no, and consider tautomerization/resonance: yes 
Dipole moment Net molecular polarity specified as the electron density that is unequally distributed between the atoms of the molecule51  ⋯ 
Partition coefficient Logarithm (logP) of the concentration ratio of a chemical dissolved in two different phases (here, octanol and water), where both concentrations are at equilibrium;52 in the case of multiple microspecies of a molecule, the non-ionic one was considered Method: consensus, Cl concentration (mol/dm3): 0.1, Na+/K+ concentration (mol/dm3): 0.1, and consider tautomerization/resonance: no 
Aromatic ring count Number of aromatic rings in a molecule calculated from the smallest set of smallest aromatic rings (SSSAR)53  Decimal places: 2, aromatization method: general, and single fragment mode: no 
Hydrogen bond acceptor sites (average at pH 7.4) Sum of lone electron pairs in a molecule, which are available for establishing a hydrogen bond54  Decimal spaces: 2, donor: yes, acceptor: yes, exclude sulfur atoms from acceptors: yes, exclude halogens from acceptors: yes, show microspecies data by pH: yes, pH lower limit: 7, pH upper limit: 8, pH step size: 0.1, and display major microspecies: no 
Hydrogen bond donor sites (average at pH 7.4) Sum of hydrogen atoms connected to atoms in the molecule that have hydrogen donor properties Decimal spaces: 2, donor: yes, acceptor: yes, exclude sulfur atoms from acceptors: yes, exclude halogens from acceptors: yes, show microspecies data by pH: yes, pH lower limit: 7, pH upper limit: 8, pH step size: 0.1, and display major microspecies: no 

Based on the database we created as described above, a feature vector was generated, which quantitatively describes the physicochemical properties of the set of molecules studied here [Fig. 1(a)]. Then, a network-based feature elimination was performed to reduce the dimensionality of the feature space [Figs. 1(b)1(f)]. More precisely, our goal was to create a new feature vector that contains features that are as distinct as possible from any other of the included features but still represent the eliminated features from the initial dataset in an optimal manner. Therefore, features from the input space that are strongly correlated with others are identified and removed as those features contain—at least to a certain extent—redundant information. If not stated otherwise, all implementations were conducted in Python (Python Software Foundation; Python Language Reference, version 3.9.12; http://www.python.org).55 Here, we made use of the following plugins: pandas (v1.4.2),56,57 NetworkX (v2.7.1),58 NumPy (v1.22.3),59 Seaborn (v0.11.2),60 and Matplotlib (v3.5.1).61 

FIG. 1.

Schematic overview of the different steps of the network-based feature elimination process. First, the selected physicochemical properties of molecules are translated into a feature vector (a) and a correlation matrix representing the correlation strength of the features is calculated (b). Using a predefined correlation threshold, this correlation matrix is converted into a network; here, each node represents a feature and edges denote correlation strengths that exceed the threshold value (c). The network is, then, reduced in a stepwise manner by identifying the feature of highest centrality, adding this feature to the (initially empty) new feature vector, and eliminating it together with all neighbors that are directly connected. To identify the features of highest centrality, the following steps are iteratively conducted: first, the feature of highest degree (i.e., the highest number of connected neighbors) is identified (d). In case of parity, the same analysis is conducted with those networks that would remain if the respective node and its connected neighbors were to be removed from the network (e). Finally, if still no unambiguous decision is possible, the mean correlation strength of the node to its direct neighbors is calculated and the feature with the highest average correlation strength is chosen.

FIG. 1.

Schematic overview of the different steps of the network-based feature elimination process. First, the selected physicochemical properties of molecules are translated into a feature vector (a) and a correlation matrix representing the correlation strength of the features is calculated (b). Using a predefined correlation threshold, this correlation matrix is converted into a network; here, each node represents a feature and edges denote correlation strengths that exceed the threshold value (c). The network is, then, reduced in a stepwise manner by identifying the feature of highest centrality, adding this feature to the (initially empty) new feature vector, and eliminating it together with all neighbors that are directly connected. To identify the features of highest centrality, the following steps are iteratively conducted: first, the feature of highest degree (i.e., the highest number of connected neighbors) is identified (d). In case of parity, the same analysis is conducted with those networks that would remain if the respective node and its connected neighbors were to be removed from the network (e). Finally, if still no unambiguous decision is possible, the mean correlation strength of the node to its direct neighbors is calculated and the feature with the highest average correlation strength is chosen.

Close modal

Step 1: Correlation matrix

First, the feature space was analyzed by creating a Pearson’s correlation matrix [Fig. 1(b)]. Here, a pairwise comparison of all n physicochemical properties (=features) was made by calculating the linear correlation strength between feature x and feature y as described by the Pearson’s correlation coefficient r [see Eq. (1)],

(1)

Importantly, features for which no correlation coefficients can be calculated are eliminated in this step without being added to the reduced feature vector. This mainly affects two types of features: features that are characterized by non-numerical inputs, and features that are represented by the exact same value for all samples. Whereas, for the latter, it is desirable to eliminate such features directly (as they contain no valuable information), non-numerical features might be important. Hence, datasets containing such non-numerical features should be preprocessed by performing a basic label encoding (as, for instance, provided by pandas56,57 and sklearn62 libraries) to convert non-numerical inputs to numerical ones; then, those features can be considered in the subsequent stages of the NETCORE algorithm.

Step 2: Network generation

Based on the obtained n × n correlation matrix calculated in the previous step, a correlation network was created [Fig. 1(c)]. This correlation network consists of n nodes, and each node represents one physicochemical property (feature). Then, edges connecting the nodes were added only if the absolute value of the correlation coefficient between the two respective properties equals or exceeds a defined threshold t (if not stated otherwise, t = 0.6 was selected). Last, each added edge was assigned a weight corresponding to the correlation coefficient of the connected feature pair.

In this step, the user can adjust the algorithm by changing the correlation threshold t, i.e., the correlation strength needed to consider two features to be redundant. Here, since t is used to filter absolute values of correlation coefficients, only threshold values between 0 and 1 are reasonable. A value of 1 would yield a correlation network in which edges are only present between fully correlated features. Since, in the further procedure of the algorithm, features can only be eliminated when they are connected to another feature via an edge (see steps 3 and 4), selecting a threshold of 1 entails the elimination of fully redundant features only. In contrast, a correlation threshold of 0 leads to a fully connected correlation network; i.e., each feature is connected to every other feature. Hence, as per the criteria described in steps 3 and 4, only one feature would be selected to represent the whole dataset. Thus, choosing the right correlation threshold is not trivial: smaller values for t lead to a stronger reduction of the feature vector, whereas larger values for t enforce that a stronger redundancy is required for elimination. As a rule of thumb, correlation coefficients below 0.35 are generally considered to represent weak correlations, moderate correlations are indicated by coefficients between 0.36 and 0.67, and strong or very strong correlations lead to correlation coefficients above 0.67 or 0.9, respectively.63 As a default, we here chose a correlation threshold of 0.6, which represents a moderate to strong correlation.

Step 3: Elimination of isolated nodes

After generating the correlation network, a first analysis step is performed to identify all independent features, i.e., nodes that have no edge to any other node. Such isolated nodes represent features that exhibit no pairwise correlation to any other feature exceeding the previously defined correlation threshold. In other words, based on the correlation threshold chosen, features are considered to be independent when they are not sufficiently represented by any other feature in the dataset. As those features contain information that is not redundant, such features are identified first and directly added to the new, reduced feature vector. Importantly, this step is conducted during each iteration of the NETCORE algorithm—more specifically, before each iteration of the centrality analysis described in step 4.

Step 4: Centrality analysis

To finally identify a reduced feature set that accurately represents the whole dataset without containing redundant information, the previously generated network was iteratively reduced by repeatedly analyzing the degree centrality of the nodes. Different from previously reported approaches that used eigenvector centrality64 to calculate importance weights based on the ability of a feature to discriminate between classes,65 the degree centrality criterion used here iteratively identifies and resolves local “clusters” of features containing redundant information and this process does not require any class labels. Another difference is that the degree centrality is directly derived from the graph without any spectral analysis. Therefore, the number of direct neighbors a node has in the interconnected correlation network is determined [Fig. 1(d)]. If a unique node of highest degree (i.e., a node with the highest number of connected neighbors) could be identified, the corresponding feature was added as the first entry to an initially empty vector (which, thus, became an n × 1 vector with n = 1). At the same time, this node was removed from the network together with all of its directly connected neighbors (this whole procedure will hereafter be referred to as “fixing a parameter”). If such a unique node of highest degree could, indeed, be identified, the centrality analysis iteration was already complete, and a second iteration could be initiated with the remaining network. In the case of parity of multiple nodes with the same degree, however, the result of fixing each of those “candidate nodes” was individually considered and the respective remaining networks were analyzed in more detail: now, the degree of all nodes remaining in the network (after eliminating the candidate node and its direct neighbors) was determined [Fig. 1(e)]. Then, in order to maintain a network for further analysis that was as well-connected as possible and to avoid an unfavorable segmentation of the network into disjunct subnetworks, the particular candidate node leading to a network that contains the node of highest degree in all candidate networks was chosen to be fixed next. Finally, if no clear decision was possible after this step either, the mean correlation strength of each candidate node to all of its direct neighbors was calculated [Fig. 1(f)]. From those candidate nodes, the node showing the highest mean correlation strength was identified, added to the reduced feature vector, and eliminated from the network together with its direct neighbors. However, in some cases where a unique node could still not be identified, the feature that was evaluated first (by random choice) was added to the feature vector. This was, for instance, the case for a connected pair of nodes that was otherwise isolated in the network (and, thus, had no further neighbors as required for an algorithm-based selection). Overall, step 3 and step 4 were repeated until all nodes in the network were successfully eliminated; at this point, a maximally reduced feature vector was established. This iterative approach ensures that, even though the network is changed when a node and its neighbors are eliminated, the next node is again chosen based on optimal centrality.

To test the success of the NETCORE algorithm regarding the elimination of correlating features, we calculated the variance inflation factor (VIF)—a common metric to quantify the severity of occurring multicollinearity.66 In brief, the VIF describes how much the standard error of the regression coefficient of a predictor variable (feature) in a linear regression model is increased due to multicollinearity. In particular, the VIF of a certain feature i can be obtained according to the following equation:

(2)

where Ri2 denotes the R2-value obtained by regressing feature i based on all remaining features.

In this study, two supervised machine learning models, i.e., a random forest (RF) classifier and a k nearest neighbors (KNN) classifier, were employed as described by Rickert et al.67 All ML algorithms were implemented using Python (version 3.9.12) with the NumPy extension for data handling (version 1.22.3)59 as well as the machine learning toolbox scikit-learn (version 1.0.2).62 In brief, for both classifiers, data were preprocessed with a Minmax feature scaling [xscaled = (x − minfeature)/(maxfeatureminfeature)]. The RF classifier was used with n = 100 independent trees without setting a maximal tree depth; the Gini impurity was used to measure the quality of a split. For the KNN classifier, n = 5 neighbors were taken into account for each classification; neighbor weights were assigned as the inverse of their distance to the point of interest; a brute force algorithm was used to find the nearest neighbors; and the Minkowski distance was used as a distance metric. The accuracy of the algorithm was evaluated using a repeated (n = 10) stratified k-fold (k = 5) cross-validation with no predefined random state.

The NETCORE algorithm developed here aims at reducing the dimensionality of molecular datasets by identifying and removing redundant information. Since such a dimensionality reduction is especially important for the analysis of experimental datasets—where the number of samples is usually very limited—we first test the algorithm on a dataset created for an experimental drug loading study using different antibiotics (see the supplementary material, Secs. S1 and S2). This rather small dataset contains 14 commonly used antibiotics, each of which is characterized by a feature vector summarizing nine physicochemical properties of the molecules [Fig. 2(a)]. Since the NETCORE algorithm conducts an output-independent analysis, no further label is required.

FIG. 2.

Application of the network-based feature elimination strategy to a sample dataset generated from 14 antibiotics. The initial feature vector comprises nine physicochemical properties of the antibiotics (a), which exhibit different levels of correlation to each other as quantified by Pearson’s correlation coefficients (b). Based on a correlation threshold of 0.6, a correlation network is generated (c) and iteratively reduced. First, four isolated nodes are eliminated and the degree of all remaining nodes is determined (d). As four nodes (“vdWv,” “MolW,” “HyBAS,” and “HyBDS”) share the highest degree, the four networks that would remain after fixing each of the candidate nodes are simulated (e). As no individual, simulated network comprising an overall node of highest degree can be identified, the mean correlation of each candidate node to its connected neighbors is calculated (f), “MolW” is fixed, and the remaining node “PartCo” is added to the new feature vector (g).

FIG. 2.

Application of the network-based feature elimination strategy to a sample dataset generated from 14 antibiotics. The initial feature vector comprises nine physicochemical properties of the antibiotics (a), which exhibit different levels of correlation to each other as quantified by Pearson’s correlation coefficients (b). Based on a correlation threshold of 0.6, a correlation network is generated (c) and iteratively reduced. First, four isolated nodes are eliminated and the degree of all remaining nodes is determined (d). As four nodes (“vdWv,” “MolW,” “HyBAS,” and “HyBDS”) share the highest degree, the four networks that would remain after fixing each of the candidate nodes are simulated (e). As no individual, simulated network comprising an overall node of highest degree can be identified, the mean correlation of each candidate node to its connected neighbors is calculated (f), “MolW” is fixed, and the remaining node “PartCo” is added to the new feature vector (g).

Close modal

As described in the section titled Methods, the available database is first translated into a correlation matrix based on the Pearson’s correlation coefficient r [Fig. 2(b)]. Here, correlation coefficients of r = 1 denote maximal direct correlation, whereas correlation coefficients of r = −1 denote maximal inverse correlation. From the correlation coefficients calculated for the antibiotics dataset, a first overview over the relations between the distinct features can be obtained. For instance, strong correlations (r > 0.8) are present between the rotatable bond count, the van der Waals volume, the molecular weight, and the number of hydrogen bond acceptor sites. In contrast, the dipole moment and the aromatic ring count show no strong correlations to any of the other features.

By transforming the correlation matrix into a correlation network, a much better graphical representation of those relations can be achieved. To do so, a node is created for each of the features and edges are added if the correlation coefficient between two features equals or exceeds a predefined correlation threshold. Selecting this hyperparameter, of course, is not always trivial as it fundamentally determines what correlation strength is regarded as sufficient to consider the information provided by two features to be redundant. For our first analysis, we chose a correlation threshold of |r| ≥ 0.6, which represents a moderate correlation and leads to a well-connected correlation network as depicted in Fig. 2(c).

From the generated network, features showing no connection to any other feature can directly be identified as non-redundant. Indeed, for the analyzed antibiotics dataset, the features “charge,” “dipole moment,” and “aromatic ring count” are not sufficiently represented by any of the other features. Consequently, those features need to be added to the initially empty new feature vector and any subsequent analysis will be performed on the interconnected network only [Fig. 2(d)]. To identify the next feature to be added to the new feature vector, we search for the node of highest degree centrality; to do so, we first determine the degree of all nodes (this parameter is defined as the number of neighbors connected to a particular node). However, for the generated network investigated at the moment, there is not a single node of highest degree; instead, four out of six nodes (i.e., those representing the “van der Waals volume,” the “molecular weight,” the “hydrogen bond acceptor sites,” and the “hydrogen bond donor sites”) all exhibit a degree of four. When applying the second sorting criterion, the situation does not improve: for this very small network, removing one of the four candidate nodes would always lead to a network comprising one individual node only [see Fig. 2(e) for a simulation of the respective outcome]. Hence, the maximal degree of the nodes of all possibly remaining networks equals 0, which still does not allow for selecting a next feature in an unambiguous manner.

Since, in this particular case, neither the first nor the second selection criterion can provide an unambiguous answer, we next determine the average correlation coefficients of the candidate nodes with respect to their directly connected neighbors [Fig. 2(f)]. Now, the algorithm finds that the feature “molecular weight” has the highest average correlation strength among the candidate nodes. Consequently, the feature “molecular weight” is added to the new feature vector and all features directly connected to “molecular weight” (i.e., “rotatable bond count,” “van der Waals volume,” “hydrogen bond acceptor sites,” and “hydrogen bond donor sites”) are removed from the network. Having eliminated those four nodes, the network finally only comprises one last node (“partition coefficient”). Similar to how the isolated nodes from the initial network were handled, also this single remaining node has to be added to the new feature vector (and, thus, can be removed from the network). After this step, the remaining correlation network is empty; i.e., all nodes were either added to the new feature vector or eliminated because another (redundant) node was added to the new feature vector. Once this state is reached, the feature elimination procedure is completed and the final, reduced feature vector is established. For the antibiotics dataset analyzed thus far, the reduced feature vector comprises five features: “charge,” “dipole moment,” “aromatic ring count,” “molecular weight,” and “partition coefficient” [Fig. 2(g)]. In other words, the four features “rotatable bond count,” “van der Waals volume,” “hydrogen bond acceptor sites,” and “hydrogen bond donor sites” were eliminated as the information they carry was determined to be redundant to those provided by the selected features.

To assess the success of the NETCORE algorithm with regard to reducing the redundancy of the feature set, we next calculate the variance inflation factor (VIF), a regression-based descriptor that quantifies the multicollinearity of a feature to a set of other features. A VIF of 1 represents the absence of multicollinearity, whereas increasing values indicate increasing levels of multicollinearity. Even though there is no universal cutoff value for this problem, there is a rule of thumb:68 the features contained in a feature vector are sufficiently uncorrelated when no feature exhibits a VIF > 5–10. When we analyze VIF values of the full initial feature vector (Fig. 3, blue bars), only the dipole moment comes with a fairly acceptable VIF of ∼9; all other features show high multicollinearity represented by VIFs of >10 (up to values on the order of a few thousand). In contrast, after applying the NETCORE algorithm (Fig. 3, cyan bars), we obtain a completely different picture: now, the five features included in the new feature vector show low-to-moderate multicollinearity between each other. This confirms that, indeed, a set of uncorrelated features was selected by the NETCORE algorithm. Importantly, the four eliminated features exhibit high multicollinearity to the features contained in the new, reduced feature vector. This underscores that the eliminated features are sufficiently represented by the selected features of the reduced feature vector. In other words, two targeted properties were, indeed, achieved: first, a feature vector was successfully created that contains as little redundancy as possible; second, this reduced feature vector still adequately represents the eliminated information.

FIG. 3.

Variance inflation factor (VIF) of all analyzed features before and after reducing the feature vector. The VIF determined for all features of the initial feature vector (blue bars) is compared to the VIF of all features in the new, reduced feature vector (cyan bars, left side). Additionally, the VIF of the eliminated features was calculated based on the new, reduced feature vector (cyan bars, right side). The red zone denotes the threshold between the moderate and strong multicollinearity as defined by Craney and Surles.68 

FIG. 3.

Variance inflation factor (VIF) of all analyzed features before and after reducing the feature vector. The VIF determined for all features of the initial feature vector (blue bars) is compared to the VIF of all features in the new, reduced feature vector (cyan bars, left side). Additionally, the VIF of the eliminated features was calculated based on the new, reduced feature vector (cyan bars, right side). The red zone denotes the threshold between the moderate and strong multicollinearity as defined by Craney and Surles.68 

Close modal

Having shown that the NETCORE algorithm is, indeed, capable of condensing an initial feature vector to a set of rather uncorrelated features, we next have a closer look at the three individual decision criteria applied by the algorithm. Therefore, we first generate networks with different grades of interconnectivity by adjusting the correlation threshold t to either t = 0.8 or to t = 0.3; with this approach, a sparsely and a strongly connected network, respectively, is obtained [see Figs. 4(a) and 4(b)].

FIG. 4.

Illustration of the three decision criteria employed by the network-based feature elimination algorithm. Based on the previously introduced correlation matrix [Fig. 2(b)], two correlation networks were generated by applying a correlation threshold of t = 0.8 (a) and t = 0.3 (b), respectively. The sparsely connected network (a) illustrates the importance of decision criterion 3 (highest mean correlation), whereas the densely connected network (b) demonstrates the application of decision criterion 1 (highest degree). Additionally, a third correlation network was artificially created (c) to illustrate a situation where decision criterion 2 (highest degree in the remaining networks) is required. Here, when eliminating candidate node D, three individual nodes remain (d), whereas three connected nodes remain when candidate node E is eliminated (e).

FIG. 4.

Illustration of the three decision criteria employed by the network-based feature elimination algorithm. Based on the previously introduced correlation matrix [Fig. 2(b)], two correlation networks were generated by applying a correlation threshold of t = 0.8 (a) and t = 0.3 (b), respectively. The sparsely connected network (a) illustrates the importance of decision criterion 3 (highest mean correlation), whereas the densely connected network (b) demonstrates the application of decision criterion 1 (highest degree). Additionally, a third correlation network was artificially created (c) to illustrate a situation where decision criterion 2 (highest degree in the remaining networks) is required. Here, when eliminating candidate node D, three individual nodes remain (d), whereas three connected nodes remain when candidate node E is eliminated (e).

Close modal

For the sparsely connected network [Fig. 4(a)], all isolated nodes can be directly added to the new feature vector. In this particular scenario, however, the important step is the analysis of the four connected nodes: here, the algorithm detects a cluster of correlating nodes and selects the one with the highest mean correlation strength to the other three connected nodes (criterion 3; for details, see the section titled Methods). This decision strategy prevents a random feature to be picked; instead, the feature that best represents the eliminated nodes is selected and added to the new feature vector.

In contrast, for the strongly interconnected network [Fig. 4(b)], the key elimination criterion is a different one: here, to ensure that the most efficient reduction of the network is achieved, the node with the highest degree, i.e., the node with the highest number of connected neighbors in the network, is identified (criterion 1). In fact, there are three candidate nodes for this step (i.e., “vdWv,” “MolW,” and “Chr”) that all exhibit a degree of 7. In other words, adding one of those features to the new feature vector would allow for removing eight features (the candidate node and its directly connected neighbors) from the network. This degree-based selection ensures that as many features as possible are represented by the chosen feature. However, to unambiguously decide which of the identified candidate nodes to pick, the next two selection criteria still need to be applied: in each case, adding one of those candidate nodes to the new feature vector would lead to a network comprising one node only (criterion 2); thus, the mean correlation strength needs to be calculated (criterion 3). When doing so, “MolW” shows the highest mean correlation strength among those three candidates, which is why this particular feature is added to the feature vector. Then, in a last step, only one node remains, which concludes the selection process.

So far, we have mainly seen the importance (and effect) of decisions made based on criteria 1 (highest degree) and 3 (highest mean correlation). However, to demonstrate the significance of criterion 2 (highest degree in the remaining network), we next analyze the artificially generated network depicted in Fig. 4(c). In this network, there are two nodes that share the highest degree (=3). However, fixing one of these nodes entails a completely different remaining correlation network: when adding node “D” to the new feature vector, three individual nodes (“A,” “F,” and “G”) are left [Fig. 4(d)]. Since those nodes do not have any connection to other nodes, it would be mandatory to include the corresponding features into the new feature vector. Hence, when choosing node “D” to be fixed, the final feature vector inevitably contains four features to properly represent all initial features. This, however, can be avoided by instead fixing node “E” [Fig. 4(e)]: by doing so, no isolated nodes remain. As the three remaining nodes (“A,” “B,” and “C”) are connected to each other, they can all be represented by adding node “A” to the new feature vector. With this choice, the final, reduced feature vector consists of two features only (compared to four entries that are obtained if node “D” would have been fixed). Hence, analyzing the networks that would remain after eliminating a candidate node helps as it avoids an unfavorable segmentation of the network into disjunct subnetworks or isolated nodes; this procedure entails a more efficient reduction of the feature space, as more features can still be represented by their connected nodes.

Having analyzed the working principle of the NETCORE algorithm, we now return to the analysis of molecular datasets. Until now, we have applied the algorithm to reduce the feature vector of a dataset representing the molecular properties of antibiotics. In a next step, we analyze datasets obtained for three additional molecular classes, i.e., fluorophores, antioxidants, and vitamins (see the supplementary material, Secs. S1 and S2), as well a combination of all four. Consistent with the analysis performed at the beginning of this article, we select a correlation threshold of t = 0.6 to create the correlation network. The initial feature vector is identical for all datasets.

When applying the NETCORE algorithm, we observe that the new, reduced feature vector differs for all four molecular classes (Fig. 5). Compared to what we obtained for the antibiotics, the new feature vector obtained for fluorophores includes one additional feature, namely the “hydrogen bond donor sites.” This indicates that, for the fluorophores, this very feature was not sufficiently represented by the other five features that were included into the new feature vector. In contrast, for vitamins, only two features were left after feature elimination. This implies that, for this particular class of molecules, there was a strong correlation among the initial features, which allows those vitamin molecules to be represented by two features only. The new feature vector created for the antioxidants contains five features again. However, different from the result obtained for the antibiotics, here, the features “molecular weight” and “dipole moment” are replaced by “van der Waals volume” and “hydrogen bond acceptor sites.” Finally, when pooling all four datasets, the new feature vector obtained for this mixed set of molecules is identical to that obtained for the antibiotics dataset. However, this result is reasonable: those five features are the ones that occurred most frequently in the reduced feature vectors of the individual molecular classes. Overall, the analysis conducted with the four different molecular datasets nicely demonstrates how the NETCORE algorithm successfully adapts to the molecular peculiarities of the analyzed datasets and generates database-specific new (=reduced) feature vectors.

FIG. 5.

Reduced feature vectors as obtained for small datasets representing different molecular classes. In addition to the previously analyzed dataset generated from antibiotics, datasets describing fluorophores, vitamins, and antioxidants and a larger, pooled dataset containing all four molecular classes were analyzed with the same network-based algorithm. All molecules were characterized by the same initial feature vector, which contains quantitative descriptors of the same set of nine physicochemical characteristics.

FIG. 5.

Reduced feature vectors as obtained for small datasets representing different molecular classes. In addition to the previously analyzed dataset generated from antibiotics, datasets describing fluorophores, vitamins, and antioxidants and a larger, pooled dataset containing all four molecular classes were analyzed with the same network-based algorithm. All molecules were characterized by the same initial feature vector, which contains quantitative descriptors of the same set of nine physicochemical characteristics.

Close modal

Feature elimination (as performed here by the proposed algorithm) is a common step involved in the preprocessing of datasets for machine learning (ML). As the NETCORE algorithm does not require an output label, it can be used on datasets for the application of both unsupervised and supervised learning methods. Hence, in a next set of trials, we test the impact of the achieved feature reduction on the sorting accuracy achieved with two common machine learning methods—a random forest classifier (RF) and a k nearest neighbors classifier (KNN). The dataset used for these tests was obtained from Wu et al.69 and contains binary labels of binding results for 1513 (putative) inhibitors of human β-secretase 1; here, each inhibitor molecule is characterized by 590 physicochemical features. When applying the NETCORE strategy to this dataset, the feature vector can be gradually reduced by adjusting the correlation threshold [Fig. 6(a)]: starting with a correlation threshold of 1.0 (here, only fully correlating features and those features that cannot be correlated are excluded), the number of features in the new feature vector continuously decreases until only 61 features are left when a correlation strength of 0.5 is chosen.

FIG. 6.

Number of features in the new feature vector and accuracy of ML models trained with feature vectors of different sizes. The network-based feature elimination algorithm was applied to a dataset comprising 590 features (full set). By decreasing the correlation threshold t, the number of features in the new feature vector is reduced (a). The new, reduced feature vectors are then used to train and test an RF classifier and a KNN classifier; both classifiers were tasked to predict the binary label of a molecule (b). The displayed values denote the mean accuracy obtained from a repeated (n = 10) stratified k-fold (k = 5) cross-validation. Error bars represent the standard error of the mean (as determined from those 50 total runs); those error bars have similar sizes for prediction results obtained with the initial dataset and reduced feature vectors obtained with different correlation thresholds (but are smaller than the symbol size). The mean VIF of features of the initial dataset (containing only features that can be correlated; for details, see the section titled “Requirements regarding data representation”) is compared to the mean VIF of the features included in reduced feature vectors that were generated by executing the NETCORE algorithm with a correlation threshold of either t = 0.6 or t = 0.7 (c).

FIG. 6.

Number of features in the new feature vector and accuracy of ML models trained with feature vectors of different sizes. The network-based feature elimination algorithm was applied to a dataset comprising 590 features (full set). By decreasing the correlation threshold t, the number of features in the new feature vector is reduced (a). The new, reduced feature vectors are then used to train and test an RF classifier and a KNN classifier; both classifiers were tasked to predict the binary label of a molecule (b). The displayed values denote the mean accuracy obtained from a repeated (n = 10) stratified k-fold (k = 5) cross-validation. Error bars represent the standard error of the mean (as determined from those 50 total runs); those error bars have similar sizes for prediction results obtained with the initial dataset and reduced feature vectors obtained with different correlation thresholds (but are smaller than the symbol size). The mean VIF of features of the initial dataset (containing only features that can be correlated; for details, see the section titled “Requirements regarding data representation”) is compared to the mean VIF of the features included in reduced feature vectors that were generated by executing the NETCORE algorithm with a correlation threshold of either t = 0.6 or t = 0.7 (c).

Close modal

When using this molecular dataset for training and testing the RF and the KNN classifier with the aim to predict the binary binding label of each sample based on the provided feature vector, we observe a change in the accuracy of both classifiers [Fig. 6(b)]. The RF model mainly profits from removing fully redundant information (and those features that cannot be correlated): when applying a feature elimination based on a correlation threshold of 1.0 (by which the size of the feature vector is reduced from 592 to 340 features), the prediction accuracy increases from ∼90% to slightly over 99%. For the KNN (which is a classifier that is known to perform badly when challenged with data of high dimensionality), an almost linear increase in accuracy is observed as the feature vector becomes smaller. However, once a correlation threshold of 0.7 is reached (which is commonly used as a threshold indicating a sufficiently strong level of correlation), a maximum in the accuracy is obtained. In other words, applying the NETCORE algorithm can significantly improve the performance of either ML classifier. Last, we assess the impact the feature elimination has on the multicollinearity within the feature vectors [Fig. 6(c)]. Therefore, the mean VIF of the features included in the initial dataset (which has already been cleared of any features that cannot be correlated) is calculated and compared to the mean VIF of feature vectors created by applying the NETCORE algorithm. Here, two different correlation thresholds are tested, namely t = 0.6 (similar to the VIF evaluation performed for the antibiotics dataset) and t = 0.7 (which delivered the best results for the ML analysis). Consistent with the results presented above for the small antibiotics dataset, also for this big inhibitor dataset, the NETCORE algorithm achieves an immense decrease in the average VIF of the features included in the feature vector, i.e., by four orders of magnitude [Fig. 6(c)]. The obtained results nicely demonstrate that the NETCORE algorithm can be easily scaled to datasets that contain both higher numbers of samples and higher feature dimensionalities. With increasing size of the dataset, the only limiting factor might become the runtime. The runtimes required to analyze the small molecular datasets studied here are in the range of several milliseconds only (see the supplementary material, Sec. S3). With increasing size of the dataset, this runtime increases, of course. However, even for the big inhibitor dataset, the full NETCORE algorithm is executed within ∼3 s (when running the NETCORE script on a MacBook Pro 2017 equipped with a 3.1 GHz Dual-Core Intel Core i5 processor), which we consider very reasonable for such a feature elimination task.

In a next step, we compare the performance of the NETCORE algorithm to two basic correlation-based feature selection approaches (for details regarding the methods used here, please refer to the supplementary material, Sec. S5). When analyzing the big BACE dataset with a random correlation-based elimination method (applying a correlation threshold of t = 0.6), a reduced feature vector containing 74 features is obtained. For all eliminated features, we then calculate the maximal correlation strength these features have to a feature from the reduced feature vector (from now on, we refer to this value as “representation strength”). We observe that this random elimination strategy based on a correlation matrix alone is not able to provide a reduced feature vector that sufficiently represents all eliminated features according to the predefined correlation threshold (some features are only represented with a correlation of 0.25 even though a correlation threshold of t = 0.6 was applied; the corresponding data are shown in the supplementary material, Sec. S6). This problem mainly arises from the fact that a feature that was previously chosen to represent an eliminated feature can afterward be eliminated itself. In contrast, the NETCORE algorithm creates a reduced feature vector that represents all eliminated features with representation strengths that exceed the predefined correlation threshold.

To make sure not to drop features that are needed to represent others, it is possible to analyze only the upper triangle of the correlation matrix. When applying such an “upper triangle” method to the BACE dataset, a feature vector is obtained that, indeed, properly represents all eliminated features. However, such a generated reduced feature vector contains 91 features, whereas the NETCORE algorithm is able to reduce the original feature vector to only 84 features (when applying the same correlation threshold of t = 0.6). At this point, we would like to mention again that the primary goal of NETCORE is to identify the smallest possible feature vector; then, for this minimal number of features, the representation strength is optimized. Hence, even though the upper triangle method can identify a suitable feature vector, its selection result is sub-optimal in terms of elimination efficiency. Most likely, this occurs since a feature can only be eliminated if it is redundant to a feature that is described by a column of the correlation matrix that is located on the left hand side of the feature to be eliminated, and this entails two major complications: first, the feature elimination process is subject to a certain bias: features that are located in the “beginning” of the correlation matrix tend to stay in the feature vector, whereas features that are located in “later” columns are likely to be discarded. Second, the reduced feature vector obtained from the upper triangle method strongly depends on the order by which the features appear in the correlation matrix. Thus, certain constellations in the correlation matrix also allow only for certain eliminations and this limits the identification of an “optimal” reduced feature vector.

When applying these two correlation-based feature elimination strategies to the five small molecular datasets we already discussed in this paper, the outcome is sub-optimal as well (the reduced feature vectors created by different algorithms are depicted in the supplementary material, Fig. S2). For all the five small molecular datasets, the random feature selection based on the full correlation matrix delivers clearly unsatisfactory results (Fig. S2, “random”): the representation strength of all eliminated features is considerably lower than those of the feature sets identified by the NETCORE algorithm—for several features, this value is even far below the initially selected correlation threshold (Fig. S2). As mentioned above, this problem mainly arises from the fact that a feature that was previously chosen to represent an eliminated feature can afterward be eliminated itself. A particularly pronounced example of this effect is observed for the “vitamins” dataset. Here, only one feature is kept in the reduced feature vector, but five of the eliminated features are not sufficiently represented: the feature “rotatable bonds,” for instance, only exhibits a representation strength of 0.04. In contrast, when using the NETCORE algorithm, all features are sufficiently represented by the reduced feature vector—for all molecular datasets tested.

Different from the random feature selection algorithm, the “upper triangle” approach does create a reduced feature vector that sufficiently represents all eliminated features (based on the predefined correlation threshold of t = 0.6). However, even though the created reduced feature vector contains the same number of features as the one created by NETCORE, the representation strength of the eliminated features is lower when the “upper triangle” approach is used. As mentioned above, here, the elimination is strongly influenced by the sequence of the features in the correlation matrix. Hence, the feature “rotatable bonds,” for instance, is kept in the reduced feature vectors of all the five molecular classes, just because it is the first feature to be analyzed and has no other feature that it can be considered to be redundant to. In addition, this leads to a sub-optimal outcome: the feature vectors obtained when analyzing the pooled dataset with the NETCORE algorithm and the upper triangle method are very similar (four out of five features are identical). However, whereas the upper triangle method selects the rotatable bond count to be included into the feature vector (for reasons described earlier), the NETCORE algorithm, instead, chooses the more central “molecular weight” feature. With this small but important difference in choice, the mean representation strength of the eliminated features is improved from 0.75 to 0.83.

In conclusion, the NETCORE algorithm is able to outperform the two basic correlation-based feature elimination strategies as it more efficiently reduces the feature space while optimizing the representation strength of the eliminated features. Additionally, whereas the runtime of the NETCORE algorithm required to analyze a big dataset is slightly higher than that of the other two approaches, for small datasets, the NETCORE algorithm is the fastest among the three tested methods (see the supplementary material, Fig. S3).

Finally, we briefly compare the NETCORE algorithm to a variance threshold filter—the latter is a common method used for dimensionality reduction of unlabeled data. The idea of this filter is to remove all features with a low variance, as those features are assumed to be irrelevant (for details, see the supplementary material, Sec. S5). As the goal of this algorithm is not to perform correlation-based redundancy elimination, analyzing the representation strengths obtained with this particular algorithm might not be fair. Instead, we compare the size of the reduced feature vector obtained with this variance threshold filter to the size of the feature vector generated by the NETCORE algorithm. When applied to the five small molecular datasets, we observe that this variance-based algorithm eliminates only very few features (two features each for the vitamins and antioxidants datasets and one feature for the fluorophores dataset) or is unable to eliminate any feature at all (for the antibiotics and the pooled datasets). Hence, we conclude that, for those particular datasets, applying a variance threshold filter is not ideal. When applied to the big inhibitor dataset, variance-based filtering reduces the feature vector to 200 features; however, no considerable improvement of the accuracy of the KNN is achieved. Thus, for this particular combination of dataset and ML model, the NETCORE algorithm returns better results than the variance threshold filter.

As described in the section titled Methods, the NETCORE algorithm analyzes a dataset based on the correlation matrix that is generated by considering the entries of the feature vectors. One prerequisite for this step is that the features can actually be correlated; i.e., a correlation coefficient can be calculated. For this to be possible, the features have to be characterized by numerical data. Conversely, features described by other datatypes cannot be analyzed by the NETCORE algorithm: since the correlation matrix ignores features with non-numerical inputs (e.g., strings), those features are excluded from any further analysis. Importantly, such features are not included into the new, reduced feature vector but simply ignored in any step following the creation of the correlation matrix. Hence, (partially) non-numerical datasets require a preprocessing step and this can be achieved by performing simple label encoding, as—for instance—provided by pandas56,57 and sklearn62 libraries.

So far, we have described how the NETCORE algorithm handles different situations and how it can help increase the accuracy of supervised machine learning models. In a last step, we discuss how certain modules of the algorithm could be modified to adapt it to special requirements of other datasets or problem statements. First, the correlation metric used to establish the correlation matrix can be altered: the Pearson’s correlation coefficient, which is used in our proposed version of the algorithm, can be easily replaced by, e.g., the Spearman or the Kendall tau correlation. The Pearson correlation analyzes linear correlations between two sequences of numbers according to Eq. (1) (see the section titled Methods). As it is simple and fast [its computational complexity is O(n)70], it is often a good choice to start with. However, its sensitivity toward outliers is an issue that should be considered; for noisy data, other metrics might be a better choice. In such situations, using Spearman’s coefficient can be a better choice. In principle, Spearman’s coefficient is obtained with the same equation as Pearson’s coefficient; however, now, the calculated absolute values are replaced by their rank in the analyzed sequence of values. This strategy renders Spearman’s coefficient more robust to outliers71 but comes with an increased computational cost since the running time required to establish the correlation matrix increases to O(n log n),70 and this is caused by the additional sorting step. Both Pearson’s and Spearman’s coefficients define correlation as the proportion of variability of one feature explained by the variability of another feature. However, a different approach is followed when using the Kendall correlation. Instead of analyzing the variability of the features, the Kendall correlation assesses the probability that the values of two analyzed features follow the same ranking. Kendall’s coefficients are defined according to Eq. (3), where n denotes the number of values available for each feature and S+ and S represent the number of pairs with concordant or discordant ranking in the sequence, respectively,

(3)

According to previous studies, this Kendall’s coefficient is more robust and slightly more efficient than Spearman’s coefficient.72 

In addition to the type of correlation coefficient used, the sorting criteria can also be adjusted. Whereas criterion 1 (i.e., searching for the node of highest degree) and criterion 3 (i.e., selecting the node with the highest mean correlation strength) are essential for identifying the node of highest centrality, criterion 2 offers room for modifications. As described in the section titled Methods, criterion 2 analyzes those networks that would remain after a candidate node was added to the new feature vector. So far, in the remaining networks, the node of highest degree is searched, and among all possible candidate nodes, the one leading to a network containing the node of highest degree is chosen. This selection criterion, however, may be changed: instead of selecting the network containing the overall node of highest degree, one could choose the network containing the node of highest minimal degree or the network with the highest mean degree (determined by averaging the degree of all remaining nodes). Of course, with either of these adjustments, it would be necessary to reevaluate the efficiency of the feature elimination process as described above.

In this study, we proposed the NETCORE algorithm, which constructs a network of features by analyzing correlations among them and then removes features such that the remaining ones represent the eliminated features very well. Importantly, the NETCORE algorithm is fully model-independent and does not require any output label, which enables it to preprocess datasets for both unsupervised and supervised learning approaches. The NETCORE algorithm is scalable to any size of dataset regarding both the number of samples and features. Here, we focused on the application of such a feature elimination algorithm to molecular datasets; nevertheless, it may be applied to any other dataset: eliminating redundant information to reduce the dimensionality of the feature space can be beneficial for all kinds of data that are analyzed via computational methods.

In addition to providing a fast and straightforward method for reducing the dimensionality of a feature space, the algorithm can help gain deeper insights into the analyzed data: the generated new feature vector basically provides a ranking of the remaining features according to the centrality of a feature in the correlation network. Furthermore, the decision-making process of the NETCORE algorithm can be traced very easily. As a result, redundant information not only is eliminated from the feature vector but can also be uncovered and interpreted. Additionally, the algorithm can be combined with common importance metrics (such as the Gini importance) to provide an importance ranking of the feature entries of the vector while removing redundant information that would otherwise have negatively impacted the result. Thus, the NETCORE strategy proposed here has the potential to be a highly beneficial tool for various machine learning pipelines.

The supplementary material contains the five small molecular datasets tested in this study (with the corresponding PubChem references and calculated correlation matrices). Furthermore, additional methods and results regarding other tested feature elimination strategies and runtimes are provided.

The authors have no conflicts to disclose.

C.A.R., M.H., and O.L. designed the study. M.H. and C.A.R. developed the algorithm and analyzed data. The manuscript was written by O.L., C.A.R., and M.H. All authors have given approval to the final version of the manuscript.

Carolin A. Rickert: Conceptualization (equal); Methodology (equal); Software (lead); Supervision (supporting); Validation (lead); Visualization (lead); Writing – original draft (lead). Manuel Henkel: Conceptualization (equal); Methodology (equal); Software (equal); Writing – original draft (supporting). Oliver Lieleg: Conceptualization (equal); Funding acquisition (lead); Project administration (lead); Supervision (lead); Writing – review & editing (lead).

The data that support the findings of this study are openly available in GitHub at https://github.com/CarolinRi/NETCORE, reference number 73, and available within the article and its supplementary material.

1.
C.
Tian
,
L.
Fei
,
W.
Zheng
,
Y.
Xu
,
W.
Zuo
, and
C.-W.
Lin
, “
Deep learning on image denoising: An overview
,”
Neural Networks
131
,
251
275
(
2020
).
2.
S.
Pouyanfar
,
S.
Sadiq
,
Y.
Yan
,
H.
Tian
,
Y.
Tao
,
M. P.
Reyes
,
M.-L.
Shyu
,
S.-C.
Chen
, and
S. S.
Iyengar
, “
A survey on deep learning: Algorithms, techniques, and applications
,”
ACM Comput. Surv.
51
(
5
),
1
36
(
2018
).
3.
Y.
LeCun
,
Y.
Bengio
, and
G.
Hinton
, “
Deep learning
,”
Nature
521
(
7553
),
436
444
(
2015
).
4.
J. G.
Richens
,
C. M.
Lee
, and
S.
Johri
, “
Improving the accuracy of medical diagnosis with causal machine learning
,”
Nat. Commun.
11
(
1
),
3923
(
2020
).
5.
Y.
Fu
,
Y.
Lei
,
T.
Wang
,
W. J.
Curran
,
T.
Liu
, and
X.
Yang
, “
Deep learning in medical image registration: A review
,”
Phys. Med. Biol.
65
(
20
),
20TR01
(
2020
).
6.
J.
He
,
S. L.
Baxter
,
J.
Xu
,
J.
Xu
,
X.
Zhou
, and
K.
Zhang
, “
The practical implementation of artificial intelligence technologies in medicine
,”
Nat. Med.
25
(
1
),
30
36
(
2019
).
7.
K. T.
Butler
,
D. W.
Davies
,
H.
Cartwright
,
O.
Isayev
, and
A.
Walsh
, “
Machine learning for molecular and materials science
,”
Nature
559
(
7715
),
547
555
(
2018
).
8.
K.
Guo
,
Z.
Yang
,
C.-H.
Yu
, and
M. J.
Buehler
, “
Artificial intelligence and machine learning in design of mechanical materials
,”
Mater. Horiz.
8
(
4
),
1153
1172
(
2021
).
9.
A. F.
de Almeida
,
R.
Moreira
, and
T.
Rodrigues
, “
Synthetic organic chemistry driven by artificial intelligence
,”
Nat. Rev. Chem.
3
(
10
),
589
604
(
2019
).
10.
N.
Brown
,
P.
Ertl
,
R.
Lewis
,
T.
Luksch
,
D.
Reker
, and
N.
Schneider
, “
Artificial intelligence in chemistry and drug design
,”
J. Comput.-Aided Mol. Des.
34
(
7
),
709
715
(
2020
).
11.
A.
Esteva
,
B.
Kuprel
,
R. A.
Novoa
,
J.
Ko
,
S. M.
Swetter
,
H. M.
Blau
, and
S.
Thrun
, “
Dermatologist-level classification of skin cancer with deep neural networks
,”
Nature
542
(
7639
),
115
118
(
2017
).
12.
J.
Jumper
,
R.
Evans
,
A.
Pritzel
,
T.
Green
,
M.
Figurnov
,
O.
Ronneberger
,
K.
Tunyasuvunakool
,
R.
Bates
,
A.
Žídek
, and
A.
Potapenko
, “
Highly accurate protein structure prediction with AlphaFold
,”
Nature
596
(
7873
),
583
589
(
2021
).
13.
A.
Jain
,
H.
Patel
,
L.
Nagalapatti
,
N.
Gupta
,
S.
Mehta
,
S.
Guttula
,
S.
Mujumdar
,
S.
Afzal
,
R.
Sharma Mittal
, and
V.
Munigala
, “
Overview and importance of data quality for machine learning tasks
,” in
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (Association for Computing Machinery
, 2020), pp.
3561
3562
.
14.
H. S.
Obaid
,
S. A.
Dheyab
, and
S. S.
Sabry
, “
The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning
,” in
2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON)
(
IEEE
,
2019
), pp.
279
283
.
15.
C. C.
Aggarwal
,
A.
Hinneburg
, and
D. A.
Keim
, “
On the surprising behavior of distance metrics in high dimensional space
,” in
International Conference on Database Theory
(
Springer
,
2001
), pp.
420
434
.
16.
H.
Liu
and
H.
Motoda
,
Feature Selection for Knowledge Discovery and Data Mining
(
Springer Science & Business Media
,
2012
).
17.
H.
Liu
and
H.
Motoda
,
Computational Methods of Feature Selection
(
CRC Press
,
2007
).
18.
S.
Beniwal
and
J.
Arora
, “
Classification and feature selection techniques in data mining
,”
Int. J. Eng. Res. Sci. Technol.
1
(
6
),
1
6
(
2012
).
19.
U.
Stańczyk
, “
Feature evaluation by filter, wrapper, and embedded approaches
,” in
Feature Selection for Data and Pattern Recognition
(
Springer
,
2015
), pp.
29
44
.
20.
N.
El Aboudi
and
L.
Benhlima
, “
Review on wrapper feature selection approaches
,” in
2016 International Conference on Engineering and MIS (ICEMIS)
(
IEEE
,
2016
), pp.
1
5
.
21.
R.
Zebari
,
A.
Abdulazeez
,
D.
Zeebaree
,
D.
Zebari
, and
J.
Saeed
, “
A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction
,”
J. Appl. Sci. Technol. Trends
1
(
2
),
56
70
(
2020
).
22.
X.
Huang
,
L.
Wu
, and
Y.
Ye
, “
A review on dimensionality reduction techniques
,”
Int. J. Pattern Recognit. Artif. Intell.
33
(
10
),
1950017
(
2019
).
23.
T.
Rückstieß
,
C.
Osendorfer
, and
P. v. d.
Smagt
, “
Sequential feature selection for classification
,” in
Australasian Joint Conference on Artificial Intelligence
(
Springer
,
2011
), pp.
132
141
.
24.
Z. M.
Hira
and
D. F.
Gillies
, “
A review of feature selection and feature extraction methods applied on microarray data
,”
Adv. Bioinf.
2015
,
198363
.
25.
J.
Loughrey
and
P.
Cunningham
, “
Overfitting in wrapper-based feature subset selection: The harder you try the worse it gets
,” in
International Conference on Innovative Techniques and Applications of Artificial Intelligence
(
Springer
,
2004
), pp.
33
43
.
26.
S.
Wang
,
J.
Tang
, and
H.
Liu
, “
Embedded unsupervised feature selection
,” in
Proceedings of the AAAI Conference on Artificial Intelligence
(AAAI,
2015
), Vol. 29.
27.
S.
Maldonado
and
J.
López
, “
Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification
,”
Appl. Soft Comput.
67
,
94
105
(
2018
).
28.
A.
Parmar
,
R.
Katariya
, and
V.
Patel
, “
A review on random forest: An ensemble classifier
,” in
International Conference on Intelligent Data Communication Technologies and Internet of Things
(
Springer
,
2018
), pp.
758
763
.
29.
K.
Fawagreh
,
M. M.
Gaber
, and
E.
Elyan
, “
Random forests: From early developments to recent advancements
,”
Syst. Sci. Control Eng.
2
(
1
),
602
609
(
2014
).
30.
J.
Ranstam
and
J. A.
Cook
, “
LASSO regression
,”
J. Br. Surg.
105
(
10
),
1348
(
2018
).
31.
G. C.
McDonald
, “
Ridge regression
,”
Wiley Interdiscip. Rev.: Comput. Stat.
1
(
1
),
93
100
(
2009
).
32.
N.
Sánchez-Maroño
,
A.
Alonso-Betanzos
, and
M.
Tombilla-Sanromán
, “
Filter methods for feature selection—A comparative study
,” in
International Conference on Intelligent Data Engineering and Automated Learning
(
Springer
,
2007
), pp.
178
187
.
33.
J. R.
Vergara
and
P. A.
Estévez
, “
A review of feature selection methods based on mutual information
,”
Neural Comput. Appl.
24
(
1
),
175
186
(
2014
).
34.
P. A.
Estévez
,
M.
Tesmer
,
C. A.
Perez
, and
J. M.
Zurada
, “
Normalized mutual information feature selection
,”
IEEE Trans. Neural Networks
20
(
2
),
189
201
(
2009
).
35.
L.
Talavera
, “
Dependency-based feature selection for clustering symbolic data
,”
Intell. Data Anal.
4
(
1
),
19
28
(
2000
).
36.
P.
Barbiero
,
G.
Squillero
, and
A.
Tonda
, “
Predictable features elimination: An unsupervised approach to feature selection
,” in
International Conference on Machine Learning, Optimization, and Data Science
(
Springer
,
2021
), pp.
399
412
.
37.
M.
Dash
,
K.
Choi
,
P.
Scheuermann
, and
H.
Liu
, “
Feature selection for clustering-a filter solution
,” in
2002 IEEE International Conference on Data Mining, Proceedings 2002
(
IEEE
,
2002
), pp.
115
122
.
38.
X.
He
,
D.
Cai
, and
P.
Niyogi
, “
Laplacian score for feature selection
,”
Advances in Neural Information Processing Systems
(NIPS,
2005
), Vol. 18.
39.
H.
Abdi
and
L. J.
Williams
, “
Principal component analysis
,”
Wiley Interdiscip. Rev.: Comput. Stat.
2
(
4
),
433
459
(
2010
).
40.
D.
Salas‐Gonzalez
,
J.
Górriz
,
J.
Ramírez
,
I.
Illán
,
M.
López
,
F.
Segovia
,
R.
Chaves
,
P.
Padilla
,
C.
Puntonet
, and
A. s. D. N.
Initiative
, “
Feature selection using factor analysis for Alzheimer’s diagnosis using 18F-FDG PET images
,”
Med. Phys.
37
(
11
),
6084
6095
(
2010
).
41.
L.
Jimenez
and
D. A.
Landgrebe
, “
Projection pursuit in high dimensional data reduction: Initial conditions, feature selection and the assumption of normality
,” in
1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century
(
IEEE
,
1995
), Vol. 1, pp.
401
406
.
42.
L. O.
Jimenez
and
D. A.
Landgrebe
, “
Hyperspectral data analysis and supervised feature reduction via projection pursuit
,”
IEEE Trans. Geosci. Remote Sens.
37
(
6
),
2653
2667
(
1999
).
43.
N.
Mazlum
and
A. Ö. S.
Mazlum
, “
Interpretation of water quality data by principal components analysis
,”
Turk. J. Eng. Environ. Sci.
23
(
1
),
19
26
(
1999
).
44.
H.
Zou
and
L.
Xue
, “
A selective overview of sparse principal component analysis
,”
Proc. IEEE
106
(
8
),
1311
1320
(
2018
).
45.
E. E.
Bolton
,
J.
Chen
,
S.
Kim
,
L.
Han
,
S.
He
,
W.
Shi
,
V.
Simonyan
,
Y.
Sun
,
P. A.
Thiessen
, and
J.
Wang
, “
PubChem3D: A new resource for scientists
,”
J. Cheminf.
3
(
1
),
32
15
(
2011
).
46.
D. F.
Veber
,
S. R.
Johnson
,
H.-Y.
Cheng
,
B. R.
Smith
,
K. W.
Ward
, and
K. D.
Kopple
, “
Molecular properties that influence the oral bioavailability of drug candidates
,”
J. Med. Chem.
45
(
12
),
2615
2623
(
2002
).
47.
A.
Bondi
, “
van der Waals volumes and radii
,”
J. Phys. Chem.
68
(
3
),
441
451
(
1964
).
49.
J.
Meija
,
T. B.
Coplen
,
M.
Berglund
,
W. A.
Brand
,
P.
De Bièvre
,
M.
Gröning
,
N. E.
Holden
,
J.
Irrgeher
,
R. D.
Loss
,
T.
Walczyk
, and
T.
Prohaska
, “
Atomic weights of the elements 2013 (IUPAC Technical Report)
,”
Pure Appl. Chem.
88
(
3
),
265
291
(
2016
).
50.
Chemaxon
, “
Isoelectric point plugin
,” https://docs.chemaxon.com/display/docs/isoelectric-point-plugin.md (accessed 25 February 2022).
51.
Chemaxon
, “
Dipole moment calculation plugin
,” https://docs.chemaxon.com/display/docs/dipole-moment-calculation-plugin.md (accessed 25 February 2022).
52.
P. M.
Schlosser
,
B. A.
Asgharian
, and
M.
Medinsky
, “
1.04 Inhalation Exposure and Absorption of Toxicants
,”
Compr. Toxicol.
1
,
75
109
(
2010
).
53.
Chemaxon
, “
Topology analysis
,” Chemaxon, https://chemaxon.com/webinar/topology-analysis (accessed 23 February 2022).
54.
Chemaxon
, “
Hydrogen bond donor acceptor plugin
,” https://docs.chemaxon.com/display/docs/hydrogen-bond-donor-acceptor-plugin.md (accessed 25 February 2022).
55.
G.
Van Rossum
and
F. L.
Drake
,
Python 3 Reference Manual
(
CreateSpace
,
2009
).
56.
W.
McKinney
, “
Data structures for statistical computing in python,” in Proceedings of the Ninth Python in Science Conference, Austin, TX, 28 June-3 July
2010, edited by
S.
van der Walt
and
K. J.
Millman
, Vol. 445, pp.
51
56
.
57.
J.
Reback
,
W.
McKinney
,
J.
Van Den Bossche
,
T.
Augspurger
,
P.
Cloud
,
A.
Klein
,
S.
Hawkins
,
M.
Roeschke
,
J.
Tratner
, and
C.
She
(
2020
).“
pandas-dev/pandas: Pandas 1.0. 5
,” Zenodo.
58.
A.
Hagberg
,
P.
Swart
, and
D.
S Chult
, in
Exploring Network Structure, Dynamics, and Function Using NetworkX
(
Los Alamos National Lab., LANL
,
Los Alamos, NM, USA
,
2008
).
59.
C. R.
Harris
,
K. J.
Millman
,
S. J.
van der Walt
,
R.
Gommers
,
P.
Virtanen
,
D.
Cournapeau
,
E.
Wieser
,
J.
Taylor
,
S.
Berg
, and
N. J.
Smith
, “
Array programming with NumPy
,”
Nature
585
(
7825
),
357
362
(
2020
).
60.
M.
Waskom
,
O.
Botvinnik
,
M.
Gelbart
,
J.
Ostblom
,
P.
Hobson
,
S.
Lukauskas
,
D. C.
Gemperline
,
T.
Augspurger
,
Y.
Halchenko
, and
J.
Warmenhoven
, “
Seaborn: Statistical data visualization
,” Astrophysics Source Code Library 2020, ascl: 2012.2015.
61.
J. D.
HunterMatplotlib
, “
A 2D graphics environment
,”
IEEE Ann. Hist. Comput.
9
(
03
),
90
95
(
2007
).
62.
F.
Pedregosa
,
G.
Varoquaux
,
A.
Gramfort
,
V.
Michel
,
B.
Thirion
,
O.
Grisel
,
M.
Blondel
,
P.
Prettenhofer
,
R.
Weiss
, and
V.
Dubourg
, “
Scikit-learn: Machine learning in Python
,”
J. Mach. Learn. Res.
12
,
2825
2830
(
2011
).
63.
R.
Taylor
, “
Interpretation of the correlation coefficient: A basic review
,”
J. Diagn. Med. Sonography
6
(
1
),
35
39
(
1990
).
64.
B.
Ruhnau
, “
Eigenvector-centrality—A node-centrality?
,”
Soc. Networks
22
(
4
),
357
365
(
2000
).
65.
G.
Roffo
and
S.
Melzi
, “
Ranking to learn
,” in
International Workshop on New Frontiers in Mining Complex Patterns
(
Springer
,
2016
), pp.
19
35
.
66.
C. G.
Thompson
,
R. S.
Kim
,
A. M.
Aloe
, and
B. J.
Becker
, “
Extracting the variance inflation factor and other multicollinearity diagnostics from typical regression results
,”
Basic Appl. Soc. Psychol.
39
(
2
),
81
90
(
2017
).
67.
C. A.
Rickert
,
E. N.
Hayta
,
D. M.
Selle
,
I.
Kouroudis
,
M.
Harth
,
A.
Gagliardi
, and
O.
Lieleg
, “
Machine learning approach to analyze the surface properties of biological materials
,”
ACS Biomater. Sci. Eng.
7
(
9
),
4614
4625
(
2021
).
68.
T. A.
Craney
and
J. G.
Surles
, “
Model-dependent variance inflation factor cutoff values
,”
Qual. Eng.
14
(
3
),
391
403
(
2002
).
69.
Z.
Wu
,
B.
Ramsundar
,
E. N.
Feinberg
,
J.
Gomes
,
C.
Geniesse
,
A. S.
Pappu
,
K.
Leswing
, and
V.
Pande
, “
MoleculeNet: A benchmark for molecular machine learning
,”
Chem. Sci.
9
(
2
),
513
530
(
2018
).
70.
P. A.
Jaskowiak
,
R. J.
Campello
,
T. F.
Covoes
, and
E. R.
Hruschka
, “
A comparative study on the use of correlation coefficients for redundant feature elimination
,” in
2010 Eleventh Brazilian Symposium on Neural Networks
(
IEEE
,
2010
), pp.
13
18
.
71.
J. C. F.
De Winter
,
S. D.
Gosling
, and
J.
Potter
, “
Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data
,”
Psychol. Methods
21
(
3
),
273
(
2016
).
72.
C.
Croux
and
C.
Dehon
, “
Influence functions of the Spearman and Kendall correlation measures
,”
Stat. Methods Appl.
19
(
4
),
497
515
(
2010
).
73.
C. A.
Rickert
,
M.
Henkel
, and
O.
Lieleg
(
2022
). “
carolinri/NETCORE: Version 1.0.1
,” GitHub/Zenodo.

Supplementary Material