13 Feature selection

13.1 Introduction to feature selection

Feature selection is the process of removing uninformative features from your model. These are features that are not helping your model to make better predictions. In other words, uninformative features are adding “noise” to your model, rather than “signal”.

There are a few reasons you might want to add feature selection to your workflow:

Model accuracy can often be improved by removing uninformative features.
Models are generally easier to interpret when they include fewer features.
When you have fewer features, models will take less time to train and it may cost less to gather and store the data that is required to train them.

Potential benefits of feature selection:

Higher accuracy
Greater interpretability
Faster training
Lower costs

As I mentioned at the start of the book, there are many valid methods for feature selection, including human intuition, domain knowledge, and data exploration. In this chapter, we’re going to do feature selection using automated methods that we can include in our Pipeline.

Feature selection methods:

Human intuition
Domain knowledge
Data exploration
Automated methods

There are three types of automated methods that we’ll cover in this chapter: intrinsic methods, filter methods, and wrapper methods.

Methods for automated feature selection:

Intrinsic methods
Filter methods
Wrapper methods

For the purposes of simplicity and training speed, we’ll use our logistic regression Pipeline as the starting point for the next few chapters. However, everything you’re learning could also be applied to the random forest Pipeline or the VotingClassifier Pipeline.

13.2 Intrinsic methods: L1 regularization

An intrinsic feature selection method is one in which feature selection happens automatically as part of the model building process. These are also known as implicit methods or embedded methods.

What are intrinsic methods?

Feature selection happens automatically during model building
Also called: implicit methods, embedded methods

We’ve actually already used an intrinsic feature selection method in the book. Recall that in chapter 10, we tuned our logistic regression Pipeline using a grid search. Here were the best parameters.

grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': True,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}

Notice the C and penalty parameters of logistic regression. The L1 penalty is the type of regularization that was used, and the C value indicates the amount of regularization.

LogisticRegression tuning parameters:

penalty: Type of regularization
C: Amount of regularization

In general, regularization shrinks model coefficients in order to minimize overfitting to the training data and improve the model’s ability to generalize to new data.

One notable aspect of L1 regularization in particular is that as the amount of regularization increases, some coefficients will be shrunk all the way to zero, which means they will be excluded from the model. In other words, L1 regularization automatically does feature selection.

How does L1 regularization do feature selection?

Regularization shrinks model coefficients to help the model to generalize
L1 regularization shrinks some coefficients to zero, which removes those features

To see an example of this, let’s take a look at the coefficients of the best model found by grid search, which is stored in the best_estimator_ attribute. Notice that the second coefficient is zero, which means that the L1 regularization caused that feature to be removed from the model.

grid.best_estimator_.named_steps['logisticregression'].coef_

array([[ 0.56431161,  0.        , -0.08767203, ...,  0.01408723,
        -0.43713268, -0.46358519]])

By checking the shape of the inner array, we can see that there are 3671 features.

grid.best_estimator_.named_steps['logisticregression'].coef_[0].shape

(3671,)

We can then check how many of the coefficients are zero. It turns out that 3103 coefficients were set to zero, which means that L1 regularization removed those features, leaving only 568 of the features.

Note that as the amount of regularization increases, more coefficients will be shrunk to zero and thus more features will be removed from the model. In the case of logistic regression, you increase the amount of regularization by decreasing the value of C.

sum(grid.best_estimator_.named_steps['logisticregression'].coef_[0] == 0)

Let’s compare this to our logistic regression Pipeline that was not tuned by grid search. You can see that it uses L2 regularization.

pipe.named_steps['logisticregression'].get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 1,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

Although L2 regularization does shrink coefficients, we can confirm that it does not shrink them all the way to zero, and thus it does not perform feature selection.

Keep in mind that although L1 regularization produced a better performing model in this situation, that will not always be the case. It’s a good idea to always try both types of regularization and see which one works better.

sum(pipe.named_steps['logisticregression'].coef_[0] == 0)

To wrap up this section, let’s talk about some advantages and disadvantages of intrinsic feature selection methods:

The main advantages are speed and simplicity: Since feature selection is implictly performed during model fitting, no additional feature selection process needs to be added to the workflow, which tends to save a lot of computational time.
The main disadvantage is that it’s model-dependent: The model that is best for your particular problem may not perform intrinsic feature selection.

Advantages and disadvantages of intrinsic methods:

Advantages:
- No added computation
- No added steps
Disadvantages:
- Model-dependent

13.3 Filter methods: Statistical test-based scoring

The next type of feature selection method we’ll cover is filter methods.

A filter method starts by scoring every single feature to quantify its potential relationship with the target column. Then, the features are ranked by their scores, and only the top scoring features are provided to the model. Thus, they’re called filter methods because they filter out what they believe to be the least informative features and then pass on the more informative features to the model.

As you’ll see in this section, filter methods vary in terms of the processes they use to score the features.

How filter methods work:

Each feature is scored by its relationship to the target
Top scoring features (most informative features) are provided to the model

Our starting point for this section will be the logistic regression Pipeline that has not been tuned by grid search. The reason for this is because we want to tune all of the Pipeline steps simultaneously, rather than tuning the transformers and model first and then adding feature selection.

In other words, the presence of a feature selection process may alter the optimal parameters for the transformers and the model, and thus we need to tune all three steps at once. Right now it’s a two-step Pipeline, but there will be three steps once we add feature selection to the Pipeline.

pipe

Let’s cross-validate this Pipeline to generate a “baseline” accuracy that we want to improve upon, which is 0.811.

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (VC): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811

The first filter method we’ll use is SelectPercentile. SelectPercentile scores features using univariate statistical tests:

You specify a statistical test, and it uses that test to score each feature independently.
Then, it passes on to the model a certain percentage (that you specify) of the best scoring features.

Thus, the assumption behind SelectPercentile is that a statistical test can assess the strength of the relationship between a feature and the target, and that if a feature appears to be independent of the target, then it is uninformative for the purpose of classification.

How SelectPercentile works:

Scores each feature using the statistical test you specify
Passes to the model the percentage of features you specify

Let’s see how SelectPercentile works. After importing SelectPercentile and chi2 from the feature_selection module, we’ll create an instance of SelectPercentile called “selection”.

First, we pass it the statistical test. In this case we’re using chi2, but other tests are available in scikit-learn.

Then, we pass it the percentile. We’re arbitrarily using 50 to keep 50% of the features, but this is a parameter you should tune. And to be clear, lower values for this parameter keep fewer features, so for example a value of 10 would only keep 10% percent of the features.

from sklearn.feature_selection import SelectPercentile, chi2
selection = SelectPercentile(chi2, percentile=50)

Next, we create a Pipeline called “fs_pipe” in which feature selection is after the ColumnTransformer but before the model. Thus, it will perform feature selection on the transformed features, not the original features.

fs_pipe = make_pipeline(ct, selection, logreg)
fs_pipe

Because we’ve included feature selection within the Pipeline, we can continue to cross-validate the entire process to see the impact of feature selection on model accuracy. When we run cross-validation on the new Pipeline, the score has improved to 0.819.

cross_val_score(fs_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8193019898311469

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (LR with SelectPercentile): 0.819
Baseline (VC): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811

It’s worth noting that there’s an alternative to SelectPercentile called SelectKBest. SelectKBest is nearly identical, except that you specify a number of features to keep rather than a percentage.

SelectPercentile vs SelectKBest:

SelectPercentile: Specify percentage of features to keep
SelectKBest: Specify number of features to keep

13.4 Filter methods: Model-based scoring

The other filter method we’ll use is called SelectFromModel. Whereas SelectPercentile scores features using a statistical test, SelectFromModel actually uses a model to score features:

First, you specify a model to use only for feature selection. That model is fit on all of the features. And the coef_ or feature_importances_ attribute of the model is used as the scores.
Then, it passes on to your prediction model all of the features that score above a certain threshold (that you specify).

How SelectFromModel works:

Scores each feature using the model you specify

Model is fit on all features
Coefficients or feature importances are used as scores

Passes to the prediction model features that score above a threshold you specify

Thus for a model to be used by SelectFromModel, it has to calculate either coefficients or feature importances. Models that can be used by SelectFromModel include logistic regression, linear SVC, and tree-based models.

Models that can be used by SelectFromModel:

Logistic regression
Linear SVC
Tree-based models
Any other model with coefficients or feature importances

To be clear, SelectFromModel is a filter method (not an intrinsic method) because it’s filtering which features are passed to your separate prediction model.

Let’s see how all of this fits together. We’re going to start by using logistic regression for feature selection. We’ll create a new instance of logistic regression called “logreg_selection” that’s only going to be used for feature selection. It’s completely separate from the logistic regression model we’re using to make predictions.

logreg_selection = LogisticRegression(solver='liblinear', penalty='l1',
                                      random_state=1)

Then, we’ll import SelectFromModel from the feature_selection module and create an instance called “selection”.

First, we pass it the model we’re using for selection. Second, we pass it a threshold. This can be the mean or median of the scores, though you can optionally include a scaling factor (such as 1.5 times mean). All features above this threshold will be passed to the prediction model, thus setting a higher threshold means fewer features will be kept.

from sklearn.feature_selection import SelectFromModel
selection = SelectFromModel(logreg_selection, threshold='mean')

Then, we’ll update fs_pipe to use the new feature selection object. Notice that logistic regression appears twice: One instance is being used only for feature selection, and the other instance is being used only for prediction.

fs_pipe = make_pipeline(ct, selection, logreg)
fs_pipe

When we cross-validate the updated Pipeline, the score has improved again, to 0.826.

cross_val_score(fs_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8260121775155358

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (LR with SelectFromModel LR): 0.826
Baseline (LR with SelectPercentile): 0.819
Baseline (VC): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811

Now let’s try using a tree-based model with SelectFromModel. We’ll use ExtraTreesClassifier, which is an ensemble of decision trees similar to random forests. After importing it from the ensemble module, we’ll create an instance to use for feature selection called “et_selection”.

from sklearn.ensemble import ExtraTreesClassifier
et_selection = ExtraTreesClassifier(n_estimators=100, random_state=1)

Then, we’ll update both the feature selection object and the Pipeline. Notice that ExtraTreesClassifier has replaced logistic regression as the second step in the Pipeline.

selection = SelectFromModel(et_selection, threshold='mean')
fs_pipe = make_pipeline(ct, selection, logreg)
fs_pipe

When we cross-validate the updated Pipeline, the resulting score is 0.815, which is not quite as good.

cross_val_score(fs_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8148013307388112

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (LR with SelectFromModel LR): 0.826
Baseline (LR with SelectPercentile): 0.819
Baseline (VC): 0.818
Baseline (LR with SelectFromModel ET): 0.815
Baseline (LR): 0.811
Baseline (RF): 0.811

As I mentioned earlier in this chapter, it’s important to tune the feature selection parameters, the transformer parameters, and the model parameters all at the same time. We’ll do this using a grid search.

To start, we’ll make a copy of our params dictionary called “fs_params”. We want to add a new entry in order to tune the threshold parameter of SelectFromModel. For the dictionary key, we specify the step name, which is selectfrommodel (all lowercase), followed by two underscores, followed by the parameter name. For the values, we’ll pass a list of mean, 1.5 times the mean, and negative infinity, which means don’t remove any features.

fs_params = params.copy()
fs_params['selectfrommodel__threshold'] = ['mean', '1.5*mean', -np.inf]

Let’s review the fs_params dictionary to see everything we’re tuning.

fs_params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10],
 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True],
 'selectfrommodel__threshold': ['mean', '1.5*mean', -inf]}

We’ll create a new instance of GridSearchCV called fs_grid, and make sure to pass it the fs_pipe and fs_params objects. Then we’ll run the grid search.

fs_grid = GridSearchCV(fs_pipe, fs_params, cv=5, scoring='accuracy',
                       n_jobs=-1)
fs_grid.fit(X, y)

This search results in a score of 0.832, which is one of our best scores so far.

fs_grid.best_score_

0.8316301550436258

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (LR with SelectFromModel ET): 0.832
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (LR with SelectFromModel LR): 0.826
Baseline (LR with SelectPercentile): 0.819
Baseline (VC): 0.818
Baseline (LR with SelectFromModel ET): 0.815
Baseline (LR): 0.811
Baseline (RF): 0.811

By examining the parameters, we can see that increasing the SelectFromModel threshold and thus keeping fewer features helped the model to perform better.

fs_grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': False,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1',
 'selectfrommodel__threshold': '1.5*mean'}

13.5 Filter methods: Summary

To wrap up this section, let’s talk about some advantages and disadvantages of filter methods.

The main advantage is that filter methods tend to run very quickly, though it’s worth noting that some statistical tests used with SelectPercentile and ensemble methods used with SelectFromModel can run quite slowly.

The main disadvantage is that there’s a disconnect between how features are being scored and their predictive value. In other words, the chi2 scores or coefficient values or feature importance scores are not a perfect measure of whether a particular feature will help a model make more accurate predictions. Thus, it’s entirely possible for informative features to receive low scores and be removed from a model, and for uninformative features to receive high scores and be kept in a model. One particular case of note is that the feature importance scores generated by tree-based models will be artificially low for any features which are highly correlated, which may result in important features being removed.

The other disadvantage of filter methods is that scores are calculated only once. This ignores the fact that as you remove certain features, the importance of other features may change. This drawback will be addressed by wrapper methods, which we’ll discuss in the next lesson.

Advantages and disadvantages of filter methods:

Advantages:
- Runs quickly (usually)
Disadvantages:
- Scores are not always correlated with predictive value
- Scores are calculated only once

13.6 Wrapper methods: Recursive feature elimination

The final type of feature selection we’ll cover is wrapper methods.

In contrast to the filter methods we’ve seen, in which features are scored only once, wrapper methods perform an iterative search in which features are scored multiple times. More specifically, a wrapper method evaluates a subset of features and then uses the results of that evaluation to help it decide which subset to evaluate next, repeating this process until some stopping criteria is met.

Filter methods vs wrapper methods:

Filter methods: Features are scored once
Wrapper methods: Features are scored multiple times

The wrapper method we’ll use in this section is Recursive Feature Elimination, or RFE. The way RFE starts is the same as SelectFromModel:

You specify a model to use only for feature selection.
That model is fit on all of the features.
The coefficients or feature importances of the model are used as scores.

However, this is the point at which SelectFromModel and RFE diverge:

SelectFromModel would now pass to your prediction model all of the features that score above a certain threshold.
RFE, on the other hand, removes the single worst scoring feature, refits the feature selection model, and recalculates the feature scores. It repeats this process, recursively eliminating one feature at a time, until it reaches the number of features that you specify. Those remaining features are the ones that will be passed to the prediction model.

How RFE works:

Scores each feature using the model you specify

Model is fit on all features
Coefficients or feature importances are used as scores

Removes the single worst scoring feature
Repeats steps 1 and 2 until it reaches the number of features you specify
Passes the remaining features to the prediction model

In other words, SelectFromModel will always just score your features a single time, whereas RFE will score your features potentially hundreds or thousands of times, depending on how many features you want to eliminate. That is obviously more computationally expensive, though it may better capture the relationships between features.

SelectFromModel vs RFE:

SelectFromModel: Scores your features a single time
RFE: Scores your features many times
- More computationally expensive
- May better capture the relationships between features

Let’s try using RFE. We start by importing it from the feature_selection module, and then create an instance called “selection”. We’re actually going to reuse logreg_selection as our feature selection model.

We’re also going to specify a step size of 10. By default, RFE will remove 1 feature at a time, and will stop once it has eliminated half of the features. Since there are about 1500 features, the default settings would require about 750 model fits. By setting a step size of 10, RFE will remove 10 features at a time, which reduces the amount of computation by a factor of 10.

from sklearn.feature_selection import RFE
selection = RFE(logreg_selection, step=10)

We’ll update the fs_pipe object to use the new feature selection object.

fs_pipe = make_pipeline(ct, selection, logreg)
fs_pipe

When we cross-validate it, we see that its score is 0.814, which is barely better than our baseline of 0.811.

cross_val_score(fs_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8136965664427847

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (LR with SelectFromModel ET): 0.832
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (LR with SelectFromModel LR): 0.826
Baseline (LR with SelectPercentile): 0.819
Baseline (VC): 0.818
Baseline (LR with SelectFromModel ET): 0.815
Baseline (LR with RFE LR): 0.814
Baseline (LR): 0.811
Baseline (RF): 0.811

Of course, it’s possible that the accuracy would improve if we tuned the number of features kept by RFE, or if we tried different models with RFE.

It’s hard to talk about the advantages and disadvantages of wrapper methods in general because of the diversity of wrapper methods. Instead, let’s wrap up this section by talking about the advantages and disadvantages of RFE specifically:

The main advantage of RFE is that it recalculates feature scores as features are removed. This is beneficial because as features are removed, the importance of other features may change, which RFE takes into account (whereas filter methods do not).
However, RFE has the same disadvantage as filter methods, in that there’s a disconnect between how features are being scored and their predictive value. In other words, informative features might be removed and uninformative features might be kept by RFE.
Another disadvantage of RFE is that it’s computationally expensive, especially if you’re removing a lot of features.
A final disadvantage of RFE is that it uses a “greedy” approach to feature selection, which means that it takes whatever action seems best at the time, even if a different action might ultimately lead to better results at the end of the process. There are non-greedy approaches to feature selection, though none of them are currently available in scikit-learn.

Advantages and disadvantages of RFE:

Advantages:
- Captures the relationships between features
Disadvantages:
- Scores are not always correlated with predictive value
- Computationally expensive
- Does not look ahead when removing features (“greedy” approach)

13.7 Q&A: How do I see which features were selected?

Recall that the fs_pipe object has three steps: a ColumnTransformer, a feature selector, and a logistic regression model.

fs_pipe.named_steps.keys()

dict_keys(['columntransformer', 'rfe', 'logisticregression'])

We can use slicing to select the ColumnTransformer step, and then we can run fit_transform to see that it outputs 1518 feature columns.

fs_pipe[0].fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
    with 7328 stored elements in Compressed Sparse Row format>

If we select the first two steps and then run fit_transform, we can see that the feature selection step reduced the number of feature columns from 1518 to 759. Note that we have to pass both X and y to fit_transform since the feature selection process requires knowledge of the target values.

fs_pipe[0:2].fit_transform(X, y)

<891x759 sparse matrix of type '<class 'numpy.float64'>'
    with 6008 stored elements in Compressed Sparse Row format>

If we then select the feature selection step and run the get_support method, it returns a boolean array which includes a True for every feature which was kept and a False for every feature which was removed.

fs_pipe[1].get_support()

array([ True,  True,  True, ...,  True,  True,  True])

As you can see, that array has 1518 elements, and 759 of those elements are True.

len(fs_pipe[1].get_support())

fs_pipe[1].get_support().sum()

Ideally, you would be able to use this array to filter the list of all features down to a list of the selected features. However, this is not a straightforward process because the get_feature_names method of ColumnTransformer only works if all of the underlying transformers have a get_feature_names method, and that is not the case here.

fs_pipe[0].get_feature_names()

AttributeError: Transformer pipeline (type Pipeline) does not provide get_feature_names.

As such, you would have to inspect the transformers one-by-one to figure out the 1518 column names, as shown previously in lesson 8.4, and then you could filter that list down to the 759 selected features using the boolean array.

Note that starting in scikit-learn version 1.1, the get_feature_names_out method will work on this ColumnTransformer, since the get_feature_names_out method will be available for all transformers.

13.8 Q&A: Are the selected features the “most important” features?

After using any feature selection procedure, it’s hard to say whether the selected feature set is truly the best set of features, or whether there might be another feature set that performs just as well.

For example, one high-performing feature set might include feature A, while another high-performing feature set might exclude feature A but include features B and C which are highly correlated with A. This is especially likely any time the number of features is much greater than the number of samples.

As such, feature selection is not an optimal tool for determining feature importance.

Feature selection vs feature importance:

Multiple sets of features may perform similarly
Especially likely if there are many more features than samples (“p >> n”)
Thus, feature selection does not necessarily determine feature importance

13.9 Q&A: Is it okay for feature selection to remove one-hot encoded categories?

The feature selection process examines each feature independently, which means that it does not know that there are groups of feature columns that originated from the same feature. As a result, feature selection might remove some of the columns that resulted from one-hot encoding a column, and keep others.

However, I don’t see this as necessarily being problematic. Each one-hot encoded column can be thought of as independent from all others, since it merely represents the presence or absence of a particular value for a categorical column. If the presence or absence of “Embarked from C” has a relationship with the target but the presence or absence of “Embarked from Q” does not, then I would agree with one of those columns being removed while the other remains.

This is similar to how I think of the features output by CountVectorizer: Some text features have a relationship with the target and should be kept, while others do not have a relationship with the target and should be removed.

Feature selection of one-hot encoded categories:

Feature selection examines each feature column independently (regardless of its “origin”)
Each one-hot encoded column is conceptually independent from the others
Thus, it’s acceptable for feature selection to ignore the origin of each column when removing features