14 Feature standardization

14.1 Standardizing numerical features

Some Machine Learning models benefit from a process called feature standardization. That’s because the objective function of some models assume that all features are centered around zero and have a variance of the same order of magnitude. If that assumption is incorrect, a given feature might dominate the objective function and the model won’t be able to learn from all of the features, reducing its performance.

Why is feature standardization useful?

Some models assume that features are centered around zero and have similar variances
Those models may perform poorly if that assumption is incorrect

In this chapter, we’ll experiment with standardizing our features to see if that improves our model performance.

We’ll start with the most common approach, which is to use StandardScaler and only standardize features that were originally numerical. We’ll import it from the preprocessing module and create an instance called scaler using the default parameters. For each feature, it will subtract the mean and divide by the standard deviation, which centers the data around zero and scales it.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Here’s a reminder of our existing ColumnTransformer. Note that our numerical features are Age, Fare, and Parch.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

To scale our numerical features, we’ll make a Pipeline of imputation and scaling called imp_scaler.

imp_scaler = make_pipeline(imp, scaler)

Then we’ll replace imp with imp_scaler in our ColumnTransformer, and apply it to Parch as well, which was previously a passthrough column.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_scaler, ['Age', 'Fare', 'Parch']))

Finally, we’ll create a Pipeline called scaler_pipe using the updated ColumnTransformer.

scaler_pipe = make_pipeline(ct, logreg)
scaler_pipe

The cross-validated accuracy of this Pipeline is 0.810, which is nearly the same as our baseline accuracy.

cross_val_score(scaler_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8103383340656581

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (LR with SelectFromModel ET): 0.832
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (VC): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811
Baseline (LR with numerical features standardized): 0.810

That might be surprising, because regularized linear models often benefit from feature standardization. However, our particular logistic regression solver (liblinear) happens to be robust to unscaled data, and thus there was no benefit in this case.

The takeaway here is that you shouldn’t always assume that standardization of numerical features is necessary.

Why didn’t feature standardization help?

Regularized linear models often benefit from standardization
However, the liblinear solver is robust to unscaled data

14.2 Standardizing all features

In the previous lesson, we standardized all of the numerical features. An alternative approach is to standardize all features after transformation, even if they were not originally numerical. That’s what we’ll try in this lesson.

Our strategy will be to add standardization as the second step in the Pipeline, in between the ColumnTransformer and the model. However, our ColumnTransformer outputs a sparse matrix, and StandardScaler would destroy the sparseness by centering the data, likely resulting in a memory issue.

Why not use StandardScaler?

Our ColumnTransformer outputs a sparse matrix
Centering would cause memory issues by creating a dense matrix

Thus, we’re going to use an alternative scaler called MaxAbsScaler. We’ll import it from the preprocessing module and create an instance called scaler. MaxAbsScaler divides each feature by its maximum value, which scales each feature to the range -1 to 1. Zeros are never changed, and thus sparsity is preserved.

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

First, we’ll reset our ColumnTransformer so that it doesn’t include the imp_scaler Pipeline.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

Then, we’ll update the scaler_pipe object to use MaxAbsScaler as the second step.

scaler_pipe = make_pipeline(ct, scaler, logreg)
scaler_pipe

When we cross-validate it, the accuracy is 0.811, which is exactly the same as our baseline accuracy.

I’m not surprised at this result, because MaxAbsScaler has no effect on the columns output by OneHotEncoder and only a tiny effect on the columns output by CountVectorizer, and thus our approach in this lesson is mostly just affecting the numerical columns, which is what we did in the previous lesson.

cross_val_score(scaler_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114556525014123

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (LR with SelectFromModel ET): 0.832
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (VC): 0.818
Baseline (LR): 0.811
Baseline (LR with all features standardized): 0.811
Baseline (RF): 0.811
Baseline (LR with numerical features standardized): 0.810

Although we didn’t see any benefits from standardization, there are certainly cases in which it will help. If you do try out standardization, my suggestion is to try out both of the approaches that we used in this chapter and use whichever one works better.

14.3 Q&A: How do I see what scaling was applied to each feature?

If you’re interested in seeing what scaling was applied to each feature, you can fit the Pipeline and then examine the scale_ attribute of the maxabsscaler step.

For example, the last three entries in the array correspond to the scaling for Age, Fare, and Parch. These are simply the maximum values of Age, Fare, and Parch in X. As a reminder, this is the scaling that will be applied to the features in X_new when making predictions.

scaler_pipe.fit(X, y)
scaler_pipe.named_steps['maxabsscaler'].scale_

array([  1.    ,   1.    ,   1.    , ...,  80.    , 512.3292,   6.    ])

And as you might expect, the scale_ attribute is also available when using StandardScaler.

14.4 Q&A: How do I turn off feature standardization within a grid search?

Although grid search is usually just used to tune parameter values, you can actually use grid search to turn on and off particular Pipeline steps. Thus you could use a grid search to decide whether or not feature standardization should be included.

To demonstrate this, let’s create a small dictionary called scaler_params.

The first entry tunes the C parameter of logistic regression, just like we’ve done before. The dictionary key is the step name, then two underscores, then the parameter name. The dictionary values are the possible values for that parameter.

The second entry is different: Rather than tuning the parameter of a Pipeline step, we’re tuning the Pipeline step itself. In this case, the dictionary key is simply the step name assigned by make_pipeline. The possible values are 'passthrough', which means skip this Pipeline step, or a MaxAbsScaler instance, which means keep MaxAbsScaler in the Pipeline.

scaler_params = {}
scaler_params['logisticregression__C'] = [0.1, 1, 10]
scaler_params['maxabsscaler'] = ['passthrough', MaxAbsScaler()]

We’ll create scaler_grid using the scaler_pipe and scaler_params objects, and then run the grid search as normal.

scaler_grid = GridSearchCV(scaler_pipe, scaler_params, cv=5,
                           scoring='accuracy', n_jobs=-1)
scaler_grid.fit(X, y)

As you can see in the results, each C value was tried once with the MaxAbsScaler and once without.

results = (pd.DataFrame(scaler_grid.cv_results_)
           .filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results

	C	param_maxabsscaler	mean_test_score	rank_test_score
0	0.1	passthrough	0.788990	5
1	0.1	MaxAbsScaler()	0.788971	6
2	1	passthrough	0.811462	2
3	1	MaxAbsScaler()	0.811456	3
4	10	passthrough	0.809234	4
5	10	MaxAbsScaler()	0.819308	1

When using grid search with this limited set of parameters, the best results came from using a C value of 10 and including the MaxAbsScaler step.

scaler_grid.best_params_

{'logisticregression__C': 10, 'maxabsscaler': MaxAbsScaler()}

14.5 Q&A: Which models benefit from standardization?

Feature standardization tends to be useful any time a model considers the distance between features, such as K-Nearest Neighbors and Support Vector Machines.

Feature standardization also tends to be useful for any models that incorporate regularization, such as a linear regression or logistic regression model with an L1 or L2 penalty, though we saw earlier in the chapter that this doesn’t apply to all solvers.

Notably, feature standardization will not benefit any tree-based models such as random forests.

When is feature standardization likely to be useful?

Useful:
- Distance-based models (KNN, SVM)
- Regularized models (linear or logistic regression with L1/L2)
Not useful:
- Tree-based models (random forests)