14  Feature standardization

14.1 Standardizing numerical features

Some Machine Learning models benefit from a process called feature standardization. That’s because the objective function of some models assume that all features are centered around zero and have a variance of the same order of magnitude. If that assumption is incorrect, a given feature might dominate the objective function and the model won’t be able to learn from all of the features, reducing its performance.

Why is feature standardization useful?

  • Some models assume that features are centered around zero and have similar variances
  • Those models may perform poorly if that assumption is incorrect

In this chapter, we’ll experiment with standardizing our features to see if that improves our model performance.

We’ll start with the most common approach, which is to use StandardScaler and only standardize features that were originally numerical. We’ll import it from the preprocessing module and create an instance called “scaler” using the default parameters. For each feature, it will subtract the mean and divide by the standard deviation, which centers the data around zero and scales it.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Here’s a reminder of our existing ColumnTransformer. Note that our numerical features are Age, Fare, and Parch.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

To scale our numerical features, we’ll make a Pipeline of imputation and scaling called “imp_scaler”.

imp_scaler = make_pipeline(imp, scaler)

Then we’ll replace imp with imp_scaler in our ColumnTransformer, and apply it to Parch as well, which was previously a passthrough column.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_scaler, ['Age', 'Fare', 'Parch']))

Finally, we’ll create a Pipeline called “scaler_pipe” using the updated ColumnTransformer.

scaler_pipe = make_pipeline(ct, logreg)
scaler_pipe
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(), 'Name'),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['Age', 'Fare', 'Parch'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline-1',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('pipeline-2',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer()),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ['Age', 'Fare', 'Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare', 'Parch']
SimpleImputer()
StandardScaler()
LogisticRegression(random_state=1, solver='liblinear')

The cross-validated accuracy of this Pipeline is 0.810, which is nearly the same as our baseline accuracy.

cross_val_score(scaler_pipe, X, y, cv=5, scoring='accuracy').mean()
0.8103383340656581

Pipeline accuracy scores:

  • Grid search (VC): 0.834
  • Grid search (LR with SelectFromModel ET): 0.832
  • Grid search (RF): 0.829
  • Grid search (LR): 0.828
  • Baseline (VC): 0.818
  • Baseline (LR): 0.811
  • Baseline (RF): 0.811
  • Baseline (LR with numerical features standardized): 0.810

That might be surprising, because regularized linear models often benefit from feature standardization. However, our particular logistic regression solver (liblinear) happens to be robust to unscaled data, and thus there was no benefit in this case.

The takeaway here is that you shouldn’t always assume that standardization of numerical features is necessary.

Why didn’t feature standardization help?

  • Regularized linear models often benefit from standardization
  • However, the liblinear solver is robust to unscaled data

14.2 Standardizing all features

In the previous lesson, we standardized all of the numerical features. An alternative approach is to standardize all features after transformation, even if they were not originally numerical. That’s what we’ll try in this lesson.

Our strategy will be to add standardization as the second step in the Pipeline, in between the ColumnTransformer and the model. However, our ColumnTransformer outputs a sparse matrix, and StandardScaler would destroy the sparseness by centering the data, likely resulting in a memory issue.

Why not use StandardScaler?

  • Our ColumnTransformer outputs a sparse matrix
  • Centering would cause memory issues by creating a dense matrix

Thus, we’re going to use an alternative scaler called MaxAbsScaler. We’ll import it from the preprocessing module and create an instance called “scaler”. MaxAbsScaler divides each feature by its maximum value, which scales each feature to the range -1 to 1. Zeros are never changed, and thus sparsity is preserved.

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

First, we’ll reset our ColumnTransformer so that it doesn’t include the imp_scaler Pipeline.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

Then, we’ll update the scaler_pipe object to use MaxAbsScaler as the second step.

scaler_pipe = make_pipeline(ct, scaler, logreg)
scaler_pipe
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(), 'Name'),
                                                 ('simpleimputer',
                                                  SimpleImputer(),
                                                  ['Age', 'Fare']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch'])])),
                ('maxabsscaler', MaxAbsScaler()),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
MaxAbsScaler()
LogisticRegression(random_state=1, solver='liblinear')

When we cross-validate it, the accuracy is 0.811, which is exactly the same as our baseline accuracy.

I’m not surprised at this result, because MaxAbsScaler has no effect on the columns output by OneHotEncoder and only a tiny effect on the columns output by CountVectorizer, and thus our approach in this lesson is mostly just affecting the numerical columns, which is what we did in the previous lesson.

cross_val_score(scaler_pipe, X, y, cv=5, scoring='accuracy').mean()
0.8114556525014123

Pipeline accuracy scores:

  • Grid search (VC): 0.834
  • Grid search (LR with SelectFromModel ET): 0.832
  • Grid search (RF): 0.829
  • Grid search (LR): 0.828
  • Baseline (VC): 0.818
  • Baseline (LR): 0.811
  • Baseline (LR with all features standardized): 0.811
  • Baseline (RF): 0.811
  • Baseline (LR with numerical features standardized): 0.810

Although we didn’t see any benefits from standardization, there are certainly cases in which it will help. If you do try out standardization, my suggestion is to try out both of the approaches that we used in this chapter and use whichever one works better.

14.3 Q&A: How do I see what scaling was applied to each feature?

If you’re interested in seeing what scaling was applied to each feature, you can fit the Pipeline and then examine the scale_ attribute of the maxabsscaler step.

For example, the last three entries in the array correspond to the scaling for Age, Fare, and Parch. These are simply the maximum values of Age, Fare, and Parch in X. As a reminder, this is the scaling that will be applied to the features in X_new when making predictions.

scaler_pipe.fit(X, y)
scaler_pipe.named_steps['maxabsscaler'].scale_
array([  1.    ,   1.    ,   1.    , ...,  80.    , 512.3292,   6.    ])

And as you might expect, the scale_ attribute is also available when using StandardScaler.

14.5 Q&A: Which models benefit from standardization?

Feature standardization tends to be useful any time a model considers the distance between features, such as K-Nearest Neighbors and Support Vector Machines.

Feature standardization also tends to be useful for any models that incorporate regularization, such as a linear regression or logistic regression model with an L1 or L2 penalty, though we saw earlier in the chapter that this doesn’t apply to all solvers.

Notably, feature standardization will not benefit any tree-based models such as random forests.

When is feature standardization likely to be useful?

  • Useful:
    • Distance-based models (KNN, SVM)
    • Regularized models (linear or logistic regression with L1/L2)
  • Not useful:
    • Tree-based models (random forests)