from sklearn.preprocessing import StandardScaler
= StandardScaler() scaler
14 Feature standardization
14.1 Standardizing numerical features
Some Machine Learning models benefit from a process called feature standardization. That’s because the objective function of some models assume that all features are centered around zero and have a variance of the same order of magnitude. If that assumption is incorrect, a given feature might dominate the objective function and the model won’t be able to learn from all of the features, reducing its performance.
In this chapter, we’ll experiment with standardizing our features to see if that improves our model performance.
We’ll start with the most common approach, which is to use StandardScaler and only standardize features that were originally numerical. We’ll import it from the preprocessing module and create an instance called “scaler” using the default parameters. For each feature, it will subtract the mean and divide by the standard deviation, which centers the data around zero and scales it.
Here’s a reminder of our existing ColumnTransformer. Note that our numerical features are Age, Fare, and Parch.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(imp_ohe, ['Name'),
(vect, 'Age', 'Fare']),
(imp, ['passthrough', ['Parch'])) (
To scale our numerical features, we’ll make a Pipeline of imputation and scaling called “imp_scaler”.
= make_pipeline(imp, scaler) imp_scaler
Then we’ll replace imp with imp_scaler in our ColumnTransformer, and apply it to Parch as well, which was previously a passthrough column.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(imp_ohe, ['Name'),
(vect, 'Age', 'Fare', 'Parch'])) (imp_scaler, [
Finally, we’ll create a Pipeline called “scaler_pipe” using the updated ColumnTransformer.
= make_pipeline(ct, logreg)
scaler_pipe scaler_pipe
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('pipeline-2', Pipeline(steps=[('simpleimputer', SimpleImputer()), ('standardscaler', StandardScaler())]), ['Age', 'Fare', 'Parch'])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('pipeline-2', Pipeline(steps=[('simpleimputer', SimpleImputer()), ('standardscaler', StandardScaler())]), ['Age', 'Fare', 'Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare', 'Parch']
SimpleImputer()
StandardScaler()
LogisticRegression(random_state=1, solver='liblinear')
The cross-validated accuracy of this Pipeline is 0.810, which is nearly the same as our baseline accuracy.
=5, scoring='accuracy').mean() cross_val_score(scaler_pipe, X, y, cv
0.8103383340656581
That might be surprising, because regularized linear models often benefit from feature standardization. However, our particular logistic regression solver (liblinear) happens to be robust to unscaled data, and thus there was no benefit in this case.
The takeaway here is that you shouldn’t always assume that standardization of numerical features is necessary.
14.2 Standardizing all features
In the previous lesson, we standardized all of the numerical features. An alternative approach is to standardize all features after transformation, even if they were not originally numerical. That’s what we’ll try in this lesson.
Our strategy will be to add standardization as the second step in the Pipeline, in between the ColumnTransformer and the model. However, our ColumnTransformer outputs a sparse matrix, and StandardScaler would destroy the sparseness by centering the data, likely resulting in a memory issue.
Thus, we’re going to use an alternative scaler called MaxAbsScaler. We’ll import it from the preprocessing module and create an instance called “scaler”. MaxAbsScaler divides each feature by its maximum value, which scales each feature to the range -1 to 1. Zeros are never changed, and thus sparsity is preserved.
from sklearn.preprocessing import MaxAbsScaler
= MaxAbsScaler() scaler
First, we’ll reset our ColumnTransformer so that it doesn’t include the imp_scaler Pipeline.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(imp_ohe, ['Name'),
(vect, 'Age', 'Fare']),
(imp, ['passthrough', ['Parch'])) (
Then, we’ll update the scaler_pipe object to use MaxAbsScaler as the second step.
= make_pipeline(ct, scaler, logreg)
scaler_pipe scaler_pipe
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('maxabsscaler', MaxAbsScaler()), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
MaxAbsScaler()
LogisticRegression(random_state=1, solver='liblinear')
When we cross-validate it, the accuracy is 0.811, which is exactly the same as our baseline accuracy.
I’m not surprised at this result, because MaxAbsScaler has no effect on the columns output by OneHotEncoder and only a tiny effect on the columns output by CountVectorizer, and thus our approach in this lesson is mostly just affecting the numerical columns, which is what we did in the previous lesson.
=5, scoring='accuracy').mean() cross_val_score(scaler_pipe, X, y, cv
0.8114556525014123
Although we didn’t see any benefits from standardization, there are certainly cases in which it will help. If you do try out standardization, my suggestion is to try out both of the approaches that we used in this chapter and use whichever one works better.
14.3 Q&A: How do I see what scaling was applied to each feature?
If you’re interested in seeing what scaling was applied to each feature, you can fit the Pipeline and then examine the scale_ attribute of the maxabsscaler step.
For example, the last three entries in the array correspond to the scaling for Age, Fare, and Parch. These are simply the maximum values of Age, Fare, and Parch in X. As a reminder, this is the scaling that will be applied to the features in X_new when making predictions.
scaler_pipe.fit(X, y)'maxabsscaler'].scale_ scaler_pipe.named_steps[
array([ 1. , 1. , 1. , ..., 80. , 512.3292, 6. ])
And as you might expect, the scale_ attribute is also available when using StandardScaler.
14.4 Q&A: How do I turn off feature standardization within a grid search?
Although grid search is usually just used to tune parameter values, you can actually use grid search to turn on and off particular Pipeline steps. Thus you could use a grid search to decide whether or not feature standardization should be included.
To demonstrate this, let’s create a small dictionary called scaler_params.
The first entry tunes the C parameter of logistic regression, just like we’ve done before. The dictionary key is the step name, then two underscores, then the parameter name. The dictionary values are the possible values for that parameter.
The second entry is different: Rather than tuning the parameter of a Pipeline step, we’re tuning the Pipeline step itself. In this case, the dictionary key is simply the step name assigned by make_pipeline. The possible values are “passthrough”, which means skip this Pipeline step, or a MaxAbsScaler instance, which means keep MaxAbsScaler in the Pipeline.
= {}
scaler_params 'logisticregression__C'] = [0.1, 1, 10]
scaler_params['maxabsscaler'] = ['passthrough', MaxAbsScaler()] scaler_params[
We’ll create scaler_grid using the scaler_pipe and scaler_params objects, and then run the grid search as normal.
= GridSearchCV(scaler_pipe, scaler_params, cv=5, scoring='accuracy',
scaler_grid =-1)
n_jobs scaler_grid.fit(X, y)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('maxabsscaler', MaxAbsScaler()), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))]), n_jobs=-1, param_grid={'logisticregression__C': [0.1, 1, 10], 'maxabsscaler': ['passthrough', MaxAbsScaler()]}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
MaxAbsScaler()
LogisticRegression(random_state=1, solver='liblinear')
As you can see in the results, each C value was tried once with the MaxAbsScaler and once without.
pd.DataFrame(scaler_grid.cv_results_)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_logisticregression__C | param_maxabsscaler | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.013027 | 0.002237 | 0.005064 | 0.002104 | 0.1 | passthrough | {'logisticregression__C': 0.1, 'maxabsscaler':... | 0.798883 | 0.803371 | 0.764045 | 0.775281 | 0.803371 | 0.788990 | 0.016258 | 5 |
1 | 0.013508 | 0.003359 | 0.003475 | 0.000406 | 0.1 | MaxAbsScaler() | {'logisticregression__C': 0.1, 'maxabsscaler':... | 0.815642 | 0.803371 | 0.786517 | 0.752809 | 0.786517 | 0.788971 | 0.021159 | 6 |
2 | 0.014037 | 0.000829 | 0.003667 | 0.000580 | 1 | passthrough | {'logisticregression__C': 1, 'maxabsscaler': '... | 0.798883 | 0.825843 | 0.803371 | 0.786517 | 0.842697 | 0.811462 | 0.020141 | 2 |
3 | 0.013643 | 0.002607 | 0.003899 | 0.001107 | 1 | MaxAbsScaler() | {'logisticregression__C': 1, 'maxabsscaler': M... | 0.804469 | 0.825843 | 0.814607 | 0.775281 | 0.837079 | 0.811456 | 0.021123 | 3 |
4 | 0.013897 | 0.004783 | 0.004927 | 0.002253 | 10 | passthrough | {'logisticregression__C': 10, 'maxabsscaler': ... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 4 |
5 | 0.013281 | 0.002391 | 0.005218 | 0.002654 | 10 | MaxAbsScaler() | {'logisticregression__C': 10, 'maxabsscaler': ... | 0.815642 | 0.797753 | 0.825843 | 0.797753 | 0.859551 | 0.819308 | 0.022825 | 1 |
When using grid search with this limited set of parameters, the best results came from using a C value of 10 and including the MaxAbsScaler step.
scaler_grid.best_params_
{'logisticregression__C': 10, 'maxabsscaler': MaxAbsScaler()}
14.5 Q&A: Which models benefit from standardization?
Feature standardization tends to be useful any time a model considers the distance between features, such as K-Nearest Neighbors and Support Vector Machines.
Feature standardization also tends to be useful for any models that incorporate regularization, such as a linear regression or logistic regression model with an L1 or L2 penalty, though we saw earlier in the chapter that this doesn’t apply to all solvers.
Notably, feature standardization will not benefit any tree-based models such as random forests.