from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()14 Feature standardization
14.1 Standardizing numerical features
Some Machine Learning models benefit from a process called feature standardization. That’s because the objective function of those models assume that all features are centered around zero and have a variance of the same order of magnitude. If that assumption is incorrect, a given feature might dominate the objective function and the model won’t be able to learn from all of the features, thus reducing its performance.
In this chapter, we’ll experiment with standardizing our features to see if that improves our model performance.
We’ll start with the most common approach, which is to use StandardScaler and only standardize features that were originally numerical. We’ll import it from the preprocessing module and create an instance called scaler using the default parameters. For each feature, it will subtract the mean and divide by the standard deviation, which centers the data around zero and scales it.
Here’s a reminder of our existing ColumnTransformer. Note that our numerical features are Age, Fare, and Parch.
ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
(imp, ['Age', 'Fare']),
('passthrough', ['Parch']))To scale our numerical features, we’ll make a Pipeline of imputation and scaling called imp_scaler.
imp_scaler = make_pipeline(imp, scaler)Then, we’ll replace imp with imp_scaler in our ColumnTransformer and apply it to Parch as well, which was previously a passthrough column.
ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
(imp_scaler, ['Age', 'Fare', 'Parch']))Finally, we’ll create a Pipeline called scaler_pipe using the updated ColumnTransformer.
scaler_pipe = make_pipeline(ct, logreg)
scaler_pipePipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(), 'Name'),
('pipeline-2',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['Age', 'Fare', 'Parch'])])),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('pipeline-2',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['Age', 'Fare', 'Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare', 'Parch']
SimpleImputer()
StandardScaler()
LogisticRegression(random_state=1, solver='liblinear')
The cross-validated accuracy of this Pipeline is 0.810, which is nearly the same as our baseline accuracy.
cross_val_score(scaler_pipe, X, y, cv=5, scoring='accuracy').mean()0.8103383340656581
That might be surprising, because regularized linear models often benefit from feature standardization. However, our particular logistic regression solver (liblinear) happens to be robust to unscaled data, and thus there was no benefit in this case.
The takeaway here is that you shouldn’t always assume that standardization of numerical features is necessary.
14.2 Standardizing all features
In the previous lesson, we standardized the numerical features only. An alternative approach is to standardize all features after transformation, even if they were not originally numerical. That’s what we’ll try in this lesson.
Our strategy will be to add standardization as the second step in the Pipeline, in between the ColumnTransformer and the model. However, our ColumnTransformer outputs a sparse matrix, and StandardScaler would destroy the sparseness by centering the data, likely resulting in a memory issue.
Thus, we’re going to use an alternative scaler called MaxAbsScaler. We’ll import it from the preprocessing module and create an instance called scaler. MaxAbsScaler divides each feature by its maximum value, which scales each feature to the range -1 to 1. Zeros are never changed, and thus sparsity is preserved.
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()First, we’ll reset our ColumnTransformer so that it doesn’t include the imp_scaler Pipeline.
ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
(imp, ['Age', 'Fare']),
('passthrough', ['Parch']))Then, we’ll update the scaler_pipe object to use MaxAbsScaler as the second step.
scaler_pipe = make_pipeline(ct, scaler, logreg)
scaler_pipePipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(), 'Name'),
('simpleimputer',
SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Parch'])])),
('maxabsscaler', MaxAbsScaler()),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
MaxAbsScaler()
LogisticRegression(random_state=1, solver='liblinear')
When we cross-validate it, the accuracy is 0.811, which is exactly the same as our baseline accuracy.
cross_val_score(scaler_pipe, X, y, cv=5, scoring='accuracy').mean()0.8114556525014123
I’m not surprised at this result, because MaxAbsScaler has no effect on the columns output by OneHotEncoder and only a tiny effect on the columns output by CountVectorizer. As such, our approach in this lesson is mostly just affecting the numerical columns, which is what we did in the previous lesson.
Although we didn’t see any benefits from standardization, there are certainly cases in which it will help. If you do want to experiment with standardization, my suggestion is to try out both of the approaches that we used in this chapter and then choose whichever one works better.
14.3 Q&A: How do I see what scaling was applied to each feature?
If you’re interested in seeing what scaling was applied to each feature, you can fit the Pipeline and then examine the scale_ attribute of the maxabsscaler step.
For example, the last three entries in the array correspond to the scaling for Age, Fare, and Parch. These are simply the maximum values of Age, Fare, and Parch in X. As a reminder, this is the scaling that will be applied to the features in X_new when making predictions.
scaler_pipe.fit(X, y)
scaler_pipe.named_steps['maxabsscaler'].scale_array([ 1. , 1. , 1. , ..., 80. , 512.3292, 6. ])
As you might expect, the scale_ attribute is also available when using StandardScaler.
14.4 Q&A: How do I turn off feature standardization within a grid search?
Although grid search is usually just used to tune parameter values, you can actually use grid search to turn on and off particular Pipeline steps. Thus you could use a grid search to decide whether or not feature standardization should be included.
To demonstrate this, let’s create a small dictionary called scaler_params:
- The first entry tunes the
Cparameter of logistic regression. The dictionary key is the step name, then two underscores, then the parameter name. The dictionary values are the possible values for that parameter. - The second entry is different: Rather than tuning the parameter of a
Pipelinestep, we’re tuning thePipelinestep itself. In this case, the dictionary key is simply the step name assigned bymake_pipeline. The possible values are'passthrough', which means skip thisPipelinestep, or aMaxAbsScalerinstance, which means keepMaxAbsScalerin thePipeline.
scaler_params = {}
scaler_params['logisticregression__C'] = [0.1, 1, 10]
scaler_params['maxabsscaler'] = ['passthrough', MaxAbsScaler()]We’ll create scaler_grid using the scaler_pipe and scaler_params objects, and then run the grid search as normal.
scaler_grid = GridSearchCV(scaler_pipe, scaler_params, cv=5,
scoring='accuracy', n_jobs=-1)
scaler_grid.fit(X, y)GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
('passthrough',
'passthrough',
['Parch'])])),
('maxabsscaler', MaxAbsScaler()),
('logisticregression',
LogisticRegression(random_state=1,
solver='liblinear'))]),
n_jobs=-1,
param_grid={'logisticregression__C': [0.1, 1, 10],
'maxabsscaler': ['passthrough', MaxAbsScaler()]},
scoring='accuracy')ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
MaxAbsScaler()
LogisticRegression(random_state=1, solver='liblinear')
As you can see in the results, each C value was tried once with MaxAbsScaler and once without.
results = (pd.DataFrame(scaler_grid.cv_results_)
.filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results| C | param_maxabsscaler | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 0 | 0.1 | passthrough | 0.788990 | 5 |
| 1 | 0.1 | MaxAbsScaler() | 0.788971 | 6 |
| 2 | 1 | passthrough | 0.811462 | 2 |
| 3 | 1 | MaxAbsScaler() | 0.811456 | 3 |
| 4 | 10 | passthrough | 0.809234 | 4 |
| 5 | 10 | MaxAbsScaler() | 0.819308 | 1 |
With this limited set of parameters, the best results came from using a C value of 10 and including the MaxAbsScaler step.
scaler_grid.best_params_{'logisticregression__C': 10, 'maxabsscaler': MaxAbsScaler()}
14.5 Q&A: Which models benefit from standardization?
Feature standardization tends to be useful with any model that considers the distance between features, such as K-Nearest Neighbors and Support Vector Machines.
It also tends to be useful with any models that incorporate regularization, such as linear or logistic regression with an L1 or L2 penalty, though we saw earlier in the chapter that this doesn’t apply to all solvers.
Notably, feature standardization will not benefit any tree-based models such as random forests.