from sklearn.model_selection import cross_val_score
=5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
In this chapter, we’re going to take a deep dive into how to efficiently tune our Pipeline
for maximum accuracy.
Let’s return to the topic of model evaluation.
As you might recall, we used cross-validation way back in chapter 2 to evaluate our most basic model. Since that chapter, we’ve been adding many more features without re-running cross-validation. That’s because any model evaluation procedure is highly unreliable with only 10 rows of data, so it would have been misleading to run cross-validation and compare the results. But now that we’re using the full dataset, cross-validation can once again be used.
cross_val_score
function from the model_selection
module.cross_val_score
, we can actually pass our entire Pipeline
.X
and y
.cross_val_score
since version 0.22, but I like to include it anyway for clarity.When we run it, cross_val_score
outputs a mean accuracy of 0.811, which we’ll use as the baseline accuracy against which our future Pipeline
s can be compared.
from sklearn.model_selection import cross_val_score
=5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
Let’s talk about what actually happens “under the hood” when we run the cross_val_score
function on a Pipeline
:
cross_val_score
splits the data into 5 folds. 4 out of 5 folds (meaning 80% of the data) is set aside for training, and the remaining fold (meaning 20% of the data) is set aside for testing.Pipeline
’s fit
method is run on the training portion. Thus the transformations specified in the ColumnTransformer
are performed on the training portion, and the transformed training data is used to fit the model.Pipeline
’s predict
method is run on the testing portion. Thus the transformations learned during step 2 are applied to the testing portion, the transformed testing data is passed to the fitted model, and the model makes predictions.cross_val_score
thus outputs 5 accuracy scores, and we take the mean of those scores.One thing you might have noticed is that cross_val_score
splits the data in step 1 before performing the transformations in steps 2 and 3. As a result, the imputation values for Age and Fare and the vocabulary for CountVectorizer
are all computed 5 different times. Each time, these values are computed using the training set only, and then applied to both the training and testing sets.
Alternatively, you could imagine performing all of the transformations first, and then splitting the data. This would be much faster, since the imputation values and the vocabulary would be computed only once on the full dataset.
So why does cross_val_score
split the data first? Because splitting the data before performing the transformations prevents data leakage, whereas performing the transformations on the full dataset before splitting the data would cause data leakage, since information about the testing set would be “leaked” into the model training process.
As we discussed in the previous chapter, this is one way that scikit-learn helps to shield you from data leakage.
Now that we’ve calculated the baseline accuracy for our Pipeline
, the next step is to tune the hyperparameters for both the model and the transformers. Recall that we’ve been using the default parameters for most objects in the Pipeline
, and so tuning those parameters is likely to result in a more accurate model.
Before proceeding, let me briefly explain some terminology. In the field of statistics, “hyperparameters” are values that you set, whereas “parameters” are values learned from the data by the estimator during the fitting process.
For example, the C
value of logistic regression is called a hyperparameter because it’s something you can set and optimize, whereas the coefficients of a logistic regression model are called parameters because they’re learned from the data.
In this book, I’m generally going to follow scikit-learn’s conventions as I understand them:
Pipeline
containing a model and transformers.C
and random_state
values passed to the LogisticRegression
class, and the strategy
value passed to the SimpleImputer
class.With that being said, we’re going to use a scikit-learn class called GridSearchCV
to perform the hyperparameter tuning. In a grid search, you define which values you want to try for each parameter, and it cross-validates every possible combination of those values.
We can actually use GridSearchCV
to tune the entire Pipeline
at once, including both the model and the transformers. This has two huge benefits over just tuning a model:
GridSearchCV
.Keep in mind that if we had instead done the data transformations in pandas, we would have missed out on both of these benefits.
In this lesson, we’re going to tune the model, and then in the next lesson, we’ll also tune the transformers.
For the LogisticRegression
model, we’re going to tune two parameters:
penalty
, which is the type of regularization. For this parameter, the default value is 'l2'
, and we’re going to try the values 'l1'
and 'l2'
. And just to be clear, the first character of each of those values is a lowercase “L”.C
, which is the amount of regularization. For this parameter, the default value is 1, and we’re going to try the values 0.1, 1, and 10.Deciding which parameters to tune and what values to try requires both research and experience, and unfortunately, it’s different for every type of model.
In order to tune a Pipeline
with GridSearchCV
, we need to get the names of the Pipeline
steps from the named_steps
attribute. We’ll tune the logisticregression
step in this lesson, and we’ll tune the columntransformer
step in the next lesson.
pipe.named_steps.keys()
dict_keys(['columntransformer', 'logisticregression'])
To use GridSearchCV
, we need to create a dictionary in which each entry represents a parameter and the values we want to try for that parameter. We’ll start by creating an empty dictionary called params
, and then we’ll add the two entries.
For each dictionary entry, the key is the Pipeline
step name, followed by two underscores, followed by the parameter name. Thus the key for the first entry is 'logisticregression__penalty'
, and the key for the second entry is 'logisticregression__C'
.
Using two underscores is what allows GridSearchCV
to distinguish between the step name and the parameter name. Using a single underscore would be ambiguous, since a step name or parameter name can have an underscore within it.
The value for each dictionary entry is a list of the values you want to try for that parameter. Thus the value for the first entry is a list of 'l1'
and 'l2'
, and the value for the second entry is a list of 0.1, 1, and 10.
After adding the two entries, we’ll print out the params
dictionary just to make sure that it looks correct.
= {}
params 'logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]
params[ params
{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10]}
Now that we’ve created the parameter dictionary, we can set up the grid search. We import the GridSearchCV
class from the model_selection
module.
from sklearn.model_selection import GridSearchCV
Next, we create an instance of GridSearchCV
called grid
. We pass it the Pipeline
, the parameter dictionary, the number of folds, and the evaluation metric.
Finally, we run the grid search by fitting the grid object with X
and y
. Because our scikit-learn configuration is set to display diagrams, we see a diagram of the Pipeline
now that the grid search is complete.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid grid.fit(X, y)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))]), param_grid={'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
The results of the grid search are stored in an attribute called cv_results_
, which we’ll convert to a DataFrame. We’ll use a filter to only keep the columns we need, and rename the parameter columns to make them easier to read.
There are 6 rows because it ran cross-validation 6 times, which is every possible combination of the 2 values of penalty
and the 3 values of C
that we specified.
= (pd.DataFrame(grid.cv_results_)
results filter(regex='param_|mean_test|rank'))
.= results.columns.str.split('__').str[-1]
results.columns results
C | penalty | mean_test_score | rank_test_score | |
---|---|---|---|---|
0 | 0.1 | l1 | 0.783385 | 6 |
1 | 0.1 | l2 | 0.788990 | 5 |
2 | 1 | l1 | 0.814814 | 2 |
3 | 1 | l2 | 0.811462 | 3 |
4 | 10 | l1 | 0.818166 | 1 |
5 | 10 | l2 | 0.809234 | 4 |
Notice the rank_test_score column. We’ll use the DataFrame’s sort_values
method to sort the rows by that column in ascending order.
By examining the mean_test_score column, we can see that the best parameter combination resulted in a cross-validated accuracy of 0.818, which is higher than our baseline accuracy of 0.811.
We can see that the best accuracy occurred when C
was 10 and penalty
was 'l1'
, neither of which was the default value for that parameter.
'rank_test_score') results.sort_values(
C | penalty | mean_test_score | rank_test_score | |
---|---|---|---|---|
4 | 10 | l1 | 0.818166 | 1 |
2 | 1 | l1 | 0.814814 | 2 |
3 | 1 | l2 | 0.811462 | 3 |
5 | 10 | l2 | 0.809234 | 4 |
1 | 0.1 | l2 | 0.788990 | 5 |
0 | 0.1 | l1 | 0.783385 | 6 |
In the previous lesson, we built a grid search for tuning model parameters and found that the best accuracy occurred when C
was 10 and penalty
was 'l1'
. In this lesson, we’re going to expand the search to also include transformer parameters.
When expanding the search, you might first think that we should set C
to 10 and penalty
to 'l1'
, and then only search the transformer parameters, since that would be the most computationally efficient approach.
However, the better approach is actually to consider all of the values for C
and penalty
in combination with all of the transformer parameters. That’s because we’re searching for the best combination of all parameters, and since each parameter can influence what is optimal for the other parameters, the best combination might use a C
value other than 10 or a penalty
value other than 'l1'
.
All of that is to say that we’re going to expand the existing params
dictionary to include transformer parameters. And to include transformer parameters, we first need to figure out the transformer names.
From the previous lesson, you might recall that the first step in the Pipeline
is named columntransformer
(all lowercase). We’ll access that step using the named_steps
attribute, which then allows us to examine the named_transformers_
attribute of the ColumnTransformer
.
As a side note, named_transformers_
ends with an underscore because it’s set during the fit step, whereas named_steps
does not end with an underscore because it’s set when the Pipeline
instance is created.
Anyway, we can now see the transformer names. We’re going to tune a single parameter from three of the transformers. Normally I might tune more parameters, but for the sake of brevity I’m only going to tune three.
'columntransformer'].named_transformers_ pipe.named_steps[
{'pipeline': Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
'countvectorizer': CountVectorizer(),
'simpleimputer': SimpleImputer(),
'passthrough': 'passthrough'}
The first parameter we’re going to tune is the drop
parameter of OneHotEncoder
, which was added to scikit-learn in version 0.21 and which I discussed in lesson 3.6.
To add it to the params
dictionary, we specify the Pipeline
step name, which is columntransformer
. Then we specify the transformer name, which is pipeline
. Then we specify the step name of the inner Pipeline, which is onehotencoder
. Finally we specify the parameter name, which is drop
. All of these components are separated by two underscores.
The parameter values we’re going to try are None
and 'first'
. None
is the default, and it means don’t drop any columns, whereas 'first'
means drop the first column of each feature after encoding.
'columntransformer__pipeline__onehotencoder__drop'] = [None,
params['first']
If you’re ever unsure how to specify a parameter for a grid search, you can see all of the Pipeline
’s parameters by using the get_params
method followed by the keys
method. I’m converting the output to a list for easier readability. This list is also useful if you prefer to copy and paste the parameter names rather than typing them.
As you can see, there are many transformer and model parameters that we’re not tuning, many of which could be useful to tune given enough time and computational resources.
list(pipe.get_params().keys())
['memory',
'steps',
'verbose',
'columntransformer',
'logisticregression',
'columntransformer__n_jobs',
'columntransformer__remainder',
'columntransformer__sparse_threshold',
'columntransformer__transformer_weights',
'columntransformer__transformers',
'columntransformer__verbose',
'columntransformer__pipeline',
'columntransformer__countvectorizer',
'columntransformer__simpleimputer',
'columntransformer__passthrough',
'columntransformer__pipeline__memory',
'columntransformer__pipeline__steps',
'columntransformer__pipeline__verbose',
'columntransformer__pipeline__simpleimputer',
'columntransformer__pipeline__onehotencoder',
'columntransformer__pipeline__simpleimputer__add_indicator',
'columntransformer__pipeline__simpleimputer__copy',
'columntransformer__pipeline__simpleimputer__fill_value',
'columntransformer__pipeline__simpleimputer__missing_values',
'columntransformer__pipeline__simpleimputer__strategy',
'columntransformer__pipeline__simpleimputer__verbose',
'columntransformer__pipeline__onehotencoder__categories',
'columntransformer__pipeline__onehotencoder__drop',
'columntransformer__pipeline__onehotencoder__dtype',
'columntransformer__pipeline__onehotencoder__handle_unknown',
'columntransformer__pipeline__onehotencoder__sparse',
'columntransformer__countvectorizer__analyzer',
'columntransformer__countvectorizer__binary',
'columntransformer__countvectorizer__decode_error',
'columntransformer__countvectorizer__dtype',
'columntransformer__countvectorizer__encoding',
'columntransformer__countvectorizer__input',
'columntransformer__countvectorizer__lowercase',
'columntransformer__countvectorizer__max_df',
'columntransformer__countvectorizer__max_features',
'columntransformer__countvectorizer__min_df',
'columntransformer__countvectorizer__ngram_range',
'columntransformer__countvectorizer__preprocessor',
'columntransformer__countvectorizer__stop_words',
'columntransformer__countvectorizer__strip_accents',
'columntransformer__countvectorizer__token_pattern',
'columntransformer__countvectorizer__tokenizer',
'columntransformer__countvectorizer__vocabulary',
'columntransformer__simpleimputer__add_indicator',
'columntransformer__simpleimputer__copy',
'columntransformer__simpleimputer__fill_value',
'columntransformer__simpleimputer__missing_values',
'columntransformer__simpleimputer__strategy',
'columntransformer__simpleimputer__verbose',
'logisticregression__C',
'logisticregression__class_weight',
'logisticregression__dual',
'logisticregression__fit_intercept',
'logisticregression__intercept_scaling',
'logisticregression__l1_ratio',
'logisticregression__max_iter',
'logisticregression__multi_class',
'logisticregression__n_jobs',
'logisticregression__penalty',
'logisticregression__random_state',
'logisticregression__solver',
'logisticregression__tol',
'logisticregression__verbose',
'logisticregression__warm_start']
Moving along, the second parameter we’re going to tune is the ngram_range
parameter of CountVectorizer
.
Again, we specify the Pipeline
step name, then the transformer name, and then the parameter name. Note that these three components are separated by double underscores, but there’s just a single underscore within ngram_range
because that’s part of the parameter name.
The parameter values we’re going to try are the tuples (1, 1)
and (1, 2)
. (1, 1)
is the default, and it creates a single feature from each word. (1, 2)
creates features from both single words, known as unigrams, and word pairs, known as bigrams.
'columntransformer__countvectorizer__ngram_range'] = [(1, 1),
params[1, 2)] (
The third parameter we’re going to tune is the add_indicator
parameter of SimpleImputer
, which was added to scikit-learn in version 0.21 and which I discussed in lesson 7.4.
Once again, we specify the Pipeline
step name, then the transformer name, and then the parameter name.
The parameter values we’re going to try are False
and True
. False
is the default, and it does not add a missing indicator column, whereas True
does add a missing indicator column.
'columntransformer__simpleimputer__add_indicator'] = [False, True] params[
Before running the grid search, we’ll print out the params
dictionary. By multiplying 2 by 3 by 2 by 2 by 2, we can calculate that there are now 48 parameter combinations, and thus the grid search will take about 8 times longer than the previous search.
As an aside, if we had used the Pipeline
and ColumnTransformer
classes instead of the make_pipeline
and make_column_transformer
functions, we could have customized the step names and transformer names, which would have made these parameter specifications a bit easier to read and write. You can watch lessons 4.9 and 4.10 for a review of that topic.
params
{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10],
'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Anyway, next we’ll recreate the grid
object with the new params
dictionary, and then we’ll run the grid search.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid grid.fit(X, y)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Now that the search is complete, we’ll convert the search results into a DataFrame and sort it by the rank_test_score column.
As you can see from the mean_test_score column, the best accuracy of 0.828 is an improvement over the previous grid search, which had an accuracy of 0.818. Keep in mind that your exact results may differ based on your scikit-learn version along with other factors. However, there’s no randomness involved when you set cv
to an integer, and so your results will be the same every time you run this grid search.
= (pd.DataFrame(grid.cv_results_)
results filter(regex='param_|mean_test|rank'))
.= results.columns.str.split('__').str[-1]
results.columns 'rank_test_score') results.sort_values(
ngram_range | drop | add_indicator | C | penalty | mean_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|
34 | (1, 2) | None | True | 10 | l1 | 0.828253 | 1 |
28 | (1, 2) | None | False | 10 | l1 | 0.824889 | 2 |
40 | (1, 2) | first | False | 10 | l1 | 0.824889 | 2 |
46 | (1, 2) | first | True | 10 | l1 | 0.822648 | 4 |
16 | (1, 1) | first | False | 10 | l1 | 0.820407 | 5 |
22 | (1, 1) | first | True | 10 | l1 | 0.819296 | 6 |
4 | (1, 1) | None | False | 10 | l1 | 0.818166 | 7 |
10 | (1, 1) | None | True | 10 | l1 | 0.817061 | 8 |
20 | (1, 1) | first | True | 1 | l1 | 0.814820 | 9 |
2 | (1, 1) | None | False | 1 | l1 | 0.814814 | 10 |
44 | (1, 2) | first | True | 1 | l1 | 0.813703 | 11 |
47 | (1, 2) | first | True | 10 | l2 | 0.812598 | 12 |
8 | (1, 1) | None | True | 1 | l1 | 0.812579 | 13 |
38 | (1, 2) | first | False | 1 | l1 | 0.812579 | 14 |
14 | (1, 1) | first | False | 1 | l1 | 0.812579 | 14 |
26 | (1, 2) | None | False | 1 | l1 | 0.812567 | 16 |
11 | (1, 1) | None | True | 10 | l2 | 0.811481 | 17 |
21 | (1, 1) | first | True | 1 | l2 | 0.811468 | 18 |
3 | (1, 1) | None | False | 1 | l2 | 0.811462 | 19 |
23 | (1, 1) | first | True | 10 | l2 | 0.810363 | 20 |
9 | (1, 1) | None | True | 1 | l2 | 0.810345 | 21 |
15 | (1, 1) | first | False | 1 | l2 | 0.810332 | 22 |
32 | (1, 2) | None | True | 1 | l1 | 0.810332 | 22 |
17 | (1, 1) | first | False | 10 | l2 | 0.809234 | 24 |
35 | (1, 2) | None | True | 10 | l2 | 0.809234 | 24 |
5 | (1, 1) | None | False | 10 | l2 | 0.809234 | 24 |
29 | (1, 2) | None | False | 10 | l2 | 0.808104 | 27 |
45 | (1, 2) | first | True | 1 | l2 | 0.808097 | 28 |
41 | (1, 2) | first | False | 10 | l2 | 0.806980 | 29 |
39 | (1, 2) | first | False | 1 | l2 | 0.805844 | 30 |
27 | (1, 2) | None | False | 1 | l2 | 0.805844 | 30 |
33 | (1, 2) | None | True | 1 | l2 | 0.804739 | 32 |
31 | (1, 2) | None | True | 0.1 | l2 | 0.793491 | 33 |
7 | (1, 1) | None | True | 0.1 | l2 | 0.793484 | 34 |
19 | (1, 1) | first | True | 0.1 | l2 | 0.791243 | 35 |
43 | (1, 2) | first | True | 0.1 | l2 | 0.790114 | 36 |
37 | (1, 2) | first | False | 0.1 | l2 | 0.789003 | 37 |
25 | (1, 2) | None | False | 0.1 | l2 | 0.788996 | 38 |
1 | (1, 1) | None | False | 0.1 | l2 | 0.788990 | 39 |
13 | (1, 1) | first | False | 0.1 | l2 | 0.787885 | 40 |
0 | (1, 1) | None | False | 0.1 | l1 | 0.783385 | 41 |
30 | (1, 2) | None | True | 0.1 | l1 | 0.783385 | 41 |
24 | (1, 2) | None | False | 0.1 | l1 | 0.783385 | 41 |
6 | (1, 1) | None | True | 0.1 | l1 | 0.783385 | 41 |
36 | (1, 2) | first | False | 0.1 | l1 | 0.777785 | 45 |
42 | (1, 2) | first | True | 0.1 | l1 | 0.777785 | 45 |
12 | (1, 1) | first | False | 0.1 | l1 | 0.777785 | 45 |
18 | (1, 1) | first | True | 0.1 | l1 | 0.777785 | 45 |
Rather than always examining the results
DataFrame, we can actually just access the single best score and the set of parameters that resulted in that score via attributes of the grid
object.
It’s worth noting that only the drop
parameter is using its default value, whereas the other four parameters are not using their default values.
grid.best_score_
0.828253091456908
grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'logisticregression__C': 10,
'logisticregression__penalty': 'l1'}
It’s hard to say whether this truly is the best set of parameters, because some of the differences in accuracy between parameter combinations may be due to chance, based on which samples happened to appear in each fold. That’s just a limitation of basic cross-validation, and so all we can say with confidence is that this is a good combination of parameters.
Now that we’ve tuned both the model parameters and the transformer parameters, we want to use those parameters with the Pipeline
when making predictions.
GridSearchCV
actually makes this very easy. After locating the best set of parameters, it automatically refits the Pipeline
on X
and y
using the best set of parameters, and it stores that fitted Pipeline
as an attribute called best_estimator_
. And as you can see, that attribute is indeed a Pipeline
object.
type(grid.best_estimator_)
sklearn.pipeline.Pipeline
If we print out the best_estimator_
attribute and click on the components, we can see that the parameters of this Pipeline
match the best parameter set we located in the previous lesson.
grid.best_estimator_
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough
LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear')
In order to make predictions using this Pipeline
, all we have to do is run the grid
object’s predict
method, which calls the predict
method of the best_estimator_
, and pass it X_new
.
grid.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
I just want to emphasize that this Pipeline
, with the best set of parameters, was automatically refit to the entire dataset. You always train your model on the entire dataset, meaning all samples for which you know the target value, before using it to make predictions on new data, otherwise you’re throwing away valuable training data.
After completing a grid search, you may want to save the Pipeline
with the best set of parameters so that you can use it to make predictions later.
As we saw in the previous lesson, the Pipeline
with the best set of parameters is stored as an attribute of the GridSearchCV
object called best_estimator_
, so this is the object that we want to save.
type(grid.best_estimator_)
sklearn.pipeline.Pipeline
You can save a Pipeline
to a file using pickle, which is part of the Python standard library.
import pickle
We’ll use pickle’s dump
method to save the Pipeline
to a file called “pipe.pickle”.
with open('pipe.pickle', 'wb') as f:
pickle.dump(grid.best_estimator_, f)
Then we can use pickle’s load
method to load the Pipeline
from the file into an object called pipe_from_pickle
.
with open('pipe.pickle', 'rb') as f:
= pickle.load(f) pipe_from_pickle
pipe_from_pickle
is identical to grid.best_estimator_
, and so when we use pipe_from_pickle
to make predictions, these predictions are identical to the predictions made by the grid
object.
pipe_from_pickle.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
One alternative to pickle is joblib, which is usually more efficient than pickle for scikit-learn objects. Although it’s not part of the Python standard library, joblib has been a dependency of scikit-learn since version 0.21.
import joblib
Just like pickle, you use joblib’s dump
method to save the Pipeline
to a file, which we’ll call “pipe.joblib”.
'pipe.joblib') joblib.dump(grid.best_estimator_,
['pipe.joblib']
Then, we’ll use the load
method to load the Pipeline
from the file into an object called pipe_from_joblib
.
= joblib.load('pipe.joblib') pipe_from_joblib
Finally, we’ll use pipe_from_joblib
to make predictions.
pipe_from_joblib.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
To be clear, pickle and joblib are not limited to Pipeline
s and can be used with other scikit-learn objects, such as a standalone model object that is not inside a Pipeline
.
There are a couple warnings to keep in mind when working with pickle and joblib objects:
Finally, it’s worth mentioning that there are alternatives to pickle and joblib such as ONNX and PMML. These formats don’t capture the full model object, but instead save a representation that can be used to make predictions. One major benefit of these formats is that they are neither environment-specific nor architecture-specific.
Let’s recreate the GridSearchCV
object, but this time we’ll add the verbose
parameter and set it to 1. When we run the search, this parameter will cause two changes to the output:
Pipeline
will be fit 240 times.= GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1)
grid grid.fit(X, y)
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.5s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 2.4s
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy', verbose=1)
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Now, let’s also add the n_jobs
parameter, set it to -1, and re-run the grid search. This instructs scikit-learn to use parallel processing with all of your CPUs to perform the search. If your machine has multiple processors, this will generally be faster, though in this case it took about the same amount of time.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1,
grid =-1)
n_jobs grid.fit(X, y)
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.8s
[Parallel(n_jobs=-1)]: Done 225 out of 240 | elapsed: 2.3s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 2.3s finished
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), n_jobs=-1, param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy', verbose=1)
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
If you find it useful to know how long a search takes, but verbose mode is a bit too verbose for you, another option is to remove the verbose
parameter and instead prefix the second line with %time
. This is known as an IPython line magic, and it will work as long as you’re using the Jupyter notebook or the IPython interpreter.
All this command does is tell you how long a particular line of code took to run. The number to focus on is the wall time.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy', n_jobs=-1)
grid %time grid.fit(X, y)
CPU times: user 208 ms, sys: 7.68 ms, total: 216 ms
Wall time: 690 ms
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), n_jobs=-1, param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
My general recommendation is to set n_jobs
to -1 any time you’re running a grid search, which is what I’ll do for the rest of the book. However, it’s still a good idea to use %time
or verbose mode to confirm that parallel processing is actually reducing the search time on your particular machine.
When you provide a set of parameter values to GridSearchCV
, it will cross-validate every possible combination of those parameters. For example, we know that with this set of parameters, cross-validation will run 48 times.
params
{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10],
'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Let’s say that we wanted to try additional C
values for logistic regression. I’ll make a copy of the params
dictionary called more_params
, and then modify the C
parameter in this dictionary to have 6 possible values instead of 3.
= params.copy()
more_params 'logisticregression__C'] = [0.01, 0.1, 1, 10, 100, 1000] more_params[
Since there are twice as many C
values, we know that a grid search will take twice as long, meaning it will run cross-validation 96 times. But what if that grid search takes more time than we have available?
An alternative method we can use is called randomized search, which is implemented in the RandomizedSearchCV
class. We’ll import it from the model_selection
module and then create an instance.
The API is very similar to GridSearchCV
, except that you also specify the number of times it should run using the n_iter
parameter. In this case, we’ll set the number of iterations to be 10.
Each time it runs, it will pick out a set of parameters at random and cross-validate that parameter set. In other words, it does the same thing as GridSearchCV
, except that it picks out random combinations of parameters from the parameter dictionary rather than trying every single combination. Because there’s an element of randomness, we’ll also set the random_state
parameter to 1 for reproducibility.
We’ll use the fit
method to run the search, and because it will only try 10 combinations instead of 96 combinations, it will run about 10 times faster than a grid search would.
from sklearn.model_selection import RandomizedSearchCV
= RandomizedSearchCV(pipe, more_params, cv=5, scoring='accuracy',
rand =10, random_state=1, n_jobs=-1)
n_iter rand.fit(X, y)
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Far... n_jobs=-1, param_distributions={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000], 'logisticregression__penalty': ['l1', 'l2']}, random_state=1, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
By printing out the results of the search, you can see that it ran 10 times using random combinations of all of those parameters.
= (pd.DataFrame(rand.cv_results_)
results filter(regex='param_|mean_test|rank'))
.= results.columns.str.split('__').str[-1]
results.columns results
penalty | C | add_indicator | drop | ngram_range | mean_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|
0 | l1 | 1 | True | first | (1, 1) | 0.814820 | 3 |
1 | l2 | 10 | False | first | (1, 1) | 0.809234 | 7 |
2 | l1 | 1000 | True | first | (1, 1) | 0.811437 | 5 |
3 | l2 | 1000 | False | None | (1, 2) | 0.810345 | 6 |
4 | l1 | 10 | False | first | (1, 2) | 0.824889 | 2 |
5 | l1 | 0.1 | False | first | (1, 2) | 0.777785 | 9 |
6 | l2 | 1 | True | None | (1, 2) | 0.804739 | 8 |
7 | l1 | 100 | True | first | (1, 1) | 0.813684 | 4 |
8 | l1 | 100 | False | first | (1, 2) | 0.827129 | 1 |
9 | l2 | 0.01 | True | first | (1, 2) | 0.744184 | 10 |
You might be surprised to know that the best score it found, 0.827, is almost as high as the best score found by our grid search earlier in the chapter, which was 0.828. That being said, we did try additional C
values in our randomized search, so the comparison isn’t entirely fair.
rand.best_score_
0.8271294959512898
Here’s the set of parameters that produced that score.
rand.best_params_
{'logisticregression__penalty': 'l1',
'logisticregression__C': 100,
'columntransformer__simpleimputer__add_indicator': False,
'columntransformer__pipeline__onehotencoder__drop': 'first',
'columntransformer__countvectorizer__ngram_range': (1, 2)}
There are four things I especially like about using a randomized search instead of a grid search:
First, randomized search will usually find the best result (or almost the best result) in far less time than grid search, which is what we saw above. This is because there are often a lot of parameter combinations that will produce similar results to one another.
Second, it’s easier to control the computational budget of a randomized search. You can test how long a small number of searches takes, and then if you have a certain amount of time available for a search, you can simply choose the number of iterations that can be completed within that time period.
Third, randomized search gives you the freedom to tune many more model and transformer parameters without worrying that it will take forever. You can try out a ton of different parameters for a short amount of time, and then narrow down which parameters to focus on based on what seems to be working. (We’ll see this in practice in the next chapter.)
Fourth, randomized search will sometimes produce even better results than grid search because you can try a finer grid. For example, let’s say you were tuning a parameter that allowed continuous values from 0 to 1. If you were using a grid search, you might try the values 0, 0.5, and 1. But if you were using randomized search, you might try the values 0, 0.01, 0.02, and so on. It may turn out that the best value for this parameter is around 0.3, and randomized search could help you to find that out, whereas this grid search would have no chance of finding that out.
If you do need to create a fine grid of numbers for a randomized search, one useful function is NumPy’s linspace
. For example, this code specifies that I want 101 equally spaced values, starting with 0 and ending with 1.
import numpy as np
0, 1, 101) np.linspace(
array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65,
0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76,
0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87,
0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98,
0.99, 1. ])
Another similar function is NumPy’s logspace
. This code specifies that I want 6 values, from 10 to the negative 2nd power through 10 to the 3rd power.
-2, 3, 6) np.logspace(
array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03])
If you’re comfortable using the SciPy library, you can instead specify continuous parameters for a randomized search using SciPy distributions. However, I find it much easier to just use NumPy’s linspace
and logspace
functions.
When you’re building and tuning a modeling Pipeline
, it’s natural to wonder how you’ll know when you’re done. In other words, how good of a model is “good enough”? There are three ways that I tend to think about this question.
The first way is to ask the question: What is the minimum accuracy that we need to achieve for our model to be considered useful? In most cases, you want your model to at least outperform null accuracy, which is the accuracy you could achieve by always predicting the most frequent class.
To calculate the null accuracy for our training data, we use the value_counts
method on y
, and set normalize
to True
in order to display the counts as a percentage. From the results, we can see that class 0 is the most frequent class, and about 61.6% of the y
values are class 0.
=True) y.value_counts(normalize
0 0.616162
1 0.383838
Name: Survived, dtype: float64
Thus the null accuracy for this problem is 61.6%, since an uninformed model, also known as the null model, could achieve that accuracy simply by predicting class 0 in all cases. In other words, this is the accuracy level that we want to outperform, otherwise the model is not providing any value. Thankfully, all of our Pipeline
s are outperforming null accuracy by a considerable amount.
The second way to think about this question is to ask: What is the maximum accuracy we could eventually reach? For most real problems, it’s impossible to know how accurate your model could be if you did enough tuning and tried enough models. It’s also impossible to know how accurate your model could be if you gathered more samples or more features. The main exception to this is if you’re working on a well-studied research problem, because in that case there may be a state-of-the-art benchmark that everyone is trying to surpass.
Thus in most practical circumstances, you don’t set a target accuracy. Instead, you work to improve the model until you run out of time, money, or ideas.
The pipe
object is our Pipeline
that hasn’t been tuned by grid search. Recall that you can examine an individual Pipeline
step by using the named_steps
attribute. In this case, we’ll select the first step, which is our ColumnTransformer
.
'columntransformer'] pipe.named_steps[
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
By passing X
to its fit_transform
method, we can see that the ColumnTransformer
outputs 1518 feature columns. As we saw in lesson 8.4, all except 9 of those features were created from the Name column by CountVectorizer
.
'columntransformer'].fit_transform(X) pipe.named_steps[
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
The cross-validated accuracy of this Pipeline
is 0.811, which we’ve been calling the baseline accuracy against which other Pipeline
s can be compared.
=5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
Similarly, we can select the ColumnTransformer
from our Pipeline
that was tuned by grid search. Notice that the ngram_range
for CountVectorizer
is (1, 2)
, meaning CountVectorizer
will create features from both unigrams and bigrams in the Name column.
'columntransformer'] grid.best_estimator_.named_steps[
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough
By using fit_transform
, we can see that this ColumnTransformer
outputs 3671 feature columns. Again, all except 9 of those features were created from the Name column.
'columntransformer'].fit_transform(X) grid.best_estimator_.named_steps[
<891x3671 sparse matrix of type '<class 'numpy.float64'>'
with 10191 stored elements in Compressed Sparse Row format>
The cross-validated accuracy of this Pipeline
is 0.828.
grid.best_score_
0.828253091456908
Finally, let’s compare these two Pipeline
s to a Pipeline
that doesn’t include the Name column at all. First, we’ll create a ColumnTransformer
called no_name_ct
that excludes Name.
= make_column_transformer(
no_name_ct 'Embarked', 'Sex']),
(imp_ohe, ['Age', 'Fare']),
(imp, ['passthrough', ['Parch'])) (
As you can see, this ColumnTransformer
only outputs 9 feature columns.
no_name_ct.fit_transform(X).shape
(891, 9)
Then, we’ll add no_name_ct
to a Pipeline
called no_name_pipe
and cross-validate it. The accuracy is 0.783, which is significantly lower than the Pipeline
s that included the Name column. To be fair, this Pipeline
hasn’t been tuned, though honestly there is no hyperparameter tuning we could do to make it perform as well as the Pipeline
s that included the Name column.
= make_pipeline(no_name_ct, logreg)
no_name_pipe =5, scoring='accuracy').mean() cross_val_score(no_name_pipe, X, y, cv
0.7833908731404181
Here are some conclusions that we can draw from this experiment:
Pipeline
significantly increased the cross-validated accuracy, which means that adding those thousands of feature columns did not result in overfitting. Instead, it tells us that the Name column contains more predictive signal than noise with respect to the target.It’s worth noting that there is additional tuning we could do to CountVectorizer
to reduce the number of features it creates. However, there’s no way to know whether that would increase or decrease the Pipeline
’s accuracy without actually trying it.
Recall that once a grid search is complete, GridSearchCV
automatically refits the Pipeline
on X
and y
and stores it as an attribute called best_estimator_
. Therefore, we can access the model coefficients by first selecting the logisticregression
step and then selecting the coef_
attribute.
'logisticregression'].coef_ grid.best_estimator_.named_steps[
array([[ 0.56431161, 0. , -0.08767203, ..., 0.01408723,
-0.43713268, -0.46358519]])
Ideally, we would also be able to get the names of the features that correspond to these coefficients by running the get_feature_names
method on the ColumnTransformer
step. However, get_feature_names
only works if all of the underlying transformers have a get_feature_names
method, and that is not the case here.
'columntransformer'].get_feature_names() grid.best_estimator_.named_steps[
AttributeError: Transformer pipeline does not provide get_feature_names
Instead, as we saw previously in lesson 8.4, you would have to inspect the transformers one-by-one in order to determine the feature names.
'columntransformer'].transformers_ grid.best_estimator_.named_steps[
[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'),
('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])]
Note that starting in scikit-learn version 1.1, the get_feature_names_out
method should work on this ColumnTransformer
, since the get_feature_names_out
method will be available for all transformers.
When we perform a grid search, we’re trying to find the parameters that maximize the cross-validation score on a dataset. Thus, we’re using the same data to accomplish two separate goals:
Pipeline
, which are stored in the best_params_
attribute.Pipeline
on new data when using these parameters, which is stored in the best_score_
attribute. grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'logisticregression__C': 10,
'logisticregression__penalty': 'l1'}
grid.best_score_
0.828253091456908
Using the same data for these two separate goals actually biases the Pipeline
to this dataset and can result in overly optimistic scores.
If your main objective is to choose the best parameters, then this process is totally fine. You’ll just have to accept that its actual performance on new data may be lower than the performance estimated by grid search.
But if you also need a realistic estimate of the Pipeline
’s performance on new data, then there’s an alternative process you can use, which I’ll walk you through in this lesson.
To start, we’ll import the train_test_split
function from the model_selection
module, and use it to split the data into training and testing sets, with 75% of the data as training and 25% of the data as testing. Note that I set the stratify
parameter to y
so that the class proportions will be approximately equal in the training and testing sets.
from sklearn.model_selection import train_test_split
= train_test_split(X, y, test_size=0.25,
X_train, X_test, y_train, y_test =1,
random_state=y) stratify
Next, we’ll create a new GridSearchCV
object called training_grid
. When we run the grid search, we’ll only pass it the training set so that the tuning process only takes the training set into account.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy',
training_grid =-1)
n_jobs training_grid.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), n_jobs=-1, param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Here are the best parameters found by grid search on the training set.
training_grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': 'first',
'columntransformer__simpleimputer__add_indicator': False,
'logisticregression__C': 10,
'logisticregression__penalty': 'l2'}
We’re not actually interested in the best score found during the grid search. Instead, we’re going to use the best parameters found by the grid search to make predictions for the testing set, and then evaluate the accuracy of those predictions. We can do this by passing the testing set to the training_grid
’s score
method.
The accuracy it outputs is 0.816, which is a more realistic estimate of how the Pipeline
will perform on new data, since the testing set is brand new data that the Pipeline
has never seen. However, it’s still just a single realization of this model, and so it’s impossible to know how precise this value is.
training_grid.score(X_test, y_test)
0.8161434977578476
Now that we’ve found the best parameters for the Pipeline
and estimated its likely performance on new data, our final step is to actually make predictions on new data. Before making predictions, it’s critical that we train the Pipeline
on all of our data, meaning the entirety of X
and y
, otherwise we’re throwing away valuable data.
In other words, we can’t simply use the training_grid
’s predict
method since it was only refit on X_train
and y_train
. Instead, we need to save the Pipeline
with the best parameters, which we’ll call best_pipe
, and fit it to X
and y
.
= training_grid.best_estimator_
best_pipe best_pipe.fit(X, y)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder(drop='first'))]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(C=10, random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder(drop='first'))]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(drop='first')
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(C=10, random_state=1, solver='liblinear')
Now we can make predictions on new data.
best_pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
If you decide that you’re going to follow the process that I’ve just outlined, then there are two guidelines that are important to follow:
First, you should only use the testing set for evaluating Pipeline
performance one time. If you keep tuning the Pipeline
again and again, each time checking its performance on the testing set, you’re essentially tuning the Pipeline
to the particulars of the testing set. At that point, it no longer functions as an independent data source and thus its performance estimates will become less reliable.
Second, it’s important that you have enough data overall in order for the training and testing sets to both be sufficiently large once the dataset has been split:
Pipeline
performance.Both of these situations would defeat the purpose of splitting the dataset, and thus this approach is best when you have a large enough dataset. Unfortunately, it’s difficult to say in the abstract how much data is “enough”, since that depends on the particulars of the dataset and the problem.
Earlier in this chapter, we tuned the regularization parameters of logistic regression. In this lesson, I’ll briefly explain what regularization actually is.
Regularization is a process that constrains the size of a model’s coefficients in order to minimize overfitting. Overfitting is when your model fits too closely to patterns in the training data, which causes your model not to perform well when it makes predictions on new data.
Regularization minimizes overfitting by reducing the variance of the model. Thus if you believe a model is too complex, regularization will reduce the error due to variance more than it increases the error due to bias, resulting in a model that is more likely to generalize to new data.
In simpler terms, regularization makes your model a bit less flexible so that it’s more likely to follow the true patterns in the data and less likely to follow the noise. Regularization is especially useful when you have outliers in the training data, because regularization decreases the influence that outliers have on the model.