from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()0.8114619295712762
In this chapter, we’re going to take a deep dive into how to efficiently tune our Pipeline for maximum accuracy.
Let’s first return to the topic of model evaluation. As you might recall, we used cross-validation back in chapter 2 to evaluate our most basic model. Since that chapter, we’ve added many more features without re-running cross-validation. That’s because any model evaluation procedure is highly unreliable with only 10 rows of data, thus it would have been misleading to run cross-validation and compare the results.
Now that we’re using the full dataset, cross-validation can once again be used. Here’s how we’ll use it to evaluate our Pipeline:
cross_val_score function from the model_selection module.cross_val_score, we can actually pass it the entire Pipeline.X and y.cross_val_score since version 0.22, but I like to include it anyway for clarity.)When we run it, cross_val_score outputs a mean accuracy of 0.811, which we’ll use as the baseline accuracy against which our future Pipelines can be compared.
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()0.8114619295712762
Let’s talk about what actually happens “under the hood” when we run the cross_val_score function on a Pipeline:
cross_val_score splits the data into 5 folds. 4 out of 5 folds (meaning 80% of the data) is set aside for training, and the remaining fold (meaning 20% of the data) is set aside for testing.Pipeline’s fit method is run on the training portion. Thus the transformations specified in the ColumnTransformer are performed on the training portion, and the transformed training data is used to fit the model.Pipeline’s predict method is run on the testing portion. Thus the transformations learned during step 2 are applied to the testing portion, and the fitted model makes predictions on the transformed testing data.cross_val_score outputs 5 accuracy scores, and we take the mean of those scores.One thing you might have noticed is that cross_val_score splits the data in step 1 before performing the transformations in steps 2 and 3. As a result, the imputation values for Age and Fare and the vocabulary for CountVectorizer are all computed 5 different times. Each time, these values are computed using the training set only, and then applied to both the training and testing sets.
Alternatively, you could imagine performing all of the transformations first, and then splitting the data. This would be much faster, since the imputation values and the vocabulary would be computed only once on the full dataset.
So why does cross_val_score split the data first? Because splitting the data before performing the transformations prevents data leakage, whereas performing the transformations on the full dataset before splitting the data would cause data leakage, since information about the testing set would be “leaked” into the model training process.
As we discussed in the previous chapter, this is one way that scikit-learn helps to shield you from data leakage.
Now that we’ve calculated the baseline accuracy for our Pipeline, the next step is to tune the hyperparameters for both the model and the transformers. Recall that we’ve been using the default parameters for most objects in the Pipeline, and so tuning those parameters is likely to result in a more accurate model.
Before proceeding, I’ll briefly explain some terminology. In the field of statistics, “hyperparameters” are values that you set, whereas “parameters” are values learned from the data by the estimator during the fitting process.
For example, the C value of logistic regression is called a hyperparameter because it’s something you can set and optimize, whereas the coefficients of a logistic regression model are called parameters because they’re learned from the data.
In this book, I’m generally going to follow scikit-learn’s conventions as I understand them:
Pipeline containing a model and transformers.C and random_state values passed to the LogisticRegression class, and the strategy value passed to the SimpleImputer class.With that being said, we’re going to use a scikit-learn class called GridSearchCV to perform the hyperparameter tuning. In a grid search, you define which values you want to try for each parameter, and it cross-validates every possible combination of those values.
We can actually use GridSearchCV to tune the entire Pipeline at once, including both the model and the transformers. This has two huge benefits over just tuning a model:
GridSearchCV.Keep in mind that if we had instead done the data transformations in pandas, we would have missed out on both of these benefits.
In this lesson, we’re going to tune the model, and then in the next lesson, we’ll also tune the transformers.
For the LogisticRegression model, we’re going to tune two parameters:
penalty, which is the type of regularization. For this parameter, the default value is 'l2', and we’re going to try the values 'l1' and 'l2'. (To be clear, the first character of each of those values is a lowercase “L”.)C, which is the amount of regularization. For this parameter, the default value is 1, and we’re going to try the values 0.1, 1, and 10.Deciding which parameters to tune and what values to try requires both research and experience, and unfortunately, it’s different for every type of model.
In order to tune a Pipeline with GridSearchCV, we need to get the names of the Pipeline steps from the named_steps attribute. We’ll tune the logisticregression step in this lesson, and we’ll tune the columntransformer step in the next lesson.
pipe.named_steps.keys()dict_keys(['columntransformer', 'logisticregression'])
To use GridSearchCV, we need to create a dictionary in which each entry represents a parameter and the values we want to try for that parameter:
Pipeline step name, followed by two underscores, followed by the parameter name. Thus the key for the first entry is 'logisticregression__penalty', and the key for the second entry is 'logisticregression__C'. Using two underscores is what allows GridSearchCV to distinguish between the step name and the parameter name. Using a single underscore would be ambiguous, since a step name or parameter name can have an underscore within it.'l1' and 'l2', and the value for the second entry is a list of 0.1, 1, and 10.We’ll create an empty dictionary called params, add these two entries, and then print it out just to make sure that it looks correct.
params = {}
params['logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]
params{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10]}
Now that we’ve created the parameter dictionary, we can set up the grid search. We’ll import the GridSearchCV class from the model_selection module.
from sklearn.model_selection import GridSearchCVNext, we’ll create an instance of GridSearchCV called grid. We’ll pass it the Pipeline, the parameter dictionary, the number of folds, and the evaluation metric.
Finally, we’ll run the grid search by fitting the grid object to X and y. Because our scikit-learn configuration is set to display diagrams, it outputs a diagram of the Pipeline once the grid search is complete.
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
('passthrough',
'passthrough',
['Parch'])])),
('logisticregression',
LogisticRegression(random_state=1,
solver='liblinear'))]),
param_grid={'logisticregression__C': [0.1, 1, 10],
'logisticregression__penalty': ['l1', 'l2']},
scoring='accuracy')ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
The results of the grid search are stored in an attribute called cv_results_, which we’ll convert to a DataFrame. We’ll use a filter to only keep the columns we need, and rename the parameter columns to make them easier to read.
The DataFrame contains 6 rows because cross-validation ran 6 times, which is every possible combination of the 2 values of penalty and the 3 values of C that we specified.
results = (pd.DataFrame(grid.cv_results_)
.filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results| C | penalty | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 0 | 0.1 | l1 | 0.783385 | 6 |
| 1 | 0.1 | l2 | 0.788990 | 5 |
| 2 | 1 | l1 | 0.814814 | 2 |
| 3 | 1 | l2 | 0.811462 | 3 |
| 4 | 10 | l1 | 0.818166 | 1 |
| 5 | 10 | l2 | 0.809234 | 4 |
Notice the rank_test_score column. We’ll use the DataFrame’s sort_values method to sort the rows by that column in ascending order.
By examining the mean_test_score column, we can see that the best parameter combination resulted in a cross-validated accuracy of 0.818, which is higher than our baseline accuracy of 0.811.
We also see that the best accuracy occurred when C was 10 and penalty was 'l1', neither of which was the default value for that parameter.
results.sort_values('rank_test_score')| C | penalty | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 4 | 10 | l1 | 0.818166 | 1 |
| 2 | 1 | l1 | 0.814814 | 2 |
| 3 | 1 | l2 | 0.811462 | 3 |
| 5 | 10 | l2 | 0.809234 | 4 |
| 1 | 0.1 | l2 | 0.788990 | 5 |
| 0 | 0.1 | l1 | 0.783385 | 6 |
In the previous lesson, we built a grid search for tuning model parameters and found that the best accuracy occurred when C was 10 and penalty was 'l1'. In this lesson, we’re going to expand the search to also include transformer parameters.
When expanding the search, you might first think that we should set C to 10 and penalty to 'l1', and then only search the transformer parameters, since that would be the most computationally efficient approach.
However, the better approach is actually to consider all of the values for C and penalty in combination with all of the transformer parameters. That’s because we’re searching for the best combination of all parameters, and since each parameter can influence what is optimal for the other parameters, the best combination might use a C value other than 10 or a penalty value other than 'l1'.
All of that is to say that we’re going to expand the existing params dictionary to include transformer parameters. But to include transformer parameters, we first need to figure out the transformer names.
From the previous lesson, you might recall that the first step in the Pipeline is named columntransformer. We’ll access that step using the named_steps attribute, which then allows us to examine the named_transformers_ attribute of the ColumnTransformer.
As a side note, named_transformers_ ends with an underscore because it’s set during the fit step, whereas named_steps does not end with an underscore because it’s set when the Pipeline instance is created.
Anyway, we can now see the transformer names. We’re going to tune a single parameter from three of the transformers. Normally I might tune more parameters, but for the sake of brevity I’m only going to tune three.
pipe.named_steps['columntransformer'].named_transformers_{'pipeline': Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
'countvectorizer': CountVectorizer(),
'simpleimputer': SimpleImputer(),
'passthrough': 'passthrough'}
The first parameter we’re going to tune is the drop parameter of OneHotEncoder, which was added to scikit-learn in version 0.21 and which I discussed in lesson 3.6.
The parameter values we’re going to try are None and 'first'. None is the default, and it means don’t drop any columns, whereas 'first' means drop the first column of each feature after encoding.
To add it to the params dictionary, we specify the Pipeline step name, which is columntransformer. Then we specify the transformer name, which is pipeline. Then we specify the step name of the inner Pipeline, which is onehotencoder. Finally we specify the parameter name, which is drop. All of these components are separated by two underscores.
params['columntransformer__pipeline__onehotencoder__drop'] = [None,
'first']If you’re ever unsure how to specify a parameter for a grid search, you can see all of the Pipeline’s parameters by using the get_params method followed by the keys method. I’m converting the output to a list for easier readability. This list is also useful if you prefer to copy and paste the parameter names rather than typing them.
As you can see, there are many transformer and model parameters that we’re not tuning, many of which could be useful to tune given enough time and computational resources.
list(pipe.get_params().keys())['memory',
'steps',
'verbose',
'columntransformer',
'logisticregression',
'columntransformer__n_jobs',
'columntransformer__remainder',
'columntransformer__sparse_threshold',
'columntransformer__transformer_weights',
'columntransformer__transformers',
'columntransformer__verbose',
'columntransformer__pipeline',
'columntransformer__countvectorizer',
'columntransformer__simpleimputer',
'columntransformer__passthrough',
'columntransformer__pipeline__memory',
'columntransformer__pipeline__steps',
'columntransformer__pipeline__verbose',
'columntransformer__pipeline__simpleimputer',
'columntransformer__pipeline__onehotencoder',
'columntransformer__pipeline__simpleimputer__add_indicator',
'columntransformer__pipeline__simpleimputer__copy',
'columntransformer__pipeline__simpleimputer__fill_value',
'columntransformer__pipeline__simpleimputer__missing_values',
'columntransformer__pipeline__simpleimputer__strategy',
'columntransformer__pipeline__simpleimputer__verbose',
'columntransformer__pipeline__onehotencoder__categories',
'columntransformer__pipeline__onehotencoder__drop',
'columntransformer__pipeline__onehotencoder__dtype',
'columntransformer__pipeline__onehotencoder__handle_unknown',
'columntransformer__pipeline__onehotencoder__sparse',
'columntransformer__countvectorizer__analyzer',
'columntransformer__countvectorizer__binary',
'columntransformer__countvectorizer__decode_error',
'columntransformer__countvectorizer__dtype',
'columntransformer__countvectorizer__encoding',
'columntransformer__countvectorizer__input',
'columntransformer__countvectorizer__lowercase',
'columntransformer__countvectorizer__max_df',
'columntransformer__countvectorizer__max_features',
'columntransformer__countvectorizer__min_df',
'columntransformer__countvectorizer__ngram_range',
'columntransformer__countvectorizer__preprocessor',
'columntransformer__countvectorizer__stop_words',
'columntransformer__countvectorizer__strip_accents',
'columntransformer__countvectorizer__token_pattern',
'columntransformer__countvectorizer__tokenizer',
'columntransformer__countvectorizer__vocabulary',
'columntransformer__simpleimputer__add_indicator',
'columntransformer__simpleimputer__copy',
'columntransformer__simpleimputer__fill_value',
'columntransformer__simpleimputer__missing_values',
'columntransformer__simpleimputer__strategy',
'columntransformer__simpleimputer__verbose',
'logisticregression__C',
'logisticregression__class_weight',
'logisticregression__dual',
'logisticregression__fit_intercept',
'logisticregression__intercept_scaling',
'logisticregression__l1_ratio',
'logisticregression__max_iter',
'logisticregression__multi_class',
'logisticregression__n_jobs',
'logisticregression__penalty',
'logisticregression__random_state',
'logisticregression__solver',
'logisticregression__tol',
'logisticregression__verbose',
'logisticregression__warm_start']
Moving along, the second parameter we’re going to tune is the ngram_range parameter of CountVectorizer.
The parameter values we’re going to try are the tuples (1, 1) and (1, 2). (1, 1) is the default, and it creates a single feature from each word. (1, 2) creates features from both single words, known as unigrams, and word pairs, known as bigrams.
Again, we specify the Pipeline step name, then the transformer name, and then the parameter name. Note that these three components are separated by double underscores, but there’s just a single underscore within ngram_range because that’s part of the parameter name.
params['columntransformer__countvectorizer__ngram_range'] = [(1, 1),
(1, 2)]The third parameter we’re going to tune is the add_indicator parameter of SimpleImputer, which was added to scikit-learn in version 0.21 and which I discussed in lesson 7.4.
The parameter values we’re going to try are False and True. False is the default, and it does not add a missing indicator column, whereas True does add a missing indicator column.
Once again, we specify the Pipeline step name, then the transformer name, and then the parameter name.
params['columntransformer__simpleimputer__add_indicator'] = [False, True]Before running the grid search, we’ll print out the params dictionary. By multiplying 2 × 3 × 2 × 2 × 2, we can calculate that there are now 48 parameter combinations, and thus the grid search will take about 8 times longer than the previous search.
As an aside, if we had used the Pipeline and ColumnTransformer classes instead of the make_pipeline and make_column_transformer functions, we could have customized the step names and transformer names, which would have made these parameter specifications a bit easier to read and write. You can read lessons 4.9 and 4.10 for a review of that topic.
params{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10],
'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Next we’ll recreate the grid object with the new params dictionary, and then we’ll run the grid search.
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
(...
LogisticRegression(random_state=1,
solver='liblinear'))]),
param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
1),
(1,
2)],
'columntransformer__pipeline__onehotencoder__drop': [None,
'first'],
'columntransformer__simpleimputer__add_indicator': [False,
True],
'logisticregression__C': [0.1, 1, 10],
'logisticregression__penalty': ['l1', 'l2']},
scoring='accuracy')ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Just like last time, we’ll convert the search results into a DataFrame and sort it by the rank_test_score column.
As you can see from the mean_test_score column, the best accuracy of 0.828 is an improvement over the previous grid search, which had an accuracy of 0.818. Keep in mind that your exact results may differ based on your scikit-learn version along with other factors. However, there’s no randomness involved when you set cv to an integer, and so your results will be the same every time you run this particular grid search.
results = (pd.DataFrame(grid.cv_results_)
.filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results.sort_values('rank_test_score')| ngram_range | drop | add_indicator | C | penalty | mean_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|
| 34 | (1, 2) | None | True | 10 | l1 | 0.828253 | 1 |
| 28 | (1, 2) | None | False | 10 | l1 | 0.824889 | 2 |
| 40 | (1, 2) | first | False | 10 | l1 | 0.824889 | 2 |
| 46 | (1, 2) | first | True | 10 | l1 | 0.822648 | 4 |
| 16 | (1, 1) | first | False | 10 | l1 | 0.820407 | 5 |
| 22 | (1, 1) | first | True | 10 | l1 | 0.819296 | 6 |
| 4 | (1, 1) | None | False | 10 | l1 | 0.818166 | 7 |
| 10 | (1, 1) | None | True | 10 | l1 | 0.817061 | 8 |
| 20 | (1, 1) | first | True | 1 | l1 | 0.814820 | 9 |
| 2 | (1, 1) | None | False | 1 | l1 | 0.814814 | 10 |
| 44 | (1, 2) | first | True | 1 | l1 | 0.813703 | 11 |
| 47 | (1, 2) | first | True | 10 | l2 | 0.812598 | 12 |
| 8 | (1, 1) | None | True | 1 | l1 | 0.812579 | 13 |
| 38 | (1, 2) | first | False | 1 | l1 | 0.812579 | 14 |
| 14 | (1, 1) | first | False | 1 | l1 | 0.812579 | 14 |
| 26 | (1, 2) | None | False | 1 | l1 | 0.812567 | 16 |
| 11 | (1, 1) | None | True | 10 | l2 | 0.811481 | 17 |
| 21 | (1, 1) | first | True | 1 | l2 | 0.811468 | 18 |
| 3 | (1, 1) | None | False | 1 | l2 | 0.811462 | 19 |
| 23 | (1, 1) | first | True | 10 | l2 | 0.810363 | 20 |
| 9 | (1, 1) | None | True | 1 | l2 | 0.810345 | 21 |
| 15 | (1, 1) | first | False | 1 | l2 | 0.810332 | 22 |
| 32 | (1, 2) | None | True | 1 | l1 | 0.810332 | 22 |
| 17 | (1, 1) | first | False | 10 | l2 | 0.809234 | 24 |
| 35 | (1, 2) | None | True | 10 | l2 | 0.809234 | 24 |
| 5 | (1, 1) | None | False | 10 | l2 | 0.809234 | 24 |
| 29 | (1, 2) | None | False | 10 | l2 | 0.808104 | 27 |
| 45 | (1, 2) | first | True | 1 | l2 | 0.808097 | 28 |
| 41 | (1, 2) | first | False | 10 | l2 | 0.806980 | 29 |
| 39 | (1, 2) | first | False | 1 | l2 | 0.805844 | 30 |
| 27 | (1, 2) | None | False | 1 | l2 | 0.805844 | 30 |
| 33 | (1, 2) | None | True | 1 | l2 | 0.804739 | 32 |
| 31 | (1, 2) | None | True | 0.1 | l2 | 0.793491 | 33 |
| 7 | (1, 1) | None | True | 0.1 | l2 | 0.793484 | 34 |
| 19 | (1, 1) | first | True | 0.1 | l2 | 0.791243 | 35 |
| 43 | (1, 2) | first | True | 0.1 | l2 | 0.790114 | 36 |
| 37 | (1, 2) | first | False | 0.1 | l2 | 0.789003 | 37 |
| 25 | (1, 2) | None | False | 0.1 | l2 | 0.788996 | 38 |
| 1 | (1, 1) | None | False | 0.1 | l2 | 0.788990 | 39 |
| 13 | (1, 1) | first | False | 0.1 | l2 | 0.787885 | 40 |
| 0 | (1, 1) | None | False | 0.1 | l1 | 0.783385 | 41 |
| 30 | (1, 2) | None | True | 0.1 | l1 | 0.783385 | 41 |
| 24 | (1, 2) | None | False | 0.1 | l1 | 0.783385 | 41 |
| 6 | (1, 1) | None | True | 0.1 | l1 | 0.783385 | 41 |
| 36 | (1, 2) | first | False | 0.1 | l1 | 0.777785 | 45 |
| 42 | (1, 2) | first | True | 0.1 | l1 | 0.777785 | 45 |
| 12 | (1, 1) | first | False | 0.1 | l1 | 0.777785 | 45 |
| 18 | (1, 1) | first | True | 0.1 | l1 | 0.777785 | 45 |
Rather than always examining the results DataFrame, we can actually just access the single best score and the set of parameters that resulted in that score via attributes of the grid object.
It’s worth noting that only the drop parameter is using its default value, whereas the other four parameters are not using their default values.
grid.best_score_0.828253091456908
grid.best_params_{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'logisticregression__C': 10,
'logisticregression__penalty': 'l1'}
It’s hard to say whether this truly is the best set of parameters, because some of the differences in accuracy between parameter combinations may be due to chance, based on which samples happened to appear in each fold. That’s just a limitation of basic cross-validation, and so all we can say with confidence is that this is a good combination of parameters.
Now that we’ve tuned both the model parameters and the transformer parameters, we want to use those parameters with the Pipeline when making predictions.
GridSearchCV actually makes this very easy. After locating the best set of parameters, it automatically refits the Pipeline on X and y using the best set of parameters, and it stores that fitted Pipeline as an attribute called best_estimator_. As you can see, that attribute is indeed a Pipeline object.
type(grid.best_estimator_)sklearn.pipeline.Pipeline
If we printed out the best_estimator_ attribute and clicked on the components, we would see that the parameters of this Pipeline match the best parameter set we located in the previous lesson.
grid.best_estimator_Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(ngram_range=(1,
2)),
'Name'),
('simpleimputer',
SimpleImputer(add_indicator=True),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Parch'])])),
('logisticregression',
LogisticRegression(C=10, penalty='l1', random_state=1,
solver='liblinear'))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(ngram_range=(1, 2)), 'Name'),
('simpleimputer',
SimpleImputer(add_indicator=True),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough
LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear')
In order to make predictions using this Pipeline, all we have to do is run the grid object’s predict method, which calls the predict method of the best_estimator_, and pass it X_new.
grid.predict(X_new)array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
I want to emphasize that this Pipeline, with the best set of parameters, was automatically refit to the entire dataset. You always train your model on the entire dataset, meaning all samples for which you know the target value, before using it to make predictions on new data. Otherwise, you would be throwing away valuable training data.
After completing a grid search, you may want to save the Pipeline with the best set of parameters so that you can use it to make predictions later.
As we saw in the previous lesson, the Pipeline with the best set of parameters is stored as an attribute of the GridSearchCV object called best_estimator_, so this is the object that we want to save.
type(grid.best_estimator_)sklearn.pipeline.Pipeline
You can save a Pipeline to a file using pickle, which is part of the Python standard library.
import pickleWe’ll use pickle’s dump method to save the Pipeline to a file called “pipe.pickle”.
with open('pipe.pickle', 'wb') as f:
pickle.dump(grid.best_estimator_, f)Then we can use pickle’s load method to load the Pipeline from the file into an object called pipe_from_pickle.
with open('pipe.pickle', 'rb') as f:
pipe_from_pickle = pickle.load(f)pipe_from_pickle is identical to grid.best_estimator_, and so when we use pipe_from_pickle to make predictions, these predictions are identical to the predictions made by the grid object.
pipe_from_pickle.predict(X_new)array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
One alternative to pickle is joblib, which is usually more efficient than pickle for scikit-learn objects. Although it’s not part of the Python standard library, joblib has been a dependency of scikit-learn since version 0.21.
import joblibJust like pickle, you use joblib’s dump method to save the Pipeline to a file, which we’ll call “pipe.joblib”.
joblib.dump(grid.best_estimator_, 'pipe.joblib')['pipe.joblib']
Then, we’ll use the load method to load the Pipeline from the file into an object called pipe_from_joblib.
pipe_from_joblib = joblib.load('pipe.joblib')Finally, we’ll use pipe_from_joblib to make predictions.
pipe_from_joblib.predict(X_new)array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
To be clear, pickle and joblib are not limited to Pipelines and can be used with other scikit-learn objects, such as a standalone model object that is not inside a Pipeline.
There are a couple of warnings to keep in mind when working with pickle and joblib objects:
Finally, it’s worth mentioning that there are alternatives to pickle and joblib such as ONNX and PMML. These formats don’t capture the full model object, but instead save a representation that can be used to make predictions. One major benefit of these formats is that they are neither environment-specific nor architecture-specific.
Let’s recreate the GridSearchCV object, but this time we’ll add the verbose parameter and set it to 1. When we run the search, this parameter will result in two changes to the output:
Pipeline will be fit 240 times.grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1)
grid.fit(X, y)Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.5s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 2.4s
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
(...
LogisticRegression(random_state=1,
solver='liblinear'))]),
param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
1),
(1,
2)],
'columntransformer__pipeline__onehotencoder__drop': [None,
'first'],
'columntransformer__simpleimputer__add_indicator': [False,
True],
'logisticregression__C': [0.1, 1, 10],
'logisticregression__penalty': ['l1', 'l2']},
scoring='accuracy', verbose=1)ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Now, let’s also add the n_jobs parameter, set it to -1, and re-run the grid search. This instructs scikit-learn to use parallel processing with all of your CPUs to perform the search. If your machine has multiple processors, this will generally be faster, though in this case it took about the same amount of time.
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1,
n_jobs=-1)
grid.fit(X, y)Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.7s
[Parallel(n_jobs=-1)]: Done 225 out of 240 | elapsed: 2.3s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 2.3s finished
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
(...
LogisticRegression(random_state=1,
solver='liblinear'))]),
n_jobs=-1,
param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
1),
(1,
2)],
'columntransformer__pipeline__onehotencoder__drop': [None,
'first'],
'columntransformer__simpleimputer__add_indicator': [False,
True],
'logisticregression__C': [0.1, 1, 10],
'logisticregression__penalty': ['l1', 'l2']},
scoring='accuracy', verbose=1)ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
If you find it useful to know how long a search takes, but verbose mode is a bit too verbose for you, another option is to remove the verbose parameter and instead prefix the second line with %time. This is known as an IPython line magic, and it will work as long as you’re using the Jupyter Notebook or the IPython interpreter.
All this command does is tell you how long a particular line of code took to run. The number to focus on is the wall time.
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy', n_jobs=-1)
%time grid.fit(X, y)CPU times: user 238 ms, sys: 8.73 ms, total: 247 ms
Wall time: 699 ms
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
(...
LogisticRegression(random_state=1,
solver='liblinear'))]),
n_jobs=-1,
param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
1),
(1,
2)],
'columntransformer__pipeline__onehotencoder__drop': [None,
'first'],
'columntransformer__simpleimputer__add_indicator': [False,
True],
'logisticregression__C': [0.1, 1, 10],
'logisticregression__penalty': ['l1', 'l2']},
scoring='accuracy')ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
My general recommendation is to set n_jobs to -1 any time you’re running a grid search, which is what I’ll do for the rest of the book. However, it’s still a good idea to use %time or verbose mode to confirm that parallel processing is actually reducing the search time on your particular machine.
When you provide a set of parameter values to GridSearchCV, it will cross-validate every possible combination of those parameters. For example, we know that with this set of parameters, cross-validation will run 48 times.
params{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10],
'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Let’s say that we wanted to try additional C values for logistic regression. I’ll make a copy of the params dictionary called more_params, and then modify the C parameter in this dictionary to have 6 possible values instead of 3.
more_params = params.copy()
more_params['logisticregression__C'] = [0.01, 0.1, 1, 10, 100, 1000]Since there are twice as many C values, we know that a grid search will take twice as long, meaning it will run cross-validation 96 times. But what if that grid search takes more time than we have available?
An alternative method we can use is called randomized search, which is implemented in the RandomizedSearchCV class. We’ll import it from the model_selection module and then create an instance.
The API is very similar to GridSearchCV, except that you also specify the number of times it should run using the n_iter parameter. In this case, we’ll set the number of iterations to be 10.
Each time it runs, it will pick out a set of parameters at random and cross-validate that parameter set. In other words, it does the same thing as GridSearchCV, except that it picks out random combinations of parameters from the parameter dictionary rather than trying every single combination. Because there’s an element of randomness, we’ll also set the random_state parameter to 1 for reproducibility.
We’ll use the fit method to run the search, and because it will only try 10 combinations instead of 96 combinations, it will run about 10 times faster than a grid search would.
from sklearn.model_selection import RandomizedSearchCV
rand = RandomizedSearchCV(pipe, more_params, cv=5, scoring='accuracy',
n_iter=10, random_state=1, n_jobs=-1)
rand.fit(X, y)RandomizedSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Far...
n_jobs=-1,
param_distributions={'columntransformer__countvectorizer__ngram_range': [(1,
1),
(1,
2)],
'columntransformer__pipeline__onehotencoder__drop': [None,
'first'],
'columntransformer__simpleimputer__add_indicator': [False,
True],
'logisticregression__C': [0.01, 0.1, 1,
10, 100,
1000],
'logisticregression__penalty': ['l1',
'l2']},
random_state=1, scoring='accuracy')ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
By printing out the results of the search, you can see that it ran 10 times using random combinations of all of those parameters.
results = (pd.DataFrame(rand.cv_results_)
.filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results| penalty | C | add_indicator | drop | ngram_range | mean_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|
| 0 | l1 | 1 | True | first | (1, 1) | 0.814820 | 3 |
| 1 | l2 | 10 | False | first | (1, 1) | 0.809234 | 7 |
| 2 | l1 | 1000 | True | first | (1, 1) | 0.811437 | 5 |
| 3 | l2 | 1000 | False | None | (1, 2) | 0.810345 | 6 |
| 4 | l1 | 10 | False | first | (1, 2) | 0.824889 | 2 |
| 5 | l1 | 0.1 | False | first | (1, 2) | 0.777785 | 9 |
| 6 | l2 | 1 | True | None | (1, 2) | 0.804739 | 8 |
| 7 | l1 | 100 | True | first | (1, 1) | 0.813684 | 4 |
| 8 | l1 | 100 | False | first | (1, 2) | 0.827129 | 1 |
| 9 | l2 | 0.01 | True | first | (1, 2) | 0.744184 | 10 |
You might be surprised to know that the best score it found, 0.827, is almost as high as the best score found by our grid search earlier in the chapter, which was 0.828. That being said, we did try additional C values in our randomized search, so the comparison isn’t entirely fair.
rand.best_score_0.8271294959512898
Here’s the set of parameters that produced that score.
rand.best_params_{'logisticregression__penalty': 'l1',
'logisticregression__C': 100,
'columntransformer__simpleimputer__add_indicator': False,
'columntransformer__pipeline__onehotencoder__drop': 'first',
'columntransformer__countvectorizer__ngram_range': (1, 2)}
There are four things I especially like about using a randomized search instead of a grid search:
If you do need to create a fine grid of numbers for a randomized search, one useful function is NumPy’s linspace. For example, this code specifies that I want 101 equally spaced values, starting with 0 and ending with 1.
import numpy as np
np.linspace(0, 1, 101)array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65,
0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76,
0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87,
0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98,
0.99, 1. ])
Another similar function is NumPy’s logspace. This code specifies that I want 6 values, from 10 to the negative 2nd power through 10 to the 3rd power.
np.logspace(-2, 3, 6)array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03])
If you’re comfortable using the SciPy library, you can instead specify continuous parameters for a randomized search using SciPy distributions. However, I find it much easier to just use NumPy’s linspace and logspace functions.
When you’re building and tuning a modeling Pipeline, it’s natural to wonder how you’ll know when you’re done. In other words, how good of a model is “good enough”? There are three ways that I tend to think about this question.
The first way is to ask the question: What is the minimum accuracy that we need to achieve for our model to be considered useful? In most cases, you want your model to at least outperform null accuracy, which is the accuracy you could achieve by always predicting the most frequent class.
To calculate the null accuracy for our training data, we use the value_counts method on y, and set normalize to True in order to display the counts as a percentage. From the results, we can see that class 0 is the most frequent class, and about 61.6% of the y values are class 0.
y.value_counts(normalize=True)0 0.616162
1 0.383838
Name: Survived, dtype: float64
Thus the null accuracy for this problem is 61.6%, since an uninformed model, also known as the null model, could achieve that accuracy simply by predicting class 0 in all cases. In other words, this is the accuracy level that we want to outperform, otherwise the model is not providing any value. Thankfully, all of our Pipelines are outperforming null accuracy by a considerable amount.
The second way to think about this question is to ask: What is the maximum accuracy we could eventually reach? For most real problems, it’s impossible to know how accurate your model could be if you did enough tuning and tried enough models. It’s also impossible to know how accurate your model could be if you gathered more samples or more features. The main exception to this is if you’re working on a well-studied research problem, because in that case there may be a state-of-the-art benchmark that everyone is trying to surpass.
Thus in most practical circumstances, you don’t set a target accuracy. Instead, you work to improve the model until you run out of time, money, or ideas.
The pipe object is our Pipeline that hasn’t been tuned by grid search. Recall that you can examine an individual Pipeline step by using the named_steps attribute. In this case, we’ll select the first step, which is our ColumnTransformer.
pipe.named_steps['columntransformer']ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
By passing X to its fit_transform method, we can see that the ColumnTransformer outputs 1518 feature columns. As we saw in lesson 8.4, all except 9 of these features were created from the Name column by CountVectorizer.
pipe.named_steps['columntransformer'].fit_transform(X)<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
The cross-validated accuracy of this Pipeline is 0.811, which we’ve been calling the baseline accuracy against which other Pipelines can be compared.
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()0.8114619295712762
Similarly, we can select the ColumnTransformer from our Pipeline that was tuned by grid search. If we clicked on CountVectorizer in the diagram, we would see that ngram_range is (1, 2), meaning CountVectorizer will create features from both unigrams and bigrams in the Name column.
grid.best_estimator_.named_steps['columntransformer']ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(ngram_range=(1, 2)), 'Name'),
('simpleimputer',
SimpleImputer(add_indicator=True),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough
By passing X to its fit_transform method, we can see that this ColumnTransformer outputs 3671 feature columns. Again, all except 9 of these features were created from the Name column.
grid.best_estimator_.named_steps['columntransformer'].fit_transform(X)<891x3671 sparse matrix of type '<class 'numpy.float64'>'
with 10191 stored elements in Compressed Sparse Row format>
The cross-validated accuracy of this Pipeline is 0.828.
grid.best_score_0.828253091456908
Finally, let’s compare these two Pipelines to a Pipeline that doesn’t include the Name column. First, we’ll create a ColumnTransformer called no_name_ct that excludes Name.
no_name_ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']),
(imp, ['Age', 'Fare']),
('passthrough', ['Parch']))As you can see, this ColumnTransformer only outputs 9 feature columns.
no_name_ct.fit_transform(X).shape(891, 9)
Then, we’ll add no_name_ct to a Pipeline called no_name_pipe and cross-validate it. The accuracy is 0.783, which is significantly lower than the Pipelines that included the Name column.
To be fair, this Pipeline hasn’t been tuned, though honestly there is no hyperparameter tuning we could do to make it perform as well as the Pipelines that included the Name column.
no_name_pipe = make_pipeline(no_name_ct, logreg)
cross_val_score(no_name_pipe, X, y, cv=5, scoring='accuracy').mean()0.7833908731404181
Here are some conclusions that we can draw from this experiment:
Pipeline significantly increased the cross-validated accuracy, which means that adding those thousands of feature columns did not result in overfitting. Instead, it tells us that the Name column contains more predictive signal than noise with respect to the target.It’s worth noting that there is additional tuning we could do to CountVectorizer to reduce the number of features it creates. However, there’s no way to know whether that would increase or decrease the Pipeline’s accuracy without actually trying it.
Recall that once a grid search is complete, GridSearchCV automatically refits the Pipeline on X and y and stores it as an attribute called best_estimator_. Therefore, we can access the model coefficients by first selecting the logisticregression step and then selecting the coef_ attribute.
grid.best_estimator_.named_steps['logisticregression'].coef_array([[ 0.56431161, 0. , -0.08767203, ..., 0.01408723,
-0.43713268, -0.46358519]])
Ideally, we would also be able to get the names of the features that correspond to these coefficients by running the get_feature_names method on the ColumnTransformer step. However, get_feature_names only works if all of the underlying transformers have a get_feature_names method, and that is not the case here.
grid.best_estimator_.named_steps['columntransformer'].get_feature_names()AttributeError: Transformer pipeline does not provide get_feature_names
Instead, as we saw previously in lesson 8.4, you would have to inspect the transformers one-by-one in order to determine the feature names.
grid.best_estimator_.named_steps['columntransformer'].transformers_[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'),
('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])]
Note that starting in scikit-learn version 1.1, the get_feature_names_out method should work on this ColumnTransformer, since the get_feature_names_out method will be available for all transformers.
When we perform a grid search, we’re trying to find the parameters that maximize the cross-validation score on a dataset. Thus, we’re using the same data to accomplish two separate goals:
Pipeline, which are stored in the best_params_ attribute.Pipelineon new data when using these parameters, which is stored in the best_score_ attribute.grid.best_params_{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'logisticregression__C': 10,
'logisticregression__penalty': 'l1'}
grid.best_score_0.828253091456908
Using the same data for these two separate goals actually biases the Pipeline to this dataset and can result in overly optimistic scores.
If your main objective is to choose the best parameters, then this process is totally fine. You’ll just have to accept that its actual performance on new data may be lower than the performance estimated by grid search.
But if you also need a realistic estimate of the Pipeline’s performance on new data, then there’s an alternative process you can use, which I’ll walk you through in this lesson.
To start, we’ll import the train_test_split function from the model_selection module and use it to split the data, with 75% of the data as training and 25% of the data as testing. Note that I set the stratify parameter to y so that the class proportions will be approximately equal in the training and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=1,
stratify=y)Next, we’ll create a new GridSearchCV object called training_grid. When we run the grid search, we’ll only pass it the training set so that the tuning process only takes the training set into account.
training_grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy',
n_jobs=-1)
training_grid.fit(X_train, y_train)GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
(...
LogisticRegression(random_state=1,
solver='liblinear'))]),
n_jobs=-1,
param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
1),
(1,
2)],
'columntransformer__pipeline__onehotencoder__drop': [None,
'first'],
'columntransformer__simpleimputer__add_indicator': [False,
True],
'logisticregression__C': [0.1, 1, 10],
'logisticregression__penalty': ['l1', 'l2']},
scoring='accuracy')ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Here are the best parameters found by grid search on the training set.
training_grid.best_params_{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': 'first',
'columntransformer__simpleimputer__add_indicator': False,
'logisticregression__C': 10,
'logisticregression__penalty': 'l2'}
We’re not actually interested in the best score found during the grid search. Instead, we’re going to use the best parameters found by the grid search to make predictions for the testing set, and then evaluate the accuracy of those predictions. We can do this by passing the testing set to the training_grid’s score method.
The accuracy it outputs is 0.816, which is a more realistic estimate of how the Pipeline will perform on new data, since the testing set is brand new data that the Pipeline has never seen. However, it’s still just a single realization of this model, and so it’s impossible to know how precise this value is.
training_grid.score(X_test, y_test)0.8161434977578476
Now that we’ve found the best parameters for the Pipeline and estimated its likely performance on new data, our final step is to actually make predictions on new data. Before making predictions, it’s critical that we train the Pipeline on all of our data, meaning the entirety of X and y, otherwise we’re throwing away valuable data.
In other words, we can’t simply use the training_grid’s predict method since it was only refit on X_train and y_train. Instead, we need to save the Pipeline with the best parameters, which we’ll call best_pipe, and fit it to X and y.
best_pipe = training_grid.best_estimator_
best_pipe.fit(X, y)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder(drop='first'))]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(ngram_range=(1,
2)),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Parch'])])),
('logisticregression',
LogisticRegression(C=10, random_state=1, solver='liblinear'))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder(drop='first'))]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(ngram_range=(1, 2)), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(drop='first')
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(C=10, random_state=1, solver='liblinear')
Now we can make predictions on new data.
best_pipe.predict(X_new)array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
If you decide that you’re going to follow the process that I’ve just outlined, then there are two important guidelines to follow:
First, you should only use the testing set for evaluating Pipeline performance one time. If you keep tuning the Pipeline again and again, each time checking its performance on the testing set, you’re essentially tuning the Pipeline to the particulars of the testing set. At that point, it no longer functions as an independent data source and thus its performance estimates will become less reliable.
Second, it’s important that you have enough data in order for the training and testing sets to both be sufficiently large once the dataset has been split:
Pipeline performance.Both of these situations would defeat the purpose of splitting the dataset, and thus this approach is best when you have a large enough dataset. Unfortunately, it’s difficult to say in the abstract how much data is “enough”, since that depends on the particulars of the dataset and the problem.
Earlier in this chapter, we tuned the regularization parameters of logistic regression. In this lesson, I’ll briefly explain what regularization actually is.
Regularization is a process that constrains the size of a model’s coefficients in order to minimize overfitting. Overfitting is when your model fits too closely to patterns in the training data, which causes your model not to perform well when it makes predictions on new data.
Regularization minimizes overfitting by reducing the variance of the model. Thus if you believe a model is too complex, regularization will reduce the error due to variance more than it increases the error due to bias, resulting in a model that is more likely to generalize to new data.
In simpler terms, regularization makes your model a bit less flexible so that it’s more likely to follow the true patterns in the data and less likely to follow the noise. Regularization is especially useful when you have outliers in the training data, because regularization decreases the influence that outliers have on the model.