from sklearn.model_selection import cross_val_score
=5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
In this chapter, we’re going to take a deep dive into how to efficiently tune our Pipeline for maximum accuracy.
Let’s return to the topic of model evaluation.
As you might recall, we used cross-validation way back in chapter 2 to evaluate our most basic model. Since that chapter, we’ve been adding many more features without re-running cross-validation. That’s because any model evaluation procedure is highly unreliable with only 10 rows of data, so it would have been misleading to run cross-validation and compare the results. But now that we’re using the full dataset, cross-validation can once again be used.
When we run it, cross_val_score outputs a mean accuracy of 0.811, which we’ll use as the baseline accuracy against which our future Pipelines can be compared.
from sklearn.model_selection import cross_val_score
=5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
Let’s talk about what actually happens “under the hood” when we run the cross_val_score function on a Pipeline:
One thing you might have noticed is that cross_val_score splits the data in step 1 before performing the transformations in steps 2 and 3. As a result, the imputation values for Age and Fare and the vocabulary for CountVectorizer are all computed 5 different times. Each time, these values are computed using the training set only, and then applied to both the training and testing sets.
Alternatively, you could imagine performing all of the transformations first, and then splitting the data. This would be much faster, since the imputation values and the vocabulary would be computed only once on the full dataset.
So why does cross_val_score split the data first? Because splitting the data before performing the transformations prevents data leakage, whereas performing the transformations on the full dataset before splitting the data would cause data leakage, since information about the testing set would be “leaked” into the model training process.
As we discussed in the previous chapter, this is one way that scikit-learn helps to shield you from data leakage.
Now that we’ve calculated the baseline accuracy for our Pipeline, the next step is to tune the hyperparameters for both the model and the transformers. Recall that we’ve been using the default parameters for most objects in the Pipeline, and so tuning those parameters is likely to result in a more accurate model.
Before proceeding, let me briefly explain some terminology. In the field of statistics, “hyperparameters” are values that you set, whereas “parameters” are values learned from the data by the estimator during the fitting process.
For example, the C value of logistic regression is called a hyperparameter because it’s something you can set and optimize, whereas the coefficients of a logistic regression model are called parameters because they’re learned from the data.
In this book, I’m generally going to follow scikit-learn’s conventions as I understand them:
With that being said, we’re going to use a scikit-learn class called GridSearchCV to perform the hyperparameter tuning. In a grid search, you define which values you want to try for each parameter, and it cross-validates every possible combination of those values.
We can actually use GridSearchCV to tune the entire Pipeline at once, including both the model and the transformers. This has two huge benefits over just tuning a model:
Keep in mind that if we had instead done the data transformations in pandas, we would have missed out on both of these benefits.
In this lesson, we’re going to tune the model, and then in the next lesson, we’ll also tune the transformers.
For the logistic regression model, we’re going to tune two parameters:
Deciding which parameters to tune and what values to try requires both research and experience, and unfortunately, it’s different for every type of model.
In order to tune a Pipeline with GridSearchCV, we need to get the names of the Pipeline steps from the named_steps attribute. We’ll tune the “logisticregression” step in this lesson, and we’ll tune the “columntransformer” step in the next lesson.
pipe.named_steps.keys()
dict_keys(['columntransformer', 'logisticregression'])
To use GridSearchCV, we need to create a dictionary in which each entry represents a parameter and the values we want to try for that parameter. We’ll start by creating an empty dictionary called params, and then we’ll add the two entries.
For each dictionary entry, the key is the Pipeline step name, followed by two underscores, followed by the parameter name. Thus the key for the first entry is “logisticregression__penalty”, and the key for the second entry is “logisticregression__C”.
Using two underscores is what allows GridSearchCV to distinguish between the step name and the parameter name. Using a single underscore would be ambiguous, since a step name or parameter name can have an underscore within it.
The value for each dictionary entry is a list of the values you want to try for that parameter. Thus the value for the first entry is a list of “l1” and “l2”, and the value for the second entry is a list of 0.1, 1, and 10.
After adding the two entries, we’ll print out the params dictionary just to make sure that it looks correct.
= {}
params 'logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]
params[ params
{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10]}
Now that we’ve created the parameter dictionary, we can set up the grid search. We import the GridSearchCV class from the model_selection module.
from sklearn.model_selection import GridSearchCV
Next, we create an instance of GridSearchCV called grid. We pass it the Pipeline, the parameter dictionary, the number of folds, and the evaluation metric.
Finally, we run the grid search by fitting the grid object with X and y. Because our scikit-learn configuration is set to display diagrams, we see a diagram of the Pipeline now that the grid search is complete.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid grid.fit(X, y)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))]), param_grid={'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
The results of the grid search are stored in an attribute called cv_results_, which we’ll convert to a DataFrame for easier viewing.
There are 6 rows because it ran cross-validation 6 times, which is every possible combination of the 2 values of penalty and the 3 values of C that we specified.
= pd.DataFrame(grid.cv_results_)
results results
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_logisticregression__C | param_logisticregression__penalty | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.006248 | 0.000092 | 0.002339 | 0.000045 | 0.1 | l1 | {'logisticregression__C': 0.1, 'logisticregres... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 6 |
1 | 0.006437 | 0.000056 | 0.002359 | 0.000054 | 0.1 | l2 | {'logisticregression__C': 0.1, 'logisticregres... | 0.798883 | 0.803371 | 0.764045 | 0.775281 | 0.803371 | 0.788990 | 0.016258 | 5 |
2 | 0.007092 | 0.000159 | 0.002399 | 0.000006 | 1 | l1 | {'logisticregression__C': 1, 'logisticregressi... | 0.815642 | 0.820225 | 0.797753 | 0.792135 | 0.848315 | 0.814814 | 0.019787 | 2 |
3 | 0.006887 | 0.000091 | 0.002386 | 0.000013 | 1 | l2 | {'logisticregression__C': 1, 'logisticregressi... | 0.798883 | 0.825843 | 0.803371 | 0.786517 | 0.842697 | 0.811462 | 0.020141 | 3 |
4 | 0.010375 | 0.001186 | 0.002408 | 0.000008 | 10 | l1 | {'logisticregression__C': 10, 'logisticregress... | 0.832402 | 0.808989 | 0.808989 | 0.786517 | 0.853933 | 0.818166 | 0.023031 | 1 |
5 | 0.007407 | 0.000158 | 0.002385 | 0.000018 | 10 | l2 | {'logisticregression__C': 10, 'logisticregress... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 4 |
Notice the rank_test_score column, which is the last column in the DataFrame. We’ll use the DataFrame’s sort_values method to sort the rows by that column in ascending order.
By examining the mean_test_score column, we can see that the best parameter combination resulted in a cross-validated accuracy of 0.818, which is higher than our baseline accuracy of 0.811.
Scrolling to the left, we can see that the best accuracy occurred when C was 10 and penalty was l1, neither of which was the default value for that parameter.
'rank_test_score') results.sort_values(
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_logisticregression__C | param_logisticregression__penalty | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0.010375 | 0.001186 | 0.002408 | 0.000008 | 10 | l1 | {'logisticregression__C': 10, 'logisticregress... | 0.832402 | 0.808989 | 0.808989 | 0.786517 | 0.853933 | 0.818166 | 0.023031 | 1 |
2 | 0.007092 | 0.000159 | 0.002399 | 0.000006 | 1 | l1 | {'logisticregression__C': 1, 'logisticregressi... | 0.815642 | 0.820225 | 0.797753 | 0.792135 | 0.848315 | 0.814814 | 0.019787 | 2 |
3 | 0.006887 | 0.000091 | 0.002386 | 0.000013 | 1 | l2 | {'logisticregression__C': 1, 'logisticregressi... | 0.798883 | 0.825843 | 0.803371 | 0.786517 | 0.842697 | 0.811462 | 0.020141 | 3 |
5 | 0.007407 | 0.000158 | 0.002385 | 0.000018 | 10 | l2 | {'logisticregression__C': 10, 'logisticregress... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 4 |
1 | 0.006437 | 0.000056 | 0.002359 | 0.000054 | 0.1 | l2 | {'logisticregression__C': 0.1, 'logisticregres... | 0.798883 | 0.803371 | 0.764045 | 0.775281 | 0.803371 | 0.788990 | 0.016258 | 5 |
0 | 0.006248 | 0.000092 | 0.002339 | 0.000045 | 0.1 | l1 | {'logisticregression__C': 0.1, 'logisticregres... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 6 |
In the previous lesson, we built a grid search for tuning model parameters and found that the best accuracy occurred when C was 10 and penalty was l1. In this lesson, we’re going to expand the search to also include transformer parameters.
When expanding the search, you might first think that we should set C to 10 and penalty to l1, and then only search the transformer parameters, since that would be the most computationally efficient approach.
However, the better approach is actually to consider all of the values for C and penalty in combination with all of the transformer parameters. That’s because we’re searching for the best combination of all parameters, and since each parameter can influence what is optimal for the other parameters, the best combination might use a C value other than 10 or a penalty value other than l1.
All of that is to say that we’re going to expand the existing params dictionary to include transformer parameters. And to include transformer parameters, we first need to figure out the transformer names.
From the previous lesson, you might recall that the first step in the Pipeline is named columntransformer (all lowercase). We’ll access that step using the named_steps attribute, which then allows us to examine the named_transformers_ attribute of the ColumnTransformer.
As a side note, named_transformers_ ends with an underscore because it’s set during the fit step, whereas named_steps does not end with an underscore because it’s set when the Pipeline instance is created.
Anyway, we can now see the transformer names. We’re going to tune a single parameter from three of the transformers. Normally I might tune more parameters, but for the sake of brevity I’m only going to tune three.
'columntransformer'].named_transformers_ pipe.named_steps[
{'pipeline': Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
'countvectorizer': CountVectorizer(),
'simpleimputer': SimpleImputer(),
'passthrough': 'passthrough'}
The first parameter we’re going to tune is the “drop” parameter of OneHotEncoder, which was added to scikit-learn in version 0.21 and which I discussed in lesson 3.6.
To add it to the params dictionary, we specify the Pipeline step name, which is “columntransformer”. Then we specify the transformer name, which is “pipeline”. Then we specify the step name of the inner Pipeline, which is “onehotencoder”. Finally we specify the parameter name, which is “drop”. All of these components are separated by two underscores.
The parameter values we’re going to try are “None” and “first”. None is the default, and it means don’t drop any columns, whereas first means drop the first column of each feature after encoding.
'columntransformer__pipeline__onehotencoder__drop'] = [None, 'first'] params[
If you’re ever unsure how to specify a parameter for a grid search, you can see all of the Pipeline’s parameters by using the get_params method followed by the keys method. I’m converting the output to a list for easier readability. This list is also useful if you prefer to copy and paste the parameter names rather than typing them.
As you can see, there are many transformer and model parameters that we’re not tuning, many of which could be useful to tune given enough time and computational resources.
list(pipe.get_params().keys())
['memory',
'steps',
'verbose',
'columntransformer',
'logisticregression',
'columntransformer__n_jobs',
'columntransformer__remainder',
'columntransformer__sparse_threshold',
'columntransformer__transformer_weights',
'columntransformer__transformers',
'columntransformer__verbose',
'columntransformer__pipeline',
'columntransformer__countvectorizer',
'columntransformer__simpleimputer',
'columntransformer__passthrough',
'columntransformer__pipeline__memory',
'columntransformer__pipeline__steps',
'columntransformer__pipeline__verbose',
'columntransformer__pipeline__simpleimputer',
'columntransformer__pipeline__onehotencoder',
'columntransformer__pipeline__simpleimputer__add_indicator',
'columntransformer__pipeline__simpleimputer__copy',
'columntransformer__pipeline__simpleimputer__fill_value',
'columntransformer__pipeline__simpleimputer__missing_values',
'columntransformer__pipeline__simpleimputer__strategy',
'columntransformer__pipeline__simpleimputer__verbose',
'columntransformer__pipeline__onehotencoder__categories',
'columntransformer__pipeline__onehotencoder__drop',
'columntransformer__pipeline__onehotencoder__dtype',
'columntransformer__pipeline__onehotencoder__handle_unknown',
'columntransformer__pipeline__onehotencoder__sparse',
'columntransformer__countvectorizer__analyzer',
'columntransformer__countvectorizer__binary',
'columntransformer__countvectorizer__decode_error',
'columntransformer__countvectorizer__dtype',
'columntransformer__countvectorizer__encoding',
'columntransformer__countvectorizer__input',
'columntransformer__countvectorizer__lowercase',
'columntransformer__countvectorizer__max_df',
'columntransformer__countvectorizer__max_features',
'columntransformer__countvectorizer__min_df',
'columntransformer__countvectorizer__ngram_range',
'columntransformer__countvectorizer__preprocessor',
'columntransformer__countvectorizer__stop_words',
'columntransformer__countvectorizer__strip_accents',
'columntransformer__countvectorizer__token_pattern',
'columntransformer__countvectorizer__tokenizer',
'columntransformer__countvectorizer__vocabulary',
'columntransformer__simpleimputer__add_indicator',
'columntransformer__simpleimputer__copy',
'columntransformer__simpleimputer__fill_value',
'columntransformer__simpleimputer__missing_values',
'columntransformer__simpleimputer__strategy',
'columntransformer__simpleimputer__verbose',
'logisticregression__C',
'logisticregression__class_weight',
'logisticregression__dual',
'logisticregression__fit_intercept',
'logisticregression__intercept_scaling',
'logisticregression__l1_ratio',
'logisticregression__max_iter',
'logisticregression__multi_class',
'logisticregression__n_jobs',
'logisticregression__penalty',
'logisticregression__random_state',
'logisticregression__solver',
'logisticregression__tol',
'logisticregression__verbose',
'logisticregression__warm_start']
Moving along, the second parameter we’re going to tune is the “ngram_range” parameter of CountVectorizer.
Again, we specify the Pipeline step name, then the transformer name, and then the parameter name. Note that these three components are separated by double underscores, but there’s just a single underscore within ngram_range because that’s part of the parameter name.
The parameter values we’re going to try are the tuples (1, 1) and (1, 2). (1, 1) is the default, and it creates a single feature from each word. (1, 2) creates features from both single words, known as unigrams, and word pairs, known as bigrams.
'columntransformer__countvectorizer__ngram_range'] = [(1, 1), (1, 2)] params[
The third parameter we’re going to tune is the “add_indicator” parameter of SimpleImputer, which was added to scikit-learn in version 0.21 and which I discussed in lesson 7.4.
Once again, we specify the Pipeline step name, then the transformer name, and then the parameter name.
The parameter values we’re going to try are “False” and “True”. False is the default, and it does not add a missing indicator column, whereas True does add a missing indicator column.
'columntransformer__simpleimputer__add_indicator'] = [False, True] params[
Before running the grid search, we’ll print out the params dictionary. By multiplying 2 by 3 by 2 by 2 by 2, we can calculate that there are now 48 parameter combinations, and thus the grid search will take about 8 times longer than the previous search.
As an aside, if we had used the Pipeline and ColumnTransformer classes instead of the make_pipeline and make_column_transformer functions, we could have customized the step names and transformer names, which would have made these parameter specifications a bit easier to read and write. You can watch lessons 4.9 and 4.10 for a review of that topic.
params
{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10],
'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Anyway, next we’ll recreate the grid object with the new params dictionary, and then we’ll run the grid search.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid grid.fit(X, y)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Now that the search is complete, we’ll convert the search results into a DataFrame and sort it by the rank_test_score column.
As you can see from the mean_test_score column, the best accuracy of 0.828 is an improvement over the previous grid search, which had an accuracy of 0.818. Keep in mind that your exact results may differ based on your scikit-learn version along with other factors. However, there’s no randomness involved when you set cv to an integer, and so your results will be the same every time you run this grid search.
= pd.DataFrame(grid.cv_results_)
results 'rank_test_score') results.sort_values(
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_columntransformer__countvectorizer__ngram_range | param_columntransformer__pipeline__onehotencoder__drop | param_columntransformer__simpleimputer__add_indicator | param_logisticregression__C | param_logisticregression__penalty | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 0.013390 | 0.000783 | 0.002698 | 0.000020 | (1, 2) | None | True | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.854749 | 0.820225 | 0.825843 | 0.780899 | 0.859551 | 0.828253 | 0.028264 | 1 |
28 | 0.013452 | 0.001096 | 0.002653 | 0.000021 | (1, 2) | None | False | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.849162 | 0.820225 | 0.814607 | 0.780899 | 0.859551 | 0.824889 | 0.027760 | 2 |
40 | 0.017350 | 0.001327 | 0.002691 | 0.000016 | (1, 2) | first | False | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.849162 | 0.825843 | 0.814607 | 0.780899 | 0.853933 | 0.824889 | 0.026361 | 2 |
46 | 0.017452 | 0.002011 | 0.002740 | 0.000021 | (1, 2) | first | True | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.843575 | 0.820225 | 0.814607 | 0.780899 | 0.853933 | 0.822648 | 0.025417 | 4 |
16 | 0.014869 | 0.000915 | 0.004289 | 0.002258 | (1, 1) | first | False | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.837989 | 0.803371 | 0.825843 | 0.786517 | 0.848315 | 0.820407 | 0.022611 | 5 |
22 | 0.012250 | 0.001199 | 0.002482 | 0.000023 | (1, 1) | first | True | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.826816 | 0.808989 | 0.814607 | 0.786517 | 0.859551 | 0.819296 | 0.023999 | 6 |
4 | 0.010636 | 0.001193 | 0.002491 | 0.000045 | (1, 1) | None | False | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.832402 | 0.808989 | 0.808989 | 0.786517 | 0.853933 | 0.818166 | 0.023031 | 7 |
10 | 0.009969 | 0.000523 | 0.002513 | 0.000014 | (1, 1) | None | True | 10 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.815642 | 0.808989 | 0.820225 | 0.786517 | 0.853933 | 0.817061 | 0.021770 | 8 |
20 | 0.007983 | 0.000461 | 0.002600 | 0.000072 | (1, 1) | first | True | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.810056 | 0.820225 | 0.797753 | 0.792135 | 0.853933 | 0.814820 | 0.021852 | 9 |
2 | 0.011787 | 0.008637 | 0.002481 | 0.000022 | (1, 1) | None | False | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.815642 | 0.820225 | 0.797753 | 0.792135 | 0.848315 | 0.814814 | 0.019787 | 10 |
44 | 0.010098 | 0.000371 | 0.002742 | 0.000050 | (1, 2) | first | True | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.804469 | 0.820225 | 0.797753 | 0.792135 | 0.853933 | 0.813703 | 0.022207 | 11 |
47 | 0.010278 | 0.000185 | 0.002679 | 0.000006 | (1, 2) | first | True | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.820225 | 0.820225 | 0.780899 | 0.853933 | 0.812598 | 0.026265 | 12 |
8 | 0.007782 | 0.000613 | 0.002519 | 0.000013 | (1, 1) | None | True | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.804469 | 0.820225 | 0.786517 | 0.792135 | 0.859551 | 0.812579 | 0.026183 | 13 |
38 | 0.009798 | 0.000299 | 0.002631 | 0.000014 | (1, 2) | first | False | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.804469 | 0.820225 | 0.797753 | 0.792135 | 0.848315 | 0.812579 | 0.020194 | 14 |
14 | 0.008625 | 0.000364 | 0.003364 | 0.000394 | (1, 1) | first | False | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.804469 | 0.820225 | 0.797753 | 0.792135 | 0.848315 | 0.812579 | 0.020194 | 14 |
26 | 0.009654 | 0.000203 | 0.002621 | 0.000008 | (1, 2) | None | False | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.815642 | 0.820225 | 0.786517 | 0.792135 | 0.848315 | 0.812567 | 0.022100 | 16 |
11 | 0.007627 | 0.000085 | 0.002493 | 0.000017 | (1, 1) | None | True | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.782123 | 0.803371 | 0.808989 | 0.792135 | 0.870787 | 0.811481 | 0.031065 | 17 |
21 | 0.007036 | 0.000065 | 0.002451 | 0.000020 | (1, 1) | first | True | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.793296 | 0.820225 | 0.803371 | 0.786517 | 0.853933 | 0.811468 | 0.024076 | 18 |
3 | 0.007053 | 0.000083 | 0.002475 | 0.000059 | (1, 1) | None | False | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.798883 | 0.825843 | 0.803371 | 0.786517 | 0.842697 | 0.811462 | 0.020141 | 19 |
23 | 0.007419 | 0.000092 | 0.002446 | 0.000020 | (1, 1) | first | True | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.776536 | 0.803371 | 0.808989 | 0.792135 | 0.870787 | 0.810363 | 0.032182 | 20 |
9 | 0.007227 | 0.000080 | 0.002494 | 0.000009 | (1, 1) | None | True | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.793296 | 0.825843 | 0.797753 | 0.786517 | 0.848315 | 0.810345 | 0.023233 | 21 |
15 | 0.007868 | 0.000300 | 0.002760 | 0.000241 | (1, 1) | first | False | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.804469 | 0.820225 | 0.803371 | 0.786517 | 0.837079 | 0.810332 | 0.017107 | 22 |
32 | 0.009945 | 0.000132 | 0.002670 | 0.000006 | (1, 2) | None | True | 1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.804469 | 0.820225 | 0.780899 | 0.792135 | 0.853933 | 0.810332 | 0.025419 | 22 |
17 | 0.008091 | 0.000432 | 0.002767 | 0.000232 | (1, 1) | first | False | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 24 |
35 | 0.010225 | 0.000123 | 0.002676 | 0.000017 | (1, 2) | None | True | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.782123 | 0.820225 | 0.814607 | 0.780899 | 0.848315 | 0.809234 | 0.025357 | 24 |
5 | 0.007588 | 0.000143 | 0.002450 | 0.000012 | (1, 1) | None | False | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 24 |
29 | 0.010094 | 0.000113 | 0.002614 | 0.000014 | (1, 2) | None | False | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.814607 | 0.820225 | 0.780899 | 0.837079 | 0.808104 | 0.020904 | 27 |
45 | 0.009662 | 0.000085 | 0.002686 | 0.000014 | (1, 2) | first | True | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.793296 | 0.814607 | 0.797753 | 0.786517 | 0.848315 | 0.808097 | 0.022143 | 28 |
41 | 0.010140 | 0.000076 | 0.002633 | 0.000008 | (1, 2) | first | False | 10 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.814607 | 0.820225 | 0.780899 | 0.831461 | 0.806980 | 0.019414 | 29 |
39 | 0.009537 | 0.000114 | 0.002621 | 0.000010 | (1, 2) | first | False | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.798883 | 0.808989 | 0.797753 | 0.786517 | 0.837079 | 0.805844 | 0.017164 | 30 |
27 | 0.009532 | 0.000076 | 0.002620 | 0.000013 | (1, 2) | None | False | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.798883 | 0.814607 | 0.792135 | 0.786517 | 0.837079 | 0.805844 | 0.018234 | 30 |
33 | 0.009623 | 0.000093 | 0.002661 | 0.000012 | (1, 2) | None | True | 1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.782123 | 0.814607 | 0.792135 | 0.786517 | 0.848315 | 0.804739 | 0.024489 | 32 |
31 | 0.009042 | 0.000162 | 0.002697 | 0.000043 | (1, 2) | None | True | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.793296 | 0.803371 | 0.769663 | 0.786517 | 0.814607 | 0.793491 | 0.015231 | 33 |
7 | 0.006694 | 0.000080 | 0.002536 | 0.000061 | (1, 1) | None | True | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.798883 | 0.803371 | 0.764045 | 0.786517 | 0.814607 | 0.793484 | 0.017253 | 34 |
19 | 0.007016 | 0.000145 | 0.002575 | 0.000145 | (1, 1) | first | True | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.793296 | 0.803371 | 0.764045 | 0.780899 | 0.814607 | 0.791243 | 0.017572 | 35 |
43 | 0.010422 | 0.002550 | 0.002856 | 0.000219 | (1, 2) | first | True | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.798883 | 0.797753 | 0.764045 | 0.780899 | 0.808989 | 0.790114 | 0.015849 | 36 |
37 | 0.008877 | 0.000029 | 0.002641 | 0.000018 | (1, 2) | first | False | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.803371 | 0.764045 | 0.780899 | 0.808989 | 0.789003 | 0.016100 | 37 |
25 | 0.008901 | 0.000061 | 0.002608 | 0.000011 | (1, 2) | None | False | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.793296 | 0.803371 | 0.764045 | 0.775281 | 0.808989 | 0.788996 | 0.016944 | 38 |
1 | 0.006655 | 0.000039 | 0.002452 | 0.000011 | (1, 1) | None | False | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.798883 | 0.803371 | 0.764045 | 0.775281 | 0.803371 | 0.788990 | 0.016258 | 39 |
13 | 0.007833 | 0.000708 | 0.003192 | 0.000419 | (1, 1) | first | False | 0.1 | l2 | {'columntransformer__countvectorizer__ngram_ra... | 0.782123 | 0.803371 | 0.764045 | 0.780899 | 0.808989 | 0.787885 | 0.016343 | 40 |
0 | 0.006412 | 0.000095 | 0.002483 | 0.000073 | (1, 1) | None | False | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 41 |
30 | 0.011053 | 0.003787 | 0.002923 | 0.000274 | (1, 2) | None | True | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 41 |
24 | 0.008545 | 0.000088 | 0.002621 | 0.000013 | (1, 2) | None | False | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 41 |
6 | 0.006543 | 0.000118 | 0.002510 | 0.000011 | (1, 1) | None | True | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 41 |
36 | 0.008752 | 0.000158 | 0.002635 | 0.000023 | (1, 2) | first | False | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.770950 | 0.797753 | 0.769663 | 0.758427 | 0.792135 | 0.777785 | 0.014779 | 45 |
42 | 0.008697 | 0.000146 | 0.002683 | 0.000007 | (1, 2) | first | True | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.770950 | 0.797753 | 0.769663 | 0.758427 | 0.792135 | 0.777785 | 0.014779 | 45 |
12 | 0.009742 | 0.004551 | 0.006101 | 0.005669 | (1, 1) | first | False | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.770950 | 0.797753 | 0.769663 | 0.758427 | 0.792135 | 0.777785 | 0.014779 | 45 |
18 | 0.006817 | 0.000245 | 0.002593 | 0.000095 | (1, 1) | first | True | 0.1 | l1 | {'columntransformer__countvectorizer__ngram_ra... | 0.770950 | 0.797753 | 0.769663 | 0.758427 | 0.792135 | 0.777785 | 0.014779 | 45 |
Rather than always examining the results DataFrame, we can actually just access the single best score and the set of parameters that resulted in that score via attributes of the grid object.
It’s worth noting that only the drop parameter is using its default value, whereas the other four parameters are not using their default values.
grid.best_score_
0.828253091456908
grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'logisticregression__C': 10,
'logisticregression__penalty': 'l1'}
It’s hard to say whether this truly is the best set of parameters, because some of the differences in accuracy between parameter combinations may be due to chance, based on which samples happened to appear in each fold. That’s just a limitation of basic cross-validation, and so all we can say with confidence is that this is a good combination of parameters.
Now that we’ve tuned both the model parameters and the transformer parameters, we want to use those parameters with the Pipeline when making predictions.
GridSearchCV actually makes this very easy. After locating the best set of parameters, it automatically refits the Pipeline on X and y using the best set of parameters, and it stores that fitted Pipeline as an attribute called best_estimator_. And as you can see, that attribute is indeed a Pipeline object.
type(grid.best_estimator_)
sklearn.pipeline.Pipeline
If we print out the best_estimator_ attribute and click on the components, we can see that the parameters of this Pipeline match the best parameter set we located in the previous lesson.
grid.best_estimator_
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough
LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear')
In order to make predictions using this Pipeline, all we have to do is run the grid object’s predict method, which calls the predict method of the best_estimator_, and pass it X_new.
grid.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
I just want to emphasize that this Pipeline, with the best set of parameters, was automatically refit to the entire dataset. You always train your model on the entire dataset, meaning all samples for which you know the target value, before using it to make predictions on new data, otherwise you’re throwing away valuable training data.
After completing a grid search, you may want to save the Pipeline with the best set of parameters so that you can use it to make predictions later.
As we saw in the previous lesson, the Pipeline with the best set of parameters is stored as an attribute of the GridSearchCV object called best_estimator_, so this is the object that we want to save.
type(grid.best_estimator_)
sklearn.pipeline.Pipeline
You can save a Pipeline to a file using pickle, which is part of the Python standard library.
import pickle
We’ll use pickle’s dump method to save the Pipeline to a file called “pipe.pickle”.
with open('pipe.pickle', 'wb') as f:
pickle.dump(grid.best_estimator_, f)
Then we can use pickle’s load method to load the Pipeline from the pipe.pickle file into an object called pipe_from_pickle.
with open('pipe.pickle', 'rb') as f:
= pickle.load(f) pipe_from_pickle
pipe_from_pickle is identical to grid.best_estimator_, and so when we use pipe_from_pickle to make predictions, these predictions are identical to the predictions made by the grid object.
pipe_from_pickle.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
One alternative to pickle is joblib, which is usually more efficient than pickle for scikit-learn objects. Although it’s not part of the Python standard library, joblib has been a dependency of scikit-learn since version 0.21.
import joblib
Just like pickle, you use joblib’s dump method to save the Pipeline to a file, which we’ll call “pipe.joblib”.
'pipe.joblib') joblib.dump(grid.best_estimator_,
['pipe.joblib']
Then, we’ll use the load method to load the Pipeline from the file into an object called pipe_from_joblib.
= joblib.load('pipe.joblib') pipe_from_joblib
Finally, we’ll use pipe_from_joblib to make predictions.
pipe_from_joblib.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
To be clear, pickle and joblib are not limited to Pipelines and can be used with other scikit-learn objects, such as a standalone model object that is not inside a Pipeline.
There are a couple warnings to keep in mind when working with pickle and joblib objects:
Finally, it’s worth mentioning that there are alternatives to pickle and joblib such as ONNX and PMML. These formats don’t capture the full model object, but instead save a representation that can be used to make predictions. One major benefit of these formats is that they are neither environment-specific nor architecture-specific.
Let’s recreate the GridSearchCV object, but this time we’ll add the verbose parameter and set it to 1. When we run the search, this parameter will cause two changes to the output:
= GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1)
grid grid.fit(X, y)
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.5s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 2.4s
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy', verbose=1)
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Now, let’s also add the n_jobs parameter, set it to -1, and re-run the grid search. This instructs scikit-learn to use parallel processing with all of your CPUs to perform the search. If your machine has multiple processors, this will generally be faster, and in this case it was about twice as fast.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
grid grid.fit(X, y)
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.9s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 2.4s finished
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), n_jobs=-1, param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy', verbose=1)
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
If you find it useful to know how long a search takes, but verbose mode is a bit too verbose for you, another option is to remove the verbose parameter and instead prefix the second line with “%time”. This is known as an IPython line magic, and it will work as long as you’re using the Jupyter notebook or the IPython interpreter.
All this command does is tell you how long a particular line of code took to run. The number to focus on is the wall time.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy', n_jobs=-1)
grid %time grid.fit(X, y)
CPU times: user 203 ms, sys: 8.41 ms, total: 211 ms
Wall time: 711 ms
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), n_jobs=-1, param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
My general recommendation is to set n_jobs to -1 any time you’re running a grid search, which is what I’ll do for the rest of the book. However, it’s still a good idea to use %time or verbose mode to confirm that parallel processing is actually reducing the search time on your particular machine.
When you provide a set of parameter values to GridSearchCV, it will cross-validate every possible combination of those parameters. For example, we know that with this set of parameters, cross-validation will run 48 times.
params
{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10],
'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Let’s say that we wanted to try additional C values for logistic regression. I’ll make a copy of the params dictionary called more_params, and then modify the C parameter in this dictionary to have 6 possible values instead of 3.
= params.copy()
more_params 'logisticregression__C'] = [0.01, 0.1, 1, 10, 100, 1000] more_params[
Since there are twice as many C values, we know that a grid search will take twice as long, meaning it will run cross-validation 96 times. But what if that grid search takes more time than we have available?
An alternative method we can use is called randomized search, which is implemented in the RandomizedSearchCV class. We’ll import it from the model_selection module and then create an instance.
The API is very similar to GridSearchCV, except that you also specify the number of times it should run using the n_iter parameter. In this case, we’ll set the number of iterations to be 10.
Each time it runs, it will pick out a set of parameters at random and cross-validate that parameter set. In other words, it does the same thing as GridSearchCV, except that it picks out random combinations of parameters from the parameter dictionary rather than trying every single combination. Because there’s an element of randomness, we’ll also set the random_state parameter to 1 for reproducibility.
We’ll use the fit method to run the search, and because it will only try 10 combinations instead of 96 combinations, it will run about 10 times faster than a grid search would.
from sklearn.model_selection import RandomizedSearchCV
= RandomizedSearchCV(pipe, more_params, cv=5, scoring='accuracy', n_iter=10,
rand =1, n_jobs=-1)
random_state rand.fit(X, y)
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Far... n_jobs=-1, param_distributions={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000], 'logisticregression__penalty': ['l1', 'l2']}, random_state=1, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
By printing out the results of the search, you can see that it ran 10 times using random combinations of all of those parameters.
pd.DataFrame(rand.cv_results_)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_logisticregression__penalty | param_logisticregression__C | param_columntransformer__simpleimputer__add_indicator | param_columntransformer__pipeline__onehotencoder__drop | param_columntransformer__countvectorizer__ngram_range | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.009594 | 0.001083 | 0.003577 | 0.000871 | l1 | 1 | True | first | (1, 1) | {'logisticregression__penalty': 'l1', 'logisti... | 0.810056 | 0.820225 | 0.797753 | 0.792135 | 0.853933 | 0.814820 | 0.021852 | 3 |
1 | 0.008598 | 0.000376 | 0.002928 | 0.000269 | l2 | 10 | False | first | (1, 1) | {'logisticregression__penalty': 'l2', 'logisti... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 7 |
2 | 0.010807 | 0.002507 | 0.002814 | 0.000038 | l1 | 1000 | True | first | (1, 1) | {'logisticregression__penalty': 'l1', 'logisti... | 0.821229 | 0.814607 | 0.808989 | 0.780899 | 0.831461 | 0.811437 | 0.017003 | 5 |
3 | 0.015629 | 0.002074 | 0.004148 | 0.001652 | l2 | 1000 | False | None | (1, 2) | {'logisticregression__penalty': 'l2', 'logisti... | 0.793296 | 0.797753 | 0.837079 | 0.780899 | 0.842697 | 0.810345 | 0.024810 | 6 |
4 | 0.023627 | 0.003475 | 0.004504 | 0.001646 | l1 | 10 | False | first | (1, 2) | {'logisticregression__penalty': 'l1', 'logisti... | 0.849162 | 0.825843 | 0.814607 | 0.780899 | 0.853933 | 0.824889 | 0.026361 | 2 |
5 | 0.015803 | 0.003058 | 0.003993 | 0.001029 | l1 | 0.1 | False | first | (1, 2) | {'logisticregression__penalty': 'l1', 'logisti... | 0.770950 | 0.797753 | 0.769663 | 0.758427 | 0.792135 | 0.777785 | 0.014779 | 9 |
6 | 0.014956 | 0.002243 | 0.004105 | 0.001644 | l2 | 1 | True | None | (1, 2) | {'logisticregression__penalty': 'l2', 'logisti... | 0.782123 | 0.814607 | 0.792135 | 0.786517 | 0.848315 | 0.804739 | 0.024489 | 8 |
7 | 0.015089 | 0.004356 | 0.003858 | 0.001471 | l1 | 100 | True | first | (1, 1) | {'logisticregression__penalty': 'l1', 'logisti... | 0.821229 | 0.814607 | 0.814607 | 0.786517 | 0.831461 | 0.813684 | 0.014918 | 4 |
8 | 0.016757 | 0.001239 | 0.004519 | 0.001777 | l1 | 100 | False | first | (1, 2) | {'logisticregression__penalty': 'l1', 'logisti... | 0.854749 | 0.820225 | 0.814607 | 0.780899 | 0.865169 | 0.827129 | 0.030171 | 1 |
9 | 0.014579 | 0.002268 | 0.003042 | 0.000092 | l2 | 0.01 | True | first | (1, 2) | {'logisticregression__penalty': 'l2', 'logisti... | 0.675978 | 0.786517 | 0.724719 | 0.758427 | 0.775281 | 0.744184 | 0.039982 | 10 |
You might be surprised to know that the best score it found, 0.827, is almost as high as the best score found by our grid search earlier in the chapter, which was 0.828. That being said, we did try additional C values in our randomized search, so the comparison isn’t entirely fair.
rand.best_score_
0.8271294959512898
Here’s the set of parameters that produced that score.
rand.best_params_
{'logisticregression__penalty': 'l1',
'logisticregression__C': 100,
'columntransformer__simpleimputer__add_indicator': False,
'columntransformer__pipeline__onehotencoder__drop': 'first',
'columntransformer__countvectorizer__ngram_range': (1, 2)}
There are four things I especially like about using a randomized search instead of a grid search:
Number 1. Because there are often a lot of parameter combinations that will produce similar results, randomized search will usually find the best result (or almost the best result) in far less time than grid search, which is what we saw above.
Number 2. It’s easier to control the computational budget of a randomized search. You can test how long a small number of searches takes, and then if you have a certain amount of time available for a search, you can simply choose the number of iterations that can be completed within that time period.
Number 3. Randomized search gives you the freedom to tune many more model and transformer parameters without worrying that it will take forever. You can try out a ton of different parameters for a short amount of time, and then narrow down which parameters to focus on based on what seems to be working. We’ll see this in practice in the next chapter.
Number 4. Randomized search will sometimes produce even better results than grid search because you can try a finer grid. For example, let’s say you were tuning a parameter that allowed continuous values from 0 to 1. If you were using a grid search, you might try the values 0, 0.5, and 1. But if you were using randomized search, you might try the values 0, 0.01, 0.02, and so on. It may turn out that the best value for this parameter is around 0.3, and randomized search could help you to find that out, whereas this grid search would have no chance of finding that out.
If you do need to create a fine grid of numbers for a randomized search, one useful function is NumPy’s linspace. For example, this code specifies that I want 101 equally spaced values, starting with 0 and ending with 1.
import numpy as np
0, 1, 101) np.linspace(
array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65,
0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76,
0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87,
0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98,
0.99, 1. ])
Another similar function is NumPy’s logspace. This code specifies that I want 6 values, from 10 to the negative 2nd power through 10 to the 3rd power.
-2, 3, 6) np.logspace(
array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03])
If you’re comfortable using the SciPy library, you can instead specify continuous parameters for a randomized search using SciPy distributions. However, I find it much easier to just use NumPy’s linspace and logspace functions.
When you’re building and tuning a modeling Pipeline, it’s natural to wonder how you’ll know when you’re done. In other words, how good of a model is “good enough”? There are three ways that I tend to think about this question.
The first way is to ask the question: What is the minimum accuracy that we need to achieve for our model to be considered useful? In most cases, you want your model to at least outperform null accuracy, which is the accuracy you could achieve by always predicting the most frequent class.
To calculate the null accuracy for our training data, we use the value_counts method on y, and set normalize to True in order to display the counts as a percentage. From the results, we can see that class 0 is the most frequent class, and about 61.6% of the y values are class 0.
=True) y.value_counts(normalize
0 0.616162
1 0.383838
Name: Survived, dtype: float64
Thus the null accuracy for this problem is 61.6%, since an uninformed model, also known as the null model, could achieve that accuracy simply by predicting class 0 in all cases. In other words, this is the accuracy level that we want to outperform, otherwise the model is not providing any value. Thankfully, all of our Pipelines are outperforming null accuracy by a considerable amount.
The second way to think about this question is to ask: What is the maximum accuracy we could eventually reach? For most real problems, it’s impossible to know how accurate your model could be if you did enough tuning and tried enough models. It’s also impossible to know how accurate your model could be if you gathered more samples or more features. The main exception to this is if you’re working on a well-studied research problem, because in that case there may be a state-of-the-art benchmark that everyone is trying to surpass.
Thus in most practical circumstances, you don’t set a target accuracy. Instead, you work to improve the model until you run out of time, money, or ideas.
The pipe object is our Pipeline that hasn’t been tuned by grid search. Recall that you can examine an individual Pipeline step by using the named_steps attribute. In this case, we’ll select the first step, which is our ColumnTransformer.
'columntransformer'] pipe.named_steps[
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
By passing X to its fit_transform method, we can see that the ColumnTransformer outputs 1518 feature columns. As we saw in lesson 8.4, all except 9 of those features were created from the Name column by CountVectorizer.
'columntransformer'].fit_transform(X) pipe.named_steps[
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
The cross-validated accuracy of this Pipeline is 0.811, which we’ve been calling the baseline accuracy against which other Pipelines can be compared.
=5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
Similarly, we can select the ColumnTransformer from our Pipeline that was tuned by grid search. Notice that the ngram_range for CountVectorizer is (1, 2), meaning CountVectorizer will create features from both unigrams and bigrams in the Name column.
'columntransformer'] grid.best_estimator_.named_steps[
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough
By using fit_transform, we can see that this ColumnTransformer outputs 3671 feature columns. Again, all except 9 of those features were created from the Name column.
'columntransformer'].fit_transform(X) grid.best_estimator_.named_steps[
<891x3671 sparse matrix of type '<class 'numpy.float64'>'
with 10191 stored elements in Compressed Sparse Row format>
The cross-validated accuracy of this Pipeline is 0.828.
grid.best_score_
0.828253091456908
Finally, let’s compare these two Pipelines to a Pipeline that doesn’t include the Name column at all. First, we’ll create a ColumnTransformer called “no_name_ct” that excludes Name.
= make_column_transformer(
no_name_ct 'Embarked', 'Sex']),
(imp_ohe, ['Age', 'Fare']),
(imp, ['passthrough', ['Parch'])) (
As you can see, this ColumnTransformer only outputs 9 feature columns.
no_name_ct.fit_transform(X).shape
(891, 9)
Then, we’ll add no_name_ct to a Pipeline called “no_name_pipe” and cross-validate it. The accuracy is 0.783, which is significantly lower than the Pipelines that included the Name column. To be fair, this Pipeline hasn’t been tuned, though honestly there is no hyperparameter tuning we could do to make it perform as well as the Pipelines that included the Name column.
= make_pipeline(no_name_ct, logreg)
no_name_pipe =5, scoring='accuracy').mean() cross_val_score(no_name_pipe, X, y, cv
0.7833908731404181
Here are some conclusions that we can draw from this experiment:
It’s worth noting that there is additional tuning we could do to CountVectorizer to reduce the number of features it creates. However, there’s no way to know whether that would increase or decrease the Pipeline’s accuracy without actually trying it.
Recall that once a grid search is complete, GridSearchCV automatically refits the Pipeline on X and y and stores it as an attribute called best_estimator_. Therefore, we can access the model coefficients by first selecting the logistic regression step and then selecting the coef_ attribute.
'logisticregression'].coef_ grid.best_estimator_.named_steps[
array([[ 0.56431161, 0. , -0.08767203, ..., 0.01408723,
-0.43713268, -0.46358519]])
Ideally, we would also be able to get the names of the features that correspond to these coefficients by running the get_feature_names method on the ColumnTransformer step. However, get_feature_names only works if all of the underlying transformers have a get_feature_names method, and that is not the case here.
'columntransformer'].get_feature_names() grid.best_estimator_.named_steps[
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[53], line 1 ----> 1 grid.best_estimator_.named_steps['columntransformer'].get_feature_names() File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:371, in ColumnTransformer.get_feature_names(self) 369 continue 370 if not hasattr(trans, 'get_feature_names'): --> 371 raise AttributeError("Transformer %s (type %s) does not " 372 "provide get_feature_names." 373 % (str(name), type(trans).__name__)) 374 feature_names.extend([name + "__" + f for f in 375 trans.get_feature_names()]) 376 return feature_names AttributeError: Transformer pipeline (type Pipeline) does not provide get_feature_names.
Instead, as we saw previously in lesson 8.4, you would have to inspect the transformers one-by-one in order to determine the feature names.
'columntransformer'].transformers_ grid.best_estimator_.named_steps[
[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'),
('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])]
Note that starting in scikit-learn version 1.1, the get_feature_names_out method should work on this ColumnTransformer, since the get_feature_names_out method will be available for all transformers.
When we perform a grid search, we’re trying to find the parameters that maximize the cross-validation score on a dataset. Thus, we’re using the same data to accomplish two separate goals:
grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'logisticregression__C': 10,
'logisticregression__penalty': 'l1'}
grid.best_score_
0.828253091456908
Using the same data for these two separate goals actually biases the Pipeline to this dataset and can result in overly optimistic scores.
If your main objective is to choose the best parameters, then this process is totally fine. You’ll just have to accept that its actual performance on new data may be lower than the performance estimated by grid search.
But if you also need a realistic estimate of the Pipeline’s performance on new data, then there’s an alternative process you can use, which I’ll walk you through in this lesson.
To start, we’ll import the train_test_split function from the model_selection module, and use it to split the data into training and testing sets, with 75% of the data as training and 25% of the data as testing. Note that I set the stratify parameter to “y” so that the class proportions will be approximately equal in the training and testing sets.
from sklearn.model_selection import train_test_split
= train_test_split(X, y, test_size=0.25,
X_train, X_test, y_train, y_test =1, stratify=y) random_state
Next, we’ll create a new GridSearchCV object called training_grid. When we run the grid search, we’ll only pass it the training set so that the tuning process only takes the training set into account.
= GridSearchCV(pipe, params, cv=5, scoring='accuracy', n_jobs=-1)
training_grid training_grid.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... LogisticRegression(random_state=1, solver='liblinear'))]), n_jobs=-1, param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'logisticregression__C': [0.1, 1, 10], 'logisticregression__penalty': ['l1', 'l2']}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Here are the best parameters found by grid search on the training set.
training_grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': 'first',
'columntransformer__simpleimputer__add_indicator': False,
'logisticregression__C': 10,
'logisticregression__penalty': 'l2'}
We’re not actually interested in the best score found during the grid search. Instead, we’re going to use the best parameters found by the grid search to make predictions for the testing set, and then evaluate the accuracy of those predictions. We can do this by passing the testing set to the training_grid’s score method.
The accuracy it outputs is 0.816, which is a more realistic estimate of how the Pipeline will perform on new data, since the testing set is brand new data that the Pipeline has never seen. However, it’s still just a single realization of this model, and so it’s impossible to know how precise this value is.
training_grid.score(X_test, y_test)
0.8161434977578476
Now that we’ve found the best parameters for the Pipeline and estimated its likely performance on new data, our final step is to actually make predictions on new data. Before making predictions, it’s critical that we train the Pipeline on all of our data, meaning the entirety of X and y, otherwise we’re throwing away valuable data.
In other words, we can’t simply use the training_grid’s predict method since it was only refit on X_train and y_train. Instead, we need to save the Pipeline with the best parameters, which we’ll call “best_pipe”, and fit it to X and y.
= training_grid.best_estimator_
best_pipe best_pipe.fit(X, y)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder(drop='first'))]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(C=10, random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder(drop='first'))]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(drop='first')
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(C=10, random_state=1, solver='liblinear')
Now we can make predictions on new data.
best_pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
If you decide that you’re going to follow the process that I’ve just outlined, then there are two guidelines that are important to follow.
First, you should only use the testing set for evaluating Pipeline performance one time. If you keep tuning the Pipeline again and again, each time checking its performance on the testing set, you’re essentially tuning the Pipeline to the particulars of the testing set. At that point, it no longer functions as an independent data source and thus its performance estimates will become less reliable.
Second, it’s important that you have enough data overall in order for the training and testing sets to both be sufficiently large once the dataset has been split:
Both of these situations would defeat the purpose of splitting the dataset, and thus this approach is best when you have a large enough dataset. Unfortunately, it’s difficult to say in the abstract how much data is “enough”, since that depends on the particulars of the dataset and the problem.
Earlier in this chapter, we tuned the regularization parameters of logistic regression. In this lesson, I’ll briefly explain what regularization actually is.
Regularization is a process that constrains the size of a model’s coefficients in order to minimize overfitting. Overfitting is when your model fits too closely to patterns in the training data, which causes your model not to perform well when it makes predictions on new data.
Regularization minimizes overfitting by reducing the variance of the model. Thus if you believe a model is too complex, regularization will reduce the error due to variance more than it increases the error due to bias, resulting in a model that is more likely to generalize to new data.
In simpler terms, regularization makes your model a bit less flexible so that it’s more likely to follow the true patterns in the data and less likely to follow the noise. Regularization is especially useful when you have outliers in the training data, because regularization decreases the influence that outliers have on the model.