10 Evaluating and tuning a Pipeline – Master Machine Learning with scikit-learn

10.1 Evaluating a Pipeline with cross-validation

In this chapter, we’re going to take a deep dive into how to efficiently tune our Pipeline for maximum accuracy.

Let’s return to the topic of model evaluation.

As you might recall, we used cross-validation way back in chapter 2 to evaluate our most basic model. Since that chapter, we’ve been adding many more features without re-running cross-validation. That’s because any model evaluation procedure is highly unreliable with only 10 rows of data, so it would have been misleading to run cross-validation and compare the results. But now that we’re using the full dataset, cross-validation can once again be used.

First, we’ll import the cross_val_score function from the model_selection module.
Instead of passing a model to cross_val_score, we can actually pass our entire Pipeline.
We also pass it X and y.
And then we specify the number of cross-validation folds. Using 5 or 10 folds has been shown to be a reasonable default choice, and so we’ll choose 5 in order to minimize the computation. 5 folds has actually been the default for cross_val_score since version 0.22, but I like to include it anyway for clarity.
Finally, we’ll specify the evaluation metric of classification accuracy.

When we run it, cross_val_score outputs a mean accuracy of 0.811, which we’ll use as the baseline accuracy against which our future Pipelines can be compared.

from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

Pipeline accuracy scores:

Baseline (no tuning): 0.811

Let’s talk about what actually happens “under the hood” when we run the cross_val_score function on a Pipeline:

In step 1, cross_val_score splits the data into 5 folds. 4 out of 5 folds (meaning 80% of the data) is set aside for training, and the remaining fold (meaning 20% of the data) is set aside for testing.
In step 2, the Pipeline’s fit method is run on the training portion. Thus the transformations specified in the ColumnTransformer are performed on the training portion, and the transformed training data is used to fit the model.
In step 3, the Pipeline’s predict method is run on the testing portion. Thus the transformations learned during step 2 are applied to the testing portion, the transformed testing data is passed to the fitted model, and the model makes predictions.
Finally, in step 4, the accuracy of those predictions is calculated.
Steps 1 through 4 are then repeated 4 more times, and each time a different fold is set aside as the testing portion.
cross_val_score thus outputs 5 accuracy scores, and we take the mean of those scores.

Steps of 5-fold cross-validation on a Pipeline:

Split data into 5 folds (A, B, C, D, E)
- ABCD is training set
- E is testing set
Pipeline is fit on training set
- ABCD is transformed
- Model is fit on transformed data
Pipeline makes predictions on testing set
- E is transformed (using step 2 transformations)
- Model makes predictions on transformed data
Calculate accuracy of those predictions
Repeat the steps above 4 more times, with a different testing set each time
Calculate the mean of the 5 scores

One thing you might have noticed is that cross_val_score splits the data in step 1 before performing the transformations in steps 2 and 3. As a result, the imputation values for Age and Fare and the vocabulary for CountVectorizer are all computed 5 different times. Each time, these values are computed using the training set only, and then applied to both the training and testing sets.

Alternatively, you could imagine performing all of the transformations first, and then splitting the data. This would be much faster, since the imputation values and the vocabulary would be computed only once on the full dataset.

So why does cross_val_score split the data first? Because splitting the data before performing the transformations prevents data leakage, whereas performing the transformations on the full dataset before splitting the data would cause data leakage, since information about the testing set would be “leaked” into the model training process.

As we discussed in the previous chapter, this is one way that scikit-learn helps to shield you from data leakage.

Why does cross_val_score split the data first?

Proper cross-validation:
- Data is split (step 1) before transformations (steps 2 and 3)
- Imputation values and vocabulary are computed using training set only
- Prevents data leakage
Improper cross-validation:
- Transformations are performed before data is split
- Imputation values and vocabulary are computed using full dataset
- Causes data leakage

10.2 Tuning a Pipeline with grid search

Now that we’ve calculated the baseline accuracy for our Pipeline, the next step is to tune the hyperparameters for both the model and the transformers. Recall that we’ve been using the default parameters for most objects in the Pipeline, and so tuning those parameters is likely to result in a more accurate model.

Before proceeding, let me briefly explain some terminology. In the field of statistics, “hyperparameters” are values that you set, whereas “parameters” are values learned from the data by the estimator during the fitting process.

For example, the C value of logistic regression is called a hyperparameter because it’s something you can set and optimize, whereas the coefficients of a logistic regression model are called parameters because they’re learned from the data.

Statistics terminology:

Hyperparameters: Values that you set
- Example: C value of logistic regression
Parameters: Values learned from the data
- Example: Coefficients of logistic regression model

In this book, I’m generally going to follow scikit-learn’s conventions as I understand them:

I’ll use the phrase “hyperparameter tuning” to refer to the process of tuning a model or tuning a Pipeline containing a model and transformers.
And I’ll use the phrase “parameter” to refer to anything passed to a class. For example, this includes the C and random_state values passed to the LogisticRegression class, and the strategy value passed to the SimpleImputer class.

scikit-learn terminology:

Hyperparameter tuning: Tuning a model or a Pipeline
Parameter: Anything passed to a class
- LogisticRegression: C, random_state
- SimpleImputer: strategy

With that being said, we’re going to use a scikit-learn class called GridSearchCV to perform the hyperparameter tuning. In a grid search, you define which values you want to try for each parameter, and it cross-validates every possible combination of those values.

Hyperparameter tuning with GridSearchCV:

You define which values to try for each parameter
It cross-validates every combination of those values

We can actually use GridSearchCV to tune the entire Pipeline at once, including both the model and the transformers. This has two huge benefits over just tuning a model:

First, it enables you to tune the model and transformer parameters simultaneously, which is important because the best performing combination might be when none of the parameters are set to their default values.
Second, this prevents data leakage because the data splitting and transformations will occur within the cross-validation done by GridSearchCV.

Benefits of tuning a Pipeline:

Tunes the model and transformers simultaneously
Prevents data leakage

Keep in mind that if we had instead done the data transformations in pandas, we would have missed out on both of these benefits.

10.3 Tuning the model

In this lesson, we’re going to tune the model, and then in the next lesson, we’ll also tune the transformers.

For the LogisticRegression model, we’re going to tune two parameters:

The first parameter is penalty, which is the type of regularization. For this parameter, the default value is 'l2', and we’re going to try the values 'l1' and 'l2'. And just to be clear, the first character of each of those values is a lowercase “L”.
The second parameter is C, which is the amount of regularization. For this parameter, the default value is 1, and we’re going to try the values 0.1, 1, and 10.

LogisticRegression tuning parameters:

penalty: Type of regularization
- 'l1'
- 'l2' (default)
C: Amount of regularization
- 0.1
- 1 (default)
- 10

Deciding which parameters to tune and what values to try requires both research and experience, and unfortunately, it’s different for every type of model.

In order to tune a Pipeline with GridSearchCV, we need to get the names of the Pipeline steps from the named_steps attribute. We’ll tune the logisticregression step in this lesson, and we’ll tune the columntransformer step in the next lesson.

pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

To use GridSearchCV, we need to create a dictionary in which each entry represents a parameter and the values we want to try for that parameter. We’ll start by creating an empty dictionary called params, and then we’ll add the two entries.

For each dictionary entry, the key is the Pipeline step name, followed by two underscores, followed by the parameter name. Thus the key for the first entry is 'logisticregression__penalty', and the key for the second entry is 'logisticregression__C'.

Using two underscores is what allows GridSearchCV to distinguish between the step name and the parameter name. Using a single underscore would be ambiguous, since a step name or parameter name can have an underscore within it.

The value for each dictionary entry is a list of the values you want to try for that parameter. Thus the value for the first entry is a list of 'l1' and 'l2', and the value for the second entry is a list of 0.1, 1, and 10.

Parameter dictionary for GridSearchCV:

Key: step__parameter
- 'logisticregression__penalty'
- 'logisticregression__C'
Value: List of values to try
- ['l1', 'l2']
- [0.1, 1, 10]

After adding the two entries, we’ll print out the params dictionary just to make sure that it looks correct.

params = {}
params['logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]
params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10]}

Now that we’ve created the parameter dictionary, we can set up the grid search. We import the GridSearchCV class from the model_selection module.

from sklearn.model_selection import GridSearchCV

Next, we create an instance of GridSearchCV called grid. We pass it the Pipeline, the parameter dictionary, the number of folds, and the evaluation metric.

Finally, we run the grid search by fitting the grid object with X and y. Because our scikit-learn configuration is set to display diagrams, we see a diagram of the Pipeline now that the grid search is complete.

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)

The results of the grid search are stored in an attribute called cv_results_, which we’ll convert to a DataFrame. We’ll use a filter to only keep the columns we need, and rename the parameter columns to make them easier to read.

There are 6 rows because it ran cross-validation 6 times, which is every possible combination of the 2 values of penalty and the 3 values of C that we specified.

results = (pd.DataFrame(grid.cv_results_)
           .filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results

	C	penalty	mean_test_score	rank_test_score
0	0.1	l1	0.783385	6
1	0.1	l2	0.788990	5
2	1	l1	0.814814	2
3	1	l2	0.811462	3
4	10	l1	0.818166	1
5	10	l2	0.809234	4

Notice the rank_test_score column. We’ll use the DataFrame’s sort_values method to sort the rows by that column in ascending order.

By examining the mean_test_score column, we can see that the best parameter combination resulted in a cross-validated accuracy of 0.818, which is higher than our baseline accuracy of 0.811.

We can see that the best accuracy occurred when C was 10 and penalty was 'l1', neither of which was the default value for that parameter.

results.sort_values('rank_test_score')

	C	penalty	mean_test_score	rank_test_score
4	10	l1	0.818166	1
2	1	l1	0.814814	2
3	1	l2	0.811462	3
5	10	l2	0.809234	4
1	0.1	l2	0.788990	5
0	0.1	l1	0.783385	6

Pipeline accuracy scores:

Grid search (2 parameters): 0.818
Baseline (no tuning): 0.811

10.4 Tuning the transformers

In the previous lesson, we built a grid search for tuning model parameters and found that the best accuracy occurred when C was 10 and penalty was 'l1'. In this lesson, we’re going to expand the search to also include transformer parameters.

When expanding the search, you might first think that we should set C to 10 and penalty to 'l1', and then only search the transformer parameters, since that would be the most computationally efficient approach.

However, the better approach is actually to consider all of the values for C and penalty in combination with all of the transformer parameters. That’s because we’re searching for the best combination of all parameters, and since each parameter can influence what is optimal for the other parameters, the best combination might use a C value other than 10 or a penalty value other than 'l1'.

Options for expanding the grid search:

Initial idea: Set C=10 and penalty='l1', then only search transformer parameters
Better approach: Search for best combination of C, penalty, and transformer parameters

All of that is to say that we’re going to expand the existing params dictionary to include transformer parameters. And to include transformer parameters, we first need to figure out the transformer names.

From the previous lesson, you might recall that the first step in the Pipeline is named columntransformer (all lowercase). We’ll access that step using the named_steps attribute, which then allows us to examine the named_transformers_ attribute of the ColumnTransformer.

As a side note, named_transformers_ ends with an underscore because it’s set during the fit step, whereas named_steps does not end with an underscore because it’s set when the Pipeline instance is created.

Anyway, we can now see the transformer names. We’re going to tune a single parameter from three of the transformers. Normally I might tune more parameters, but for the sake of brevity I’m only going to tune three.

pipe.named_steps['columntransformer'].named_transformers_

{'pipeline': Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='missing', strategy='constant')),
                 ('onehotencoder', OneHotEncoder())]),
 'countvectorizer': CountVectorizer(),
 'simpleimputer': SimpleImputer(),
 'passthrough': 'passthrough'}

The first parameter we’re going to tune is the drop parameter of OneHotEncoder, which was added to scikit-learn in version 0.21 and which I discussed in lesson 3.6.

To add it to the params dictionary, we specify the Pipeline step name, which is columntransformer. Then we specify the transformer name, which is pipeline. Then we specify the step name of the inner Pipeline, which is onehotencoder. Finally we specify the parameter name, which is drop. All of these components are separated by two underscores.

The parameter values we’re going to try are None and 'first'. None is the default, and it means don’t drop any columns, whereas 'first' means drop the first column of each feature after encoding.

OneHotEncoder tuning parameter:

drop: Method for dropping a column of each feature
- None (default)
- 'first'

params['columntransformer__pipeline__onehotencoder__drop'] = [None,
                                                              'first']

If you’re ever unsure how to specify a parameter for a grid search, you can see all of the Pipeline’s parameters by using the get_params method followed by the keys method. I’m converting the output to a list for easier readability. This list is also useful if you prefer to copy and paste the parameter names rather than typing them.

As you can see, there are many transformer and model parameters that we’re not tuning, many of which could be useful to tune given enough time and computational resources.

list(pipe.get_params().keys())

['memory',
 'steps',
 'verbose',
 'columntransformer',
 'logisticregression',
 'columntransformer__n_jobs',
 'columntransformer__remainder',
 'columntransformer__sparse_threshold',
 'columntransformer__transformer_weights',
 'columntransformer__transformers',
 'columntransformer__verbose',
 'columntransformer__pipeline',
 'columntransformer__countvectorizer',
 'columntransformer__simpleimputer',
 'columntransformer__passthrough',
 'columntransformer__pipeline__memory',
 'columntransformer__pipeline__steps',
 'columntransformer__pipeline__verbose',
 'columntransformer__pipeline__simpleimputer',
 'columntransformer__pipeline__onehotencoder',
 'columntransformer__pipeline__simpleimputer__add_indicator',
 'columntransformer__pipeline__simpleimputer__copy',
 'columntransformer__pipeline__simpleimputer__fill_value',
 'columntransformer__pipeline__simpleimputer__missing_values',
 'columntransformer__pipeline__simpleimputer__strategy',
 'columntransformer__pipeline__simpleimputer__verbose',
 'columntransformer__pipeline__onehotencoder__categories',
 'columntransformer__pipeline__onehotencoder__drop',
 'columntransformer__pipeline__onehotencoder__dtype',
 'columntransformer__pipeline__onehotencoder__handle_unknown',
 'columntransformer__pipeline__onehotencoder__sparse',
 'columntransformer__countvectorizer__analyzer',
 'columntransformer__countvectorizer__binary',
 'columntransformer__countvectorizer__decode_error',
 'columntransformer__countvectorizer__dtype',
 'columntransformer__countvectorizer__encoding',
 'columntransformer__countvectorizer__input',
 'columntransformer__countvectorizer__lowercase',
 'columntransformer__countvectorizer__max_df',
 'columntransformer__countvectorizer__max_features',
 'columntransformer__countvectorizer__min_df',
 'columntransformer__countvectorizer__ngram_range',
 'columntransformer__countvectorizer__preprocessor',
 'columntransformer__countvectorizer__stop_words',
 'columntransformer__countvectorizer__strip_accents',
 'columntransformer__countvectorizer__token_pattern',
 'columntransformer__countvectorizer__tokenizer',
 'columntransformer__countvectorizer__vocabulary',
 'columntransformer__simpleimputer__add_indicator',
 'columntransformer__simpleimputer__copy',
 'columntransformer__simpleimputer__fill_value',
 'columntransformer__simpleimputer__missing_values',
 'columntransformer__simpleimputer__strategy',
 'columntransformer__simpleimputer__verbose',
 'logisticregression__C',
 'logisticregression__class_weight',
 'logisticregression__dual',
 'logisticregression__fit_intercept',
 'logisticregression__intercept_scaling',
 'logisticregression__l1_ratio',
 'logisticregression__max_iter',
 'logisticregression__multi_class',
 'logisticregression__n_jobs',
 'logisticregression__penalty',
 'logisticregression__random_state',
 'logisticregression__solver',
 'logisticregression__tol',
 'logisticregression__verbose',
 'logisticregression__warm_start']

Moving along, the second parameter we’re going to tune is the ngram_range parameter of CountVectorizer.

Again, we specify the Pipeline step name, then the transformer name, and then the parameter name. Note that these three components are separated by double underscores, but there’s just a single underscore within ngram_range because that’s part of the parameter name.

The parameter values we’re going to try are the tuples (1, 1) and (1, 2). (1, 1) is the default, and it creates a single feature from each word. (1, 2) creates features from both single words, known as unigrams, and word pairs, known as bigrams.

CountVectorizer tuning parameter:

ngram_range: Selection of word n-grams to be extracted as features
- (1, 1) (default)
- (1, 2)

params['columntransformer__countvectorizer__ngram_range'] = [(1, 1),
                                                             (1, 2)]

The third parameter we’re going to tune is the add_indicator parameter of SimpleImputer, which was added to scikit-learn in version 0.21 and which I discussed in lesson 7.4.

Once again, we specify the Pipeline step name, then the transformer name, and then the parameter name.

The parameter values we’re going to try are False and True. False is the default, and it does not add a missing indicator column, whereas True does add a missing indicator column.

SimpleImputer tuning parameter:

add_indicator: Option to add a missing indicator column
- False (default)
- True

params['columntransformer__simpleimputer__add_indicator'] = [False, True]

Before running the grid search, we’ll print out the params dictionary. By multiplying 2 by 3 by 2 by 2 by 2, we can calculate that there are now 48 parameter combinations, and thus the grid search will take about 8 times longer than the previous search.

As an aside, if we had used the Pipeline and ColumnTransformer classes instead of the make_pipeline and make_column_transformer functions, we could have customized the step names and transformer names, which would have made these parameter specifications a bit easier to read and write. You can watch lessons 4.9 and 4.10 for a review of that topic.

params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10],
 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

Anyway, next we’ll recreate the grid object with the new params dictionary, and then we’ll run the grid search.

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)

Now that the search is complete, we’ll convert the search results into a DataFrame and sort it by the rank_test_score column.

As you can see from the mean_test_score column, the best accuracy of 0.828 is an improvement over the previous grid search, which had an accuracy of 0.818. Keep in mind that your exact results may differ based on your scikit-learn version along with other factors. However, there’s no randomness involved when you set cv to an integer, and so your results will be the same every time you run this grid search.

results = (pd.DataFrame(grid.cv_results_)
           .filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results.sort_values('rank_test_score')

	ngram_range	drop	add_indicator	C	penalty	mean_test_score	rank_test_score
34	(1, 2)	None	True	10	l1	0.828253	1
28	(1, 2)	None	False	10	l1	0.824889	2
40	(1, 2)	first	False	10	l1	0.824889	2
46	(1, 2)	first	True	10	l1	0.822648	4
16	(1, 1)	first	False	10	l1	0.820407	5
22	(1, 1)	first	True	10	l1	0.819296	6
4	(1, 1)	None	False	10	l1	0.818166	7
10	(1, 1)	None	True	10	l1	0.817061	8
20	(1, 1)	first	True	1	l1	0.814820	9
2	(1, 1)	None	False	1	l1	0.814814	10
44	(1, 2)	first	True	1	l1	0.813703	11
47	(1, 2)	first	True	10	l2	0.812598	12
8	(1, 1)	None	True	1	l1	0.812579	13
38	(1, 2)	first	False	1	l1	0.812579	14
14	(1, 1)	first	False	1	l1	0.812579	14
26	(1, 2)	None	False	1	l1	0.812567	16
11	(1, 1)	None	True	10	l2	0.811481	17
21	(1, 1)	first	True	1	l2	0.811468	18
3	(1, 1)	None	False	1	l2	0.811462	19
23	(1, 1)	first	True	10	l2	0.810363	20
9	(1, 1)	None	True	1	l2	0.810345	21
15	(1, 1)	first	False	1	l2	0.810332	22
32	(1, 2)	None	True	1	l1	0.810332	22
17	(1, 1)	first	False	10	l2	0.809234	24
35	(1, 2)	None	True	10	l2	0.809234	24
5	(1, 1)	None	False	10	l2	0.809234	24
29	(1, 2)	None	False	10	l2	0.808104	27
45	(1, 2)	first	True	1	l2	0.808097	28
41	(1, 2)	first	False	10	l2	0.806980	29
39	(1, 2)	first	False	1	l2	0.805844	30
27	(1, 2)	None	False	1	l2	0.805844	30
33	(1, 2)	None	True	1	l2	0.804739	32
31	(1, 2)	None	True	0.1	l2	0.793491	33
7	(1, 1)	None	True	0.1	l2	0.793484	34
19	(1, 1)	first	True	0.1	l2	0.791243	35
43	(1, 2)	first	True	0.1	l2	0.790114	36
37	(1, 2)	first	False	0.1	l2	0.789003	37
25	(1, 2)	None	False	0.1	l2	0.788996	38
1	(1, 1)	None	False	0.1	l2	0.788990	39
13	(1, 1)	first	False	0.1	l2	0.787885	40
0	(1, 1)	None	False	0.1	l1	0.783385	41
30	(1, 2)	None	True	0.1	l1	0.783385	41
24	(1, 2)	None	False	0.1	l1	0.783385	41
6	(1, 1)	None	True	0.1	l1	0.783385	41
36	(1, 2)	first	False	0.1	l1	0.777785	45
42	(1, 2)	first	True	0.1	l1	0.777785	45
12	(1, 1)	first	False	0.1	l1	0.777785	45
18	(1, 1)	first	True	0.1	l1	0.777785	45

Pipeline accuracy scores:

Grid search (5 parameters): 0.828
Grid search (2 parameters): 0.818
Baseline (no tuning): 0.811

Rather than always examining the results DataFrame, we can actually just access the single best score and the set of parameters that resulted in that score via attributes of the grid object.

It’s worth noting that only the drop parameter is using its default value, whereas the other four parameters are not using their default values.

grid.best_score_

0.828253091456908

grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': True,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}

It’s hard to say whether this truly is the best set of parameters, because some of the differences in accuracy between parameter combinations may be due to chance, based on which samples happened to appear in each fold. That’s just a limitation of basic cross-validation, and so all we can say with confidence is that this is a good combination of parameters.

10.5 Using the best Pipeline to make predictions

Now that we’ve tuned both the model parameters and the transformer parameters, we want to use those parameters with the Pipeline when making predictions.

GridSearchCV actually makes this very easy. After locating the best set of parameters, it automatically refits the Pipeline on X and y using the best set of parameters, and it stores that fitted Pipeline as an attribute called best_estimator_. And as you can see, that attribute is indeed a Pipeline object.

type(grid.best_estimator_)

sklearn.pipeline.Pipeline

If we print out the best_estimator_ attribute and click on the components, we can see that the parameters of this Pipeline match the best parameter set we located in the previous lesson.

grid.best_estimator_

In order to make predictions using this Pipeline, all we have to do is run the grid object’s predict method, which calls the predict method of the best_estimator_, and pass it X_new.

grid.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

I just want to emphasize that this Pipeline, with the best set of parameters, was automatically refit to the entire dataset. You always train your model on the entire dataset, meaning all samples for which you know the target value, before using it to make predictions on new data, otherwise you’re throwing away valuable training data.

10.6 Q&A: How do I save the best Pipeline for future use?

After completing a grid search, you may want to save the Pipeline with the best set of parameters so that you can use it to make predictions later.

As we saw in the previous lesson, the Pipeline with the best set of parameters is stored as an attribute of the GridSearchCV object called best_estimator_, so this is the object that we want to save.

type(grid.best_estimator_)

sklearn.pipeline.Pipeline

You can save a Pipeline to a file using pickle, which is part of the Python standard library.

import pickle

We’ll use pickle’s dump method to save the Pipeline to a file called “pipe.pickle”.

with open('pipe.pickle', 'wb') as f:
    pickle.dump(grid.best_estimator_, f)

Then we can use pickle’s load method to load the Pipeline from the file into an object called pipe_from_pickle.

with open('pipe.pickle', 'rb') as f:
    pipe_from_pickle = pickle.load(f)

pipe_from_pickle is identical to grid.best_estimator_, and so when we use pipe_from_pickle to make predictions, these predictions are identical to the predictions made by the grid object.

pipe_from_pickle.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

One alternative to pickle is joblib, which is usually more efficient than pickle for scikit-learn objects. Although it’s not part of the Python standard library, joblib has been a dependency of scikit-learn since version 0.21.

import joblib

Just like pickle, you use joblib’s dump method to save the Pipeline to a file, which we’ll call “pipe.joblib”.

joblib.dump(grid.best_estimator_, 'pipe.joblib')

['pipe.joblib']

Then, we’ll use the load method to load the Pipeline from the file into an object called pipe_from_joblib.

pipe_from_joblib = joblib.load('pipe.joblib')

Finally, we’ll use pipe_from_joblib to make predictions.

pipe_from_joblib.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

To be clear, pickle and joblib are not limited to Pipelines and can be used with other scikit-learn objects, such as a standalone model object that is not inside a Pipeline.

There are a couple warnings to keep in mind when working with pickle and joblib objects:

First, the objects may be version-specific and architecture-specific. As such, you should only load them into an identical environment, meaning the same versions of scikit-learn and its dependencies, using the identical computing architecture.
Second, these objects can be poisoned with malicious code, and so you should only load objects from a trusted source.

Warnings for pickle and joblib objects:

May be version-specific and architecture-specific
Can be poisoned with malicious code

Finally, it’s worth mentioning that there are alternatives to pickle and joblib such as ONNX and PMML. These formats don’t capture the full model object, but instead save a representation that can be used to make predictions. One major benefit of these formats is that they are neither environment-specific nor architecture-specific.

Alternatives to pickle and joblib:

Examples: ONNX, PMML
Save a model representation for making predictions
Work across environments and architectures

10.7 Q&A: How do I speed up a grid search?

Let’s recreate the GridSearchCV object, but this time we’ll add the verbose parameter and set it to 1. When we run the search, this parameter will cause two changes to the output:

First, it will calculate the number of parameter combinations for you, which is 48. Since this is 5-fold cross-validation, that means the Pipeline will be fit 240 times.
Second, it will report back how long the search took, and sometimes it will give you progress updates along the way.

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1)
grid.fit(X, y)

Fitting 5 folds for each of 48 candidates, totalling 240 fits

[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done 199 tasks      | elapsed:    2.4s

Now, let’s also add the n_jobs parameter, set it to -1, and re-run the grid search. This instructs scikit-learn to use parallel processing with all of your CPUs to perform the search. If your machine has multiple processors, this will generally be faster, though in this case it took about the same amount of time.

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy', verbose=1,
                    n_jobs=-1)
grid.fit(X, y)

Fitting 5 folds for each of 48 candidates, totalling 240 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 225 out of 240 | elapsed:    2.5s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:    2.5s finished

If you find it useful to know how long a search takes, but verbose mode is a bit too verbose for you, another option is to remove the verbose parameter and instead prefix the second line with %time. This is known as an IPython line magic, and it will work as long as you’re using the Jupyter notebook or the IPython interpreter.

All this command does is tell you how long a particular line of code took to run. The number to focus on is the wall time.

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy', n_jobs=-1)
%time grid.fit(X, y)

CPU times: user 217 ms, sys: 9.15 ms, total: 226 ms
Wall time: 729 ms

My general recommendation is to set n_jobs to -1 any time you’re running a grid search, which is what I’ll do for the rest of the book. However, it’s still a good idea to use %time or verbose mode to confirm that parallel processing is actually reducing the search time on your particular machine.

10.8 Q&A: How do I tune a Pipeline with randomized search?

When you provide a set of parameter values to GridSearchCV, it will cross-validate every possible combination of those parameters. For example, we know that with this set of parameters, cross-validation will run 48 times.

params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10],
 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

Let’s say that we wanted to try additional C values for logistic regression. I’ll make a copy of the params dictionary called more_params, and then modify the C parameter in this dictionary to have 6 possible values instead of 3.

more_params = params.copy()
more_params['logisticregression__C'] = [0.01, 0.1, 1, 10, 100, 1000]

Since there are twice as many C values, we know that a grid search will take twice as long, meaning it will run cross-validation 96 times. But what if that grid search takes more time than we have available?

An alternative method we can use is called randomized search, which is implemented in the RandomizedSearchCV class. We’ll import it from the model_selection module and then create an instance.

The API is very similar to GridSearchCV, except that you also specify the number of times it should run using the n_iter parameter. In this case, we’ll set the number of iterations to be 10.

Each time it runs, it will pick out a set of parameters at random and cross-validate that parameter set. In other words, it does the same thing as GridSearchCV, except that it picks out random combinations of parameters from the parameter dictionary rather than trying every single combination. Because there’s an element of randomness, we’ll also set the random_state parameter to 1 for reproducibility.

We’ll use the fit method to run the search, and because it will only try 10 combinations instead of 96 combinations, it will run about 10 times faster than a grid search would.

How to use RandomizedSearchCV:

n_iter: Specify the number of randomly-chosen parameter combinations to cross-validate
random_state: Set to any integer for reproducibility

from sklearn.model_selection import RandomizedSearchCV
rand = RandomizedSearchCV(pipe, more_params, cv=5, scoring='accuracy',
                          n_iter=10, random_state=1, n_jobs=-1)
rand.fit(X, y)

By printing out the results of the search, you can see that it ran 10 times using random combinations of all of those parameters.

results = (pd.DataFrame(rand.cv_results_)
           .filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results

	penalty	C	add_indicator	drop	ngram_range	mean_test_score	rank_test_score
0	l1	1	True	first	(1, 1)	0.814820	3
1	l2	10	False	first	(1, 1)	0.809234	7
2	l1	1000	True	first	(1, 1)	0.811437	5
3	l2	1000	False	None	(1, 2)	0.810345	6
4	l1	10	False	first	(1, 2)	0.824889	2
5	l1	0.1	False	first	(1, 2)	0.777785	9
6	l2	1	True	None	(1, 2)	0.804739	8
7	l1	100	True	first	(1, 1)	0.813684	4
8	l1	100	False	first	(1, 2)	0.827129	1
9	l2	0.01	True	first	(1, 2)	0.744184	10

You might be surprised to know that the best score it found, 0.827, is almost as high as the best score found by our grid search earlier in the chapter, which was 0.828. That being said, we did try additional C values in our randomized search, so the comparison isn’t entirely fair.

rand.best_score_

0.8271294959512898

Pipeline accuracy scores:

Grid search (5 parameters): 0.828
Randomized search (more C values): 0.827
Grid search (2 parameters): 0.818
Baseline (no tuning): 0.811

Here’s the set of parameters that produced that score.

rand.best_params_

{'logisticregression__penalty': 'l1',
 'logisticregression__C': 100,
 'columntransformer__simpleimputer__add_indicator': False,
 'columntransformer__pipeline__onehotencoder__drop': 'first',
 'columntransformer__countvectorizer__ngram_range': (1, 2)}

There are four things I especially like about using a randomized search instead of a grid search:

First, randomized search will usually find the best result (or almost the best result) in far less time than grid search, which is what we saw above. This is because there are often a lot of parameter combinations that will produce similar results to one another.

Second, it’s easier to control the computational budget of a randomized search. You can test how long a small number of searches takes, and then if you have a certain amount of time available for a search, you can simply choose the number of iterations that can be completed within that time period.

Third, randomized search gives you the freedom to tune many more model and transformer parameters without worrying that it will take forever. You can try out a ton of different parameters for a short amount of time, and then narrow down which parameters to focus on based on what seems to be working. (We’ll see this in practice in the next chapter.)

Fourth, randomized search will sometimes produce even better results than grid search because you can try a finer grid. For example, let’s say you were tuning a parameter that allowed continuous values from 0 to 1. If you were using a grid search, you might try the values 0, 0.5, and 1. But if you were using randomized search, you might try the values 0, 0.01, 0.02, and so on. It may turn out that the best value for this parameter is around 0.3, and randomized search could help you to find that out, whereas this grid search would have no chance of finding that out.

Why use RandomizedSearchCV instead of GridSearchCV?

Similar results in far less time
Easier to control the computational budget
Freedom to tune many more parameters
Can use a much finer grid

If you do need to create a fine grid of numbers for a randomized search, one useful function is NumPy’s linspace. For example, this code specifies that I want 101 equally spaced values, starting with 0 and ending with 1.

import numpy as np
np.linspace(0, 1, 101)

array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
       0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
       0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
       0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
       0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
       0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65,
       0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76,
       0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87,
       0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98,
       0.99, 1.  ])

Another similar function is NumPy’s logspace. This code specifies that I want 6 values, from 10 to the negative 2nd power through 10 to the 3rd power.

np.logspace(-2, 3, 6)

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03])

If you’re comfortable using the SciPy library, you can instead specify continuous parameters for a randomized search using SciPy distributions. However, I find it much easier to just use NumPy’s linspace and logspace functions.

10.9 Q&A: What’s the target accuracy we are trying to achieve?

When you’re building and tuning a modeling Pipeline, it’s natural to wonder how you’ll know when you’re done. In other words, how good of a model is “good enough”? There are three ways that I tend to think about this question.

When is a model “good enough”?

Useful model: Outperforms null accuracy
Best possible model: Usually impossible to know the theoretical maximum accuracy
Practical model: Continue improving until you run out of resources

The first way is to ask the question: What is the minimum accuracy that we need to achieve for our model to be considered useful? In most cases, you want your model to at least outperform null accuracy, which is the accuracy you could achieve by always predicting the most frequent class.

To calculate the null accuracy for our training data, we use the value_counts method on y, and set normalize to True in order to display the counts as a percentage. From the results, we can see that class 0 is the most frequent class, and about 61.6% of the y values are class 0.

y.value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

Thus the null accuracy for this problem is 61.6%, since an uninformed model, also known as the null model, could achieve that accuracy simply by predicting class 0 in all cases. In other words, this is the accuracy level that we want to outperform, otherwise the model is not providing any value. Thankfully, all of our Pipelines are outperforming null accuracy by a considerable amount.

Pipeline accuracy scores:

Grid search (5 parameters): 0.828
Randomized search (more C values): 0.827
Grid search (2 parameters): 0.818
Baseline (no tuning): 0.811
Null model: 0.616

The second way to think about this question is to ask: What is the maximum accuracy we could eventually reach? For most real problems, it’s impossible to know how accurate your model could be if you did enough tuning and tried enough models. It’s also impossible to know how accurate your model could be if you gathered more samples or more features. The main exception to this is if you’re working on a well-studied research problem, because in that case there may be a state-of-the-art benchmark that everyone is trying to surpass.

Thus in most practical circumstances, you don’t set a target accuracy. Instead, you work to improve the model until you run out of time, money, or ideas.

10.10 Q&A: Is it okay that our model includes thousands of features?

The pipe object is our Pipeline that hasn’t been tuned by grid search. Recall that you can examine an individual Pipeline step by using the named_steps attribute. In this case, we’ll select the first step, which is our ColumnTransformer.

pipe.named_steps['columntransformer']

By passing X to its fit_transform method, we can see that the ColumnTransformer outputs 1518 feature columns. As we saw in lesson 8.4, all except 9 of those features were created from the Name column by CountVectorizer.

pipe.named_steps['columntransformer'].fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
    with 7328 stored elements in Compressed Sparse Row format>

The cross-validated accuracy of this Pipeline is 0.811, which we’ve been calling the baseline accuracy against which other Pipelines can be compared.

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

Pipeline accuracy scores:

Grid search (5 parameters): 0.828
Randomized search (more C values): 0.827
Grid search (2 parameters): 0.818
Baseline (no tuning): 0.811
Null model: 0.616

Similarly, we can select the ColumnTransformer from our Pipeline that was tuned by grid search. Notice that the ngram_range for CountVectorizer is (1, 2), meaning CountVectorizer will create features from both unigrams and bigrams in the Name column.

grid.best_estimator_.named_steps['columntransformer']

By using fit_transform, we can see that this ColumnTransformer outputs 3671 feature columns. Again, all except 9 of those features were created from the Name column.

grid.best_estimator_.named_steps['columntransformer'].fit_transform(X)

<891x3671 sparse matrix of type '<class 'numpy.float64'>'
    with 10191 stored elements in Compressed Sparse Row format>

The cross-validated accuracy of this Pipeline is 0.828.

grid.best_score_

0.828253091456908

Pipeline accuracy scores:

Grid search (5 parameters): 0.828
Randomized search (more C values): 0.827
Grid search (2 parameters): 0.818
Baseline (no tuning): 0.811
Null model: 0.616

Finally, let’s compare these two Pipelines to a Pipeline that doesn’t include the Name column at all. First, we’ll create a ColumnTransformer called no_name_ct that excludes Name.

no_name_ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

As you can see, this ColumnTransformer only outputs 9 feature columns.

no_name_ct.fit_transform(X).shape

(891, 9)

Then, we’ll add no_name_ct to a Pipeline called no_name_pipe and cross-validate it. The accuracy is 0.783, which is significantly lower than the Pipelines that included the Name column. To be fair, this Pipeline hasn’t been tuned, though honestly there is no hyperparameter tuning we could do to make it perform as well as the Pipelines that included the Name column.

no_name_pipe = make_pipeline(no_name_ct, logreg)
cross_val_score(no_name_pipe, X, y, cv=5, scoring='accuracy').mean()

0.7833908731404181

Pipeline accuracy scores:

Grid search (5 parameters): 0.828
Randomized search (more C values): 0.827
Grid search (2 parameters): 0.818
Baseline (no tuning): 0.811
Baseline excluding Name (no tuning): 0.783
Null model: 0.616

Here are some conclusions that we can draw from this experiment:

First, including the Name column in the Pipeline significantly increased the cross-validated accuracy, which means that adding those thousands of feature columns did not result in overfitting. Instead, it tells us that the Name column contains more predictive signal than noise with respect to the target.
More generally, this experiment tells us that having more features than samples does not necessarily result in overfitting.

What did we learn?

Name column contains more predictive signal than noise
More features than samples does not necessarily result in overfitting

It’s worth noting that there is additional tuning we could do to CountVectorizer to reduce the number of features it creates. However, there’s no way to know whether that would increase or decrease the Pipeline’s accuracy without actually trying it.

10.11 Q&A: How do I examine the coefficients of a Pipeline?

Recall that once a grid search is complete, GridSearchCV automatically refits the Pipeline on X and y and stores it as an attribute called best_estimator_. Therefore, we can access the model coefficients by first selecting the logisticregression step and then selecting the coef_ attribute.

grid.best_estimator_.named_steps['logisticregression'].coef_

array([[ 0.56431161,  0.        , -0.08767203, ...,  0.01408723,
        -0.43713268, -0.46358519]])

Ideally, we would also be able to get the names of the features that correspond to these coefficients by running the get_feature_names method on the ColumnTransformer step. However, get_feature_names only works if all of the underlying transformers have a get_feature_names method, and that is not the case here.

grid.best_estimator_.named_steps['columntransformer'].get_feature_names()

AttributeError: Transformer pipeline does not provide get_feature_names

Instead, as we saw previously in lesson 8.4, you would have to inspect the transformers one-by-one in order to determine the feature names.

grid.best_estimator_.named_steps['columntransformer'].transformers_

[('pipeline',
  Pipeline(steps=[('simpleimputer',
                   SimpleImputer(fill_value='missing', strategy='constant')),
                  ('onehotencoder', OneHotEncoder())]),
  ['Embarked', 'Sex']),
 ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'),
 ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']),
 ('passthrough', 'passthrough', ['Parch'])]

Note that starting in scikit-learn version 1.1, the get_feature_names_out method should work on this ColumnTransformer, since the get_feature_names_out method will be available for all transformers.

10.12 Q&A: Should I split the dataset before tuning the Pipeline?

When we perform a grid search, we’re trying to find the parameters that maximize the cross-validation score on a dataset. Thus, we’re using the same data to accomplish two separate goals:

First, to choose the best parameters for the Pipeline, which are stored in the best_params_ attribute.
Second, to estimate the future performance of the Pipelineon new data when using these parameters, which is stored in the best_score_ attribute.

Goals of a grid search:

Choose the best parameters for the Pipeline
Estimate its performance on new data when using these parameters

grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': True,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}

grid.best_score_

0.828253091456908

Using the same data for these two separate goals actually biases the Pipeline to this dataset and can result in overly optimistic scores.

If your main objective is to choose the best parameters, then this process is totally fine. You’ll just have to accept that its actual performance on new data may be lower than the performance estimated by grid search.

But if you also need a realistic estimate of the Pipeline’s performance on new data, then there’s an alternative process you can use, which I’ll walk you through in this lesson.

Is it okay to use the same data for both goals?

Yes: If your main objective is to choose the best parameters
No: If you need a realistic estimate of performance on new data

To start, we’ll import the train_test_split function from the model_selection module, and use it to split the data into training and testing sets, with 75% of the data as training and 25% of the data as testing. Note that I set the stratify parameter to y so that the class proportions will be approximately equal in the training and testing sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=1,
                                                    stratify=y)

Next, we’ll create a new GridSearchCV object called training_grid. When we run the grid search, we’ll only pass it the training set so that the tuning process only takes the training set into account.

training_grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy',
                             n_jobs=-1)
training_grid.fit(X_train, y_train)

Here are the best parameters found by grid search on the training set.

training_grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': 'first',
 'columntransformer__simpleimputer__add_indicator': False,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l2'}

We’re not actually interested in the best score found during the grid search. Instead, we’re going to use the best parameters found by the grid search to make predictions for the testing set, and then evaluate the accuracy of those predictions. We can do this by passing the testing set to the training_grid’s score method.

The accuracy it outputs is 0.816, which is a more realistic estimate of how the Pipeline will perform on new data, since the testing set is brand new data that the Pipeline has never seen. However, it’s still just a single realization of this model, and so it’s impossible to know how precise this value is.

training_grid.score(X_test, y_test)

0.8161434977578476

Pipeline accuracy scores:

Grid search (5 parameters): 0.828
Randomized search (more C values): 0.827
Grid search (2 parameters): 0.818
Grid search (estimate for new data): 0.816
Baseline (no tuning): 0.811
Baseline excluding Name (no tuning): 0.783
Null model: 0.616

Now that we’ve found the best parameters for the Pipeline and estimated its likely performance on new data, our final step is to actually make predictions on new data. Before making predictions, it’s critical that we train the Pipeline on all of our data, meaning the entirety of X and y, otherwise we’re throwing away valuable data.

In other words, we can’t simply use the training_grid’s predict method since it was only refit on X_train and y_train. Instead, we need to save the Pipeline with the best parameters, which we’ll call best_pipe, and fit it to X and y.

best_pipe = training_grid.best_estimator_
best_pipe.fit(X, y)

Now we can make predictions on new data.

best_pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

If you decide that you’re going to follow the process that I’ve just outlined, then there are two guidelines that are important to follow:

First, you should only use the testing set for evaluating Pipeline performance one time. If you keep tuning the Pipeline again and again, each time checking its performance on the testing set, you’re essentially tuning the Pipeline to the particulars of the testing set. At that point, it no longer functions as an independent data source and thus its performance estimates will become less reliable.

Second, it’s important that you have enough data overall in order for the training and testing sets to both be sufficiently large once the dataset has been split:

If the training set is too small, then the grid search won’t have enough data to find the optimal tuning parameters.
If the testing set is too small, then it won’t be able to provide a reliable estimate of Pipeline performance.

Both of these situations would defeat the purpose of splitting the dataset, and thus this approach is best when you have a large enough dataset. Unfortunately, it’s difficult to say in the abstract how much data is “enough”, since that depends on the particulars of the dataset and the problem.

Guidelines for using this process:

Only use the testing set once:
- If used multiple times, performance estimates will become less reliable
You must have enough data:
- If training set is too small, grid search won’t find the optimal parameters
- If testing set is too small, it won’t provide a reliable performance estimate

10.13 Q&A: What is regularization?

Earlier in this chapter, we tuned the regularization parameters of logistic regression. In this lesson, I’ll briefly explain what regularization actually is.

Regularization is a process that constrains the size of a model’s coefficients in order to minimize overfitting. Overfitting is when your model fits too closely to patterns in the training data, which causes your model not to perform well when it makes predictions on new data.

Regularization minimizes overfitting by reducing the variance of the model. Thus if you believe a model is too complex, regularization will reduce the error due to variance more than it increases the error due to bias, resulting in a model that is more likely to generalize to new data.

In simpler terms, regularization makes your model a bit less flexible so that it’s more likely to follow the true patterns in the data and less likely to follow the noise. Regularization is especially useful when you have outliers in the training data, because regularization decreases the influence that outliers have on the model.

Brief explanation of regularization:

Constrains the size of model coefficients to minimize overfitting
Reduces the variance of an overly complex model to help the model generalize
Decreases model flexibility so that it follows the true patterns in the data