10  Evaluating and tuning a Pipeline

10.1 Evaluating a Pipeline with cross-validation

In this chapter, we’re going to take a deep dive into how to efficiently tune our Pipeline for maximum accuracy.

Let’s return to the topic of model evaluation.

As you might recall, we used cross-validation way back in chapter 2 to evaluate our most basic model. Since that chapter, we’ve been adding many more features without re-running cross-validation. That’s because any model evaluation procedure is highly unreliable with only 10 rows of data, so it would have been misleading to run cross-validation and compare the results. But now that we’re using the full dataset, cross-validation can once again be used.

  • First, we’ll import the cross_val_score function from the model_selection module.
  • Instead of passing a model to cross_val_score, we can actually pass our entire Pipeline.
  • We also pass it X and y.
  • And then we specify the number of cross-validation folds. Using 5 or 10 folds has been shown to be a reasonable default choice, and so we’ll choose 5 in order to minimize the computation. 5 folds has actually been the default for cross_val_score since version 0.22, but I like to include it anyway for clarity.
  • Finally, we’ll specify the evaluation metric of classification accuracy.

When we run it, cross_val_score outputs a mean accuracy of 0.811, which we’ll use as the baseline accuracy against which our future Pipelines can be compared.

from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
0.8114619295712762

Pipeline accuracy scores:

  • Baseline (no tuning): 0.811

Let’s talk about what actually happens “under the hood” when we run the cross_val_score function on a Pipeline:

  • In step 1, cross_val_score splits the data into 5 folds. 4 out of 5 folds (meaning 80% of the data) is set aside for training, and the remaining fold (meaning 20% of the data) is set aside for testing.
  • In step 2, the Pipeline’s fit method is run on the training portion. Thus the transformations specified in the ColumnTransformer are performed on the training portion, and the transformed training data is used to fit the model.
  • In step 3, the Pipeline’s predict method is run on the testing portion. Thus the transformations learned during step 2 are applied to the testing portion, the transformed testing data is passed to the fitted model, and the model makes predictions.
  • Finally, in step 4, the accuracy of those predictions is calculated.
  • Steps 1 through 4 are then repeated 4 more times, and each time a different fold is set aside as the testing portion.
  • cross_val_score thus outputs 5 accuracy scores, and we take the mean of those scores.

Steps of 5-fold cross-validation on a Pipeline:

  1. Split data into 5 folds (A, B, C, D, E)
  • ABCD is training set
  • E is testing set
  1. Pipeline is fit on training set
  • ABCD is transformed
  • Model is fit on transformed data
  1. Pipeline makes predictions on testing set
  • E is transformed (using step 2 transformations)
  • Model makes predictions on transformed data
  1. Calculate accuracy of those predictions
  2. Repeat the steps above 4 more times, with a different testing set each time
  3. Calculate the mean of the 5 scores

One thing you might have noticed is that cross_val_score splits the data in step 1 before performing the transformations in steps 2 and 3. As a result, the imputation values for Age and Fare and the vocabulary for CountVectorizer are all computed 5 different times. Each time, these values are computed using the training set only, and then applied to both the training and testing sets.

Alternatively, you could imagine performing all of the transformations first, and then splitting the data. This would be much faster, since the imputation values and the vocabulary would be computed only once on the full dataset.

So why does cross_val_score split the data first? Because splitting the data before performing the transformations prevents data leakage, whereas performing the transformations on the full dataset before splitting the data would cause data leakage, since information about the testing set would be “leaked” into the model training process.

As we discussed in the previous chapter, this is one way that scikit-learn helps to shield you from data leakage.

Why does cross_val_score split the data first?

  • Proper cross-validation:
    • Data is split (step 1) before transformations (steps 2 and 3)
    • Imputation values and vocabulary are computed using training set only
    • Prevents data leakage
  • Improper cross-validation:
    • Transformations are performed before data is split
    • Imputation values and vocabulary are computed using full dataset
    • Causes data leakage

10.3 Tuning the model

In this lesson, we’re going to tune the model, and then in the next lesson, we’ll also tune the transformers.

For the logistic regression model, we’re going to tune two parameters:

  • The first parameter is “penalty”, which is the type of regularization. For this parameter, the default value is “l2”, and we’re going to try the values “l1” and “l2”. And just to be clear, the first character of each of those values is a lowercase “L”.
  • The second parameter is “C”, which is the amount of regularization. For this parameter, the default value is 1, and we’re going to try the values 0.1, 1, and 10.

LogisticRegression tuning parameters:

  • penalty: Type of regularization
    • ‘l1’
    • ‘l2’ (default)
  • C: Amount of regularization
    • 0.1
    • 1 (default)
    • 10

Deciding which parameters to tune and what values to try requires both research and experience, and unfortunately, it’s different for every type of model.

In order to tune a Pipeline with GridSearchCV, we need to get the names of the Pipeline steps from the named_steps attribute. We’ll tune the “logisticregression” step in this lesson, and we’ll tune the “columntransformer” step in the next lesson.

pipe.named_steps.keys()
dict_keys(['columntransformer', 'logisticregression'])

To use GridSearchCV, we need to create a dictionary in which each entry represents a parameter and the values we want to try for that parameter. We’ll start by creating an empty dictionary called params, and then we’ll add the two entries.

For each dictionary entry, the key is the Pipeline step name, followed by two underscores, followed by the parameter name. Thus the key for the first entry is “logisticregression__penalty”, and the key for the second entry is “logisticregression__C”.

Using two underscores is what allows GridSearchCV to distinguish between the step name and the parameter name. Using a single underscore would be ambiguous, since a step name or parameter name can have an underscore within it.

The value for each dictionary entry is a list of the values you want to try for that parameter. Thus the value for the first entry is a list of “l1” and “l2”, and the value for the second entry is a list of 0.1, 1, and 10.

Parameter dictionary for GridSearchCV:

  • Key: step__parameter
    • ’logisticregression__penalty’
    • ’logisticregression__C’
  • Value: List of values to try
    • [‘l1’, ‘l2’]
    • [0.1, 1, 10]

After adding the two entries, we’ll print out the params dictionary just to make sure that it looks correct.

params = {}
params['logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]
params
{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10]}

Now that we’ve created the parameter dictionary, we can set up the grid search. We import the GridSearchCV class from the model_selection module.

from sklearn.model_selection import GridSearchCV

Next, we create an instance of GridSearchCV called grid. We pass it the Pipeline, the parameter dictionary, the number of folds, and the evaluation metric.

Finally, we run the grid search by fitting the grid object with X and y. Because our scikit-learn configuration is set to display diagrams, we see a diagram of the Pipeline now that the grid search is complete.

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer(fill_value='missing',
                                                                                                        strategy='constant')),
                                                                                         ('onehotencoder',
                                                                                          OneHotEncoder())]),
                                                                         ['Embarked',
                                                                          'Sex']),
                                                                        ('countvectorizer',
                                                                         CountVectorizer(),
                                                                         'Name'),
                                                                        ('simpleimputer',
                                                                         SimpleImputer(),
                                                                         ['Age',
                                                                          'Fare']),
                                                                        ('passthrough',
                                                                         'passthrough',
                                                                         ['Parch'])])),
                                       ('logisticregression',
                                        LogisticRegression(random_state=1,
                                                           solver='liblinear'))]),
             param_grid={'logisticregression__C': [0.1, 1, 10],
                         'logisticregression__penalty': ['l1', 'l2']},
             scoring='accuracy')
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')

The results of the grid search are stored in an attribute called cv_results_, which we’ll convert to a DataFrame for easier viewing.

There are 6 rows because it ran cross-validation 6 times, which is every possible combination of the 2 values of penalty and the 3 values of C that we specified.

results = pd.DataFrame(grid.cv_results_)
results
mean_fit_time std_fit_time mean_score_time std_score_time param_logisticregression__C param_logisticregression__penalty params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.006248 0.000092 0.002339 0.000045 0.1 l1 {'logisticregression__C': 0.1, 'logisticregres... 0.787709 0.803371 0.769663 0.758427 0.797753 0.783385 0.016946 6
1 0.006437 0.000056 0.002359 0.000054 0.1 l2 {'logisticregression__C': 0.1, 'logisticregres... 0.798883 0.803371 0.764045 0.775281 0.803371 0.788990 0.016258 5
2 0.007092 0.000159 0.002399 0.000006 1 l1 {'logisticregression__C': 1, 'logisticregressi... 0.815642 0.820225 0.797753 0.792135 0.848315 0.814814 0.019787 2
3 0.006887 0.000091 0.002386 0.000013 1 l2 {'logisticregression__C': 1, 'logisticregressi... 0.798883 0.825843 0.803371 0.786517 0.842697 0.811462 0.020141 3
4 0.010375 0.001186 0.002408 0.000008 10 l1 {'logisticregression__C': 10, 'logisticregress... 0.832402 0.808989 0.808989 0.786517 0.853933 0.818166 0.023031 1
5 0.007407 0.000158 0.002385 0.000018 10 l2 {'logisticregression__C': 10, 'logisticregress... 0.782123 0.803371 0.808989 0.797753 0.853933 0.809234 0.024080 4

Notice the rank_test_score column, which is the last column in the DataFrame. We’ll use the DataFrame’s sort_values method to sort the rows by that column in ascending order.

By examining the mean_test_score column, we can see that the best parameter combination resulted in a cross-validated accuracy of 0.818, which is higher than our baseline accuracy of 0.811.

Scrolling to the left, we can see that the best accuracy occurred when C was 10 and penalty was l1, neither of which was the default value for that parameter.

results.sort_values('rank_test_score')
mean_fit_time std_fit_time mean_score_time std_score_time param_logisticregression__C param_logisticregression__penalty params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
4 0.010375 0.001186 0.002408 0.000008 10 l1 {'logisticregression__C': 10, 'logisticregress... 0.832402 0.808989 0.808989 0.786517 0.853933 0.818166 0.023031 1
2 0.007092 0.000159 0.002399 0.000006 1 l1 {'logisticregression__C': 1, 'logisticregressi... 0.815642 0.820225 0.797753 0.792135 0.848315 0.814814 0.019787 2
3 0.006887 0.000091 0.002386 0.000013 1 l2 {'logisticregression__C': 1, 'logisticregressi... 0.798883 0.825843 0.803371 0.786517 0.842697 0.811462 0.020141 3
5 0.007407 0.000158 0.002385 0.000018 10 l2 {'logisticregression__C': 10, 'logisticregress... 0.782123 0.803371 0.808989 0.797753 0.853933 0.809234 0.024080 4
1 0.006437 0.000056 0.002359 0.000054 0.1 l2 {'logisticregression__C': 0.1, 'logisticregres... 0.798883 0.803371 0.764045 0.775281 0.803371 0.788990 0.016258 5
0 0.006248 0.000092 0.002339 0.000045 0.1 l1 {'logisticregression__C': 0.1, 'logisticregres... 0.787709 0.803371 0.769663 0.758427 0.797753 0.783385 0.016946 6

Pipeline accuracy scores:

  • Grid search (2 parameters): 0.818
  • Baseline (no tuning): 0.811

10.4 Tuning the transformers

In the previous lesson, we built a grid search for tuning model parameters and found that the best accuracy occurred when C was 10 and penalty was l1. In this lesson, we’re going to expand the search to also include transformer parameters.

When expanding the search, you might first think that we should set C to 10 and penalty to l1, and then only search the transformer parameters, since that would be the most computationally efficient approach.

However, the better approach is actually to consider all of the values for C and penalty in combination with all of the transformer parameters. That’s because we’re searching for the best combination of all parameters, and since each parameter can influence what is optimal for the other parameters, the best combination might use a C value other than 10 or a penalty value other than l1.

Options for expanding the grid search:

  • Initial idea: Set C=10 and penalty=‘l1’, then only search transformer parameters
  • Better approach: Search for best combination of C, penalty, and transformer parameters

All of that is to say that we’re going to expand the existing params dictionary to include transformer parameters. And to include transformer parameters, we first need to figure out the transformer names.

From the previous lesson, you might recall that the first step in the Pipeline is named columntransformer (all lowercase). We’ll access that step using the named_steps attribute, which then allows us to examine the named_transformers_ attribute of the ColumnTransformer.

As a side note, named_transformers_ ends with an underscore because it’s set during the fit step, whereas named_steps does not end with an underscore because it’s set when the Pipeline instance is created.

Anyway, we can now see the transformer names. We’re going to tune a single parameter from three of the transformers. Normally I might tune more parameters, but for the sake of brevity I’m only going to tune three.

pipe.named_steps['columntransformer'].named_transformers_
{'pipeline': Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='missing', strategy='constant')),
                 ('onehotencoder', OneHotEncoder())]),
 'countvectorizer': CountVectorizer(),
 'simpleimputer': SimpleImputer(),
 'passthrough': 'passthrough'}

The first parameter we’re going to tune is the “drop” parameter of OneHotEncoder, which was added to scikit-learn in version 0.21 and which I discussed in .

To add it to the params dictionary, we specify the Pipeline step name, which is “columntransformer”. Then we specify the transformer name, which is “pipeline”. Then we specify the step name of the inner Pipeline, which is “onehotencoder”. Finally we specify the parameter name, which is “drop”. All of these components are separated by two underscores.

The parameter values we’re going to try are “None” and “first”. None is the default, and it means don’t drop any columns, whereas first means drop the first column of each feature after encoding.

OneHotEncoder tuning parameter:

  • drop: Method for dropping a column of each feature
    • None (default)
    • ‘first’
params['columntransformer__pipeline__onehotencoder__drop'] = [None, 'first']

If you’re ever unsure how to specify a parameter for a grid search, you can see all of the Pipeline’s parameters by using the get_params method followed by the keys method. I’m converting the output to a list for easier readability. This list is also useful if you prefer to copy and paste the parameter names rather than typing them.

As you can see, there are many transformer and model parameters that we’re not tuning, many of which could be useful to tune given enough time and computational resources.

list(pipe.get_params().keys())
['memory',
 'steps',
 'verbose',
 'columntransformer',
 'logisticregression',
 'columntransformer__n_jobs',
 'columntransformer__remainder',
 'columntransformer__sparse_threshold',
 'columntransformer__transformer_weights',
 'columntransformer__transformers',
 'columntransformer__verbose',
 'columntransformer__pipeline',
 'columntransformer__countvectorizer',
 'columntransformer__simpleimputer',
 'columntransformer__passthrough',
 'columntransformer__pipeline__memory',
 'columntransformer__pipeline__steps',
 'columntransformer__pipeline__verbose',
 'columntransformer__pipeline__simpleimputer',
 'columntransformer__pipeline__onehotencoder',
 'columntransformer__pipeline__simpleimputer__add_indicator',
 'columntransformer__pipeline__simpleimputer__copy',
 'columntransformer__pipeline__simpleimputer__fill_value',
 'columntransformer__pipeline__simpleimputer__missing_values',
 'columntransformer__pipeline__simpleimputer__strategy',
 'columntransformer__pipeline__simpleimputer__verbose',
 'columntransformer__pipeline__onehotencoder__categories',
 'columntransformer__pipeline__onehotencoder__drop',
 'columntransformer__pipeline__onehotencoder__dtype',
 'columntransformer__pipeline__onehotencoder__handle_unknown',
 'columntransformer__pipeline__onehotencoder__sparse',
 'columntransformer__countvectorizer__analyzer',
 'columntransformer__countvectorizer__binary',
 'columntransformer__countvectorizer__decode_error',
 'columntransformer__countvectorizer__dtype',
 'columntransformer__countvectorizer__encoding',
 'columntransformer__countvectorizer__input',
 'columntransformer__countvectorizer__lowercase',
 'columntransformer__countvectorizer__max_df',
 'columntransformer__countvectorizer__max_features',
 'columntransformer__countvectorizer__min_df',
 'columntransformer__countvectorizer__ngram_range',
 'columntransformer__countvectorizer__preprocessor',
 'columntransformer__countvectorizer__stop_words',
 'columntransformer__countvectorizer__strip_accents',
 'columntransformer__countvectorizer__token_pattern',
 'columntransformer__countvectorizer__tokenizer',
 'columntransformer__countvectorizer__vocabulary',
 'columntransformer__simpleimputer__add_indicator',
 'columntransformer__simpleimputer__copy',
 'columntransformer__simpleimputer__fill_value',
 'columntransformer__simpleimputer__missing_values',
 'columntransformer__simpleimputer__strategy',
 'columntransformer__simpleimputer__verbose',
 'logisticregression__C',
 'logisticregression__class_weight',
 'logisticregression__dual',
 'logisticregression__fit_intercept',
 'logisticregression__intercept_scaling',
 'logisticregression__l1_ratio',
 'logisticregression__max_iter',
 'logisticregression__multi_class',
 'logisticregression__n_jobs',
 'logisticregression__penalty',
 'logisticregression__random_state',
 'logisticregression__solver',
 'logisticregression__tol',
 'logisticregression__verbose',
 'logisticregression__warm_start']

Moving along, the second parameter we’re going to tune is the “ngram_range” parameter of CountVectorizer.

Again, we specify the Pipeline step name, then the transformer name, and then the parameter name. Note that these three components are separated by double underscores, but there’s just a single underscore within ngram_range because that’s part of the parameter name.

The parameter values we’re going to try are the tuples (1, 1) and (1, 2). (1, 1) is the default, and it creates a single feature from each word. (1, 2) creates features from both single words, known as unigrams, and word pairs, known as bigrams.

CountVectorizer tuning parameter:

  • ngram_range: Selection of word n-grams to be extracted as features
    • (1, 1) (default)
    • (1, 2)
params['columntransformer__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]

The third parameter we’re going to tune is the “add_indicator” parameter of SimpleImputer, which was added to scikit-learn in version 0.21 and which I discussed in .

Once again, we specify the Pipeline step name, then the transformer name, and then the parameter name.

The parameter values we’re going to try are “False” and “True”. False is the default, and it does not add a missing indicator column, whereas True does add a missing indicator column.

SimpleImputer tuning parameter:

  • add_indicator: Option to add a missing indicator column
    • False (default)
    • True
params['columntransformer__simpleimputer__add_indicator'] = [False, True]

Before running the grid search, we’ll print out the params dictionary. By multiplying 2 by 3 by 2 by 2 by 2, we can calculate that there are now 48 parameter combinations, and thus the grid search will take about 8 times longer than the previous search.

As an aside, if we had used the Pipeline and ColumnTransformer classes instead of the make_pipeline and make_column_transformer functions, we could have customized the step names and transformer names, which would have made these parameter specifications a bit easier to read and write. You can watch and for a review of that topic.

params
{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10],
 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

Anyway, next we’ll recreate the grid object with the new params dictionary, and then we’ll run the grid search.

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer(fill_value='missing',
                                                                                                        strategy='constant')),
                                                                                         ('onehotencoder',
                                                                                          OneHotEncoder())]),
                                                                         ['Embarked',
                                                                          'Sex']),
                                                                        ('countvectorizer',
                                                                         CountVectorizer(),
                                                                         'Name'),
                                                                        ('simpleimputer',
                                                                         SimpleImputer(),
                                                                         ['Age',
                                                                          'Fare']),
                                                                        (...
                                        LogisticRegression(random_state=1,
                                                           solver='liblinear'))]),
             param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
                                                                              1),
                                                                             (1,
                                                                              2)],
                         'columntransformer__pipeline__onehotencoder__drop': [None,
                                                                              'first'],
                         'columntransformer__simpleimputer__add_indicator': [False,
                                                                             True],
                         'logisticregression__C': [0.1, 1, 10],
                         'logisticregression__penalty': ['l1', 'l2']},
             scoring='accuracy')
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')

Now that the search is complete, we’ll convert the search results into a DataFrame and sort it by the rank_test_score column.

As you can see from the mean_test_score column, the best accuracy of 0.828 is an improvement over the previous grid search, which had an accuracy of 0.818. Keep in mind that your exact results may differ based on your scikit-learn version along with other factors. However, there’s no randomness involved when you set cv to an integer, and so your results will be the same every time you run this grid search.

results = pd.DataFrame(grid.cv_results_)
results.sort_values('rank_test_score')
mean_fit_time std_fit_time mean_score_time std_score_time param_columntransformer__countvectorizer__ngram_range param_columntransformer__pipeline__onehotencoder__drop param_columntransformer__simpleimputer__add_indicator param_logisticregression__C param_logisticregression__penalty params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
34 0.013390 0.000783 0.002698 0.000020 (1, 2) None True 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.854749 0.820225 0.825843 0.780899 0.859551 0.828253 0.028264 1
28 0.013452 0.001096 0.002653 0.000021 (1, 2) None False 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.849162 0.820225 0.814607 0.780899 0.859551 0.824889 0.027760 2
40 0.017350 0.001327 0.002691 0.000016 (1, 2) first False 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.849162 0.825843 0.814607 0.780899 0.853933 0.824889 0.026361 2
46 0.017452 0.002011 0.002740 0.000021 (1, 2) first True 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.843575 0.820225 0.814607 0.780899 0.853933 0.822648 0.025417 4
16 0.014869 0.000915 0.004289 0.002258 (1, 1) first False 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.837989 0.803371 0.825843 0.786517 0.848315 0.820407 0.022611 5
22 0.012250 0.001199 0.002482 0.000023 (1, 1) first True 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.826816 0.808989 0.814607 0.786517 0.859551 0.819296 0.023999 6
4 0.010636 0.001193 0.002491 0.000045 (1, 1) None False 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.832402 0.808989 0.808989 0.786517 0.853933 0.818166 0.023031 7
10 0.009969 0.000523 0.002513 0.000014 (1, 1) None True 10 l1 {'columntransformer__countvectorizer__ngram_ra... 0.815642 0.808989 0.820225 0.786517 0.853933 0.817061 0.021770 8
20 0.007983 0.000461 0.002600 0.000072 (1, 1) first True 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.810056 0.820225 0.797753 0.792135 0.853933 0.814820 0.021852 9
2 0.011787 0.008637 0.002481 0.000022 (1, 1) None False 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.815642 0.820225 0.797753 0.792135 0.848315 0.814814 0.019787 10
44 0.010098 0.000371 0.002742 0.000050 (1, 2) first True 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.804469 0.820225 0.797753 0.792135 0.853933 0.813703 0.022207 11
47 0.010278 0.000185 0.002679 0.000006 (1, 2) first True 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.820225 0.820225 0.780899 0.853933 0.812598 0.026265 12
8 0.007782 0.000613 0.002519 0.000013 (1, 1) None True 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.804469 0.820225 0.786517 0.792135 0.859551 0.812579 0.026183 13
38 0.009798 0.000299 0.002631 0.000014 (1, 2) first False 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.804469 0.820225 0.797753 0.792135 0.848315 0.812579 0.020194 14
14 0.008625 0.000364 0.003364 0.000394 (1, 1) first False 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.804469 0.820225 0.797753 0.792135 0.848315 0.812579 0.020194 14
26 0.009654 0.000203 0.002621 0.000008 (1, 2) None False 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.815642 0.820225 0.786517 0.792135 0.848315 0.812567 0.022100 16
11 0.007627 0.000085 0.002493 0.000017 (1, 1) None True 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.782123 0.803371 0.808989 0.792135 0.870787 0.811481 0.031065 17
21 0.007036 0.000065 0.002451 0.000020 (1, 1) first True 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.793296 0.820225 0.803371 0.786517 0.853933 0.811468 0.024076 18
3 0.007053 0.000083 0.002475 0.000059 (1, 1) None False 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.798883 0.825843 0.803371 0.786517 0.842697 0.811462 0.020141 19
23 0.007419 0.000092 0.002446 0.000020 (1, 1) first True 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.776536 0.803371 0.808989 0.792135 0.870787 0.810363 0.032182 20
9 0.007227 0.000080 0.002494 0.000009 (1, 1) None True 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.793296 0.825843 0.797753 0.786517 0.848315 0.810345 0.023233 21
15 0.007868 0.000300 0.002760 0.000241 (1, 1) first False 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.804469 0.820225 0.803371 0.786517 0.837079 0.810332 0.017107 22
32 0.009945 0.000132 0.002670 0.000006 (1, 2) None True 1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.804469 0.820225 0.780899 0.792135 0.853933 0.810332 0.025419 22
17 0.008091 0.000432 0.002767 0.000232 (1, 1) first False 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.782123 0.803371 0.808989 0.797753 0.853933 0.809234 0.024080 24
35 0.010225 0.000123 0.002676 0.000017 (1, 2) None True 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.782123 0.820225 0.814607 0.780899 0.848315 0.809234 0.025357 24
5 0.007588 0.000143 0.002450 0.000012 (1, 1) None False 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.782123 0.803371 0.808989 0.797753 0.853933 0.809234 0.024080 24
29 0.010094 0.000113 0.002614 0.000014 (1, 2) None False 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.814607 0.820225 0.780899 0.837079 0.808104 0.020904 27
45 0.009662 0.000085 0.002686 0.000014 (1, 2) first True 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.793296 0.814607 0.797753 0.786517 0.848315 0.808097 0.022143 28
41 0.010140 0.000076 0.002633 0.000008 (1, 2) first False 10 l2 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.814607 0.820225 0.780899 0.831461 0.806980 0.019414 29
39 0.009537 0.000114 0.002621 0.000010 (1, 2) first False 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.798883 0.808989 0.797753 0.786517 0.837079 0.805844 0.017164 30
27 0.009532 0.000076 0.002620 0.000013 (1, 2) None False 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.798883 0.814607 0.792135 0.786517 0.837079 0.805844 0.018234 30
33 0.009623 0.000093 0.002661 0.000012 (1, 2) None True 1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.782123 0.814607 0.792135 0.786517 0.848315 0.804739 0.024489 32
31 0.009042 0.000162 0.002697 0.000043 (1, 2) None True 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.793296 0.803371 0.769663 0.786517 0.814607 0.793491 0.015231 33
7 0.006694 0.000080 0.002536 0.000061 (1, 1) None True 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.798883 0.803371 0.764045 0.786517 0.814607 0.793484 0.017253 34
19 0.007016 0.000145 0.002575 0.000145 (1, 1) first True 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.793296 0.803371 0.764045 0.780899 0.814607 0.791243 0.017572 35
43 0.010422 0.002550 0.002856 0.000219 (1, 2) first True 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.798883 0.797753 0.764045 0.780899 0.808989 0.790114 0.015849 36
37 0.008877 0.000029 0.002641 0.000018 (1, 2) first False 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.803371 0.764045 0.780899 0.808989 0.789003 0.016100 37
25 0.008901 0.000061 0.002608 0.000011 (1, 2) None False 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.793296 0.803371 0.764045 0.775281 0.808989 0.788996 0.016944 38
1 0.006655 0.000039 0.002452 0.000011 (1, 1) None False 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.798883 0.803371 0.764045 0.775281 0.803371 0.788990 0.016258 39
13 0.007833 0.000708 0.003192 0.000419 (1, 1) first False 0.1 l2 {'columntransformer__countvectorizer__ngram_ra... 0.782123 0.803371 0.764045 0.780899 0.808989 0.787885 0.016343 40
0 0.006412 0.000095 0.002483 0.000073 (1, 1) None False 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.803371 0.769663 0.758427 0.797753 0.783385 0.016946 41
30 0.011053 0.003787 0.002923 0.000274 (1, 2) None True 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.803371 0.769663 0.758427 0.797753 0.783385 0.016946 41
24 0.008545 0.000088 0.002621 0.000013 (1, 2) None False 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.803371 0.769663 0.758427 0.797753 0.783385 0.016946 41
6 0.006543 0.000118 0.002510 0.000011 (1, 1) None True 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.787709 0.803371 0.769663 0.758427 0.797753 0.783385 0.016946 41
36 0.008752 0.000158 0.002635 0.000023 (1, 2) first False 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.770950 0.797753 0.769663 0.758427 0.792135 0.777785 0.014779 45
42 0.008697 0.000146 0.002683 0.000007 (1, 2) first True 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.770950 0.797753 0.769663 0.758427 0.792135 0.777785 0.014779 45
12 0.009742 0.004551 0.006101 0.005669 (1, 1) first False 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.770950 0.797753 0.769663 0.758427 0.792135 0.777785 0.014779 45
18 0.006817 0.000245 0.002593 0.000095 (1, 1) first True 0.1 l1 {'columntransformer__countvectorizer__ngram_ra... 0.770950 0.797753 0.769663 0.758427 0.792135 0.777785 0.014779 45

Pipeline accuracy scores:

  • Grid search (5 parameters): 0.828
  • Grid search (2 parameters): 0.818
  • Baseline (no tuning): 0.811

Rather than always examining the results DataFrame, we can actually just access the single best score and the set of parameters that resulted in that score via attributes of the grid object.

It’s worth noting that only the drop parameter is using its default value, whereas the other four parameters are not using their default values.

grid.best_score_
0.828253091456908
grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': True,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}

It’s hard to say whether this truly is the best set of parameters, because some of the differences in accuracy between parameter combinations may be due to chance, based on which samples happened to appear in each fold. That’s just a limitation of basic cross-validation, and so all we can say with confidence is that this is a good combination of parameters.

10.5 Using the best Pipeline to make predictions

Now that we’ve tuned both the model parameters and the transformer parameters, we want to use those parameters with the Pipeline when making predictions.

GridSearchCV actually makes this very easy. After locating the best set of parameters, it automatically refits the Pipeline on X and y using the best set of parameters, and it stores that fitted Pipeline as an attribute called best_estimator_. And as you can see, that attribute is indeed a Pipeline object.

type(grid.best_estimator_)
sklearn.pipeline.Pipeline

If we print out the best_estimator_ attribute and click on the components, we can see that the parameters of this Pipeline match the best parameter set we located in the previous lesson.

grid.best_estimator_
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(ngram_range=(1,
                                                                               2)),
                                                  'Name'),
                                                 ('simpleimputer',
                                                  SimpleImputer(add_indicator=True),
                                                  ['Age', 'Fare']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch'])])),
                ('logisticregression',
                 LogisticRegression(C=10, penalty='l1', random_state=1,
                                    solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer',
                                 CountVectorizer(ngram_range=(1, 2)), 'Name'),
                                ('simpleimputer',
                                 SimpleImputer(add_indicator=True),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough
LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear')

In order to make predictions using this Pipeline, all we have to do is run the grid object’s predict method, which calls the predict method of the best_estimator_, and pass it X_new.

grid.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

I just want to emphasize that this Pipeline, with the best set of parameters, was automatically refit to the entire dataset. You always train your model on the entire dataset, meaning all samples for which you know the target value, before using it to make predictions on new data, otherwise you’re throwing away valuable training data.

10.6 Q&A: How do I save the best Pipeline for future use?

After completing a grid search, you may want to save the Pipeline with the best set of parameters so that you can use it to make predictions later.

As we saw in the previous lesson, the Pipeline with the best set of parameters is stored as an attribute of the GridSearchCV object called best_estimator_, so this is the object that we want to save.

type(grid.best_estimator_)
sklearn.pipeline.Pipeline

You can save a Pipeline to a file using pickle, which is part of the Python standard library.

import pickle

We’ll use pickle’s dump method to save the Pipeline to a file called “pipe.pickle”.

with open('pipe.pickle', 'wb') as f:
    pickle.dump(grid.best_estimator_, f)

Then we can use pickle’s load method to load the Pipeline from the pipe.pickle file into an object called pipe_from_pickle.

with open('pipe.pickle', 'rb') as f:
    pipe_from_pickle = pickle.load(f)

pipe_from_pickle is identical to grid.best_estimator_, and so when we use pipe_from_pickle to make predictions, these predictions are identical to the predictions made by the grid object.

pipe_from_pickle.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

One alternative to pickle is joblib, which is usually more efficient than pickle for scikit-learn objects. Although it’s not part of the Python standard library, joblib has been a dependency of scikit-learn since version 0.21.

import joblib

Just like pickle, you use joblib’s dump method to save the Pipeline to a file, which we’ll call “pipe.joblib”.

joblib.dump(grid.best_estimator_, 'pipe.joblib')
['pipe.joblib']

Then, we’ll use the load method to load the Pipeline from the file into an object called pipe_from_joblib.

pipe_from_joblib = joblib.load('pipe.joblib')

Finally, we’ll use pipe_from_joblib to make predictions.

pipe_from_joblib.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

To be clear, pickle and joblib are not limited to Pipelines and can be used with other scikit-learn objects, such as a standalone model object that is not inside a Pipeline.

There are a couple warnings to keep in mind when working with pickle and joblib objects:

  • First, the objects may be version-specific and architecture-specific. As such, you should only load them into an identical environment, meaning the same versions of scikit-learn and its dependencies, using the identical computing architecture.
  • Second, these objects can be poisoned with malicious code, and so you should only load objects from a trusted source.

Warnings for pickle and joblib objects:

  • May be version-specific and architecture-specific
  • Can be poisoned with malicious code

Finally, it’s worth mentioning that there are alternatives to pickle and joblib such as ONNX and PMML. These formats don’t capture the full model object, but instead save a representation that can be used to make predictions. One major benefit of these formats is that they are neither environment-specific nor architecture-specific.

Alternatives to pickle and joblib:

  • Examples: ONNX, PMML
  • Save a model representation for making predictions
  • Work across environments and architectures

10.9 Q&A: What’s the target accuracy we are trying to achieve?

When you’re building and tuning a modeling Pipeline, it’s natural to wonder how you’ll know when you’re done. In other words, how good of a model is “good enough”? There are three ways that I tend to think about this question.

When is a model “good enough”?

  • Useful model: Outperforms null accuracy
  • Best possible model: Usually impossible to know the theoretical maximum accuracy
  • Practical model: Continue improving until you run out of resources

The first way is to ask the question: What is the minimum accuracy that we need to achieve for our model to be considered useful? In most cases, you want your model to at least outperform null accuracy, which is the accuracy you could achieve by always predicting the most frequent class.

To calculate the null accuracy for our training data, we use the value_counts method on y, and set normalize to True in order to display the counts as a percentage. From the results, we can see that class 0 is the most frequent class, and about 61.6% of the y values are class 0.

y.value_counts(normalize=True)
0    0.616162
1    0.383838
Name: Survived, dtype: float64

Thus the null accuracy for this problem is 61.6%, since an uninformed model, also known as the null model, could achieve that accuracy simply by predicting class 0 in all cases. In other words, this is the accuracy level that we want to outperform, otherwise the model is not providing any value. Thankfully, all of our Pipelines are outperforming null accuracy by a considerable amount.

Pipeline accuracy scores:

  • Grid search (5 parameters): 0.828
  • Randomized search (more C values): 0.827
  • Grid search (2 parameters): 0.818
  • Baseline (no tuning): 0.811
  • Null model: 0.616

The second way to think about this question is to ask: What is the maximum accuracy we could eventually reach? For most real problems, it’s impossible to know how accurate your model could be if you did enough tuning and tried enough models. It’s also impossible to know how accurate your model could be if you gathered more samples or more features. The main exception to this is if you’re working on a well-studied research problem, because in that case there may be a state-of-the-art benchmark that everyone is trying to surpass.

Thus in most practical circumstances, you don’t set a target accuracy. Instead, you work to improve the model until you run out of time, money, or ideas.

10.10 Q&A: Is it okay that our model includes thousands of features?

The pipe object is our Pipeline that hasn’t been tuned by grid search. Recall that you can examine an individual Pipeline step by using the named_steps attribute. In this case, we’ll select the first step, which is our ColumnTransformer.

pipe.named_steps['columntransformer']
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough

By passing X to its fit_transform method, we can see that the ColumnTransformer outputs 1518 feature columns. As we saw in , all except 9 of those features were created from the Name column by CountVectorizer.

pipe.named_steps['columntransformer'].fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
    with 7328 stored elements in Compressed Sparse Row format>

The cross-validated accuracy of this Pipeline is 0.811, which we’ve been calling the baseline accuracy against which other Pipelines can be compared.

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
0.8114619295712762

Pipeline accuracy scores:

  • Grid search (5 parameters): 0.828
  • Randomized search (more C values): 0.827
  • Grid search (2 parameters): 0.818
  • Baseline (no tuning): 0.811
  • Null model: 0.616

Similarly, we can select the ColumnTransformer from our Pipeline that was tuned by grid search. Notice that the ngram_range for CountVectorizer is (1, 2), meaning CountVectorizer will create features from both unigrams and bigrams in the Name column.

grid.best_estimator_.named_steps['columntransformer']
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer',
                                 CountVectorizer(ngram_range=(1, 2)), 'Name'),
                                ('simpleimputer',
                                 SimpleImputer(add_indicator=True),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer(add_indicator=True)
['Parch']
passthrough

By using fit_transform, we can see that this ColumnTransformer outputs 3671 feature columns. Again, all except 9 of those features were created from the Name column.

grid.best_estimator_.named_steps['columntransformer'].fit_transform(X)
<891x3671 sparse matrix of type '<class 'numpy.float64'>'
    with 10191 stored elements in Compressed Sparse Row format>

The cross-validated accuracy of this Pipeline is 0.828.

grid.best_score_
0.828253091456908

Pipeline accuracy scores:

  • Grid search (5 parameters): 0.828
  • Randomized search (more C values): 0.827
  • Grid search (2 parameters): 0.818
  • Baseline (no tuning): 0.811
  • Null model: 0.616

Finally, let’s compare these two Pipelines to a Pipeline that doesn’t include the Name column at all. First, we’ll create a ColumnTransformer called “no_name_ct” that excludes Name.

no_name_ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

As you can see, this ColumnTransformer only outputs 9 feature columns.

no_name_ct.fit_transform(X).shape
(891, 9)

Then, we’ll add no_name_ct to a Pipeline called “no_name_pipe” and cross-validate it. The accuracy is 0.783, which is significantly lower than the Pipelines that included the Name column. To be fair, this Pipeline hasn’t been tuned, though honestly there is no hyperparameter tuning we could do to make it perform as well as the Pipelines that included the Name column.

no_name_pipe = make_pipeline(no_name_ct, logreg)
cross_val_score(no_name_pipe, X, y, cv=5, scoring='accuracy').mean()
0.7833908731404181

Pipeline accuracy scores:

  • Grid search (5 parameters): 0.828
  • Randomized search (more C values): 0.827
  • Grid search (2 parameters): 0.818
  • Baseline (no tuning): 0.811
  • Baseline excluding Name (no tuning): 0.783
  • Null model: 0.616

Here are some conclusions that we can draw from this experiment:

  • First, including the Name column in the Pipeline significantly increased the cross-validated accuracy, which means that adding those thousands of feature columns did not result in overfitting. Instead, it tells us that the Name column contains more predictive signal than noise with respect to the target.
  • More generally, this experiment tells us that having more features than samples does not necessarily result in overfitting.

What did we learn?

  • Name column contains more predictive signal than noise
  • More features than samples does not necessarily result in overfitting

It’s worth noting that there is additional tuning we could do to CountVectorizer to reduce the number of features it creates. However, there’s no way to know whether that would increase or decrease the Pipeline’s accuracy without actually trying it.

10.11 Q&A: How do I examine the coefficients of a Pipeline?

Recall that once a grid search is complete, GridSearchCV automatically refits the Pipeline on X and y and stores it as an attribute called best_estimator_. Therefore, we can access the model coefficients by first selecting the logistic regression step and then selecting the coef_ attribute.

grid.best_estimator_.named_steps['logisticregression'].coef_
array([[ 0.56431161,  0.        , -0.08767203, ...,  0.01408723,
        -0.43713268, -0.46358519]])

Ideally, we would also be able to get the names of the features that correspond to these coefficients by running the get_feature_names method on the ColumnTransformer step. However, get_feature_names only works if all of the underlying transformers have a get_feature_names method, and that is not the case here.

grid.best_estimator_.named_steps['columntransformer'].get_feature_names()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[53], line 1
----> 1 grid.best_estimator_.named_steps['columntransformer'].get_feature_names()

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:371, in ColumnTransformer.get_feature_names(self)
    369         continue
    370     if not hasattr(trans, 'get_feature_names'):
--> 371         raise AttributeError("Transformer %s (type %s) does not "
    372                              "provide get_feature_names."
    373                              % (str(name), type(trans).__name__))
    374     feature_names.extend([name + "__" + f for f in
    375                           trans.get_feature_names()])
    376 return feature_names

AttributeError: Transformer pipeline (type Pipeline) does not provide get_feature_names.

Instead, as we saw previously in , you would have to inspect the transformers one-by-one in order to determine the feature names.

grid.best_estimator_.named_steps['columntransformer'].transformers_
[('pipeline',
  Pipeline(steps=[('simpleimputer',
                   SimpleImputer(fill_value='missing', strategy='constant')),
                  ('onehotencoder', OneHotEncoder())]),
  ['Embarked', 'Sex']),
 ('countvectorizer', CountVectorizer(ngram_range=(1, 2)), 'Name'),
 ('simpleimputer', SimpleImputer(add_indicator=True), ['Age', 'Fare']),
 ('passthrough', 'passthrough', ['Parch'])]

Note that starting in scikit-learn version 1.1, the get_feature_names_out method should work on this ColumnTransformer, since the get_feature_names_out method will be available for all transformers.

10.12 Q&A: Should I split the dataset before tuning the Pipeline?

When we perform a grid search, we’re trying to find the parameters that maximize the cross-validation score on a dataset. Thus, we’re using the same data to accomplish two separate goals:

  • First, to choose the best parameters for the Pipeline, which are stored in the best_params_ attribute.
  • Second, to estimate the future performance of the Pipeline on new data when using these parameters, which is stored in the best_score_ attribute.

Goals of a grid search:

  • Choose the best parameters for the Pipeline
  • Estimate its performance on new data when using these parameters
grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': True,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}
grid.best_score_
0.828253091456908

Using the same data for these two separate goals actually biases the Pipeline to this dataset and can result in overly optimistic scores.

If your main objective is to choose the best parameters, then this process is totally fine. You’ll just have to accept that its actual performance on new data may be lower than the performance estimated by grid search.

But if you also need a realistic estimate of the Pipeline’s performance on new data, then there’s an alternative process you can use, which I’ll walk you through in this lesson.

Is it okay to use the same data for both goals?

  • Yes: If your main objective is to choose the best parameters
  • No: If you need a realistic estimate of performance on new data

To start, we’ll import the train_test_split function from the model_selection module, and use it to split the data into training and testing sets, with 75% of the data as training and 25% of the data as testing. Note that I set the stratify parameter to “y” so that the class proportions will be approximately equal in the training and testing sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=1, stratify=y)

Next, we’ll create a new GridSearchCV object called training_grid. When we run the grid search, we’ll only pass it the training set so that the tuning process only takes the training set into account.

training_grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy', n_jobs=-1)
training_grid.fit(X_train, y_train)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer(fill_value='missing',
                                                                                                        strategy='constant')),
                                                                                         ('onehotencoder',
                                                                                          OneHotEncoder())]),
                                                                         ['Embarked',
                                                                          'Sex']),
                                                                        ('countvectorizer',
                                                                         CountVectorizer(),
                                                                         'Name'),
                                                                        ('simpleimputer',
                                                                         SimpleImputer(),
                                                                         ['Age',
                                                                          'Fare']),
                                                                        (...
                                        LogisticRegression(random_state=1,
                                                           solver='liblinear'))]),
             n_jobs=-1,
             param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
                                                                              1),
                                                                             (1,
                                                                              2)],
                         'columntransformer__pipeline__onehotencoder__drop': [None,
                                                                              'first'],
                         'columntransformer__simpleimputer__add_indicator': [False,
                                                                             True],
                         'logisticregression__C': [0.1, 1, 10],
                         'logisticregression__penalty': ['l1', 'l2']},
             scoring='accuracy')
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')

Here are the best parameters found by grid search on the training set.

training_grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': 'first',
 'columntransformer__simpleimputer__add_indicator': False,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l2'}

We’re not actually interested in the best score found during the grid search. Instead, we’re going to use the best parameters found by the grid search to make predictions for the testing set, and then evaluate the accuracy of those predictions. We can do this by passing the testing set to the training_grid’s score method.

The accuracy it outputs is 0.816, which is a more realistic estimate of how the Pipeline will perform on new data, since the testing set is brand new data that the Pipeline has never seen. However, it’s still just a single realization of this model, and so it’s impossible to know how precise this value is.

training_grid.score(X_test, y_test)
0.8161434977578476

Pipeline accuracy scores:

  • Grid search (5 parameters): 0.828
  • Randomized search (more C values): 0.827
  • Grid search (2 parameters): 0.818
  • Grid search (estimate for new data): 0.816
  • Baseline (no tuning): 0.811
  • Baseline excluding Name (no tuning): 0.783
  • Null model: 0.616

Now that we’ve found the best parameters for the Pipeline and estimated its likely performance on new data, our final step is to actually make predictions on new data. Before making predictions, it’s critical that we train the Pipeline on all of our data, meaning the entirety of X and y, otherwise we’re throwing away valuable data.

In other words, we can’t simply use the training_grid’s predict method since it was only refit on X_train and y_train. Instead, we need to save the Pipeline with the best parameters, which we’ll call “best_pipe”, and fit it to X and y.

best_pipe = training_grid.best_estimator_
best_pipe.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(ngram_range=(1,
                                                                               2)),
                                                  'Name'),
                                                 ('simpleimputer',
                                                  SimpleImputer(),
                                                  ['Age', 'Fare']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch'])])),
                ('logisticregression',
                 LogisticRegression(C=10, random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer',
                                 CountVectorizer(ngram_range=(1, 2)), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(drop='first')
Name
CountVectorizer(ngram_range=(1, 2))
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(C=10, random_state=1, solver='liblinear')

Now we can make predictions on new data.

best_pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

If you decide that you’re going to follow the process that I’ve just outlined, then there are two guidelines that are important to follow.

First, you should only use the testing set for evaluating Pipeline performance one time. If you keep tuning the Pipeline again and again, each time checking its performance on the testing set, you’re essentially tuning the Pipeline to the particulars of the testing set. At that point, it no longer functions as an independent data source and thus its performance estimates will become less reliable.

Second, it’s important that you have enough data overall in order for the training and testing sets to both be sufficiently large once the dataset has been split:

  • If the training set is too small, then the grid search won’t have enough data to find the optimal tuning parameters.
  • If the testing set is too small, then it won’t be able to provide a reliable estimate of Pipeline performance.

Both of these situations would defeat the purpose of splitting the dataset, and thus this approach is best when you have a large enough dataset. Unfortunately, it’s difficult to say in the abstract how much data is “enough”, since that depends on the particulars of the dataset and the problem.

Guidelines for using this process:

  • Only use the testing set once:
    • If used multiple times, performance estimates will become less reliable
  • You must have enough data:
    • If training set is too small, grid search won’t find the optimal parameters
    • If testing set is too small, it won’t provide a reliable performance estimate

10.13 Q&A: What is regularization?

Earlier in this chapter, we tuned the regularization parameters of logistic regression. In this lesson, I’ll briefly explain what regularization actually is.

Regularization is a process that constrains the size of a model’s coefficients in order to minimize overfitting. Overfitting is when your model fits too closely to patterns in the training data, which causes your model not to perform well when it makes predictions on new data.

Regularization minimizes overfitting by reducing the variance of the model. Thus if you believe a model is too complex, regularization will reduce the error due to variance more than it increases the error due to bias, resulting in a model that is more likely to generalize to new data.

In simpler terms, regularization makes your model a bit less flexible so that it’s more likely to follow the true patterns in the data and less likely to follow the noise. Regularization is especially useful when you have outliers in the training data, because regularization decreases the influence that outliers have on the model.

Brief explanation of regularization:

  • Constrains the size of model coefficients to minimize overfitting
  • Reduces the variance of an overly complex model to help the model generalize
  • Decreases model flexibility so that it follows the true patterns in the data