from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier(random_state=1, n_jobs=-1) rf
11 Comparing linear and non-linear models
11.1 Trying a random forest model
So far, the only model we’ve used in this book is logistic regression. But what if you wanted to try a different model?
One great thing about the scikit-learn API is that once you’ve built a workflow, you can easily swap in a different model, usually without making any other changes to your workflow. This is a huge benefit of scikit-learn, since it’s not possible to know ahead of time which model is going to work best for a given problem and dataset. This is also known as the “no free lunch” theorem.
In this chapter, we’re going to try out the random forest model, which is one of the most well-known models in Machine Learning. Whereas logistic regression is a linear model, random forests is a non-linear model based on decision trees. These two types of models have different overall properties, thus it may turn out that one type is better suited to this particular problem.
We start out by importing the RandomForestClassifier class from the ensemble module, and creating an instance called rf. Because there’s randomness involved in a random forest, we’ll set a random state for reproducibility. And because building a random forest can be computationally expensive, it has its own n_jobs parameter (just like grid search and randomized search), which we’ll set to -1 to enable parallel processing.
We’ll create a new Pipeline object called rf_pipe that uses random forests instead of logistic regression.
= make_pipeline(ct, rf)
rf_pipe rf_pipe
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('randomforestclassifier', RandomForestClassifier(n_jobs=-1, random_state=1))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
RandomForestClassifier(n_jobs=-1, random_state=1)
And we can cross-validate it to generate a baseline accuracy, which is 0.811. This accuracy is nearly identical to the baseline accuracy of our logistic regression Pipeline, but it’s likely that we can improve it through hyperparameter tuning.
=5, scoring='accuracy').mean() cross_val_score(rf_pipe, X, y, cv
0.811436821291821
As an aside, I’ve simplified the Pipeline accuracy scores table to only include the most important scores from the previous chapter. As you might guess, “LR” stands for logistic regression and “RF” stands for random forests. And going forward, I’ll always use the term “baseline” in this table to describe a Pipeline that has not undergone any hyperparameter tuning via grid search or randomized search.
11.2 Tuning random forests with randomized search
When tuning random forests, we’ll try tuning the same parameters for the transformers as before, but different parameters for the model. It’s still important to tune the transformations and the model at the same time, because it may turn out that the best data transformations for a random forest model are different than the best data transformations for a logistic regression model.
Rather than typing a parameters dictionary from scratch, we can start by creating a copy of the params dictionary called rf_params.
= params.copy()
rf_params rf_params
{'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': [0.1, 1, 10],
'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Then, we’ll delete the entries from the rf_params dictionary that only apply to logistic regression.
del rf_params['logisticregression__penalty']
del rf_params['logisticregression__C']
rf_params
{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Alternatively, we could have created the rf_params dictionary using a dictionary comprehension that only keeps the entries in params that start with the letters ‘col’.
= {k:v for k, v in params.items() if k.startswith('col')}
rf_params rf_params
{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
Now we’re ready to add tuning parameters for the RandomForestClassifier. Random forests has a lot of parameters you can tune, which can make for a computationally expensive grid search if you try to tune all of them. This is compounded by the fact that random forests is comparatively slower to train than logistic regression.
When you’re not quite sure which parameters to tune or which values to try for those parameters, I would suggest a two-step approach, which is what we’ll use in this chapter:
- First, use a randomized search with a variety of parameters and values. This allows you to test out a lot of different combinations while still controlling the computational budget. Examine the results of the search to look for trends of what’s working and what’s not.
- Second, use a grid search with a more optimized set of parameters and values, based on what you learned from the randomized search.
We’ll start by trying out four parameters from the RandomForestClassifier, which I selected based on research and experience.
First, we’ll confirm that the Pipeline step name is randomforestclassifier (all lowercase).
rf_pipe.named_steps.keys()
dict_keys(['columntransformer', 'randomforestclassifier'])
Then we’ll add the four parameters I selected and some reasonable values for those parameters to the rf_params dictionary:
- n_estimators is the number of decision trees in the random forest.
- min_samples_leaf is a way to control overfitting, just like regularization is used to control overfitting in a logistic regression model.
- And max_features and bootstrap affect certain properties of the random forest algorithm.
'randomforestclassifier__n_estimators'] = [100, 300, 500, 700]
rf_params['randomforestclassifier__min_samples_leaf'] = [1, 2, 3]
rf_params['randomforestclassifier__max_features'] = ['sqrt', None]
rf_params['randomforestclassifier__bootstrap'] = [True, False]
rf_params[ rf_params
{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True],
'randomforestclassifier__n_estimators': [100, 300, 500, 700],
'randomforestclassifier__min_samples_leaf': [1, 2, 3],
'randomforestclassifier__max_features': ['sqrt', None],
'randomforestclassifier__bootstrap': [True, False]}
Finally, we’ll create an instance of RandomizedSearchCV called rf_rand, making sure to use the rf_pipe and rf_params objects, and we’ll run 100 iterations of the randomized search.
Notice that I’ve added a warning to this cell to indicate that it takes a few minutes to run on my local machine, though it may run faster or slower on your machine.
# WARNING: EXTENDED RUNTIME
= RandomizedSearchCV(rf_pipe, rf_params, cv=5, scoring='accuracy',
rf_rand =100, random_state=1, n_jobs=-1)
n_iter%time rf_rand.fit(X, y)
CPU times: user 2.79 s, sys: 331 ms, total: 3.13 s
Wall time: 1min 33s
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Far... 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'randomforestclassifier__bootstrap': [True, False], 'randomforestclassifier__max_features': ['sqrt', None], 'randomforestclassifier__min_samples_leaf': [1, 2, 3], 'randomforestclassifier__n_estimators': [100, 300, 500, 700]}, random_state=1, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
RandomForestClassifier(n_jobs=-1, random_state=1)
We see that the best score found during the randomized search is 0.825, which is better than our baseline score of 0.811, but not quite as good as the 0.828 score of our best logistic regression Pipeline.
rf_rand.best_score_
0.8249262444291003
11.3 Further tuning with grid search
Let’s now examine the results from the randomized search to look for trends, specifically focusing on the top 20 results. We’ll convert the results to a DataFrame, and then sort by rank_test_score. Keep in mind that it’s hard to draw any definitive conclusions since this is a randomized search, so what we’re looking for is just any obvious trends.
= pd.DataFrame(rf_rand.cv_results_)
results 'rank_test_score').head(20) results.sort_values(
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_randomforestclassifier__n_estimators | param_randomforestclassifier__min_samples_leaf | param_randomforestclassifier__max_features | param_randomforestclassifier__bootstrap | param_columntransformer__simpleimputer__add_indicator | param_columntransformer__pipeline__onehotencoder__drop | param_columntransformer__countvectorizer__ngram_range | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | 1.758950 | 0.102511 | 0.091779 | 0.033565 | 700 | 3 | None | True | False | first | (1, 1) | {'randomforestclassifier__n_estimators': 700, ... | 0.815642 | 0.808989 | 0.853933 | 0.803371 | 0.842697 | 0.824926 | 0.019809 | 1 |
70 | 2.013018 | 0.084309 | 0.086011 | 0.023684 | 700 | 3 | None | True | False | None | (1, 1) | {'randomforestclassifier__n_estimators': 700, ... | 0.810056 | 0.808989 | 0.853933 | 0.803371 | 0.837079 | 0.822685 | 0.019513 | 2 |
31 | 1.887772 | 0.086216 | 0.064653 | 0.011458 | 500 | 3 | None | True | True | first | (1, 2) | {'randomforestclassifier__n_estimators': 500, ... | 0.810056 | 0.814607 | 0.848315 | 0.797753 | 0.842697 | 0.822685 | 0.019513 | 2 |
45 | 1.037857 | 0.158690 | 0.056631 | 0.028906 | 300 | 2 | None | True | False | None | (1, 1) | {'randomforestclassifier__n_estimators': 300, ... | 0.815642 | 0.808989 | 0.842697 | 0.808989 | 0.837079 | 0.822679 | 0.014370 | 4 |
33 | 0.858704 | 0.094488 | 0.060416 | 0.020129 | 300 | 3 | None | True | False | first | (1, 1) | {'randomforestclassifier__n_estimators': 300, ... | 0.810056 | 0.808989 | 0.848315 | 0.803371 | 0.837079 | 0.821562 | 0.017764 | 5 |
54 | 1.010711 | 0.062747 | 0.053485 | 0.011058 | 300 | 2 | None | True | True | None | (1, 1) | {'randomforestclassifier__n_estimators': 300, ... | 0.821229 | 0.808989 | 0.842697 | 0.803371 | 0.831461 | 0.821549 | 0.014379 | 6 |
68 | 2.148946 | 0.087918 | 0.076372 | 0.018386 | 500 | 3 | None | True | False | None | (1, 2) | {'randomforestclassifier__n_estimators': 500, ... | 0.804469 | 0.808989 | 0.848315 | 0.803371 | 0.837079 | 0.820444 | 0.018609 | 7 |
81 | 0.413764 | 0.086766 | 0.029865 | 0.010336 | 100 | 3 | None | True | True | first | (1, 2) | {'randomforestclassifier__n_estimators': 100, ... | 0.815642 | 0.803371 | 0.842697 | 0.803371 | 0.837079 | 0.820432 | 0.016601 | 8 |
15 | 0.844169 | 0.142691 | 0.048393 | 0.007232 | 300 | 2 | None | True | True | first | (1, 1) | {'randomforestclassifier__n_estimators': 300, ... | 0.810056 | 0.808989 | 0.842697 | 0.797753 | 0.837079 | 0.819315 | 0.017433 | 9 |
94 | 1.856896 | 0.045917 | 0.057204 | 0.015763 | 500 | 2 | None | True | True | first | (1, 1) | {'randomforestclassifier__n_estimators': 500, ... | 0.815642 | 0.808989 | 0.837079 | 0.797753 | 0.837079 | 0.819308 | 0.015596 | 10 |
98 | 0.261669 | 0.037321 | 0.044620 | 0.026675 | 100 | 2 | None | True | True | first | (1, 1) | {'randomforestclassifier__n_estimators': 100, ... | 0.815642 | 0.808989 | 0.837079 | 0.808989 | 0.825843 | 0.819308 | 0.010816 | 10 |
63 | 3.068088 | 0.164047 | 0.104920 | 0.052729 | 700 | 1 | None | True | True | first | (1, 1) | {'randomforestclassifier__n_estimators': 700, ... | 0.837989 | 0.814607 | 0.814607 | 0.780899 | 0.848315 | 0.819283 | 0.023280 | 12 |
18 | 2.645641 | 0.444149 | 0.072046 | 0.010605 | 700 | 2 | None | True | True | first | (1, 1) | {'randomforestclassifier__n_estimators': 700, ... | 0.810056 | 0.808989 | 0.831461 | 0.797753 | 0.842697 | 0.818191 | 0.016402 | 13 |
57 | 1.524880 | 0.188370 | 0.092194 | 0.019741 | 500 | 2 | None | True | False | None | (1, 1) | {'randomforestclassifier__n_estimators': 500, ... | 0.815642 | 0.808989 | 0.837079 | 0.792135 | 0.837079 | 0.818185 | 0.017225 | 14 |
12 | 0.437379 | 0.084064 | 0.026701 | 0.007412 | 100 | 3 | None | True | True | None | (1, 2) | {'randomforestclassifier__n_estimators': 100, ... | 0.815642 | 0.803371 | 0.831461 | 0.803371 | 0.837079 | 0.818185 | 0.013990 | 14 |
72 | 1.300662 | 0.091567 | 0.054674 | 0.013393 | 300 | 1 | None | True | False | None | (1, 1) | {'randomforestclassifier__n_estimators': 300, ... | 0.821229 | 0.808989 | 0.820225 | 0.797753 | 0.842697 | 0.818178 | 0.014942 | 16 |
2 | 0.748695 | 0.197776 | 0.049966 | 0.010642 | 500 | 1 | sqrt | False | False | first | (1, 1) | {'randomforestclassifier__n_estimators': 500, ... | 0.837989 | 0.831461 | 0.820225 | 0.752809 | 0.848315 | 0.818160 | 0.033925 | 17 |
41 | 2.489124 | 0.073481 | 0.078912 | 0.004515 | 700 | 2 | None | True | True | None | (1, 1) | {'randomforestclassifier__n_estimators': 700, ... | 0.810056 | 0.808989 | 0.831461 | 0.797753 | 0.837079 | 0.817067 | 0.014799 | 18 |
10 | 2.137177 | 0.137258 | 0.062515 | 0.015032 | 500 | 2 | None | True | False | first | (1, 2) | {'randomforestclassifier__n_estimators': 500, ... | 0.815642 | 0.808989 | 0.837079 | 0.780899 | 0.842697 | 0.817061 | 0.022058 | 19 |
8 | 2.289808 | 0.061330 | 0.077428 | 0.016585 | 500 | 2 | None | True | True | None | (1, 2) | {'randomforestclassifier__n_estimators': 500, ... | 0.815642 | 0.814607 | 0.831461 | 0.786517 | 0.837079 | 0.817061 | 0.017601 | 19 |
Starting with n_estimators, we see that higher numbers are performing better, which is typical for n_estimators. It seems unlikely that 100 will produce the best result, so we’ll exclude that value from our grid search. And since the current best result is at 700, it seems useful to add a value of 900 to our grid search, in case increasing it further is even better.
Do keep in mind that increasing n_estimators also increases the time needed to train the model. You could consider just setting a single large value for n_estimators rather than searching through multiple values, since larger values will generally produce better results up to a certain point, but I prefer to tune this value when computational resources allow for it.
The next parameter to examine is min_samples_leaf. Similar to n_estimators, the lowest value of 1 seems unlikely to produce the best result, so we’ll remove it. The current best result is 3, so we’ll also try the values 4 and 5 in the grid search.
For max_features, it’s clear that None is performing better, so we’re no longer going to try sqrt.
For bootstrap, it’s clear that True is performing better, so we’re no longer going to try False.
And finally, there aren’t any clear trends for the transformer parameters, so we’ll leave those as-is.
Here are the updated values we’re going to try. For max_features and bootstrap, you’ll see that we can just pass a list with a single value so that that parameter value will always get set during the search.
'randomforestclassifier__n_estimators'] = [300, 500, 700, 900]
rf_params['randomforestclassifier__min_samples_leaf'] = [2, 3, 4, 5]
rf_params['randomforestclassifier__max_features'] = [None]
rf_params['randomforestclassifier__bootstrap'] = [True]
rf_params[ rf_params
{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True],
'randomforestclassifier__n_estimators': [300, 500, 700, 900],
'randomforestclassifier__min_samples_leaf': [2, 3, 4, 5],
'randomforestclassifier__max_features': [None],
'randomforestclassifier__bootstrap': [True]}
At this point, you could continue to run additional randomized searches in order to study the trends further, but we’re just going to move on to grid search.
We’ll create an instance of GridSearchCV called rf_grid, making sure to use the rf_pipe and rf_params objects, and then run the search.
# WARNING: EXTENDED RUNTIME
= GridSearchCV(rf_pipe, rf_params, cv=5, scoring='accuracy', n_jobs=-1)
rf_grid %time rf_grid.fit(X, y)
CPU times: user 2.1 s, sys: 221 ms, total: 2.32 s
Wall time: 3min 3s
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'randomforestclassifier__bootstrap': [True], 'randomforestclassifier__max_features': [None], 'randomforestclassifier__min_samples_leaf': [2, 3, 4, 5], 'randomforestclassifier__n_estimators': [300, 500, 700, 900]}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
RandomForestClassifier(n_jobs=-1, random_state=1)
The best score from our grid search is 0.829, which is just a tiny bit higher than the 0.828 score of our best logistic regression Pipeline.
rf_grid.best_score_
0.8294143493817087
These are the best parameters that were found during the grid search. Again, it’s hard to say whether this is truly the best set of parameters, but we at least know that it’s a good set of parameters.
rf_grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 1),
'columntransformer__pipeline__onehotencoder__drop': 'first',
'columntransformer__simpleimputer__add_indicator': True,
'randomforestclassifier__bootstrap': True,
'randomforestclassifier__max_features': None,
'randomforestclassifier__min_samples_leaf': 4,
'randomforestclassifier__n_estimators': 300}
11.4 Q&A: How do I tune two models with a single grid search?
So far, we’ve set up two separate Pipelines called pipe and rf_pipe that each end in a different model. That made it easy to grid search each Pipeline with its own set of relevant parameters. However, you can actually tune two different models using a single grid search if you like.
To do this, the first step is to create a new Pipeline using the Pipeline class instead of the make_pipeline function. The reason we’re doing this is so that we can provide custom names for the steps. In this case, we’ll call the Pipeline object both_pipe, and we’ll call the step names “preprocessor” and “classifier”. We’ll set the classifier to be logistic regression, though this is just a placeholder as you’ll see in a minute.
= Pipeline([('preprocessor', ct), ('classifier', logreg)]) both_pipe
Next, we’ll create a new parameter dictionary called params1. For the simplicity of this example, we’re only going to tune one parameter from the preprocessor step and two parameters from the classifier step.
Additionally, we’re going to add one more entry to the dictionary to indicate that the classifier we want to use with this parameter set is logistic regression. Notice that this is a logistic regression object, not a string, and also notice that we put it in brackets to make it a list. This might seem strange, but it will make more sense in a minute.
= {}
params1 'preprocessor__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]
params1['classifier__penalty'] = ['l1', 'l2']
params1['classifier__C'] = [0.1, 1, 10]
params1['classifier'] = [logreg]
params1[ params1
{'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': [0.1, 1, 10],
'classifier': [LogisticRegression(random_state=1, solver='liblinear')]}
Next, we’ll create another parameter dictionary called params2. Again, we’ll tune one parameter from the preprocessor step and two parameters from the classifier step. You’ll notice that the classifier parameters are random forest parameters, not logistic regression parameters.
Just like above, we’ll add one more entry to the dictionary to indicate that the classifier we want to use with this parameter set is random forests. During the grid search, this will override the logistic regression classifier we specified when creating the Pipeline.
= {}
params2 'preprocessor__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]
params2['classifier__n_estimators'] = [300, 500]
params2['classifier__min_samples_leaf'] = [3, 4]
params2['classifier'] = [rf]
params2[ params2
{'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'classifier__n_estimators': [300, 500],
'classifier__min_samples_leaf': [3, 4],
'classifier': [RandomForestClassifier(n_jobs=-1, random_state=1)]}
Next, we’ll create a list called both_params that includes both of these parameter sets.
= [params1, params2]
both_params both_params
[{'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': [0.1, 1, 10],
'classifier': [LogisticRegression(random_state=1, solver='liblinear')]},
{'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'classifier__n_estimators': [300, 500],
'classifier__min_samples_leaf': [3, 4],
'classifier': [RandomForestClassifier(n_jobs=-1, random_state=1)]}]
Finally, we’ll create an instance of GridSearchCV called both_grid, making sure to pass it the both_pipe and both_params objects.
When we run the search, here’s what will happen:
- First, it will try every combination of parameters from params1, which is 2 times 2 times 3 times 1, or 12 combinations.
- Then, it will try every combination of parameters from params2, which is 2 times 2 times 2 times 1, or 8 combinations.
Thus, it will run a total of 20 times.
= GridSearchCV(both_pipe, both_params, cv=5, scoring='accuracy',
both_grid =-1)
n_jobs both_grid.fit(X, y)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('pass... solver='liblinear')], 'classifier__C': [0.1, 1, 10], 'classifier__penalty': ['l1', 'l2'], 'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)]}, {'classifier': [RandomForestClassifier(n_jobs=-1, random_state=1)], 'classifier__min_samples_leaf': [3, 4], 'classifier__n_estimators': [300, 500], 'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)]}], scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear')
Let’s take a look at the results. You can see that it ran 12 times with a logistic regression model and 8 times with a random forest model. Also note that when logistic regression model runs, the random forest-related parameters are listed as NaN, and vice versa when the random forest model runs.
pd.DataFrame(both_grid.cv_results_)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_classifier | param_classifier__C | param_classifier__penalty | param_preprocessor__countvectorizer__ngram_range | param_classifier__min_samples_leaf | param_classifier__n_estimators | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.007347 | 0.000113 | 0.002645 | 0.000035 | LogisticRegression(C=10, penalty='l1', random_... | 0.1 | l1 | (1, 1) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 12 |
1 | 0.010414 | 0.000730 | 0.003242 | 0.000054 | LogisticRegression(C=10, penalty='l1', random_... | 0.1 | l1 | (1, 2) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 12 |
2 | 0.010512 | 0.003404 | 0.003052 | 0.000255 | LogisticRegression(C=10, penalty='l1', random_... | 0.1 | l2 | (1, 1) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.798883 | 0.803371 | 0.764045 | 0.775281 | 0.803371 | 0.788990 | 0.016258 | 10 |
3 | 0.012982 | 0.001550 | 0.005367 | 0.001251 | LogisticRegression(C=10, penalty='l1', random_... | 0.1 | l2 | (1, 2) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.793296 | 0.803371 | 0.764045 | 0.775281 | 0.808989 | 0.788996 | 0.016944 | 9 |
4 | 0.012608 | 0.002167 | 0.004043 | 0.001136 | LogisticRegression(C=10, penalty='l1', random_... | 1 | l1 | (1, 1) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.815642 | 0.820225 | 0.797753 | 0.792135 | 0.848315 | 0.814814 | 0.019787 | 3 |
5 | 0.016644 | 0.000577 | 0.004067 | 0.000812 | LogisticRegression(C=10, penalty='l1', random_... | 1 | l1 | (1, 2) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.815642 | 0.820225 | 0.786517 | 0.792135 | 0.848315 | 0.812567 | 0.022100 | 4 |
6 | 0.011687 | 0.002482 | 0.003360 | 0.000297 | LogisticRegression(C=10, penalty='l1', random_... | 1 | l2 | (1, 1) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.798883 | 0.825843 | 0.803371 | 0.786517 | 0.842697 | 0.811462 | 0.020141 | 5 |
7 | 0.015418 | 0.002135 | 0.005152 | 0.001715 | LogisticRegression(C=10, penalty='l1', random_... | 1 | l2 | (1, 2) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.798883 | 0.814607 | 0.792135 | 0.786517 | 0.837079 | 0.805844 | 0.018234 | 8 |
8 | 0.019986 | 0.003582 | 0.005547 | 0.001773 | LogisticRegression(C=10, penalty='l1', random_... | 10 | l1 | (1, 1) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.832402 | 0.808989 | 0.808989 | 0.786517 | 0.853933 | 0.818166 | 0.023031 | 2 |
9 | 0.022375 | 0.006685 | 0.005572 | 0.002336 | LogisticRegression(C=10, penalty='l1', random_... | 10 | l1 | (1, 2) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.849162 | 0.820225 | 0.814607 | 0.780899 | 0.859551 | 0.824889 | 0.027760 | 1 |
10 | 0.013795 | 0.002522 | 0.004653 | 0.001127 | LogisticRegression(C=10, penalty='l1', random_... | 10 | l2 | (1, 1) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 6 |
11 | 0.016559 | 0.002225 | 0.004013 | 0.000424 | LogisticRegression(C=10, penalty='l1', random_... | 10 | l2 | (1, 2) | NaN | NaN | {'classifier': LogisticRegression(C=10, penalt... | 0.787709 | 0.814607 | 0.820225 | 0.780899 | 0.837079 | 0.808104 | 0.020904 | 7 |
12 | 0.329072 | 0.039890 | 0.046886 | 0.009094 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 1) | 3 | 300 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.793296 | 0.803371 | 0.780899 | 0.747191 | 0.792135 | 0.783378 | 0.019444 | 14 |
13 | 0.315229 | 0.023925 | 0.047965 | 0.009272 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 2) | 3 | 300 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.765363 | 0.713483 | 0.702247 | 0.685393 | 0.752809 | 0.723859 | 0.030381 | 16 |
14 | 0.534220 | 0.080665 | 0.061349 | 0.009893 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 1) | 3 | 500 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.793296 | 0.808989 | 0.780899 | 0.752809 | 0.786517 | 0.784502 | 0.018431 | 11 |
15 | 0.623413 | 0.099512 | 0.055768 | 0.004650 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 2) | 3 | 500 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.754190 | 0.752809 | 0.713483 | 0.696629 | 0.775281 | 0.738478 | 0.028923 | 15 |
16 | 0.319832 | 0.048049 | 0.042681 | 0.002448 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 1) | 4 | 300 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.692737 | 0.702247 | 0.735955 | 0.646067 | 0.730337 | 0.701469 | 0.032152 | 17 |
17 | 0.362235 | 0.086649 | 0.048045 | 0.012798 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 2) | 4 | 300 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.614525 | 0.617978 | 0.617978 | 0.617978 | 0.612360 | 0.616163 | 0.002325 | 19 |
18 | 0.504366 | 0.026996 | 0.048679 | 0.005673 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 1) | 4 | 500 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.659218 | 0.741573 | 0.696629 | 0.685393 | 0.651685 | 0.686900 | 0.031914 | 18 |
19 | 0.649702 | 0.034483 | 0.043766 | 0.010341 | RandomForestClassifier(n_jobs=-1, random_state=1) | NaN | NaN | (1, 2) | 4 | 500 | {'classifier': RandomForestClassifier(n_jobs=-... | 0.614525 | 0.617978 | 0.617978 | 0.617978 | 0.612360 | 0.616163 | 0.002325 | 19 |
As usual, the best_score_ and best_params_ attributes are still available.
both_grid.best_score_
0.8248885820099178
both_grid.best_params_
{'classifier': LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear'),
'classifier__C': 10,
'classifier__penalty': 'l1',
'preprocessor__countvectorizer__ngram_range': (1, 2)}
Here are two neat extensions to what we’ve done in this lesson that you could try on your own.
First, since the two models have separate parameter dictionaries, you could theoretically tune different preprocessing parameters for each model. For example, you could tune different CountVectorizer parameters for logistic regression and random forests.
Taking it one step further, you could actually create two different preprocessor objects and tune them using the same grid search, just like we tuned two different models using the same grid search. That would allow you, for example, to use different encoders when preparing data for your logistic regression and random forest models.
11.5 Q&A: How do I tune two models with a single randomized search?
Starting in scikit-learn version 0.22, RandomizedSearchCV can search multiple parameter dictionaries. This allows you to do a randomized search of multiple models, in the same exact way that we did a grid search of multiple models.
Here’s an example in which we pass both_pipe and both_params to RandomizedSearchCV and run it for 10 iterations.
= RandomizedSearchCV(both_pipe, both_params, cv=5, scoring='accuracy',
both_rand =10, random_state=1, n_jobs=-1)
n_iter both_rand.fit(X, y)
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']),... 'classifier__C': [0.1, 1, 10], 'classifier__penalty': ['l1', 'l2'], 'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)]}, {'classifier': [RandomForestClassifier(n_jobs=-1, random_state=1)], 'classifier__min_samples_leaf': [3, 4], 'classifier__n_estimators': [300, 500], 'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)]}], random_state=1, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(C=1, penalty='l1', random_state=1, solver='liblinear')
If you examine the results, you’ll find that logistic regression will be chosen more often because we defined more parameter combinations for the logistic regression model. This behavior may change in a future version, such that each model is equally likely to be chosen, regardless of the number of parameter combinations.
pd.DataFrame(both_rand.cv_results_)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_preprocessor__countvectorizer__ngram_range | param_classifier__penalty | param_classifier__C | param_classifier | param_classifier__n_estimators | param_classifier__min_samples_leaf | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.011173 | 0.001856 | 0.003149 | 0.000129 | (1, 2) | l2 | 0.1 | LogisticRegression(C=1, penalty='l1', random_s... | NaN | NaN | {'preprocessor__countvectorizer__ngram_range':... | 0.793296 | 0.803371 | 0.764045 | 0.775281 | 0.808989 | 0.788996 | 0.016944 | 5 |
1 | 0.324651 | 0.029089 | 0.040264 | 0.011326 | (1, 1) | NaN | NaN | RandomForestClassifier(n_jobs=-1, random_state=1) | 300 | 4 | {'preprocessor__countvectorizer__ngram_range':... | 0.692737 | 0.702247 | 0.735955 | 0.646067 | 0.730337 | 0.701469 | 0.032152 | 9 |
2 | 0.012903 | 0.002507 | 0.003355 | 0.000525 | (1, 1) | l2 | 1 | LogisticRegression(C=1, penalty='l1', random_s... | NaN | NaN | {'preprocessor__countvectorizer__ngram_range':... | 0.798883 | 0.825843 | 0.803371 | 0.786517 | 0.842697 | 0.811462 | 0.020141 | 2 |
3 | 0.014181 | 0.005096 | 0.004045 | 0.001645 | (1, 1) | l2 | 10 | LogisticRegression(C=1, penalty='l1', random_s... | NaN | NaN | {'preprocessor__countvectorizer__ngram_range':... | 0.782123 | 0.803371 | 0.808989 | 0.797753 | 0.853933 | 0.809234 | 0.024080 | 3 |
4 | 0.008655 | 0.000337 | 0.003142 | 0.000069 | (1, 1) | l2 | 0.1 | LogisticRegression(C=1, penalty='l1', random_s... | NaN | NaN | {'preprocessor__countvectorizer__ngram_range':... | 0.798883 | 0.803371 | 0.764045 | 0.775281 | 0.803371 | 0.788990 | 0.016258 | 6 |
5 | 0.506709 | 0.093686 | 0.040965 | 0.009688 | (1, 1) | NaN | NaN | RandomForestClassifier(n_jobs=-1, random_state=1) | 500 | 3 | {'preprocessor__countvectorizer__ngram_range':... | 0.793296 | 0.808989 | 0.780899 | 0.752809 | 0.786517 | 0.784502 | 0.018431 | 7 |
6 | 0.010139 | 0.000722 | 0.003664 | 0.000909 | (1, 1) | l1 | 1 | LogisticRegression(C=1, penalty='l1', random_s... | NaN | NaN | {'preprocessor__countvectorizer__ngram_range':... | 0.815642 | 0.820225 | 0.797753 | 0.792135 | 0.848315 | 0.814814 | 0.019787 | 1 |
7 | 0.273520 | 0.015257 | 0.033608 | 0.002356 | (1, 2) | NaN | NaN | RandomForestClassifier(n_jobs=-1, random_state=1) | 300 | 4 | {'preprocessor__countvectorizer__ngram_range':... | 0.614525 | 0.617978 | 0.617978 | 0.617978 | 0.612360 | 0.616163 | 0.002325 | 10 |
8 | 0.019968 | 0.010435 | 0.010277 | 0.009121 | (1, 2) | l2 | 1 | LogisticRegression(C=1, penalty='l1', random_s... | NaN | NaN | {'preprocessor__countvectorizer__ngram_range':... | 0.798883 | 0.814607 | 0.792135 | 0.786517 | 0.837079 | 0.805844 | 0.018234 | 4 |
9 | 0.049322 | 0.012308 | 0.019117 | 0.010222 | (1, 2) | l1 | 0.1 | LogisticRegression(C=1, penalty='l1', random_s... | NaN | NaN | {'preprocessor__countvectorizer__ngram_range':... | 0.787709 | 0.803371 | 0.769663 | 0.758427 | 0.797753 | 0.783385 | 0.016946 | 8 |