11 Comparing linear and non-linear models

11.1 Trying a random forest model

So far, the only model we’ve used in this book is logistic regression. But what if you wanted to try a different model?

One great thing about the scikit-learn API is that once you’ve built a workflow, you can easily swap in a different model, usually without making any other changes to your workflow. This is a huge benefit of scikit-learn, since it’s not possible to know ahead of time which model is going to work best for a given problem and dataset. This is also known as the “no free lunch” theorem.

In this chapter, we’re going to try out the random forest model, which is one of the most well-known models in Machine Learning. Whereas logistic regression is a linear model, random forests is a non-linear model based on decision trees. These two types of models have different overall properties, thus it may turn out that one type is better suited to this particular problem.

Random forest model:

Non-linear model
Based on decision trees
Different properties from logistic regression

We start out by importing the RandomForestClassifier class from the ensemble module, and creating an instance called rf. Because there’s randomness involved in a random forest, we’ll set a random state for reproducibility. And because building a random forest can be computationally expensive, it has its own n_jobs parameter (just like grid search and randomized search), which we’ll set to -1 to enable parallel processing.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_jobs=-1)

We’ll create a new Pipeline object called rf_pipe that uses random forests instead of logistic regression.

rf_pipe = make_pipeline(ct, rf)
rf_pipe

And we can cross-validate it to generate a baseline accuracy, which is 0.811. This accuracy is nearly identical to the baseline accuracy of our logistic regression Pipeline, but it’s likely that we can improve it through hyperparameter tuning.

cross_val_score(rf_pipe, X, y, cv=5, scoring='accuracy').mean()

0.811436821291821

Pipeline accuracy scores:

Grid search (LR): 0.828
Baseline (LR): 0.811
Baseline (RF): 0.811

As an aside, I’ve simplified the Pipeline accuracy scores table to only include the most important scores from the previous chapter. As you might guess, “LR” stands for logistic regression and “RF” stands for random forests. And going forward, I’ll always use the term “baseline” in this table to describe a Pipeline that has not undergone any hyperparameter tuning via grid search or randomized search.

11.2 Tuning random forests with randomized search

When tuning random forests, we’ll try tuning the same parameters for the transformers as before, but different parameters for the model. It’s still important to tune the transformations and the model at the same time, because it may turn out that the best data transformations for a random forest model are different than the best data transformations for a logistic regression model.

Rather than typing a parameters dictionary from scratch, we can start by creating a copy of the params dictionary called rf_params.

rf_params = params.copy()
rf_params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10],
 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

Then, we’ll delete the entries from the rf_params dictionary that only apply to logistic regression.

del rf_params['logisticregression__penalty']
del rf_params['logisticregression__C']
rf_params

{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

Alternatively, we could have created the rf_params dictionary using a dictionary comprehension that only keeps the entries in params that start with the letters ‘col’.

rf_params = {k:v for k, v in params.items() if k.startswith('col')}
rf_params

{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

Now we’re ready to add tuning parameters for the RandomForestClassifier. Random forests has a lot of parameters you can tune, which can make for a computationally expensive grid search if you try to tune all of them. This is compounded by the fact that random forests is comparatively slower to train than logistic regression.

When you’re not quite sure which parameters to tune or which values to try for those parameters, I would suggest a two-step approach, which is what we’ll use in this chapter:

First, use a randomized search with a variety of parameters and values. This allows you to test out a lot of different combinations while still controlling the computational budget. Examine the results of the search to look for trends of what’s working and what’s not.
Second, use a grid search with a more optimized set of parameters and values, based on what you learned from the randomized search.

Two-step approach to hyperparameter tuning:

Randomized search: Test a variety of parameters and values, then examine the results for trends
Grid search: Use an optimized set of parameters and values based on what you learned from step 1

We’ll start by trying out four parameters from the RandomForestClassifier, which I selected based on research and experience.

First, we’ll confirm that the Pipeline step name is randomforestclassifier (all lowercase).

rf_pipe.named_steps.keys()

dict_keys(['columntransformer', 'randomforestclassifier'])

Then we’ll add the four parameters I selected and some reasonable values for those parameters to the rf_params dictionary:

n_estimators is the number of decision trees in the random forest.
min_samples_leaf is a way to control overfitting, just like regularization is used to control overfitting in a logistic regression model.
And max_features and bootstrap affect certain properties of the random forest algorithm.

RandomForestClassifier tuning parameters:

n_estimators: Number of decisions trees in the forest
- 100 (default)
- 300
- 500
- 700
min_samples_leaf: Minimum number of samples at a leaf node
- 1 (default)
- 2
- 3
max_features: Number of features to consider when choosing a split
- ‘sqrt’ (default)
- None
bootstrap: Whether bootstrap samples are used when building trees
- True (default)
- False

rf_params['randomforestclassifier__n_estimators'] = [100, 300, 500, 700]
rf_params['randomforestclassifier__min_samples_leaf'] = [1, 2, 3]
rf_params['randomforestclassifier__max_features'] = ['sqrt', None]
rf_params['randomforestclassifier__bootstrap'] = [True, False]
rf_params

{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True],
 'randomforestclassifier__n_estimators': [100, 300, 500, 700],
 'randomforestclassifier__min_samples_leaf': [1, 2, 3],
 'randomforestclassifier__max_features': ['sqrt', None],
 'randomforestclassifier__bootstrap': [True, False]}

Finally, we’ll create an instance of RandomizedSearchCV called rf_rand, making sure to use the rf_pipe and rf_params objects, and we’ll run 100 iterations of the randomized search.

Notice that I’ve added a warning to this cell to indicate that it takes a few minutes to run on my local machine, though it may run faster or slower on your machine.

# WARNING: EXTENDED RUNTIME
rf_rand = RandomizedSearchCV(rf_pipe, rf_params, cv=5, scoring='accuracy',
                             n_iter=100, random_state=1, n_jobs=-1)
%time rf_rand.fit(X, y)

CPU times: user 2.91 s, sys: 347 ms, total: 3.25 s
Wall time: 1min 33s

We see that the best score found during the randomized search is 0.825, which is better than our baseline score of 0.811, but not quite as good as the 0.828 score of our best logistic regression Pipeline.

rf_rand.best_score_

0.8249262444291003

Pipeline accuracy scores:

Grid search (LR): 0.828
Randomized search (RF): 0.825
Baseline (LR): 0.811
Baseline (RF): 0.811

11.3 Further tuning with grid search

Let’s now examine the results from the randomized search to look for trends, specifically focusing on the top 20 results. We’ll convert the results to a DataFrame, filter to only keep the columns we need, rename the parameter columns, and then sort by rank_test_score. Keep in mind that it’s hard to draw any definitive conclusions since this is a randomized search, so what we’re looking for is just any obvious trends.

results = (pd.DataFrame(rf_rand.cv_results_)
           .filter(regex='param_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results.sort_values('rank_test_score').head(20)

	n_estimators	min_samples_leaf	max_features	bootstrap	add_indicator	drop	ngram_range	mean_test_score	rank_test_score
13	700	3	None	True	False	first	(1, 1)	0.824926	1
70	700	3	None	True	False	None	(1, 1)	0.822685	2
31	500	3	None	True	True	first	(1, 2)	0.822685	2
45	300	2	None	True	False	None	(1, 1)	0.822679	4
33	300	3	None	True	False	first	(1, 1)	0.821562	5
54	300	2	None	True	True	None	(1, 1)	0.821549	6
68	500	3	None	True	False	None	(1, 2)	0.820444	7
81	100	3	None	True	True	first	(1, 2)	0.820432	8
15	300	2	None	True	True	first	(1, 1)	0.819315	9
94	500	2	None	True	True	first	(1, 1)	0.819308	10
98	100	2	None	True	True	first	(1, 1)	0.819308	10
63	700	1	None	True	True	first	(1, 1)	0.819283	12
18	700	2	None	True	True	first	(1, 1)	0.818191	13
57	500	2	None	True	False	None	(1, 1)	0.818185	14
12	100	3	None	True	True	None	(1, 2)	0.818185	14
72	300	1	None	True	False	None	(1, 1)	0.818178	16
2	500	1	sqrt	False	False	first	(1, 1)	0.818160	17
41	700	2	None	True	True	None	(1, 1)	0.817067	18
10	500	2	None	True	False	first	(1, 2)	0.817061	19
8	500	2	None	True	True	None	(1, 2)	0.817061	19

Starting with n_estimators, we see that higher numbers are performing better, which is typical for n_estimators. It seems unlikely that 100 will produce the best result, so we’ll exclude that value from our grid search. And since the current best result is at 700, it seems useful to add a value of 900 to our grid search, in case increasing it further is even better.

Do keep in mind that increasing n_estimators also increases the time needed to train the model. You could consider just setting a single large value for n_estimators rather than searching through multiple values, since larger values will generally produce better results up to a certain point, but I prefer to tune this value when computational resources allow for it.

The next parameter to examine is min_samples_leaf. Similar to n_estimators, the lowest value of 1 seems unlikely to produce the best result, so we’ll remove it. The current best result is 3, so we’ll also try the values 4 and 5 in the grid search.

For max_features, it’s clear that None is performing better, so we’re no longer going to try sqrt.

For bootstrap, it’s clear that True is performing better, so we’re no longer going to try False.

And finally, there aren’t any clear trends for the transformer parameters, so we’ll leave those as-is.

Trends in the randomized search results:

n_estimators:
- Higher numbers are performing better
- Remove 100, add 900
min_samples_leaf:
- Higher numbers are performing better
- Remove 1, add 4 and 5
max_features:
- None is performing better
- Remove ‘sqrt’
bootstrap:
- True is performing better
- Remove False
Transformer parameters:
- No clear trends
- Leave as-is

Here are the updated values we’re going to try. For max_features and bootstrap, you’ll see that we can just pass a list with a single value so that that parameter value will always get set during the search.

rf_params['randomforestclassifier__n_estimators'] = [300, 500, 700, 900]
rf_params['randomforestclassifier__min_samples_leaf'] = [2, 3, 4, 5]
rf_params['randomforestclassifier__max_features'] = [None]
rf_params['randomforestclassifier__bootstrap'] = [True]
rf_params

{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True],
 'randomforestclassifier__n_estimators': [300, 500, 700, 900],
 'randomforestclassifier__min_samples_leaf': [2, 3, 4, 5],
 'randomforestclassifier__max_features': [None],
 'randomforestclassifier__bootstrap': [True]}

At this point, you could continue to run additional randomized searches in order to study the trends further, but we’re just going to move on to grid search.

We’ll create an instance of GridSearchCV called rf_grid, making sure to use the rf_pipe and rf_params objects, and then run the search.

# WARNING: EXTENDED RUNTIME
rf_grid = GridSearchCV(rf_pipe, rf_params, cv=5, scoring='accuracy',
                       n_jobs=-1)
%time rf_grid.fit(X, y)

CPU times: user 2.21 s, sys: 245 ms, total: 2.45 s
Wall time: 3min 11s

The best score from our grid search is 0.829, which is just a tiny bit higher than the 0.828 score of our best logistic regression Pipeline.

rf_grid.best_score_

0.8294143493817087

Pipeline accuracy scores:

Grid search (RF): 0.829
Grid search (LR): 0.828
Randomized search (RF): 0.825
Baseline (LR): 0.811
Baseline (RF): 0.811

These are the best parameters that were found during the grid search. Again, it’s hard to say whether this is truly the best set of parameters, but we at least know that it’s a good set of parameters.

rf_grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 1),
 'columntransformer__pipeline__onehotencoder__drop': 'first',
 'columntransformer__simpleimputer__add_indicator': True,
 'randomforestclassifier__bootstrap': True,
 'randomforestclassifier__max_features': None,
 'randomforestclassifier__min_samples_leaf': 4,
 'randomforestclassifier__n_estimators': 300}

11.4 Q&A: How do I tune two models with a single grid search?

So far, we’ve set up two separate Pipelines called pipe and rf_pipe that each end in a different model. That made it easy to grid search each Pipeline with its own set of relevant parameters. However, you can actually tune two different models using a single grid search if you like.

To do this, the first step is to create a new Pipeline using the Pipeline class instead of the make_pipeline function. The reason we’re doing this is so that we can provide custom names for the steps. In this case, we’ll call the Pipeline object both_pipe, and we’ll call the step names “preprocessor” and “classifier”. We’ll set the classifier to be logistic regression, though this is just a placeholder as you’ll see in a minute.

both_pipe = Pipeline([('preprocessor', ct), ('classifier', logreg)])

Next, we’ll create a new parameter dictionary called params1. For the simplicity of this example, we’re only going to tune one parameter from the preprocessor step and two parameters from the classifier step.

Additionally, we’re going to add one more entry to the dictionary to indicate that the classifier we want to use with this parameter set is logistic regression. Notice that this is a logistic regression object, not a string, and also notice that we put it in brackets to make it a list. This might seem strange, but it will make more sense in a minute.

params1 = {}
params1['preprocessor__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]
params1['classifier__penalty'] = ['l1', 'l2']
params1['classifier__C'] = [0.1, 1, 10]
params1['classifier'] = [logreg]
params1

{'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'classifier__penalty': ['l1', 'l2'],
 'classifier__C': [0.1, 1, 10],
 'classifier': [LogisticRegression(random_state=1, solver='liblinear')]}

Next, we’ll create another parameter dictionary called params2. Again, we’ll tune one parameter from the preprocessor step and two parameters from the classifier step. You’ll notice that the classifier parameters are random forest parameters, not logistic regression parameters.

Just like above, we’ll add one more entry to the dictionary to indicate that the classifier we want to use with this parameter set is random forests. During the grid search, this will override the logistic regression classifier we specified when creating the Pipeline.

params2 = {}
params2['preprocessor__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]
params2['classifier__n_estimators'] = [300, 500]
params2['classifier__min_samples_leaf'] = [3, 4]
params2['classifier'] = [rf]
params2

{'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'classifier__n_estimators': [300, 500],
 'classifier__min_samples_leaf': [3, 4],
 'classifier': [RandomForestClassifier(n_jobs=-1, random_state=1)]}

Next, we’ll create a list called both_params that includes both of these parameter sets.

both_params = [params1, params2]
both_params

[{'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
  'classifier__penalty': ['l1', 'l2'],
  'classifier__C': [0.1, 1, 10],
  'classifier': [LogisticRegression(random_state=1, solver='liblinear')]},
 {'preprocessor__countvectorizer__ngram_range': [(1, 1), (1, 2)],
  'classifier__n_estimators': [300, 500],
  'classifier__min_samples_leaf': [3, 4],
  'classifier': [RandomForestClassifier(n_jobs=-1, random_state=1)]}]

Finally, we’ll create an instance of GridSearchCV called both_grid, making sure to pass it the both_pipe and both_params objects.

When we run the search, here’s what will happen:

First, it will try every combination of parameters from params1, which is 2 times 2 times 3 times 1, or 12 combinations.
Then, it will try every combination of parameters from params2, which is 2 times 2 times 2 times 1, or 8 combinations.

Thus, it will run a total of 20 times.

both_grid = GridSearchCV(both_pipe, both_params, cv=5, scoring='accuracy',
                         n_jobs=-1)
both_grid.fit(X, y)

Let’s take a look at the results. You can see that it ran 12 times with a logistic regression model and 8 times with a random forest model. Also note that when logistic regression model runs, the random forest-related parameters are listed as NaN, and vice versa when the random forest model runs.

results = (pd.DataFrame(both_grid.cv_results_)
           .filter(regex='param_.+_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results

	C	penalty	ngram_range	min_samples_leaf	n_estimators	mean_test_score	rank_test_score
0	0.1	l1	(1, 1)	NaN	NaN	0.783385	12
1	0.1	l1	(1, 2)	NaN	NaN	0.783385	12
2	0.1	l2	(1, 1)	NaN	NaN	0.788990	10
3	0.1	l2	(1, 2)	NaN	NaN	0.788996	9
4	1	l1	(1, 1)	NaN	NaN	0.814814	3
5	1	l1	(1, 2)	NaN	NaN	0.812567	4
6	1	l2	(1, 1)	NaN	NaN	0.811462	5
7	1	l2	(1, 2)	NaN	NaN	0.805844	8
8	10	l1	(1, 1)	NaN	NaN	0.818166	2
9	10	l1	(1, 2)	NaN	NaN	0.824889	1
10	10	l2	(1, 1)	NaN	NaN	0.809234	6
11	10	l2	(1, 2)	NaN	NaN	0.808104	7
12	NaN	NaN	(1, 1)	3	300	0.783378	14
13	NaN	NaN	(1, 2)	3	300	0.723859	16
14	NaN	NaN	(1, 1)	3	500	0.784502	11
15	NaN	NaN	(1, 2)	3	500	0.738478	15
16	NaN	NaN	(1, 1)	4	300	0.701469	17
17	NaN	NaN	(1, 2)	4	300	0.616163	19
18	NaN	NaN	(1, 1)	4	500	0.686900	18
19	NaN	NaN	(1, 2)	4	500	0.616163	19

As usual, the best_score_ and best_params_ attributes are still available.

both_grid.best_score_

0.8248885820099178

both_grid.best_params_

{'classifier': LogisticRegression(C=10, penalty='l1', random_state=1, solver='liblinear'),
 'classifier__C': 10,
 'classifier__penalty': 'l1',
 'preprocessor__countvectorizer__ngram_range': (1, 2)}

Here are two neat extensions to what we’ve done in this lesson that you could try on your own.

First, since the two models have separate parameter dictionaries, you could theoretically tune different preprocessing parameters for each model. For example, you could tune different CountVectorizer parameters for logistic regression and random forests.

Taking it one step further, you could actually create two different preprocessor objects and tune them using the same grid search, just like we tuned two different models using the same grid search. That would allow you, for example, to use different encoders when preparing data for your logistic regression and random forest models.

Extensions of this approach:

Tune different preprocessing parameters for each model
Tune two different preprocessor objects

11.5 Q&A: How do I tune two models with a single randomized search?

Starting in scikit-learn version 0.22, RandomizedSearchCV can search multiple parameter dictionaries. This allows you to do a randomized search of multiple models, in the same exact way that we did a grid search of multiple models.

Here’s an example in which we pass both_pipe and both_params to RandomizedSearchCV and run it for 10 iterations.

both_rand = RandomizedSearchCV(both_pipe, both_params, cv=5,
                               scoring='accuracy', n_iter=10, random_state=1,
                               n_jobs=-1)
both_rand.fit(X, y)

If you examine the results, you’ll find that logistic regression will be chosen more often because we defined more parameter combinations for the logistic regression model. This behavior may change in a future version, such that each model is equally likely to be chosen, regardless of the number of parameter combinations.

results = (pd.DataFrame(both_rand.cv_results_)
           .filter(regex='param_.+_|mean_test|rank'))
results.columns = results.columns.str.split('__').str[-1]
results

	ngram_range	penalty	C	n_estimators	min_samples_leaf	mean_test_score	rank_test_score
0	(1, 2)	l2	0.1	NaN	NaN	0.788996	5
1	(1, 1)	NaN	NaN	300	4	0.701469	9
2	(1, 1)	l2	1	NaN	NaN	0.811462	2
3	(1, 1)	l2	10	NaN	NaN	0.809234	3
4	(1, 1)	l2	0.1	NaN	NaN	0.788990	6
5	(1, 1)	NaN	NaN	500	3	0.784502	7
6	(1, 1)	l1	1	NaN	NaN	0.814814	1
7	(1, 2)	NaN	NaN	300	4	0.616163	10
8	(1, 2)	l2	1	NaN	NaN	0.805844	4
9	(1, 2)	l1	0.1	NaN	NaN	0.783385	8