12 Ensembling multiple models – Master Machine Learning with scikit-learn

12.1 Introduction to ensembling

So far, we’ve been trying to choose between two different models, namely logistic regression and random forests. However, you can actually use a process called ensembling to combine multiple models. The goal of ensembling is to produce a combined model, known as an ensemble, that is more accurate than any of the individual models.

The process for ensembling is simple:

For a regression problem, you calculate the average of the predictions made by the individual regressors and use that as your prediction.
For a classification problem, you can either average the predicted probabilities output by the classifiers, or you can let the classifiers vote on which class to predict. We’ll see examples of this below.

How to create an ensemble:

Regression: Average the predictions
Classification: Average the predicted probabilities, or let the classifiers vote on the class

The idea behind ensembling is that if you have a collection of individually imperfect models, the “one-off” errors made by each model are probably not going to be made by the rest of the models. Thus, the errors will be discarded (or at least reduced) when ensembling the models. Another way of saying this is that ensembling produces better predictions because the ensemble has a lower variance than any of the individual models.

Why does ensembling work?

“One-off” errors made by each model will be discarded when ensembling
Ensemble has a lower variance than any individual model

In this chapter, we’ll ensemble our classification models two different ways, and then we’ll tune the ensemble to try to achieve even better performance.

12.2 Ensembling logistic regression and random forests

In this lesson, we’re going to ensemble logistic regression and random forests. Because their predictions are generated using completely different processes, they’re likely to make different types of errors and thus they’re good candidates for ensembling.

Let’s see a reminder of their cross-validation scores. Both the logistic regression Pipeline and the random forest Pipeline have an accuracy of 0.811, so our goal with ensembling is to increase this score.

logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

rf = RandomForestClassifier(random_state=1, n_jobs=-1)
rf_pipe = make_pipeline(ct, rf)
cross_val_score(rf_pipe, X, y, cv=5, scoring='accuracy').mean()

0.811436821291821

Pipeline accuracy scores:

Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (LR): 0.811
Baseline (RF): 0.811

We’ll create the ensemble using the VotingClassifier class, which we’ll import from the ensemble module. We’ll create an instance called “vc” and pass it a list of tuples, in which the first element of the tuple is a name and the second element is a classifier object.

The options for the voting parameter are ‘soft’, in which predicted probabilities are averaged, and ‘hard’, in which only class predictions are taken into account. We’ll try soft voting first. Also, we’ll set n_jobs to -1 to enable parallel processing.

from sklearn.ensemble import VotingClassifier
vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft',
                      n_jobs=-1)

Voting options for VotingClassifier:

soft: Average the predicted probabilities
hard: Majority vote using class predictions

Then, we’ll create a new Pipeline called “vc_pipe” in which the VotingClassifier is the second step instead of a model.

vc_pipe = make_pipeline(ct, vc)
vc_pipe

12.3 Combining predicted probabilities

Let’s examine how the VotingClassifier makes predictions when using soft voting. We’ll do this by examining the predicted probabilities output by the logistic regression-based Pipeline and the random forest-based Pipeline for the first 3 samples in X_new.

As a reminder, the left column is the predicted probability of class 0 (for each sample), and the right column is the predicted probability of class 1.

pipe.fit(X, y)
pipe.predict_proba(X_new)[:3]

array([[0.88916549, 0.11083451],
       [0.14200691, 0.85799309],
       [0.9190551 , 0.0809449 ]])

rf_pipe.fit(X, y)
rf_pipe.predict_proba(X_new)[:3]

array([[0.99, 0.01],
       [0.24, 0.76],
       [0.97, 0.03]])

Now let’s use the VotingClassifier to output predicted probabilities. It uses the fit and predict_proba methods, just like any other classifier.

If you examine its predicted probabilities, you will see that it’s simply averaging the two sets of probabilities from logistic regression and random forests. For example, the average of 0.14 and 0.24 is 0.19.

vc_pipe.fit(X, y)
vc_pipe.predict_proba(X_new)[:3]

array([[0.93958275, 0.06041725],
       [0.19100345, 0.80899655],
       [0.94452755, 0.05547245]])

In order to make class predictions for X_new, you use the predict method, which simply chooses whichever class has the higher predicted probability. In this case, it predicted 0, 1, and 0, because those classes had the higher predicted probability.

vc_pipe.predict(X_new)[:3]

array([0, 1, 0])

In the 3 cases we just examined, logistic regression and random forests agreed on the class predictions. Let’s now examine a case in which the two models disagreed. One example of this is sample 80.

As you can see, logistic regression predicted class 0 but without much confidence.

pipe.predict_proba(X_new)[80]

array([0.51799634, 0.48200366])

Random forests predicted class 1 with more confidence.

rf_pipe.predict_proba(X_new)[80]

array([0.29, 0.71])

When VotingClassifier averages the predicted probabilities for this sample, the class 1 value is higher, thus it will predict class 1.

vc_pipe.predict_proba(X_new)[80]

array([0.40399817, 0.59600183])

Let’s move on to cross-validation to see how the VotingClassifier Pipeline with soft voting performs. Its score is 0.818, which is better than either model alone.

cross_val_score(vc_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8181846713953927

Pipeline accuracy scores:

Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (VC soft voting): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811

12.4 Combining class predictions

Now let’s modify the VotingClassifier to use hard voting, which means that it ignores predicted probabilities and just takes a majority vote based on class predictions.

vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='hard',
                      n_jobs=-1)
vc_pipe = make_pipeline(ct, vc)

When we cross-validate the Pipeline with hard voting, it performs a bit better, scoring 0.820.

cross_val_score(vc_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8204255853367648

Pipeline accuracy scores:

Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (VC hard voting): 0.820
Baseline (VC soft voting): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811

However, this result is actually misleading:

When you use hard voting with VotingClassifier, and there’s a tie, it always chooses the lowest numbered class. In other words, every time the two models disagreed, it chose class 0.
This means that hard voting is performing better than soft voting purely by chance. If the tiebreaking algorithm instead chose the highest numbered class, hard voting would be performing worse than soft voting.

Why is this result misleading?

In the case of a tie, hard voting always chooses class 0
Thus hard voting is performing better than soft voting by chance

12.5 Choosing a voting strategy

The previous lesson brings up the obvious question: When should you use soft voting, and when should you use hard voting?

Soft voting is preferred if you have an even number of models in the ensemble, and especially if you only have two models.
Soft voting is preferred if all of your models output well-calibrated predicted probabilities, whereas hard voting is preferred otherwise.
Hard voting will always work, whereas soft voting will only work if all of the models in the ensemble include the predict_proba method. For example, LinearSVC and Perceptron don’t include the predict_proba method.

Soft voting vs hard voting:

Soft voting:
- Preferred if you have an even number of models (especially two)
- Preferred if all models are well-calibrated
- Only works if all models have the predict_proba method
Hard voting:
- Preferred if some models are not well-calibrated
- Does not require the predict_proba method

Ultimately, you can just try both soft and hard voting and see which works better, keeping in mind that hard voting results can be misleading if you have an even number of classifiers.

Because we’re using an even number of classifiers, we’ll change back to soft voting.

vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft',
                      n_jobs=-1)
vc_pipe = make_pipeline(ct, vc)

12.6 Tuning an ensemble with grid search

Like with any model, we can tune the VotingClassifier’s hyperparameters using a grid search to try to improve its accuracy. Keep in mind that the best parameters for the VotingClassifier Pipeline might be different than the parameters for either model when they were tuned separately.

We’ll start by creating a vc_params dictionary that only includes the ColumnTransformer parameters.

vc_params = {k:v for k, v in params.items() if k.startswith('col')}
vc_params

{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

We can look at the named_steps attribute of vc_pipe to confirm that the second step name is votingclassifier (all lowercase).

vc_pipe.named_steps.keys()

dict_keys(['columntransformer', 'votingclassifier'])

And you may recall that we assigned the names clf1 and clf2 to our two models within the VotingClassifier.

vc_pipe.named_steps['votingclassifier'].named_estimators

{'clf1': LogisticRegression(random_state=1, solver='liblinear'),
 'clf2': RandomForestClassifier(n_jobs=-1, random_state=1)}

Knowing those names, we can now add some model parameters to vc_params. For simplicity and speed, we’ll just tune a smaller selection of parameters and values.

vc_params['votingclassifier__clf1__penalty'] = ['l1', 'l2']
vc_params['votingclassifier__clf1__C'] = [1, 10]
vc_params['votingclassifier__clf2__n_estimators'] = [100, 300]
vc_params['votingclassifier__clf2__min_samples_leaf'] = [2, 3]
vc_params

{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True],
 'votingclassifier__clf1__penalty': ['l1', 'l2'],
 'votingclassifier__clf1__C': [1, 10],
 'votingclassifier__clf2__n_estimators': [100, 300],
 'votingclassifier__clf2__min_samples_leaf': [2, 3]}

Finally, we’ll create a GridSearchCV object called vc_grid, making sure to use the vc_pipe and vc_params objects, and then run the search.

vc_grid = GridSearchCV(vc_pipe, vc_params, cv=5, scoring='accuracy',
                       n_jobs=-1)
vc_grid.fit(X, y)

You can see that the best score has improved significantly, to 0.834.

vc_grid.best_score_

0.833864791915134

Pipeline accuracy scores:

Grid search (VC soft voting): 0.834
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (VC hard voting): 0.820
Baseline (VC soft voting): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811

Here’s the best set of parameters that it found.

vc_grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': True,
 'votingclassifier__clf1__C': 10,
 'votingclassifier__clf1__penalty': 'l1',
 'votingclassifier__clf2__min_samples_leaf': 3,
 'votingclassifier__clf2__n_estimators': 100}

And finally, using the tuned grid to make predictions for new data is the same as always.

vc_grid.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

12.7 Q&A: When should I use ensembling?

Ensembling generally improves the performance of your model, and so it’s a useful technique any time that performance is your highest priority. Keep in mind, however, that ensembling adds more complexity to your process, and the ensemble is also less interpretable than a single model.

Should you ensemble?

Advantages:
- Improves model performance
Disadvantages:
- Adds more complexity
- Decreases interpretability

If you do decide to use ensembling, my advice is to include at least 3 models in the ensemble. It’s important that all models you include are performing reasonably well on their own. And as mentioned before, it’s ideal if they generate their predictions using different processes.

Advice for ensembling:

Include at least 3 models
Models should be performing well on their own
Ideal if they generate predictions using different processes

12.8 Q&A: How do I apply different weights to the models in an ensemble?

By default, each model within an ensemble is given equal weight. However, you can try weighting certain models more than others to give them more “voting power” when determining the predicted class labels or predicted probabilities.

For example, we could give the logistic regression model double the voting power of the random forest model by setting the weights parameter of the VotingClassifier.

vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft',
                      weights=[2, 1], n_jobs=-1)
vc_pipe = make_pipeline(ct, vc)

Let’s see how that affects the predicted probabilities. Once again, here are the predicted probabilities output by the logistic regression and random forest models for the first 3 samples in X_new.

pipe.predict_proba(X_new)[:3]

array([[0.88916549, 0.11083451],
       [0.14200691, 0.85799309],
       [0.9190551 , 0.0809449 ]])

rf_pipe.predict_proba(X_new)[:3]

array([[0.99, 0.01],
       [0.24, 0.76],
       [0.97, 0.03]])

And here are the predicted probabilities output by the VotingClassifier. As you can see, the predicted probabilities are closer to the ones output by logistic regression because we gave that model twice the weight. For example, 0.94 is closer to 0.92 than it is to 0.97.

vc_pipe.fit(X, y)
vc_pipe.predict_proba(X_new)[:3]

array([[0.92277699, 0.07722301],
       [0.17467127, 0.82532873],
       [0.93603673, 0.06396327]])

You can confirm whether the weights are helping or hurting the ensemble by using cross-validation. In this case, the score is 0.816, which is slightly worse than our baseline VotingClassifier.

cross_val_score(vc_pipe, X, y, cv=5, scoring='accuracy').mean()

0.8159437574540205

Pipeline accuracy scores:

Grid search (VC soft voting): 0.834
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (VC hard voting): 0.820
Baseline (VC soft voting): 0.818
Baseline (VC soft voting with LR weighted): 0.816
Baseline (LR): 0.811
Baseline (RF): 0.811

You can also search for the optimal weights using a grid search. Here’s how you might add that to the vc_params dictionary.

vc_params['votingclassifier__weights'] = [(1, 1), (2, 1), (1, 2)]