= LogisticRegression(solver='liblinear', random_state=1)
logreg = make_pipeline(ct, logreg)
pipe =5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
So far, we’ve been trying to choose between two different models, namely logistic regression and random forests. However, you can actually use a process called ensembling to combine multiple models. The goal of ensembling is to produce a combined model, known as an ensemble, that is more accurate than any of the individual models.
The process for ensembling is simple:
The idea behind ensembling is that if you have a collection of individually imperfect models, the “one-off” errors made by each model are probably not going to be made by the rest of the models. Thus, the errors will be discarded (or at least reduced) when ensembling the models. Another way of saying this is that ensembling produces better predictions because the ensemble has a lower variance than any of the individual models.
In this chapter, we’ll ensemble our classification models two different ways, and then we’ll tune the ensemble to try to achieve even better performance.
In this lesson, we’re going to ensemble logistic regression and random forests. Because their predictions are generated using completely different processes, they’re likely to make different types of errors and thus they’re good candidates for ensembling.
Let’s see a reminder of their cross-validation scores. Both the logistic regression Pipeline and the random forest Pipeline have an accuracy of 0.811, so our goal with ensembling is to increase this score.
= LogisticRegression(solver='liblinear', random_state=1)
logreg = make_pipeline(ct, logreg)
pipe =5, scoring='accuracy').mean() cross_val_score(pipe, X, y, cv
0.8114619295712762
= RandomForestClassifier(random_state=1, n_jobs=-1)
rf = make_pipeline(ct, rf)
rf_pipe =5, scoring='accuracy').mean() cross_val_score(rf_pipe, X, y, cv
0.811436821291821
We’ll create the ensemble using the VotingClassifier class, which we’ll import from the ensemble module. We’ll create an instance called “vc” and pass it a list of tuples, in which the first element of the tuple is a name and the second element is a classifier object.
The options for the voting parameter are ‘soft’, in which predicted probabilities are averaged, and ‘hard’, in which only class predictions are taken into account. We’ll try soft voting first. Also, we’ll set n_jobs to -1 to enable parallel processing.
from sklearn.ensemble import VotingClassifier
= VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft', n_jobs=-1) vc
Then, we’ll create a new Pipeline called “vc_pipe” in which the VotingClassifier is the second step instead of a model.
= make_pipeline(ct, vc)
vc_pipe vc_pipe
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('votingclassifier', VotingClassifier(estimators=[('clf1', LogisticRegression(random_state=1, solver='liblinear')), ('clf2', RandomForestClassifier(n_jobs=-1, random_state=1))], n_jobs=-1, voting='soft'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
VotingClassifier(estimators=[('clf1', LogisticRegression(random_state=1, solver='liblinear')), ('clf2', RandomForestClassifier(n_jobs=-1, random_state=1))], n_jobs=-1, voting='soft')
LogisticRegression(random_state=1, solver='liblinear')
RandomForestClassifier(n_jobs=-1, random_state=1)
Let’s examine how the VotingClassifier makes predictions when using soft voting. We’ll do this by examining the predicted probabilities output by the logistic regression-based Pipeline and the random forest-based Pipeline for the first 3 samples in X_new.
As a reminder, the left column is the predicted probability of class 0 (for each sample), and the right column is the predicted probability of class 1.
pipe.fit(X, y)3] pipe.predict_proba(X_new)[:
array([[0.88916549, 0.11083451],
[0.14200691, 0.85799309],
[0.9190551 , 0.0809449 ]])
rf_pipe.fit(X, y)3] rf_pipe.predict_proba(X_new)[:
array([[0.99, 0.01],
[0.24, 0.76],
[0.97, 0.03]])
Now let’s use the VotingClassifier to output predicted probabilities. It uses the fit and predict_proba methods, just like any other classifier.
If you examine its predicted probabilities, you will see that it’s simply averaging the two sets of probabilities from logistic regression and random forests. For example, the average of 0.14 and 0.24 is 0.19.
vc_pipe.fit(X, y)3] vc_pipe.predict_proba(X_new)[:
array([[0.93958275, 0.06041725],
[0.19100345, 0.80899655],
[0.94452755, 0.05547245]])
In order to make class predictions for X_new, you use the predict method, which simply chooses whichever class has the higher predicted probability. In this case, it predicted 0, 1, and 0, because those classes had the higher predicted probability.
3] vc_pipe.predict(X_new)[:
array([0, 1, 0])
In the 3 cases we just examined, logistic regression and random forests agreed on the class predictions. Let’s now examine a case in which the two models disagreed. One example of this is sample 80.
As you can see, logistic regression predicted class 0 but without much confidence.
80] pipe.predict_proba(X_new)[
array([0.51799634, 0.48200366])
Random forests predicted class 1 with more confidence.
80] rf_pipe.predict_proba(X_new)[
array([0.29, 0.71])
When VotingClassifier averages the predicted probabilities for this sample, the class 1 value is higher, thus it will predict class 1.
80] vc_pipe.predict_proba(X_new)[
array([0.40399817, 0.59600183])
Let’s move on to cross-validation to see how the VotingClassifier Pipeline with soft voting performs. Its score is 0.818, which is better than either model alone.
=5, scoring='accuracy').mean() cross_val_score(vc_pipe, X, y, cv
0.8181846713953927
Now let’s modify the VotingClassifier to use hard voting, which means that it ignores predicted probabilities and just takes a majority vote based on class predictions.
= VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='hard', n_jobs=-1)
vc = make_pipeline(ct, vc) vc_pipe
When we cross-validate the Pipeline with hard voting, it performs a bit better, scoring 0.820.
=5, scoring='accuracy').mean() cross_val_score(vc_pipe, X, y, cv
0.8204255853367648
However, this result is actually misleading:
The previous lesson brings up the obvious question: When should you use soft voting, and when should you use hard voting?
Ultimately, you can just try both soft and hard voting and see which works better, keeping in mind that hard voting results can be misleading if you have an even number of classifiers.
Because we’re using an even number of classifiers, we’ll change back to soft voting.
= VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft', n_jobs=-1)
vc = make_pipeline(ct, vc) vc_pipe
Like with any model, we can tune the VotingClassifier’s hyperparameters using a grid search to try to improve its accuracy. Keep in mind that the best parameters for the VotingClassifier Pipeline might be different than the parameters for either model when they were tuned separately.
We’ll start by creating a vc_params dictionary that only includes the ColumnTransformer parameters.
= {k:v for k, v in params.items() if k.startswith('col')}
vc_params vc_params
{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
We can look at the named_steps attribute of vc_pipe to confirm that the second step name is votingclassifier (all lowercase).
vc_pipe.named_steps.keys()
dict_keys(['columntransformer', 'votingclassifier'])
And you may recall that we assigned the names clf1 and clf2 to our two models within the VotingClassifier.
'votingclassifier'].named_estimators vc_pipe.named_steps[
{'clf1': LogisticRegression(random_state=1, solver='liblinear'),
'clf2': RandomForestClassifier(n_jobs=-1, random_state=1)}
Knowing those names, we can now add some model parameters to vc_params. For simplicity and speed, we’ll just tune a smaller selection of parameters and values.
'votingclassifier__clf1__penalty'] = ['l1', 'l2']
vc_params['votingclassifier__clf1__C'] = [1, 10]
vc_params['votingclassifier__clf2__n_estimators'] = [100, 300]
vc_params['votingclassifier__clf2__min_samples_leaf'] = [2, 3]
vc_params[ vc_params
{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True],
'votingclassifier__clf1__penalty': ['l1', 'l2'],
'votingclassifier__clf1__C': [1, 10],
'votingclassifier__clf2__n_estimators': [100, 300],
'votingclassifier__clf2__min_samples_leaf': [2, 3]}
Finally, we’ll create a GridSearchCV object called vc_grid, making sure to use the vc_pipe and vc_params objects, and then run the search.
= GridSearchCV(vc_pipe, vc_params, cv=5, scoring='accuracy', n_jobs=-1)
vc_grid vc_grid.fit(X, y)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), (... param_grid={'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)], 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'], 'columntransformer__simpleimputer__add_indicator': [False, True], 'votingclassifier__clf1__C': [1, 10], 'votingclassifier__clf1__penalty': ['l1', 'l2'], 'votingclassifier__clf2__min_samples_leaf': [2, 3], 'votingclassifier__clf2__n_estimators': [100, 300]}, scoring='accuracy')
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
VotingClassifier(estimators=[('clf1', LogisticRegression(random_state=1, solver='liblinear')), ('clf2', RandomForestClassifier(n_jobs=-1, random_state=1))], n_jobs=-1, voting='soft')
LogisticRegression(random_state=1, solver='liblinear')
RandomForestClassifier(n_jobs=-1, random_state=1)
You can see that the best score has improved significantly, to 0.834.
vc_grid.best_score_
0.833864791915134
Here’s the best set of parameters that it found.
vc_grid.best_params_
{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'votingclassifier__clf1__C': 10,
'votingclassifier__clf1__penalty': 'l1',
'votingclassifier__clf2__min_samples_leaf': 3,
'votingclassifier__clf2__n_estimators': 100}
And finally, using the tuned grid to make predictions for new data is the same as always.
vc_grid.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
Ensembling generally improves the performance of your model, and so it’s a useful technique any time that performance is your highest priority. Keep in mind, however, that ensembling adds more complexity to your process, and the ensemble is also less interpretable than a single model.
If you do decide to use ensembling, my advice is to include at least 3 models in the ensemble. It’s important that all models you include are performing reasonably well on their own. And as mentioned before, it’s ideal if they generate their predictions using different processes.
By default, each model within an ensemble is given equal weight. However, you can try weighting certain models more than others to give them more “voting power” when determining the predicted class labels or predicted probabilities.
For example, we could give the logistic regression model double the voting power of the random forest model by setting the weights parameter of the VotingClassifier.
= VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft',
vc =[2, 1], n_jobs=-1)
weights= make_pipeline(ct, vc) vc_pipe
Let’s see how that affects the predicted probabilities. Once again, here are the predicted probabilities output by the logistic regression and random forest models for the first 3 samples in X_new.
3] pipe.predict_proba(X_new)[:
array([[0.88916549, 0.11083451],
[0.14200691, 0.85799309],
[0.9190551 , 0.0809449 ]])
3] rf_pipe.predict_proba(X_new)[:
array([[0.99, 0.01],
[0.24, 0.76],
[0.97, 0.03]])
And here are the predicted probabilities output by the VotingClassifier. As you can see, the predicted probabilities are closer to the ones output by logistic regression because we gave that model twice the weight. For example, 0.94 is closer to 0.92 than it is to 0.97.
vc_pipe.fit(X, y)3] vc_pipe.predict_proba(X_new)[:
array([[0.92277699, 0.07722301],
[0.17467127, 0.82532873],
[0.93603673, 0.06396327]])
You can confirm whether the weights are helping or hurting the ensemble by using cross-validation. In this case, the score is 0.816, which is slightly worse than our baseline VotingClassifier.
=5, scoring='accuracy').mean() cross_val_score(vc_pipe, X, y, cv
0.8159437574540205
You can also search for the optimal weights using a grid search. Here’s how you might add that to the vc_params dictionary.
'votingclassifier__weights'] = [(1, 1), (2, 1), (1, 2)] vc_params[