logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()0.8114619295712762
So far, we’ve been trying to choose between two different models, namely logistic regression and random forests. However, you can actually use a process called ensembling to combine multiple models. The goal of ensembling is to produce a combined model, known as an ensemble, that is more accurate than any of the individual models.
The process for ensembling is simple:
The idea behind ensembling is that if you have a collection of individually imperfect models, the “one-off” errors made by each model are probably not going to be made by the rest of the models. Thus, the errors will be discarded (or at least reduced) when ensembling the models. Another way to say this is that ensembling produces better predictions because the ensemble has a lower variance than any of the individual models.
In this chapter, we’ll ensemble our classification models two different ways, and then we’ll tune the ensemble to try to achieve even better performance.
In this lesson, we’re going to ensemble logistic regression and random forests. Because their predictions are generated using completely different processes, they’re likely to make different types of errors and thus they’re good candidates for ensembling.
Let’s see a reminder of their cross-validation scores. Both the logistic regression and random forest Pipelines have an accuracy of 0.811, so our goal with ensembling is to increase this score.
logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()0.8114619295712762
rf = RandomForestClassifier(random_state=1, n_jobs=-1)
rf_pipe = make_pipeline(ct, rf)
cross_val_score(rf_pipe, X, y, cv=5, scoring='accuracy').mean()0.811436821291821
We’ll create the ensemble using the VotingClassifier class, which we’ll import from the ensemble module. We’ll create an instance called vc and pass it a list of tuples, in which the first element of the tuple is a name and the second element is a classifier object.
The options for the voting parameter are 'soft', in which predicted probabilities are averaged, and 'hard', in which only class predictions are taken into account. We’ll try soft voting first. Also, we’ll set n_jobs to -1 to enable parallel processing.
from sklearn.ensemble import VotingClassifier
vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft',
n_jobs=-1)Then, we’ll create a new Pipeline called vc_pipe in which the VotingClassifier is the second step instead of a model.
vc_pipe = make_pipeline(ct, vc)
vc_pipePipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(), 'Name'),
('simpleimputer',
SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Parch'])])),
('votingclassifier',
VotingClassifier(estimators=[('clf1',
LogisticRegression(random_state=1,
solver='liblinear')),
('clf2',
RandomForestClassifier(n_jobs=-1,
random_state=1))],
n_jobs=-1, voting='soft'))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
VotingClassifier(estimators=[('clf1',
LogisticRegression(random_state=1,
solver='liblinear')),
('clf2',
RandomForestClassifier(n_jobs=-1,
random_state=1))],
n_jobs=-1, voting='soft')LogisticRegression(random_state=1, solver='liblinear')
RandomForestClassifier(n_jobs=-1, random_state=1)
Let’s examine how the VotingClassifier makes predictions when using soft voting. We’ll do this by examining the predicted probabilities output by the logistic regression and random forest Pipelines for the first 3 samples in X_new.
As a reminder, the left column is the predicted probability of class 0 (for each sample), and the right column is the predicted probability of class 1.
pipe.fit(X, y)
pipe.predict_proba(X_new)[:3]array([[0.88916549, 0.11083451],
[0.14200691, 0.85799309],
[0.9190551 , 0.0809449 ]])
rf_pipe.fit(X, y)
rf_pipe.predict_proba(X_new)[:3]array([[0.99, 0.01],
[0.24, 0.76],
[0.97, 0.03]])
Now we’ll use the VotingClassifier to output predicted probabilities. It uses the fit and predict_proba methods, just like any other classifier.
If you examine its predicted probabilities, you’ll see that it’s simply averaging the two sets of probabilities from logistic regression and random forests. For example, the 0.19 in the left column of the second sample is the average of 0.14 (from logistic regression) and 0.24 (from random forests).
vc_pipe.fit(X, y)
vc_pipe.predict_proba(X_new)[:3]array([[0.93958275, 0.06041725],
[0.19100345, 0.80899655],
[0.94452755, 0.05547245]])
In order to make class predictions for X_new, you use the predict method, which simply chooses whichever class has the higher predicted probability. In this case, it predicted 0, 1, and 0, because those classes had the higher predicted probability.
vc_pipe.predict(X_new)[:3]array([0, 1, 0])
In the 3 cases we just examined, logistic regression and random forests agreed on the class predictions. Let’s now examine a case in which the two models disagreed.
For sample 80, logistic regression predicted class 0 but without much confidence.
pipe.predict_proba(X_new)[80]array([0.51799634, 0.48200366])
Random forests predicted class 1 with more confidence.
rf_pipe.predict_proba(X_new)[80]array([0.29, 0.71])
When VotingClassifier averages the predicted probabilities for this sample, the class 1 value is higher than the class 0 value, thus it will predict class 1.
vc_pipe.predict_proba(X_new)[80]array([0.40399817, 0.59600183])
Let’s move on to cross-validation to see how the VotingClassifier Pipeline with soft voting performs. Its score is 0.818, which is better than either model alone.
cross_val_score(vc_pipe, X, y, cv=5, scoring='accuracy').mean()0.8181846713953927
Now let’s modify the VotingClassifier to use hard voting, which means that it ignores predicted probabilities and just takes a majority vote based on class predictions.
vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='hard',
n_jobs=-1)
vc_pipe = make_pipeline(ct, vc)When we cross-validate the Pipeline with hard voting, it performs a bit better, scoring 0.820.
cross_val_score(vc_pipe, X, y, cv=5, scoring='accuracy').mean()0.8204255853367648
However, this result is actually misleading:
VotingClassifier and there’s a tie, it always chooses the lowest numbered class. In other words, every time the two models disagreed, it chose class 0.The previous lesson brings up the obvious question: When should you use soft voting, and when should you use hard voting?
predict_proba method. For example, LinearSVC and Perceptron don’t include the predict_proba method.Ultimately, you can just try both soft and hard voting and see which works better, keeping in mind that hard voting results can be misleading if you have an even number of classifiers.
Because we’re using an even number of classifiers, we’ll change back to soft voting.
vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft',
n_jobs=-1)
vc_pipe = make_pipeline(ct, vc)Like with any model, we can tune the VotingClassifier’s hyperparameters using a grid search to try to improve its accuracy. Keep in mind that the best parameters for the VotingClassifier Pipeline might be different than the parameters for either model when they were tuned separately.
We’ll start by creating a vc_params dictionary that only includes the ColumnTransformer parameters.
vc_params = {k:v for k, v in params.items() if k.startswith('col')}
vc_params{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True]}
We can look at the named_steps attribute of vc_pipe to confirm that the second step name is votingclassifier.
vc_pipe.named_steps.keys()dict_keys(['columntransformer', 'votingclassifier'])
And you may recall that we assigned the names clf1 and clf2 to our two models within the VotingClassifier.
vc_pipe.named_steps['votingclassifier'].named_estimators{'clf1': LogisticRegression(random_state=1, solver='liblinear'),
'clf2': RandomForestClassifier(n_jobs=-1, random_state=1)}
Knowing those names, we can now add some model parameters to vc_params. For simplicity and speed, we’ll just tune a smaller selection of parameters and values.
vc_params['votingclassifier__clf1__penalty'] = ['l1', 'l2']
vc_params['votingclassifier__clf1__C'] = [1, 10]
vc_params['votingclassifier__clf2__n_estimators'] = [100, 300]
vc_params['votingclassifier__clf2__min_samples_leaf'] = [2, 3]
vc_params{'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
'columntransformer__simpleimputer__add_indicator': [False, True],
'votingclassifier__clf1__penalty': ['l1', 'l2'],
'votingclassifier__clf1__C': [1, 10],
'votingclassifier__clf2__n_estimators': [100, 300],
'votingclassifier__clf2__min_samples_leaf': [2, 3]}
Finally, we’ll create a GridSearchCV object called vc_grid, making sure to use the vc_pipe and vc_params objects, and then run the search.
vc_grid = GridSearchCV(vc_pipe, vc_params, cv=5, scoring='accuracy',
n_jobs=-1)
vc_grid.fit(X, y)GridSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked',
'Sex']),
('countvectorizer',
CountVectorizer(),
'Name'),
('simpleimputer',
SimpleImputer(),
['Age',
'Fare']),
(...
param_grid={'columntransformer__countvectorizer__ngram_range': [(1,
1),
(1,
2)],
'columntransformer__pipeline__onehotencoder__drop': [None,
'first'],
'columntransformer__simpleimputer__add_indicator': [False,
True],
'votingclassifier__clf1__C': [1, 10],
'votingclassifier__clf1__penalty': ['l1', 'l2'],
'votingclassifier__clf2__min_samples_leaf': [2, 3],
'votingclassifier__clf2__n_estimators': [100, 300]},
scoring='accuracy')ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
VotingClassifier(estimators=[('clf1',
LogisticRegression(random_state=1,
solver='liblinear')),
('clf2',
RandomForestClassifier(n_jobs=-1,
random_state=1))],
n_jobs=-1, voting='soft')LogisticRegression(random_state=1, solver='liblinear')
RandomForestClassifier(n_jobs=-1, random_state=1)
You can see that the best score has improved significantly, to 0.834.
vc_grid.best_score_0.833864791915134
Here’s the best set of parameters that it found.
vc_grid.best_params_{'columntransformer__countvectorizer__ngram_range': (1, 2),
'columntransformer__pipeline__onehotencoder__drop': None,
'columntransformer__simpleimputer__add_indicator': True,
'votingclassifier__clf1__C': 10,
'votingclassifier__clf1__penalty': 'l1',
'votingclassifier__clf2__min_samples_leaf': 3,
'votingclassifier__clf2__n_estimators': 100}
Finally, using the tuned grid to make predictions for new data is the same as always.
vc_grid.predict(X_new)array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
Ensembling generally improves the performance of your model, and so it’s a useful technique any time that performance is your highest priority. Keep in mind, however, that ensembling adds more complexity to your process, and the ensemble is also less interpretable than a single model.
If you do decide to use ensembling, my advice is to include at least 3 models in the ensemble (when possible). It’s important that all models you include are performing reasonably well on their own. And as mentioned before, it’s ideal if they generate their predictions using different processes.
By default, each model within an ensemble is given equal weight. However, you can try weighting certain models more than others to give them more “voting power” when determining the predicted class labels or predicted probabilities.
For example, we could give the logistic regression model double the voting power of the random forest model by setting the weights parameter of the VotingClassifier.
vc = VotingClassifier([('clf1', logreg), ('clf2', rf)], voting='soft',
weights=[2, 1], n_jobs=-1)
vc_pipe = make_pipeline(ct, vc)Let’s see how that affects the predicted probabilities. Once again, here are the predicted probabilities output by the logistic regression and random forest Pipelines for the first 3 samples in X_new.
pipe.predict_proba(X_new)[:3]array([[0.88916549, 0.11083451],
[0.14200691, 0.85799309],
[0.9190551 , 0.0809449 ]])
rf_pipe.predict_proba(X_new)[:3]array([[0.99, 0.01],
[0.24, 0.76],
[0.97, 0.03]])
And here are the predicted probabilities output by the VotingClassifier. As you can see, the predicted probabilities are closer to the ones output by logistic regression because we gave that model twice the weight. For example, the 0.17 in the left column of the second sample is closer to 0.14 (from logistic regression) than it is to 0.24 (from random forests).
vc_pipe.fit(X, y)
vc_pipe.predict_proba(X_new)[:3]array([[0.92277699, 0.07722301],
[0.17467127, 0.82532873],
[0.93603673, 0.06396327]])
You can confirm whether the weights are helping or hurting the ensemble by using cross-validation. In this case, the score is 0.816, which is slightly worse than our baseline VotingClassifier.
cross_val_score(vc_pipe, X, y, cv=5, scoring='accuracy').mean()0.8159437574540205
You can also search for the optimal weights using a grid search. Here’s how you might add that to the vc_params dictionary.
vc_params['votingclassifier__weights'] = [(1, 1), (2, 1), (1, 2)]