11  Comparing linear and non-linear models

11.1 Trying a random forest model

So far, the only model we’ve used in this book is logistic regression. But what if you wanted to try a different model?

One great thing about the scikit-learn API is that once you’ve built a workflow, you can easily swap in a different model, usually without making any other changes to your workflow. This is a huge benefit of scikit-learn, since it’s not possible to know ahead of time which model is going to work best for a given problem and dataset. This is also known as the “no free lunch” theorem.

In this chapter, we’re going to try out the random forest model, which is one of the most well-known models in Machine Learning. Whereas logistic regression is a linear model, random forests is a non-linear model based on decision trees. These two types of models have different overall properties, thus it may turn out that one type is better suited to this particular problem.

Random forest model:

  • Non-linear model
  • Based on decision trees
  • Different properties from logistic regression

We start out by importing the RandomForestClassifier class from the ensemble module, and creating an instance called rf. Because there’s randomness involved in a random forest, we’ll set a random state for reproducibility. And because building a random forest can be computationally expensive, it has its own n_jobs parameter (just like grid search and randomized search), which we’ll set to -1 to enable parallel processing.

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_jobs=-1)

We’ll create a new Pipeline object called rf_pipe that uses random forests instead of logistic regression.

rf_pipe = make_pipeline(ct, rf)
rf_pipe
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(), 'Name'),
                                                 ('simpleimputer',
                                                  SimpleImputer(),
                                                  ['Age', 'Fare']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch'])])),
                ('randomforestclassifier',
                 RandomForestClassifier(n_jobs=-1, random_state=1))])
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
RandomForestClassifier(n_jobs=-1, random_state=1)

And we can cross-validate it to generate a baseline accuracy, which is 0.811. This accuracy is nearly identical to the baseline accuracy of our logistic regression Pipeline, but it’s likely that we can improve it through hyperparameter tuning.

cross_val_score(rf_pipe, X, y, cv=5, scoring='accuracy').mean()
0.811436821291821

Pipeline accuracy scores:

  • Grid search (LR): 0.828
  • Baseline (LR): 0.811
  • Baseline (RF): 0.811

As an aside, I’ve simplified the Pipeline accuracy scores table to only include the most important scores from the previous chapter. As you might guess, “LR” stands for logistic regression and “RF” stands for random forests. And going forward, I’ll always use the term “baseline” in this table to describe a Pipeline that has not undergone any hyperparameter tuning via grid search or randomized search.