16 Workflow review #3

16.1 Recap of our workflow

In this chapter, we’re going to do one final review of the core workflow that we built throughout this book, including all of the features that we developed in the previous chapter.

We begin by importing pandas and NumPy, the four transformer classes we’re using, one modeling class, and two composition functions.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

Next, we create a list of the eight columns we’re going to select from our data.

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age', 'Cabin', 'SibSp']

Then, we read in all of our training data and use it to define our X and y.

df = pd.read_csv('http://bit.ly/MLtrain')
X = df[cols]
y = df['Survived']

And we read in all of the new data and use it to define X_new.

df_new = pd.read_csv('http://bit.ly/MLnewdata')
X_new = df_new[cols]

We create five instances of our transformers, namely two different instances of SimpleImputer, two different instances of OneHotEncoder, and one instance of CountVectorizer.

imp = SimpleImputer()
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()
ohe_ignore = OneHotEncoder(handle_unknown='ignore')
vect = CountVectorizer()

We define two custom functions that will be used for feature engineering.

def first_letter(df):
    return pd.DataFrame(df).apply(lambda x: x.str.slice(0, 1))

def sum_cols(df):
    return np.array(df).sum(axis=1).reshape(-1, 1)

We convert two NumPy functions and our two custom functions into transformers.

ceiling = FunctionTransformer(np.ceil)
clip = FunctionTransformer(np.clip, kw_args={'a_min':5, 'a_max':60})
letter = FunctionTransformer(first_letter)
total = FunctionTransformer(sum_cols)

We create four Pipelines that combine the various transformers.

imp_ohe = make_pipeline(imp_constant, ohe)
imp_ceiling = make_pipeline(imp, ceiling)
imp_clip = make_pipeline(imp, clip)
letter_imp_ohe = make_pipeline(letter, imp_constant, ohe_ignore)

And then we build the ColumnTransformer, which imputes and one-hot encodes Embarked and Sex, vectorizes Name, imputes and takes the ceiling of Fare, imputes and clips Age, imputes and one-hot encodes the first letter of Cabin, and adds SibSp and Parch.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_ceiling, ['Fare']),
    (imp_clip, ['Age']),
    (letter_imp_ohe, ['Cabin']),
    (total, ['SibSp', 'Parch']))

We also create an instance of logistic regression.

logreg = LogisticRegression(solver='liblinear', random_state=1)

We create a two-step modeling Pipeline and fit the Pipeline to X and y.

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe

Finally, we use the fitted Pipeline to make predictions on X_new.

pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

Keep in mind that there are many other steps you can incorporate into this workflow in order to potentially improve performance, including hyperparameter tuning, trying a different model, ensembling, feature selection, feature standardization, and additional feature engineering.

16.2 What’s the role of pandas?

If we can do all of our data transformations in scikit-learn, then you might be left wondering: What’s the role of pandas?

First is data exploration and visualization. A deep understanding of your dataset will help you with many steps of the Machine Learning workflow, especially selecting which features to use and deciding how you might want to transform your features.

Second is testing out data transformations for Machine Learning. If I’m thinking about building a custom transformer in scikit-learn, I first create it using pandas to make sure that it works.

Finally, if your goal is anything other than Machine Learning, then all of your data transformations should still be executed using pandas.

The bottom line is that pandas still has a huge role in the data science workflow. However, if your goal is Machine Learning, then it’s best to shift as much of your workflow as possible to scikit-learn.

Uses for pandas in the data science workflow:

All projects: Data exploration and visualization
ML projects: Testing out data transformations
Non-ML projects: Executing data transformations