4 Improving your workflow with ColumnTransformer and Pipeline

4.1 Preprocessing features with ColumnTransformer

In the last chapter, our goal was to include two numeric features and two categorical features in our model. We saw how to numerically encode the categorical features using OneHotEncoder, but we lacked an efficient process for stacking those encoded features next to the numerical features, and we lacked an efficient way to apply this same preprocessing to our new data.

Problems from Chapter 3:

Need to stack categorical features next to numerical features
Need to apply the same preprocessing to new data

In this chapter, we’re going to solve both of those problems using the ColumnTransformer and Pipeline classes:

ColumnTransformer will make it easy to apply different preprocessing steps to different columns.
Pipeline will make it easy to apply the same workflow to training data and new data.

How to solve those problems:

ColumnTransformer: Apply different preprocessing steps to different columns
Pipeline: Apply the same workflow to training data and new data

To start, we’ll create a Python list of the four columns we’ve been working with, and use that to create our X object.

cols = ['Parch', 'Fare', 'Embarked', 'Sex']

X = df[cols]
X

	Parch	Fare	Embarked	Sex
0	0	7.2500	S	male
1	0	71.2833	C	female
2	0	7.9250	S	female
3	0	53.1000	S	female
4	0	8.0500	S	male
5	0	8.4583	Q	male
6	0	51.8625	S	male
7	1	21.0750	S	male
8	2	11.1333	S	female
9	0	30.0708	C	female

We’re still going to be one-hot encoding the Embarked and Sex columns, so we’ll create an instance of OneHotEncoder. We’re using the default options for OneHotEncoder, which means it will output a sparse matrix, but that’s fine because we’re not going to examine the output directly.

ohe = OneHotEncoder()

Now it’s time to create our first ColumnTransformer, which will take care of any data transformations that we specify. We’ll start by importing the make_column_transformer function from the compose module.

In general, you use make_column_transformer by passing it one or more tuples, and each tuple should have two elements:

The first element is a transformer.
The second element is a list of columns to which that transformer should be applied. Note that in most cases, this element should be a list even if you are only specifying a single column.

In our case, we’ll pass it a single tuple in which the first element is our OneHotEncoder object and the second element is a list of the two columns we want to one-hot encode.

After all tuples, we’ll set the remainder parameter to drop, which means that all columns which are not explicitly mentioned in the ColumnTransformer should be dropped. Drop is actually the default value for remainder, but I’m including it here just for clarity.

Note that I could have defined the ColumnTransformer on a single line, but I prefer breaking the lines in this way for readability.

When we run this code, the make_column_transformer code returns a ColumnTransformer object, which we’ll save as ct.

from sklearn.compose import make_column_transformer
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='drop')

Tuple elements for make_column_transformer:

Transformer object
List of columns to which the transformer should be applied

Next, we’ll perform the transformation by passing X, which is our four-column DataFrame, to the fit_transform method of the ct object. It outputs a 10 by 5 array that represents the one-hot encoding of the Embarked and Sex columns. The first three columns represent Embarked and the other two columns represent Sex, and they’re in that order because that’s the order in which they were listed in the ColumnTransformer.

Note that even though the Parch and Fare columns are part of X, they’re excluded from the output array because we told the ColumnTransformer to drop all unspecified columns.

ct.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])

Output columns:

Columns 1-3: Embarked
Columns 4-5: Sex

This is nice, but our actual goal was to create a matrix that includes the Parch and Fare columns alongside the encoded versions of Embarked and Sex. To accomplish that, we’ll simply change the value of remainder from drop to passthrough. This means that all columns which are not mentioned in the ColumnTransformer should be passed through to the output unmodified. In other words, include the Parch and Fare columns in the output, but don’t transform them in any way.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough')

When we run the fit_transform method this time, it outputs a 10 by 7 array. The first five columns represent the encoded Embarked and Sex columns, and the sixth and seventh columns are the Parch and Fare columns. The column order is based on the order in which the columns were listed in the ColumnTransformer, followed by any you passthrough.

ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

Output columns:

Columns 1-3: Embarked
Columns 4-5: Sex
Column 6: Parch
Column 7: Fare

We were able to figure out on our own what each column represents, but you can also use the ColumnTransformer’s get_feature_names method to confirm the meanings of these 7 features. The x0 simply means feature 0 that was passed to the OneHotEncoder, and the x1 means feature 1.

ct.get_feature_names()

['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'Parch',
 'Fare']

Before we move on, I have two quick asides about the get_feature_names method:

First, the get_feature_names method didn’t work with passthrough columns prior to scikit-learn version 0.23, so you’ll get an error if you run the code with previous versions.
Second, the get_feature_names method has been replaced with a similar method called get_feature_names_out beginning in scikit-learn 1.0.

Notes about get_feature_names:

Before version 0.23: Didn’t work with passthrough columns
Starting in version 1.0: Has been replaced with get_feature_names_out

To wrap up this lesson, I want to show you one other way to specify this same ColumnTransformer.

As I mentioned before, make_column_transformer accepts tuples, and the first element of each tuple is usually a transformer object (like our “ohe” object). However, the first element of the tuple can also be the special strings “drop” or “passthrough”, which tells the ColumnTransformer to drop or passthrough specific columns.

Tuple elements for make_column_transformer (revised):

Transformer object or “drop” or “passthrough”
List of columns to which the transformer should be applied

So, we’re going to add a second tuple in which the transformer is the string “passthrough”, and we want to apply this passthrough transformer to the columns Parch and Fare. This ColumnTransformer will do the exact same thing as the previous one, but I actually prefer this notation any time I have a small number of passthrough columns, since it reminds me of which columns I’m passing through.

It’s still important to remember that the default value for the remainder parameter is “drop”, which means that any unspecified columns will be dropped, though we don’t have any unspecified columns in this case.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))

We’ll run the fit_transform method one more time, and you can see that it outputs the same 7 columns as before. And to be clear, this is the feature matrix that we will pass to our model.

ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

4.2 Chaining steps with Pipeline

In the previous lesson, we accomplished our first goal, which was to apply different preprocessing to different columns using ColumnTransformer. In this lesson, we’re moving on to our second goal, which is to apply the same workflow to training data and new data using the Pipeline class.

A Pipeline is used to chain together sequential steps. In this case, we want to chain together two steps, namely data preprocessing followed by model building.

We’ll start by importing the make_pipeline function. Then, we can create a Pipeline instance by passing it two objects: our ColumnTransformer instance for data preprocessing, and our logistic regression instance for model building. We’ll save it as an object called “pipe”, which is a 2-step Pipeline.

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, logreg)

Pipeline steps:

Data preprocessing with ColumnTransformer
Model building with LogisticRegression

You might remember that back in Chapter 2, we used cross-validation to evaluate our model when it only included the Parch and Fare features. Now that we’ve added the Embarked and Sex features, it would normally make sense to cross-validate the updated model to see whether the adding those features made our model better or worse. And in fact, you can (and should) cross-validate an entire Pipeline.

However, any model evaluation procedure is highly unreliable with only 10 rows of data, and so any change in the cross-validated accuracy would be misleading. Thus we’re going to skip the cross-validation step for the moment, though we’ll return to it in a later chapter once we’re using the full dataset.

Since we’re skipping cross-validation, our next step is just to run the fit method on the Pipeline, and pass it X and y. Here’s what happens when we fit the Pipeline:

First, it runs the ColumnTransformer step, meaning that it takes X, which is a 4-column DataFrame that contains both numbers and strings, and transforms it into the 7-column feature matrix that only includes numbers.
Second, it runs the LogisticRegression step, meaning that the model is fit to this 7-column feature matrix. In other words, it learns the relationship between those 7 features and the y values.

Note that when you fit a Pipeline, it will actually print out the steps. You can see that step 1 is a ColumnTransformer that includes a OneHotEncoder and a passthrough transformer, and step 2 is a LogisticRegression model.

pipe.fit(X, y)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])

Fitting the Pipeline:

ColumnTransformer converts X (4 columns) into a numeric feature matrix (7 columns)
LogisticRegression model is fit to the feature matrix

In case it helps you to understand the Pipeline better, I’m going to show you what happens “under the hood” when you fit this Pipeline. To be clear, you should not actually write the following code, rather it is just for teaching purposes.

First, X is transformed by the ColumnTransformer into X_t, which stands for X transformed. Second, the LogisticRegression model is fit on X_t and y.

X_t = ct.fit_transform(X)
logreg.fit(X_t, y)

LogisticRegression(random_state=1, solver='liblinear')

And as you would expect, X has the shape 10 by 4, and X_t has the shape 10 by 7.

print(X.shape)
print(X_t.shape)

(10, 4)
(10, 7)

4.3 Using the Pipeline to make predictions

Now that we’ve fit our Pipeline, we want to use it to make predictions on new data.

The first step is to update the X_new DataFrame so that it contains the same columns as X. Recall that the cols object contains the names of our four columns, and so we can use it to select those four columns from the df_new DataFrame.

X_new = df_new[cols]
X_new

	Parch	Fare	Embarked	Sex
0	0	7.8292	Q	male
1	0	7.0000	S	female
2	0	9.6875	Q	male
3	0	8.6625	S	male
4	1	12.2875	S	female
5	0	9.2250	S	male
6	0	7.6292	Q	female
7	1	29.0000	S	male
8	0	7.2292	C	female
9	0	24.1500	S	male

Now, we can pass X_new to the Pipeline’s predict method to make predictions for these ten samples. When we run it, the Pipeline applies the same transformations to X_new that it applied to X, and the transformed version of X_new is passed to the fitted logistic regression model so that it can make predictions.

In other words, the Pipeline enabled us to accomplish our second goal, which is to apply the same workflow to training data and new data.

As a reminder, we can’t evaluate the accuracy of these ten predictions because we don’t know the true target values for X_new.

pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

Predicting with the Pipeline:

ColumnTransformer applies the same transformations to X_new
Fitted LogisticRegression model makes predictions on the transformed version of X_new

Just like before, I’m going to show you what happens “under the hood” when you make predictions using this Pipeline. Again, you should not actually write the following code, rather it is just for teaching purposes.

First, X_new is transformed by the ColumnTransformer into X_new_t, which stands for X_new transformed. Second, the fitted LogisticRegression model makes predictions for the samples in X_new_t.

X_new_t = ct.transform(X_new)
logreg.predict(X_new_t)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

And as you would expect, X_new has the shape 10 by 4, and X_new_t has the shape 10 by 7.

print(X_new.shape)
print(X_new_t.shape)

(10, 4)
(10, 7)

One important point I want to highlight is that the Pipeline’s predict method called the ColumnTransformer’s transform method, not its fit_transform method. Why would that be?

Recall that the fit step is when a transformer learns something, and the transform step is when it uses what it learned to do the transformation. Thus you fit on X to learn an encoding, and you transform on X and X_new to apply that encoding.

ColumnTransformer methods:

Run fit_transform on X:

fit: Learn the encoding
transform: Apply the encoding to create 7 columns

Run transform on X_new:

transform: Apply the encoding to create 7 columns

This is critically important. Our logistic regression model was fit on 7 columns, and so it learned 7 coefficients. To make predictions, you need to pass 7 columns to the predict method, and those 7 columns need to mean the same thing as the 7 columns you used when fitting the model. Thus, the predict method only runs transform so that the exact same encoding will be applied to the training data and the new data.

It’s okay if you’re still a bit fuzzy on the difference between fit and transform, because the Pipeline object will just do the right thing for you when you run fit or predict. However, understanding the difference will ultimately help you to go further with scikit-learn.

4.4 Q&A: How do I drop some columns and passthrough others?

Currently we only have 4 columns in X, namely Parch, Fare, Embarked, and Sex. But imagine that we had many more columns, and we wanted to drop a few columns and passthrough the rest. How would we do that efficiently?

We can use the special string “drop” to tell the ColumnTransformer which columns to drop, and also tell it to passthrough all remaining columns. So in this example, we’re one-hot encoding Embarked and Sex, which creates 5 columns, dropping Fare, and passing through Parch, which adds 1 more column.

We could use this same pattern to drop a few columns and passthrough hundreds of columns without having to list the passthrough columns one-by-one.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('drop', ['Fare']),
    remainder='passthrough')
ct.fit_transform(X)

array([[0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 1.],
       [0., 0., 1., 1., 0., 2.],
       [1., 0., 0., 1., 0., 0.]])

Conversely, we might want to passthrough a few columns and drop the rest. We can use the special string “passthrough” to tell the ColumnTransformer which columns to passthrough, and also tell it to drop all remaining columns. So in this example, we’re one-hot encoding Embarked and Sex, which creates 5 columns, passing through Parch, which adds 1 more column, and dropping Fare.

Again, we can use this pattern to passthrough a few columns and drop hundreds of columns without listing them all.

Finally, just a reminder that “drop” is the default value for remainder, so you aren’t actually required to specify it here.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch']),
    remainder='drop')
ct.fit_transform(X)

array([[0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 1.],
       [0., 0., 1., 1., 0., 2.],
       [1., 0., 0., 1., 0., 0.]])

4.5 Q&A: How do I transform the unspecified columns?

We know how to drop or passthrough the unspecified columns in a ColumnTransformer, but let’s pretend we wanted to apply a transformation to all of the unspecified columns. This is actually simple to do by passing a transformer to the remainder parameter.

For example, we might want to scale all of the unspecified columns. One option is MaxAbsScaler, which divides each feature by its maximum value and thus scales it to the range negative 1 to positive 1. We’ll import it from the preprocessing module and then create an instance.

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

Then, we can pass the scaler to the remainder parameter.

When we run the fit_transform method, you can see that the first 5 columns were created from Embarked and Sex, and the sixth column is the scaled version of the Parch column.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('drop', ['Fare']),
    remainder=scaler)
ct.fit_transform(X)

array([[0. , 0. , 1. , 0. , 1. , 0. ],
       [1. , 0. , 0. , 1. , 0. , 0. ],
       [0. , 0. , 1. , 1. , 0. , 0. ],
       [0. , 0. , 1. , 1. , 0. , 0. ],
       [0. , 0. , 1. , 0. , 1. , 0. ],
       [0. , 1. , 0. , 0. , 1. , 0. ],
       [0. , 0. , 1. , 0. , 1. , 0. ],
       [0. , 0. , 1. , 0. , 1. , 0.5],
       [0. , 0. , 1. , 1. , 0. , 1. ],
       [1. , 0. , 0. , 1. , 0. , 0. ]])

4.6 Q&A: How do I select columns from a NumPy array?

Throughout this book, we’ve been using a pandas DataFrame as our input. But what if your input data was a NumPy array instead? Let’s see how that affects our workflow.

We’ll start by converting the X and X_new DataFrames into NumPy arrays called X_array and X_new_array.

X_array = X.to_numpy()
X_new_array = X_new.to_numpy()

Here’s what X_array looks like.

X_array

array([[0, 7.25, 'S', 'male'],
       [0, 71.2833, 'C', 'female'],
       [0, 7.925, 'S', 'female'],
       [0, 53.1, 'S', 'female'],
       [0, 8.05, 'S', 'male'],
       [0, 8.4583, 'Q', 'male'],
       [0, 51.8625, 'S', 'male'],
       [1, 21.075, 'S', 'male'],
       [2, 11.1333, 'S', 'female'],
       [0, 30.0708, 'C', 'female']], dtype=object)

If this was our input data, and we wanted to use a ColumnTransformer, we wouldn’t be able to specify the columns by name because columns of a NumPy array don’t have names. However, we do have a couple of other options.

First, we could specify the columns by integer position. Embarked and Sex are columns 2 and 3, so in this example, we’re one-hot encoding Embarked and Sex and passing through the remainder. Note that we’re passing X_array, not X, to the fit_transform method.

ct = make_column_transformer(
    (ohe, [2, 3]),
    remainder='passthrough')
ct.fit_transform(X_array)

array([[0.0, 0.0, 1.0, 0.0, 1.0, 0, 7.25],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 71.2833],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 7.925],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 53.1],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 8.05],
       [0.0, 1.0, 0.0, 0.0, 1.0, 0, 8.4583],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 51.8625],
       [0.0, 0.0, 1.0, 0.0, 1.0, 1, 21.075],
       [0.0, 0.0, 1.0, 1.0, 0.0, 2, 11.1333],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 30.0708]], dtype=object)

Another option is to specify the columns using slices, which is useful for large ranges of columns next to one another. In this case, we’re selecting columns 2 through 3 for one-hot encoding, and passing through the remainder. Remember that Python slices are inclusive of the starting value, which is 2 in this case, and exclusive of the ending value, which is 4 in this case.

ct = make_column_transformer(
    (ohe, slice(2, 4)),
    remainder='passthrough')
ct.fit_transform(X_array)

array([[0.0, 0.0, 1.0, 0.0, 1.0, 0, 7.25],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 71.2833],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 7.925],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 53.1],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 8.05],
       [0.0, 1.0, 0.0, 0.0, 1.0, 0, 8.4583],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 51.8625],
       [0.0, 0.0, 1.0, 0.0, 1.0, 1, 21.075],
       [0.0, 0.0, 1.0, 1.0, 0.0, 2, 11.1333],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 30.0708]], dtype=object)

One final option is to specify the columns using a boolean mask. Normally you would create the mask using some sort of condition, but in this case I’m just writing out a mask to select columns 2 and 3 for one-hot encoding, and passing through the remainder.

ct = make_column_transformer(
    (ohe, [False, False, True, True]),
    remainder='passthrough')
ct.fit_transform(X_array)

array([[0.0, 0.0, 1.0, 0.0, 1.0, 0, 7.25],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 71.2833],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 7.925],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0, 53.1],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 8.05],
       [0.0, 1.0, 0.0, 0.0, 1.0, 0, 8.4583],
       [0.0, 0.0, 1.0, 0.0, 1.0, 0, 51.8625],
       [0.0, 0.0, 1.0, 0.0, 1.0, 1, 21.075],
       [0.0, 0.0, 1.0, 1.0, 0.0, 2, 11.1333],
       [1.0, 0.0, 0.0, 1.0, 0.0, 0, 30.0708]], dtype=object)

So those are our three options for selecting columns in a ColumnTransformer when your input source is a NumPy array.

Options for selecting columns from a NumPy array:

Integer position
Slice
Boolean mask

Other than that, the rest of our workflow remains the same. We’ll just update the Pipeline to use our new ColumnTransformer.

pipe = make_pipeline(ct, logreg)

Then we can fit the Pipeline with X_array and y, and make predictions for the X_new_array.

pipe.fit(X_array, y)
pipe.predict(X_new_array)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

4.7 Q&A: How do I select columns by data type?

So far in this book, we’ve been selecting columns one-by-one. But let’s say that we had many more columns, and we simply wanted to one-hot encode all object columns and passthrough all numeric columns without listing all of them out. How would we do that?

The easiest way to do this is with the make_column_selector function, which is new in scikit-learn version 0.22.

from sklearn.compose import make_column_selector

We’re going to create two column selectors called select_object and select_number. To do this, we just set the dtype_include parameter to the data type we want to include, and it outputs a callable.

select_object = make_column_selector(dtype_include=object)
select_number = make_column_selector(dtype_include='number')

Then, we pass the callables to make_column_transformer instead of the column names, and the callables select the columns for us.

When we run fit_transform, you can see that once again, the object columns have been one-hot encoded and the numeric columns have been passed through.

ct = make_column_transformer(
    (ohe, select_object),
    ('passthrough', select_number))
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

One slight variation of this is that you can tell make_column_selector to exclude instead of include a specific data type. In this example, we’re using the dtype_exclude parameter to create a column selector that excludes the object data type.

exclude_object = make_column_selector(dtype_exclude=object)

This time, we’ll tell the ColumnTransformer to one-hot encode all object columns and passthrough all non-object columns, which has the same effect as before.

ct = make_column_transformer(
    (ohe, select_object),
    ('passthrough', exclude_object))
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

There are also other data type options you can use, such as the datetime data type or the pandas category data type.

select_datetime = make_column_selector(dtype_include='datetime')
select_category = make_column_selector(dtype_include='category')

Finally, it’s worth noting that you can also pass a list of multiple data types to make_column_selector.

select_multiple = make_column_selector(dtype_include=[object, 'category'])

4.8 Q&A: How do I select columns by column name pattern?

Let’s say that we had a lot of columns, and all of the columns that we wanted to select for a particular transformation had the same pattern in their names. For example, maybe all of those columns started with the same word.

Once again, we can use the make_column_selector function, which allows us to select columns by regular expression pattern. Here’s a silly example in which we select columns that include the capital letters E or S.

select_ES = make_column_selector(pattern='E|S')

When we run the fit_transform method, Embarked and Sex have been one-hot encoded, and the remaining columns have been passed through.

ct = make_column_transformer(
    (ohe, select_ES),
    remainder='passthrough')
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

Again, this is only useful if your column names follow a particular pattern and you know how to write regular expressions.

4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?

So far in this book, we’ve been creating ColumnTransformers using the make_column_transformer function. In this lesson, I’ll show you how to use the ColumnTransformer class and then compare it to make_column_transformer so that you can decide which one you want to use.

To start, we’ll import the ColumnTransformer class from the compose module, and then we’ll create an instance.

When creating an instance, the first difference you might notice is that the tuples have three elements rather than two. The first element of each tuple is a name of your choosing that you are required to assign to the transformer.

In this case, the first tuple is our one-hot encoding of Embarked and Sex, and we’re assigning it the name “OHE” in all caps. The second tuple is our special passthrough transformer for Parch and Fare, and we’re assigning it the name “pass”. We can see these names when we print out the ColumnTransformer.

You might also notice that the tuples are in a list, which is a requirement of the ColumnTransformer class.

from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    [('OHE', ohe, ['Embarked', 'Sex']),
     ('pass', 'passthrough', ['Parch', 'Fare'])])
ct

ColumnTransformer(transformers=[('OHE', OneHotEncoder(), ['Embarked', 'Sex']),
                                ('pass', 'passthrough', ['Parch', 'Fare'])])

Tuple elements for ColumnTransformer:

Transformer name
Transformer object or “drop” or “passthrough”
List of columns to which the transformer should be applied

Now let’s create the same ColumnTransformer using the make_column_transformer function. When using make_column_transformer, we don’t define names for the transformers. Instead, each transformer is assigned a default name, which is the lowercase version of the transformer’s class name.

As you can see when we print it out, the one-hot encoder is assigned the name “onehotencoder” (all lowercase), and the passthrough transformer is assigned the name “passthrough”.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    ('passthrough', ['Parch', 'Fare']))
ct

ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 ['Embarked', 'Sex']),
                                ('passthrough', 'passthrough',
                                 ['Parch', 'Fare'])])

All of that being said, which one should you use?

I prefer make_column_transformer, because I find the code both easier to read and easier to write, so that’s what I’ll use in this book. I usually don’t mind the default transformer names, and in fact I like that I don’t have to come up with a name for each transformer.

However, there are times when defining names for the transformers is useful. Custom names can be clearer if you’re performing a grid search of transformer parameters, or if you’re using the same type of transformer multiple times in the same ColumnTransformer instance. We’ll see examples of this later in the book.

One final note is that the ColumnTransformer class enables transformer weights, meaning you can emphasize the output of some transformers more than others. The specific use case of this is not yet clear to me, but if you do decide to use transformer weights, then you can’t use the make_column_transformer function and you must use the ColumnTransformer class.

	ColumnTransformer	make_column_transformer
Allows custom names?	Yes	No
Allows transformer weights?	Yes	No

4.10 Q&A: Should I use Pipeline or make_pipeline?

So far in this book, we’ve been creating Pipelines using the make_pipeline function. In this lesson, I’ll show you how to use the Pipeline class and then compare it to make_pipeline so that you can decide which one you want to use.

To start, we’ll import the Pipeline class from the pipeline module, and then we’ll create an instance.

When creating an instance, the main difference you might notice is that we’re passing in a list of tuples to the Pipeline constructor. Each tuple has two elements, in which the first element is the name you’re assigning to the Pipeline step, and the second element is the model or transformer you’re including in the Pipeline.

In this case, the first tuple is our preprocessing step using ColumnTransformer, and we’re assigning it the name “preprocessor”. The second tuple is our model building step using logistic regression, and we’re assigning it the name “classifier”. We can see these names when we print out the Pipeline.

from sklearn.pipeline import Pipeline
pipe = Pipeline([('preprocessor', ct), ('classifier', logreg)])
pipe

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('classifier',
                 LogisticRegression(random_state=1, solver='liblinear'))])

Tuple elements for Pipeline:

Step name
Model or transformer object

We can also see the step names by accessing the named_steps attribute of the Pipeline and running the keys method.

pipe.named_steps.keys()

dict_keys(['preprocessor', 'classifier'])

Now let’s create the same Pipeline using the make_pipeline function. When using make_pipeline, we don’t define names for the steps. Instead, each step is assigned a default name, which is the lowercase version of the step’s class name.

As you can see when we print it out, the first step is assigned the name “columntransformer” (all lowercase), and the second step is assigned the name “logisticregression” (all lowercase).

pipe = make_pipeline(ct, logreg)
pipe

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])

Again, we can also see the step names using the named_steps attribute.

pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

All of that being said, which one should you use?

I prefer make_pipeline, because I find the code both easier to read and easier to write, so that’s what I’ll use in this book. I usually don’t mind the default step names, and in fact I like that I don’t have to come up with a name for each step.

However, custom step names can be useful for clarity, especially if you’re performing a grid search of a Pipeline. We’ll see many examples of this later in the book.

	Pipeline	make_pipeline
Allows custom names?	Yes	No

4.11 Q&A: How do I examine the steps of a Pipeline?

Sometimes you might want to examine the steps of a fitted Pipeline so that you can understand what’s happening within each step. In this lesson, I’ll show you how to do it.

We’ll start by fitting the Pipeline, which prints out the two steps.

pipe.fit(X, y)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])

As I mentioned in the previous lesson, make_pipeline assigned a name to each step, which is the lowercase version of the step’s class name. In this case, our step names are “columntransformer” and “logisticregression”.

pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

To examine an individual step, you select the named_steps attribute and pass the step name in brackets. Note that if we had assigned custom step names such as “preprocessor” and “classifier”, we would be using those here instead.

pipe.named_steps['columntransformer']

ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 ['Embarked', 'Sex']),
                                ('passthrough', 'passthrough',
                                 ['Parch', 'Fare'])])

pipe.named_steps['logisticregression']

LogisticRegression(random_state=1, solver='liblinear')

Once you’ve accessed a step, you can examine its attributes or run its methods. For example, we can run the get_feature_names method from the “columntransformer” step to learn the names of each feature. As a reminder, the x0 means feature 0 that was passed to the OneHotEncoder, and the x1 means feature 1.

pipe.named_steps['columntransformer'].get_feature_names()

['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'Parch',
 'Fare']

We can also see the coefficient values of the 7 features by examining the “coef_” attribute of the “logisticregression” step. These coefficients are listed in the same order as the features, though the intercept is stored in a separate attribute.

By finding the 4 positive coefficients, you can determine that embarking at port C, being female, and having a higher Parch and Fare are all associated with a greater likelihood of survival. Note that these are just associations the model learned from 10 rows of training data. They are not necessarily statistically significant associations, and in fact scikit-learn does not provide p-values.

pipe.named_steps['logisticregression'].coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

Finally, it’s worth noting that there are three other ways that you can examine the steps of a Pipeline:

First, you can use named_steps with periods.
Second, you can exclude the named_steps attribute entirely.
And third, you can reference the step by position rather than by name.

pipe.named_steps.logisticregression.coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

pipe['logisticregression'].coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

pipe[1].coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

Personally, I like the initial bracket notation because I think it’s the most readable, even though it’s the most typing. However, using named_steps with the periods seems to be the only option that supports autocompleting both the step name and the attribute, which is a nice benefit.