= ['Parch', 'Fare', 'Embarked', 'Sex'] cols
4 Improving your workflow with ColumnTransformer and Pipeline
4.1 Preprocessing features with ColumnTransformer
In the last chapter, our goal was to include two numeric features and two categorical features in our model. We saw how to numerically encode the categorical features using OneHotEncoder
, but we lacked an efficient process for stacking those encoded features next to the numerical features, and we lacked an efficient way to apply this same preprocessing to our new data.
In this chapter, we’re going to solve both of those problems using the ColumnTransformer
and Pipeline
classes:
ColumnTransformer
will make it easy to apply different preprocessing steps to different columns.Pipeline
will make it easy to apply the same workflow to training data and new data.
To start, we’ll create a Python list of the four columns we’ve been working with, and use that to create our X
object.
= df[cols]
X X
Parch | Fare | Embarked | Sex | |
---|---|---|---|---|
0 | 0 | 7.2500 | S | male |
1 | 0 | 71.2833 | C | female |
2 | 0 | 7.9250 | S | female |
3 | 0 | 53.1000 | S | female |
4 | 0 | 8.0500 | S | male |
5 | 0 | 8.4583 | Q | male |
6 | 0 | 51.8625 | S | male |
7 | 1 | 21.0750 | S | male |
8 | 2 | 11.1333 | S | female |
9 | 0 | 30.0708 | C | female |
We’re still going to be one-hot encoding the Embarked and Sex columns, so we’ll create an instance of OneHotEncoder
. We’re using the default options for OneHotEncoder
, which means it will output a sparse matrix, but that’s fine because we’re not going to examine the output directly.
= OneHotEncoder() ohe
Now it’s time to create our first ColumnTransformer
, which will take care of any data transformations that we specify. We’ll start by importing the make_column_transformer
function from the compose
module.
In general, you use make_column_transformer
by passing it one or more tuples, and each tuple should have two elements:
- The first element is a transformer.
- The second element is a list of columns to which that transformer should be applied. Note that in most cases, this element should be a list even if you are only specifying a single column.
In our case, we’ll pass it a single tuple in which the first element is our OneHotEncoder
object and the second element is a list of the two columns we want to one-hot encode.
After all tuples, we’ll set the remainder
parameter to 'drop'
, which means that all columns which are not explicitly mentioned in the ColumnTransformer
should be dropped. 'drop'
is actually the default value for remainder
, but I’m including it here just for clarity.
Note that I could have defined the ColumnTransformer
on a single line, but I prefer breaking the lines in this way for readability.
When we run this code, the make_column_transformer
code returns a ColumnTransformer
object, which we’ll save as ct
.
from sklearn.compose import make_column_transformer
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, [='drop') remainder
Next, we’ll perform the transformation by passing X
, which is our four-column DataFrame, to the fit_transform
method of the ct
object. It outputs a 10 by 5 array that represents the one-hot encoding of the Embarked and Sex columns. The first three columns represent Embarked and the other two columns represent Sex, and they’re in that order because that’s the order in which they were listed in the ColumnTransformer
.
Note that even though the Parch and Fare columns are part of X
, they’re excluded from the output array because we told the ColumnTransformer
to drop all unspecified columns.
ct.fit_transform(X)
array([[0., 0., 1., 0., 1.],
[1., 0., 0., 1., 0.],
[0., 0., 1., 1., 0.],
[0., 0., 1., 1., 0.],
[0., 0., 1., 0., 1.],
[0., 1., 0., 0., 1.],
[0., 0., 1., 0., 1.],
[0., 0., 1., 0., 1.],
[0., 0., 1., 1., 0.],
[1., 0., 0., 1., 0.]])
This is nice, but our actual goal was to create a matrix that includes the Parch and Fare columns alongside the encoded versions of Embarked and Sex. To accomplish that, we’ll simply change the value of remainder
from 'drop'
to 'passthrough'
. This means that all columns which are not mentioned in the ColumnTransformer
should be passed through to the output unmodified. In other words, include the Parch and Fare columns in the output, but don’t transform them in any way.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, [='passthrough') remainder
When we run the fit_transform
method this time, it outputs a 10 by 7 array. The first five columns represent the encoded Embarked and Sex columns, and the sixth and seventh columns are the Parch and Fare columns. The column order is based on the order in which the columns were listed in the ColumnTransformer
, followed by any you pass through.
ct.fit_transform(X)
array([[ 0. , 0. , 1. , 0. , 1. , 0. , 7.25 ],
[ 1. , 0. , 0. , 1. , 0. , 0. , 71.2833],
[ 0. , 0. , 1. , 1. , 0. , 0. , 7.925 ],
[ 0. , 0. , 1. , 1. , 0. , 0. , 53.1 ],
[ 0. , 0. , 1. , 0. , 1. , 0. , 8.05 ],
[ 0. , 1. , 0. , 0. , 1. , 0. , 8.4583],
[ 0. , 0. , 1. , 0. , 1. , 0. , 51.8625],
[ 0. , 0. , 1. , 0. , 1. , 1. , 21.075 ],
[ 0. , 0. , 1. , 1. , 0. , 2. , 11.1333],
[ 1. , 0. , 0. , 1. , 0. , 0. , 30.0708]])
We were able to figure out on our own what each column represents, but you can also use the ColumnTransformer
’s get_feature_names
method to confirm the meanings of these 7 features. The x0 simply means feature 0 that was passed to the OneHotEncoder
, and the x1 means feature 1.
ct.get_feature_names()
['onehotencoder__x0_C',
'onehotencoder__x0_Q',
'onehotencoder__x0_S',
'onehotencoder__x1_female',
'onehotencoder__x1_male',
'Parch',
'Fare']
Before we move on, I have two quick asides about the get_feature_names
method:
- First, the
get_feature_names
method didn’t work with passthrough columns prior to scikit-learn version 0.23, so you’ll get an error if you run the code with previous versions. - Second, the
get_feature_names
method has been replaced with a similar method calledget_feature_names_out
starting in scikit-learn 1.0.
To wrap up this lesson, I want to show you one other way to specify this same ColumnTransformer
.
As I mentioned before, make_column_transformer
accepts tuples, and the first element of each tuple is usually a transformer object (like our ohe
object). However, the first element of the tuple can also be the special strings 'drop'
or 'passthrough'
, which tells the ColumnTransformer
to drop or pass through specific columns.
So, we’re going to add a second tuple in which the transformer is the string 'passthrough'
, and we want to apply this passthrough transformer to the columns Parch and Fare. This ColumnTransformer
will do the exact same thing as the previous one, but I actually prefer this notation any time I have a small number of passthrough columns, since it reminds me of which columns I’m passing through.
It’s still important to remember that the default value for the remainder
parameter is 'drop'
, which means that any unspecified columns will be dropped, though we don’t have any unspecified columns in this case.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, ['passthrough', ['Parch', 'Fare'])) (
We’ll run the fit_transform
method one more time, and you can see that it outputs the same 7 columns as before. And to be clear, this is the feature matrix that we will pass to our model.
ct.fit_transform(X)
array([[ 0. , 0. , 1. , 0. , 1. , 0. , 7.25 ],
[ 1. , 0. , 0. , 1. , 0. , 0. , 71.2833],
[ 0. , 0. , 1. , 1. , 0. , 0. , 7.925 ],
[ 0. , 0. , 1. , 1. , 0. , 0. , 53.1 ],
[ 0. , 0. , 1. , 0. , 1. , 0. , 8.05 ],
[ 0. , 1. , 0. , 0. , 1. , 0. , 8.4583],
[ 0. , 0. , 1. , 0. , 1. , 0. , 51.8625],
[ 0. , 0. , 1. , 0. , 1. , 1. , 21.075 ],
[ 0. , 0. , 1. , 1. , 0. , 2. , 11.1333],
[ 1. , 0. , 0. , 1. , 0. , 0. , 30.0708]])
4.2 Chaining steps with Pipeline
In the previous lesson, we accomplished our first goal, which was to apply different preprocessing to different columns using ColumnTransformer
. In this lesson, we’re moving on to our second goal, which is to apply the same workflow to training data and new data using the Pipeline
class.
A Pipeline
is used to chain together sequential steps. In this case, we want to chain together two steps, namely data preprocessing followed by model building.
We’ll start by importing the make_pipeline
function. Then, we can create a Pipeline
instance by passing it two objects: our ColumnTransformer
instance for data preprocessing, and our LogisticRegression
instance for model building. We’ll save it as an object called pipe
, which is a 2-step Pipeline
.
from sklearn.pipeline import make_pipeline
= make_pipeline(ct, logreg) pipe
You might remember that back in Chapter 2, we used cross-validation to evaluate our model when it only included the Parch and Fare features. Now that we’ve added the Embarked and Sex features, it would normally make sense to cross-validate the updated model to see whether the adding those features made our model better or worse. And in fact, you can (and should) cross-validate an entire Pipeline
.
However, any model evaluation procedure is highly unreliable with only 10 rows of data, and so any change in the cross-validated accuracy would be misleading. Thus we’re going to skip the cross-validation step for the moment, though we’ll return to it in a later chapter once we’re using the full dataset.
Since we’re skipping cross-validation, our next step is just to run the fit
method on the Pipeline
, and pass it X
and y
. Here’s what happens when we fit the Pipeline
:
- First, it runs the
ColumnTransformer
step, meaning that it takesX
, which is a 4-column DataFrame that contains both numbers and strings, and transforms it into the 7-column feature matrix that only includes numbers. - Second, it runs the
LogisticRegression
step, meaning that the model is fit to this 7-column feature matrix. In other words, it learns the relationship between those 7 features and they
values.
Note that when you fit a Pipeline
, it will actually print out the steps. You can see that step 1 is a ColumnTransformer
that includes a OneHotEncoder
and a passthrough transformer, and step 2 is a LogisticRegression
model.
pipe.fit(X, y)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(),
['Embarked', 'Sex']),
('passthrough', 'passthrough',
['Parch', 'Fare'])])),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])
In case it helps you to understand the Pipeline
better, I’m going to show you what happens “under the hood” when you fit this Pipeline
. To be clear, you should not actually write the following code, rather it is just for teaching purposes.
First, X
is transformed by the ColumnTransformer
into X_t
, which stands for “X transformed”. Second, the LogisticRegression
model is fit on X_t
and y
.
= ct.fit_transform(X)
X_t logreg.fit(X_t, y)
LogisticRegression(random_state=1, solver='liblinear')
And as you would expect, X
has the shape 10 by 4, and X_t
has the shape 10 by 7.
print(X.shape)
print(X_t.shape)
(10, 4)
(10, 7)
4.3 Using the Pipeline to make predictions
Now that we’ve fit our Pipeline
, we want to use it to make predictions on new data.
The first step is to update the X_new
DataFrame so that it contains the same columns as X
. Recall that the cols
object contains the names of our four columns, and so we can use it to select those four columns from the df_new
DataFrame.
= df_new[cols]
X_new X_new
Parch | Fare | Embarked | Sex | |
---|---|---|---|---|
0 | 0 | 7.8292 | Q | male |
1 | 0 | 7.0000 | S | female |
2 | 0 | 9.6875 | Q | male |
3 | 0 | 8.6625 | S | male |
4 | 1 | 12.2875 | S | female |
5 | 0 | 9.2250 | S | male |
6 | 0 | 7.6292 | Q | female |
7 | 1 | 29.0000 | S | male |
8 | 0 | 7.2292 | C | female |
9 | 0 | 24.1500 | S | male |
Now, we can pass X_new
to the Pipeline
’s predict
method to make predictions for these ten samples. When we run it, the Pipeline
applies the same transformations to X_new
that it applied to X
, and the transformed version of X_new
is passed to the fitted logistic regression model so that it can make predictions.
In other words, the Pipeline
enabled us to accomplish our second goal, which is to apply the same workflow to training data and new data.
As a reminder, we can’t evaluate the accuracy of these ten predictions because we don’t know the true target values for X_new
.
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
Just like before, I’m going to show you what happens “under the hood” when you make predictions using this Pipeline
. Again, you should not actually write the following code, rather it is just for teaching purposes.
First, X_new
is transformed by the ColumnTransformer
into X_new_t
, which stands for “X_new transformed”. Second, the fitted LogisticRegression
model makes predictions for the samples in X_new_t
.
= ct.transform(X_new)
X_new_t logreg.predict(X_new_t)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
And as you would expect, X_new
has the shape 10 by 4, and X_new_t
has the shape 10 by 7.
print(X_new.shape)
print(X_new_t.shape)
(10, 4)
(10, 7)
One important point I want to highlight is that the Pipeline
’s predict
method called the ColumnTransformer
’s transform
method, not its fit_transform
method. Why would that be?
Recall that the fit
step is when a transformer learns something, and the transform
step is when it uses what it learned to do the transformation. Thus you fit on X
to learn an encoding, and you transform on X
and X_new
to apply that encoding.
This is critically important. Our logistic regression model was fit on 7 columns, and so it learned 7 coefficients. To make predictions, you need to pass 7 columns to the predict
method, and those 7 columns need to mean the same thing as the 7 columns you used when fitting the model. Thus, the predict
method only runs transform
so that the exact same encoding will be applied to the training data and the new data.
It’s okay if you’re still a bit fuzzy on the difference between fit
and transform
, because the Pipeline
object will just do the right thing for you when you run fit
or predict
. However, understanding the difference will ultimately help you to go further with scikit-learn.
4.4 Q&A: How do I drop some columns and passthrough others?
Currently we only have 4 columns in X
, namely Parch, Fare, Embarked, and Sex. But imagine that we had many more columns, and we wanted to drop a few columns and pass through the rest. How would we do that efficiently?
We can use the special string 'drop'
to tell the ColumnTransformer
which columns to drop, and also tell it to pass through all remaining columns. So in this example, we’re one-hot encoding Embarked and Sex, which creates 5 columns, dropping Fare, and passing through Parch, which adds 1 more column.
We could use this same pattern to drop a few columns and pass through hundreds of columns without having to list the passthrough columns one-by-one.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, ['drop', ['Fare']),
(='passthrough')
remainder ct.fit_transform(X)
array([[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0.],
[0., 0., 1., 1., 0., 0.],
[0., 0., 1., 1., 0., 0.],
[0., 0., 1., 0., 1., 0.],
[0., 1., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
[0., 0., 1., 0., 1., 1.],
[0., 0., 1., 1., 0., 2.],
[1., 0., 0., 1., 0., 0.]])
Conversely, we might want to pass through a few columns and drop the rest. We can use the special string 'passthrough'
to tell the ColumnTransformer
which columns to pass through, and also tell it to drop all remaining columns. So in this example, we’re one-hot encoding Embarked and Sex, which creates 5 columns, passing through Parch, which adds 1 more column, and dropping Fare.
Again, we can use this pattern to pass through a few columns and drop hundreds of columns without listing them all.
Finally, just a reminder that 'drop'
is the default value for remainder
, so you aren’t actually required to specify it here.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, ['passthrough', ['Parch']),
(='drop')
remainder ct.fit_transform(X)
array([[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0.],
[0., 0., 1., 1., 0., 0.],
[0., 0., 1., 1., 0., 0.],
[0., 0., 1., 0., 1., 0.],
[0., 1., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
[0., 0., 1., 0., 1., 1.],
[0., 0., 1., 1., 0., 2.],
[1., 0., 0., 1., 0., 0.]])
4.5 Q&A: How do I transform the unspecified columns?
We know how to drop or pass through the unspecified columns in a ColumnTransformer
, but let’s pretend we wanted to apply a transformation to all of the unspecified columns. This is actually simple to do by passing a transformer to the remainder
parameter.
For example, we might want to scale all of the unspecified columns. One option is MaxAbsScaler
, which divides each feature by its maximum value and thus scales it to the range negative 1 to positive 1. We’ll import it from the preprocessing
module and then create an instance.
from sklearn.preprocessing import MaxAbsScaler
= MaxAbsScaler() scaler
Then, we can pass the scaler
to the remainder
parameter.
When we run the fit_transform
method, you can see that the first 5 columns were created from Embarked and Sex, and the sixth column is the scaled version of the Parch column.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, ['drop', ['Fare']),
(=scaler)
remainder ct.fit_transform(X)
array([[0. , 0. , 1. , 0. , 1. , 0. ],
[1. , 0. , 0. , 1. , 0. , 0. ],
[0. , 0. , 1. , 1. , 0. , 0. ],
[0. , 0. , 1. , 1. , 0. , 0. ],
[0. , 0. , 1. , 0. , 1. , 0. ],
[0. , 1. , 0. , 0. , 1. , 0. ],
[0. , 0. , 1. , 0. , 1. , 0. ],
[0. , 0. , 1. , 0. , 1. , 0.5],
[0. , 0. , 1. , 1. , 0. , 1. ],
[1. , 0. , 0. , 1. , 0. , 0. ]])
4.6 Q&A: How do I select columns from a NumPy array?
Throughout this book, we’ve been using a pandas DataFrame as our input. But what if your input data was a NumPy array instead? Let’s see how that affects our workflow.
We’ll start by converting the X
and X_new
DataFrames into NumPy arrays called X_array
and X_new_array
.
= X.to_numpy()
X_array = X_new.to_numpy() X_new_array
Here’s what X_array
looks like.
X_array
array([[0, 7.25, 'S', 'male'],
[0, 71.2833, 'C', 'female'],
[0, 7.925, 'S', 'female'],
[0, 53.1, 'S', 'female'],
[0, 8.05, 'S', 'male'],
[0, 8.4583, 'Q', 'male'],
[0, 51.8625, 'S', 'male'],
[1, 21.075, 'S', 'male'],
[2, 11.1333, 'S', 'female'],
[0, 30.0708, 'C', 'female']], dtype=object)
If this was our input data, and we wanted to use a ColumnTransformer
, we wouldn’t be able to specify the columns by name because columns of a NumPy array don’t have names. However, we do have a couple of other options.
First, we could specify the columns by integer position. Embarked and Sex are columns 2 and 3, so in this example, we’re one-hot encoding Embarked and Sex and passing through the remainder. Note that we’re passing X_array
, not X
, to the fit_transform
method.
= make_column_transformer(
ct 2, 3]),
(ohe, [='passthrough')
remainder ct.fit_transform(X_array)
array([[0.0, 0.0, 1.0, 0.0, 1.0, 0, 7.25],
[1.0, 0.0, 0.0, 1.0, 0.0, 0, 71.2833],
[0.0, 0.0, 1.0, 1.0, 0.0, 0, 7.925],
[0.0, 0.0, 1.0, 1.0, 0.0, 0, 53.1],
[0.0, 0.0, 1.0, 0.0, 1.0, 0, 8.05],
[0.0, 1.0, 0.0, 0.0, 1.0, 0, 8.4583],
[0.0, 0.0, 1.0, 0.0, 1.0, 0, 51.8625],
[0.0, 0.0, 1.0, 0.0, 1.0, 1, 21.075],
[0.0, 0.0, 1.0, 1.0, 0.0, 2, 11.1333],
[1.0, 0.0, 0.0, 1.0, 0.0, 0, 30.0708]], dtype=object)
Another option is to specify the columns using slices, which is useful for large ranges of columns next to one another. In this case, we’re selecting columns 2 through 3 for one-hot encoding, and passing through the remainder. Remember that Python slices are inclusive of the starting value, which is 2 in this case, and exclusive of the ending value, which is 4 in this case.
= make_column_transformer(
ct slice(2, 4)),
(ohe, ='passthrough')
remainder ct.fit_transform(X_array)
array([[0.0, 0.0, 1.0, 0.0, 1.0, 0, 7.25],
[1.0, 0.0, 0.0, 1.0, 0.0, 0, 71.2833],
[0.0, 0.0, 1.0, 1.0, 0.0, 0, 7.925],
[0.0, 0.0, 1.0, 1.0, 0.0, 0, 53.1],
[0.0, 0.0, 1.0, 0.0, 1.0, 0, 8.05],
[0.0, 1.0, 0.0, 0.0, 1.0, 0, 8.4583],
[0.0, 0.0, 1.0, 0.0, 1.0, 0, 51.8625],
[0.0, 0.0, 1.0, 0.0, 1.0, 1, 21.075],
[0.0, 0.0, 1.0, 1.0, 0.0, 2, 11.1333],
[1.0, 0.0, 0.0, 1.0, 0.0, 0, 30.0708]], dtype=object)
One final option is to specify the columns using a boolean mask. Normally you would create the mask using some sort of condition, but in this case I’m just writing out a mask to select columns 2 and 3 for one-hot encoding, and passing through the remainder.
= make_column_transformer(
ct False, False, True, True]),
(ohe, [='passthrough')
remainder ct.fit_transform(X_array)
array([[0.0, 0.0, 1.0, 0.0, 1.0, 0, 7.25],
[1.0, 0.0, 0.0, 1.0, 0.0, 0, 71.2833],
[0.0, 0.0, 1.0, 1.0, 0.0, 0, 7.925],
[0.0, 0.0, 1.0, 1.0, 0.0, 0, 53.1],
[0.0, 0.0, 1.0, 0.0, 1.0, 0, 8.05],
[0.0, 1.0, 0.0, 0.0, 1.0, 0, 8.4583],
[0.0, 0.0, 1.0, 0.0, 1.0, 0, 51.8625],
[0.0, 0.0, 1.0, 0.0, 1.0, 1, 21.075],
[0.0, 0.0, 1.0, 1.0, 0.0, 2, 11.1333],
[1.0, 0.0, 0.0, 1.0, 0.0, 0, 30.0708]], dtype=object)
So those are our three options for selecting columns in a ColumnTransformer
when your input source is a NumPy array.
Other than that, the rest of our workflow remains the same. We’ll just update the Pipeline
to use our new ColumnTransformer
.
= make_pipeline(ct, logreg) pipe
Then we can fit the Pipeline
with X_array
and y
, and make predictions for X_new_array
.
pipe.fit(X_array, y) pipe.predict(X_new_array)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
4.7 Q&A: How do I select columns by data type?
So far in this book, we’ve been selecting columns one-by-one. But let’s say that we had many more columns, and we simply wanted to one-hot encode all object columns and passthrough all numeric columns without listing all of them out. How would we do that?
The easiest way to do this is with the make_column_selector
function, which was new in scikit-learn version 0.22.
from sklearn.compose import make_column_selector
We’re going to create two column selectors called select_object
and select_number
. To do this, we just set the dtype_include
parameter to the data type we want to include, and it outputs a callable.
= make_column_selector(dtype_include=object)
select_object = make_column_selector(dtype_include='number') select_number
Then, we pass the callables to make_column_transformer
instead of the column names, and the callables select the columns for us.
When we run fit_transform
, you can see that once again, the object columns have been one-hot encoded and the numeric columns have been passed through.
= make_column_transformer(
ct
(ohe, select_object),'passthrough', select_number))
( ct.fit_transform(X)
array([[ 0. , 0. , 1. , 0. , 1. , 0. , 7.25 ],
[ 1. , 0. , 0. , 1. , 0. , 0. , 71.2833],
[ 0. , 0. , 1. , 1. , 0. , 0. , 7.925 ],
[ 0. , 0. , 1. , 1. , 0. , 0. , 53.1 ],
[ 0. , 0. , 1. , 0. , 1. , 0. , 8.05 ],
[ 0. , 1. , 0. , 0. , 1. , 0. , 8.4583],
[ 0. , 0. , 1. , 0. , 1. , 0. , 51.8625],
[ 0. , 0. , 1. , 0. , 1. , 1. , 21.075 ],
[ 0. , 0. , 1. , 1. , 0. , 2. , 11.1333],
[ 1. , 0. , 0. , 1. , 0. , 0. , 30.0708]])
One slight variation of this is that you can tell make_column_selector
to exclude instead of include a specific data type. In this example, we’re using the dtype_exclude
parameter to create a column selector that excludes the object data type.
= make_column_selector(dtype_exclude=object) exclude_object
This time, we’ll tell the ColumnTransformer
to one-hot encode all object columns and pass through all non-object columns, which has the same effect as before.
= make_column_transformer(
ct
(ohe, select_object),'passthrough', exclude_object))
( ct.fit_transform(X)
array([[ 0. , 0. , 1. , 0. , 1. , 0. , 7.25 ],
[ 1. , 0. , 0. , 1. , 0. , 0. , 71.2833],
[ 0. , 0. , 1. , 1. , 0. , 0. , 7.925 ],
[ 0. , 0. , 1. , 1. , 0. , 0. , 53.1 ],
[ 0. , 0. , 1. , 0. , 1. , 0. , 8.05 ],
[ 0. , 1. , 0. , 0. , 1. , 0. , 8.4583],
[ 0. , 0. , 1. , 0. , 1. , 0. , 51.8625],
[ 0. , 0. , 1. , 0. , 1. , 1. , 21.075 ],
[ 0. , 0. , 1. , 1. , 0. , 2. , 11.1333],
[ 1. , 0. , 0. , 1. , 0. , 0. , 30.0708]])
There are also other data type options you can use, such as the datetime data type or the pandas category data type.
= make_column_selector(dtype_include='datetime')
select_datetime = make_column_selector(dtype_include='category') select_category
Finally, it’s worth noting that you can also pass a list of multiple data types to make_column_selector
.
= make_column_selector(dtype_include=[object,
select_multiple 'category'])
4.8 Q&A: How do I select columns by column name pattern?
Let’s say that we had a lot of columns, and all of the columns that we wanted to select for a particular transformation had the same pattern in their names. For example, maybe all of those columns started with the same word.
Once again, we can use the make_column_selector
function, which allows us to select columns by regular expression pattern. Here’s a silly example in which we select columns that include the capital letters E or S.
= make_column_selector(pattern='E|S') select_ES
When we run the fit_transform
method, Embarked and Sex have been one-hot encoded, and the remaining columns have been passed through.
= make_column_transformer(
ct
(ohe, select_ES),='passthrough')
remainder ct.fit_transform(X)
array([[ 0. , 0. , 1. , 0. , 1. , 0. , 7.25 ],
[ 1. , 0. , 0. , 1. , 0. , 0. , 71.2833],
[ 0. , 0. , 1. , 1. , 0. , 0. , 7.925 ],
[ 0. , 0. , 1. , 1. , 0. , 0. , 53.1 ],
[ 0. , 0. , 1. , 0. , 1. , 0. , 8.05 ],
[ 0. , 1. , 0. , 0. , 1. , 0. , 8.4583],
[ 0. , 0. , 1. , 0. , 1. , 0. , 51.8625],
[ 0. , 0. , 1. , 0. , 1. , 1. , 21.075 ],
[ 0. , 0. , 1. , 1. , 0. , 2. , 11.1333],
[ 1. , 0. , 0. , 1. , 0. , 0. , 30.0708]])
Again, this is only useful if your column names follow a particular pattern and you know how to write regular expressions.
4.9 Q&A: Should I use ColumnTransformer or make_column_transformer?
So far in this book, we’ve been creating ColumnTransformer
s using the make_column_transformer
function. In this lesson, I’ll show you how to use the ColumnTransformer
class and then compare it to make_column_transformer
so that you can decide which one you want to use.
To start, we’ll import the ColumnTransformer
class from the compose
module, and then we’ll create an instance.
When creating an instance, the first difference you might notice is that the tuples have three elements rather than two. The first element of each tuple is a name of your choosing that you are required to assign to the transformer.
In this case, the first tuple is our one-hot encoding of Embarked and Sex, and we’re assigning it the name “OHE” (all caps). The second tuple is our special passthrough transformer for Parch and Fare, and we’re assigning it the name “pass”. We can see these names when we print out the ColumnTransformer
.
You might also notice that the tuples are in a list, which is a requirement of the ColumnTransformer
class.
from sklearn.compose import ColumnTransformer
= ColumnTransformer(
ct 'OHE', ohe, ['Embarked', 'Sex']),
[('pass', 'passthrough', ['Parch', 'Fare'])])
( ct
ColumnTransformer(transformers=[('OHE', OneHotEncoder(), ['Embarked', 'Sex']),
('pass', 'passthrough', ['Parch', 'Fare'])])
Now let’s create the same ColumnTransformer
using the make_column_transformer
function. When using make_column_transformer
, we don’t define names for the transformers. Instead, each transformer is assigned a default name, which is the lowercase version of the transformer’s class name.
As you can see when we print it out, the one-hot encoder is assigned the name “onehotencoder” (all lowercase), and the passthrough transformer is assigned the name “passthrough”.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, ['passthrough', ['Parch', 'Fare']))
( ct
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
['Embarked', 'Sex']),
('passthrough', 'passthrough',
['Parch', 'Fare'])])
All of that being said, which one should you use?
I prefer make_column_transformer
, because I find the code both easier to read and easier to write, so that’s what I’ll use in this book. I usually don’t mind the default transformer names, and in fact I like that I don’t have to come up with a name for each transformer.
However, there are times when defining names for the transformers is useful. Custom names can be clearer if you’re performing a grid search of transformer parameters, or if you’re using the same type of transformer multiple times in the same ColumnTransformer
instance. We’ll see examples of this later in the book.
One final note is that the ColumnTransformer
class enables transformer weights, meaning you can emphasize the output of some transformers more than others. The specific use case of this is not yet clear to me, but if you do decide to use transformer weights, then you can’t use the make_column_transformer
function and you must use the ColumnTransformer
class.
4.10 Q&A: Should I use Pipeline or make_pipeline?
So far in this book, we’ve been creating Pipeline
s using the make_pipeline
function. In this lesson, I’ll show you how to use the Pipeline
class and then compare it to make_pipeline
so that you can decide which one you want to use.
To start, we’ll import the Pipeline
class from the pipeline
module, and then we’ll create an instance.
When creating an instance, the main difference you might notice is that we’re passing in a list of tuples to the Pipeline
constructor. Each tuple has two elements, in which the first element is the name you’re assigning to the Pipeline
step, and the second element is the model or transformer you’re including in the Pipeline
.
In this case, the first tuple is our preprocessing step using ColumnTransformer
, and we’re assigning it the name “preprocessor”. The second tuple is our model building step using logistic regression, and we’re assigning it the name “classifier”. We can see these names when we print out the Pipeline
.
from sklearn.pipeline import Pipeline
= Pipeline([('preprocessor', ct), ('classifier', logreg)])
pipe pipe
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(),
['Embarked', 'Sex']),
('passthrough', 'passthrough',
['Parch', 'Fare'])])),
('classifier',
LogisticRegression(random_state=1, solver='liblinear'))])
We can also see the step names by accessing the named_steps
attribute of the Pipeline
and running the keys
method.
pipe.named_steps.keys()
dict_keys(['preprocessor', 'classifier'])
Now let’s create the same Pipeline
using the make_pipeline
function. When using make_pipeline
, we don’t define names for the steps. Instead, each step is assigned a default name, which is the lowercase version of the step’s class name.
As you can see when we print it out, the first step is assigned the name “columntransformer” (all lowercase), and the second step is assigned the name “logisticregression” (all lowercase).
= make_pipeline(ct, logreg)
pipe pipe
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(),
['Embarked', 'Sex']),
('passthrough', 'passthrough',
['Parch', 'Fare'])])),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])
Again, we can also see the step names using the named_steps
attribute.
pipe.named_steps.keys()
dict_keys(['columntransformer', 'logisticregression'])
All of that being said, which one should you use?
I prefer make_pipeline
, because I find the code both easier to read and easier to write, so that’s what I’ll use in this book. I usually don’t mind the default step names, and in fact I like that I don’t have to come up with a name for each step.
However, custom step names can be useful for clarity, especially if you’re performing a grid search of a Pipeline
. We’ll see many examples of this later in the book.
4.11 Q&A: How do I examine the steps of a Pipeline?
Sometimes you might want to examine the steps of a fitted Pipeline
so that you can understand what’s happening within each step. In this lesson, I’ll show you how to do it.
We’ll start by fitting the Pipeline
, which prints out the two steps.
pipe.fit(X, y)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(),
['Embarked', 'Sex']),
('passthrough', 'passthrough',
['Parch', 'Fare'])])),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])
As I mentioned in the previous lesson, make_pipeline
assigned a name to each step, which is the lowercase version of the step’s class name. In this case, our step names are “columntransformer” and “logisticregression”.
pipe.named_steps.keys()
dict_keys(['columntransformer', 'logisticregression'])
To examine an individual step, you select the named_steps
attribute and pass the step name in brackets. Note that if we had assigned custom step names such as “preprocessor” and “classifier”, we would be using those here instead.
'columntransformer'] pipe.named_steps[
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
['Embarked', 'Sex']),
('passthrough', 'passthrough',
['Parch', 'Fare'])])
'logisticregression'] pipe.named_steps[
LogisticRegression(random_state=1, solver='liblinear')
Once you’ve accessed a step, you can examine its attributes or run its methods. For example, we can run the get_feature_names
method from the “columntransformer” step to learn the names of each feature. As a reminder, the x0 means feature 0 that was passed to the OneHotEncoder, and the x1 means feature 1.
'columntransformer'].get_feature_names() pipe.named_steps[
['onehotencoder__x0_C',
'onehotencoder__x0_Q',
'onehotencoder__x0_S',
'onehotencoder__x1_female',
'onehotencoder__x1_male',
'Parch',
'Fare']
We can also see the coefficient values of the 7 features by examining the coef_
attribute of the “logisticregression” step. These coefficients are listed in the same order as the features, though the intercept is stored in a separate attribute.
By finding the 4 positive coefficients, you can determine that embarking at port C, being female, and having a higher Parch and Fare are all associated with a greater likelihood of survival. Note that these are just associations the model learned from 10 rows of training data. They are not necessarily statistically significant associations, and in fact scikit-learn does not provide p-values.
'logisticregression'].coef_ pipe.named_steps[
array([[ 0.26491287, -0.19848033, -0.22907928, 1.0075062 , -1.17015293,
0.20056557, 0.01597307]])
Finally, it’s worth noting that there are three other ways that you can examine the steps of a Pipeline
:
- First, you can use
named_steps
with periods. - Second, you can exclude the
named_steps
attribute entirely. - And third, you can reference the step by position rather than by name.
pipe.named_steps.logisticregression.coef_
array([[ 0.26491287, -0.19848033, -0.22907928, 1.0075062 , -1.17015293,
0.20056557, 0.01597307]])
'logisticregression'].coef_ pipe[
array([[ 0.26491287, -0.19848033, -0.22907928, 1.0075062 , -1.17015293,
0.20056557, 0.01597307]])
1].coef_ pipe[
array([[ 0.26491287, -0.19848033, -0.22907928, 1.0075062 , -1.17015293,
0.20056557, 0.01597307]])
Personally, I like the initial bracket notation because I think it’s the most readable, even though it’s the most typing. However, using named_steps
with the periods seems to be the only option that supports autocompleting both the step name and the attribute, which is a nice benefit.