15 Feature engineering with custom transformers

15.1 Why not use pandas for feature engineering?

Let’s say that you need to create some custom features for your model. This is usually because you believe that your model could learn more from a particular feature if the feature was represented in a different way or combined with another feature.

Often, feature engineering is done using pandas on the original dataset, and then the updated dataset is passed to scikit-learn. However, you can actually do feature engineering within scikit-learn using custom transformers, and in this chapter I’ll show you how.

It’s a bit more work to do feature engineering within scikit-learn, but it provides some considerable benefits. All of your data transformations can be included in a Pipeline, which means they can be tuned using a grid search, they can be applied to new data without any extra work, and when done correctly, there’s no possibility of data leakage.

Options for feature engineering:

pandas: Create features on original dataset, pass updated dataset to scikit-learn
scikit-learn: Create features using custom transformers
- Requires more work
- All transformations can be included in a Pipeline

15.2 Transformer 1: Rounding numerical values

To start, we’re going to redefine df as a 10-row dataset. This will make it much easier for you to see how custom transformers work.

df = pd.read_csv('http://bit.ly/MLtrain', nrows=10)
df

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

First, let’s pretend we believe Fare would be a better feature if we rounded it up to the next integer. We can do this using NumPy’s built-in ceil function, which is short for ceiling. Notice that each value in Fare has now been rounded up.

In general, NumPy functions are a good choice for custom transformations because both scikit-learn and pandas use NumPy under-the-hood.

Notice that we passed it a 2D object, namely a one-column DataFrame, and it returned a 2D object, again a one-column DataFrame. This will turn out to be a useful characteristic for custom transformations, as you’ll see later in this chapter.

np.ceil(df[['Fare']])

	Fare
0	8.0
1	72.0
2	8.0
3	54.0
4	9.0
5	9.0
6	52.0
7	22.0
8	12.0
9	31.0

In order to do this transformation within scikit-learn, we need to convert the ceil function into a scikit-learn transformer using the FunctionTransformer class, which we’ll import from the preprocessing module.

We simply pass the ceil function to FunctionTransformer, and it returns a transformer object, which we’ll call “ceiling”. Note that if you’re using a version of scikit-learn prior to 0.22, you should also include the argument validate=False any time you’re using FunctionTransformer.

from sklearn.preprocessing import FunctionTransformer
ceiling = FunctionTransformer(np.ceil)

Anyway, because ceiling is a transformer, you can pass Fare to its fit_transform method, which performs the same transformation as before. This is the simplest example of feature engineering within scikit-learn.

ceiling.fit_transform(df[['Fare']])

	Fare
0	8.0
1	72.0
2	8.0
3	54.0
4	9.0
5	9.0
6	52.0
7	22.0
8	12.0
9	31.0

Like any transformer, ceiling can be included in a ColumnTransformer. For the moment, we’ll create a ColumnTransformer instance that only includes the ceiling transformer, though we’ll add more transformers throughout the chapter.

Finally, we’ll pass df to its fit_transform method, which confirms that it works. As we’ve seen throughout this book, ColumnTransformer always outputs a NumPy array or a sparse matrix, and in this case it outputs a NumPy array.

ct = make_column_transformer(
    (ceiling, ['Fare']))
ct.fit_transform(df)

array([[ 8.],
       [72.],
       [ 8.],
       [54.],
       [ 9.],
       [ 9.],
       [52.],
       [22.],
       [12.],
       [31.]])

15.3 Transformer 2: Clipping numerical values

Let’s look at the 10-row dataset again.

df

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

For our second transformation, let’s pretend we believe Age would be a better feature if we limited the values to the range 5 to 60, such that all values below 5 are rounded up to 5, and all values above 60 are rounded down to 60. We can do this using NumPy’s built-in clip function. Note that it has two required arguments, a_min and a_max, which define the limits.

In this case, the only value that changed was the row with index 7, in which a 2 became a 5.

np.clip(df[['Age']], a_min=5, a_max=60)

	Age
0	22.0
1	38.0
2	26.0
3	35.0
4	35.0
5	NaN
6	54.0
7	5.0
8	27.0
9	14.0

We’ll convert NumPy’s clip function to a transformer called “clip”, though this time we need to pass the a_min and a_max arguments to the FunctionTransformer’s kw_args parameter.

clip = FunctionTransformer(np.clip, kw_args={'a_min':5, 'a_max':60})

We’ll check that the clip transformer works by running the fit_transform method and passing it the Age column.

clip.fit_transform(df[['Age']])

	Age
0	22.0
1	38.0
2	26.0
3	35.0
4	35.0
5	NaN
6	54.0
7	5.0
8	27.0
9	14.0

Finally, we’ll add the clip transformer to the ColumnTransformer, and confirm that the ColumnTransformer still works as expected, which it does.

ct = make_column_transformer(
    (ceiling, ['Fare']),
    (clip, ['Age']))
ct.fit_transform(df)

array([[ 8., 22.],
       [72., 38.],
       [ 8., 26.],
       [54., 35.],
       [ 9., 35.],
       [ 9., nan],
       [52., 54.],
       [22.,  5.],
       [12., 27.],
       [31., 14.]])

15.4 Transformer 3: Extracting string values

Let’s look at the dataset again.

df

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

For our third transformation, let’s assume that the first letter of Cabin indicates the deck they were staying on, which we believe might be predictive. Our first thought might be to use the Series string slice method to extract the first character.

This works, but notice that the output is a 1D object, namely a pandas Series. This is problematic because a function (once transformed) must return 2D output in order to be used in a ColumnTransformer.

df['Cabin'].str.slice(0, 1)

0    NaN
1      C
2    NaN
3      C
4    NaN
5    NaN
6      E
7    NaN
8    NaN
9    NaN
Name: Cabin, dtype: object

To resolve this, we’ll use the DataFrame apply method with a lambda. It still extracts the first character, but notice that it now returns 2D output. Also notice that it accepts 2D input, which means that it will be able to operate on multiple columns if we like.

df[['Cabin']].apply(lambda x: x.str.slice(0, 1))

	Cabin
0	NaN
1	C
2	NaN
3	C
4	NaN
5	NaN
6	E
7	NaN
8	NaN
9	NaN

Since this doesn’t already exist as a standalone function, our next step is to convert this operation into a custom function called “first_letter”. This function would work, but notice that it requires the input to be a pandas DataFrame, since apply is a DataFrame method. That may be problematic since ColumnTransformers accept both DataFrames and NumPy arrays as input. As such, it’s better to write a function that will work regardless of whether the input is a DataFrame or a NumPy array.

def first_letter(df):
    return df.apply(lambda x: x.str.slice(0, 1))

We can revise the function to work with both DataFrames and NumPy arrays by converting the input to a DataFrame explicitly before using the apply method.

def first_letter(df):
    return pd.DataFrame(df).apply(lambda x: x.str.slice(0, 1))

We’ll check that the function works. Again, notice that it accepts 2D input and returns 2D output, which is what we want.

first_letter(df[['Cabin']])

	Cabin
0	NaN
1	C
2	NaN
3	C
4	NaN
5	NaN
6	E
7	NaN
8	NaN
9	NaN

Then, we’ll convert the function to a transformer called “letter”, and check that it works.

letter = FunctionTransformer(first_letter)
letter.fit_transform(df[['Cabin']])

	Cabin
0	NaN
1	C
2	NaN
3	C
4	NaN
5	NaN
6	E
7	NaN
8	NaN
9	NaN

Finally, we’ll add the letter transformer to the ColumnTransformer, and check that it works, which it does.

ct = make_column_transformer(
    (ceiling, ['Fare']),
    (clip, ['Age']),
    (letter, ['Cabin']))
ct.fit_transform(df)

array([[8.0, 22.0, nan],
       [72.0, 38.0, 'C'],
       [8.0, 26.0, nan],
       [54.0, 35.0, 'C'],
       [9.0, 35.0, nan],
       [9.0, nan, nan],
       [52.0, 54.0, 'E'],
       [22.0, 5.0, nan],
       [12.0, 27.0, nan],
       [31.0, 14.0, nan]], dtype=object)

15.5 Rules for transformer functions

We’ve been talking about input and output shapes, so let me summarize the rules for functions that will be used in a ColumnTransformer.

First, your function isn’t required to accept 2D input, but it’s better if it accepts 2D input. This enables your function (once transformed) to accept multiple columns in the ColumnTransformer.

Second, your function is required to return 2D output in order to be used in a ColumnTransformer. Thus if your function returns a pandas object, it should return a DataFrame (not a Series). And if your function returns a 1D array, you should reshape the array to be 2D before returning it. We’ll see an example of this in the next lesson.

Input and output of transformer functions:

Input:
- 1D is allowed
- 2D is preferred: Enables it to accept multiple columns
Output:
- 2D is required

15.6 Transformer 4: Combining two features

Let’s look at the dataset one more time.

df

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

For our fourth transformation, let’s say we believe that a passenger’s total number of family members aboard, meaning SibSp plus Parch, is more predictive than either feature individually.

To create this feature, we can use the DataFrame’s sum method over the 1 axis. However, this outputs a 1D object, which will not work with a ColumnTransformer.

df[['SibSp', 'Parch']].sum(axis=1)

0    1
1    1
2    0
3    1
4    0
5    0
6    0
7    4
8    2
9    1
dtype: int64

To resolve this issue, we can use NumPy’s reshape method. Since it’s a NumPy method, we need to first convert the DataFrame into a NumPy array, then use NumPy’s sum method, then chain the reshape method on the end to convert the output into a 2D object. If you’re not familiar with the reshape method, this notation specifies that the second dimension should be 1 and the first dimension should be inferred.

np.array(df[['SibSp', 'Parch']]).sum(axis=1).reshape(-1, 1)

array([[1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [4],
       [2],
       [1]])

Next, we’ll convert this operation into a function called sum_cols, and this function will work regardless of whether the input is a DataFrame or a NumPy array.

def sum_cols(df):
    return np.array(df).sum(axis=1).reshape(-1, 1)

We’ll check that the function works.

sum_cols(df[['SibSp', 'Parch']])

array([[1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [4],
       [2],
       [1]])

Since the function works, we’ll convert the function to a transformer called “total”, and check that it also works.

total = FunctionTransformer(sum_cols)
total.fit_transform(df[['SibSp', 'Parch']])

array([[1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [4],
       [2],
       [1]])

Finally, we’ll add the total transformer to the ColumnTransformer, specify that it should be applied to the SibSp and Parch columns, and then check that it all works together, which it does.

ct = make_column_transformer(
    (ceiling, ['Fare']),
    (clip, ['Age']),
    (letter, ['Cabin']),
    (total, ['SibSp', 'Parch']))
ct.fit_transform(df)

array([[8.0, 22.0, nan, 1],
       [72.0, 38.0, 'C', 1],
       [8.0, 26.0, nan, 0],
       [54.0, 35.0, 'C', 1],
       [9.0, 35.0, nan, 0],
       [9.0, nan, nan, 0],
       [52.0, 54.0, 'E', 0],
       [22.0, 5.0, nan, 4],
       [12.0, 27.0, nan, 2],
       [31.0, 14.0, nan, 1]], dtype=object)

15.7 Revising the transformers

At this point, we’ve built four custom transformers and tested them out on 10 rows. Now, we want to apply them to our entire dataset, along with all of our other transformations.

Before we start, let’s review our original ColumnTransformer.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

Let’s compare that with the ColumnTransformer with our custom transformations.

ct = make_column_transformer(
    (ceiling, ['Fare']),
    (clip, ['Age']),
    (letter, ['Cabin']),
    (total, ['SibSp', 'Parch']))

In order to combine the custom transformations with our original transformations, there are a few things we’ll need to handle:

First, Cabin and SibSp weren’t used in our original ColumnTransformer.
Second, Fare and Age both have missing values.
Third, the first letter of Cabin is obviously non-numeric, and it has missing values.

Issues to handle when updating the ColumnTransformer:

Cabin and SibSp weren’t originally included
Fare and Age have missing values
Cabin is non-numeric and has missing values

To start, we’ll define df as the entire dataset.

df = pd.read_csv('http://bit.ly/MLtrain')

Then, we’ll add Cabin and SibSp to the list of columns, and update X and X_new to include these columns. That solves issue number one.

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age', 'Cabin', 'SibSp']
X = df[cols]
X_new = df_new[cols]

Next, let’s handle issue number two, which is the missing values in Fare and Age. Remember how we did imputation before one-hot encoding for Embarked and Sex? We’ll do something similar for Fare and Age:

For Fare, we’ll create a Pipeline called “imp_ceiling” that does imputation before ceiling.
For Age, we’ll create a Pipeline called “imp_clip” that does imputation before clipping.

imp_ceiling = make_pipeline(imp, ceiling)
imp_clip = make_pipeline(imp, clip)

Finally, we’ll tackle the most complicated issue, which is issue number three. Let’s slice the first letter of Cabin using the string slice method and then take the value_counts of the result to see why:

First, it contains missing values, so imputation will be required.
Second, letters are strings, so one-hot encoding will be required.

X['Cabin'].str.slice(0, 1).value_counts(dropna=False)

NaN    687
C       59
B       47
D       33
E       32
A       15
F       13
G        4
T        1
Name: Cabin, dtype: int64

In addition, notice that the G and T categories are quite rare, which can cause problems with cross-validation. Why would this be?

For any rare category, it’s possible for all values of that category to show up in the same testing fold during cross-validation. If that happens, the rare category won’t be learned by the OneHotEncoder during the fit step, and will be treated as an unknown category during the transform step. By default, the OneHotEncoder will error when it encounters an unknown category, and thus cross-validation will also throw an error.

Why are rare categories problematic for cross-validation?

Rare category values may all show up in the same testing fold
The rare category won’t be learned during fit and will be treated as an unknown category
OneHotEncoder will error when it encounters an unknown category

Although the problem is complicated, the solution is simple, which is to set the handle_unknown parameter to ignore, which we learned about back in lesson 3.5.

ohe_ignore = OneHotEncoder(handle_unknown='ignore')

To resolve all of these issues with the Cabin column, we’ll create a three-step Pipeline. Step 1 is the letter transformer, which extracts the first letter. Step 2 is imp_constant, which imputes the constant value “missing”. Step 3 is ohe_ignore, which one-hot encodes the results.

letter_imp_ohe = make_pipeline(letter, imp_constant, ohe_ignore)

Now we’re finally ready to update our primary ColumnTransformer:

Embarked, Sex, and Name are transformed exactly as they were previously.
For Fare, we’ll use imp_ceiling instead of imp.
For Age, we’ll use imp_clip instead of imp.
For Cabin, we’ll use letter_imp_ohe.
For SibSp and Parch, we’ll use total.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_ceiling, ['Fare']),
    (imp_clip, ['Age']),
    (letter_imp_ohe, ['Cabin']),
    (total, ['SibSp', 'Parch']))

We’ll check that it works by passing X to the fit_transform method.

ct.fit_transform(X)

<891x1527 sparse matrix of type '<class 'numpy.float64'>'
    with 8360 stored elements in Compressed Sparse Row format>

Then we’ll update the Pipeline to use the new ColumnTransformer.

pipe = make_pipeline(ct, logreg)
pipe

When we cross-validate the Pipeline, the accuracy is 0.827, which is higher than our baseline accuracy of 0.811. And it’s very likely that its accuracy could be further improved through hyperparameter tuning.

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8271483271608812

Pipeline accuracy scores:

Grid search (VC): 0.834
Grid search (LR with SelectFromModel ET): 0.832
Grid search (RF): 0.829
Grid search (LR): 0.828
Baseline (LR with more features): 0.827
Baseline (VC): 0.818
Baseline (LR): 0.811
Baseline (RF): 0.811

Finally, we’ll fit the Pipeline to X and y and use it to make predictions for X_new.

pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

15.8 Q&A: How do I fix incorrect data types within a Pipeline?

Let’s say that you have the following DataFrame.

demo = pd.DataFrame({'A': ['10', '20', '30'],
                     'B': ['40', '50', '60'],
                     'C': [70, 80, 90],
                     'D': ['x', 'y', 'z']})
demo

	A	B	C	D
0	10	40	70	x
1	20	50	80	y
2	30	60	90	z

It may look like the first three columns are all integers, but columns A and B are actually object columns because the numbers are being stored as strings, which is a common problem in datasets.

demo.dtypes

A    object
B    object
C     int64
D    object
dtype: object

These data types need to be fixed in order for a scikit-learn model to understand them. Within pandas, you can do this using the DataFrame method astype.

demo[['A', 'B']].astype('int')

	A	B
0	10	40
1	20	50
2	30	60

As you can see, this changes the data types of columns A and B to integer.

demo[['A', 'B']].astype('int').dtypes

A    int64
B    int64
dtype: object

To incorporate this into a scikit-learn Pipeline, we would first define a custom function called “make_integer” that converts DataFrame columns to integers.

def make_integer(df):
    return pd.DataFrame(df).astype('int')

Then, we can convert the function to a transformer called “integer”, and check that it works.

integer = FunctionTransformer(make_integer)
integer.fit_transform(demo[['A', 'B']])

	A	B
0	10	40
1	20	50
2	30	60

And you can see that the data types are now integer.

integer.fit_transform(demo[['A', 'B']]).dtypes

A    int64
B    int64
dtype: object

What I’ve shown so far is a reasonable strategy, and it will work in many cases. However, let’s modify the demo DataFrame slightly.

demo.loc[2, 'B'] = ''

As you can see, the value 60 has been replaced by an empty string.

demo

	A	B	C	D
0	10	40	70	x
1	20	50	80	y
2	30		90	z

If we try to use our “integer” transformer on the revised DataFrame, it will error because the astype method doesn’t know how to handle an empty string.

integer.fit_transform(demo[['A', 'B']])

ValueError: invalid literal for int() with base 10: ''

An alternative strategy is to use the pandas function to_numeric and apply it to each of the columns. As you can see, to_numeric replaced the empty string with NaN.

demo[['A', 'B']].apply(pd.to_numeric)

	A	B
0	10	40.0
1	20	50.0
2	30	NaN

Column A is now an integer column, and column B is now a float column since integer columns don’t currently support NumPy’s NaN value.

demo[['A', 'B']].apply(pd.to_numeric).dtypes

A      int64
B    float64
dtype: object

If we wanted to use this in the Pipeline, we would create a custom function called “make_number”.

def make_number(df):
    return pd.DataFrame(df).apply(pd.to_numeric)

Then we would convert the function to a transformer called “number”, and check that it works.

number = FunctionTransformer(make_number)
number.fit_transform(demo[['A', 'B']])

	A	B
0	10	40.0
1	20	50.0
2	30	NaN

You could use either the integer or number transformers in a ColumnTransformer, and both would allow you to fix data types within a Pipeline. However, there are some advantages to using the number transformer, since the to_numeric function includes a few different options for error handling, and it will handle the conversion of both integers and floating point numbers.

15.9 Q&A: How do I create features from datetime data?

To demonstrate how to create date-based features, I’m going to read a tiny dataset of reported UFO sightings into a DataFrame.

ufo = pd.read_csv('http://bit.ly/ufosample', parse_dates=['Date'])
ufo

	Date	City	State	Shape
0	2023-08-04	Lexington	SC	Cylinder
1	2023-08-04	Alpharetta	GA	Orb
2	2023-08-05	Baltimore	MD	Light
3	2023-08-05	Helena	MT	Rectangle
4	2023-08-06	Wilmington	NC	Fireball
5	2023-08-06	Brooklyn	NY	Star

Because we parsed the Date column during file reading, it uses the special datetime data type.

ufo.dtypes

Date     datetime64[ns]
City             object
State            object
Shape            object
dtype: object

Because of the datetime data type, we can access various properties of the Date column using the dt accessor. For example, we can easily access the day of the month using the “day” attribute.

ufo['Date'].dt.day

0    4
1    4
2    5
3    5
4    6
5    6
Name: Date, dtype: int64

Let’s say that you wanted to use day of month as a feature. The first step would be to create a custom function like we’ve done previously, and we’ll call it “day_of_month”.

def day_of_month(df):
    return df.apply(lambda x: x.dt.day)

We can see that this function works. However, it’s not quite optimal because it assumes that the data structure being passed in is a DataFrame with the datetime data type.

day_of_month(ufo[['Date']])

	Date
0	4
1	4
2	5
3	5
4	6
5	6

Let’s make the function more robust by enabling it to handle NumPy arrays and by doing the datetime conversion within the function.

Here’s one option. As you can see, it converts the object to a DataFrame, and then it converts each column to datetime format before accessing the day attribute.

def day_of_month(df):
    return pd.DataFrame(df).apply(lambda x: pd.to_datetime(x).dt.day)

Another option is to convert all columns to datetime format within the DataFrame constructor rather than during the apply method.

def day_of_month(df):
    return pd.DataFrame(df, dtype=np.datetime64).apply(lambda x: x.dt.day)

To check that it works, we’ll read in the UFO dataset again, but this time we’ll leave the Date column as an object column.

ufo = pd.read_csv('http://bit.ly/ufosample')
ufo.dtypes

Date     object
City     object
State    object
Shape    object
dtype: object

We can see that the day_of_month function works, even though Date is an object column.

day_of_month(ufo[['Date']])

	Date
0	4
1	4
2	5
3	5
4	6
5	6

Finally, we’ll convert the day_of_month function into a transformer called “day”, and check that it works as well.

day = FunctionTransformer(day_of_month)
day.fit_transform(ufo[['Date']])

	Date
0	4
1	4
2	5
3	5
4	6
5	6

15.10 Q&A: How do I create feature interactions?

When you believe there’s an interaction between two or more features, one common technique is to create “interaction terms” or “feature interactions” that your model can learn from. This is generally done by multiplying the values of each pair of features, and using those as new features.

Creating interaction features is useful when the combined impact of a pair of features is different from the impact of the features when considered independently. For example, let’s pretend that features A and B each have a small positive impact on the target, but when combined, they have a much larger positive impact on the target than you would expect. In that case, it would be useful to create the interaction feature of A times B.

When are feature interactions useful?

When the combined impact of features is different from their independent impacts
Example:
- A and B (individually) each have a small positive impact
- A and B (combined) has a larger positive impact than expected

Let’s see how we can create feature interactions in scikit-learn. We’ll assume that we’ve decided to create interactions between Fare, SibSp, and Parch. Here are the first three rows and last three rows of each of those features.

X[['Fare', 'SibSp', 'Parch']].to_numpy()

array([[ 7.25  ,  1.    ,  0.    ],
       [71.2833,  1.    ,  0.    ],
       [ 7.925 ,  0.    ,  0.    ],
       ...,
       [23.45  ,  1.    ,  2.    ],
       [30.    ,  0.    ,  0.    ],
       [ 7.75  ,  0.    ,  0.    ]])

Our first step is to import the PolynomialFeatures class from the preprocessing module, and then create an instance called “poly”. We’ll set the include_bias parameter to False to avoid creating a column of ones in the result, and we’ll set the interaction_only parameter to True to avoid creating the square of each feature.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(include_bias=False, interaction_only=True)

When we run the fit_transform method and pass it those three columns, it outputs six columns:

The first three columns of the output are the original three columns: Fare, SibSp, and Parch.
The next three columns are our interaction terms: Fare times SibSp, Fare times Parch, and SibSp times Parch.

poly.fit_transform(X[['Fare', 'SibSp', 'Parch']])

array([[ 7.25  ,  1.    ,  0.    ,  7.25  ,  0.    ,  0.    ],
       [71.2833,  1.    ,  0.    , 71.2833,  0.    ,  0.    ],
       [ 7.925 ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ],
       ...,
       [23.45  ,  1.    ,  2.    , 23.45  , 46.9   ,  2.    ],
       [30.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ],
       [ 7.75  ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ]])

Output columns:

Fare
SibSp
Parch
Fare * SibSp
Fare * Parch
SibSp * Parch

If we wanted to include these feature interactions in our model, we would simply include “poly” as one of the transformers in our ColumnTransformer.

One obvious question is: How should you choose which feature interactions to create?

Ideally, you would use expert knowledge of the system in question to guide your decision of which interactions to create.
If that’s not available, then you can explore the data to decide which interactions to create.
Finally, if you have a small number of features, then you could simply create all possible interactions and then use feature selection to remove the interactions which are not useful.

With a large number of features, it’s simply not practical to create all possible interactions, plus you run an increased risk of false positive feature interactions. These are interactions which appear to have a relationship with the target, but that relationship is actually just occurring due to random chance.

How to choose feature interactions:

Use expert knowledge
Explore the data
Create all possible interactions
- Not practical with a large number of features
- Increases risk of false positives

Another point worth noting is that tree-based models can learn feature interactions on their own through recursive splitting. That means that if you’re using a tree-based model as your prediction model, then you don’t need to manually create feature interactions.

Finally, it’s worth noting that even though linear models can’t explicitly learn feature interactions, they can sometimes replace the information supplied by the interaction terms, in which case the interaction terms are unnecessary. As such, you should always evaluate the model with interaction terms against the model without interaction terms, and only include the interaction terms if they’re improving the model’s performance.

When are feature interactions not useful?

Tree-based models can learn feature interactions on their own
Linear models can sometimes replace the information supplied by interaction terms
Conclusion: Evaluate the model with and without interaction terms

15.11 Q&A: How do I save a Pipeline with custom transformers?

If you save a Pipeline using pickle or joblib, and the Pipeline includes custom transformers, then the saved Pipeline can only be loaded into a new environment if the functions it depends upon are defined in the new environment.

For example, let’s import pickle and use it to save our current Pipeline.

import pickle
with open('pipe.pickle', 'wb') as f:
    pickle.dump(pipe, f)

Let’s pretend that we’re in a brand new environment and we wanted to make predictions for X_new using our saved Pipeline. Because the Pipeline includes custom transformers which use the first_letter and sum_cols functions, those two functions need to be defined in the new environment.

def first_letter(df):
    return pd.DataFrame(df).apply(lambda x: x.str.slice(0, 1))

def sum_cols(df):
    return np.array(df).sum(axis=1).reshape(-1, 1)

And because those functions depend on pandas and NumPy, then pandas and numpy also need to be imported in the new environment.

import pandas as pd
import numpy as np

Now we can load our saved Pipeline into the “pipe_from_pickle” object.

with open('pipe.pickle', 'rb') as f:
    pipe_from_pickle = pickle.load(f)

We also need to create the X_new object in our environment.

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age', 'Cabin', 'SibSp']
df_new = pd.read_csv('http://bit.ly/MLnewdata')
X_new = df_new[cols]

Finally, we can make predictions using the saved Pipeline.

pipe_from_pickle.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

If that process seems too burdensome, you can actually simplify the process by using a Python library called cloudpickle. cloudpickle extends the functionality of pickle to allow you to save user-defined functions.

All you have to do is to install cloudpickle using pip or conda, import it, and then save the Pipeline using cloudpickle instead of pickle. Notice that the cloudpickle code is exactly the same as the pickle code, except you use the dump function from cloudpickle instead of from pickle.

import cloudpickle
with open('pipe.pickle', 'wb') as f:
    cloudpickle.dump(pipe, f)

Then, in your new environment, you’ll be able to load the saved Pipeline using pickle and use it to make predictions without having to define the custom functions in that environment.

with open('pipe.pickle', 'rb') as f:
    pipe_from_pickle = pickle.load(f)

pipe_from_pickle.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

15.12 Q&A: Can FunctionTransformer be used with any transformation?

FunctionTransformer should only be used with stateless transformations. A transformation is considered stateless if it doesn’t learn any information during the fit step.

For example, all of our custom transformations in this chapter were stateless: rounding up to the next integer, limiting values to a range, extracting the first letter, and adding two columns. They’re considered stateless because they didn’t learn anything about the training data that later needed to be applied to testing data. In other words, they work exactly the same on the testing data regardless of what the training data looked like.

Stateless transformations:

ceiling: Rounding up to the next integer
clip: Limiting values to a range
letter: Extracting the first letter
total: Adding two columns

This is in contrast to stateful transformations, which do learn information from the fit step that needs to be applied to both training and testing data. We’ve seen many stateful transformations in this book:

OneHotEncoder learns the categories from the training data, and those same categories need to be applied to the testing data.
CountVectorizer learns the vocabulary from the training data, and that vocabulary needs to be used when building the document-term matrix for the testing data.
SimpleImputer learns the value to impute from the training data, and that value is applied to the testing data.
And as we saw in chapter 14, MaxAbsScaler learns the scale of each feature from the training data, and that scaling is applied to the testing data.

FunctionTransformer should never be used to implement stateful transformations. Depending on the situation, you would either run into an error or you would silently cause data leakage. In that case, you would need to write your own class in order to create a proper transformer.

Stateful transformations:

OneHotEncoder: fit learns the categories
CountVectorizer: fit learns the vocabulary
SimpleImputer: fit learns the value to impute
MaxAbsScaler: fit learns the scale of each feature