7 Handling missing values

7.1 Introduction to missing values

In this chapter, we’re going to deal with an issue that is common in real datasets, namely missing values.

A missing value is simply a value that does not exist in the dataset. It might be missing because that value purposefully wasn’t collected for that sample, or it might be missing due to an error in the data collection process.

Common sources of missing values:

Value purposefully wasn’t collected
Error in the data collection process

Let’s see an example in the Titanic dataset. We want to use Age as a feature, but note that it has a missing value, encoded as “NaN”. This stands for “not a number”, and it’s how missing values are typically encoded in a pandas DataFrame.

df

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

When we use the phrase “missing values”, we’re talking about NaNs only. The phrase “missing values” does not refer to categories or words in new data that weren’t seen during training. For example, if our new data contained the value “Z” in the Embarked column, that is not called a “missing value”, rather that’s called “an unknown category” or “a category you didn’t see during training”.

Missing values vs unknown categories:

Missing value: Value encoded as “NaN”
Unknown category: Category not seen in the training data

To start our exploration of this topic, let’s see what happens if we try to add the Age column to our model.

First, we’ll add Age to the cols list and use that to modify the X DataFrame. Again, notice the missing value in Age.

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']
X = df[cols]
X

	Parch	Fare	Embarked	Sex	Name	Age
0	0	7.2500	S	male	Braund, Mr. Owen Harris	22.0
1	0	71.2833	C	female	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0
2	0	7.9250	S	female	Heikkinen, Miss. Laina	26.0
3	0	53.1000	S	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0
4	0	8.0500	S	male	Allen, Mr. William Henry	35.0
5	0	8.4583	Q	male	Moran, Mr. James	NaN
6	0	51.8625	S	male	McCarthy, Mr. Timothy J	54.0
7	1	21.0750	S	male	Palsson, Master. Gosta Leonard	2.0
8	2	11.1333	S	female	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27.0
9	0	30.0708	C	female	Nasser, Mrs. Nicholas (Adele Achem)	14.0

Then, we’ll add Age to the ColumnTransformer as a passthrough column.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    ('passthrough', ['Parch', 'Fare', 'Age']))

Then, we’ll update the Pipeline to include the modified ColumnTransformer.

pipe = make_pipeline(ct, logreg)

Finally, we’ll attempt to fit the Pipeline, but it throws an error due to the presence of a missing value.

pipe.fit(X, y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

scikit-learn models generally don’t accept data with missing values, with the exception of histogram-based gradient boosting trees. As such, we’ll need to figure out a way to handle the missing value if we want to include Age in our model.

7.2 Three ways to handle missing values

Let’s talk about three different ways we can handle missing values in our dataset.

The first way is to drop any rows from the DataFrame that have missing values. We can use the dropna method in pandas to do this, and you’ll notice that row 5 is now gone. Note that you would also need to drop the same row from y.

X.dropna()

	Parch	Fare	Embarked	Sex	Name	Age
0	0	7.2500	S	male	Braund, Mr. Owen Harris	22.0
1	0	71.2833	C	female	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0
2	0	7.9250	S	female	Heikkinen, Miss. Laina	26.0
3	0	53.1000	S	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0
4	0	8.0500	S	male	Allen, Mr. William Henry	35.0
6	0	51.8625	S	male	McCarthy, Mr. Timothy J	54.0
7	1	21.0750	S	male	Palsson, Master. Gosta Leonard	2.0
8	2	11.1333	S	female	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	27.0
9	0	30.0708	C	female	Nasser, Mrs. Nicholas (Adele Achem)	14.0

However, there are three main problems with this approach:

First, if a high proportion of your rows have missing values, then this approach is impractical because it will discard too much of your training data.
Second, if there’s a useful pattern in the “missingness”, then dropping the rows will obscure this pattern.
Third, this takes care of the training data, but it doesn’t help you deal with any missing values that might appear in new data.

Approach 1: Drop rows with missing values

May discard too much training data
May obscure a pattern in the “missingness”
Doesn’t help you with new data

A second option for handling missing values is to drop any columns that have missing values. Again, we can use the dropna method to do this if we set the axis parameter to “columns”. You can see that row 5 is back, but the Age column is now gone.

X.dropna(axis='columns')

	Parch	Fare	Embarked	Sex	Name
0	0	7.2500	S	male	Braund, Mr. Owen Harris
1	0	71.2833	C	female	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	0	7.9250	S	female	Heikkinen, Miss. Laina
3	0	53.1000	S	female	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	0	8.0500	S	male	Allen, Mr. William Henry
5	0	8.4583	Q	male	Moran, Mr. James
6	0	51.8625	S	male	McCarthy, Mr. Timothy J
7	1	21.0750	S	male	Palsson, Master. Gosta Leonard
8	2	11.1333	S	female	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9	0	30.0708	C	female	Nasser, Mrs. Nicholas (Adele Achem)

The main problem with this approach is that you’re discarding a feature that may be useful to your model.

Approach 2: Drop columns with missing values

May discard useful features

A third option for handling missing values is to impute missing values. Imputation means that you’re filling in missing values based on what you know from the non-missing data. Before proceeding with imputation, it’s important to carefully consider its costs and benefits:

The benefit is that you’re able to keep more samples and features in your dataset, which may help to improve your model.
The cost is that you’re inserting values that may not match the true, unknown values, which may make your model less reliable.

Approach 3: Impute missing values

Benefit: Keeps more samples and features
Cost: Imputed values may not match the true values

When making this decision, here are some useful factors to consider:

How important are the particular samples that imputation would allow you to keep?
How important are the particular features that imputation would allow you to keep?
What percentage of the values in a feature would need to be imputed?
Are there samples or features without missing values that are highly correlated with the samples or features with missing values, such that the model wouldn’t be negatively affected by dropping those samples or features?
Is the missingness random, or is there a useful pattern in the missingness that would be lost if those samples or features were dropped?

Factors to consider before imputing:

How important are the samples?
How important are the features?
What percentage of values would need to be imputed?
Are there other samples or features that contain the same information?
Is the missingness random?

Imputation is a complex topic, and ultimately I can’t tell you whether you should impute missing values in your particular situation. Instead, I’m going to show you how to do imputation in scikit-learn, and you can decide whether or not to do it. As well, I’ll provide some advice for imputation in lesson 7.6.

7.3 Missing value imputation

In this lesson, we’re going to perform missing value imputation on the Age column so that we can include it in our model. There are three different imputers that we can use in scikit-learn, but we’re going to start with SimpleImputer and I’ll show you the others later in the chapter.

First, we’ll import SimpleImputer from the impute module and create an instance called imp.

from sklearn.impute import SimpleImputer
imp = SimpleImputer()

Then, we’ll pass the Age column to the fit_transform method to perform the imputation. Note that SimpleImputer, like most transformers, requires 2-dimensional input, which is why there are double brackets around Age.

imp.fit_transform(X[['Age']])

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

By default, SimpleImputer fills any missing values with the mean of the non-missing values, which is 28.11 in this case. It also supports other imputation strategies, namely the median value, the most frequent value, and a user-defined value. Note that all of these strategies can be applied to numeric features, but only the most frequent and user-defined strategies can be applied to categorical features.

Simple imputation strategies:

Mean value
Median value
Most frequent value
User-defined value

In order to confirm what value was imputed, we can examine the statistics_ attribute, which was learned from the data during the fit step.

imp.statistics_

array([28.11111111])

Now that we know how SimpleImputer works, we’ll update the ColumnTransformer to include imputation of the Age column. As a reminder, brackets are required around Age because SimpleImputer expects 2-dimensional input, whereas brackets are not allowed around Name because CountVectorizer expects 1-dimensional input.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    ('passthrough', ['Parch', 'Fare']))

Now we’ll run the fit_transform method, and you can see that there are now 48 columns in the feature matrix, whereas in the last chapter there were 47.

ct.fit_transform(X)

<10x48 sparse matrix of type '<class 'numpy.float64'>'
    with 88 stored elements in Compressed Sparse Row format>

Next, we’ll update the Pipeline to include the revised ColumnTransformer, and fit it on X and y. There was no error this time because the one missing value in X was imputed.

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)

Before we make predictions, we’re going to examine the Pipeline to confirm the imputation value for Age that was learned from X:

First, we access the columntransformer step of the Pipeline using the named_steps attribute.
Then, from that Pipeline step, we access the simpleimputer transformer using the named_transformers_ attribute.
Finally, from that transformer, we access the statistics_ attribute.

This confirms what we saw previously, which is that the imputer learned a value of 28.11 from the Age column.

(pipe.named_steps['columntransformer']
     .named_transformers_['simpleimputer']
     .statistics_)

array([28.11111111])

Next, we’ll update the X_new DataFrame to use the same columns as X.

X_new = df_new[cols]
X_new

	Parch	Fare	Embarked	Sex	Name	Age
0	0	7.8292	Q	male	Kelly, Mr. James	34.5
1	0	7.0000	S	female	Wilkes, Mrs. James (Ellen Needs)	47.0
2	0	9.6875	Q	male	Myles, Mr. Thomas Francis	62.0
3	0	8.6625	S	male	Wirz, Mr. Albert	27.0
4	1	12.2875	S	female	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	22.0
5	0	9.2250	S	male	Svensson, Mr. Johan Cervin	14.0
6	0	7.6292	Q	female	Connolly, Miss. Kate	30.0
7	1	29.0000	S	male	Caldwell, Mr. Albert Francis	26.0
8	0	7.2292	C	female	Abrahim, Mrs. Joseph (Sophie Halaut Easu)	18.0
9	0	24.1500	S	male	Davies, Mr. John Samuel	21.0

Finally, we’ll use the fitted Pipeline to make predictions for X_new.

pipe.predict(X_new)

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

With respect to imputation, it’s worth talking about exactly what happens during the predict step. Since X_new didn’t have any missing values in the Age column, then nothing got imputed during prediction.

But let’s pretend that X_new did have a missing value in Age. If that was the case, then the imputed value would have been the mean of Age in X, which is 28.11, not the mean of Age in X_new. This is critically important because a transformer is only allowed to learn from the training data, and then apply what it learned to both the training and new data.

What would have been imputed for Age in X_new?

Imputed value would be the mean of Age in X, not the mean of Age in X_new
Transformer is only allowed to learn from the training data

As we’ve already seen in this book, the OneHotEncoder learns its categories from the training data, the CountVectorizer learns its vocabulary from the training data, and similarly, the SimpleImputer learns its imputation value from the training data. This is one of the main reasons why we run fit_transform on the training data but transform only on the new data.

What do transformers learn from the training data?

OneHotEncoder: Learns categories
CountVectorizer: Learns vocabulary
SimpleImputer: Learns imputation value

If you’re struggling with this concept, here’s another way of looking at it that might be helpful to you. Pretend for a second that X_new only contained a single sample, and that sample had a missing Age value. If you passed that X_new to the predict method, it’s clear that scikit-learn has to look to the training data for the imputation value, since there would be no other Age values in X_new for it to examine.

7.4 Using “missingness” as a feature

When imputing missing values, it’s also possible to use “missingness” as a feature of its own.

Starting in scikit-learn version 0.21, we can set the add_indicator parameter to True when creating a SimpleImputer.

imp_indicator = SimpleImputer(add_indicator=True)

Then when we use fit_transform, a separate column known as a “missing indicator” is included in the output matrix, and it indicates the presence of missing values.

imp_indicator.fit_transform(X[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

Adding a missing indicator is useful when the data is not missing at random, since there might be a relationship between the “missingness” and the target value.

Why add a missing indicator?

Useful when the data is not missing at random
Can encode the relationship between “missingness” and the target value

For example, if Age was missing for some samples because older passengers declined to give their ages, and older passengers are more likely to have survived, then there is a relationship between Age being missing and the likelihood of survival. Thus the missingness itself can be a useful feature, and we can include that feature in the model using a missing indicator.

We’re not going to modify our Pipeline to include a missing indicator at this time, but we’ll return to this concept later in the book.

7.5 Q&A: How do I perform multivariate imputation?

So far in this chapter, we’ve used SimpleImputer for imputation. SimpleImputer does univariate imputation, which means that it only looks at the feature being imputed when deciding what values to impute. Thus when imputing missing values for Age, SimpleImputer only considers the values in the Age column.

However, there’s another type of imputation called multivariate imputation. The intuition of multivariate imputation is that it can be useful to take other features into account when deciding what value to impute.

For example, maybe a high Parch and a low Fare is common for kids. Thus if Age is missing for a row which has a high Parch and a low Fare, then you should impute a low Age rather than the mean of Age, which is what SimpleImputer would do.

Multivariate imputation is available in scikit-learn via the IterativeImputer and KNNImputer classes, and in this lesson I’ll explain how both of them work.

Types of imputation:

Univariate imputation: Only examines the feature being imputed
- SimpleImputer
Multivariate imputation: Takes multiple features into account
- IterativeImputer
- KNNImputer

IterativeImputer was introduced in scikit-learn version 0.21. It’s considered experimental, which means that the API and predictions may change in future versions. As such, scikit-learn will only allow you to import the IterativeImputer class from the impute module if you first import the enable_iterative_imputer function from the experimental module. This is scikit-learn’s way of requiring you to acknowledge that you’re using experimental functionality.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

Then, we’ll create an instance of IterativeImputer called imp_iterative, and pass three columns from X to its fit_transform method: Parch, Fare, and Age. Parch and Fare don’t have any missing values, and Age has 1 missing value. You can see that 24.24 was imputed for the missing value of Age.

imp_iterative = IterativeImputer()
imp_iterative.fit_transform(X[['Parch', 'Fare', 'Age']])

array([[ 0.        ,  7.25      , 22.        ],
       [ 0.        , 71.2833    , 38.        ],
       [ 0.        ,  7.925     , 26.        ],
       [ 0.        , 53.1       , 35.        ],
       [ 0.        ,  8.05      , 35.        ],
       [ 0.        ,  8.4583    , 24.23702669],
       [ 0.        , 51.8625    , 54.        ],
       [ 1.        , 21.075     ,  2.        ],
       [ 2.        , 11.1333    , 27.        ],
       [ 0.        , 30.0708    , 14.        ]])

Here’s how the IterativeImputer works:

For the 9 rows in which Age is not missing, scikit-learn trains a regression model in which Parch and Fare are the features and Age is the target.
For the 1 row in which Age is missing, scikit-learn passes the Parch and Fare values to the trained model. The model makes a prediction for Age, and that value is used for imputation.

In summary, IterativeImputer turned this into a regression problem with 2 features, 9 samples of training data, and 1 sample of new data.

How IterativeImputer works:

Age not missing: Train a regression model to predict Age using Parch and Fare
Age missing: Predict Age using trained model

There are four things I want to comment on about IterativeImputer:

First, it only works with numerical features. This is unlike SimpleImputer, which also works with categorical features.
Second, you have to decide which features to pass to IterativeImputer. My advice is to use features that you believe are highly correlated with one another.
Third, you are allowed to pass it multiple features with missing values. In other words, Parch, Fare, and Age could all have missing values. IterativeImputer will just do three different regression problems, and each column will take a turn being the target.
Fourth, you can actually choose which regression model IterativeImputer uses for the regression problem. By default, it uses Bayesian ridge regression.

Notes about IterativeImputer:

Only works with numerical features
You have to decide which features to include
You can include multiple features with missing values
You can choose the regression model

Anyway, if you decided you wanted to include it in the ColumnTransformer, you would just specify that Parch, Fare, and Age should be transformed by imp_iterative, and thus there would no longer be any passthrough columns.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_iterative, ['Parch', 'Fare', 'Age']))

Our other option for multivariate imputation is KNNImputer, which was introduced in scikit-learn version 0.22 but is not considered experimental.

To use it, we’ll import it from the impute module, create an instance called imp_knn, and then pass the same three columns from X to the fit_transform method. This time, you’ll see that 30.5 was imputed for the missing value of Age.

from sklearn.impute import KNNImputer
imp_knn = KNNImputer(n_neighbors=2)
imp_knn.fit_transform(X[['Parch', 'Fare', 'Age']])

array([[ 0.    ,  7.25  , 22.    ],
       [ 0.    , 71.2833, 38.    ],
       [ 0.    ,  7.925 , 26.    ],
       [ 0.    , 53.1   , 35.    ],
       [ 0.    ,  8.05  , 35.    ],
       [ 0.    ,  8.4583, 30.5   ],
       [ 0.    , 51.8625, 54.    ],
       [ 1.    , 21.075 ,  2.    ],
       [ 2.    , 11.1333, 27.    ],
       [ 0.    , 30.0708, 14.    ]])

Here’s how the KNNImputer works:

First, KNNImputer finds the row in which Age is missing.
Because I set the n_neighbors parameter to 2, scikit-learn finds the 2 rows that are nearest to this row in which Age is not missing, measured by how close the Parch and Fare values are to this row. In this case, the third and the fifth rows are the two nearest rows.
Finally, scikit-learn calculates the mean of the Age values from those 2 rows. In this case, it takes the mean of 26 and 35 which is 30.5, and that value is used as the imputation value.

How KNNImputer works:

Find the row in which Age is missing
Find the n_neighbors “nearest” rows in which Age is not missing
Calculate the mean of Age from the “nearest” rows

You might be wondering how you should choose the value for n_neighbors. As you’ll see later in the book, transformer hyperparameters like this should be chosen through a tuning process in the same way that you would tune the hyperparameters for a model.

7.6 Q&A: What are the best practices for missing value imputation?

Missing value imputation may be considered dubious from a statistics point of view, but if your primary goal is prediction, then imputation has been shown experimentally to work well under certain conditions.

In this lesson, I’m going to summarize what I’ve learned from research papers about the best practices for effective missing value imputation.

Types of missing data:

Missing Completely At Random (MCAR):
- No relationship between missingness and underlying data
- Example: Booking agent forgot to gather Age
Missing Not At Random (MNAR):
- Relationship between missingness and underlying data
- Example: Older passengers declined to give their Age
Missing due to a structural deficiency:
- Data omitted for a specific purpose
- Example: Staff members did not pay a Fare

The best practices actually differ depending upon the type of missing data you have, and so I’m going to start with the first type of missing data, which is known as Missing Completely At Random or “MCAR”. Missing data is designated as MCAR if there’s no relationship between the missingness and the underlying data.

An example of this would be if Age was missing because the booking agent forgot to gather that information from a few of the passengers. In other words, the data is just as likely to be missing for an older person as a younger person, and thus missingness is not meaningfully related to Age.

If you determine that your data is MCAR, here’s what the research indicates:

If you have a small dataset, then IterativeImputer tends to be more effective than mean imputation.
If you have a large dataset, IterativeImputer and mean imputation tend to work equally well, though IterativeImputer will have a much higher computational cost.
And there’s no benefit to adding a missing indicator since the missingness is random.

Advice for MCAR imputation:

Small dataset: IterativeImputer is more effective than mean imputation
Large dataset: IterativeImputer and mean imputation work equally well
No benefit to adding a missing indicator

The next type of missing data is known as Missing Not At Random or “MNAR”. Missing data is designated as MNAR if there is a relationship between the missingness and the underlying data, which is commonly the case in real-world datasets.

I mentioned an example of this in a previous lesson, namely if Age was missing because some of the older passengers declined to give their ages. In other words, there is a relationship between missingness and Age.

If your data is MNAR, here is what the research indicates:

Mean imputation is more effective than IterativeImputer because IterativeImputer actually obscures the pattern that the model might otherwise be able to learn.
It’s important that you add a missing indicator so that the model can learn from the pattern of missingness.
And finally, it’s recommended to use a powerful, non-linear model, since mean imputation tends not to work as well in combination with a linear model.

Advice for MNAR imputation:

Mean imputation is more effective than IterativeImputer
Add a missing indicator
Use a powerful, non-linear model

The final type of missing data I’ll mention is data that’s missing due to a structural deficiency. This means that a value is missing because it was omitted for a specific purpose. An example of this would be if the passenger list included staff members, and their values for the Fare column were listed as missing to indicate that they did not pay a fare.

In the case of a structural deficiency, my advice is to impute the most logical and reasonable user-defined value, and also add a missing indicator. For the example I mentioned, it would make the most sense to insert a Fare value of 0 for all staff members.

Advice for structural deficiency imputation:

Impute a logical and reasonable user-defined value
Add a missing indicator

All of these examples should make clear that if you’re going to impute missing values, it’s important to thoroughly understand your data before choosing an imputation strategy.

Finally, it’s worth reiterating that histogram-based gradient boosting trees in scikit-learn have built-in support for missing values. In other words, you can pass it data with missing values, and it will handle them internally. This is significant because it has a lower computation cost than complex imputation strategies like IterativeImputer, and yet it can perform just as well or better than imputation across a variety of missing value scenarios.

Advantages of histogram-based gradient boosting trees:

Built-in support for missing values
Lower computational cost than IterativeImputer
Performs well across many missing value scenarios

Therefore, if you have a large dataset with a lot of missing values, it’s worth trying out a histogram-based gradient boosting tree as your prediction model and excluding the imputation step, and comparing its performance with any other model that does require an imputation step.

7.7 Q&A: What’s the difference between ColumnTransformer and FeatureUnion?

As we saw earlier in this chapter, you can add a missing indicator to the output of SimpleImputer by setting the add_indicator parameter to True.

imp_indicator = SimpleImputer(add_indicator=True)
imp_indicator.fit_transform(X[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

In this lesson, we’re going to explore a different way to create this same matrix as a way of learning about the FeatureUnion class.

To create the left column of this matrix, we can use the imp object, which is a SimpleImputer without a missing indicator.

imp.fit_transform(X[['Age']])

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

To create the right column of the matrix, we can use the MissingIndicator class. We’ll import the class from the impute module and create an instance called indicator.

from sklearn.impute import MissingIndicator
indicator = MissingIndicator()

When we pass Age to the fit_transform method, it outputs False and True instead of 0 and 1, but otherwise the results are identical to the right column of the matrix.

indicator.fit_transform(X[['Age']])

array([[False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False]])

The final step in recreating our original matrix is to stack these two columns side-by-side, which we can do by using the FeatureUnion class. A FeatureUnion applies multiple transformations to a single input column and stacks the results side-by-side.

We’ll import the make_union function from the pipeline module, and then create an instance called union that contains both the imp and indicator objects.

from sklearn.pipeline import make_union
union = make_union(imp, indicator)

When we pass Age to the union’s fit_transform method, it runs both the imp and indicator transformations and stacks the results side-by-side. Note that False and True are converted to 0 and 1 when included in a numeric array, and thus we’ve recreated our original matrix.

union.fit_transform(X[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

Alternatively, we could have achieved the same results using a ColumnTransformer. Specifically, we could have passed the Age column to two separate transformations.

ct = make_column_transformer(
    (imp, ['Age']),
    (indicator, ['Age']))
ct.fit_transform(X)

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

Stepping back, here’s a quick comparison between FeatureUnion and ColumnTransformer:

A FeatureUnion works on a single input column, and applies multiple different transformations to that one column in parallel.
A ColumnTransformer works on multiple input columns, and applies a different transformation to each input column in parallel.

FeatureUnion vs ColumnTransformer:

FeatureUnion:
- Single input column
- Applies multiple different transformations to that column in parallel
ColumnTransformer:
- Multiple input columns
- Applies a different transformation to each column in parallel

You can see that ColumnTransformer is far more flexible than FeatureUnion, and so my recommendation is that you use ColumnTransformer to do all of your transformations. For the rare case in which you need to apply multiple different transformations in parallel to the same column, you can use either a FeatureUnion or a ColumnTransformer, depending upon which solution makes more sense to you.