In this chapter, we’re going to deal with an issue that is common in real datasets, namely missing values.
A missing value is simply a value that does not exist in the dataset. It might be missing because that value purposefully wasn’t collected for that sample, or it might be missing due to an error in the data collection process.
Common sources of missing values:
Value purposefully wasn’t collected
Error in the data collection process
Let’s see an example in the Titanic dataset. We want to use Age as a feature, but note that it has a missing value, encoded as “NaN”. This stands for “not a number”, and it’s how missing values are typically encoded in a pandas DataFrame.
df
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
5
0
3
Moran, Mr. James
male
NaN
0
0
330877
8.4583
NaN
Q
6
0
1
McCarthy, Mr. Timothy J
male
54.0
0
0
17463
51.8625
E46
S
7
0
3
Palsson, Master. Gosta Leonard
male
2.0
3
1
349909
21.0750
NaN
S
8
1
3
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
female
27.0
0
2
347742
11.1333
NaN
S
9
1
2
Nasser, Mrs. Nicholas (Adele Achem)
female
14.0
1
0
237736
30.0708
NaN
C
When we use the phrase “missing values”, we’re talking about NaNs only. The phrase “missing values” does not refer to categories or words in new data that weren’t seen during training. For example, if our new data contained the value “Z” in the Embarked column, that is not called a “missing value”, rather that’s called “an unknown category” or “a category you didn’t see during training”.
Missing values vs unknown categories:
Missing value: Value encoded as “NaN”
Unknown category: Category not seen in the training data
To start our exploration of this topic, let’s see what happens if we try to add the Age column to our model.
First, we’ll add Age to the cols list and use that to modify the X DataFrame. Again, notice the missing value in Age.
Then, we’ll update the Pipeline to include the modified ColumnTransformer.
pipe = make_pipeline(ct, logreg)
Finally, we’ll attempt to fit the Pipeline, but it throws an error due to the presence of a missing value.
pipe.fit(X, y)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
scikit-learn models generally don’t accept data with missing values, with the exception of histogram-based gradient boosting trees. As such, we’ll need to figure out a way to handle the missing value if we want to include Age in our model.
7.2 Three ways to handle missing values
Let’s talk about three different ways we can handle missing values in our dataset.
The first way is to drop any rows from the DataFrame that have missing values. We can use the dropna method in pandas to do this, and you’ll notice that row 5 is now gone. Note that you would also need to drop the same row from y.
X.dropna()
Parch
Fare
Embarked
Sex
Name
Age
0
0
7.2500
S
male
Braund, Mr. Owen Harris
22.0
1
0
71.2833
C
female
Cumings, Mrs. John Bradley (Florence Briggs Th...
38.0
2
0
7.9250
S
female
Heikkinen, Miss. Laina
26.0
3
0
53.1000
S
female
Futrelle, Mrs. Jacques Heath (Lily May Peel)
35.0
4
0
8.0500
S
male
Allen, Mr. William Henry
35.0
6
0
51.8625
S
male
McCarthy, Mr. Timothy J
54.0
7
1
21.0750
S
male
Palsson, Master. Gosta Leonard
2.0
8
2
11.1333
S
female
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
27.0
9
0
30.0708
C
female
Nasser, Mrs. Nicholas (Adele Achem)
14.0
However, there are three main problems with this approach:
First, if a high proportion of your rows have missing values, then this approach is impractical because it will discard too much of your training data.
Second, if there’s a useful pattern in the “missingness”, then dropping the rows will obscure this pattern.
Third, this takes care of the training data, but it doesn’t help you deal with any missing values that might appear in new data.
Approach 1: Drop rows with missing values
May discard too much training data
May obscure a pattern in the “missingness”
Doesn’t help you with new data
A second option for handling missing values is to drop any columns that have missing values. Again, we can use the dropna method to do this if we set the axis parameter to “columns”. You can see that row 5 is back, but the Age column is now gone.
X.dropna(axis='columns')
Parch
Fare
Embarked
Sex
Name
0
0
7.2500
S
male
Braund, Mr. Owen Harris
1
0
71.2833
C
female
Cumings, Mrs. John Bradley (Florence Briggs Th...
2
0
7.9250
S
female
Heikkinen, Miss. Laina
3
0
53.1000
S
female
Futrelle, Mrs. Jacques Heath (Lily May Peel)
4
0
8.0500
S
male
Allen, Mr. William Henry
5
0
8.4583
Q
male
Moran, Mr. James
6
0
51.8625
S
male
McCarthy, Mr. Timothy J
7
1
21.0750
S
male
Palsson, Master. Gosta Leonard
8
2
11.1333
S
female
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9
0
30.0708
C
female
Nasser, Mrs. Nicholas (Adele Achem)
The main problem with this approach is that you’re discarding a feature that may be useful to your model.
Approach 2: Drop columns with missing values
May discard useful features
A third option for handling missing values is to impute missing values. Imputation means that you’re filling in missing values based on what you know from the non-missing data. Before proceeding with imputation, it’s important to carefully consider its costs and benefits:
The benefit is that you’re able to keep more samples and features in your dataset, which may help to improve your model.
The cost is that you’re inserting values that may not match the true, unknown values, which may make your model less reliable.
Approach 3: Impute missing values
Benefit: Keeps more samples and features
Cost: Imputed values may not match the true values
When making this decision, here are some useful factors to consider:
How important are the particular samples that imputation would allow you to keep?
How important are the particular features that imputation would allow you to keep?
What percentage of the values in a feature would need to be imputed?
Are there samples or features without missing values that are highly correlated with the samples or features with missing values, such that the model wouldn’t be negatively affected by dropping those samples or features?
Is the missingness random, or is there a useful pattern in the missingness that would be lost if those samples or features were dropped?
Factors to consider before imputing:
How important are the samples?
How important are the features?
What percentage of values would need to be imputed?
Are there other samples or features that contain the same information?
Is the missingness random?
Imputation is a complex topic, and ultimately I can’t tell you whether you should impute missing values in your particular situation. Instead, I’m going to show you how to do imputation in scikit-learn, and you can decide whether or not to do it. As well, I’ll provide some advice for imputation in lesson 7.6.
7.3 Missing value imputation
In this lesson, we’re going to perform missing value imputation on the Age column so that we can include it in our model. There are three different imputers that we can use in scikit-learn, but we’re going to start with SimpleImputer and I’ll show you the others later in the chapter.
First, we’ll import SimpleImputer from the impute module and create an instance called imp.
from sklearn.impute import SimpleImputerimp = SimpleImputer()
Then, we’ll pass the Age column to the fit_transform method to perform the imputation. Note that SimpleImputer, like most transformers, requires 2-dimensional input, which is why there are double brackets around Age.
By default, SimpleImputer fills any missing values with the mean of the non-missing values, which is 28.11 in this case. It also supports other imputation strategies, namely the median value, the most frequent value, and a user-defined value. Note that all of these strategies can be applied to numeric features, but only the most frequent and user-defined strategies can be applied to categorical features.
Simple imputation strategies:
Mean value
Median value
Most frequent value
User-defined value
In order to confirm what value was imputed, we can examine the statistics_ attribute, which was learned from the data during the fit step.
imp.statistics_
array([28.11111111])
Now that we know how SimpleImputer works, we’ll update the ColumnTransformer to include imputation of the Age column. As a reminder, brackets are required around Age because SimpleImputer expects 2-dimensional input, whereas brackets are not allowed around Name because CountVectorizer expects 1-dimensional input.
Now we’ll run the fit_transform method, and you can see that there are now 48 columns in the feature matrix, whereas in the last chapter there were 47.
ct.fit_transform(X)
<10x48 sparse matrix of type '<class 'numpy.float64'>'
with 88 stored elements in Compressed Sparse Row format>
Next, we’ll update the Pipeline to include the revised ColumnTransformer, and fit it on X and y. There was no error this time because the one missing value in X was imputed.
Next, we’ll update the X_new DataFrame to use the same columns as X.
X_new = df_new[cols]X_new
Parch
Fare
Embarked
Sex
Name
Age
0
0
7.8292
Q
male
Kelly, Mr. James
34.5
1
0
7.0000
S
female
Wilkes, Mrs. James (Ellen Needs)
47.0
2
0
9.6875
Q
male
Myles, Mr. Thomas Francis
62.0
3
0
8.6625
S
male
Wirz, Mr. Albert
27.0
4
1
12.2875
S
female
Hirvonen, Mrs. Alexander (Helga E Lindqvist)
22.0
5
0
9.2250
S
male
Svensson, Mr. Johan Cervin
14.0
6
0
7.6292
Q
female
Connolly, Miss. Kate
30.0
7
1
29.0000
S
male
Caldwell, Mr. Albert Francis
26.0
8
0
7.2292
C
female
Abrahim, Mrs. Joseph (Sophie Halaut Easu)
18.0
9
0
24.1500
S
male
Davies, Mr. John Samuel
21.0
Finally, we’ll use the fitted Pipeline to make predictions for X_new.
pipe.predict(X_new)
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])
With respect to imputation, it’s worth talking about exactly what happens during the predict step. Since X_new didn’t have any missing values in the Age column, then nothing got imputed during prediction.
But let’s pretend that X_new did have a missing value in Age. If that was the case, then the imputed value would have been the mean of Age in X, which is 28.11, not the mean of Age in X_new. This is critically important because a transformer is only allowed to learn from the training data, and then apply what it learned to both the training and new data.
What would have been imputed for Age in X_new?
Imputed value would be the mean of Age in X, not the mean of Age in X_new
Transformer is only allowed to learn from the training data
As we’ve already seen in this book, the OneHotEncoder learns its categories from the training data, the CountVectorizer learns its vocabulary from the training data, and similarly, the SimpleImputer learns its imputation value from the training data. This is one of the main reasons why we run fit_transform on the training data but transform only on the new data.
What do transformers learn from the training data?
OneHotEncoder: Learns categories
CountVectorizer: Learns vocabulary
SimpleImputer: Learns imputation value
If you’re struggling with this concept, here’s another way of looking at it that might be helpful to you. Pretend for a second that X_new only contained a single sample, and that sample had a missing Age value. If you passed that X_new to the predict method, it’s clear that scikit-learn has to look to the training data for the imputation value, since there would be no other Age values in X_new for it to examine.
7.4 Using “missingness” as a feature
When imputing missing values, it’s also possible to use “missingness” as a feature of its own.
Starting in scikit-learn version 0.21, we can set the add_indicator parameter to True when creating a SimpleImputer.
imp_indicator = SimpleImputer(add_indicator=True)
Then when we use fit_transform, a separate column known as a “missing indicator” is included in the output matrix, and it indicates the presence of missing values.
Adding a missing indicator is useful when the data is not missing at random, since there might be a relationship between the “missingness” and the target value.
Why add a missing indicator?
Useful when the data is not missing at random
Can encode the relationship between “missingness” and the target value
For example, if Age was missing for some samples because older passengers declined to give their ages, and older passengers are more likely to have survived, then there is a relationship between Age being missing and the likelihood of survival. Thus the missingness itself can be a useful feature, and we can include that feature in the model using a missing indicator.
We’re not going to modify our Pipeline to include a missing indicator at this time, but we’ll return to this concept later in the book.
7.5 Q&A: How do I perform multivariate imputation?
So far in this chapter, we’ve used SimpleImputer for imputation. SimpleImputer does univariate imputation, which means that it only looks at the feature being imputed when deciding what values to impute. Thus when imputing missing values for Age, SimpleImputer only considers the values in the Age column.
However, there’s another type of imputation called multivariate imputation. The intuition of multivariate imputation is that it can be useful to take other features into account when deciding what value to impute.
For example, maybe a high Parch and a low Fare is common for kids. Thus if Age is missing for a row which has a high Parch and a low Fare, then you should impute a low Age rather than the mean of Age, which is what SimpleImputer would do.
Multivariate imputation is available in scikit-learn via the IterativeImputer and KNNImputer classes, and in this lesson I’ll explain how both of them work.
Types of imputation:
Univariate imputation: Only examines the feature being imputed
SimpleImputer
Multivariate imputation: Takes multiple features into account
IterativeImputer
KNNImputer
IterativeImputer was introduced in scikit-learn version 0.21. It’s considered experimental, which means that the API and predictions may change in future versions. As such, scikit-learn will only allow you to import the IterativeImputer class from the impute module if you first import the enable_iterative_imputer function from the experimental module. This is scikit-learn’s way of requiring you to acknowledge that you’re using experimental functionality.
from sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputer
Then, we’ll create an instance of IterativeImputer called imp_iterative, and pass three columns from X to its fit_transform method: Parch, Fare, and Age. Parch and Fare don’t have any missing values, and Age has 1 missing value. You can see that 24.24 was imputed for the missing value of Age.
For the 9 rows in which Age is not missing, scikit-learn trains a regression model in which Parch and Fare are the features and Age is the target.
For the 1 row in which Age is missing, scikit-learn passes the Parch and Fare values to the trained model. The model makes a prediction for Age, and that value is used for imputation.
In summary, IterativeImputer turned this into a regression problem with 2 features, 9 samples of training data, and 1 sample of new data.
How IterativeImputer works:
Age not missing: Train a regression model to predict Age using Parch and Fare
Age missing: Predict Age using trained model
There are four things I want to comment on about IterativeImputer:
First, it only works with numerical features. This is unlike SimpleImputer, which also works with categorical features.
Second, you have to decide which features to pass to IterativeImputer. My advice is to use features that you believe are highly correlated with one another.
Third, you are allowed to pass it multiple features with missing values. In other words, Parch, Fare, and Age could all have missing values. IterativeImputer will just do three different regression problems, and each column will take a turn being the target.
Fourth, you can actually choose which regression model IterativeImputer uses for the regression problem. By default, it uses Bayesian ridge regression.
Notes about IterativeImputer:
Only works with numerical features
You have to decide which features to include
You can include multiple features with missing values
You can choose the regression model
Anyway, if you decided you wanted to include it in the ColumnTransformer, you would just specify that Parch, Fare, and Age should be transformed by imp_iterative, and thus there would no longer be any passthrough columns.
Our other option for multivariate imputation is KNNImputer, which was introduced in scikit-learn version 0.22 but is not considered experimental.
To use it, we’ll import it from the impute module, create an instance called imp_knn, and then pass the same three columns from X to the fit_transform method. This time, you’ll see that 30.5 was imputed for the missing value of Age.
from sklearn.impute import KNNImputerimp_knn = KNNImputer(n_neighbors=2)imp_knn.fit_transform(X[['Parch', 'Fare', 'Age']])
First, KNNImputer finds the row in which Age is missing.
Because I set the n_neighbors parameter to 2, scikit-learn finds the 2 rows that are nearest to this row in which Age is not missing, measured by how close the Parch and Fare values are to this row. In this case, the third and the fifth rows are the two nearest rows.
Finally, scikit-learn calculates the mean of the Age values from those 2 rows. In this case, it takes the mean of 26 and 35 which is 30.5, and that value is used as the imputation value.
How KNNImputer works:
Find the row in which Age is missing
Find the n_neighbors “nearest” rows in which Age is not missing
Calculate the mean of Age from the “nearest” rows
You might be wondering how you should choose the value for n_neighbors. As you’ll see later in the book, transformer hyperparameters like this should be chosen through a tuning process in the same way that you would tune the hyperparameters for a model.
7.6 Q&A: What are the best practices for missing value imputation?
Missing value imputation may be considered dubious from a statistics point of view, but if your primary goal is prediction, then imputation has been shown experimentally to work well under certain conditions.
In this lesson, I’m going to summarize what I’ve learned from research papers about the best practices for effective missing value imputation.
Types of missing data:
Missing Completely At Random (MCAR):
No relationship between missingness and underlying data
Example: Booking agent forgot to gather Age
Missing Not At Random (MNAR):
Relationship between missingness and underlying data
Example: Older passengers declined to give their Age
Missing due to a structural deficiency:
Data omitted for a specific purpose
Example: Staff members did not pay a Fare
The best practices actually differ depending upon the type of missing data you have, and so I’m going to start with the first type of missing data, which is known as Missing Completely At Random or “MCAR”. Missing data is designated as MCAR if there’s no relationship between the missingness and the underlying data.
An example of this would be if Age was missing because the booking agent forgot to gather that information from a few of the passengers. In other words, the data is just as likely to be missing for an older person as a younger person, and thus missingness is not meaningfully related to Age.
If you determine that your data is MCAR, here’s what the research indicates:
If you have a small dataset, then IterativeImputer tends to be more effective than mean imputation.
If you have a large dataset, IterativeImputer and mean imputation tend to work equally well, though IterativeImputer will have a much higher computational cost.
And there’s no benefit to adding a missing indicator since the missingness is random.
Advice for MCAR imputation:
Small dataset: IterativeImputer is more effective than mean imputation
Large dataset: IterativeImputer and mean imputation work equally well
No benefit to adding a missing indicator
The next type of missing data is known as Missing Not At Random or “MNAR”. Missing data is designated as MNAR if there is a relationship between the missingness and the underlying data, which is commonly the case in real-world datasets.
I mentioned an example of this in a previous lesson, namely if Age was missing because some of the older passengers declined to give their ages. In other words, there is a relationship between missingness and Age.
If your data is MNAR, here is what the research indicates:
Mean imputation is more effective than IterativeImputer because IterativeImputer actually obscures the pattern that the model might otherwise be able to learn.
It’s important that you add a missing indicator so that the model can learn from the pattern of missingness.
And finally, it’s recommended to use a powerful, non-linear model, since mean imputation tends not to work as well in combination with a linear model.
Advice for MNAR imputation:
Mean imputation is more effective than IterativeImputer
Add a missing indicator
Use a powerful, non-linear model
The final type of missing data I’ll mention is data that’s missing due to a structural deficiency. This means that a value is missing because it was omitted for a specific purpose. An example of this would be if the passenger list included staff members, and their values for the Fare column were listed as missing to indicate that they did not pay a fare.
In the case of a structural deficiency, my advice is to impute the most logical and reasonable user-defined value, and also add a missing indicator. For the example I mentioned, it would make the most sense to insert a Fare value of 0 for all staff members.
Advice for structural deficiency imputation:
Impute a logical and reasonable user-defined value
Add a missing indicator
All of these examples should make clear that if you’re going to impute missing values, it’s important to thoroughly understand your data before choosing an imputation strategy.
Finally, it’s worth reiterating that histogram-based gradient boosting trees in scikit-learn have built-in support for missing values. In other words, you can pass it data with missing values, and it will handle them internally. This is significant because it has a lower computation cost than complex imputation strategies like IterativeImputer, and yet it can perform just as well or better than imputation across a variety of missing value scenarios.
Advantages of histogram-based gradient boosting trees:
Built-in support for missing values
Lower computational cost than IterativeImputer
Performs well across many missing value scenarios
Therefore, if you have a large dataset with a lot of missing values, it’s worth trying out a histogram-based gradient boosting tree as your prediction model and excluding the imputation step, and comparing its performance with any other model that does require an imputation step.
7.7 Q&A: What’s the difference between ColumnTransformer and FeatureUnion?
As we saw earlier in this chapter, you can add a missing indicator to the output of SimpleImputer by setting the add_indicator parameter to True.
To create the right column of the matrix, we can use the MissingIndicator class. We’ll import the class from the impute module and create an instance called indicator.
from sklearn.impute import MissingIndicatorindicator = MissingIndicator()
When we pass Age to the fit_transform method, it outputs False and True instead of 0 and 1, but otherwise the results are identical to the right column of the matrix.
The final step in recreating our original matrix is to stack these two columns side-by-side, which we can do by using the FeatureUnion class. A FeatureUnion applies multiple transformations to a single input column and stacks the results side-by-side.
We’ll import the make_union function from the pipeline module, and then create an instance called union that contains both the imp and indicator objects.
from sklearn.pipeline import make_unionunion = make_union(imp, indicator)
When we pass Age to the union’s fit_transform method, it runs both the imp and indicator transformations and stacks the results side-by-side. Note that False and True are converted to 0 and 1 when included in a numeric array, and thus we’ve recreated our original matrix.
Alternatively, we could have achieved the same results using a ColumnTransformer. Specifically, we could have passed the Age column to two separate transformations.
Stepping back, here’s a quick comparison between FeatureUnion and ColumnTransformer:
A FeatureUnion works on a single input column, and applies multiple different transformations to that one column in parallel.
A ColumnTransformer works on multiple input columns, and applies a different transformation to each input column in parallel.
FeatureUnion vs ColumnTransformer:
FeatureUnion:
Single input column
Applies multiple different transformations to that column in parallel
ColumnTransformer:
Multiple input columns
Applies a different transformation to each column in parallel
You can see that ColumnTransformer is far more flexible than FeatureUnion, and so my recommendation is that you use ColumnTransformer to do all of your transformations. For the rare case in which you need to apply multiple different transformations in parallel to the same column, you can use either a FeatureUnion or a ColumnTransformer, depending upon which solution makes more sense to you.