'Embarked'].value_counts() df[
S 644
C 168
Q 77
Name: Embarked, dtype: int64
Let’s talk about categorical features. There are two types of categorical features that we’ve covered in the book:
So far, here’s the advice that I’ve given for encoding nominal and ordinal features:
OneHotEncoder
, and it will output one column for each category.OrdinalEncoder
, and it will output a single column using the category ordering that you define.Let’s do a quick recap as to why OneHotEncoder
is the preferred approach for a nominal feature, using Embarked as an example.
Embarked has three categories, so OneHotEncoder
would output 3 features. From each of the three features, the model can learn the relationship between the target value and whether or not a given passenger embarked at that port. For example, the model might learn from the first feature that passengers who embarked at C have a higher survival rate than passengers who didn’t embark at C.
If you were to instead use OrdinalEncoder
with Embarked, it would output 1 feature. This is problematic because it would imply an ordering of the categories that doesn’t inherently exist. For example, if passengers who embarked at C and S had high survival rates, and passengers who embarked at Q had low survival rates, there’s no way for a linear model to learn this relationship if Embarked is encoded as a single feature.
'Embarked'].value_counts() df[
S 644
C 168
Q 77
Name: Embarked, dtype: int64
In this chapter, we’re going to explore whether this advice still holds when you have high-cardinality categorical features, which are categorical features with lots of unique values.
We’ll use a new dataset for this chapter, namely a dataset of US census data from 1994.
We’ll read the dataset into a new DataFrame called census
.
= pd.read_csv('http://bit.ly/censusdataset') census
We’re only going to use the categorical features, which we can explore by using the DataFrame describe
method.
='object') census.describe(include
workclass | education | marital-status | occupation | relationship | race | sex | native-country | class | |
---|---|---|---|---|---|---|---|---|---|
count | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 |
unique | 9 | 16 | 7 | 15 | 6 | 5 | 2 | 42 | 2 |
top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States | <=50K |
freq | 33906 | 15784 | 22379 | 6172 | 19716 | 41762 | 32650 | 43832 | 37155 |
From the row labeled “unique”, you can see that education, occupation, and native-country all have more than 10 unique values. There’s no hard-and-fast rule for what counts as a high-cardinality feature, but all three of these could be considered high-cardinality since they have a lot of unique values.
You can’t tell from this display, but these 8 features are all nominal features, with the exception of education since it does have a logical ordering. However, we’re going to be treating education as nominal for this experiment.
The column labeled “class” is actually our target. This column indicates whether the person has an income of more or less than fifty thousand dollars a year.
We can view the class proportions by normalizing the output of value_counts
.
'class'].value_counts(normalize=True) census[
<=50K 0.760718
>50K 0.239282
Name: class, dtype: float64
When defining our X
DataFrame, which I’m calling census_X
, we’re only going to use the 8 categorical columns, which I’ve listed out manually. And we’ll use class as our y
Series, which I’m calling census_y
.
= ['workclass', 'education', 'marital-status', 'occupation',
census_cols 'relationship', 'race', 'sex', 'native-country']
= census[census_cols]
census_X = census['class'] census_y
We’re going to be testing the effectiveness of both OneHotEncoder
and OrdinalEncoder
with these 8 features. For this experiment, we would normally just create instances using the default arguments.
= OneHotEncoder()
ohe = OrdinalEncoder() oe
Notice that we created an instance of OrdinalEncoder
without defining the category ordering. This is because we’re treating all of the features as nominal, and as such there is no logical ordering.
As a result, OrdinalEncoder
would simply learn the categories for each feature in alphabetical order, which we can confirm by fitting the OrdinalEncoder
and checking the categories_
attribute.
oe.fit(census_X).categories_
[array([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
dtype=object),
array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
' HS-grad', ' Masters', ' Preschool', ' Prof-school',
' Some-college'], dtype=object),
array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
' Married-spouse-absent', ' Never-married', ' Separated',
' Widowed'], dtype=object),
array([' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair',
' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners',
' Machine-op-inspct', ' Other-service', ' Priv-house-serv',
' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support',
' Transport-moving'], dtype=object),
array([' Husband', ' Not-in-family', ' Other-relative', ' Own-child',
' Unmarried', ' Wife'], dtype=object),
array([' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other',
' White'], dtype=object),
array([' Female', ' Male'], dtype=object),
array([' ?', ' Cambodia', ' Canada', ' China', ' Columbia', ' Cuba',
' Dominican-Republic', ' Ecuador', ' El-Salvador', ' England',
' France', ' Germany', ' Greece', ' Guatemala', ' Haiti',
' Holand-Netherlands', ' Honduras', ' Hong', ' Hungary', ' India',
' Iran', ' Ireland', ' Italy', ' Jamaica', ' Japan', ' Laos',
' Mexico', ' Nicaragua', ' Outlying-US(Guam-USVI-etc)', ' Peru',
' Philippines', ' Poland', ' Portugal', ' Puerto-Rico',
' Scotland', ' South', ' Taiwan', ' Thailand', ' Trinadad&Tobago',
' United-States', ' Vietnam', ' Yugoslavia'], dtype=object)]
That being said, we’re actually going to run into a problem with encoding due to our highest cardinality feature, native-country. Let’s take a look at it and see why.
'native-country'].value_counts() census_X[
United-States 43832
Mexico 951
? 857
Philippines 295
Germany 206
Puerto-Rico 184
Canada 182
El-Salvador 155
India 151
Cuba 138
England 127
China 122
South 115
Jamaica 106
Italy 105
Dominican-Republic 103
Japan 92
Guatemala 88
Poland 87
Vietnam 86
Columbia 85
Haiti 75
Portugal 67
Taiwan 65
Iran 59
Nicaragua 49
Greece 49
Peru 46
Ecuador 45
France 38
Ireland 37
Thailand 30
Hong 30
Cambodia 28
Trinadad&Tobago 27
Laos 23
Yugoslavia 23
Outlying-US(Guam-USVI-etc) 23
Scotland 21
Honduras 20
Hungary 19
Holand-Netherlands 1
Name: native-country, dtype: int64
You can see that one of the categories appears only once in the dataset. As we talked about in lesson 15.7, rare category values can cause problems with cross-validation.
In this case, it will definitely create a problem, because that sample is guaranteed to appear in the test fold but not a training fold during one of the runs of cross-validation. That will cause an error for both OneHotEncoder
and OrdinalEncoder
.
In the case of OneHotEncoder
, the solution is simply to set the handle_unknown
parameter to 'ignore'
.
= OneHotEncoder(handle_unknown='ignore') ohe_ignore
Starting in scikit-learn version 0.24, OrdinalEncoder
has a similar handle_unknown
parameter that could be used for this situation.
But for now, the best solution is to define the categories in advance for each feature using a list comprehension. The list comprehension iterates through the feature columns, extracts the unique values from each column, and stores the result in a list called cats
.
= [census_X[col].unique() for col in census_X[census_cols]]
cats cats
[array([' Private', ' Local-gov', ' ?', ' Self-emp-not-inc',
' Federal-gov', ' State-gov', ' Self-emp-inc', ' Without-pay',
' Never-worked'], dtype=object),
array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
' Preschool'], dtype=object),
array([' Never-married', ' Married-civ-spouse', ' Widowed', ' Divorced',
' Separated', ' Married-spouse-absent', ' Married-AF-spouse'],
dtype=object),
array([' Machine-op-inspct', ' Farming-fishing', ' Protective-serv', ' ?',
' Other-service', ' Prof-specialty', ' Craft-repair',
' Adm-clerical', ' Exec-managerial', ' Tech-support', ' Sales',
' Priv-house-serv', ' Transport-moving', ' Handlers-cleaners',
' Armed-Forces'], dtype=object),
array([' Own-child', ' Husband', ' Not-in-family', ' Unmarried', ' Wife',
' Other-relative'], dtype=object),
array([' Black', ' White', ' Asian-Pac-Islander', ' Other',
' Amer-Indian-Eskimo'], dtype=object),
array([' Male', ' Female'], dtype=object),
array([' United-States', ' ?', ' Peru', ' Guatemala', ' Mexico',
' Dominican-Republic', ' Ireland', ' Germany', ' Philippines',
' Thailand', ' Haiti', ' El-Salvador', ' Puerto-Rico', ' Vietnam',
' South', ' Columbia', ' Japan', ' India', ' Cambodia', ' Poland',
' Laos', ' England', ' Cuba', ' Taiwan', ' Italy', ' Canada',
' Portugal', ' China', ' Nicaragua', ' Honduras', ' Iran',
' Scotland', ' Jamaica', ' Ecuador', ' Yugoslavia', ' Hungary',
' Hong', ' Greece', ' Trinadad&Tobago',
' Outlying-US(Guam-USVI-etc)', ' France', ' Holand-Netherlands'],
dtype=object)]
Then, we can pass the cats
list to the categories
parameter of OrdinalEncoder
when creating an instance. This solves our problem since no unknown categories will ever appear during cross-validation.
= OrdinalEncoder(categories=cats) oe_cats
Now that we’ve set up our OneHotEncoder
, called ohe_ignore
, and our OrdinalEncoder
, called oe_cats
, let’s see what happens when we pass census_X
to fit_transform
and then check the shape.
As expected, the OneHotEncoder
creates a lot of columns due to the high-cardinality features, whereas the OrdinalEncoder
creates only one column for each of the eight features.
ohe_ignore.fit_transform(census_X).shape
(48842, 102)
oe_cats.fit_transform(census_X).shape
(48842, 8)
Now let’s actually test the advice that I’ve given, which is that OneHotEncoder
should be used for nominal features, to see if this advice still holds for high-cardinality features.
The simplest method for doing this is to create two Pipeline
s. One of them uses OneHotEncoder
and the other uses OrdinalEncoder
, and both end in a logistic regression model.
= make_pipeline(ohe_ignore, logreg)
ohe_logreg = make_pipeline(oe_cats, logreg) oe_logreg
We’ll cross-validate each Pipeline
using all features and then compare the accuracies. We’ll also time the operations to see if there are significant differences.
%time cross_val_score(ohe_logreg, census_X, census_y, cv=5, \
='accuracy').mean() scoring
CPU times: user 580 ms, sys: 5.33 ms, total: 586 ms
Wall time: 588 ms
0.8329920571424309
%time cross_val_score(oe_logreg, census_X, census_y, cv=5, \
='accuracy').mean() scoring
CPU times: user 506 ms, sys: 3.26 ms, total: 509 ms
Wall time: 509 ms
0.7547398152859307
The two Pipelines take around the same amount of time to run, but the accuracy of the OneHotEncoder
Pipeline
is 0.833, which is significantly better than the 0.755 accuracy of the OrdinalEncoder
Pipeline
. This would suggest that at least for a linear model like logistic regression, OneHotEncoder
should be used for nominal features, even when the features have high cardinality.
Let’s now do the same comparison as the previous lesson, except this time we’ll use random forests, which is a tree-based non-linear model that we used in chapter 11.
First, we’ll create two more Pipeline
s. One uses OneHotEncoder
and the other uses OrdinalEncoder
, and both end in a random forest model.
= make_pipeline(ohe_ignore, rf)
ohe_rf = make_pipeline(oe_cats, rf) oe_rf
Then, we’ll cross-validate each Pipeline
using all features.
%time cross_val_score(ohe_rf, census_X, census_y, cv=5, \
='accuracy').mean() scoring
CPU times: user 38.3 s, sys: 336 ms, total: 38.6 s
Wall time: 5.75 s
0.8260513856992514
%time cross_val_score(oe_rf, census_X, census_y, cv=5, \
='accuracy').mean() scoring
CPU times: user 6.17 s, sys: 299 ms, total: 6.47 s
Wall time: 1.47 s
0.8245362761024548
We can see that the accuracies are about the same for the OneHotEncoder
Pipeline
(0.826) and the OrdinalEncoder
Pipeline
(0.825), even though we were using the OrdinalEncoder
on nominal features, which would normally be considered improper.
How can this be? Well, because of how decision trees recursively split features, the random forest model can approximately learn the relationships present in categorical features even when they’re encoded as single columns with OrdinalEncoder
.
It’s also worth noting that the OrdinalEncoder
Pipeline
is significantly faster than the OneHotEncoder
Pipeline
due to the much smaller feature set created by the OrdinalEncoder
.
One final variation that we can try is to use the OneHotEncoder
for all features except for education. And since education is actually an ordinal feature, we can use the OrdinalEncoder
with it and define the category ordering.
Here are the education categories.
'education'].unique() census_X[
array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
' Preschool'], dtype=object)
We’ll manually define the category ordering, from “Preschool” through “Doctorate”, and then create an instance of OrdinalEncoder
using that ordering.
= [[' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th',
cats ' 10th', ' 11th', ' 12th', ' HS-grad', ' Some-college',
' Assoc-voc', ' Assoc-acdm', ' Bachelors', ' Masters',
' Prof-school', ' Doctorate']]
= OrdinalEncoder(categories=cats) oe_cats
Then we’ll create a ColumnTransformer
that applies the OrdinalEncoder
to education, and applies the OneHotEncoder
to all other features.
= make_column_transformer(
ct 'education']),
(oe_cats, [=ohe_ignore) remainder
When we pass census_X
to the fit_transform
, it creates 87 feature columns, compared to the 102 columns that were created when we only used the OneHotEncoder
.
ct.fit_transform(census_X).shape
(48842, 87)
Finally, we’ll create two Pipeline
s. Both of them start with the same ColumnTransformer
, but one ends with logistic regression while the other ends with random forests.
= make_pipeline(ct, logreg)
oe_ohe_logreg = make_pipeline(ct, rf) oe_ohe_rf
When we cross-validate the first Pipeline
, the accuracy is 0.832, which is nearly the same as the 0.833 achieved by the logistic regression Pipeline
that used OneHotEncoder
for all features.
%time cross_val_score(oe_ohe_logreg, census_X, census_y, cv=5, \
='accuracy').mean() scoring
CPU times: user 611 ms, sys: 6.72 ms, total: 618 ms
Wall time: 618 ms
0.8315588308601922
When we cross-validate the second Pipeline
, the accuracy is 0.825, which is nearly the same as the 0.826 achieved by the random forest Pipeline
that used OneHotEncoder
for all features.
%time cross_val_score(oe_ohe_rf, census_X, census_y, cv=5, \
='accuracy').mean() scoring
CPU times: user 38.5 s, sys: 316 ms, total: 38.8 s
Wall time: 5.73 s
0.8251300537921482
In summary, encoding the education feature with OrdinalEncoder
and the seven other features with OneHotEncoder
performed basically the same as encoding all eight features with OneHotEncoder
. However, it’s certainly possible that the OrdinalEncoder
could provide a benefit under other circumstances.
Let’s summarize what we’ve learned in this chapter:
OneHotEncoder
, regardless of whether the features have high cardinality.OneHotEncoder
, and you can try using OrdinalEncoder
without defining the category ordering, and then see which option performs better.OneHotEncoder
, and you can try using OrdinalEncoder
while defining the category ordering, and then see which option performs better.In all cases, keep in mind that if the features have high cardinality, OrdinalEncoder
is likely to run significantly faster than OneHotEncoder
, which may or may not matter in your particular case.