17 High-cardinality categorical features

17.1 Recap of nominal and ordinal features

Let’s talk about categorical features. There are two types of categorical features that we’ve covered in the book:

A nominal feature has categories that are unordered, such as Embarked and Sex.
An ordinal feature has categories with an inherent logical ordering, such as Pclass.

Types of categorical features:

Nominal: Unordered categories
- Embarked
- Sex
Ordinal: Ordered categories
- Pclass

So far, here’s the advice that I’ve given for encoding nominal and ordinal features:

For a nominal feature, you should use OneHotEncoder, and it will output one column for each category.
For an ordinal feature that is already encoded numerically, you should leave it as-is.
And for an ordinal feature that is encoded as strings, you should use OrdinalEncoder, and it will output a single column using the category ordering that you define.

Advice for encoding categorical data:

Nominal feature: Use OneHotEncoder
Ordinal feature stored as numbers: Leave as-is
Ordinal feature stored as strings: Use OrdinalEncoder

Let’s do a quick recap as to why OneHotEncoder is the preferred approach for a nominal feature, using Embarked as an example.

Embarked has three categories, so OneHotEncoder would output 3 features. From each of the three features, the model can learn the relationship between the target value and whether or not a given passenger embarked at that port. For example, the model might learn from the first feature that passengers who embarked at C have a higher survival rate than passengers who didn’t embark at C.

If you were to instead use OrdinalEncoder with Embarked, it would output 1 feature. This is problematic because it would imply an ordering of the categories that doesn’t inherently exist. For example, if passengers who embarked at C and S had high survival rates, and passengers who embarked at Q had low survival rates, there’s no way for a linear model to learn this relationship if Embarked is encoded as a single feature.

Why use OneHotEncoder for Embarked?

OneHotEncoder:
- Outputs 3 features
- Model can learn the relationship between each feature and the target value
OrdinalEncoder:
- Outputs 1 feature
- Implies an ordering that doesn’t inherently exist
- Linear model can’t necessarily learn the relationships in the data

df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In this chapter, we’re going to explore whether this advice still holds when you have high-cardinality categorical features, which are categorical features with lots of unique values.

17.2 Preparing the census dataset

We’ll use a new dataset for this chapter, namely a dataset of US census data from 1994.

We’ll read the dataset into a new DataFrame called census.

census = pd.read_csv('http://bit.ly/censusdataset')

We’re only going to use the categorical features, which we can explore by using the DataFrame describe method.

census.describe(include='object')

	workclass	education	marital-status	occupation	relationship	race	sex	native-country	class
count	48842	48842	48842	48842	48842	48842	48842	48842	48842
unique	9	16	7	15	6	5	2	42	2
top	Private	HS-grad	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States	<=50K
freq	33906	15784	22379	6172	19716	41762	32650	43832	37155

From the row labeled “unique”, you can see that education, occupation, and native-country all have more than 10 unique values. There’s no hard-and-fast rule for what counts as a high-cardinality feature, but all three of these could be considered high-cardinality since they have a lot of unique values.

You can’t tell from this display, but these 8 features are all nominal features, with the exception of education since it does have a logical ordering. However, we’re going to be treating education as nominal for this experiment.

Categorical features in census dataset:

High-cardinality (3 of 8): education, occupation, native-country
Nominal (7 of 8): All except education

The column labeled “class” is actually our target. This column indicates whether the person has an income of more or less than fifty thousand dollars a year.

We can view the class proportions by normalizing the output of value_counts.

census['class'].value_counts(normalize=True)

 <=50K    0.760718
 >50K     0.239282
Name: class, dtype: float64

When defining our X DataFrame, which I’m calling census_X, we’re only going to use the 8 categorical columns, which I’ve listed out manually. And we’ll use “class” as our y Series, which I’m calling census_y.

census_cols = ['workclass', 'education', 'marital-status', 'occupation',
               'relationship', 'race', 'sex', 'native-country']
census_X = census[census_cols]
census_y = census['class']

17.3 Setting up the encoders

We’re going to be testing the effectiveness of both OneHotEncoder and OrdinalEncoder with these 8 features. For this experiment, we would normally just create instances using the default arguments.

ohe = OneHotEncoder()
oe = OrdinalEncoder()

Notice that we created an instance of OrdinalEncoder without defining the category ordering. This is because we’re treating all of the features as nominal, and as such there is no logical ordering.

As a result, OrdinalEncoder would simply learn the categories for each feature in alphabetical order, which we can confirm by fitting the OrdinalEncoder and checking the categories_ attribute.

oe.fit(census_X).categories_

[array([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
        ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
       dtype=object),
 array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object),
 array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
        ' Married-spouse-absent', ' Never-married', ' Separated',
        ' Widowed'], dtype=object),
 array([' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair',
        ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners',
        ' Machine-op-inspct', ' Other-service', ' Priv-house-serv',
        ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support',
        ' Transport-moving'], dtype=object),
 array([' Husband', ' Not-in-family', ' Other-relative', ' Own-child',
        ' Unmarried', ' Wife'], dtype=object),
 array([' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other',
        ' White'], dtype=object),
 array([' Female', ' Male'], dtype=object),
 array([' ?', ' Cambodia', ' Canada', ' China', ' Columbia', ' Cuba',
        ' Dominican-Republic', ' Ecuador', ' El-Salvador', ' England',
        ' France', ' Germany', ' Greece', ' Guatemala', ' Haiti',
        ' Holand-Netherlands', ' Honduras', ' Hong', ' Hungary', ' India',
        ' Iran', ' Ireland', ' Italy', ' Jamaica', ' Japan', ' Laos',
        ' Mexico', ' Nicaragua', ' Outlying-US(Guam-USVI-etc)', ' Peru',
        ' Philippines', ' Poland', ' Portugal', ' Puerto-Rico',
        ' Scotland', ' South', ' Taiwan', ' Thailand', ' Trinadad&Tobago',
        ' United-States', ' Vietnam', ' Yugoslavia'], dtype=object)]

That being said, we’re actually going to run into a problem with encoding due to our highest cardinality feature, native-country. Let’s take a look at it and see why.

census_X['native-country'].value_counts()

 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Nicaragua                        49
 Greece                           49
 Peru                             46
 Ecuador                          45
 France                           38
 Ireland                          37
 Thailand                         30
 Hong                             30
 Cambodia                         28
 Trinadad&Tobago                  27
 Outlying-US(Guam-USVI-etc)       23
 Yugoslavia                       23
 Laos                             23
 Scotland                         21
 Honduras                         20
 Hungary                          19
 Holand-Netherlands                1
Name: native-country, dtype: int64

You can see that one of the categories appears only once in the dataset. As we talked about in lesson 15.7, rare category values can cause problems with cross-validation.

In this case, it will definitely create a problem, because that sample is guaranteed to appear in the test fold but not a training fold during one of the runs of cross-validation. That will cause an error for both OneHotEncoder and OrdinalEncoder.

Resolving the cross-validation error caused by rare categories:

OneHotEncoder: Set handle_unknown=‘ignore’
OrdinalEncoder (before 0.24): Define the categories in advance
OrdinalEncoder (starting in 0.24): Set handle_unknown=‘use_encoded_value’

In the case of OneHotEncoder, the solution is simply to set the handle_unknown parameter to “ignore”.

ohe_ignore = OneHotEncoder(handle_unknown='ignore')

Starting in scikit-learn version 0.24, OrdinalEncoder will have a similar handle_unknown parameter that could be used for this situation.

But for now, the best solution is to define the categories in advance for each feature using a list comprehension. The list comprehension iterates through the feature columns, extracts the unique values from each column, and stores the result in a list called “cats”.

cats = [census_X[col].unique() for col in census_X[census_cols]]
cats

[array([' Private', ' Local-gov', ' ?', ' Self-emp-not-inc',
        ' Federal-gov', ' State-gov', ' Self-emp-inc', ' Without-pay',
        ' Never-worked'], dtype=object),
 array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
        ' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
        ' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
        ' Preschool'], dtype=object),
 array([' Never-married', ' Married-civ-spouse', ' Widowed', ' Divorced',
        ' Separated', ' Married-spouse-absent', ' Married-AF-spouse'],
       dtype=object),
 array([' Machine-op-inspct', ' Farming-fishing', ' Protective-serv', ' ?',
        ' Other-service', ' Prof-specialty', ' Craft-repair',
        ' Adm-clerical', ' Exec-managerial', ' Tech-support', ' Sales',
        ' Priv-house-serv', ' Transport-moving', ' Handlers-cleaners',
        ' Armed-Forces'], dtype=object),
 array([' Own-child', ' Husband', ' Not-in-family', ' Unmarried', ' Wife',
        ' Other-relative'], dtype=object),
 array([' Black', ' White', ' Asian-Pac-Islander', ' Other',
        ' Amer-Indian-Eskimo'], dtype=object),
 array([' Male', ' Female'], dtype=object),
 array([' United-States', ' ?', ' Peru', ' Guatemala', ' Mexico',
        ' Dominican-Republic', ' Ireland', ' Germany', ' Philippines',
        ' Thailand', ' Haiti', ' El-Salvador', ' Puerto-Rico', ' Vietnam',
        ' South', ' Columbia', ' Japan', ' India', ' Cambodia', ' Poland',
        ' Laos', ' England', ' Cuba', ' Taiwan', ' Italy', ' Canada',
        ' Portugal', ' China', ' Nicaragua', ' Honduras', ' Iran',
        ' Scotland', ' Jamaica', ' Ecuador', ' Yugoslavia', ' Hungary',
        ' Hong', ' Greece', ' Trinadad&Tobago',
        ' Outlying-US(Guam-USVI-etc)', ' France', ' Holand-Netherlands'],
       dtype=object)]

Then, we can pass the “cats” list to the categories parameter of OrdinalEncoder when creating an instance. This solves our problem since no unknown categories will ever appear during cross-validation.

oe_cats = OrdinalEncoder(categories=cats)

17.4 Encoding nominal features for a linear model

Now that we’ve set up our OneHotEncoder, called “ohe_ignore”, and our OrdinalEncoder, called “oe_cats”, let’s see what happens when we pass census_X to fit_transform and then check the shape.

As expected, the OneHotEncoder creates a lot of columns due to the high-cardinality features, whereas the OrdinalEncoder creates only one column for each of the eight features.

ohe_ignore.fit_transform(census_X).shape

(48842, 102)

oe_cats.fit_transform(census_X).shape

(48842, 8)

Now let’s actually test the advice that I’ve given, which is that OneHotEncoder should be used for nominal features, to see if this advice still holds for high-cardinality features.

The simplest method for doing this is to create two Pipelines. One of them uses OneHotEncoder and the other uses OrdinalEncoder, and both end in a logistic regression model.

ohe_logreg = make_pipeline(ohe_ignore, logreg)
oe_logreg = make_pipeline(oe_cats, logreg)

We’ll cross-validate each Pipeline using all features and then compare the accuracies. We’ll also time the operations to see if there are significant differences.

%time cross_val_score(ohe_logreg, census_X, census_y, cv=5, scoring='accuracy').mean()

CPU times: user 578 ms, sys: 5.76 ms, total: 584 ms
Wall time: 587 ms

0.8329920571424309

%time cross_val_score(oe_logreg, census_X, census_y, cv=5, scoring='accuracy').mean()

CPU times: user 498 ms, sys: 3.71 ms, total: 501 ms
Wall time: 501 ms

0.7547398152859307

The two Pipelines take around the same amount of time to run, but the accuracy of the OneHotEncoder Pipeline is 0.833, which is significantly better than the 0.755 accuracy of the OrdinalEncoder Pipeline. This would suggest that at least for a linear model like logistic regression, OneHotEncoder should be used for nominal features, even when the features have high cardinality.

Pipeline accuracy when encoding nominal features:

Linear model:
- OneHotEncoder: 0.833
- OrdinalEncoder: 0.755

17.5 Encoding nominal features for a non-linear model

Let’s now do the same comparison as the previous lesson, except this time we’ll use random forests, which is a tree-based non-linear model that we used in chapter 11.

First, we’ll create two more Pipelines, one using OneHotEncoder and the other using OrdinalEncoder, and both ending in a random forest model.

ohe_rf = make_pipeline(ohe_ignore, rf)
oe_rf = make_pipeline(oe_cats, rf)

Then, we’ll cross-validate each Pipeline using all features.

%time cross_val_score(ohe_rf, census_X, census_y, cv=5, scoring='accuracy').mean()

CPU times: user 39.7 s, sys: 302 ms, total: 40 s
Wall time: 5.82 s

0.8260513856992514

%time cross_val_score(oe_rf, census_X, census_y, cv=5, scoring='accuracy').mean()

CPU times: user 6.38 s, sys: 302 ms, total: 6.68 s
Wall time: 1.54 s

0.8245362761024548

We can see that the accuracies are about the same for the OneHotEncoder Pipeline (0.826) and the OrdinalEncoder Pipeline (0.825), even though we were using the OrdinalEncoder on nominal features, which would normally be considered improper.

How can this be? Well, because of how decision trees recursively split features, the random forest model can approximately learn the relationships present in categorical features even when they’re encoded as single columns with OrdinalEncoder.

It’s also worth noting that the OrdinalEncoder Pipeline is significantly faster than the OneHotEncoder Pipeline, which is due to the much smaller feature set created by the OrdinalEncoder.

Pipeline accuracy when encoding nominal features:

Linear model:
- OneHotEncoder: 0.833
- OrdinalEncoder: 0.755
Non-linear model:
- OneHotEncoder: 0.826
- OrdinalEncoder: 0.825

17.6 Combining the encodings

One final variation that we can try is to use the OneHotEncoder for all features except for education. And since education is actually an ordinal feature, we can use the OrdinalEncoder with it and define the category ordering.

Here are the education categories.

census_X['education'].unique()

array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
       ' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
       ' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
       ' Preschool'], dtype=object)

We’ll manually define the category ordering, from “Preschool” through “Doctorate”, and then create an instance of OrdinalEncoder using that ordering.

cats = [[' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th',
         ' 11th', ' 12th', ' HS-grad', ' Some-college', ' Assoc-voc',
         ' Assoc-acdm', ' Bachelors', ' Masters', ' Prof-school', ' Doctorate']]
oe_cats = OrdinalEncoder(categories=cats)

Then we’ll create a ColumnTransformer that applies the OrdinalEncoder to education, and applies the OneHotEncoder to all other features.

ct = make_column_transformer(
    (oe_cats, ['education']),
    remainder=ohe_ignore)

When we pass census_X to the fit_transform, it creates 87 feature columns, compared to the 102 columns that were created when we only used the OneHotEncoder.

ct.fit_transform(census_X).shape

(48842, 87)

Finally, we’ll create two Pipelines. Both of them start with the same ColumnTransformer, but one ends with logistic regression while the other ends with random forests.

oe_ohe_logreg = make_pipeline(ct, logreg)
oe_ohe_rf = make_pipeline(ct, rf)

When we cross-validate the first Pipeline, the accuracy is 0.832, which is nearly the same as the 0.833 achieved by the logistic regression Pipeline that used OneHotEncoding for all features.

%time cross_val_score(oe_ohe_logreg, census_X, census_y, cv=5, scoring='accuracy').mean()

CPU times: user 659 ms, sys: 8.62 ms, total: 668 ms
Wall time: 668 ms

0.8315588308601922

When we cross-validate the second Pipeline, the accuracy is 0.825, which is nearly the same as the 0.826 achieved by the random forest Pipeline that used OneHotEncoding for all features.

%time cross_val_score(oe_ohe_rf, census_X, census_y, cv=5, scoring='accuracy').mean()

CPU times: user 40.8 s, sys: 309 ms, total: 41.1 s
Wall time: 5.98 s

0.8251300537921482

In summary, encoding the education feature with OrdinalEncoder and the seven other features with OneHotEncoder performed basically the same as encoding all eight features with OneHotEncoder. However, it’s certainly possible that the OrdinalEncoder could provide a benefit under other circumstances.

Pipeline accuracy when encoding nominal features:

Linear model:
- OneHotEncoder: 0.833
- OrdinalEncoder: 0.755
- OneHotEncoder (7 features) + OrdinalEncoder (education): 0.832
Non-linear model:
- OneHotEncoder: 0.826
- OrdinalEncoder: 0.825
- OneHotEncoder (7 features) + OrdinalEncoder (education): 0.825

17.7 Best practices for encoding

Let’s summarize what we’ve learned in this chapter.

If you have nominal features, and you’re using a linear model, you should definitely use OneHotEncoder, regardless of whether the features have high cardinality.

If you have nominal features, and you’re using a non-linear model, you can try using OneHotEncoder, and you can try using OrdinalEncoder without defining the category ordering, and then see which option performs better.

If you have ordinal features, regardless of the type of model, you can try using OneHotEncoder, and you can try using OrdinalEncoder while defining the category ordering, and then see which option performs better.

In all cases, keep in mind that if the features have high cardinality, OrdinalEncoder is likely to run significantly faster than OneHotEncoder, which may or may not matter in your particular case.

Summary of best practices:

Nominal features, linear model:
- OneHotEncoder
Nominal features, non-linear model:
- OneHotEncoder
- OrdinalEncoder
  - Don’t define category ordering
  - Much faster than OneHotEncoder (if features have high cardinality)
Ordinal features:
- OneHotEncoder
- OrdinalEncoder
  - Define category ordering
  - Much faster than OneHotEncoder (if features have high cardinality)