census = pd.read_csv('http://bit.ly/censusdataset')17 High-cardinality categorical features
17.1 Recap of nominal and ordinal features
Let’s talk about categorical features. There are two types of categorical features that we’ve covered in the book:
- Nominal features have categories that are unordered, such as Embarked and Sex.
- Ordinal features have categories with an inherent logical ordering, such as Pclass.
So far, here’s the advice that I’ve given you for encoding nominal and ordinal features:
- For a nominal feature, you should use
OneHotEncoder, and it will output one column for each category. - For an ordinal feature that is already encoded numerically, you should leave it as-is.
- For an ordinal feature that is encoded as strings, you should use
OrdinalEncoder, and it will output a single column using the category ordering that you define.
Let’s do a quick recap as to why OneHotEncoder is the preferred approach for a nominal feature, using Embarked as an example.
Embarked has 3 categories, so OneHotEncoder would output 3 features. From each of the 3 features, the model can learn the relationship between the target value and whether or not a given passenger embarked at that port. For example, the model might learn from the first feature that passengers who embarked at C have a higher survival rate than passengers who didn’t embark at C.
If you were to instead use OrdinalEncoder with Embarked, it would output 1 feature. This is problematic because it would imply an ordering of the categories that doesn’t inherently exist. For example, if passengers who embarked at C and S had high survival rates, and passengers who embarked at Q had low survival rates, there would be no way for a linear model to learn this relationship.
In this chapter, we’re going to explore whether this advice still holds when you have high-cardinality categorical features, which are categorical features with lots of unique values.
17.2 Preparing the census dataset
We’ll use a new dataset for this chapter, namely a dataset of US census data from 1994.
We’ll read the dataset into a new DataFrame called census.
We’re only going to use the categorical features, which we can explore by using the DataFrame describe method.
census.describe(include='object')| workclass | education | marital-status | occupation | relationship | race | sex | native-country | class | |
|---|---|---|---|---|---|---|---|---|---|
| count | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 |
| unique | 9 | 16 | 7 | 15 | 6 | 5 | 2 | 42 | 2 |
| top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States | <=50K |
| freq | 33906 | 15784 | 22379 | 6172 | 19716 | 41762 | 32650 | 43832 | 37155 |
From the row labeled “unique”, you can see that education, occupation, and native-country all have more than 10 unique values. There’s no hard-and-fast rule for what counts as a high-cardinality feature, but all 3 of these could be considered high-cardinality since they have a lot of unique values.
You can’t tell from this display, but these 8 features are all nominal features, with the exception of education since it does have a logical ordering. However, we’re going to be treating education as nominal for this experiment.
The column labeled “class” is actually our target. This column indicates whether the person has an income of more or less than $50,000 a year.
We can view the class proportions by normalizing the output of value_counts.
census['class'].value_counts(normalize=True) <=50K 0.760718
>50K 0.239282
Name: class, dtype: float64
When defining our X DataFrame, which I’m calling census_X, we’re only going to use the 8 categorical columns, which I’ve listed out manually. And we’ll use class as our y Series, which I’m calling census_y.
census_cols = ['workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race', 'sex', 'native-country']
census_X = census[census_cols]
census_y = census['class']17.3 Setting up the encoders
We’re going to be testing the effectiveness of both OneHotEncoder and OrdinalEncoder with these 8 features. For this experiment, we would normally just create instances using the default arguments.
ohe = OneHotEncoder()
oe = OrdinalEncoder()Notice that we created an instance of OrdinalEncoder without defining the category ordering. This is because we’re treating all of the features as nominal, and nominal features have no logical ordering.
As a result, OrdinalEncoder would simply learn the categories for each feature in alphabetical order, which we can confirm by fitting the OrdinalEncoder and checking the categories_ attribute.
oe.fit(census_X).categories_[array([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
dtype=object),
array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
' HS-grad', ' Masters', ' Preschool', ' Prof-school',
' Some-college'], dtype=object),
array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
' Married-spouse-absent', ' Never-married', ' Separated',
' Widowed'], dtype=object),
array([' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair',
' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners',
' Machine-op-inspct', ' Other-service', ' Priv-house-serv',
' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support',
' Transport-moving'], dtype=object),
array([' Husband', ' Not-in-family', ' Other-relative', ' Own-child',
' Unmarried', ' Wife'], dtype=object),
array([' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other',
' White'], dtype=object),
array([' Female', ' Male'], dtype=object),
array([' ?', ' Cambodia', ' Canada', ' China', ' Columbia', ' Cuba',
' Dominican-Republic', ' Ecuador', ' El-Salvador', ' England',
' France', ' Germany', ' Greece', ' Guatemala', ' Haiti',
' Holand-Netherlands', ' Honduras', ' Hong', ' Hungary', ' India',
' Iran', ' Ireland', ' Italy', ' Jamaica', ' Japan', ' Laos',
' Mexico', ' Nicaragua', ' Outlying-US(Guam-USVI-etc)', ' Peru',
' Philippines', ' Poland', ' Portugal', ' Puerto-Rico',
' Scotland', ' South', ' Taiwan', ' Thailand', ' Trinadad&Tobago',
' United-States', ' Vietnam', ' Yugoslavia'], dtype=object)]
That being said, we’re actually going to run into a problem with encoding due to our highest cardinality feature, native-country. Let’s take a look and see why.
census_X['native-country'].value_counts() United-States 43832
Mexico 951
? 857
Philippines 295
Germany 206
Puerto-Rico 184
Canada 182
El-Salvador 155
India 151
Cuba 138
England 127
China 122
South 115
Jamaica 106
Italy 105
Dominican-Republic 103
Japan 92
Guatemala 88
Poland 87
Vietnam 86
Columbia 85
Haiti 75
Portugal 67
Taiwan 65
Iran 59
Nicaragua 49
Greece 49
Peru 46
Ecuador 45
France 38
Ireland 37
Hong 30
Thailand 30
Cambodia 28
Trinadad&Tobago 27
Yugoslavia 23
Outlying-US(Guam-USVI-etc) 23
Laos 23
Scotland 21
Honduras 20
Hungary 19
Holand-Netherlands 1
Name: native-country, dtype: int64
You can see that one of the categories appears only once in the dataset. As we talked about in lesson 15.7, rare category values can cause problems with cross-validation.
In this case, it will definitely create a problem, because that sample is guaranteed to appear in the test fold but not a training fold during one of the runs of cross-validation. That will cause an error for both OneHotEncoder and OrdinalEncoder.
In the case of OneHotEncoder, the solution is simply to set the handle_unknown parameter to 'ignore'.
ohe_ignore = OneHotEncoder(handle_unknown='ignore')Starting in scikit-learn version 0.24, OrdinalEncoder has a similar handle_unknown parameter that could be used for this situation.
But for now, the best solution is to define the categories in advance for each feature using a list comprehension. The list comprehension iterates through the feature columns, extracts the unique values from each column, and stores the result in a list called cats.
cats = [census_X[col].unique() for col in census_X[census_cols]]
cats[array([' Private', ' Local-gov', ' ?', ' Self-emp-not-inc',
' Federal-gov', ' State-gov', ' Self-emp-inc', ' Without-pay',
' Never-worked'], dtype=object),
array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
' Preschool'], dtype=object),
array([' Never-married', ' Married-civ-spouse', ' Widowed', ' Divorced',
' Separated', ' Married-spouse-absent', ' Married-AF-spouse'],
dtype=object),
array([' Machine-op-inspct', ' Farming-fishing', ' Protective-serv', ' ?',
' Other-service', ' Prof-specialty', ' Craft-repair',
' Adm-clerical', ' Exec-managerial', ' Tech-support', ' Sales',
' Priv-house-serv', ' Transport-moving', ' Handlers-cleaners',
' Armed-Forces'], dtype=object),
array([' Own-child', ' Husband', ' Not-in-family', ' Unmarried', ' Wife',
' Other-relative'], dtype=object),
array([' Black', ' White', ' Asian-Pac-Islander', ' Other',
' Amer-Indian-Eskimo'], dtype=object),
array([' Male', ' Female'], dtype=object),
array([' United-States', ' ?', ' Peru', ' Guatemala', ' Mexico',
' Dominican-Republic', ' Ireland', ' Germany', ' Philippines',
' Thailand', ' Haiti', ' El-Salvador', ' Puerto-Rico', ' Vietnam',
' South', ' Columbia', ' Japan', ' India', ' Cambodia', ' Poland',
' Laos', ' England', ' Cuba', ' Taiwan', ' Italy', ' Canada',
' Portugal', ' China', ' Nicaragua', ' Honduras', ' Iran',
' Scotland', ' Jamaica', ' Ecuador', ' Yugoslavia', ' Hungary',
' Hong', ' Greece', ' Trinadad&Tobago',
' Outlying-US(Guam-USVI-etc)', ' France', ' Holand-Netherlands'],
dtype=object)]
Then, we can pass the cats list to the categories parameter of OrdinalEncoder when creating an instance. This solves our problem since no unknown categories will ever appear during cross-validation.
oe_cats = OrdinalEncoder(categories=cats)17.4 Encoding nominal features for a linear model
Now that we’ve set up our OneHotEncoder, called ohe_ignore, and our OrdinalEncoder, called oe_cats, let’s see what happens when we pass census_X to fit_transform and then check the shape.
As expected, the OneHotEncoder creates a lot of columns due to the high-cardinality features, whereas the OrdinalEncoder creates only one column for each of the eight features.
ohe_ignore.fit_transform(census_X).shape(48842, 102)
oe_cats.fit_transform(census_X).shape(48842, 8)
Now let’s actually test the advice that I’ve given, which is that OneHotEncoder should be used for nominal features, to see if this advice still holds for high-cardinality features.
The simplest method for doing this is to create two Pipelines. One of them uses OneHotEncoder and the other uses OrdinalEncoder, and both end in a logistic regression model.
ohe_logreg = make_pipeline(ohe_ignore, logreg)
oe_logreg = make_pipeline(oe_cats, logreg)We’ll cross-validate each Pipeline using all features and then compare the accuracies. We’ll also time the operations to see if there are significant differences.
%time cross_val_score(ohe_logreg, census_X, census_y, cv=5, \
scoring='accuracy').mean()CPU times: user 570 ms, sys: 5.18 ms, total: 575 ms
Wall time: 578 ms
0.8329920571424309
%time cross_val_score(oe_logreg, census_X, census_y, cv=5, \
scoring='accuracy').mean()CPU times: user 495 ms, sys: 3.67 ms, total: 499 ms
Wall time: 499 ms
0.7547398152859307
The two Pipelines take around the same amount of time to run, but the accuracy of the OneHotEncoder Pipeline is 0.833, which is significantly better than the 0.755 accuracy of the OrdinalEncoder Pipeline. This would suggest that at least for a linear model like logistic regression, OneHotEncoder should be used for nominal features, even when the features have high cardinality.
17.5 Encoding nominal features for a non-linear model
Let’s now do the same comparison as the previous lesson, except this time we’ll use random forests, which is a tree-based non-linear model that we used in chapter 11.
First, we’ll create two more Pipelines. One uses OneHotEncoder and the other uses OrdinalEncoder, and both end in a random forest model.
ohe_rf = make_pipeline(ohe_ignore, rf)
oe_rf = make_pipeline(oe_cats, rf)Then, we’ll cross-validate each Pipeline using all features.
%time cross_val_score(ohe_rf, census_X, census_y, cv=5, \
scoring='accuracy').mean()CPU times: user 38.2 s, sys: 292 ms, total: 38.4 s
Wall time: 6min 7s
0.8260513856992514
%time cross_val_score(oe_rf, census_X, census_y, cv=5, \
scoring='accuracy').mean()CPU times: user 6.16 s, sys: 304 ms, total: 6.46 s
Wall time: 1.53 s
0.8245362761024548
We can see that the accuracies are about the same for the OneHotEncoder Pipeline (0.826) and the OrdinalEncoder Pipeline (0.825), even though we were using the OrdinalEncoder on nominal features, which would normally be considered improper.
How can this be? Well, because of how decision trees recursively split features, the random forest model can approximately learn the relationships present in categorical features even when they’re encoded as single columns with OrdinalEncoder.
It’s also worth noting that the OrdinalEncoder Pipeline is significantly faster than the OneHotEncoder Pipeline due to the much smaller feature set created by the OrdinalEncoder.
17.6 Combining the encodings
One final variation that we can try is to use the OneHotEncoder for all features except for education. And since education is actually an ordinal feature, we can use the OrdinalEncoder with it and define the category ordering.
Here are the education categories.
census_X['education'].unique()array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
' Preschool'], dtype=object)
We’ll manually define the category ordering, from “Preschool” through “Doctorate”, and then create an instance of OrdinalEncoder using that ordering.
cats = [[' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th',
' 10th', ' 11th', ' 12th', ' HS-grad', ' Some-college',
' Assoc-voc', ' Assoc-acdm', ' Bachelors', ' Masters',
' Prof-school', ' Doctorate']]
oe_cats = OrdinalEncoder(categories=cats)Then we’ll create a ColumnTransformer that applies the OrdinalEncoder to education, and applies the OneHotEncoder to all other features.
ct = make_column_transformer(
(oe_cats, ['education']),
remainder=ohe_ignore)When we pass census_X to the fit_transform, it creates 87 feature columns, compared to the 102 columns that were created when we only used the OneHotEncoder.
ct.fit_transform(census_X).shape(48842, 87)
Finally, we’ll create two Pipelines. Both of them start with the same ColumnTransformer, but one ends with logistic regression while the other ends with random forests.
oe_ohe_logreg = make_pipeline(ct, logreg)
oe_ohe_rf = make_pipeline(ct, rf)When we cross-validate the first Pipeline, the accuracy is 0.832, which is nearly the same as the 0.833 achieved by the logistic regression Pipeline that used OneHotEncoder for all features.
%time cross_val_score(oe_ohe_logreg, census_X, census_y, cv=5, \
scoring='accuracy').mean()CPU times: user 624 ms, sys: 7.42 ms, total: 631 ms
Wall time: 631 ms
0.8315588308601922
When we cross-validate the second Pipeline, the accuracy is 0.825, which is nearly the same as the 0.826 achieved by the random forest Pipeline that used OneHotEncoder for all features.
%time cross_val_score(oe_ohe_rf, census_X, census_y, cv=5, \
scoring='accuracy').mean()CPU times: user 37.9 s, sys: 282 ms, total: 38.2 s
Wall time: 12min 12s
0.8251300537921482
In summary, encoding the education feature with OrdinalEncoder and the 7 other features with OneHotEncoder performed basically the same as encoding all 8 features with OneHotEncoder. However, it’s certainly possible that the OrdinalEncoder could provide a benefit under other circumstances.
17.7 Best practices for encoding
Let’s summarize what we’ve learned in this chapter:
- If you have nominal features and are using a linear model, you should definitely use
OneHotEncoder, regardless of whether the features have high cardinality. - If you have nominal features and are using a non-linear model, you can try using
OneHotEncoder, and you can try usingOrdinalEncoderwithout defining the category ordering, and then see which option performs better. - If you have ordinal features and are using either type of model, you can try using
OneHotEncoder, and you can try usingOrdinalEncoderwhile defining the category ordering, and then see which option performs better.
In all cases, keep in mind that if the features have high cardinality, OrdinalEncoder is likely to run significantly faster than OneHotEncoder, which may or may not matter in your particular case.