'Embarked'].value_counts() df[
S 644
C 168
Q 77
Name: Embarked, dtype: int64
Let’s talk about categorical features. There are two types of categorical features that we’ve covered in the book:
So far, here’s the advice that I’ve given for encoding nominal and ordinal features:
Let’s do a quick recap as to why OneHotEncoder is the preferred approach for a nominal feature, using Embarked as an example.
Embarked has three categories, so OneHotEncoder would output 3 features. From each of the three features, the model can learn the relationship between the target value and whether or not a given passenger embarked at that port. For example, the model might learn from the first feature that passengers who embarked at C have a higher survival rate than passengers who didn’t embark at C.
If you were to instead use OrdinalEncoder with Embarked, it would output 1 feature. This is problematic because it would imply an ordering of the categories that doesn’t inherently exist. For example, if passengers who embarked at C and S had high survival rates, and passengers who embarked at Q had low survival rates, there’s no way for a linear model to learn this relationship if Embarked is encoded as a single feature.
'Embarked'].value_counts() df[
S 644
C 168
Q 77
Name: Embarked, dtype: int64
In this chapter, we’re going to explore whether this advice still holds when you have high-cardinality categorical features, which are categorical features with lots of unique values.
We’ll use a new dataset for this chapter, namely a dataset of US census data from 1994.
We’ll read the dataset into a new DataFrame called census.
= pd.read_csv('http://bit.ly/censusdataset') census
We’re only going to use the categorical features, which we can explore by using the DataFrame describe method.
='object') census.describe(include
workclass | education | marital-status | occupation | relationship | race | sex | native-country | class | |
---|---|---|---|---|---|---|---|---|---|
count | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 |
unique | 9 | 16 | 7 | 15 | 6 | 5 | 2 | 42 | 2 |
top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States | <=50K |
freq | 33906 | 15784 | 22379 | 6172 | 19716 | 41762 | 32650 | 43832 | 37155 |
From the row labeled “unique”, you can see that education, occupation, and native-country all have more than 10 unique values. There’s no hard-and-fast rule for what counts as a high-cardinality feature, but all three of these could be considered high-cardinality since they have a lot of unique values.
You can’t tell from this display, but these 8 features are all nominal features, with the exception of education since it does have a logical ordering. However, we’re going to be treating education as nominal for this experiment.
The column labeled “class” is actually our target. This column indicates whether the person has an income of more or less than fifty thousand dollars a year.
We can view the class proportions by normalizing the output of value_counts.
'class'].value_counts(normalize=True) census[
<=50K 0.760718
>50K 0.239282
Name: class, dtype: float64
When defining our X DataFrame, which I’m calling census_X, we’re only going to use the 8 categorical columns, which I’ve listed out manually. And we’ll use “class” as our y Series, which I’m calling census_y.
= ['workclass', 'education', 'marital-status', 'occupation',
census_cols 'relationship', 'race', 'sex', 'native-country']
= census[census_cols]
census_X = census['class'] census_y
We’re going to be testing the effectiveness of both OneHotEncoder and OrdinalEncoder with these 8 features. For this experiment, we would normally just create instances using the default arguments.
= OneHotEncoder()
ohe = OrdinalEncoder() oe
Notice that we created an instance of OrdinalEncoder without defining the category ordering. This is because we’re treating all of the features as nominal, and as such there is no logical ordering.
As a result, OrdinalEncoder would simply learn the categories for each feature in alphabetical order, which we can confirm by fitting the OrdinalEncoder and checking the categories_ attribute.
oe.fit(census_X).categories_
[array([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
dtype=object),
array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
' HS-grad', ' Masters', ' Preschool', ' Prof-school',
' Some-college'], dtype=object),
array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
' Married-spouse-absent', ' Never-married', ' Separated',
' Widowed'], dtype=object),
array([' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair',
' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners',
' Machine-op-inspct', ' Other-service', ' Priv-house-serv',
' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support',
' Transport-moving'], dtype=object),
array([' Husband', ' Not-in-family', ' Other-relative', ' Own-child',
' Unmarried', ' Wife'], dtype=object),
array([' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other',
' White'], dtype=object),
array([' Female', ' Male'], dtype=object),
array([' ?', ' Cambodia', ' Canada', ' China', ' Columbia', ' Cuba',
' Dominican-Republic', ' Ecuador', ' El-Salvador', ' England',
' France', ' Germany', ' Greece', ' Guatemala', ' Haiti',
' Holand-Netherlands', ' Honduras', ' Hong', ' Hungary', ' India',
' Iran', ' Ireland', ' Italy', ' Jamaica', ' Japan', ' Laos',
' Mexico', ' Nicaragua', ' Outlying-US(Guam-USVI-etc)', ' Peru',
' Philippines', ' Poland', ' Portugal', ' Puerto-Rico',
' Scotland', ' South', ' Taiwan', ' Thailand', ' Trinadad&Tobago',
' United-States', ' Vietnam', ' Yugoslavia'], dtype=object)]
That being said, we’re actually going to run into a problem with encoding due to our highest cardinality feature, native-country. Let’s take a look at it and see why.
'native-country'].value_counts() census_X[
United-States 43832
Mexico 951
? 857
Philippines 295
Germany 206
Puerto-Rico 184
Canada 182
El-Salvador 155
India 151
Cuba 138
England 127
China 122
South 115
Jamaica 106
Italy 105
Dominican-Republic 103
Japan 92
Guatemala 88
Poland 87
Vietnam 86
Columbia 85
Haiti 75
Portugal 67
Taiwan 65
Iran 59
Nicaragua 49
Greece 49
Peru 46
Ecuador 45
France 38
Ireland 37
Thailand 30
Hong 30
Cambodia 28
Trinadad&Tobago 27
Outlying-US(Guam-USVI-etc) 23
Yugoslavia 23
Laos 23
Scotland 21
Honduras 20
Hungary 19
Holand-Netherlands 1
Name: native-country, dtype: int64
You can see that one of the categories appears only once in the dataset. As we talked about in lesson 15.7, rare category values can cause problems with cross-validation.
In this case, it will definitely create a problem, because that sample is guaranteed to appear in the test fold but not a training fold during one of the runs of cross-validation. That will cause an error for both OneHotEncoder and OrdinalEncoder.
In the case of OneHotEncoder, the solution is simply to set the handle_unknown parameter to “ignore”.
= OneHotEncoder(handle_unknown='ignore') ohe_ignore
Starting in scikit-learn version 0.24, OrdinalEncoder will have a similar handle_unknown parameter that could be used for this situation.
But for now, the best solution is to define the categories in advance for each feature using a list comprehension. The list comprehension iterates through the feature columns, extracts the unique values from each column, and stores the result in a list called “cats”.
= [census_X[col].unique() for col in census_X[census_cols]]
cats cats
[array([' Private', ' Local-gov', ' ?', ' Self-emp-not-inc',
' Federal-gov', ' State-gov', ' Self-emp-inc', ' Without-pay',
' Never-worked'], dtype=object),
array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
' Preschool'], dtype=object),
array([' Never-married', ' Married-civ-spouse', ' Widowed', ' Divorced',
' Separated', ' Married-spouse-absent', ' Married-AF-spouse'],
dtype=object),
array([' Machine-op-inspct', ' Farming-fishing', ' Protective-serv', ' ?',
' Other-service', ' Prof-specialty', ' Craft-repair',
' Adm-clerical', ' Exec-managerial', ' Tech-support', ' Sales',
' Priv-house-serv', ' Transport-moving', ' Handlers-cleaners',
' Armed-Forces'], dtype=object),
array([' Own-child', ' Husband', ' Not-in-family', ' Unmarried', ' Wife',
' Other-relative'], dtype=object),
array([' Black', ' White', ' Asian-Pac-Islander', ' Other',
' Amer-Indian-Eskimo'], dtype=object),
array([' Male', ' Female'], dtype=object),
array([' United-States', ' ?', ' Peru', ' Guatemala', ' Mexico',
' Dominican-Republic', ' Ireland', ' Germany', ' Philippines',
' Thailand', ' Haiti', ' El-Salvador', ' Puerto-Rico', ' Vietnam',
' South', ' Columbia', ' Japan', ' India', ' Cambodia', ' Poland',
' Laos', ' England', ' Cuba', ' Taiwan', ' Italy', ' Canada',
' Portugal', ' China', ' Nicaragua', ' Honduras', ' Iran',
' Scotland', ' Jamaica', ' Ecuador', ' Yugoslavia', ' Hungary',
' Hong', ' Greece', ' Trinadad&Tobago',
' Outlying-US(Guam-USVI-etc)', ' France', ' Holand-Netherlands'],
dtype=object)]
Then, we can pass the “cats” list to the categories parameter of OrdinalEncoder when creating an instance. This solves our problem since no unknown categories will ever appear during cross-validation.
= OrdinalEncoder(categories=cats) oe_cats
Now that we’ve set up our OneHotEncoder, called “ohe_ignore”, and our OrdinalEncoder, called “oe_cats”, let’s see what happens when we pass census_X to fit_transform and then check the shape.
As expected, the OneHotEncoder creates a lot of columns due to the high-cardinality features, whereas the OrdinalEncoder creates only one column for each of the eight features.
ohe_ignore.fit_transform(census_X).shape
(48842, 102)
oe_cats.fit_transform(census_X).shape
(48842, 8)
Now let’s actually test the advice that I’ve given, which is that OneHotEncoder should be used for nominal features, to see if this advice still holds for high-cardinality features.
The simplest method for doing this is to create two Pipelines. One of them uses OneHotEncoder and the other uses OrdinalEncoder, and both end in a logistic regression model.
= make_pipeline(ohe_ignore, logreg)
ohe_logreg = make_pipeline(oe_cats, logreg) oe_logreg
We’ll cross-validate each Pipeline using all features and then compare the accuracies. We’ll also time the operations to see if there are significant differences.
%time cross_val_score(ohe_logreg, census_X, census_y, cv=5, scoring='accuracy').mean()
CPU times: user 578 ms, sys: 5.76 ms, total: 584 ms
Wall time: 587 ms
0.8329920571424309
%time cross_val_score(oe_logreg, census_X, census_y, cv=5, scoring='accuracy').mean()
CPU times: user 498 ms, sys: 3.71 ms, total: 501 ms
Wall time: 501 ms
0.7547398152859307
The two Pipelines take around the same amount of time to run, but the accuracy of the OneHotEncoder Pipeline is 0.833, which is significantly better than the 0.755 accuracy of the OrdinalEncoder Pipeline. This would suggest that at least for a linear model like logistic regression, OneHotEncoder should be used for nominal features, even when the features have high cardinality.
Let’s now do the same comparison as the previous lesson, except this time we’ll use random forests, which is a tree-based non-linear model that we used in chapter 11.
First, we’ll create two more Pipelines, one using OneHotEncoder and the other using OrdinalEncoder, and both ending in a random forest model.
= make_pipeline(ohe_ignore, rf)
ohe_rf = make_pipeline(oe_cats, rf) oe_rf
Then, we’ll cross-validate each Pipeline using all features.
%time cross_val_score(ohe_rf, census_X, census_y, cv=5, scoring='accuracy').mean()
CPU times: user 39.7 s, sys: 302 ms, total: 40 s
Wall time: 5.82 s
0.8260513856992514
%time cross_val_score(oe_rf, census_X, census_y, cv=5, scoring='accuracy').mean()
CPU times: user 6.38 s, sys: 302 ms, total: 6.68 s
Wall time: 1.54 s
0.8245362761024548
We can see that the accuracies are about the same for the OneHotEncoder Pipeline (0.826) and the OrdinalEncoder Pipeline (0.825), even though we were using the OrdinalEncoder on nominal features, which would normally be considered improper.
How can this be? Well, because of how decision trees recursively split features, the random forest model can approximately learn the relationships present in categorical features even when they’re encoded as single columns with OrdinalEncoder.
It’s also worth noting that the OrdinalEncoder Pipeline is significantly faster than the OneHotEncoder Pipeline, which is due to the much smaller feature set created by the OrdinalEncoder.
One final variation that we can try is to use the OneHotEncoder for all features except for education. And since education is actually an ordinal feature, we can use the OrdinalEncoder with it and define the category ordering.
Here are the education categories.
'education'].unique() census_X[
array([' 11th', ' HS-grad', ' Assoc-acdm', ' Some-college', ' 10th',
' Prof-school', ' 7th-8th', ' Bachelors', ' Masters', ' Doctorate',
' 5th-6th', ' Assoc-voc', ' 9th', ' 12th', ' 1st-4th',
' Preschool'], dtype=object)
We’ll manually define the category ordering, from “Preschool” through “Doctorate”, and then create an instance of OrdinalEncoder using that ordering.
= [[' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th',
cats ' 11th', ' 12th', ' HS-grad', ' Some-college', ' Assoc-voc',
' Assoc-acdm', ' Bachelors', ' Masters', ' Prof-school', ' Doctorate']]
= OrdinalEncoder(categories=cats) oe_cats
Then we’ll create a ColumnTransformer that applies the OrdinalEncoder to education, and applies the OneHotEncoder to all other features.
= make_column_transformer(
ct 'education']),
(oe_cats, [=ohe_ignore) remainder
When we pass census_X to the fit_transform, it creates 87 feature columns, compared to the 102 columns that were created when we only used the OneHotEncoder.
ct.fit_transform(census_X).shape
(48842, 87)
Finally, we’ll create two Pipelines. Both of them start with the same ColumnTransformer, but one ends with logistic regression while the other ends with random forests.
= make_pipeline(ct, logreg)
oe_ohe_logreg = make_pipeline(ct, rf) oe_ohe_rf
When we cross-validate the first Pipeline, the accuracy is 0.832, which is nearly the same as the 0.833 achieved by the logistic regression Pipeline that used OneHotEncoding for all features.
%time cross_val_score(oe_ohe_logreg, census_X, census_y, cv=5, scoring='accuracy').mean()
CPU times: user 659 ms, sys: 8.62 ms, total: 668 ms
Wall time: 668 ms
0.8315588308601922
When we cross-validate the second Pipeline, the accuracy is 0.825, which is nearly the same as the 0.826 achieved by the random forest Pipeline that used OneHotEncoding for all features.
%time cross_val_score(oe_ohe_rf, census_X, census_y, cv=5, scoring='accuracy').mean()
CPU times: user 40.8 s, sys: 309 ms, total: 41.1 s
Wall time: 5.98 s
0.8251300537921482
In summary, encoding the education feature with OrdinalEncoder and the seven other features with OneHotEncoder performed basically the same as encoding all eight features with OneHotEncoder. However, it’s certainly possible that the OrdinalEncoder could provide a benefit under other circumstances.
Let’s summarize what we’ve learned in this chapter.
If you have nominal features, and you’re using a linear model, you should definitely use OneHotEncoder, regardless of whether the features have high cardinality.
If you have nominal features, and you’re using a non-linear model, you can try using OneHotEncoder, and you can try using OrdinalEncoder without defining the category ordering, and then see which option performs better.
If you have ordinal features, regardless of the type of model, you can try using OneHotEncoder, and you can try using OrdinalEncoder while defining the category ordering, and then see which option performs better.
In all cases, keep in mind that if the features have high cardinality, OrdinalEncoder is likely to run significantly faster than OneHotEncoder, which may or may not matter in your particular case.