3 Encoding categorical features

3.1 Introduction to one-hot encoding

In this chapter, we’re going to focus on one of the most important data preprocessing steps, which is the encoding of categorical features.

Let’s take a look at our Titanic DataFrame. In the last chapter, the only features we used were Parch and Fare. In this chapter, we want to add Embarked and Sex as additional features, in case they improve our model.

df

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

As a reminder, Parch is the number of parents or children aboard with that passenger, and Fare is the amount the passenger paid. Our first new feature, Embarked, is the port that each passenger embarked from, and the possible values are C, Q, or S. Our other new feature, Sex, is simply male or female.

Currently selected features:

Parch: Number of parents or children aboard with that passenger
Fare: Amount the passenger paid
Embarked: Port the passenger embarked from
Sex: Male or Female

Both Embarked and Sex are known as unordered categorical features because there are distinct categories and there’s no inherent logical ordering to the categories. This type of data is also known as nominal data.

Unordered categorical data:

Contains distinct categories
No inherent logical ordering to the categories
Also called “nominal data”

All scikit-learn models expect features to be numeric, and so Embarked and Sex can’t actually be passed directly to a model. Instead, we’re going to encode them using a process called one-hot encoding, also known as dummy encoding.

Let’s look at the code for one-hot encoding. First, we import the OneHotEncoder class from the preprocessing module. Then, we create an instance of it and set sparse to False.

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

By default, OneHotEncoder will output a sparse matrix, which is the most efficient and performant data structure for this type of data. By setting sparse to False, it will instead output a dense matrix, which is just the normal way of representing a matrix. This representation will allow us to examine the output so that we can understand the encoding scheme.

Matrix representations:

Sparse: More efficient and performant
Dense: More readable

Next, we’ll encode the Embarked column by passing it to the fit_transform method of the OneHotEncoder. We’ll talk about the fit_transform method in the next lesson, but for now I just want to highlight that we’ll use double brackets around Embarked to pass it as a single-column DataFrame instead of using single brackets to pass it as a Series.

This is important because OneHotEncoder expects to receive a two-dimensional object (such as a DataFrame) since a one-dimensional object is considered ambiguous. A one-dimensional Series could be interpreted either as a single feature or a single sample, whereas our two-dimensional DataFrame signals to scikit-learn that this is indeed a single feature.

Why use double brackets?

Single brackets:
- Outputs a Series
- Could be interpreted as a single feature or a single sample
Double brackets:
- Outputs a single-column DataFrame
- Interpreted as a single feature

Running the fit_transform method outputs this 10 by 3 array. This is the encoded version of the Embarked column, and it is exactly what we will pass to the model instead of the strings C, Q, and S.

ohe.fit_transform(df[['Embarked']])

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

Let’s talk about how we interpret this output. There are 3 columns because there were 3 unique values in Embarked. Each row contains a single 1, and the rest of the values in the row are 0. 100 means “C”, 010 means “Q”, and 001 means “S”, which you can confirm by comparing it to the Embarked column in our DataFrame.

As an aside, this is called one-hot encoding because in each row there is one “hot” level, meaning one non-zero level.

This is also the same output you would get by using the get_dummies function in pandas, though we’ll talk later in the book why it’s best to do all of your preprocessing in scikit-learn instead of pandas.

Output of OneHotEncoder:

One column for each unique value
One non-zero value in each row:
- 1, 0, 0 means “C”
- 0, 1, 0 means “Q”
- 0, 0, 1 means “S”

Let’s now look at the categories attribute of the OneHotEncoder. You can think of it as the column header for our 10 by 3 array. In other words, the categories attribute tells you that the first column represents C, the second column represents Q, and the third column represents S. Because the categories are always in alphabetical order from left to right, I didn’t actually have to examine the categories attribute in order to know how to interpret it.

As an aside, you’ll notice a lot of attributes in scikit-learn end in an underscore. This is scikit-learn’s convention for any attribute that is learned or estimated from the data during the fit step.

ohe.categories_

[array(['C', 'Q', 'S'], dtype=object)]

We’ve now seen how the OneHotEncoder encodes the Embarked feature. But why is this a reasonable way to encode a categorical feature?

You can think of it this way: OneHotEncoder creates a feature from each level so that the model can learn the relationship between each level and the target value. In this case, the model can learn the relationship between the target value of Survived and whether or not a passenger embarked at a given port.

For example, the model might learn from the first feature that passengers who embarked at C have a higher survival rate than passengers who didn’t embark at C. This is similar to how a model might learn from a numeric feature like Fare that passengers with a higher Fare have a higher survival rate than passengers with a lower Fare.

Why use one-hot encoding?

Model can learn the relationship between each level and the target value
Example: Model might learn that “C” passengers have a higher survival rate than “not C” passengers

At this point, you might be wondering whether we could have instead encoded Embarked as a single numeric feature with the values 0, 1, and 2 representing C, Q, and S. The answer is that yes, we can do this, but it’s generally not a good idea to do this with unordered categories because it would imply an ordering that doesn’t inherently exist.

To see why it’s not a good idea, let’s pretend that passengers who embarked at C and S had high survival rates, and passengers who embarked at Q had low survival rates. There would be no way for a linear model like logistic regression to learn that relationship if Embarked is encoded as a single feature since a single feature can’t be assigned both a negative coefficient to represent the impact of Q with respect to C and a positive coefficient to represent the impact of S with respect to Q.

Why not encode as a single feature?

Pretend:
- C: high survival rate
- Q: low survival rate
- S: high survival rate
Single feature would need two coefficients:
- Negative coefficient for impact of Q (with respect to C)
- Positive coefficient for impact of S (with respect to Q)

In summary, encoding Embarked as a single feature would prohibit a linear model from learning a non-linear relationship in the data, which is why encoding it as multiple features is generally the better choice.

3.2 Transformer methods: fit, transform, fit_transform

Let’s discuss the fit_transform method, since that’s the method we used with OneHotEncoder to encode the Embarked feature.

OneHotEncoder is known as a transformer, meaning its role is to perform data transformations. Transformers usually have a “fit” method and always have a “transform” method. The fit method is when the transformer learns something, and the transform method is when it uses what it learned to do the data transformation.

Generic transformer methods:

fit: Transformer learns something
transform: Transformer uses what it learned to do the data transformation

Using OneHotEncoder as an example, the fit method is when it learns the categories from the data in alphabetical order, and the transform method is when it creates the feature matrix using those categories.

OneHotEncoder methods:

fit: Learn the categories
transform: Create the feature matrix using those categories

The fit_transform method, which is what we used above, just combines those two steps into a single method call. You can actually do those steps as two separate calls of fit then transform, but the single method call of fit_transform is better because it’s more computationally efficient and also more readable in my opinion.

3.3 One-hot encoding of multiple features

We saw how to use OneHotEncoder to encode the Embarked column, but we actually need to encode both Embarked and Sex. Thankfully, OneHotEncoder can be applied to multiple features at once.

To do this, we simply pass a two-column DataFrame to the fit_transform method, whereas previously we had passed a one-column DataFrame. It outputs 5 columns, in which the first 3 columns represent Embarked and the last 2 columns represent Sex.

ohe.fit_transform(df[['Embarked', 'Sex']])

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])

Looking at the categories attribute, we first see the 3 categories that were learned from Embarked in alphabetical order, and then we see the 2 categories that were learned from Sex in alphabetical order. Thus, we know that a 10 in the last two columns means “female”, and a 01 in the last two columns means “male”.

ohe.categories_

[array(['C', 'Q', 'S'], dtype=object), array(['female', 'male'], dtype=object)]

So for example, the first sample in the output array is 00101, which means they embarked from S and they are male. The second sample in the array is 10010, which means they embarked from C and they are female. And so on.

Decoding the output array:

First three columns:
- 1, 0, 0 means “C”
- 0, 1, 0 means “Q”
- 0, 0, 1 means “S”
Last two columns:
- 1, 0 means “female”
- 0, 1 means “male”
Example:
- 0, 0, 1, 0, 1 means “S, male”
- 1, 0, 0, 1, 0 means “C, female”

Recall that our goal in this chapter was to numerically encode Embarked and Sex so we could include them in our model along with Parch and Fare. How might we do that?

One idea would be to manually stack Parch and Fare side-by-side with the 5 columns output by OneHotEncoder, and then train the model using all 7 columns. However, we would need to repeat the same exact process of encoding and stacking with the new data, since if you train a model with 7 features, you need the same 7 features in the new data in order to make predictions.

How to manually add Embarked and Sex to the model:

Stack Parch and Fare side-by-side with OneHotEncoder output
Repeat the same process with new data

This process would work, but it’s less than ideal, since repeating the same steps twice is both inefficient and error-prone. Not only that, but the complexity of this process will continue to increase as you preprocess additional features.

Problems with a manual approach:

Repeating steps is inefficient and error-prone
Complexity will increase

In the next chapter, I’ll introduce you to the ColumnTransformer and Pipeline classes. We’ll use these two classes to accomplish our goal of adding Embarked and Sex to our model, but we’ll do it in a way that is both reliable and efficient.

3.4 Q&A: When should I use transform instead of fit_transform?

Earlier in this chapter, we used the fit_transform method of OneHotEncoder to encode two categorical features. In this lesson, I’ll show you when it’s appropriate to just use the transform method instead of fit_transform. The example below will use OneHotEncoder, but the principles I’m teaching here apply the same way to all transformers.

We’ll start by creating a DataFrame of training data with just 1 categorical feature.

demo_train = pd.DataFrame({'letter':['A', 'B', 'C', 'B']})
demo_train

	letter
0	A
1	B
2	C
3	B

Let’s run fit_transform on the entire DataFrame.

ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

Recall that fit_transform is really 2 steps. During the first step, which is fit, the OneHotEncoder learns the 3 categories. During the second step, which is transform, the OneHotEncoder creates the feature matrix using those categories. It outputs a 4 by 3 array, since there are 4 samples and 3 categories.

Example of fit_transform on training data:

fit: Learn 3 categories (A, B, C)
transform: Create feature matrix with 3 columns

Now, we’ll create a DataFrame of testing data. It contains the same feature, but that feature includes one less category.

demo_test = pd.DataFrame({'letter':['A', 'C', 'A']})
demo_test

	letter
0	A
1	C
2	A

What would happen if we ran fit_transform on the testing data?

ohe.fit_transform(demo_test)

array([[1., 0.],
       [0., 1.],
       [1., 0.]])

The output array only includes two columns, because the testing data only included two categories. The first column represents the A category, and the second column represents the C category.

This is problematic, because if we trained a model using the 3-column feature matrix, and then tried to make predictions on the 2-column feature matrix, it would error due to a shape mismatch. That makes sense because if you train a model such as logistic regression using 3 features, it will learn 3 coefficients, and it expects to use all 3 of those coefficients when making predictions.

Example of fit_transform on testing data:

fit: Learn 2 categories (A, C)
transform: Create feature matrix with 2 columns

The solution is to run fit_transform on the training data, and only run transform on the testing data. Let’s take a look at the output arrays.

Notice that the categories are represented the same way in both arrays: the first column represents A, the second column represents B, and the third column represents C.

ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

ohe.transform(demo_test)

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

This happened because we only ran the fit method once, on the training data, and the fit method is when the OneHotEncoder learns the categories.

Then we ran the transform method twice, both on the training data and the testing data. Because we didn’t run the fit method on the testing data, the categories learned from the training data were applied to both the training and testing data. This is critically important because it means that both our training and testing feature matrices have 3 columns, and those 3 columns mean the same thing.

Correct process:

Run fit_transform on training data:

fit: Learn 3 categories (A, B, C)
transform: Create feature matrix with 3 columns

Run transform on testing data:

transform: Create feature matrix with 3 columns

In summary, when using any transformer, you will always use the fit_transform method on the training data and only the transform method on the testing data.

3.5 Q&A: What happens if the testing data includes a new category?

In the previous lesson, we created this example DataFrame of training data.

demo_train

	letter
0	A
1	B
2	C
3	B

When we passed that DataFrame to fit_transform, the output array included three columns.

ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

We know from the categories attribute that those columns represent the categories A, B, and C.

ohe.categories_

[array(['A', 'B', 'C'], dtype=object)]

Now we’ll create a new DataFrame of testing data that includes a category, D, which was not seen in the training data.

demo_test_unknown = pd.DataFrame({'letter':['A', 'C', 'D']})
demo_test_unknown

	letter
0	A
1	C
2	D

If you pass this new DataFrame to the transform method, it will throw an error because it doesn’t know how to represent the D category. It only knows how to represent the A, B, and C categories because those are the ones that were seen by the OneHotEncoder during the fit step.

ohe.transform(demo_test_unknown)

ValueError: Found unknown categories ['D'] in column 0 during transform

There are two possible solutions to this problem.

The first solution is to specify the categories manually to the OneHotEncoder when creating an instance.

ohe = OneHotEncoder(sparse=False, categories=[['A', 'B', 'C', 'D']])

Then, when fit_transform is run on the training data, a column is reserved for each of the four categories.

ohe.fit_transform(demo_train)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])

As a result, the transform on the testing data will no longer error.

ohe.transform(demo_test_unknown)

array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

However, specifying the categories manually is only a useful solution if you know all possible categories that might ever appear in your data. But in the real world, you don’t always know the full set of categories ahead of time.

For example, there might be rare categories that aren’t present in your set of samples, or new categories might be added in the future. For example, if one of your categorical features was medical billing codes, you could imagine that new billing codes are added over time.

Why you might not know all possible categories:

Rare categories aren’t present in your set of samples
New categories are added later

If you don’t know all possible categories, then the solution is to set the handle_unknown parameter of the OneHotEncoder to ignore, which overrides the default value of error.

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

Let’s use the fit_transform method on our training data one more time. The output array includes three columns representing A, B, and C.

ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

Now, when you use the transform method on the testing data, the third sample is encoded as all zeros because D is an unknown category.

ohe.transform(demo_test_unknown)

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

Although this might seem strange, this is actually quite a reasonable approach since you don’t have any information from the training data about the relationship between the D category and the target value.

One limitation of this approach, however, is that all unknown categories will be encoded the same way, which means that an E value in the testing data would also be encoded as all zeros.

Here’s my overall advice:

When starting a project, keep the handle_unknown parameter set to its default value of error so that you will know if you are encountering new categories in your testing data.
If you do find that you’re encountering new categories, but you can determine the full set of categories through research, then define the categories manually when creating the OneHotEncoder instance.
If you can’t determine the full set of categories, then set the handle_unknown parameter to ignore. However, you should retrain your model as soon as possible using data that includes those new categories.

Advice for OneHotEncoder:

Start with handle_unknown set to ‘error’
If possible, specify the categories manually
If necessary, set handle_unknown to ‘ignore’ and then retrain your model

3.6 Q&A: Should I drop one of the one-hot encoded categories?

Here’s the example training data that we’ve used in the past few lessons.

demo_train

	letter
0	A
1	B
2	C
3	B

And here’s the default one-hot encoding of this DataFrame.

ohe.fit_transform(demo_train)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

When one-hot encoding, it’s somewhat common to drop the first column of the output array because it contains redundant information and because it avoids collinearity between features.

You can drop the first column:

Contains redundant information
Avoids collinearity between features

If you want to drop the first column, you can set the OneHotEncoder’s drop parameter to first, though this option only exists in scikit-learn version 0.21 and later. When you run the fit_transform, you can see that the output array contains 1 less column. However, the new encoding retains the same information, since each category is still represented by a unique code: 00 means A, 10 means B, and 01 means C.

ohe = OneHotEncoder(sparse=False, drop='first')
ohe.fit_transform(demo_train)

array([[0., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.]])

Decoding the output array (after dropping the first column):

0, 0 means “A”
1, 0 means “B”
0, 1 means “C”

Dropping the first column will work regardless of the number of categories, but you’re only ever allowed to drop a single column. And it doesn’t actually matter which column you drop, though the convention is to drop the first column.

You’ve now seen that you can drop the first column, but the question is, should you drop the first column? Here’s my advice.

If you know that perfectly collinear features will cause problems, such as when feeding the resulting data into a neural network or an unregularized regression, then it’s a good idea to drop the first column. However, for most scikit-learn models, perfectly collinear features will not cause any problems, and thus dropping the first column will not benefit the model.

There are also some significant downsides to dropping the first column that you need to be aware of.

Number one, dropping the first column is incompatible with ignoring unknown categories, which is the handle_unknown=‘ignore’ option that we saw in the previous lesson, since the dropped category and unknown categories would both be encoded as all zeros. You are allowed to do this starting in scikit-learn 1.0, but I still don’t recommend it.

Number two, dropping the first column can introduce bias into the model if you standardize your features, such as with StandardScaler, or if you use a regularized model, such as logistic regression, since the dropped category will be exempt from standardization and regularization.

Should you drop the first column?

Advantages:
- Useful if perfectly collinear features will cause problems (does not apply to most models)
Disadvantages:
- Incompatible with handle_unknown=‘ignore’
- Introduces bias if you standardize features or use a regularized model

In summary, I recommend that you drop the first column only if you know that perfectly collinear features will cause problems, otherwise I don’t recommend dropping the first column.

3.7 Q&A: How do I encode an ordinal feature?

Throughout this chapter, we used one-hot encoding to encode unordered categorical features, also known as nominal data. But how should you encode categorical features with an inherent logical ordering, also known as ordinal data? That’s the subject of this lesson.

Types of categorical data:

Unordered (nominal data)
Ordered (ordinal data)

Let’s take a look at our Titanic DataFrame.

Pclass, which stands for passenger class, is an ordinal feature. Although it’s already numeric, the numbers 1, 2, and 3 represent the categories 1st class, 2nd class, and 3rd class. It’s considered ordinal data because there is a logical ordering to the categories.

Our intuition is that there may be a relationship between Pclass values increasing and survival rate decreasing, because passengers in the lower-numbered classes may have gotten priority access to lifeboats.

df

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Thus if we were going to include Pclass in the model, we would keep the existing numeric encoding so that the model can learn the relationship between Pclass and Survived with a single feature. You could use one-hot encoding with Pclass instead, but the model wouldn’t be able to learn that relationship as effectively because that information would be spread out across three features.

Options for encoding Pclass:

Ordinal encoding: Creates one feature
One-hot encoding: Creates three features

Let’s create an example DataFrame to see how to handle ordinal features that are stored as strings.

In this DataFrame, we have two ordinal features, Class and Size.

df_ordinal = pd.DataFrame({'Class': ['third', 'first', 'second', 'third'],
                           'Size': ['S', 'S', 'L', 'XL']})
df_ordinal

	Class	Size
0	third	S
1	first	S
2	second	L
3	third	XL

If you have ordinal data, you should use the OrdinalEncoder class to do the encoding. First, you import it from the preprocessing module. Then, you create an instance of OrdinalEncoder, and when you do so, you define the logical order of the categories.

We pass a list of lists to the categories parameter, in which the first inner list is the categories for the Class feature, and the second inner list is the categories for the Size feature. I put the two lists in that order because that is the order in which I’ll be passing the features to the fit_transform method.

One important note is that I included the M category for Size even though it wasn’t present in this DataFrame because I knew that it would occur in the dataset at some point.

from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['first', 'second', 'third'],
                                ['S', 'M', 'L', 'XL']])

Next, we pass the DataFrame to the OrdinalEncoder’s fit_transform method in order to do the encoding. You’ll notice that each input feature became a single feature in the output array.

oe.fit_transform(df_ordinal)

array([[2., 0.],
       [0., 0.],
       [1., 2.],
       [2., 3.]])

For the Class feature, first was encoded as 0, second was encoded as 1, and third was encoded as 2.

For the Size feature, S was encoded as 0, L was encoded as 2, and XL was encoded as 3. And if M appears in the data at some point, it will be encoded as 1.

Again, we encoded each input feature as a single column so that the model can learn the relationship between the target and an increase or decrease in each feature.

Decoding the output array:

First column:
- 0 means “first”
- 1 means “second”
- 2 means “third”
Second column:
- 0 means “S”
- 1 means “M”
- 2 means “L”
- 3 means “XL”
Example:
- 2, 0 means “third, S”

Let’s briefly contrast this with the output you would get if you used OneHotEncoder with these same two features.

OneHotEncoder would create 7 columns in the output array, since Class has 3 categories and Size has 4 categories. These 7 columns contain the same information as the 2 columns output by OrdinalEncoder, but the model would have a comparatively harder time learning from the 7 columns since the information is expressed in a less compact form.

ohe = OneHotEncoder(sparse=False, categories=[['first', 'second', 'third'],
                                              ['S', 'M', 'L', 'XL']])
ohe.fit_transform(df_ordinal)

array([[0., 0., 1., 1., 0., 0., 0.],
       [1., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0., 0., 1.]])

Here’s a summary of my advice on this topic:

If you have an ordinal feature that’s already encoded numerically, then leave it as-is.
If you have an ordinal feature that’s stored as strings, then encode it using OrdinalEncoder.
If you have a nominal feature, then encode it using OneHotEncoder.

Advice for encoding categorical data:

Ordinal feature stored as numbers: Leave as-is
Ordinal feature stored as strings: Use OrdinalEncoder
Nominal feature: Use OneHotEncoder

In chapter 17, we’ll explore this topic further and see if there are cases in which you should diverge from this advice.

One final note about OrdinalEncoder is that unlike OneHotEncoder, it does not allow for new categories in the testing data that were not seen during training. However, that functionality is available beginning in scikit-learn version 0.24 using a handle_unknown parameter.

3.8 Q&A: What’s the difference between OrdinalEncoder and LabelEncoder?

There are many similarities between the OrdinalEncoder and LabelEncoder classes, so in this lesson I’ll explain how they’re different and why you should be using OrdinalEncoder, not LabelEncoder.

The first main difference is that OrdinalEncoder allows you to define the order of the categories, whereas LabelEncoder does not. LabelEncoder simply uses the alphabetical order of the values you pass to it to determine which value to encode as 0, which value to encode as 1, and so on.

The second main difference is that OrdinalEncoder can be used to encode multiple features at once, whereas LabelEncoder can only encode one column of data at once.

	OrdinalEncoder	LabelEncoder
Can you define the category order?	Yes	No
Can you encode multiple features?	Yes	No

Because of these differences, OrdinalEncoder is much better suited than LabelEncoder for encoding ordinal features. And in fact, LabelEncoder is only intended for the encoding of class labels, hence its name.

You might be asking why LabelEncoder even exists, given its limitations. There are two reasons.

First, in older versions of scikit-learn, some classification models were not able to handle string-based labels. LabelEncoder was used to encode those strings as integers so that they could be passed to the model. That limitation was removed a few years ago, and so all scikit-learn classifiers can now handle string-based labels. Therefore, you should never need to use LabelEncoder for encoding your class labels.

Second, also in order versions of scikit-learn, OneHotEncoder did not accept strings as input. Thus if you had categorical data stored as strings, you actually had to use LabelEncoder to encode the strings as integers before passing them to the OneHotEncoder. Again, that limitation was removed a few years ago, and so you can pass string-based categorical data directly to OneHotEncoder.

Outdated uses for LabelEncoder:

Encoding string-based labels for some classifiers
Encoding string-based features for OneHotEncoder

Because of this legacy from older versions of scikit-learn, many people are familiar with LabelEncoder and thus use it to encode features. However, the best practice is to use OrdinalEncoder to encode ordinal features. In fact, it’s rare that you will ever need to use LabelEncoder, which is why I’m not using it in this book.

3.9 Q&A: Should I encode numeric features as ordinal features?

Normally, when you have a continuous numeric feature such as Fare, you pass that feature directly to your Machine Learning model.

df[['Fare']]

	Fare
0	7.2500
1	71.2833
2	7.9250
3	53.1000
4	8.0500
5	8.4583
6	51.8625
7	21.0750
8	11.1333
9	30.0708

However, one strategy that is sometimes used with numeric features is to “discretize” or “bin” them into categorical features. In scikit-learn, we can do this using KBinsDiscretizer.

from sklearn.preprocessing import KBinsDiscretizer

When creating an instance of KBinsDiscretizer, you define the number of bins, the binning strategy, and the method used for encoding the result.

kb = KBinsDiscretizer(n_bins=3, strategy='quantile', encode='ordinal')

Here’s the output when we pass the Fare feature to the fit_transform method.

Because we specified 3 bins, every sample has been assigned to bin 0 or 1 or 2. The smallest values were assigned to bin 0, the largest values were assigned to bin 2, and the values in between were assigned to bin 1. Thus, we’ve taken a continuous numeric feature and encoded it as an ordinal feature, and this ordinal feature could be passed to the model in place of the numeric feature.

kb.fit_transform(df[['Fare']])

array([[0.],
       [2.],
       [0.],
       [2.],
       [0.],
       [1.],
       [2.],
       [1.],
       [1.],
       [2.]])

The obvious follow-up question is: Should we discretize our numeric features? Theoretically, discretization can benefit linear models by helping them to learn non-linear trends. However, my general recommendation is to not use discretization, for three main reasons.

First, discretization removes all nuance from the data, which makes it harder for a model to learn the actual trends that are present in the data.

Second, discretization reduces the variation in the data, which makes it easier to find trends that don’t actually exist.

Third, any possible benefits of discretization are highly dependent on the parameters used with KBinsDiscretizer. Making those decisions by hand creates a risk of overfitting the training data, and making those decisions during a tuning process adds both complexity and processing time, and so neither of those options is particularly attractive to me.

Why not discretize numeric features?

Makes it harder to learn the actual trends
Makes it easier to discover non-existent trends
May result in overfitting