6  Encoding text data

6.1 Vectorizing text

So far in this book, we’ve only focused on numerical and categorical features. In this chapter, we’ll learn how to create features from unstructured text data.

Let’s take another look at our Titanic DataFrame.

We want to include the Name column in our model in case it contains predictive information about the likelihood of survival. For example, maybe their last name is predictive of survival if they’re part of an important family, and maybe certain titles are also predictive.

The Name column can’t be passed directly to the model because it’s not numeric. So in this lesson, we’ll learn how to encode it numerically, and in the next lesson we’ll add it to the ColumnTransformer and Pipeline.

df
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

One idea for encoding would be to simply one-hot encode the Name column. However, OneHotEncoder would treat each full name as its own category. This is unlikely to be useful, since we don’t expect to see any full name repeated more than once in the training or new data.

Instead, we want to consider each word in their name independently, so that we can learn different things from each part of their name. The CountVectorizer class was built for this purpose, so that’s what we’ll use. In brief, CountVectorizer converts text into a matrix of token counts, and you’ll see exactly what that means in a few minutes.

Ideas for encoding the Name column:

  • OneHotEncoder: Each full name is treated as a category (not recommended)
  • CountVectorizer: Each word in a name is treated independently (recommended)

To start, we import CountVectorizer from the feature_extraction module and create an instance called vect.

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

Then we’ll pass the Name to the fit_transform method of vect. Notice that we’re using single brackets around Name to pass it as a Series, which is because CountVectorizer expects 1-dimensional input. This is unlike OneHotEncoder and most other transformers, which expect 2-dimensional input.

The fit_transform method outputs what is called a document-term matrix, which we save as dtm. When we print it out, you can see that it’s a sparse matrix containing 10 rows and 40 columns. There’s one row for each of the 10 names, and one column for each of the 40 features it created.

dtm = vect.fit_transform(df['Name'])
dtm
<10x40 sparse matrix of type '<class 'numpy.int64'>'
    with 46 stored elements in Compressed Sparse Row format>

CountVectorizer vs other transformers:

  • CountVectorizer: 1-dimensional input (Series)
  • Other transformers: 2-dimensional input (DataFrame)

Let’s examine the names of the 40 feature columns it created by running the get_feature_names method. As an aside, this will be replaced by the get_feature_names_out method starting in scikit-learn 1.0.

These are the 40 unique words that were found in the Name column after lowercasing the words, removing all punctuation, and excluding words that were only 1 character long.

Note that the features are sorted alphabetically, which is the same convention used by the OneHotEncoder when learning category names.

print(vect.get_feature_names())
['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']

Default settings for CountVectorizer:

  • Convert all words to lowercase
  • Remove all punctuation
  • Exclude one-character words

Let’s quickly summarize what we know about the document-term matrix it created:

  • There are 10 rows and 40 columns.
  • Each row represents a row from the training data, and each column represents a word.
  • The rows are known as documents, and the feature names are known as terms, which is why it’s called a document-term matrix.
  • And it’s stored as a sparse matrix.

About the document-term matrix:

  • 10 rows and 40 columns
  • Rows represent rows from training data, columns represent words
  • Rows are “documents”, feature names are “terms”
  • Sparse matrix

We want to examine the document-term matrix to better understand it. Unlike OneHotEncoder, CountVectorizer does not have a sparse=False argument that we can set in order to view the dense representation. Instead, we’ll use the toarray method to make it dense and then convert that into a DataFrame, using the feature names as the column headings.

How to examine a document-term matrix:

  1. Use toarray method to make it dense
  2. Convert dense matrix into a DataFrame
  3. Use feature names as column headings
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
achem adele allen berg bradley braund briggs cumings elisabeth florence ... nasser nicholas oscar owen palsson peel thayer timothy vilhelmina william
0 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
1 0 0 0 0 1 0 1 1 0 1 ... 0 0 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 0
9 1 1 0 0 0 0 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 0

10 rows × 40 columns

What can we learn by examining this? If we compared the Name column to the document-term matrix, we would see that in each row, CountVectorizer counted how many times each word appeared.

Let’s take the first row as an example. Using the head method, we can see that the name in the first row of training data was “Braund, Mr. Owen Harris”. Thus the first row of the document-term matrix contains 4 ones (under “braund”, “mr”, “owen”, and “harris”), and the other 36 entries are all zeros. Unfortunately, you can’t see the entries in the middle without modifying the display options for pandas.

df.head(1)
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S

This encoding is known as the “Bag of Words” representation of text data. It’s known as a “bag” because this encoding doesn’t capture the word ordering present in each name, rather it only captures the count of how many times a word appears in each name.

“Bag of Words” representation:

  • Ignores word order
  • Only counts how many times a word appears

This representation is the feature matrix that will get passed to the model. Just like the OneHotEncoder created a 10 by 3 matrix from Embarked, CountVectorizer created this 10 by 40 matrix from Name.

And from each of the 40 features, the model can learn the relationship between the target value and how many times that word appeared in each passenger’s name. Thus if a particular word is predictive of survival, such as “Braund”, the model will be able to learn that from the matrix.

6.2 Including text data in the model

Now that we know how to encode text data, we’re ready to include it in our model.

We’ll start by updating the cols list to include the Name column, and then use cols to update the X DataFrame.

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X
Parch Fare Embarked Sex Name
0 0 7.2500 S male Braund, Mr. Owen Harris
1 0 71.2833 C female Cumings, Mrs. John Bradley (Florence Briggs Th...
2 0 7.9250 S female Heikkinen, Miss. Laina
3 0 53.1000 S female Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 0 8.0500 S male Allen, Mr. William Henry
5 0 8.4583 Q male Moran, Mr. James
6 0 51.8625 S male McCarthy, Mr. Timothy J
7 1 21.0750 S male Palsson, Master. Gosta Leonard
8 2 11.1333 S female Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 0 30.0708 C female Nasser, Mrs. Nicholas (Adele Achem)

Next, we’ll update the ColumnTransformer by adding one more tuple to specify that the CountVectorizer should be applied to Name.

Note that there are no brackets around “Name”. This is not because there’s only one column being passed to CountVectorizer. Instead, it’s because CountVectorizer expects 1-dimensional input, and brackets would signal 2-dimensional input to the ColumnTransformer.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    ('passthrough', ['Parch', 'Fare']))

Next, we’ll use the fit_transform method to try out the transformation. The output contains 47 columns.

ct.fit_transform(X)
<10x47 sparse matrix of type '<class 'numpy.float64'>'
    with 78 stored elements in Compressed Sparse Row format>

We can confirm with get_feature_names the meaning of these 47 columns: 3 columns for Embarked, 2 colums for Sex, 40 columns for Name, 1 column for Parch, and 1 column for Fare. Again, the features are in this order because that’s the order in which they were passed to the ColumnTransformer.

ct.get_feature_names() 
['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'countvectorizer__achem',
 'countvectorizer__adele',
 'countvectorizer__allen',
 'countvectorizer__berg',
 'countvectorizer__bradley',
 'countvectorizer__braund',
 'countvectorizer__briggs',
 'countvectorizer__cumings',
 'countvectorizer__elisabeth',
 'countvectorizer__florence',
 'countvectorizer__futrelle',
 'countvectorizer__gosta',
 'countvectorizer__harris',
 'countvectorizer__heath',
 'countvectorizer__heikkinen',
 'countvectorizer__henry',
 'countvectorizer__jacques',
 'countvectorizer__james',
 'countvectorizer__john',
 'countvectorizer__johnson',
 'countvectorizer__laina',
 'countvectorizer__leonard',
 'countvectorizer__lily',
 'countvectorizer__master',
 'countvectorizer__may',
 'countvectorizer__mccarthy',
 'countvectorizer__miss',
 'countvectorizer__moran',
 'countvectorizer__mr',
 'countvectorizer__mrs',
 'countvectorizer__nasser',
 'countvectorizer__nicholas',
 'countvectorizer__oscar',
 'countvectorizer__owen',
 'countvectorizer__palsson',
 'countvectorizer__peel',
 'countvectorizer__thayer',
 'countvectorizer__timothy',
 'countvectorizer__vilhelmina',
 'countvectorizer__william',
 'Parch',
 'Fare']

Output columns:

  • Columns 1-3: Embarked
  • Columns 4-5: Sex
  • Columns 6-45: Name
  • Column 46: Parch
  • Column 47: Fare

Now we’ll update the Pipeline to use the modified ColumnTransformer.

pipe = make_pipeline(ct, logreg)

Then we can fit the Pipeline, which displays our Pipeline diagram.

pipe.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(), 'Name'),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch', 'Fare'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('passthrough', 'passthrough',
                                 ['Parch', 'Fare'])])
['Embarked', 'Sex']
OneHotEncoder()
Name
CountVectorizer()
['Parch', 'Fare']
passthrough
LogisticRegression(random_state=1, solver='liblinear')

Our last step before prediction is to update X_new to include the Name column.

X_new = df_new[cols]

And finally, we’ll use the fitted Pipeline to make predictions for X_new.

pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

We’ve now accomplished our goal, which is to include the Name column in our model.

6.3 Q&A: Why is the document-term matrix stored as a sparse matrix?

Just like OneHotEncoder, CountVectorizer outputs a sparse matrix by default. To explore why it does this, let’s create a Python list of two short text documents and call it text.

text = ['Machine Learning is fun', 'I am learning Machine Learning']

We’ll pass those documents to the fit_transform method of CountVectorizer, make it dense with the toarray method, and then convert it to a DataFrame. As you can see, the word “I” was ignored because it’s only one character, and there’s a 2 under “learning” because the word “learning” appears twice in the second document.

pd.DataFrame(vect.fit_transform(text).toarray(), columns=vect.get_feature_names())
am fun is learning machine
0 0 1 1 1 1
1 1 0 0 2 1

Now let’s use CountVectorizer on the same text to output a sparse matrix instead, which is the default representation. The matrix is 2 rows by 5 columns, and there are 7 stored elements, meaning 7 non-zero values in the matrix.

dtm = vect.fit_transform(text)
dtm
<2x5 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>

We can actually see those 7 elements by using the print method. It turns out that a sparse matrix only stores the positions of the non-zero values and the values at those positions. In contrast, a dense matrix stores every value, whether or not it’s zero.

print(dtm)
  (0, 4)    1
  (0, 3)    1
  (0, 2)    1
  (0, 1)    1
  (1, 4)    1
  (1, 3)    2
  (1, 0)    1

As you might imagine, most elements in a typical document-term matrix are zero. This is because a collection of documents tends to have a large number of unique words, whereas any given document in that collection only contains a small fraction of those words.

When most elements in a matrix are zero, a sparse representation requires far less storage space than a dense representation and is also more performant. This is the same reason that OneHotEncoder outputs a sparse matrix, since its output also tends to be mostly zeros.

Preferred matrix representation:

  • Most elements are zero: Sparse matrix
  • Most elements are non-zero: Dense matrix

6.4 Q&A: What happens if the testing data includes new words?

In the previous lesson, we created two short text documents.

text
['Machine Learning is fun', 'I am learning Machine Learning']

Let’s pretend that those documents were our training data. If we passed those documents to the fit_transform method of CountVectorizer, these are the 5 features that would be learned.

dtm = vect.fit_transform(text)
vect.get_feature_names()
['am', 'fun', 'is', 'learning', 'machine']

Now, let’s create another short text document to act as our testing data. It includes two words, “is” and “fun”, that were in the training data, and two words, “data” and “science”, that were not in the training data.

text_new = ['Data Science is FUN!']

As we’ve discussed throughout this book, your testing data needs to have the same columns as your training data. In other words, we need to create a document-term matrix from the testing data that has the same 5 columns as the training data.

To do this, we’ll pass the testing data to the transform method, and it will build a matrix using the features it learned during the fit step.

vect.transform(text_new).toarray()
array([[0, 1, 1, 0, 0]])

In other words, the vectorizer learned its vocabulary from the training data, and it uses that same vocabulary when creating the document-term matrix for the testing data.

CountVectorizer methods:

  • fit: Learn the vocabulary
  • transform: Create the document-term matrix using that vocabulary

As you can see by comparing the output to the feature names, the vectorizer only learned the words “fun” and “is” from the testing data. The words “data” and “science” were ignored because they were not seen in the training data.

Ignoring unknown words actually makes intuitive sense, because if a word wasn’t seen during training, then you don’t know anything about the relationship between that word and the target variable. This is similar to setting the OneHotEncoder’s handle_unknown parameter to “ignore”, since that ignores unknown categories encountered during the transform step by encoding them as all zeros.

6.5 Q&A: How do I vectorize multiple columns of text?

Let’s take a look at the Name and Ticket columns from the Titanic DataFrame. Even though some of the Ticket values are entirely numeric, they’re actually all stored as strings.

df[['Name', 'Ticket']]
Name Ticket
0 Braund, Mr. Owen Harris A/5 21171
1 Cumings, Mrs. John Bradley (Florence Briggs Th... PC 17599
2 Heikkinen, Miss. Laina STON/O2. 3101282
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 113803
4 Allen, Mr. William Henry 373450
5 Moran, Mr. James 330877
6 McCarthy, Mr. Timothy J 17463
7 Palsson, Master. Gosta Leonard 349909
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 347742
9 Nasser, Mrs. Nicholas (Adele Achem) 237736

If we wanted to apply CountVectorizer to both columns so that we could include both in our model, how would we do it?

First, let’s try applying CountVectorizer separately. Vectorizing Name creates a 10 by 40 matrix, and vectorizing Ticket creates a 10 by 13 matrix.

vect.fit_transform(df['Name'])
<10x40 sparse matrix of type '<class 'numpy.int64'>'
    with 46 stored elements in Compressed Sparse Row format>
vect.fit_transform(df['Ticket'])
<10x13 sparse matrix of type '<class 'numpy.int64'>'
    with 13 stored elements in Compressed Sparse Row format>

What we want is to stack these matrices side-by-side as a 10 by 53 matrix.

One idea would be to pass both columns as a DataFrame to CountVectorizer, which is how we would transform multiple columns with OneHotEncoder, for example. However, you’ll see that the output is not what we had hoped for. This is because CountVectorizer expects 1-dimensional input, and we passed it 2-dimensional input instead.

vect.fit_transform(df[['Name', 'Ticket']])
<2x2 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>

To actually get the result we’re looking for, we have to pass each column separately to make_column_transformer. And you’ll see that it does indeed output a 10 by 53 matrix.

To be clear, it’s not problematic to use the same CountVectorizer object twice, since it will still learn two separate vocabularies.

ct = make_column_transformer(
    (vect, 'Name'),
    (vect, 'Ticket'))
ct.fit_transform(df)
<10x53 sparse matrix of type '<class 'numpy.int64'>'
    with 59 stored elements in Compressed Sparse Row format>

Recall that make_column_transformer assigns names to all transformers. Normally the assigned name would be “countvectorizer” in all lowercase, but it can’t give both transformers the same name. Instead, it appends numbers at the end, as you can see from the diagram.

ct
ColumnTransformer(transformers=[('countvectorizer-1', CountVectorizer(),
                                 'Name'),
                                ('countvectorizer-2', CountVectorizer(),
                                 'Ticket')])
Name
CountVectorizer()
Ticket
CountVectorizer()

You can also see these names by running the keys method on the named_transformers_ attribute.

ct.named_transformers_.keys()
dict_keys(['countvectorizer-1', 'countvectorizer-2', 'remainder'])

If you wanted to avoid these names, you could instead create the ColumnTransformer using the ColumnTransformer class, since that requires you to assign a custom name to each transformer.

6.6 Q&A: Should I one-hot encode or vectorize categorical features?

Let’s say you have categorical features which only contain one word, such as Embarked and Sex. We’ve been using OneHotEncoder to encode them, but should we use CountVectorizer instead? Let’s try it out and see what happens.

df[['Embarked', 'Sex']]
Embarked Sex
0 S male
1 C female
2 S female
3 S female
4 S male
5 Q male
6 S male
7 S male
8 S female
9 C female

If we use CountVectorizer on the Sex column, it produces the exact same output as the OneHotEncoder.

vect.fit_transform(df['Sex']).toarray()
array([[0, 1],
       [1, 0],
       [1, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [1, 0],
       [1, 0]])

But as we saw in the previous lesson, CountVectorizer won’t do what you expect if you try to encode multiple columns at once, since it expects 1-dimensional input.

vect.fit_transform(df[['Embarked', 'Sex']]).toarray()
array([[1, 0],
       [0, 1]])

In addition, the default settings for CountVectorizer don’t allow for one-character tokens, so you would have to modify those settings if you wanted to use it with Embarked.

vect.fit_transform(df['Embarked']).toarray()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[35], line 1
----> 1 vect.fit_transform(df['Embarked']).toarray()

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1198, in CountVectorizer.fit_transform(self, raw_documents, y)
   1195 min_df = self.min_df
   1196 max_features = self.max_features
-> 1198 vocabulary, X = self._count_vocab(raw_documents,
   1199                                   self.fixed_vocabulary_)
   1201 if self.binary:
   1202     X.data.fill(1)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1129, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab)
   1127     vocabulary = dict(vocabulary)
   1128     if not vocabulary:
-> 1129         raise ValueError("empty vocabulary; perhaps the documents only"
   1130                          " contain stop words")
   1132 if indptr[-1] > np.iinfo(np.int32).max:  # = 2**31 - 1
   1133     if _IS_32BIT:

ValueError: empty vocabulary; perhaps the documents only contain stop words

Finally, OneHotEncoder lets you decide how you want to handle categories that weren’t seen during training (using the handle_unknown parameter), whereas CountVectorizer doesn’t provide that option and will always ignore words that it didn’t see during training.

In summary, OneHotEncoder is the better encoding mechanism for any data that you would consider categorical, since it can encode multiple columns at once, it allows one-character category names by default, and it provides more options for handling unknown categories.

Advantages of OneHotEncoder for categorical data:

  • Encodes multiple columns at once
  • Allows one-character category names
  • Gives more options for handling unknown categories