So far in this book, we’ve only focused on numerical and categorical features. In this chapter, we’ll learn how to create features from unstructured text data.
Let’s take another look at our Titanic DataFrame.
We want to include the Name column in our model in case it contains predictive information about the likelihood of survival. For example, maybe their last name is predictive of survival if they’re part of an important family, and maybe certain titles are also predictive.
The Name column can’t be passed directly to the model because it’s not numeric. So in this lesson, we’ll learn how to encode it numerically, and in the next lesson we’ll add it to the ColumnTransformer and Pipeline.
df
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
5
0
3
Moran, Mr. James
male
NaN
0
0
330877
8.4583
NaN
Q
6
0
1
McCarthy, Mr. Timothy J
male
54.0
0
0
17463
51.8625
E46
S
7
0
3
Palsson, Master. Gosta Leonard
male
2.0
3
1
349909
21.0750
NaN
S
8
1
3
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
female
27.0
0
2
347742
11.1333
NaN
S
9
1
2
Nasser, Mrs. Nicholas (Adele Achem)
female
14.0
1
0
237736
30.0708
NaN
C
One idea for encoding would be to simply one-hot encode the Name column. However, OneHotEncoder would treat each full name as its own category. This is unlikely to be useful, since we don’t expect to see any full name repeated more than once in the training or new data.
Instead, we want to consider each word in their name independently, so that we can learn different things from each part of their name. The CountVectorizer class was built for this purpose, so that’s what we’ll use. In brief, CountVectorizer converts text into a matrix of token counts, and you’ll see exactly what that means in a few minutes.
Ideas for encoding the Name column:
OneHotEncoder: Each full name is treated as a category (not recommended)
CountVectorizer: Each word in a name is treated independently (recommended)
To start, we import CountVectorizer from the feature_extraction module and create an instance called vect.
from sklearn.feature_extraction.text import CountVectorizervect = CountVectorizer()
Then we’ll pass the Name to the fit_transform method of vect. Notice that we’re using single brackets around Name to pass it as a Series, which is because CountVectorizer expects 1-dimensional input. This is unlike OneHotEncoder and most other transformers, which expect 2-dimensional input.
The fit_transform method outputs what is called a document-term matrix, which we save as dtm. When we print it out, you can see that it’s a sparse matrix containing 10 rows and 40 columns. There’s one row for each of the 10 names, and one column for each of the 40 features it created.
dtm = vect.fit_transform(df['Name'])dtm
<10x40 sparse matrix of type '<class 'numpy.int64'>'
with 46 stored elements in Compressed Sparse Row format>
CountVectorizer vs other transformers:
CountVectorizer: 1-dimensional input (Series)
Other transformers: 2-dimensional input (DataFrame)
Let’s examine the names of the 40 feature columns it created by running the get_feature_names method. As an aside, this will be replaced by the get_feature_names_out method starting in scikit-learn 1.0.
These are the 40 unique words that were found in the Name column after lowercasing the words, removing all punctuation, and excluding words that were only 1 character long.
Note that the features are sorted alphabetically, which is the same convention used by the OneHotEncoder when learning category names.
Let’s quickly summarize what we know about the document-term matrix it created:
There are 10 rows and 40 columns.
Each row represents a row from the training data, and each column represents a word.
The rows are known as documents, and the feature names are known as terms, which is why it’s called a document-term matrix.
And it’s stored as a sparse matrix.
About the document-term matrix:
10 rows and 40 columns
Rows represent rows from training data, columns represent words
Rows are “documents”, feature names are “terms”
Sparse matrix
We want to examine the document-term matrix to better understand it. Unlike OneHotEncoder, CountVectorizer does not have a sparse=False argument that we can set in order to view the dense representation. Instead, we’ll use the toarray method to make it dense and then convert that into a DataFrame, using the feature names as the column headings.
What can we learn by examining this? If we compared the Name column to the document-term matrix, we would see that in each row, CountVectorizer counted how many times each word appeared.
Let’s take the first row as an example. Using the head method, we can see that the name in the first row of training data was “Braund, Mr. Owen Harris”. Thus the first row of the document-term matrix contains 4 ones (under “braund”, “mr”, “owen”, and “harris”), and the other 36 entries are all zeros. Unfortunately, you can’t see the entries in the middle without modifying the display options for pandas.
df.head(1)
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.25
NaN
S
This encoding is known as the “Bag of Words” representation of text data. It’s known as a “bag” because this encoding doesn’t capture the word ordering present in each name, rather it only captures the count of how many times a word appears in each name.
“Bag of Words” representation:
Ignores word order
Only counts how many times a word appears
This representation is the feature matrix that will get passed to the model. Just like the OneHotEncoder created a 10 by 3 matrix from Embarked, CountVectorizer created this 10 by 40 matrix from Name.
And from each of the 40 features, the model can learn the relationship between the target value and how many times that word appeared in each passenger’s name. Thus if a particular word is predictive of survival, such as “Braund”, the model will be able to learn that from the matrix.
6.2 Including text data in the model
Now that we know how to encode text data, we’re ready to include it in our model.
We’ll start by updating the cols list to include the Name column, and then use cols to update the X DataFrame.
Next, we’ll update the ColumnTransformer by adding one more tuple to specify that the CountVectorizer should be applied to Name.
Note that there are no brackets around “Name”. This is not because there’s only one column being passed to CountVectorizer. Instead, it’s because CountVectorizer expects 1-dimensional input, and brackets would signal 2-dimensional input to the ColumnTransformer.
Next, we’ll use the fit_transform method to try out the transformation. The output contains 47 columns.
ct.fit_transform(X)
<10x47 sparse matrix of type '<class 'numpy.float64'>'
with 78 stored elements in Compressed Sparse Row format>
We can confirm with get_feature_names the meaning of these 47 columns: 3 columns for Embarked, 2 colums for Sex, 40 columns for Name, 1 column for Parch, and 1 column for Fare. Again, the features are in this order because that’s the order in which they were passed to the ColumnTransformer.
Our last step before prediction is to update X_new to include the Name column.
X_new = df_new[cols]
And finally, we’ll use the fitted Pipeline to make predictions for X_new.
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
We’ve now accomplished our goal, which is to include the Name column in our model.
6.3 Q&A: Why is the document-term matrix stored as a sparse matrix?
Just like OneHotEncoder, CountVectorizer outputs a sparse matrix by default. To explore why it does this, let’s create a Python list of two short text documents and call it text.
text = ['Machine Learning is fun', 'I am learning Machine Learning']
We’ll pass those documents to the fit_transform method of CountVectorizer, make it dense with the toarray method, and then convert it to a DataFrame. As you can see, the word “I” was ignored because it’s only one character, and there’s a 2 under “learning” because the word “learning” appears twice in the second document.
Now let’s use CountVectorizer on the same text to output a sparse matrix instead, which is the default representation. The matrix is 2 rows by 5 columns, and there are 7 stored elements, meaning 7 non-zero values in the matrix.
dtm = vect.fit_transform(text)dtm
<2x5 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
We can actually see those 7 elements by using the print method. It turns out that a sparse matrix only stores the positions of the non-zero values and the values at those positions. In contrast, a dense matrix stores every value, whether or not it’s zero.
As you might imagine, most elements in a typical document-term matrix are zero. This is because a collection of documents tends to have a large number of unique words, whereas any given document in that collection only contains a small fraction of those words.
When most elements in a matrix are zero, a sparse representation requires far less storage space than a dense representation and is also more performant. This is the same reason that OneHotEncoder outputs a sparse matrix, since its output also tends to be mostly zeros.
Preferred matrix representation:
Most elements are zero: Sparse matrix
Most elements are non-zero: Dense matrix
6.4 Q&A: What happens if the testing data includes new words?
In the previous lesson, we created two short text documents.
text
['Machine Learning is fun', 'I am learning Machine Learning']
Let’s pretend that those documents were our training data. If we passed those documents to the fit_transform method of CountVectorizer, these are the 5 features that would be learned.
Now, let’s create another short text document to act as our testing data. It includes two words, “is” and “fun”, that were in the training data, and two words, “data” and “science”, that were not in the training data.
text_new = ['Data Science is FUN!']
As we’ve discussed throughout this book, your testing data needs to have the same columns as your training data. In other words, we need to create a document-term matrix from the testing data that has the same 5 columns as the training data.
To do this, we’ll pass the testing data to the transform method, and it will build a matrix using the features it learned during the fit step.
vect.transform(text_new).toarray()
array([[0, 1, 1, 0, 0]])
In other words, the vectorizer learned its vocabulary from the training data, and it uses that same vocabulary when creating the document-term matrix for the testing data.
CountVectorizer methods:
fit: Learn the vocabulary
transform: Create the document-term matrix using that vocabulary
As you can see by comparing the output to the feature names, the vectorizer only learned the words “fun” and “is” from the testing data. The words “data” and “science” were ignored because they were not seen in the training data.
Ignoring unknown words actually makes intuitive sense, because if a word wasn’t seen during training, then you don’t know anything about the relationship between that word and the target variable. This is similar to setting the OneHotEncoder’s handle_unknown parameter to “ignore”, since that ignores unknown categories encountered during the transform step by encoding them as all zeros.
6.5 Q&A: How do I vectorize multiple columns of text?
Let’s take a look at the Name and Ticket columns from the Titanic DataFrame. Even though some of the Ticket values are entirely numeric, they’re actually all stored as strings.
df[['Name', 'Ticket']]
Name
Ticket
0
Braund, Mr. Owen Harris
A/5 21171
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
PC 17599
2
Heikkinen, Miss. Laina
STON/O2. 3101282
3
Futrelle, Mrs. Jacques Heath (Lily May Peel)
113803
4
Allen, Mr. William Henry
373450
5
Moran, Mr. James
330877
6
McCarthy, Mr. Timothy J
17463
7
Palsson, Master. Gosta Leonard
349909
8
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
347742
9
Nasser, Mrs. Nicholas (Adele Achem)
237736
If we wanted to apply CountVectorizer to both columns so that we could include both in our model, how would we do it?
First, let’s try applying CountVectorizer separately. Vectorizing Name creates a 10 by 40 matrix, and vectorizing Ticket creates a 10 by 13 matrix.
vect.fit_transform(df['Name'])
<10x40 sparse matrix of type '<class 'numpy.int64'>'
with 46 stored elements in Compressed Sparse Row format>
vect.fit_transform(df['Ticket'])
<10x13 sparse matrix of type '<class 'numpy.int64'>'
with 13 stored elements in Compressed Sparse Row format>
What we want is to stack these matrices side-by-side as a 10 by 53 matrix.
One idea would be to pass both columns as a DataFrame to CountVectorizer, which is how we would transform multiple columns with OneHotEncoder, for example. However, you’ll see that the output is not what we had hoped for. This is because CountVectorizer expects 1-dimensional input, and we passed it 2-dimensional input instead.
vect.fit_transform(df[['Name', 'Ticket']])
<2x2 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
To actually get the result we’re looking for, we have to pass each column separately to make_column_transformer. And you’ll see that it does indeed output a 10 by 53 matrix.
To be clear, it’s not problematic to use the same CountVectorizer object twice, since it will still learn two separate vocabularies.
<10x53 sparse matrix of type '<class 'numpy.int64'>'
with 59 stored elements in Compressed Sparse Row format>
Recall that make_column_transformer assigns names to all transformers. Normally the assigned name would be “countvectorizer” in all lowercase, but it can’t give both transformers the same name. Instead, it appends numbers at the end, as you can see from the diagram.
If you wanted to avoid these names, you could instead create the ColumnTransformer using the ColumnTransformer class, since that requires you to assign a custom name to each transformer.
6.6 Q&A: Should I one-hot encode or vectorize categorical features?
Let’s say you have categorical features which only contain one word, such as Embarked and Sex. We’ve been using OneHotEncoder to encode them, but should we use CountVectorizer instead? Let’s try it out and see what happens.
df[['Embarked', 'Sex']]
Embarked
Sex
0
S
male
1
C
female
2
S
female
3
S
female
4
S
male
5
Q
male
6
S
male
7
S
male
8
S
female
9
C
female
If we use CountVectorizer on the Sex column, it produces the exact same output as the OneHotEncoder.
But as we saw in the previous lesson, CountVectorizer won’t do what you expect if you try to encode multiple columns at once, since it expects 1-dimensional input.
In addition, the default settings for CountVectorizer don’t allow for one-character tokens, so you would have to modify those settings if you wanted to use it with Embarked.
vect.fit_transform(df['Embarked']).toarray()
---------------------------------------------------------------------------ValueError Traceback (most recent call last)
Cell In[35], line 1----> 1vect.fit_transform(df['Embarked']).toarray()
File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1198, in CountVectorizer.fit_transform(self, raw_documents, y) 1195 min_df =self.min_df
1196 max_features =self.max_features
-> 1198 vocabulary, X =self._count_vocab(raw_documents, 1199self.fixed_vocabulary_) 1201ifself.binary:
1202 X.data.fill(1)
File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1129, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab) 1127 vocabulary =dict(vocabulary)
1128ifnot vocabulary:
-> 1129raiseValueError("empty vocabulary; perhaps the documents only" 1130" contain stop words")
1132if indptr[-1] > np.iinfo(np.int32).max: # = 2**31 - 1 1133if _IS_32BIT:
ValueError: empty vocabulary; perhaps the documents only contain stop words
Finally, OneHotEncoder lets you decide how you want to handle categories that weren’t seen during training (using the handle_unknown parameter), whereas CountVectorizer doesn’t provide that option and will always ignore words that it didn’t see during training.
In summary, OneHotEncoder is the better encoding mechanism for any data that you would consider categorical, since it can encode multiple columns at once, it allows one-character category names by default, and it provides more options for handling unknown categories.
Advantages of OneHotEncoder for categorical data:
Encodes multiple columns at once
Allows one-character category names
Gives more options for handling unknown categories