= pd.read_csv('http://bit.ly/MLtrain')
df df.shape
(891, 11)
Up to now, we’ve only been working with the first 10 rows of the Titanic dataset to make it easy to examine the input and output of each workflow step. In this chapter, we’ll begin using the full Titanic dataset. This will create a few new problems that are common with real datasets, and we’ll figure out how to handle those problems appropriately.
We’ll start by reading the training data into df
and reading the new data into df_new
, overwriting the existing objects.
When examining the shapes, you can see that df_new
has one less column than df because it doesn’t contain the target column of Survived.
= pd.read_csv('http://bit.ly/MLtrain')
df df.shape
(891, 11)
= pd.read_csv('http://bit.ly/MLnewdata')
df_new df_new.shape
(418, 10)
We’ll check for missing values in these two DataFrames by chaining together the isna
and sum
methods. The results tell us how many missing values are present in each column.
This reveals two problems we’ll have to handle that weren’t present in our 10-row datasets. First, Embarked contains missing values in df
, and second, Fare contains a missing value in df_new
. We’ll spend the rest of this chapter addressing those two problems.
Note that we don’t have to worry about missing values in Cabin because we’re not yet using that as a feature, and our existing workflow already accounts for the missing values in Age.
sum() df.isna().
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
sum() df_new.isna().
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
In this lesson, we’re going to figure out how to handle the missing values in the Embarked column.
We’ll start with a reminder of the six feature columns we’re using.
= ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age'] cols
We’ll redefine X
and y
to use the full dataset rather than the 10-row dataset.
= df[cols]
X = df['Survived'] y
And here’s a reminder of the ColumnTransformer
we created in the previous chapter.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, ['Name'),
(vect, 'Age']),
(imp, ['passthrough', ['Parch', 'Fare'])) (
Normally we would pass X
to the fit_transform
method, but in this case it will error because the Embarked column contains missing values. Our solution will be to impute missing values for Embarked before one-hot encoding it.
ct.fit_transform(X)
ValueError: Input contains NaN
As an aside, OneHotEncoder
automatically handles missing values by treating them as a new category starting in scikit-learn version 0.24. I’m using version 0.23, and so I’ll be writing the code to manually treat missing values as a new category. Even if you’re using version 0.24 or later, I still recommend following my code because what I’m about to teach you will enable you to solve other similar problems that scikit-learn does not automatically handle.
As I was saying, our solution is to impute missing values for Embarked and then one-hot encode it.
The first step of this solution is to create a new instance of SimpleImputer
, which we’ll call imp_constant
. For categorical features, you can either impute the most frequent value or a constant user-defined value. We’ll choose the latter by setting the strategy
parameter to 'constant'
, and the constant value we’ll impute is the string 'missing'
.
= SimpleImputer(strategy='constant', fill_value='missing') imp_constant
Next, we’ll create a two-step Pipeline
that only contains transformers. The first step is imputation using our new imputer, and the second step is one-hot encoding. We’ll call this Pipeline
imp_ohe
to remind us of the two steps it contains.
= make_pipeline(imp_constant, ohe) imp_ohe
We can test out the imp_ohe
Pipeline
by passing the Embarked column to its fit_transform
method. It outputs four columns because missing values are essentially being treated as a fourth category in addition to C, Q, and S.
'Embarked']]) imp_ohe.fit_transform(X[[
<891x4 sparse matrix of type '<class 'numpy.float64'>'
with 891 stored elements in Compressed Sparse Row format>
We can confirm this by accessing the second step of the Pipeline
, which is the OneHotEncoder
, and then examining the categories_
attribute.
1].categories_ imp_ohe[
[array(['C', 'Q', 'S', 'missing'], dtype=object)]
In case it helps you to understand imp_ohe
better, I’m going to show you what happens “under the hood” when you fit_transform
this Pipeline
. To be clear, you should not actually write the following code, rather it’s just for teaching purposes.
First, the imp_constant
object imputes a string value of “missing” for any missing values in the Embarked column. Then, the output of the imputer is one-hot encoded by the ohe
object, which outputs four columns.
'Embarked']])) ohe.fit_transform(imp_constant.fit_transform(X[[
<891x4 sparse matrix of type '<class 'numpy.float64'>'
with 891 stored elements in Compressed Sparse Row format>
Now that we’ve created a transformer-only Pipeline
to handle the missing values in Embarked, we’ll simply replace the ohe
transformer in our ColumnTransformer
with the imp_ohe
Pipeline
.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(imp_ohe, ['Name'),
(vect, 'Age']),
(imp, ['passthrough', ['Parch', 'Fare'])) (
There are two things I want to note about the imp_ohe
Pipeline
:
ColumnTransformer
, but imp_ohe
is eligible because all of its steps are transformers.imp_ohe
to the Sex column as well as Embarked. There are no missing values in the Sex column, so the imputation step won’t affect it, and it will simply get passed along to the one-hot encoding step.By replacing ohe
with imp_ohe
, we have now solved the problem of missing values in the Embarked column. Thus, we can pass X
to the ColumnTransformer
’s fit_transform
method, and it will not throw an error.
As an aside, the output matrix is now much wider than before because the Name column of X
contains a large number of unique words.
ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
Now that we’ve solved our first problem, we’re going to move on to the second problem, which is the missing values in the Fare column. Recall that Fare has missing values in X_new
but not in X
, and thus our modeling Pipeline
would error when making predictions for X_new
if we don’t account for these missing values.
Our solution to this problem is to impute missing values for Fare. The ColumnTransformer
already contains an imputer that does mean imputation, so we’ll apply the existing imputer to the Fare column, whereas previously Fare was a passthrough column. This is actually all that is required to solve our problem.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(imp_ohe, ['Name'),
(vect, 'Age', 'Fare']),
(imp, ['passthrough', ['Parch'])) (
Now, we’ll pass X
to the fit_transform
method of the ColumnTransformer
. It will output the same number of columns as it did before, since Fare just moved from a passthrough column to a transformed column.
ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
To be clear, the Fare column does not have any missing values in X
, thus the imputer did not impute any values for Fare during the fit_transform
. However, it did learn the mean of Fare in X
, which is the imputation value that will be applied to the Fare column of X_new
during prediction.
Next, we’ll update our modeling Pipeline
to include the revised ColumnTransformer
, and fit it on X
and y
. You can see from the diagram that there’s now a transformer Pipeline
within the ColumnTransformer
, which is within the modeling Pipeline
.
= make_pipeline(ct, logreg)
pipe pipe.fit(X, y)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Finally, we’ll redefine X_new
to use the full dataset, and then use the fitted Pipeline
to make predictions for X_new
. We know that we’ve solved our second problem because the Pipeline
did not throw any errors during the predict step.
= df_new[cols]
X_new pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
When we pass X
to the ColumnTransformer
’s fit_transform
method, it outputs a matrix with 1518 columns. How can we find out the names of these columns?
ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
Earlier in the book, we used the get_feature_names
method for this purpose, which, as I mentioned previously, has been replaced by get_feature_names_out
starting in scikit-learn 1.0. However, get_feature_names
will only work if all of the underlying transformers have a get_feature_names
method. In this case, it errors because neither Pipeline
transformers nor SimpleImputer
transformers have a get_feature_names
method.
ct.get_feature_names()
AttributeError: Transformer pipeline does not provide get_feature_names
The good news is that starting in scikit-learn 1.1, the get_feature_names_out
method is available for all transformers, which means that retrieving the feature names will no longer error.
In the meantime, our only solution for figuring out the column names is to inspect the transformers one-by-one.
When we print out the transformers_
attribute, we can see that there are 4 transformers.
ct.transformers_
[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(), ['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])]
The first transformer is a Pipeline
of SimpleImputer
and OneHotEncoder
. OneHotEncoder
has a get_feature_names
method, which we can access by selecting the pipeline
transformer and then its onehotencoder
step. get_feature_names
outputs 6 features, which we know are the first 6 features in the matrix because this is the first transformer in the ColumnTransformer
.
'pipeline']
(ct.named_transformers_['onehotencoder']
.named_steps[ .get_feature_names())
array(['x0_C', 'x0_Q', 'x0_S', 'x0_missing', 'x1_female', 'x1_male'],
dtype=object)
The second transformer is a CountVectorizer
. It also has a get_feature_names
method, which we can access by selecting the countvectorizer
transformer. We could print out all of the feature names, but instead we’ll pass it to the len
function, which indicates that the next 1509 features in the matrix came from CountVectorizer
.
len(ct.named_transformers_['countvectorizer'].get_feature_names())
1509
The third transformer is a SimpleImputer
, which doesn’t change the number of columns since we’re not adding a missing indicator, so we know that the next two features in the matrix are Age and Fare.
The fourth transformer is a passthrough transformer, which also doesn’t change the number of columns, so we know that the final feature in the matrix is Parch.
We’ve now accounted for all 1518 features: 6 from the Pipeline
transformer, 1509 from the CountVectorizer
, 2 from the SimpleImputer
, and 1 from the passthrough transformer.
Earlier in this chapter, since the Embarked column contained missing values and needed one-hot encoding, we created a two-step Pipeline
called imp_ohe
. The first step of this Pipeline
is imputation of a constant value, and the second step is one-hot encoding.
= make_pipeline(imp_constant, ohe) imp_ohe
We included the imp_ohe
Pipeline
in the ColumnTransformer
, and applied it to both the Embarked and Sex columns. Here’s what it would look like if the ColumnTransformer
only contained the imp_ohe
Pipeline
.
= make_column_transformer(
ct 'Embarked', 'Sex'])) (imp_ohe, [
When you run the fit_transform
method, Embarked turns into 4 columns and Sex turns into 2 columns, and the results are stacked side-by-side.
ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
...,
[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 1.]])
Because the Sex column didn’t contain any missing values and only needed one-hot encoding, we could have achieved the exact same results by applying imp_ohe
just to Embarked and then separately applying ohe
to Sex.
= make_column_transformer(
ct 'Embarked']),
(imp_ohe, ['Sex'])) (ohe, [
The fit_transform
does indeed output the same results as above, though I personally prefer the first ColumnTransformer
.
ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
...,
[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 1.]])
One common question is whether you can avoid using the imp_ohe
Pipeline
entirely by making a ColumnTransformer
like this instead, in which the imputation of a constant value is applied to Embarked, and one-hot encoding is applied to both Embarked and Sex.
= make_column_transformer(
ct 'Embarked']),
(imp_constant, ['Embarked', 'Sex'])) (ohe, [
The answer is no, you cannot. The fit_transform
method throws an error because Embarked contains missing values, and the ohe
transformer is not able to handle missing values.
ct.fit_transform(X)
ValueError: Input contains NaN
The key insight here is that there’s no interaction between the transformers of a ColumnTransformer
. In other words, there’s no flow of data from one transformer to the next, meaning the output of the imp_constant
transformer does not become the input to the ohe
transformer. Thus, the ohe
transformer is operating on the original Embarked column, not a transformed Embarked column in which missing values have been imputed.
If that’s confusing, it might be useful to recall the key differences between a Pipeline
and a ColumnTransformer
:
Pipeline
, the output of one step becomes the input to the next step. This is precisely why we created the imp_ohe
Pipeline
: We needed the output of the imputer to become the input to the ohe-hot encoder.ColumnTransformer
does not have steps. Instead, it has transformers that operate in parallel, and the output of each transformer is stacked beside the other transformer outputs.When imputing missing values for a categorical feature, you can either impute the most frequent value or a constant user-defined value. In this lesson, I’m going to discuss how you might choose between these two strategies.
Imputing a constant value essentially treats the missing values as a new category, which I believe is the better choice regardless of whether the values are missing at random or not at random. Imputing a constant value is especially important if the majority of values are missing, since imputing the most frequent value in that case would more than double the size of the category that was imputed, which would be quite misleading to the model.
That being said, imputing the most frequent value is much more acceptable when you have only a small number of missing values for a given feature, since the imputation won’t have much of an impact on the model anyway.
It’s important to note that if you impute a constant value for a feature, and that feature has missing values in the new data but not the training data, then you’ll need to set the OneHotEncoder
’s handle_unknown
parameter to 'ignore'
. That’s because the missing values category won’t be learned during the OneHotEncoder
’s fit step, and thus unknown values seen during the transform step need to be ignored in order to avoid an error.
The alternative here is just to impute the most frequent value for that feature, in which case you can leave the handle_unknown
parameter set to its default value of 'error'
.
Here’s the Pipeline
that we’ve built throughout the book. The strategy I’ve used throughout is to include all data transformations within a single ColumnTransformer
, including any missing value imputation, and then use that ColumnTransformer
as the first step in a two-step Pipeline
.
pipe
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
However, an alternative approach would to create a three-step Pipeline
in which the first step is missing value imputation, the second step includes all other data transformations, and the third step is the model. Let’s try it out to see if this is a better approach.
This would be the first ColumnTransformer
, which only does missing value imputation. Constant value imputation is applied to Embarked, mean imputation is applied to Age and Fare, and the other columns are passed through because they don’t contain any missing values in the training or new data. It would be the first step in the Pipeline
.
= make_column_transformer(
ct1 'Embarked']),
(imp_constant, ['Age', 'Fare']),
(imp, ['passthrough', ['Sex', 'Name', 'Parch'])) (
This would be the second ColumnTransformer
, which handles all other data transformations. It would be the second step in the Pipeline
, and thus it would operate on the output of the first ColumnTransformer
. However, a ColumnTransformer
outputs a NumPy array, not a DataFrame, and thus in this ColumnTransformer
we would have to reference the columns by position instead of by name.
We know the order of the columns from the first ColumnTransformer
, and thus Embarked and Sex would be columns 0 and 3 and Name would be column 4. Embarked and Sex are one-hot encoded, Name is vectorized, and the other columns are passed through.
= make_column_transformer(
ct2 0, 3]),
(ohe, [4),
(vect, 'passthrough', [1, 2, 5])) (
Now that we’ve created the ColumnTransformers
, we can include them in a three-step Pipeline
and fit
the Pipeline
to X
and y
.
= make_pipeline(ct1, ct2, logreg)
pipe pipe.fit(X, y)
Pipeline(steps=[('columntransformer-1', ColumnTransformer(transformers=[('simpleimputer-1', SimpleImputer(fill_value='missing', strategy='constant'), ['Embarked']), ('simpleimputer-2', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Sex', 'Name', 'Parch'])])), ('columntransformer-2', ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(), [0, 3]), ('countvectorizer', CountVectorizer(), 4), ('passthrough', 'passthrough', [1, 2, 5])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('simpleimputer-1', SimpleImputer(fill_value='missing', strategy='constant'), ['Embarked']), ('simpleimputer-2', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Sex', 'Name', 'Parch'])])
['Embarked']
SimpleImputer(fill_value='missing', strategy='constant')
['Age', 'Fare']
SimpleImputer()
['Sex', 'Name', 'Parch']
passthrough
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(), [0, 3]), ('countvectorizer', CountVectorizer(), 4), ('passthrough', 'passthrough', [1, 2, 5])])
[0, 3]
OneHotEncoder()
4
CountVectorizer()
[1, 2, 5]
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Finally, we can use this three-step Pipeline
to make predictions, and it does indeed make the same predictions as our original two-step Pipeline
.
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
Using a three-step Pipeline
like this is certainly a valid approach. However, I find the original two-step Pipeline
easier to write and to read, and thus I prefer the two-step approach.
The rules for Pipeline
s are that all steps other than the final step must be a transformer, and the final step can be a model or a transformer.
If a Pipeline
ends in a model, such as our pipe
object, you can use the Pipeline
’s fit
and predict
methods:
fit
method, all steps before the final one run fit_transform
, and the final step runs fit
.predict
method, all steps before the final one run transform
, and the final step runs predict
.If a Pipeline
ends in a transformer, such as our imp_ohe
object, you generally use the Pipeline
’s fit_transform
and transform
methods, but you can also use the fit
method:
fit_transform
method, all steps run fit_transform
.transform
method, all steps run transform
.fit
method, all steps before the final one run fit_transform
, and the final step runs fit
.Although this is a lot of information to take in, developing this level of understanding will definitely make it easier for you to test and debug your future Pipeline
s.