df = pd.read_csv('http://bit.ly/MLtrain')
df.shape(891, 11)
Up to now, we’ve only been working with the first 10 rows of the Titanic dataset to make it easy to examine the input and output of each workflow step. In this chapter, we’ll begin using the full Titanic dataset. This will create a few new problems that are common with real datasets, and we’ll figure out how to handle those problems appropriately.
We’ll start by reading the training data into df and reading the new data into df_new, overwriting the existing objects.
When examining the shapes, you can see that df_new has one less column than df because it doesn’t contain the target column of Survived.
df = pd.read_csv('http://bit.ly/MLtrain')
df.shape(891, 11)
df_new = pd.read_csv('http://bit.ly/MLnewdata')
df_new.shape(418, 10)
We’ll check for missing values in these two DataFrames by chaining together the isna and sum methods. The results tell us how many missing values are present in each column.
This reveals two problems we’ll have to handle that weren’t present in our 10-row datasets:
df.df_new.We’ll spend the rest of this chapter addressing these two problems.
Note that we don’t have to worry about missing values in Cabin because we’re not yet using that as a feature, and our existing workflow already accounts for the missing values in Age.
df.isna().sum()Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
df_new.isna().sum()Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
In this lesson, we’re going to figure out how to handle the missing values in the Embarked column.
We’ll start with a reminder of the six feature columns we’re using.
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']We’ll redefine X and y to use the full dataset rather than the 10-row dataset.
X = df[cols]
y = df['Survived']Here’s a reminder of the ColumnTransformer we created in the previous chapter.
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
(imp, ['Age']),
('passthrough', ['Parch', 'Fare']))Normally we would pass X to the fit_transform method, but in this case it errors because the Embarked column contains missing values.
ct.fit_transform(X)ValueError: Input contains NaN
Our solution will be to impute missing values for Embarked before one-hot encoding it.
Starting in scikit-learn version 0.24, OneHotEncoder automatically handles missing values by treating them as a new category. I’m using version 0.23, and so I’ll be writing the code to manually treat missing values as a new category. Even if you’re using version 0.24 or later, I still recommend following my code because what I’ll teach you will enable you to solve similar problems that scikit-learn does not automatically handle.
The first step of this solution is to create a new instance of SimpleImputer, which we’ll call imp_constant. For categorical features, you can either impute the most frequent value or a constant user-defined value. We’ll choose the latter by setting the strategy parameter to 'constant', and the constant value we’ll impute is the string 'missing'.
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')Next, we’ll create a two-step Pipeline that only contains transformers. The first step is imputation using our new imputer, and the second step is one-hot encoding. We’ll call this Pipeline imp_ohe to remind us of the two steps it contains.
imp_ohe = make_pipeline(imp_constant, ohe)We can test out the imp_ohe Pipeline by passing the Embarked column to its fit_transform method. It outputs four columns because missing values are essentially being treated as a fourth category in addition to C, Q, and S.
imp_ohe.fit_transform(X[['Embarked']])<891x4 sparse matrix of type '<class 'numpy.float64'>'
with 891 stored elements in Compressed Sparse Row format>
We can confirm this by accessing the second step of the Pipeline, which is the OneHotEncoder, and then examining the categories_ attribute.
imp_ohe[1].categories_[array(['C', 'Q', 'S', 'missing'], dtype=object)]
In case it helps you to understand imp_ohe better, I’m going to show you what happens “under the hood” when you fit_transform this Pipeline. To be clear, you should not actually write the following code, rather it’s just for teaching purposes:
imp_constant object imputes a string value of “missing” for any missing values in the Embarked column.ohe object, which outputs four columns.ohe.fit_transform(imp_constant.fit_transform(X[['Embarked']]))<891x4 sparse matrix of type '<class 'numpy.float64'>'
with 891 stored elements in Compressed Sparse Row format>
Now that we’ve created a transformer-only Pipeline to handle the missing values in Embarked, we’ll simply replace the ohe transformer in our ColumnTransformer with the imp_ohe Pipeline.
ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
(imp, ['Age']),
('passthrough', ['Parch', 'Fare']))There are two things I want to note about the imp_ohe Pipeline:
ColumnTransformer, but imp_ohe is eligible because all of its steps are transformers.imp_ohe to the Sex column as well as Embarked. There are no missing values in the Sex column, so the imputation step won’t affect it, and it will simply get passed along to the one-hot encoding step.By replacing ohe with imp_ohe, we have now solved the problem of missing values in the Embarked column. Thus, we can pass X to the ColumnTransformer’s fit_transform method, and it no longer throws an error.
As an aside, the output matrix is now much wider than before because the Name column of X contains a large number of unique words.
ct.fit_transform(X)<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
Now that we’ve solved our first problem, we’re going to move on to the second problem, which is the missing values in the Fare column. Recall that Fare has missing values in X_new but not in X, and thus our modeling Pipeline would error when making predictions for X_new if we don’t account for these missing values.
Our solution to this problem is to impute missing values for Fare. The ColumnTransformer already contains an imputer (imp) that does mean imputation on the Age column, so we’ll just apply it to Fare as well. This is actually all that’s required to solve our problem.
ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
(imp, ['Age', 'Fare']),
('passthrough', ['Parch']))Now, we’ll pass X to the fit_transform method of the ColumnTransformer. It will output the same number of columns as it did before, since Fare just moved from a passthrough column to a transformed column.
ct.fit_transform(X)<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
To be clear, the Fare column does not have any missing values in X, thus the imputer did not impute any values for Fare during the fit_transform. However, it did learn the mean of Fare in X, which is the imputation value that will be applied to Fare in X_new during prediction.
Next, we’ll update our modeling Pipeline to include the revised ColumnTransformer, and fit it on X and y. You can see from the diagram that there’s now a transformer Pipeline within the ColumnTransformer, which is within the modeling Pipeline.
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(), 'Name'),
('simpleimputer',
SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Parch'])])),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Finally, we’ll redefine X_new to use the full dataset, and then use the fitted Pipeline to make predictions for X_new. We know that we’ve solved our second problem because the Pipeline did not throw any errors during the predict step.
X_new = df_new[cols]
pipe.predict(X_new)array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
When we pass X to the ColumnTransformer’s fit_transform method, it outputs a matrix with 1518 columns. How can we find out the names of these columns?
ct.fit_transform(X)<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
Earlier in the book, we used the get_feature_names method for this purpose, which, as I mentioned previously, has been replaced by get_feature_names_out starting in scikit-learn 1.0. However, get_feature_names will only work if all of the underlying transformers have a get_feature_names method. In this case, it errors because neither Pipeline transformers nor SimpleImputer transformers have a get_feature_names method.
ct.get_feature_names()AttributeError: Transformer pipeline does not provide get_feature_names
The good news is that starting in scikit-learn 1.1, the get_feature_names_out method is available for all transformers, which means that retrieving the feature names will no longer error.
In the meantime, our only solution for figuring out the column names is to inspect the transformers one-by-one.
When we print out the transformers_ attribute, we can see that there are 4 transformers.
ct.transformers_[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(), ['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])]
The first transformer is a Pipeline of SimpleImputer and OneHotEncoder. OneHotEncoder has a get_feature_names method, which we can access by selecting the pipeline transformer and then its onehotencoder step. get_feature_names outputs 6 features, which we know are the first 6 features in the matrix because this is the first transformer in the ColumnTransformer.
(ct.named_transformers_['pipeline']
.named_steps['onehotencoder']
.get_feature_names())array(['x0_C', 'x0_Q', 'x0_S', 'x0_missing', 'x1_female', 'x1_male'],
dtype=object)
The second transformer is a CountVectorizer. It also has a get_feature_names method, which we can access by selecting the countvectorizer transformer. We could print out all of the feature names, but instead we’ll pass it to the len function, which indicates that the next 1509 features in the matrix came from CountVectorizer.
len(ct.named_transformers_['countvectorizer'].get_feature_names())1509
The third transformer is a SimpleImputer, which doesn’t change the number of columns since we’re not adding a missing indicator, so we know that the next two features in the matrix are Age and Fare.
The fourth transformer is a passthrough transformer, which also doesn’t change the number of columns, so we know that the final feature in the matrix is Parch.
We’ve now accounted for all 1518 features: 6 from the Pipeline transformer, 1509 from the CountVectorizer, 2 from the SimpleImputer, and 1 from the passthrough transformer.
Earlier in this chapter, since the Embarked column contained missing values and needed one-hot encoding, we created a two-step Pipeline called imp_ohe. The first step of this Pipeline is imputation of a constant value, and the second step is one-hot encoding.
imp_ohe = make_pipeline(imp_constant, ohe)We included the imp_ohe Pipeline in the ColumnTransformer, and applied it to both the Embarked and Sex columns. Here’s what it would look like if the ColumnTransformer only contained the imp_ohe Pipeline.
ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']))When you run the fit_transform method, Embarked turns into 4 columns and Sex turns into 2 columns, and the results are stacked side-by-side.
ct.fit_transform(X)array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
...,
[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 1.]])
Because the Sex column didn’t contain any missing values and only needed one-hot encoding, we could have achieved the exact same result by applying imp_ohe to Embarked and separately applying ohe to Sex.
ct = make_column_transformer(
(imp_ohe, ['Embarked']),
(ohe, ['Sex']))The fit_transform does indeed output the same results as above, though I personally prefer the first ColumnTransformer.
ct.fit_transform(X)array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
...,
[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 1.]])
One common question is whether you can avoid using the imp_ohe Pipeline entirely by making a ColumnTransformer like this instead, in which the imputation of a constant value is applied to Embarked, and one-hot encoding is applied to both Embarked and Sex.
ct = make_column_transformer(
(imp_constant, ['Embarked']),
(ohe, ['Embarked', 'Sex']))The answer is no, you cannot. The fit_transform method throws an error because Embarked contains missing values, and the ohe transformer is not able to handle missing values.
ct.fit_transform(X)ValueError: Input contains NaN
The key insight here is that there’s no interaction between the transformers in a ColumnTransformer. In other words, there’s no flow of data from one transformer to the next, meaning the output of the imp_constant transformer does not become the input to the ohe transformer. Thus, the ohe transformer is operating on the original Embarked column, not a transformed Embarked column in which missing values have been imputed.
If that’s confusing, it might be useful to recall the key differences between a Pipeline and a ColumnTransformer:
Pipeline, the output of one step becomes the input to the next step. This is precisely why we created the imp_ohe Pipeline: We needed the output of the imputer to become the input to the ohe-hot encoder.ColumnTransformer does not have steps. Instead, it has transformers that operate in parallel, and the output of each transformer is stacked beside the other transformer outputs.When imputing missing values for a categorical feature, you can either impute the most frequent value or a constant user-defined value. In this lesson, I’m going to discuss how you might choose between these two strategies.
Imputing a constant value essentially treats the missing values as a new category, which I believe is the better choice regardless of whether the values are missing at random or not at random. Imputing a constant value is especially important if the majority of values are missing, since imputing the most frequent value in that case would more than double the size of the category that was imputed, which would be quite misleading to the model.
That being said, imputing the most frequent value is much more acceptable when you have only a small number of missing values for a given feature, since the imputation won’t have much of an impact on the model anyway.
It’s important to note that if you impute a constant value for a categorical feature, and that feature has missing values in the new data but not the training data, then you’ll need to set the OneHotEncoder’s handle_unknown parameter to 'ignore'. That’s because the missing values category won’t be learned during the OneHotEncoder’s fit step, and thus unknown values seen during the transform step need to be ignored in order to avoid an error.
The alternative here is to impute the most frequent value for that feature, in which case you can leave the handle_unknown parameter set to its default value of 'error'.
Here’s the Pipeline that we’ve built throughout the book. The strategy I’ve used so far is to include all data transformations within a single ColumnTransformer, including any missing value imputation, and then use that ColumnTransformer as the first step in a two-step Pipeline.
pipePipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer',
CountVectorizer(), 'Name'),
('simpleimputer',
SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Parch'])])),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('onehotencoder',
OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])])['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
However, an alternative approach would to create a three-step Pipeline in which the first step is missing value imputation, the second step includes all other data transformations, and the third step is the model. Let’s try it out to see if this is a better approach.
This would be the first ColumnTransformer, which only does missing value imputation. Constant value imputation is applied to Embarked, mean imputation is applied to Age and Fare, and the other columns are passed through because they don’t contain any missing values in the training or new data. It would be the first step in the Pipeline.
ct1 = make_column_transformer(
(imp_constant, ['Embarked']),
(imp, ['Age', 'Fare']),
('passthrough', ['Sex', 'Name', 'Parch']))This would be the second ColumnTransformer, which handles all other data transformations. It would be the second step in the Pipeline, and thus it would operate on the output of the first ColumnTransformer. However, a ColumnTransformer outputs a NumPy array, not a DataFrame, and thus in this ColumnTransformer we would have to reference the columns by position instead of by name.
We know the order of the columns from the first ColumnTransformer, and thus Embarked and Sex would be columns 0 and 3 and Name would be column 4. Embarked and Sex are one-hot encoded, Name is vectorized, and the other columns are passed through.
ct2 = make_column_transformer(
(ohe, [0, 3]),
(vect, 4),
('passthrough', [1, 2, 5]))Now that we’ve created the ColumnTransformers, we can include them in a three-step Pipeline and fit the Pipeline to X and y.
pipe = make_pipeline(ct1, ct2, logreg)
pipe.fit(X, y)Pipeline(steps=[('columntransformer-1',
ColumnTransformer(transformers=[('simpleimputer-1',
SimpleImputer(fill_value='missing',
strategy='constant'),
['Embarked']),
('simpleimputer-2',
SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Sex', 'Name', 'Parch'])])),
('columntransformer-2',
ColumnTransformer(transformers=[('onehotencoder',
OneHotEncoder(), [0, 3]),
('countvectorizer',
CountVectorizer(), 4),
('passthrough', 'passthrough',
[1, 2, 5])])),
('logisticregression',
LogisticRegression(random_state=1, solver='liblinear'))])ColumnTransformer(transformers=[('simpleimputer-1',
SimpleImputer(fill_value='missing',
strategy='constant'),
['Embarked']),
('simpleimputer-2', SimpleImputer(),
['Age', 'Fare']),
('passthrough', 'passthrough',
['Sex', 'Name', 'Parch'])])['Embarked']
SimpleImputer(fill_value='missing', strategy='constant')
['Age', 'Fare']
SimpleImputer()
['Sex', 'Name', 'Parch']
passthrough
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(), [0, 3]),
('countvectorizer', CountVectorizer(), 4),
('passthrough', 'passthrough', [1, 2, 5])])[0, 3]
OneHotEncoder()
4
CountVectorizer()
[1, 2, 5]
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Finally, we can use this three-step Pipeline to make predictions, and it does indeed make the same predictions as our original two-step Pipeline.
pipe.predict(X_new)array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
Using a three-step Pipeline like this is certainly a valid approach. However, I find the original two-step Pipeline easier to write and to read, and thus I prefer the two-step approach.
The rules for Pipelines are that all steps other than the final step must be a transformer, and the final step can be a model or a transformer.
If a Pipeline ends in a model, such as our pipe object, you can use the Pipeline’s fit and predict methods:
fit method, all steps before the final one run fit_transform, and the final step runs fit.predict method, all steps before the final one run transform, and the final step runs predict.If a Pipeline ends in a transformer, such as our imp_ohe object, you generally use the Pipeline’s fit_transform and transform methods, but you can also use the fit method:
fit_transform method, all steps run fit_transform.transform method, all steps run transform.fit method, all steps before the final one run fit_transform, and the final step runs fit.Although this is a lot of information to take in, developing this level of understanding will definitely make it easier for you to test and debug your future Pipelines.