8  Fixing common workflow problems

8.1 Two new problems

Up to now, we’ve only been working with the first 10 rows of the Titanic dataset to make it easy to examine the input and output of each workflow step. In this chapter, we’ll begin using the full Titanic dataset. This will create a few new problems that are common with real datasets, and we’ll figure out how to handle those problems appropriately.

We’ll start by reading the training data into df and reading the new data into df_new, overwriting the existing objects.

When examining the shapes, you can see that df_new has one less column than df because it doesn’t contain the target column of Survived.

df = pd.read_csv('http://bit.ly/MLtrain')
df.shape
(891, 11)
df_new = pd.read_csv('http://bit.ly/MLnewdata')
df_new.shape
(418, 10)

We’ll check for missing values in these two DataFrames by chaining together the isna and sum methods. The results tell us how many missing values are present in each column.

This reveals two problems we’ll have to handle that weren’t present in our 10-row datasets. First, Embarked contains missing values in df, and second, Fare contains a missing value in df_new. We’ll spend the rest of this chapter addressing those two problems.

Note that we don’t have to worry about missing values in Cabin because we’re not yet using that as a feature, and our existing workflow already accounts for the missing values in Age.

df.isna().sum()
Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64
df_new.isna().sum()
Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64

Features with missing values:

  • Problematic:
    • Embarked: Missing values in df
    • Fare: Missing value in df_new
  • Not problematic:
    • Cabin: Not currently using
    • Age: Already being imputed

8.2 Problem 1: Missing values in a categorical feature

In this lesson, we’re going to figure out how to handle the missing values in the Embarked column.

We’ll start with a reminder of the six feature columns we’re using.

cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

We’ll redefine X and y to use the full dataset rather than the 10-row dataset.

X = df[cols]
y = df['Survived']

And here’s a reminder of the ColumnTransformer we created in the previous chapter.

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    ('passthrough', ['Parch', 'Fare']))

Normally we would pass X to the fit_transform method, but in this case it will error because the Embarked column contains missing values. Our solution will be to impute missing values for Embarked before one-hot encoding it.

ct.fit_transform(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ct.fit_transform(X)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:531, in ColumnTransformer.fit_transform(self, X, y)
    528 self._validate_column_callables(X)
    529 self._validate_remainder(X)
--> 531 result = self._fit_transform(X, y, _fit_transform_one)
    533 if not result:
    534     self._update_fitted_transformers([])

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:458, in ColumnTransformer._fit_transform(self, X, y, func, fitted)
    455 transformers = list(
    456     self._iter(fitted=fitted, replace_strings=True))
    457 try:
--> 458     return Parallel(n_jobs=self.n_jobs)(
    459         delayed(func)(
    460             transformer=clone(trans) if not fitted else trans,
    461             X=_safe_indexing(X, column, axis=1),
    462             y=y,
    463             weight=weight,
    464             message_clsname='ColumnTransformer',
    465             message=self._log_message(name, idx, len(transformers)))
    466         for idx, (name, trans, column, weight) in enumerate(
    467                 self._iter(fitted=fitted, replace_strings=True), 1))
    468 except ValueError as e:
    469     if "Expected 2D array, got 1D array instead" in str(e):

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
   1845 self.n_dispatched_batches += 1
   1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
   1848 self.n_completed_tasks += 1
   1849 self.print_progress()

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/pipeline.py:740, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    738 with _print_elapsed_time(message_clsname, message):
    739     if hasattr(transformer, 'fit_transform'):
--> 740         res = transformer.fit_transform(X, y, **fit_params)
    741     else:
    742         res = transformer.fit(X, y, **fit_params).transform(X)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:410, in OneHotEncoder.fit_transform(self, X, y)
    390 """
    391 Fit OneHotEncoder to X, then transform X.
    392 
   (...)
    407     Transformed input.
    408 """
    409 self._validate_keywords()
--> 410 return super().fit_transform(X, y)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/base.py:690, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    686 # non-optimized default implementation; override when a better
    687 # method is possible for a given clustering algorithm
    688 if y is None:
    689     # fit method of arity 1 (unsupervised transformation)
--> 690     return self.fit(X, **fit_params).transform(X)
    691 else:
    692     # fit method of arity 2 (supervised transformation)
    693     return self.fit(X, y, **fit_params).transform(X)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:385, in OneHotEncoder.fit(self, X, y)
    368 """
    369 Fit OneHotEncoder to X.
    370 
   (...)
    382 self
    383 """
    384 self._validate_keywords()
--> 385 self._fit(X, handle_unknown=self.handle_unknown)
    386 self.drop_idx_ = self._compute_drop_idx()
    387 return self

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:74, in _BaseEncoder._fit(self, X, handle_unknown)
     73 def _fit(self, X, handle_unknown='error'):
---> 74     X_list, n_samples, n_features = self._check_X(X)
     76     if self.categories != 'auto':
     77         if len(self.categories) != n_features:

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:60, in _BaseEncoder._check_X(self, X)
     58 for i in range(n_features):
     59     Xi = self._get_feature(X, feature_idx=i)
---> 60     Xi = check_array(Xi, ensure_2d=False, dtype=None,
     61                      force_all_finite=needs_validation)
     62     X_columns.append(Xi)
     64 return X_columns, n_samples, n_features

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:72, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     67     warnings.warn("Pass {} as keyword args. From version 0.25 "
     68                   "passing these as positional arguments will "
     69                   "result in an error".format(", ".join(args_msg)),
     70                   FutureWarning)
     71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:644, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    640         raise ValueError("Found array with dim %d. %s expected <= 2."
    641                          % (array.ndim, estimator_name))
    643     if force_all_finite:
--> 644         _assert_all_finite(array,
    645                            allow_nan=force_all_finite == 'allow-nan')
    647 if ensure_min_samples > 0:
    648     n_samples = _num_samples(array)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:104, in _assert_all_finite(X, allow_nan, msg_dtype)
    102 elif X.dtype == np.dtype('object') and not allow_nan:
    103     if _object_dtype_isnan(X).any():
--> 104         raise ValueError("Input contains NaN")

ValueError: Input contains NaN

As an aside, OneHotEncoder will automatically handle missing values by treating them as a new category starting in scikit-learn version 0.24. I’m using version 0.23, and so I’ll be writing the code to manually treat missing values as a new category. Even if you’re using version 0.24 or later, I still recommend following my code because what I’m about to teach you will enable you to solve other similar problems that scikit-learn does not automatically handle.

How OneHotEncoder handles missing values:

  • Before version 0.24: Errors if the input contains missing values
  • Starting in version 0.24: Treats missing values as a new category

As I was saying, our solution is to impute missing values for Embarked and then one-hot encode it.

The first step of this solution is to create a new instance of SimpleImputer, which we’ll call imp_constant. For categorical features, you can either impute the most frequent value or a constant user-defined value. We’ll choose the latter by setting the strategy parameter to constant, and the constant value we’ll impute is the string “missing”.

Imputation strategies for categorical features:

  • Most frequent value
  • User-defined value
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')

Next, we’ll create a two-step Pipeline that only contains transformers. The first step is imputation using our new imputer, and the second step is one-hot encoding. We’ll call this Pipeline imp_ohe to remind us of the two steps it contains.

imp_ohe = make_pipeline(imp_constant, ohe)

We can test out the imp_ohe Pipeline by passing the Embarked column to its fit_transform method. It outputs four columns because missing values are essentially being treated as a fourth category in addition to C, Q, and S.

imp_ohe.fit_transform(X[['Embarked']])
<891x4 sparse matrix of type '<class 'numpy.float64'>'
    with 891 stored elements in Compressed Sparse Row format>

We can confirm this by accessing the second step of the Pipeline, which is the OneHotEncoder, and then examining the categories_ attribute.

imp_ohe[1].categories_
[array(['C', 'Q', 'S', 'missing'], dtype=object)]

In case it helps you to understand the imp_ohe Pipeline better, I’m going to show you what happens “under the hood” when you fit_transform this Pipeline. To be clear, you should not actually write the following code, rather it’s just for teaching purposes.

First, the imp_constant object imputes a string value of “missing” for any missing values in the Embarked column. Then, the output of the imputer is one-hot encoded by the ohe object, which outputs four columns.

ohe.fit_transform(imp_constant.fit_transform(X[['Embarked']]))
<891x4 sparse matrix of type '<class 'numpy.float64'>'
    with 891 stored elements in Compressed Sparse Row format>

Now that we’ve created a transformer-only Pipeline to handle the missing values in Embarked, we’ll simply replace the ohe transformer in our ColumnTransformer with the imp_ohe Pipeline.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    ('passthrough', ['Parch', 'Fare']))

There are two things I want to note about the imp_ohe Pipeline:

  • First, you’re only allowed to include transformer objects in a ColumnTransformer, but the imp_ohe Pipeline is eligible because all of its steps are transformers.
  • Second, it’s completely fine to apply the imp_ohe Pipeline to the Sex column as well as Embarked. There are no missing values in the Sex column, so the imputation step won’t affect it, and it will simply get passed along to the one-hot encoding step.

Notes about the imp_ohe Pipeline:

  • Treated like a transformer because all of its steps are transformers
  • Imputation step won’t affect the Sex column

By replacing ohe with imp_ohe, we have now solved the problem of missing values in the Embarked column. Thus, we can pass X to the ColumnTransformer’s fit_transform method, and it will not throw an error.

As an aside, the output matrix is now much wider than before because the Name column of X contains a large number of unique words.

ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
    with 7328 stored elements in Compressed Sparse Row format>

8.3 Problem 2: Missing values in the new data

Now that we’ve solved our first problem, we’re going to move on to the second problem, which is the missing values in the Fare column. Recall that Fare has missing values in X_new but not in X, and thus our modeling Pipeline would error when making predictions for X_new if we don’t account for these missing values.

Our solution to this problem is to impute missing values for Fare. The ColumnTransformer already contains an imputer that does mean imputation, so we’ll apply the existing imputer to the Fare column, whereas previously Fare was a passthrough column. This is actually all that is required to solve our problem.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Parch']))

Now, we’ll pass X to the fit_transform method of the ColumnTransformer. It will output the same number of columns as it did before, since Fare just moved from a passthrough column to a transformed column.

ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
    with 7328 stored elements in Compressed Sparse Row format>

To be clear, the Fare column does not have any missing values in X, thus the imputer did not impute any values for Fare during the fit_transform. However, it did learn the mean of Fare in X, which is the imputation value that will be applied to the Fare column of X_new during prediction.

What will be imputed for Fare?

  • X: No missing Fare values, thus no imputation of Fare
  • X_new: Missing Fare value, thus impute the mean of Fare in X during prediction

Next, we’ll update our modeling Pipeline to include the revised ColumnTransformer, and fit it on X and y. You can see from the diagram that there’s now a transformer Pipeline within the ColumnTransformer, which is within the modeling Pipeline.

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(), 'Name'),
                                                 ('simpleimputer',
                                                  SimpleImputer(),
                                                  ['Age', 'Fare']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')

Finally, we’ll redefine X_new to use the full dataset, and then use the fitted Pipeline to make predictions for X_new. We know that we’ve solved our second problem because the Pipeline did not throw any errors during the predict step.

X_new = df_new[cols]
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

8.4 Q&A: How do I see the feature names output by the ColumnTransformer?

When we pass X to the ColumnTransformer’s fit_transform method, it outputs a matrix with 1518 columns. How can we find out the names of these columns?

ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
    with 7328 stored elements in Compressed Sparse Row format>

Earlier in the book, we used the get_feature_names method for this purpose, which, as I mentioned previously, will be replaced by get_feature_names_out starting in scikit-learn 1.0. However, get_feature_names will only work if all of the underlying transformers have a get_feature_names method. In this case, it errors because neither Pipeline transformers nor SimpleImputer transformers have a get_feature_names method.

ct.get_feature_names()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[22], line 1
----> 1 ct.get_feature_names()

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:371, in ColumnTransformer.get_feature_names(self)
    369         continue
    370     if not hasattr(trans, 'get_feature_names'):
--> 371         raise AttributeError("Transformer %s (type %s) does not "
    372                              "provide get_feature_names."
    373                              % (str(name), type(trans).__name__))
    374     feature_names.extend([name + "__" + f for f in
    375                           trans.get_feature_names()])
    376 return feature_names

AttributeError: Transformer pipeline (type Pipeline) does not provide get_feature_names.

The good news is that starting in scikit-learn 1.1, the get_feature_names_out method will be available for all transformers, which means that retrieving the feature names will no longer error.

Changes to get_feature_names:

  • Starting in version 1.0: get_feature_names replaced with get_feature_names_out
  • Starting in version 1.1: get_feature_names_out available for all transformers

In the meantime, our only solution for figuring out the column names is to inspect the transformers one-by-one.

When we print out the transformers_ attribute, we can see that there are 4 transformers.

ct.transformers_
[('pipeline',
  Pipeline(steps=[('simpleimputer',
                   SimpleImputer(fill_value='missing', strategy='constant')),
                  ('onehotencoder', OneHotEncoder())]),
  ['Embarked', 'Sex']),
 ('countvectorizer', CountVectorizer(), 'Name'),
 ('simpleimputer', SimpleImputer(), ['Age', 'Fare']),
 ('passthrough', 'passthrough', ['Parch'])]

The first transformer is a Pipeline of SimpleImputer and OneHotEncoder. OneHotEncoder has a get_feature_names method, which we can access by selecting the “pipeline” transformer and then its “onehotencoder” step. get_feature_names outputs 6 features, which we know are the first 6 features in the matrix because this is the first transformer in the ColumnTransformer.

ct.named_transformers_['pipeline'].named_steps['onehotencoder'].get_feature_names()
array(['x0_C', 'x0_Q', 'x0_S', 'x0_missing', 'x1_female', 'x1_male'],
      dtype=object)

The second transformer is a CountVectorizer. It also has a get_feature_names method, which we can access by selecting the “countvectorizer” transformer. We could print out all of the feature names, but instead we’ll pass it to the len function, which indicates that the next 1509 features in the matrix came from CountVectorizer.

len(ct.named_transformers_['countvectorizer'].get_feature_names())
1509

The third transformer is a SimpleImputer, which doesn’t change the number of columns since we’re not adding a missing indicator, so we know that the next two features in the matrix are Age and Fare.

The fourth transformer is a passthrough transformer, which also doesn’t change the number of columns, so we know that the final feature in the matrix is Parch.

We’ve now accounted for all 1518 features: 6 from the Pipeline transformer, 1509 from the CountVectorizer, 2 from the SimpleImputer, and 1 from the passthrough transformer.

Features output by each transformer:

  • Pipeline: 6 features (Embarked and Sex)
  • CountVectorizer: 1509 features (Name)
  • SimpleImputer: 2 features (Age and Fare)
  • passthrough: 1 feature (Parch)

8.5 Q&A: Why did we create a Pipeline inside of the ColumnTransformer?

Earlier in this chapter, since the Embarked column contained missing values and needed one-hot encoding, we created a two-step Pipeline called imp_ohe. The first step of this Pipeline is imputation of a constant value, and the second step is one-hot encoding.

imp_ohe = make_pipeline(imp_constant, ohe)

We included the imp_ohe Pipeline in the ColumnTransformer, and applied it to both the Embarked and Sex columns. Here’s what it would look like if the ColumnTransformer only contained the imp_ohe Pipeline.

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']))

When you run the fit_transform method, Embarked turns into 4 columns and Sex turns into 2 columns, and the results are stacked side-by-side.

ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0.],
       ...,
       [0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.]])

Because the Sex column didn’t contain any missing values and only needed one-hot encoding, we could have achieved the exact same results by applying imp_ohe just to Embarked and then separately applying ohe to Sex.

ct = make_column_transformer(
    (imp_ohe, ['Embarked']),
    (ohe, ['Sex']))

The fit_transform does indeed output the same results as above, though I personally prefer the first ColumnTransformer.

ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0.],
       ...,
       [0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.]])

One common question is whether you can avoid using the imp_ohe Pipeline entirely by making a ColumnTransformer like this instead, in which the imputation of a constant value is applied to Embarked, and one-hot encoding is applied to both Embarked and Sex.

ct = make_column_transformer(
    (imp_constant, ['Embarked']),
    (ohe, ['Embarked', 'Sex']))

The answer is no, you cannot. The fit_transform method throws an error because Embarked contains missing values, and the ohe transformer is not able to handle missing values.

ct.fit_transform(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[32], line 1
----> 1 ct.fit_transform(X)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:531, in ColumnTransformer.fit_transform(self, X, y)
    528 self._validate_column_callables(X)
    529 self._validate_remainder(X)
--> 531 result = self._fit_transform(X, y, _fit_transform_one)
    533 if not result:
    534     self._update_fitted_transformers([])

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:458, in ColumnTransformer._fit_transform(self, X, y, func, fitted)
    455 transformers = list(
    456     self._iter(fitted=fitted, replace_strings=True))
    457 try:
--> 458     return Parallel(n_jobs=self.n_jobs)(
    459         delayed(func)(
    460             transformer=clone(trans) if not fitted else trans,
    461             X=_safe_indexing(X, column, axis=1),
    462             y=y,
    463             weight=weight,
    464             message_clsname='ColumnTransformer',
    465             message=self._log_message(name, idx, len(transformers)))
    466         for idx, (name, trans, column, weight) in enumerate(
    467                 self._iter(fitted=fitted, replace_strings=True), 1))
    468 except ValueError as e:
    469     if "Expected 2D array, got 1D array instead" in str(e):

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
   1845 self.n_dispatched_batches += 1
   1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
   1848 self.n_completed_tasks += 1
   1849 self.print_progress()

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/pipeline.py:740, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    738 with _print_elapsed_time(message_clsname, message):
    739     if hasattr(transformer, 'fit_transform'):
--> 740         res = transformer.fit_transform(X, y, **fit_params)
    741     else:
    742         res = transformer.fit(X, y, **fit_params).transform(X)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:410, in OneHotEncoder.fit_transform(self, X, y)
    390 """
    391 Fit OneHotEncoder to X, then transform X.
    392 
   (...)
    407     Transformed input.
    408 """
    409 self._validate_keywords()
--> 410 return super().fit_transform(X, y)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/base.py:690, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    686 # non-optimized default implementation; override when a better
    687 # method is possible for a given clustering algorithm
    688 if y is None:
    689     # fit method of arity 1 (unsupervised transformation)
--> 690     return self.fit(X, **fit_params).transform(X)
    691 else:
    692     # fit method of arity 2 (supervised transformation)
    693     return self.fit(X, y, **fit_params).transform(X)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:385, in OneHotEncoder.fit(self, X, y)
    368 """
    369 Fit OneHotEncoder to X.
    370 
   (...)
    382 self
    383 """
    384 self._validate_keywords()
--> 385 self._fit(X, handle_unknown=self.handle_unknown)
    386 self.drop_idx_ = self._compute_drop_idx()
    387 return self

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:74, in _BaseEncoder._fit(self, X, handle_unknown)
     73 def _fit(self, X, handle_unknown='error'):
---> 74     X_list, n_samples, n_features = self._check_X(X)
     76     if self.categories != 'auto':
     77         if len(self.categories) != n_features:

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:60, in _BaseEncoder._check_X(self, X)
     58 for i in range(n_features):
     59     Xi = self._get_feature(X, feature_idx=i)
---> 60     Xi = check_array(Xi, ensure_2d=False, dtype=None,
     61                      force_all_finite=needs_validation)
     62     X_columns.append(Xi)
     64 return X_columns, n_samples, n_features

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:72, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     67     warnings.warn("Pass {} as keyword args. From version 0.25 "
     68                   "passing these as positional arguments will "
     69                   "result in an error".format(", ".join(args_msg)),
     70                   FutureWarning)
     71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:644, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    640         raise ValueError("Found array with dim %d. %s expected <= 2."
    641                          % (array.ndim, estimator_name))
    643     if force_all_finite:
--> 644         _assert_all_finite(array,
    645                            allow_nan=force_all_finite == 'allow-nan')
    647 if ensure_min_samples > 0:
    648     n_samples = _num_samples(array)

File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:104, in _assert_all_finite(X, allow_nan, msg_dtype)
    102 elif X.dtype == np.dtype('object') and not allow_nan:
    103     if _object_dtype_isnan(X).any():
--> 104         raise ValueError("Input contains NaN")

ValueError: Input contains NaN

The key insight here is that there’s no interaction between the transformers of a ColumnTransformer. In other words, there’s no flow of data from one transformer to the next, meaning the output of the imp_constant transformer does not become the input to the ohe transformer. Thus, the ohe transformer is operating on the original Embarked column, not a transformed Embarked column in which missing values have been imputed.

If that’s confusing, it might be useful to recall the key differences between a Pipeline and a ColumnTransformer:

  • In a Pipeline, the output of one step becomes the input to the next step. This is precisely why we created the imp_ohe Pipeline: We needed the output of the imputer to become the input to the ohe-hot encoder.
  • In contrast, a ColumnTransformer does not have steps. Instead, it has transformers that operate in parallel, and the output of each transformer is stacked beside the other transformer outputs.

Pipeline vs ColumnTransformer:

  • Pipeline:
    • Output of one step becomes the input to the next step
    • imp_ohe: Output of imp_constant becomes the input to ohe
  • ColumnTransformer:
    • Transformers operate in parallel
    • ct: Output of each transformer is stacked beside the other transformer outputs

8.6 Q&A: Which imputation strategy should I use with categorical features?

When imputing missing values for a categorical feature, you can either impute the most frequent value or a constant user-defined value. In this lesson, I’m going to discuss how you might choose between these two strategies.

Imputing a constant value essentially treats the missing values as a new category, which I believe is the better choice regardless of whether the values are missing at random or not at random. Imputing a constant value is especially important if the majority of values are missing, since imputing the most frequent value in that case would more than double the size of the category that was imputed, which would be quite misleading to the model.

That being said, imputing the most frequent value is much more acceptable when you have only a small number of missing values for a given feature, since the imputation won’t have much of an impact on the model anyway.

Imputation strategies for categorical features:

  • Constant user-defined value:
    • Treats missing values as a new category (recommended)
    • Important if the majority of values are missing
  • Most frequent value:
    • Acceptable if only a small number of values are missing

It’s important to note that if you impute a constant value for a feature, and that feature has missing values in the new data but not the training data, then you’ll need to set the OneHotEncoder’s handle_unknown parameter to “ignore”. That’s because the missing values category won’t be learned during the OneHotEncoder’s fit step, and thus unknown values seen during the transform step need to be ignored in order to avoid an error.

The alternative here is just to impute the most frequent value for that feature, in which case you can leave the handle_unknown parameter set to its default value of “error”.

Possible problem with imputing a constant value:

  • Condition: The feature only has missing values in the new data
  • Solution: Set handle_unknown to ‘ignore’ for the OneHotEncoder
  • Alternative: Impute the most frequent value, and leave handle_unknown set to ‘error’

8.7 Q&A: Should I impute missing values before all other transformations?

Here’s the Pipeline that we’ve built throughout the book. The strategy I’ve used throughout is to include all data transformations within a single ColumnTransformer, including any missing value imputation, and then use that ColumnTransformer as the first step in a two-step Pipeline.

pipe
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(), 'Name'),
                                                 ('simpleimputer',
                                                  SimpleImputer(),
                                                  ['Age', 'Fare']),
                                                 ('passthrough', 'passthrough',
                                                  ['Parch'])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['Embarked', 'Sex']),
                                ('countvectorizer', CountVectorizer(), 'Name'),
                                ('simpleimputer', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')

However, an alternative approach would to create a three-step Pipeline in which the first step is missing value imputation, the second step includes all other data transformations, and the third step is the model. Let’s try it out to see if this is a better approach.

Impute missing values as a first step?

  • Current Pipeline:
    • Step 1: All data transformations
    • Step 2: Model
  • Alternative Pipeline:
    • Step 1: Missing value imputation
    • Step 2: All other data transformations
    • Step 3: Model

This would be the first ColumnTransformer, which only does missing value imputation. Constant value imputation is applied to Embarked, mean imputation is applied to Age and Fare, and the other columns are passed through because they don’t contain any missing values in the training or new data. It would be the first step in the Pipeline.

ct1 = make_column_transformer(
    (imp_constant, ['Embarked']),
    (imp, ['Age', 'Fare']),
    ('passthrough', ['Sex', 'Name', 'Parch']))

This would be the second ColumnTransformer, which handles all other data transformations. It would be the second step in the Pipeline, and thus it would operate on the output of the first ColumnTransformer. However, a ColumnTransformer outputs a NumPy array, not a DataFrame, and thus in this ColumnTransformer we would have to reference the columns by position instead of by name.

We know the order of the columns from the first ColumnTransformer, and thus Embarked and Sex would be columns 0 and 3 and Name would be column 4. Embarked and Sex are one-hot encoded, Name is vectorized, and the other columns are passed through.

ct2 = make_column_transformer(
    (ohe, [0, 3]),
    (vect, 4),
    ('passthrough', [1, 2, 5]))

Now that we’ve created the ColumnTransformers, we can include them in a three-step Pipeline and fit the Pipeline to X and y.

pipe = make_pipeline(ct1, ct2, logreg)
pipe.fit(X, y)
Pipeline(steps=[('columntransformer-1',
                 ColumnTransformer(transformers=[('simpleimputer-1',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant'),
                                                  ['Embarked']),
                                                 ('simpleimputer-2',
                                                  SimpleImputer(),
                                                  ['Age', 'Fare']),
                                                 ('passthrough', 'passthrough',
                                                  ['Sex', 'Name', 'Parch'])])),
                ('columntransformer-2',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(), [0, 3]),
                                                 ('countvectorizer',
                                                  CountVectorizer(), 4),
                                                 ('passthrough', 'passthrough',
                                                  [1, 2, 5])])),
                ('logisticregression',
                 LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('simpleimputer-1',
                                 SimpleImputer(fill_value='missing',
                                               strategy='constant'),
                                 ['Embarked']),
                                ('simpleimputer-2', SimpleImputer(),
                                 ['Age', 'Fare']),
                                ('passthrough', 'passthrough',
                                 ['Sex', 'Name', 'Parch'])])
['Embarked']
SimpleImputer(fill_value='missing', strategy='constant')
['Age', 'Fare']
SimpleImputer()
['Sex', 'Name', 'Parch']
passthrough
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(), [0, 3]),
                                ('countvectorizer', CountVectorizer(), 4),
                                ('passthrough', 'passthrough', [1, 2, 5])])
[0, 3]
OneHotEncoder()
4
CountVectorizer()
[1, 2, 5]
passthrough
LogisticRegression(random_state=1, solver='liblinear')

Finally, we can use this three-step Pipeline to make predictions, and it does indeed make the same predictions as our original two-step Pipeline.

pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])

Using a three-step Pipeline like this is certainly a valid approach. However, I find the original two-step Pipeline easier to write and to read, and thus I prefer the two-step approach.

8.8 Q&A: What methods can I use with a Pipeline?

The rules for Pipelines are that all Pipeline steps other than the final step must be a transformer, and the final step can be a model or a transformer.

Rules for Pipeline steps:

  • All steps other than the final step must be a transformer
  • Final step can be a model or a transformer

If a Pipeline ends in a model, such as our “pipe” object, you can use the Pipeline’s fit and predict methods:

  • If you run the fit method, all steps before the final one run fit_transform, and the final step runs fit.
  • If you run the predict method, all steps before the final one run transform, and the final step runs predict.

Pipeline ends in a model:

  • fit:
    • All steps before the final step run fit_transform
    • Final step runs fit
  • predict:
    • All steps before the final step run transform
    • Final step runs predict

If a Pipeline ends in a transformer, such as our “imp_ohe” object, you generally use the Pipeline’s fit_transform and transform methods, but you can also use the fit method:

  • If you run the fit_transform method, all steps run fit_transform.
  • If you run the transform method, all steps run transform.
  • If you run the fit method, all steps before the final one run fit_transform, and the final step runs fit.

Pipeline ends in a transformer:

  • fit_transform:
    • All steps run fit_transform
  • transform:
    • All steps run transform
  • fit:
    • All steps before the final step run fit_transform
    • Final step runs fit

Although this is a lot of information to take in, developing this level of understanding will definitely make it easier for you to test and debug your future Pipelines.