= pd.read_csv('http://bit.ly/MLtrain')
df df.shape
(891, 11)
Up to now, we’ve only been working with the first 10 rows of the Titanic dataset to make it easy to examine the input and output of each workflow step. In this chapter, we’ll begin using the full Titanic dataset. This will create a few new problems that are common with real datasets, and we’ll figure out how to handle those problems appropriately.
We’ll start by reading the training data into df and reading the new data into df_new, overwriting the existing objects.
When examining the shapes, you can see that df_new has one less column than df because it doesn’t contain the target column of Survived.
= pd.read_csv('http://bit.ly/MLtrain')
df df.shape
(891, 11)
= pd.read_csv('http://bit.ly/MLnewdata')
df_new df_new.shape
(418, 10)
We’ll check for missing values in these two DataFrames by chaining together the isna and sum methods. The results tell us how many missing values are present in each column.
This reveals two problems we’ll have to handle that weren’t present in our 10-row datasets. First, Embarked contains missing values in df, and second, Fare contains a missing value in df_new. We’ll spend the rest of this chapter addressing those two problems.
Note that we don’t have to worry about missing values in Cabin because we’re not yet using that as a feature, and our existing workflow already accounts for the missing values in Age.
sum() df.isna().
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
sum() df_new.isna().
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
In this lesson, we’re going to figure out how to handle the missing values in the Embarked column.
We’ll start with a reminder of the six feature columns we’re using.
= ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age'] cols
We’ll redefine X and y to use the full dataset rather than the 10-row dataset.
= df[cols]
X = df['Survived'] y
And here’s a reminder of the ColumnTransformer we created in the previous chapter.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(ohe, ['Name'),
(vect, 'Age']),
(imp, ['passthrough', ['Parch', 'Fare'])) (
Normally we would pass X to the fit_transform method, but in this case it will error because the Embarked column contains missing values. Our solution will be to impute missing values for Embarked before one-hot encoding it.
ct.fit_transform(X)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[9], line 1 ----> 1 ct.fit_transform(X) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:531, in ColumnTransformer.fit_transform(self, X, y) 528 self._validate_column_callables(X) 529 self._validate_remainder(X) --> 531 result = self._fit_transform(X, y, _fit_transform_one) 533 if not result: 534 self._update_fitted_transformers([]) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:458, in ColumnTransformer._fit_transform(self, X, y, func, fitted) 455 transformers = list( 456 self._iter(fitted=fitted, replace_strings=True)) 457 try: --> 458 return Parallel(n_jobs=self.n_jobs)( 459 delayed(func)( 460 transformer=clone(trans) if not fitted else trans, 461 X=_safe_indexing(X, column, axis=1), 462 y=y, 463 weight=weight, 464 message_clsname='ColumnTransformer', 465 message=self._log_message(name, idx, len(transformers))) 466 for idx, (name, trans, column, weight) in enumerate( 467 self._iter(fitted=fitted, replace_strings=True), 1)) 468 except ValueError as e: 469 if "Expected 2D array, got 1D array instead" in str(e): File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable) 1916 output = self._get_sequential_output(iterable) 1917 next(output) -> 1918 return output if self.return_generator else list(output) 1920 # Let's create an ID that uniquely identifies the current call. If the 1921 # call is interrupted early and that the same instance is immediately 1922 # re-used, this id will be used to prevent workers that were 1923 # concurrently finalizing a task from the previous call to run the 1924 # callback. 1925 with self._lock: File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable) 1845 self.n_dispatched_batches += 1 1846 self.n_dispatched_tasks += 1 -> 1847 res = func(*args, **kwargs) 1848 self.n_completed_tasks += 1 1849 self.print_progress() File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/pipeline.py:740, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 738 with _print_elapsed_time(message_clsname, message): 739 if hasattr(transformer, 'fit_transform'): --> 740 res = transformer.fit_transform(X, y, **fit_params) 741 else: 742 res = transformer.fit(X, y, **fit_params).transform(X) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:410, in OneHotEncoder.fit_transform(self, X, y) 390 """ 391 Fit OneHotEncoder to X, then transform X. 392 (...) 407 Transformed input. 408 """ 409 self._validate_keywords() --> 410 return super().fit_transform(X, y) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/base.py:690, in TransformerMixin.fit_transform(self, X, y, **fit_params) 686 # non-optimized default implementation; override when a better 687 # method is possible for a given clustering algorithm 688 if y is None: 689 # fit method of arity 1 (unsupervised transformation) --> 690 return self.fit(X, **fit_params).transform(X) 691 else: 692 # fit method of arity 2 (supervised transformation) 693 return self.fit(X, y, **fit_params).transform(X) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:385, in OneHotEncoder.fit(self, X, y) 368 """ 369 Fit OneHotEncoder to X. 370 (...) 382 self 383 """ 384 self._validate_keywords() --> 385 self._fit(X, handle_unknown=self.handle_unknown) 386 self.drop_idx_ = self._compute_drop_idx() 387 return self File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:74, in _BaseEncoder._fit(self, X, handle_unknown) 73 def _fit(self, X, handle_unknown='error'): ---> 74 X_list, n_samples, n_features = self._check_X(X) 76 if self.categories != 'auto': 77 if len(self.categories) != n_features: File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:60, in _BaseEncoder._check_X(self, X) 58 for i in range(n_features): 59 Xi = self._get_feature(X, feature_idx=i) ---> 60 Xi = check_array(Xi, ensure_2d=False, dtype=None, 61 force_all_finite=needs_validation) 62 X_columns.append(Xi) 64 return X_columns, n_samples, n_features File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:72, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs) 67 warnings.warn("Pass {} as keyword args. From version 0.25 " 68 "passing these as positional arguments will " 69 "result in an error".format(", ".join(args_msg)), 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 72 return f(**kwargs) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:644, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 640 raise ValueError("Found array with dim %d. %s expected <= 2." 641 % (array.ndim, estimator_name)) 643 if force_all_finite: --> 644 _assert_all_finite(array, 645 allow_nan=force_all_finite == 'allow-nan') 647 if ensure_min_samples > 0: 648 n_samples = _num_samples(array) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:104, in _assert_all_finite(X, allow_nan, msg_dtype) 102 elif X.dtype == np.dtype('object') and not allow_nan: 103 if _object_dtype_isnan(X).any(): --> 104 raise ValueError("Input contains NaN") ValueError: Input contains NaN
As an aside, OneHotEncoder will automatically handle missing values by treating them as a new category starting in scikit-learn version 0.24. I’m using version 0.23, and so I’ll be writing the code to manually treat missing values as a new category. Even if you’re using version 0.24 or later, I still recommend following my code because what I’m about to teach you will enable you to solve other similar problems that scikit-learn does not automatically handle.
As I was saying, our solution is to impute missing values for Embarked and then one-hot encode it.
The first step of this solution is to create a new instance of SimpleImputer, which we’ll call imp_constant. For categorical features, you can either impute the most frequent value or a constant user-defined value. We’ll choose the latter by setting the strategy parameter to constant, and the constant value we’ll impute is the string “missing”.
= SimpleImputer(strategy='constant', fill_value='missing') imp_constant
Next, we’ll create a two-step Pipeline that only contains transformers. The first step is imputation using our new imputer, and the second step is one-hot encoding. We’ll call this Pipeline imp_ohe to remind us of the two steps it contains.
= make_pipeline(imp_constant, ohe) imp_ohe
We can test out the imp_ohe Pipeline by passing the Embarked column to its fit_transform method. It outputs four columns because missing values are essentially being treated as a fourth category in addition to C, Q, and S.
'Embarked']]) imp_ohe.fit_transform(X[[
<891x4 sparse matrix of type '<class 'numpy.float64'>'
with 891 stored elements in Compressed Sparse Row format>
We can confirm this by accessing the second step of the Pipeline, which is the OneHotEncoder, and then examining the categories_ attribute.
1].categories_ imp_ohe[
[array(['C', 'Q', 'S', 'missing'], dtype=object)]
In case it helps you to understand the imp_ohe Pipeline better, I’m going to show you what happens “under the hood” when you fit_transform this Pipeline. To be clear, you should not actually write the following code, rather it’s just for teaching purposes.
First, the imp_constant object imputes a string value of “missing” for any missing values in the Embarked column. Then, the output of the imputer is one-hot encoded by the ohe object, which outputs four columns.
'Embarked']])) ohe.fit_transform(imp_constant.fit_transform(X[[
<891x4 sparse matrix of type '<class 'numpy.float64'>'
with 891 stored elements in Compressed Sparse Row format>
Now that we’ve created a transformer-only Pipeline to handle the missing values in Embarked, we’ll simply replace the ohe transformer in our ColumnTransformer with the imp_ohe Pipeline.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(imp_ohe, ['Name'),
(vect, 'Age']),
(imp, ['passthrough', ['Parch', 'Fare'])) (
There are two things I want to note about the imp_ohe Pipeline:
By replacing ohe with imp_ohe, we have now solved the problem of missing values in the Embarked column. Thus, we can pass X to the ColumnTransformer’s fit_transform method, and it will not throw an error.
As an aside, the output matrix is now much wider than before because the Name column of X contains a large number of unique words.
ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
Now that we’ve solved our first problem, we’re going to move on to the second problem, which is the missing values in the Fare column. Recall that Fare has missing values in X_new but not in X, and thus our modeling Pipeline would error when making predictions for X_new if we don’t account for these missing values.
Our solution to this problem is to impute missing values for Fare. The ColumnTransformer already contains an imputer that does mean imputation, so we’ll apply the existing imputer to the Fare column, whereas previously Fare was a passthrough column. This is actually all that is required to solve our problem.
= make_column_transformer(
ct 'Embarked', 'Sex']),
(imp_ohe, ['Name'),
(vect, 'Age', 'Fare']),
(imp, ['passthrough', ['Parch'])) (
Now, we’ll pass X to the fit_transform method of the ColumnTransformer. It will output the same number of columns as it did before, since Fare just moved from a passthrough column to a transformed column.
ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
To be clear, the Fare column does not have any missing values in X, thus the imputer did not impute any values for Fare during the fit_transform. However, it did learn the mean of Fare in X, which is the imputation value that will be applied to the Fare column of X_new during prediction.
Next, we’ll update our modeling Pipeline to include the revised ColumnTransformer, and fit it on X and y. You can see from the diagram that there’s now a transformer Pipeline within the ColumnTransformer, which is within the modeling Pipeline.
= make_pipeline(ct, logreg)
pipe pipe.fit(X, y)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Finally, we’ll redefine X_new to use the full dataset, and then use the fitted Pipeline to make predictions for X_new. We know that we’ve solved our second problem because the Pipeline did not throw any errors during the predict step.
= df_new[cols]
X_new pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
When we pass X to the ColumnTransformer’s fit_transform method, it outputs a matrix with 1518 columns. How can we find out the names of these columns?
ct.fit_transform(X)
<891x1518 sparse matrix of type '<class 'numpy.float64'>'
with 7328 stored elements in Compressed Sparse Row format>
Earlier in the book, we used the get_feature_names method for this purpose, which, as I mentioned previously, will be replaced by get_feature_names_out starting in scikit-learn 1.0. However, get_feature_names will only work if all of the underlying transformers have a get_feature_names method. In this case, it errors because neither Pipeline transformers nor SimpleImputer transformers have a get_feature_names method.
ct.get_feature_names()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[22], line 1 ----> 1 ct.get_feature_names() File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:371, in ColumnTransformer.get_feature_names(self) 369 continue 370 if not hasattr(trans, 'get_feature_names'): --> 371 raise AttributeError("Transformer %s (type %s) does not " 372 "provide get_feature_names." 373 % (str(name), type(trans).__name__)) 374 feature_names.extend([name + "__" + f for f in 375 trans.get_feature_names()]) 376 return feature_names AttributeError: Transformer pipeline (type Pipeline) does not provide get_feature_names.
The good news is that starting in scikit-learn 1.1, the get_feature_names_out method will be available for all transformers, which means that retrieving the feature names will no longer error.
In the meantime, our only solution for figuring out the column names is to inspect the transformers one-by-one.
When we print out the transformers_ attribute, we can see that there are 4 transformers.
ct.transformers_
[('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing', strategy='constant')),
('onehotencoder', OneHotEncoder())]),
['Embarked', 'Sex']),
('countvectorizer', CountVectorizer(), 'Name'),
('simpleimputer', SimpleImputer(), ['Age', 'Fare']),
('passthrough', 'passthrough', ['Parch'])]
The first transformer is a Pipeline of SimpleImputer and OneHotEncoder. OneHotEncoder has a get_feature_names method, which we can access by selecting the “pipeline” transformer and then its “onehotencoder” step. get_feature_names outputs 6 features, which we know are the first 6 features in the matrix because this is the first transformer in the ColumnTransformer.
'pipeline'].named_steps['onehotencoder'].get_feature_names() ct.named_transformers_[
array(['x0_C', 'x0_Q', 'x0_S', 'x0_missing', 'x1_female', 'x1_male'],
dtype=object)
The second transformer is a CountVectorizer. It also has a get_feature_names method, which we can access by selecting the “countvectorizer” transformer. We could print out all of the feature names, but instead we’ll pass it to the len function, which indicates that the next 1509 features in the matrix came from CountVectorizer.
len(ct.named_transformers_['countvectorizer'].get_feature_names())
1509
The third transformer is a SimpleImputer, which doesn’t change the number of columns since we’re not adding a missing indicator, so we know that the next two features in the matrix are Age and Fare.
The fourth transformer is a passthrough transformer, which also doesn’t change the number of columns, so we know that the final feature in the matrix is Parch.
We’ve now accounted for all 1518 features: 6 from the Pipeline transformer, 1509 from the CountVectorizer, 2 from the SimpleImputer, and 1 from the passthrough transformer.
Earlier in this chapter, since the Embarked column contained missing values and needed one-hot encoding, we created a two-step Pipeline called imp_ohe. The first step of this Pipeline is imputation of a constant value, and the second step is one-hot encoding.
= make_pipeline(imp_constant, ohe) imp_ohe
We included the imp_ohe Pipeline in the ColumnTransformer, and applied it to both the Embarked and Sex columns. Here’s what it would look like if the ColumnTransformer only contained the imp_ohe Pipeline.
= make_column_transformer(
ct 'Embarked', 'Sex'])) (imp_ohe, [
When you run the fit_transform method, Embarked turns into 4 columns and Sex turns into 2 columns, and the results are stacked side-by-side.
ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
...,
[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 1.]])
Because the Sex column didn’t contain any missing values and only needed one-hot encoding, we could have achieved the exact same results by applying imp_ohe just to Embarked and then separately applying ohe to Sex.
= make_column_transformer(
ct 'Embarked']),
(imp_ohe, ['Sex'])) (ohe, [
The fit_transform does indeed output the same results as above, though I personally prefer the first ColumnTransformer.
ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 0., 1., 0.],
[0., 0., 1., 0., 1., 0.],
...,
[0., 0., 1., 0., 1., 0.],
[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 1.]])
One common question is whether you can avoid using the imp_ohe Pipeline entirely by making a ColumnTransformer like this instead, in which the imputation of a constant value is applied to Embarked, and one-hot encoding is applied to both Embarked and Sex.
= make_column_transformer(
ct 'Embarked']),
(imp_constant, ['Embarked', 'Sex'])) (ohe, [
The answer is no, you cannot. The fit_transform method throws an error because Embarked contains missing values, and the ohe transformer is not able to handle missing values.
ct.fit_transform(X)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[32], line 1 ----> 1 ct.fit_transform(X) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:531, in ColumnTransformer.fit_transform(self, X, y) 528 self._validate_column_callables(X) 529 self._validate_remainder(X) --> 531 result = self._fit_transform(X, y, _fit_transform_one) 533 if not result: 534 self._update_fitted_transformers([]) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:458, in ColumnTransformer._fit_transform(self, X, y, func, fitted) 455 transformers = list( 456 self._iter(fitted=fitted, replace_strings=True)) 457 try: --> 458 return Parallel(n_jobs=self.n_jobs)( 459 delayed(func)( 460 transformer=clone(trans) if not fitted else trans, 461 X=_safe_indexing(X, column, axis=1), 462 y=y, 463 weight=weight, 464 message_clsname='ColumnTransformer', 465 message=self._log_message(name, idx, len(transformers))) 466 for idx, (name, trans, column, weight) in enumerate( 467 self._iter(fitted=fitted, replace_strings=True), 1)) 468 except ValueError as e: 469 if "Expected 2D array, got 1D array instead" in str(e): File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable) 1916 output = self._get_sequential_output(iterable) 1917 next(output) -> 1918 return output if self.return_generator else list(output) 1920 # Let's create an ID that uniquely identifies the current call. If the 1921 # call is interrupted early and that the same instance is immediately 1922 # re-used, this id will be used to prevent workers that were 1923 # concurrently finalizing a task from the previous call to run the 1924 # callback. 1925 with self._lock: File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable) 1845 self.n_dispatched_batches += 1 1846 self.n_dispatched_tasks += 1 -> 1847 res = func(*args, **kwargs) 1848 self.n_completed_tasks += 1 1849 self.print_progress() File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/pipeline.py:740, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 738 with _print_elapsed_time(message_clsname, message): 739 if hasattr(transformer, 'fit_transform'): --> 740 res = transformer.fit_transform(X, y, **fit_params) 741 else: 742 res = transformer.fit(X, y, **fit_params).transform(X) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:410, in OneHotEncoder.fit_transform(self, X, y) 390 """ 391 Fit OneHotEncoder to X, then transform X. 392 (...) 407 Transformed input. 408 """ 409 self._validate_keywords() --> 410 return super().fit_transform(X, y) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/base.py:690, in TransformerMixin.fit_transform(self, X, y, **fit_params) 686 # non-optimized default implementation; override when a better 687 # method is possible for a given clustering algorithm 688 if y is None: 689 # fit method of arity 1 (unsupervised transformation) --> 690 return self.fit(X, **fit_params).transform(X) 691 else: 692 # fit method of arity 2 (supervised transformation) 693 return self.fit(X, y, **fit_params).transform(X) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:385, in OneHotEncoder.fit(self, X, y) 368 """ 369 Fit OneHotEncoder to X. 370 (...) 382 self 383 """ 384 self._validate_keywords() --> 385 self._fit(X, handle_unknown=self.handle_unknown) 386 self.drop_idx_ = self._compute_drop_idx() 387 return self File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:74, in _BaseEncoder._fit(self, X, handle_unknown) 73 def _fit(self, X, handle_unknown='error'): ---> 74 X_list, n_samples, n_features = self._check_X(X) 76 if self.categories != 'auto': 77 if len(self.categories) != n_features: File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:60, in _BaseEncoder._check_X(self, X) 58 for i in range(n_features): 59 Xi = self._get_feature(X, feature_idx=i) ---> 60 Xi = check_array(Xi, ensure_2d=False, dtype=None, 61 force_all_finite=needs_validation) 62 X_columns.append(Xi) 64 return X_columns, n_samples, n_features File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:72, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs) 67 warnings.warn("Pass {} as keyword args. From version 0.25 " 68 "passing these as positional arguments will " 69 "result in an error".format(", ".join(args_msg)), 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 72 return f(**kwargs) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:644, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 640 raise ValueError("Found array with dim %d. %s expected <= 2." 641 % (array.ndim, estimator_name)) 643 if force_all_finite: --> 644 _assert_all_finite(array, 645 allow_nan=force_all_finite == 'allow-nan') 647 if ensure_min_samples > 0: 648 n_samples = _num_samples(array) File /opt/miniconda3/envs/mlbook/lib/python3.9/site-packages/sklearn/utils/validation.py:104, in _assert_all_finite(X, allow_nan, msg_dtype) 102 elif X.dtype == np.dtype('object') and not allow_nan: 103 if _object_dtype_isnan(X).any(): --> 104 raise ValueError("Input contains NaN") ValueError: Input contains NaN
The key insight here is that there’s no interaction between the transformers of a ColumnTransformer. In other words, there’s no flow of data from one transformer to the next, meaning the output of the imp_constant transformer does not become the input to the ohe transformer. Thus, the ohe transformer is operating on the original Embarked column, not a transformed Embarked column in which missing values have been imputed.
If that’s confusing, it might be useful to recall the key differences between a Pipeline and a ColumnTransformer:
When imputing missing values for a categorical feature, you can either impute the most frequent value or a constant user-defined value. In this lesson, I’m going to discuss how you might choose between these two strategies.
Imputing a constant value essentially treats the missing values as a new category, which I believe is the better choice regardless of whether the values are missing at random or not at random. Imputing a constant value is especially important if the majority of values are missing, since imputing the most frequent value in that case would more than double the size of the category that was imputed, which would be quite misleading to the model.
That being said, imputing the most frequent value is much more acceptable when you have only a small number of missing values for a given feature, since the imputation won’t have much of an impact on the model anyway.
It’s important to note that if you impute a constant value for a feature, and that feature has missing values in the new data but not the training data, then you’ll need to set the OneHotEncoder’s handle_unknown parameter to “ignore”. That’s because the missing values category won’t be learned during the OneHotEncoder’s fit step, and thus unknown values seen during the transform step need to be ignored in order to avoid an error.
The alternative here is just to impute the most frequent value for that feature, in which case you can leave the handle_unknown parameter set to its default value of “error”.
Here’s the Pipeline that we’ve built throughout the book. The strategy I’ve used throughout is to include all data transformations within a single ColumnTransformer, including any missing value imputation, and then use that ColumnTransformer as the first step in a two-step Pipeline.
pipe
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('pipeline', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='constant')), ('onehotencoder', OneHotEncoder())]), ['Embarked', 'Sex']), ('countvectorizer', CountVectorizer(), 'Name'), ('simpleimputer', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Parch'])])
['Embarked', 'Sex']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
Name
CountVectorizer()
['Age', 'Fare']
SimpleImputer()
['Parch']
passthrough
LogisticRegression(random_state=1, solver='liblinear')
However, an alternative approach would to create a three-step Pipeline in which the first step is missing value imputation, the second step includes all other data transformations, and the third step is the model. Let’s try it out to see if this is a better approach.
This would be the first ColumnTransformer, which only does missing value imputation. Constant value imputation is applied to Embarked, mean imputation is applied to Age and Fare, and the other columns are passed through because they don’t contain any missing values in the training or new data. It would be the first step in the Pipeline.
= make_column_transformer(
ct1 'Embarked']),
(imp_constant, ['Age', 'Fare']),
(imp, ['passthrough', ['Sex', 'Name', 'Parch'])) (
This would be the second ColumnTransformer, which handles all other data transformations. It would be the second step in the Pipeline, and thus it would operate on the output of the first ColumnTransformer. However, a ColumnTransformer outputs a NumPy array, not a DataFrame, and thus in this ColumnTransformer we would have to reference the columns by position instead of by name.
We know the order of the columns from the first ColumnTransformer, and thus Embarked and Sex would be columns 0 and 3 and Name would be column 4. Embarked and Sex are one-hot encoded, Name is vectorized, and the other columns are passed through.
= make_column_transformer(
ct2 0, 3]),
(ohe, [4),
(vect, 'passthrough', [1, 2, 5])) (
Now that we’ve created the ColumnTransformers, we can include them in a three-step Pipeline and fit the Pipeline to X and y.
= make_pipeline(ct1, ct2, logreg)
pipe pipe.fit(X, y)
Pipeline(steps=[('columntransformer-1', ColumnTransformer(transformers=[('simpleimputer-1', SimpleImputer(fill_value='missing', strategy='constant'), ['Embarked']), ('simpleimputer-2', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Sex', 'Name', 'Parch'])])), ('columntransformer-2', ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(), [0, 3]), ('countvectorizer', CountVectorizer(), 4), ('passthrough', 'passthrough', [1, 2, 5])])), ('logisticregression', LogisticRegression(random_state=1, solver='liblinear'))])
ColumnTransformer(transformers=[('simpleimputer-1', SimpleImputer(fill_value='missing', strategy='constant'), ['Embarked']), ('simpleimputer-2', SimpleImputer(), ['Age', 'Fare']), ('passthrough', 'passthrough', ['Sex', 'Name', 'Parch'])])
['Embarked']
SimpleImputer(fill_value='missing', strategy='constant')
['Age', 'Fare']
SimpleImputer()
['Sex', 'Name', 'Parch']
passthrough
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(), [0, 3]), ('countvectorizer', CountVectorizer(), 4), ('passthrough', 'passthrough', [1, 2, 5])])
[0, 3]
OneHotEncoder()
4
CountVectorizer()
[1, 2, 5]
passthrough
LogisticRegression(random_state=1, solver='liblinear')
Finally, we can use this three-step Pipeline to make predictions, and it does indeed make the same predictions as our original two-step Pipeline.
pipe.predict(X_new)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1])
Using a three-step Pipeline like this is certainly a valid approach. However, I find the original two-step Pipeline easier to write and to read, and thus I prefer the two-step approach.
The rules for Pipelines are that all Pipeline steps other than the final step must be a transformer, and the final step can be a model or a transformer.
If a Pipeline ends in a model, such as our “pipe” object, you can use the Pipeline’s fit and predict methods:
If a Pipeline ends in a transformer, such as our “imp_ohe” object, you generally use the Pipeline’s fit_transform and transform methods, but you can also use the fit method:
Although this is a lot of information to take in, developing this level of understanding will definitely make it easier for you to test and debug your future Pipelines.