1 Introduction

1.1 Book overview

Hello and welcome to “Master Machine Learning with scikit-learn”. In this book, you’re going to learn how to build an effective Machine Learning workflow using the latest scikit-learn techniques so that you can solve almost any supervised Machine Learning problem.

If you have just a bit of experience with scikit-learn, you’ve probably spent most of your time building models using artificially clean training data. But in this book, you’ll learn how to build models using more complex datasets. We’ll cover:

How to handle common scenarios such as missing values, text data, categorical data, and class imbalance
How to build a reusable workflow that starts with a pandas DataFrame and ends with a trained scikit-learn model
How to integrate feature engineering, selection, and standardization into your workflow
How to avoid data leakage so that you can correctly estimate model performance
How to tune your entire workflow for maximum performance

…and so much more!

High-level topics:

Handling missing values, text data, categorical data, and class imbalance
Building a reusable workflow
Feature engineering, selection, and standardization
Avoiding data leakage
Tuning your entire workflow

Throughout the book, you’ll learn the best practices for proper Machine Learning and how to apply those practices to your own Machine Learning problems.

By the end of this book, you’ll be more confident when tackling new Machine Learning problems because you’ll understand what steps you need to take, why you need to take them, and how to correctly execute those steps using scikit-learn.

You’ll know what problems you might run into, and you’ll know exactly how to solve them.

And because you’re learning a better way to work in scikit-learn, your code will be easier to write and to read, and you’ll get better Machine Learning results faster than before!

How you will benefit from this book:

Knowledge of best practices
Confidence when tackling new ML problems
Ability to anticipate and solve problems
Improved code quality
Better, faster results

1.2 scikit-learn vs Deep Learning

If you want to solve any Machine Learning problem using Python, then I always recommend starting with scikit-learn. Here’s why:

It provides a consistent interface to a huge number of Machine Learning models
It offers many options and tuning parameters but uses sensible defaults
It includes a rich set of functionality to support the entire Machine Learning workflow
It has exceptional documentation
And there is an active community of researchers and developers who continue to improve and support the library

Benefits of scikit-learn:

Consistent interface to many models
Many tuning parameters (but sensible defaults)
Workflow-related functionality
Exceptional documentation
Active community support

In fact, scikit-learn was the most popular Machine Learning tool in Kaggle’s most recent “State of Machine Learning” report, with more than 80% of data scientists using it.

For many Machine Learning problems, scikit-learn will be the only library you need. However, there are some specialized problems for which a deep learning library such as TensorFlow, PyTorch, or Keras will provide superior results. That being said, deep learning does have some significant drawbacks:

Deep learning requires more computational resources
Deep learning libraries have a higher learning curve
And deep learning models are less interpretable than non-deep learning models

Drawbacks of deep learning:

More computational resources
Higher learning curve
Less interpretable models

In other words, I only recommend using deep learning if you already know that you need it to solve your particular problem. But for the majority of Machine Learning problems, you are likely to get similar results using scikit-learn, and you will get those results much faster and easier with scikit-learn.

1.3 Prerequisite skills

This is an intermediate level book about scikit-learn, though it also includes many advanced topics.

You’re ready for this book if you can use scikit-learn to solve simple classification or regression problems, including loading a dataset, defining the features and target, training and evaluating a model, and making predictions with new data.

scikit-learn prerequisites:

Loading a dataset
Defining the features and target
Training and evaluating a model
Making predictions with new data

If you’re brand new to scikit-learn, I recommend first taking my free introductory course, Introduction to Machine Learning with scikit-learn. Once you’ve completed lessons 1 through 7, you’ll know the Machine Learning and scikit-learn fundamentals that you’ll need for this book.

New to scikit-learn?

Enroll in Introduction to Machine Learning with scikit-learn (free)
Complete lessons 1 through 7

If you’ve used scikit-learn before but you just need a refresher, there’s no need to take my introductory course because I’ll be reviewing the Machine Learning workflow in the next chapter.

Finally, I should note that we will perform a few basic pandas operations in the book, including reading a CSV file and selecting columns from a DataFrame. However, it’s not a problem at all if you’re new to pandas because I will explain that code as we go.

1.4 Setup and software versions

To follow along with this book, you’ll need to have access to scikit-learn and pandas, which are both open source Python libraries. The easiest way to install scikit-learn and pandas (and their dependencies) is to download the Anaconda distribution of Python.

For other installation options, visit the scikit-learn and pandas websites.

How to install scikit-learn and pandas:

Option 1: Install together
- Anaconda distribution
Option 2: Install separately
- scikit-learn
- pandas

Because we’ll be using some of the latest scikit-learn functionality, it’s important that you are running a modern version of scikit-learn. To check your version, just open your code editor and run this code. Note that there are two underscores before and after the word “version”.

import sklearn
sklearn.__version__

'0.23.2'

Note that this book uses version 0.23.2. To follow along with the book, it’s important that you are using at least version 0.20.2. There are some scikit-learn features that I’ll use in the book that were released in 0.21 or 0.22 or 0.23, but none of those features are essential to the core content of the book. Throughout the book, I’ll specifically mention any features that require scikit-learn 0.21 or above.

scikit-learn version:

Book version: 0.23.2
Minimum version: 0.20.2

You are also welcome to use a later version of scikit-learn, such as 0.24, 1.0, or beyond. Throughout the book, I’ll mention any changes to scikit-learn in those versions that are relevant to the book.

Regarding the pandas library, this book uses version 1.2.4, but any version should work just fine.

import pandas
pandas.__version__

'1.2.4'

One final option for following along with the book, especially if you aren’t able to install Python packages on your local machine, is to use Google Colab. Colab provides you with an interface similar to the Jupyter Notebook. It’s free and runs entirely in your browser, though it does require that you have a Google account.

Using Google Colab with the book:

Similar to the Jupyter Notebook
Runs in your browser
Free (but requires a Google account)

1.5 Book outline

Let’s talk about what we’re going to cover in each chapter.

In chapter 1, which is this chapter, we’re getting you ready for the book.
In chapter 2, we’ll walk through the basic Machine Learning workflow, from loading a dataset to building a model to making predictions.
In chapter 3, we’ll focus on one of the most important data preprocessing steps, which is the encoding of categorical features.
In chapter 4, we’ll see how to use ColumnTransformer and Pipeline to make our workflow more powerful and efficient.
In chapter 5, we’ll review the workflow we’ve built so far to make sure you understand the key concepts before we start adding additional complexity.
In chapter 6, we’ll learn how to create features from unstructured text data.
In chapter 7, we’ll discuss missing values and explore a few different ways to handle them.
In chapter 8, we’ll see what problems arise when we expand the size of our dataset, and then we’ll figure out how to handle those problems.
In chapter 9, we’ll review our workflow again and discuss how it’s helping us to prevent data leakage.
In chapter 10, we’ll take a deep dive into how to efficiently tune our Pipeline for maximum performance.
In chapter 11, we’ll try out a non-linear model called “random forests” and figure out how to tune it without overextending our computing resources.
In chapter 12, we’ll learn how to ensemble our different models two different ways and how to tune the ensemble for even better performance.
In chapter 13, we’ll discuss the benefits of feature selection and then try out a handful of different automated methods for selecting features.
In chapter 14, we’ll experiment with standardizing our features to see if that improves our model performance.
In chapter 15, we’ll create a variety of new features within our Pipeline and discuss why you might want to do all of your feature engineering using scikit-learn rather than pandas.
In chapter 16, we’ll do one final review of the workflow that we created throughout the book.
In chapter 17, we’ll experiment with different ways of handling categorical features with lots of unique values.
In chapter 18, we’ll thoroughly explore the problem of class imbalance and the processes you can use to work around it.
In chapter 19, we’ll walk through my complete workflow for handling class imbalance so that you can see a demonstration of the best practices.
And finally, in chapter 20, we’ll discuss how you can keep learning and improving your skills on your own.

Chapters:

Introduction
Review of the Machine Learning workflow
Encoding categorical features
Improving your workflow with ColumnTransformer and Pipeline
Workflow review #1
Encoding text data
Handling missing values
Fixing common workflow problems
Workflow review #2
Evaluating and tuning a Pipeline
Comparing linear and non-linear models
Ensembling multiple models
Feature selection
Feature standardization
Feature engineering with custom transformers
Workflow review #3
High-cardinality categorical features
Class imbalance
Class imbalance walkthrough
Going further

I recommend reading the chapters in order, because each chapter builds on the material from previous chapters.

Also, you might have noticed that most of the chapters end with a series of lessons marked “Q&A”. These lessons answer common questions that may have come up in your mind while reading the rest of the lessons in that chapter. Thus, the Q&A lessons should help you to understand the core material from each chapter in greater depth.

Lesson types:

Core lessons
Q&A lessons

Finally, I wanted to mention that this book won’t be focusing on high-level algorithm selection, such as whether you should use a logistic regression model or a random forests model for your particular problem. That’s because I’ve found that the workflow is more important and will have a greater impact on your overall Machine Learning results than your ability to pick between algorithms.

In fact, once you’ve mastered the workflow, you can iterate through different algorithms quickly even if you don’t deeply understand them. And even if you did understand all of the algorithms, it’s hard to know in advance which one will work best for a given problem, which is why it’s so important to build a reusable workflow that enables you to switch between algorithms easily.

The bottom line is that understanding the algorithms is still useful, but the workflow is even more important, and so that’s the focus of this book.

Why not focus on algorithms?

Workflow will have a greater impact on your results
Reusable workflow enables you to try many different algorithms
Hard to know (in advance) which algorithm will work best

1.6 Datasets

We’ll be using three different datasets in the book:

The first is the famous Titanic dataset.
The second is a dataset of US census data.
And the third is a dataset of mammography scans used to detect the presence of cancer.

Datasets:

Titanic
US census
Mammography scans

I chose these datasets for three main reasons.

First, I wanted to use small-to-medium size datasets that could be read via URL so that you aren’t required to download any files for this book, which is more convenient for everyone.

Second, using smaller datasets means that computationally intense operations such as grid search won’t take many hours to run. This will save you a lot of time, especially if you’re working through the book without the latest and greatest hardware.

Third, I believe you’ll actually understand the lessons better because we’re not using huge datasets. From years of teaching experience, I’ve found that you will never master complex processes until you understand the simpler building blocks in great detail.

In this book, we’re going to be examining the inputs and outputs from different workflow steps in detail, and that’s not practical when you have thousands of features or hundreds of thousands of samples.

In fact, we’re going to be spending the next few chapters with just 10 rows of data, which might sound crazy, but is actually the ideal way to truly understand how the different scikit-learn components work together.

Why use smaller datasets?

Easier and faster access to files
Reduced computational time
Greater understanding of the lessons

The great news is that once you master the workflow you’ll learn in this book, you’ll be able to handle datasets of almost any size without changing a single step. In other words, the knowledge you gain here will 100% transfer to far more complex datasets, and yet you will still understand in detail what is actually going on inside your code.

1.7 Meet the author

Finally, before we wrap up this chapter, I just want to introduce myself. My name is Kevin Markham and I’m the author of this book. I’m the founder of Data School, an online school for learning Data Science with Python.

I’ve been teaching Data Science in the classroom and online for more than 10 years, and I’m passionate about teaching Data Science to people who are new to the field, regardless of their educational and professional backgrounds.

Personally, I live in Asheville, North Carolina with my wife and son, and I have a degree in Computer Engineering from Vanderbilt University.

About me:

Founder of Data School
Teaching Data Science for 10+ years
Passionate about teaching people who are new to data science
Lives in Asheville, North Carolina
Degree in Computer Engineering