import sklearn
sklearn.__version__
'0.23.2'
Hello and welcome to “Master Machine Learning with scikit-learn”. In this book, you’re going to learn how to build an effective Machine Learning workflow using the latest scikit-learn techniques so that you can solve almost any supervised Machine Learning problem.
If you have just a bit of experience with scikit-learn, you’ve probably spent most of your time building models using artificially clean training data. But in this book, you’ll learn how to build models using more complex datasets. We’ll cover:
…and so much more!
Throughout the book, you’ll learn the best practices for proper Machine Learning and how to apply those practices to your own Machine Learning problems.
By the end of this book, you’ll be more confident when tackling new Machine Learning problems because you’ll understand what steps you need to take, why you need to take them, and how to correctly execute those steps using scikit-learn.
You’ll know what problems you might run into, and you’ll know exactly how to solve them.
And because you’re learning a better way to work in scikit-learn, your code will be easier to write and to read, and you’ll get better Machine Learning results faster than before!
If you want to solve any Machine Learning problem using Python, then I always recommend starting with scikit-learn. Here’s why:
In fact, scikit-learn was the most popular Machine Learning tool in Kaggle’s most recent “State of Machine Learning” report, with more than 80% of data scientists using it.
For many Machine Learning problems, scikit-learn will be the only library you need. However, there are some specialized problems for which a deep learning library such as TensorFlow, PyTorch, or Keras will provide superior results. That being said, deep learning does have some significant drawbacks:
In other words, I only recommend using deep learning if you already know that you need it to solve your particular problem. But for the majority of Machine Learning problems, you are likely to get similar results using scikit-learn, and you will get those results much faster and easier with scikit-learn.
This is an intermediate level book about scikit-learn, though it also includes many advanced topics.
You’re ready for this book if you can use scikit-learn to solve simple classification or regression problems, including loading a dataset, defining the features and target, training and evaluating a model, and making predictions with new data.
If you’re brand new to scikit-learn, I recommend first taking my free introductory course, Introduction to Machine Learning with scikit-learn. Once you’ve completed lessons 1 through 7, you’ll know the Machine Learning and scikit-learn fundamentals that you’ll need for this book.
If you’ve used scikit-learn before but you just need a refresher, there’s no need to take my introductory course because I’ll be reviewing the Machine Learning workflow in the next chapter.
Finally, I should note that we will perform a few basic pandas operations in the book, including reading a CSV file and selecting columns from a DataFrame. However, it’s not a problem at all if you’re new to pandas because I will explain that code as we go.
To follow along with this book, you’ll need to have access to scikit-learn and pandas, which are both open source Python libraries. The easiest way to install scikit-learn and pandas (and their dependencies) is to download the Anaconda distribution of Python.
For other installation options, visit the scikit-learn and pandas websites.
Because we’ll be using some of the latest scikit-learn functionality, it’s important that you are running a modern version of scikit-learn. To check your version, just open your code editor and run this code. Note that there are two underscores before and after the word “version”.
import sklearn
sklearn.__version__
'0.23.2'
Note that this book uses version 0.23.2. To follow along with the book, it’s important that you are using at least version 0.20.2. There are some scikit-learn features that I’ll use in the book that were released in 0.21 or 0.22 or 0.23, but none of those features are essential to the core content of the book. Throughout the book, I’ll specifically mention any features that require scikit-learn 0.21 or above.
You are also welcome to use a later version of scikit-learn, such as 0.24, 1.0, or beyond. Throughout the book, I’ll mention any changes to scikit-learn in those versions that are relevant to the book.
Regarding the pandas library, this book uses version 1.2.4, but any version should work just fine.
import pandas
pandas.__version__
'1.2.4'
One final option for following along with the book, especially if you aren’t able to install Python packages on your local machine, is to use Google Colab. Colab provides you with an interface similar to the Jupyter Notebook. It’s free and runs entirely in your browser, though it does require that you have a Google account.
Let’s talk about what we’re going to cover in each chapter.
I recommend reading the chapters in order, because each chapter builds on the material from previous chapters.
Also, you might have noticed that most of the chapters end with a series of lessons marked “Q&A”. These lessons answer common questions that may have come up in your mind while reading the rest of the lessons in that chapter. Thus, the Q&A lessons should help you to understand the core material from each chapter in greater depth.
Finally, I wanted to mention that this book won’t be focusing on high-level algorithm selection, such as whether you should use a logistic regression model or a random forests model for your particular problem. That’s because I’ve found that the workflow is more important and will have a greater impact on your overall Machine Learning results than your ability to pick between algorithms.
In fact, once you’ve mastered the workflow, you can iterate through different algorithms quickly even if you don’t deeply understand them. And even if you did understand all of the algorithms, it’s hard to know in advance which one will work best for a given problem, which is why it’s so important to build a reusable workflow that enables you to switch between algorithms easily.
The bottom line is that understanding the algorithms is still useful, but the workflow is even more important, and so that’s the focus of this book.
We’ll be using three different datasets in the book:
I chose these datasets for three main reasons.
First, I wanted to use small-to-medium size datasets that could be read via URL so that you aren’t required to download any files for this book, which is more convenient for everyone.
Second, using smaller datasets means that computationally intense operations such as grid search won’t take many hours to run. This will save you a lot of time, especially if you’re working through the book without the latest and greatest hardware.
Third, I believe you’ll actually understand the lessons better because we’re not using huge datasets. From years of teaching experience, I’ve found that you will never master complex processes until you understand the simpler building blocks in great detail.
In this book, we’re going to be examining the inputs and outputs from different workflow steps in detail, and that’s not practical when you have thousands of features or hundreds of thousands of samples.
In fact, we’re going to be spending the next few chapters with just 10 rows of data, which might sound crazy, but is actually the ideal way to truly understand how the different scikit-learn components work together.
The great news is that once you master the workflow you’ll learn in this book, you’ll be able to handle datasets of almost any size without changing a single step. In other words, the knowledge you gain here will 100% transfer to far more complex datasets, and yet you will still understand in detail what is actually going on inside your code.