Master Machine Learning with scikit-learn

About this book

This is a practical guide to help you transition from Machine Learning beginner to skilled Machine Learning practitioner.

Throughout the book, you’ll learn the best practices for proper Machine Learning and how to apply those practices to your own Machine Learning problems.

By the end of this book, you’ll be more confident when tackling new Machine Learning problems because you’ll understand what steps you need to take, why you need to take them, and how to correctly execute those steps using scikit-learn.

You’ll know what problems you might run into, and you’ll know exactly how to solve them.

And because you’re learning a better way to work in scikit-learn, your code will be easier to write and to read, and you’ll get better Machine Learning results faster than before!

About the author

Hi! My name is Kevin Markham. I’m the founder of Data School, an online school for learning Data Science with Python.

I’ve been teaching Data Science in the classroom and online for 10+ years, and I’m passionate about teaching Data Science to people who are new to the field, regardless of their educational and professional backgrounds.

This book was adapted from my 7.5-hour video course of the same name, and is a compilation of nearly everything I know about effective Machine Learning with scikit-learn!

Personally, I live in Asheville, North Carolina with my wife and son, and I have a degree in Computer Engineering from Vanderbilt University.

Prerequisite skills

This is an intermediate-level book about scikit-learn, though it also includes many advanced topics.

You’re ready for this book if you can use scikit-learn to solve simple classification or regression problems, including loading a dataset, defining the features and target, training and evaluating a model, and making predictions with new data.

If you’re brand new to scikit-learn, I recommend first taking my free introductory course, Introduction to Machine Learning with scikit-learn. Once you’ve completed lessons 1 through 7, you’ll know the Machine Learning and scikit-learn fundamentals that you’ll need for this book.

If you’ve used scikit-learn before but you just need a refresher, there’s no need to take my introductory course because I’ll be reviewing the Machine Learning workflow in chapter 2.

Book outline

In chapter 1, I’ll give you an overview of the book and help you to get set up. Then in chapter 2, we’ll move on to a review of the Machine Learning workflow in order to establish a foundation for the rest of the book.

In chapters 3 through 9, we’ll explore how to handle common issues such as categorical features, text data, and missing values, and also how to integrate those steps into an efficient workflow. Then in chapter 10, we’ll cover how to properly evaluate and tune your entire workflow for maximum performance.

In chapters 11 through 16, we’ll walk through a variety of advanced techniques that can help to further improve your model’s performance, including ensembling, feature selection, feature standardization, and feature engineering. In chapters 17 through 19, we’ll dive deep into two common issues you’ll run into during real-world Machine Learning, namely high-cardinality categorical features and class imbalance.

Finally, in chapter 20, I’ll end the book with my advice for how you can continue to make progress with your Machine Learning education and skill development!

Topics covered

  • Review of the basic Machine Learning workflow
  • Encoding categorical features
  • Encoding text data
  • Handling missing values
  • Preparing complex datasets
  • Creating an efficient workflow for preprocessing and model building
  • Tuning your workflow for maximum performance
  • Avoiding data leakage
  • Proper model evaluation
  • Automatic feature selection
  • Feature standardization
  • Feature engineering using custom transformers
  • Linear and non-linear models
  • Model ensembling
  • Model persistence
  • Handling high-cardinality categorical features
  • Handling class imbalance

Functions and classes covered

  • Workflow composition: Pipeline, ColumnTransformer, make_pipeline, make_column_transformer, make_column_selector, make_union
  • Categorical encoding: OneHotEncoder, OrdinalEncoder
  • Numerical encoding: KBinsDiscretizer
  • Text encoding: CountVectorizer
  • Missing value imputation: SimpleImputer, KNNImputer, IterativeImputer, MissingIndicator
  • Model building: LogisticRegression, RandomForestClassifier, ExtraTreesClassifier
  • Model ensembling: VotingClassifier
  • Model selection: StratifiedKFold, cross_val_score, train_test_split
  • Model evaluation: accuracy_score, classification_report, confusion_matrix, roc_auc_score, average_precision_score, plot_confusion_matrix, plot_roc_curve, plot_precision_recall_curve
  • Hyperparameter tuning: GridSearchCV, RandomizedSearchCV
  • Feature selection: RFE, SelectPercentile, SelectFromModel, chi2
  • Feature standardization: StandardScaler, MaxAbsScaler
  • Feature engineering: FunctionTransformer, PolynomialFeatures
  • Configuration: set_config
  • Model persistence: joblib, pickle, cloudpickle (these are external libraries)

How to support this book

If you appreciate reading this book for free and want to give back, here are a few options for how you can help:

  • Purchase a physical or digital copy of the book (coming soon!)
  • Purchase one of my online courses
  • Tell a friend about the book
  • Share the book on social media

Thank you so much! 🙏

License

The content of this book is protected by copyright, and may not be copied or reproduced without the author’s permission.