Repository: chiphuyen/dmls-book
Branch: main
Commit: f09bd03dab64
Files: 6
Total size: 74.8 KB
Directory structure:
gitextract_1je7gjfp/
├── .gitignore
├── README.md
├── basic-ml-review.md
├── mlops-tools.md
├── resources.md
└── summary.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.DS_Store
================================================
FILE: README.md
================================================
# Designing Machine Learning Systems (Chip Huyen 2022)
Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.
[
](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969)
The book has been translated into 10+ languages including: [Japanese](https://www.oreilly.co.jp/books/9784814400409/), [Korean](https://www.hanbit.co.kr/media/books/book_view.html?p_code=B1811121220), [Vietnamese](https://www.facebook.com/nhanampublishing/posts/pfbid0j4qygGh4nTLbDKDfyyPbkVXtWQa32DijbPfBt2tXmhrUGAZ8wJUoEcGkoe8pt5gcl), [traditional Chinese](https://www.gotop.com.tw/books/BookDetails.aspx?Types=v&bn=A738), [simplified Chinese - mainland China](https://oreillymedia.com.cn/index.php?func=book&isbn=978-7-5198-8628-8), [simplified Chinese - Taiwan](https://fe.suning.com/bigimages/12429674436.html), [Portuguese](https://www.amazon.com/PROJETANDO-SISTEMAS-MACHINE-LEARNING-Ravaglia/dp/8550819670), [Spanish](https://www.amazon.com/Dise%C3%B1o-sistemas-Machine-Learning-aplicaciones/dp/8426736955/), [Russian](https://www.chitai-gorod.ru/product/proektirovanie-sistem-mashinnogo-obucheniya-2990668?srsltid=AfmBOoqOV5AmFVNedKvq0DHvEnQEgNYqiubU-57CxODNy3sxnn5WnfGz), [Polish](https://helion.pl/ksiazki/jak-projektowac-systemy-uczenia-maszynowego-iteracyjne-tworzenie-aplikacji-gotowych-do-pracy-chip-huyen,jakpsu.htm#format/d), [Serbian](https://www.mikroknjiga.rs/masinsko-ucenje-projektovanje-sistema/46926), [Turkish](https://www.kitapyurdu.com/kitap/makine-ogrenmesi-sistemleri-tasarlamak/680393.html?srsltid=AfmBOoqxuD51dOaVe8a42WH6RU_Fyh9nuN2DWNMIuqVlEjwYBHsRoPEU), [Greek](https://www.public.gr/product/books/greek-books/computer-science/program-instruction-manuals/sxediasi-sustimaton-mixanikis-mathisis/2004325), and [Thai](https://www.lazada.co.th/products/designing-machine-learning-systems-i4258019199-s16857887502.html).
The book is available on:
- [Amazon](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969)
- [O'Reilly](https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/)
- [Kindle](https://www.amazon.com/Designing-Machine-Learning-Systems-Huyen-ebook-dp-B0B1LGL2SR/dp/B0B1LGL2SR/r)
- [Amazon India](http://amazon.in/Designing-Machine-Learning-Systems-Production-Ready/dp/9355422679)
and most places where technical books are sold.
## Repo structure
This book focuses on the key design decisions when developing and deploying machine learning systems. This is NOT a tutorial book, so it doesn't have a lot of code snippets. In this repo, you won't find code examples, but you'll find:
- [Table of contents](ToC.pdf)
- [Chapter summaries](summary.md)
- [MLOps tools](mlops-tools.md)
- [Resources](resources.md)
- [A very short review of basic ML concepts](basic-ml-review.md)
## Contributions
You're welcome to create issues or submit pull requests. Your feedback is much appreciated!
## Who This Book Is For
This book is for anyone who wants to leverage ML to solve real-world problems. ML in this book refers to both deep learning and classical algorithms, with a leaning toward ML systems at scale, such as those seen at medium to large enterprises and fast-growing startups. Systems at a smaller scale tend to be less complex and might benefit less from the comprehensive approach laid out in this book.
Because my background is engineering, the language of this book is geared toward engineers, including ML engineers, data scientists, data engineers, ML platform engineers, and engineering managers.
You might be able to relate to one of the following scenarios:
1. You have been given a business problem and a lot of raw data. You want to engineer this data and choose the right metrics to solve this problem.
2. Your initial models perform well in offline experiments and you want to deploy them.
3. You have little feedback on how your models are performing after your models are deployed, and you want to figure out a way to quickly detect, debug, and address any issue your models might run into in production.
4. The process of developing, evaluating, deploying, and updating models for your team has been mostly manual, slow, and error-prone. You want to automate and improve this process.
5. Each ML use case in your organization has been deployed using its own workflow, and you want to lay down the foundation (e.g., model store, feature store, monitoring tools) that can be shared and reused across use cases.
6. You’re worried that there might be biases in your ML systems and you want to make your systems responsible!
You can also benefit from the book if you belong to one of the following groups:
- Tool developers who want to identify underserved areas in ML production and figure out how to position your tools in the ecosystem.
- Individuals looking for ML-related roles in the industry.
- Technical and business leaders who are considering adopting ML solutions to improve your products and/or business processes. Readers without strong technical backgrounds might benefit the most from Chapters 1, 2, and 11.
## Review
- _"This is, simply, the very best book you can read about how to build, deploy, and scale machine learning models at a company for maximum impact. Chip is a masterful teacher, and the breadth and depth of her knowledge is unparalleled."_ - Josh Wills, Software Engineer at WeaveGrid and former Director of Data Engineering, Slack
- _"There is so much information one needs to know to be an effective machine learning engineer. It's hard to cut through the chaff to get the most relevant information, but Chip has done that admirably with this book. If you are serious about ML in production, and care about how to design and implement ML systems end to end, this book is essential."_ - Laurence Moroney, AI and ML Lead, Google
- _"One of the best resources that focuses on the first principles behind designing ML systems for production. A must-read to navigate the ephemeral landscape of tooling and platform options."_ - Goku Mohandas, Founder of [Made With ML](https://github.com/GokuMohandas/MadeWithML)
See what people are talking about the book on Twitter [@designmlsys](https://twitter.com/designmlsys/likes)!
---
Chip Huyen, *Designing Machine Learning Systems*. O'Reilly Media, 2022.
@book{dmlsbook2022,
address = {USA},
author = {Chip Huyen},
isbn = {978-1801819312},
publisher = {O'Reilly Media},
title = {{Designing Machine Learning Systems}},
year = {2022}
}
================================================
FILE: basic-ml-review.md
================================================
## Basic ML Review
> **_NOTE:_** This is a quick refresh of some key concepts touched on in the book. This is not meant to be an introduction to machine learning. For readers who want an introduction to ML, I recommend the following resources:
> 1. [Lecture notes] [Stanford CS 321N](https://cs231n.github.io/): deep learning focused, beginner-friendly.
> 2. [Book] Kevin P Murphy's [Machine Learning: A Probabilistic Perspective](https://probml.github.io/pml-book/book1.html): foundational, comprehensive, though a bit intense.
A model is a function that transforms inputs into outputs, which can then be used to make predictions. For example, a binary text classification model might take sentences as inputs and output values between 0 and 1. You can use these output values to make predictions, such as if the value is less than 0.5, output the `NEGATIVE` class, and if the value is greater than or equal to 0.5, output the `POSITIVE` class.
In traditional programming, functions are given and outputs are calculated from given inputs. For example, your function f(x) might be given as: `f(x) = 2x + 3`.
Given x = 1, the output will be `f(1) = 2 * 1 + 3 = 5`. Given x = 3, the output will be `f(3) = 2 * 3 + 3 = 9`.
In supervised ML, the inputs and outputs are given, which are called data, and the function is derived from data. Given x as input and y as output, you want to learn a function f such that applying f on x will produce y. However, ML isn’t powerful enough to derive arbitrary functions from data yet, so you still need to specify the form that you think the function should take[^1]. It can be a linear function, a decision tree, a feedforward neural network with two hidden layers, each with 768 neurons[^2].
For example, given a dataset with only two examples (x = 1, y = 5) and (x = 3, y = 9), you might specify that the function is a linear function, which means that it takes the form f(x) = wx + b. Then you learn the values of w and b to fit this dataset. Because w and b are learned during the training process, they are called parameters.
For each type of model, there are many possible values for the parameters. You need an objective function to evaluate how good a given set of parameters is for a dataset, and a procedure to derive the set of parameters best suited for the given data according to that objective, known as a learning procedure.
> **_SIDEBAR:_** Some readers might wonder if the above paragraph about parameters still applies to non-parametric models such as K-means clustering and decision trees. Being non-parametric doesn’t mean that models don’t use parameters. In a parametric model, the number of parameters is fixed with respect to the sample size. In a nonparametric model, the effective number of parameters can grow with the sample size. So the complexity of the function underlying a neural network remains the same even if the amount of data grows. But the complexity of the function underlying a decision tree grows as its number of nodes grows.
When talking about model selection, most people think about selecting a function form. However, choosing the right objective function and a learning procedure is extremely important in finding a good set of parameters for your model.
### Objective Function
The **objective function**, also known as the loss function, is highly dependent on the model type and whether the labels are available. If the labels aren’t available, as in the case of unsupervised learning, the objective functions depend on the data points themselves. For example, for k-means clustering, the objective function is the variance within data points in the same cluster (so the objective is to put data points into clusters so that the within-cluster variance is minimized). But unsupervised learning is much less commonly used in production.
Most algorithms you’ll encounter in production are supervised or some form of weakly or semi-supervised, as mentioned in the section **[Handling the Lack of Labels](https://learning.oreilly.com/library/view/designing-machine-learning/9781098107956/ch04.html)** in Chapter 4. Given a set of parameter values, you calculate the outputs from the given inputs, and compare the given function’s predicted outputs (y') to the actual outputs (y). Objective functions evaluate how good a set of parameter values is by measuring the distance between the set of y' and the set of y.
To make this concrete, let’s go back to the example above where we have only 2 data points (x = 1, y = 5) and (x = 3, y = 9). We want to find w and b such that `f(x) = wx + b` best suited this data. Given the set of parameter values w = 3 and b = 4, we get the predicted outputs of 7 and 13 as shown in Table A-1. The objective function measures the distance between the predicted outputs (7, 13) and the actual outputs (5, 9).
| Input
|
Predicted output made by f(x | w=3, b=4) = 3x + 4
|
Actual output
|
| x = 1
|
3 * 1 + 4 = 7
|
5
|
| x = 3
|
3 * 3 + 4 = 13
|
9
|
Table A-1: Predicted outputs when w = 4 and b = 4
There are many types of distance metrics you can use to derive your objective functions. When the outputs are scalars (numbers), two common metrics are Root Mean Squared Error and Mean Absolute Error as shown in Table A-2.
| Objective function
|
How to calculate
|
Distance metrics
|
| Root Mean Squared Error (RMSE)
|
$\sqrt{\sum\limits_{i=1}^n \frac{(y_i' - y_i)^2}{n}}$
|
Euclidean
|
| Mean Absolute Error (MAE)
|
$\sum\limits_{i=1}^n \frac{|y_i' - y_i|}{n}$
|
Manhattan
|
Table A-2: Two common objective functions for scalar outputs
However, many types of models don’t output just one number given an input, but output a distribution. For example, if your task has three classes: [cat, dog, chicken], your model might output an array of how likely it is that your input belongs to each class. So the predicted output might look like [0.1, 0.5, 0.4], which means the input has 10% chance of being cat, 50% chance of being dog, 40% chance of being chicken. The actual label for this example is chicken, so the output is [0, 0, 1]. We want to measure the distance between the predicted outputs that take the form [0.1, 0.5, 0.4] and the actual outputs that take the form [0, 0, 1]. In this case, the common objective function is cross entropy and its variation.
You can modify the objective function to enforce your model to learn a set of parameters with certain properties. As discussed in the section Class Imbalance in Chapter 4, you can modify the objective function to encourage your model to focus on examples of rare classes or examples that are difficult to learn. You can also add regularizers such as L1 and L2 to your loss function to encourage your model to choose parameters of smaller values.
Each objective function gives you a set of possible values your parameters can take. This set of possible values for parameters is known as the loss surface of a given objective function. A small change to your objective function can give you a very different loss surface, which, in turn, gives you a very different function for your model.
Understanding the possible parameters given by different objective functions can help you choose the objective function that is best suited for your needs. However, this understanding tends to require advanced linear algebra, so it’s common for ML engineers to use popular objective functions that are known to give decent performance for their problems without giving them much thought.
While developing your model, if time permits, you should experiment with different objective functions to see how your model’s behaviors change, both globally on all your data or with respect to different slices of your data. You might be surprised.
### Learning Procedure
Learning procedures the procedures that help your model find the set of parameters that minimize a given objective function for a given set of data, are diverse[^3]. In some cases, the parameters might be calculated exactly. For example, in the case of linear functions, the values of w and b can be calculated from the averages and variances of x and y. In most cases, however, the values of parameters can’t be calculated exactly and have to be approximated, usually via an iterative procedure. For example, K-means clustering uses an iterative procedure called expectation–maximization algorithm.
The most popular family of iterative procedures today is undoubtedly **gradient descent**. The loss of a model at a given train step is given by the objective function. The gradient of an objective function with respect to a parameter tells us how much that parameter contributes to the loss. In other words, the gradient is the direction that lowers the loss from a current value the most. The idea is to subject that gradient value from that parameter, hoping that this would make the parameter contribute less to the loss, and eventually drive the loss down to 0.
Subtracting the raw gradient values from parameters doesn’t work extremely well. Transforming the gradient values first (such as multiplying the gradient value with 0.003) then subtracting that transformed values from parameters helps models converge much faster. The function that determines how to update a parameter given a gradient value is called an update algorithm, or an **optimizer**. Common optimizers include Momentum, Adam, and RMSProp.
Good optimizers can both speed up your model training process and help your model converge to a better set of parameters. Even though optimizers help your model find the set of parameters that minimize a given objective function for a given set of data, the set of parameters that minimize the loss for your training data isn’t always the best optimizer for you, as you might want the parameters that will perform well on the data your model will encounter in production too[^4]. While developing ML models, especially with gradient descent-based models, it’s often helpful to explore with different types of optimizers. In section AutoML, we’ll discuss how to use ML to find the best optimizers for your model.
## Notes
[^1]:
In the section AutoML in Chapter 6, we cover how to use algorithms to automatically choose a function form from a predefined set of possible forms.
[^2]:
If you don’t know what these terms mean, you should still be able to understand approximately 80% of this book. However, I recommend that you take an introductory course to Machine Learning or read an introductory book on Machine Learning in your free time.
[^3]:
The subfield that studies different learning procedures is called optimization and it’s a large, complex, and fascinating field. Readers interested in learning more can refer to the book [Algorithms for Optimization](https://algorithmsbook.com/optimization/) (Kochenderfer and Wheeler, 2019).
[^4]:
In technical terms, you want optimizers that can generalize to unseen data.
================================================
FILE: mlops-tools.md
================================================
# MLOps Tools
I avoided discussing tools in the book because tooling is ephemeral -- all software will eventually become legacy software. However, I did mention a few tools in my book when they were useful to illustrate a point. Several early readers told me that the book helped them discover several useful tools, and asked me to recommend more, so here is my attempt.
In this doc, I’ll start with long lists of tools other people have compiled. Then we’ll go over a short list of open-source tools that I think are pretty cool. I exclude some of the already well known tools in the data science community to keep the list short. Feel free to open an issue or submit a PR to recommend a new tool. Thank you!
We also have a channel ([#tools-watch](https://discord.gg/Mw77HPrgjF)) on our MLOps Discord to discuss interesting tools.
* [Long lists of MLOps tools](#long-lists-of-mlops-tools)
* [Short lists of MLOps open-source tools](#short-lists-of-MLOps-open-source-tools)
* [Pandas alternatives](#pandas-alternatives)
* [Data and features](#data-and-features)
* [Interpretability and fairness](#interpretability-and-fairness)
* [Model development and evaluation](#model-development-and-evaluation)
* [Use case specific frameworks](#use-case-specific-frameworks)
* [Dev environment](#dev-environment)
* [DevOps for MLOps](#devops-for-mlops)
# Long lists of MLOps tools
I get it -- there are hundreds of MLOps tools out there. There have been multiple long lists of MLOps tools, each with hundreds so I won’t create another list here. However, if you want a long list of tools, here are some of the more popular resources.
>> **Notes**
>> * Many of these lists were created by vendor-sponsored communities. There’s nothing wrong with taking sponsorships -- communities have operational expenses too -- and being able to get sponsorships is impressive for a community. However, sometimes, there might be conflicts of interests between vendor-sponsorships and vendor-neutral reviews.
>> * Many of these lists divide tools into categories and evaluate them by features. However, the landscape is still evolving quickly, so categories are not yet well-defined. At the same time, best practices are still converging, so it can be hard for companies to delineate what features they need.
1. [MLOps Community’s Learn](https://mlops.community/learn/): currently, it consists of 3 categories (Feature Store, Monitoring, Metadata Storage & Management -- which includes Model Store) with one in progress (Deploy). You can add 2 tools to compare them side-by-side.
2. [TwiML Solutions Guide](https://twimlai.com/solutions/): Tools are grouped into categories with features supported and where they can be deployed.
3. [StackShare](https://stackshare.io/): You can browse different devtools, see what companies are using them, see jobs that mention them, check out alternatives, even read reviews. You can also see the tech stacks of popular tech companies. You can also compare tools side by side. In my opinion, StackShare is what many MLOps solution guides aspire to become, but StackShare is general for all tech stacks, not just MLOps.
4. [AI Infrastructure Landscape](https://ai-infrastructure.org/ai-infrastructure-landscape/): a membership-based foundation that focuses on AI infrastructure (more general than [LFAI & Data](https://lfaidata.foundation/), which focuses on open-source). They do a lot of comprehensive research and analysis on MLOps tooling.
5. Matt Turck’s annual [Machine Learning, AI and Data (MAD) Landscape](https://mattturck.com/data2021/): with high-level analysis from an investor’s perspective.
6. Leigh Marie Braswell’s [Startup Opportunities in Machine Learning Infrastructure](https://leighmariebraswell.substack.com/p/startup-opportunities-in-machine): analysis of the gaps in MLOps landscape from another investor’s perspective.
7. Yours truly’s [MLOps Landscape](https://huyenchip.com/2020/12/30/mlops-v2.html). It comes with a [Google Sheets list](https://docs.google.com/spreadsheets/d/1i8BzE4puGQ3dmQueu4LQCcwaqrulgK1Vb-xeFwhy6gY/edit#gid=0) of 284 tools (last updated in December 2020) and interactive visualization. I stopped maintaining the list because it takes a ton of time and it’s not like no one else is doing it.
# Short lists of MLOps open-source tools
I star cool open source tools whenever I come across them. You can check out the full list [here](https://github.com/chiphuyen?tab=stars).
**Disclaimer**: this is not supposed to be a comprehensive list.
## Pandas alternatives
Table formats like Pandas DataFrame are great for data manipulation. However, Pandas is [slow](https://stackoverflow.com/search?q=%5Bpandas%5D+slow), [quirky](https://github.com/chiphuyen/just-pandas-things), and doesn’t leverage GPUs. Naturally, there have been many projects aiming to fix these problems. Here are a few of them.
* [modin](https://github.com/modin-project/modin): a drop-in replacement of pandas (using `import modin.pandas as pd` instead of `import pandas as pd`). [Ponder](https://ponder.io/) is working on the enterprise version built on top of it.
* [dask](https://github.com/dask/dask) and [cuDF](https://github.com/rapidsai/cudf): both projects were at some point maintained by the RAPIDS AI team at NVIDIA, until a core developer of dask left NVIDIA to start [Coiled](https://coiled.io/), which provides the enterprise version built on top of dask.
* [Polars](https://github.com/pola-rs/polars/): built with Rust on top of Apache Arrow, Polars promises to be fast!
H2O did a fun [database-like ops benchmark](https://h2oai.github.io/db-benchmark/) (though I hope they’ll consider another color scheme for their graphs).

## Data and features
* Online analytical databases (e.g. if you want to join predictions with user feedback to monitor your model performance in real-time): [ClickHouse](https://github.com/ClickHouse/ClickHouse), [Druid](https://github.com/apache/druid).
* Stream processing: [ksql](https://github.com/confluentinc/ksql), [faust](https://github.com/robinhood/faust), [materialize](https://github.com/MaterializeInc/materialize), [Redpanda](https://github.com/redpanda-data/redpanda) (on WASM)
* Visualization: [D3](https://github.com/d3/d3), [Superset](https://github.com/apache/superset), [Facets](https://github.com/PAIR-code/facets), [redash](https://github.com/getredash/redash), [visdom](https://github.com/fossasia/visdom), [plotly](https://github.com/plotly/plotly.py), [Altair](https://github.com/altair-viz/altair), [pandas-profiling](https://github.com/ydataai/pandas-profiling), [lux](https://github.com/lux-org/lux) (only for DataFrames), [bokeh](https://github.com/bokeh/bokeh)
* Data validation: [Great Expectations](https://github.com/great-expectations/great_expectations), [deepchecks](https://github.com/deepchecks/deepchecks), [pandera](https://github.com/pandera-dev/pandera)
* Labeling: [Snorkel](https://github.com/snorkel-team/snorkel), [Label Studio](https://github.com/heartexlabs/label-studio), [doccano](https://github.com/doccano/doccano)
* Data versioning: [DVC](https://github.com/iterative/dvc), [Dolt](https://github.com/dolthub/dolt), [pachyderm](https://github.com/pachyderm/pachyderm)
* Data hosting for unstructured data: [Hub](https://github.com/activeloopai/Hub)
* Metadata stores (for feature discovery): Lyft’s [Amundsen](https://github.com/amundsen-io/amundsen) and LinkedIn’s [DataHub](https://github.com/datahub-project/datahub).
* Feature stores: [FEAST](https://github.com/feast-dev/feast) (OSS). Several commercial feature stores are built on top of FEAST.
* Data pipelines: [ploomber](https://github.com/ploomber/ploomber), [hamilton](https://github.com/stitchfix/hamilton), [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular)
* The whole [arrow](https://github.com/apache/arrow) and [flight](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) ecosystem is growing fast!
## Interpretability and fairness
* Interpretability: [SHAP](https://github.com/slundberg/shap), [Lime](https://github.com/marcotcr/lime), [Interpret](https://github.com/interpretml/interpret), [lit](https://github.com/PAIR-code/lit) (for NLP), [captum](https://github.com/pytorch/captum) (PyTorch), [timeshap](https://github.com/feedzai/timeshap), [AIX360](https://github.com/Trusted-AI/AIX360)
* Fairness: [AIF360](https://github.com/Trusted-AI/AIF360)
## Model development and evaluation
* Experiment tracking: [MLflow](https://github.com/mlflow/mlflow), [aim](https://github.com/aimhubio/aim). Most of the tools in this category aren’t open source since the hard part is to host and visualize artifacts.
* Model optimization: [TVM](https://github.com/apache/tvm), [TensorRT](https://developer.nvidia.com/tensorrt), [Triton](https://github.com/openai/triton), [hummingbird](https://github.com/microsoft/hummingbird), [composer](https://github.com/mosaicml/composer)
* Distributed training: [DeepSpeed](https://github.com/microsoft/DeepSpeed) (super cool), [accelerate](https://github.com/huggingface/accelerate)
* Federated learning: [PySyft](https://github.com/OpenMined/PySyft), [FedML](https://github.com/FedML-AI/FedML), [FATE](https://github.com/FederatedAI/FATE), [TensorFlow Federated](https://www.tensorflow.org/federated)
* Evaluation: [checklist](https://github.com/marcotcr/checklist) (NLP), [reclist](https://github.com/jacopotagliabue/reclist) (recommender systems)
* Online experiments (e.g. A/B testing): [growthbook](https://github.com/growthbook/growthbook), [Ax](https://github.com/facebook/Ax)
## Use case specific frameworks
Just me nerding out on those cool use cases -- don’t mind me.
* Neural recommender systems / CTR: [DLRM](https://github.com/facebookresearch/dlrm), [DeepCTR](https://github.com/shenweichen/DeepCTR), [tensorflow-DeepFM](https://github.com/ChenglongChen/tensorflow-DeepFM), [Transformers4Rec](https://github.com/NVIDIA-Merlin/Transformers4Rec)
* Conversational AI: [rasa](https://github.com/RasaHQ/rasa), [NeMo](https://github.com/NVIDIA/NeMo)
* Similarity search: [annoy](https://github.com/spotify/annoy), [Faiss](https://github.com/facebookresearch/faiss), [Milvus](https://github.com/milvus-io/milvus)
* Deep fakes: [faceswap](https://github.com/deepfakes/faceswap), [deepface](https://github.com/serengil/deepface)
* Time-lagged conversion modeling: [convoys](https://github.com/better/convoys)
* Churn prediction: [WTTE-RNN](WTTE-RNN)
* Survival analysis: [lifelines](https://github.com/CamDavidsonPilon/lifelines)
## Dev environment
* CLI tools: [fzf](https://github.com/junegunn/fzf) (fuzzy search), [lipgloss](https://github.com/charmbracelet/lipgloss)
* IDEs: just use VSCode -- its notebook is pretty good tooo.
* Dependency management: [Poetry](https://github.com/python-poetry/poetry)
* Config management: [Hydra](https://github.com/facebookresearch/hydra), [gin-config](https://github.com/google/gin-config)
* Documentation: [docusaurus](https://github.com/facebook/docusaurus)
* Debugging in Kubernetes: [k9s](https://github.com/derailed/k9s)
* Virtual whiteboard: [excalidraw](https://github.com/excalidraw/excalidraw)
## DevOps for MLOps
* CI/CD: [earthly](https://github.com/earthly/earthly)
* Monitoring: [Sentry](https://github.com/getsentry/sentry), [Prometheus](https://github.com/prometheus/prometheus), [vector](https://github.com/vectordotdev/vector), [M3](https://github.com/m3db/m3)
* Dashboards: [Grafana](https://github.com/grafana/grafana), [Metabase](https://github.com/metabase/metabase)
* General DevOps: [Chaos Monkey](https://github.com/Netflix/chaosmonkey), [k6](https://github.com/grafana/k6)
================================================
FILE: resources.md
================================================
# Resources
The resources here are meant for further exploration of topics already covered in the book. Some of them were excluded from the book to avoid distracting the readers from the key points, as the book already includes a substantial amount of links and references.
* [Chapter 1. Overview of Machine Learning Systems](#chapter-1-overview-of-machine-learning-systems)
* [Chapter 2. Introduction to Machine Learning Systems Design](#chapter-2-introduction-to-machine-learning-systems-design)
* [Chapter 3. Data Engineering Fundamentals](#chapter-3-data-engineering-fundamentals)
* [Streaming systems](#streaming-systems)
* [Chapter 4. Training Data](#chapter-4-training-data)
* [Chapter 5. Feature Engineering](#chapter-5-feature-engineering)
* [Chapter 6. Model Development and Offline Evaluation](#chapter-6-model-development-and-offline-evaluation)
* [Training, debugging, and testing ML code](#training-debugging-and-testing-ml-code)
* [Model evaluation](#model-evaluation)
* [Chapter 7. Model Deployment and Prediction Service](#chapter-7-model-deployment-and-prediction-service)
* [Chapter 8. Data Distribution Shifts and Monitoring](#chapter-8-data-distribution-shifts-and-monitoring)
* [Chapter 9. Continual Learning and Test in Production](#chapter-9-continual-learning-and-test-in-production)
* [Contextual bandits](#contextual-bandits)
* [Chapter 10. Infrastructure and Tooling for MLOps](#chapter-10-infrastructure-and-tooling-for-mlops)
* [Chapter 11. The Human Side of Machine Learning](#chapter-11-the-human-side-of-machine-learning)
## Chapter 1. Overview of Machine Learning Systems
To learn to design ML systems, it’s helpful to read case studies to see how actual teams deal with different deployment requirements and constraints. Many companies — Airbnb, Lyft, Uber, and Netflix, to name a few — run excellent tech blogs where they share their experience using ML to improve their products and/or processes.
1. [Using Machine Learning to Predict Value of Homes On Airbnb](https://medium.com/airbnb-engineering/using-machine-learning-to-predict-value-of-homes-on-airbnb-9272d3d4739d) (Robert Chang, Airbnb Engineering & Data Science, 2017)
In this detailed and well-written blog post, Chang described how Airbnb used machine learning to predict an important business metric: the value of homes on Airbnb. It walks you through the entire workflow: feature engineering, model selection, prototyping, moving prototypes to production. It's completed with lessons learned, tools used, and code snippets too.
2. [Using Machine Learning to Improve Streaming Quality at Netflix](https://medium.com/netflix-techblog/using-machine-learning-to-improve-streaming-quality-at-netflix-9651263ef09f) (Chaitanya Ekanadham, Netflix Technology Blog, 2018)
As of 2018, Netflix streams to over 117M members worldwide, half of those living outside the US. This blog post describes some of their technical challenges and how they use machine learning to overcome these challenges, including to predict the network quality, detect device anomaly, and allocate resources for predictive caching.
3. [150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com](https://blog.kevinhu.me/2021/04/25/25-Paper-Reading-Booking.com-Experiences/bernardi2019.pdf) (Bernardi et al., KDD, 2019)
As of 2019, Booking.com has around 150 machine learning models in production. These models solve a wide range of prediction problems (e.g. predicting users’ travel preferences and how many people they travel with) and optimization problems (e.g.optimizing the background images and reviews to show for each user). Adrian Colyer gave a good summary of the six lessons learned [here](https://blog.acolyer.org/2019/10/07/150-successful-machine-learning-models/):
1. Machine learned models deliver strong business value.
2. Model performance is not the same as business performance.
3. Be clear about the problem you’re trying to solve.
4. Prediction serving latency matters.
5. Get early feedback on model quality.
6. Test the business impact of your models using randomized controlled trials.
4. [How we grew from 0 to 4 million women on our fashion app, with a vertical machine learning approach](https://medium.com/hackernoon/how-we-grew-from-0-to-4-million-women-on-our-fashion-app-with-a-vertical-machine-learning-approach-f8b7fc0a89d7) (Gabriel Aldamiz, HackerNoon, 2018)
To offer automated outfit advice, Chicisimo tried to qualify people's fashion taste using machine learning. Due to the ambiguous nature of the task, the biggest challenges are framing the problem and collecting the data for it, both challenges are addressed by the article. It also covers the problem that every consumer app struggles with: user retention.
5. [Machine Learning-Powered Search Ranking of Airbnb Experiences](https://medium.com/airbnb-engineering/machine-learning-powered-search-ranking-of-airbnb-experiences-110b4b1a0789) (Mihajlo Grbovic, Airbnb Engineering & Data Science, 2019)
This article walks you step by step through a canonical example of the ranking and recommendation problem. The four main steps are system design, personalization, online scoring, and business aspects. The article explains which features to use, how to collect data and label it, why they chose Gradient Boosted Decision Tree, which testing metrics to use, what heuristics to take into account while ranking results, and how to do A/B testing during deployment. Another wonderful thing about this post is that it also covers personalization to rank results differently for different users.
6. [From shallow to deep learning in fraud](https://eng.lyft.com/from-shallow-to-deep-learning-in-fraud-9dafcbcef743) (Hao Yi Ong, Lyft Engineering, 2018)
Fraud detection is one of the earliest use cases of machine learning in the industry. This article explores the evolution of fraud detection algorithms used at Lyft. At first, an algorithm as simple as logistic regression with engineered features was enough to catch most fraud cases. Its simplicity allowed the team to understand the importance of different features. Later, when fraud techniques have become too sophisticated, more complex models are required. This article explores the tradeoff between complexity and interpretability, performance and ease of deployment.
7. [Space, Time and Groceries](https://tech.instacart.com/space-time-and-groceries-a315925acf3a) (Jeremy Stanley, Tech at Instacart, 2017)
Instacart uses machine learning to solve the task of path optimization: how to most efficiently assign tasks for multiple shoppers and find the optimal paths for them. The article explains the entire process of system design, from framing the problem, collecting data, algorithm and metric selection, topped with a tutorial for beautiful visualization.
8. [Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning](https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/) (Brad Neuberg, Dropbox Engineering, 2017)
An application as simple as a document scanner has two distinct components: optical character recognition and word detector. Each requires its own production pipeline, and the end-to-end system requires additional steps for training and tuning. This article also goes into detail the team’s effort to collect data, which includes building their own data annotation platform.
9. [Spotify’s Discover Weekly: How machine learning finds your new music](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe) (Sophia Ciocca, 2017)
To create Discover Weekly, there are three main types of recommendation models that Spotify employs:
* **Collaborative Filtering **models (i.e. the ones that Last.fm originally used), which work by analyzing your behavior and others’ behavior.
* **Natural Language Processing** (NLP) models, which work by analyzing text.
* **Audio** models, which work by analyzing the raw audio tracks themselves.
10. [Smart Compose: Using Neural Networks to Help Write Emails](https://ai.googleblog.com/2018/05/smart-compose-using-neural-networks-to.html) (Yonghui Wu, Google AI Blog 2018)
“_Since Smart Compose provides predictions on a per-keystroke basis, it must respond ideally within **100ms** for the user not to notice any delays. Balancing model complexity and inference speed was a critical issue_.”
## Chapter 2. Introduction to Machine Learning Systems Design
* [Rules of Machine Learning](https://developers.google.com/machine-learning/guides/rules-of-ml) (Martin Zinkevich)
* [Things I wish we had known before we started our first Machine Learning project](https://medium.com/infinity-aka-aseem/things-we-wish-we-had-known-before-we-started-our-first-machine-learning-project-336d1d6f2184) (Aseem Bansal, towards-infinity 2018)
* [Data Science Project Quick-Start](https://eugeneyan.com/writing/project-quick-start/) (Eugene Yan, 2022)
* [https://github.com/chiphuyen/machine-learning-systems-design](https://github.com/chiphuyen/machine-learning-systems-design): A much earlier, much less organized version of this book.
* [Deploying Machine Learning Models: A Checklist](https://twolodzko.github.io/ml-checklist) (a short checklist for ML systems design)
## Chapter 3. Data Engineering Fundamentals
* [A Beginner’s Guide to Data Engineering](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7) (Robert Chang 2018)
* [Designing Data-Intensive Applications](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/) (Martin Kleppmann, O’Reilly, 2017)
* [Emerging Architectures for Modern Data Infrastructure](https://future.a16z.com/emerging-architectures-modern-data-infrastructure/) (Bornstein et al, a16z 2022)
* [Reverse ETL — A Primer](https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) (Astasia Myers 2021)
* [Uber’s Big Data Platform: 100+ Petabytes with Minute Latency](https://eng.uber.com/uber-big-data-platform/) (Reza Shiftehfar, Uber Engineering blog 2018)
* [How DoorDash is Scaling its Data Platform to Delight Customers and Meet our Growing Demand](https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/) (Sudhir Tonse 2020)
### Streaming systems
* [The Log: What every software engineer should know about real-time data's unifying abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying) (Jay Kreps, LinkedIn / Confluent, 2013): Jay mentioned in a [tweet](https://twitter.com/jaykreps/status/1408159236794765314) that when he wrote the blog to see if there was enough interest in streaming for his team to start a company around it. The blog must have been popular because his team spun out of LinkedIn to become Confluent.
* [The Many Meanings of Event-Driven Architecture](https://www.youtube.com/watch?v=STKCRSUsyP0) (Martin Fowler, GOTO 2017): Martin Fowler is a great speaker. His talk made clear many of the complexities of event-driven architecture.
* [Stream Processing Hard Problems – Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda) (Kartik Paramasivam, LinkedIn Engineering 2016)
* [Open Problems in Stream Processing: A Call To Action](https://docs.google.com/presentation/d/1YtTEnOax5MDA8DazDa1ad-sP4zzM58KQK4HNAcxoONA/edit#slide=id.p) (Tyler Akidau, DEBS 2019): Tyler used to lead Dataflow at Google until he joined Snowflake in Jan 2020 to start Snowflake’s streaming team. His talk laid out key challenges of stream processing.
* [The Four Innovation Phases of Netflix's Trillions Scale Real-time Data Infrastructure](https://zhenzhongxu.com/the-four-innovation-phases-of-netflixs-trillions-scale-real-time-data-infrastructure-2370938d7f01) (Zhenzhong Xu, 2022): How Netflix transitioned from a batch system to a streaming system.
## Chapter 4. Training Data
* [Rejection sampling](https://en.wikipedia.org/wiki/Rejection_sampling)
* [The MIDAS Touch: Mixed Data Sampling Regression Models](https://escholarship.org/uc/item/9mf223rs) (Ghysels et al., 2004)
* [An Overview of Weak Supervision](https://www.snorkel.org/blog/weak-supervision) (Ratner et al., 2018)
## Chapter 5. Feature Engineering
* [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/) (Christoph Molnar, 2022): An amazingly detailed introduction to interpretability
## Chapter 6. Model Development and Offline Evaluation
### Training, debugging, and testing ML code
* [How to unit test machine learning code](https://medium.com/@keeper6928/how-to-unit-test-machine-learning-code-57cf6fd81765) (Chase Roberts, 2017)
* [A Recipe for Training Neural Networks](http://karpathy.github.io/2019/04/25/recipe/) (Andrej Karpathy, 2019)
* [Top 6 errors novice machine learning engineers make](https://medium.com/ai%C2%B3-theory-practice-business/top-6-errors-novice-machine-learning-engineers-make-e82273d394db) (Christopher Dossman, AI³ | Theory, Practice, Business 2017)
* [Testing and Debugging in Machine Learning](https://developers.google.com/machine-learning/testing-debugging) course (Google)
* [What did you wish you knew before deploying your first ML model?](https://twitter.com/chipro/status/1348265019012743169) (I asked this question on Twitter and got some interesting responses)
* [Techniques for Training Large Neural Networks](https://openai.com/blog/techniques-for-training-large-neural-networks/) (OpenAI 2022)
* [A survey of model compression and acceleration for deep neural networks](https://arxiv.org/abs/1710.09282) (Cheng et al., IEEE Signal Processing Magazine 2017)
* [Towards Federated Learning at Scale: System Design](https://arxiv.org/abs/1902.01046) (Bonawitz et al, 2019)
### Model evaluation
* [Effective testing for machine learning systems](https://www.jeremyjordan.me/testing-ml/) (Jeremy Jordan, 2020)
* [On Calibration of Modern Neural Networks](https://arxiv.org/abs/1706.04599) (Guo et al., 2017)
* [Calibration for Netflix recommendation systems](https://dl.acm.org/doi/10.1145/3240323.3240372) (Harald Steck, 2018)
* [Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://aclanthology.org/2020.acl-main.442/) (Ribeiro et al., ACL 2020)
* [TextBugger: Generating Adversarial Text Against Real-world Applications](https://arxiv.org/abs/1812.05271) (Li et al., 2018)
* [Uncertainty Sets for Image Classifiers using Conformal Prediction](https://arxiv.org/abs/2009.14193) (Angelopoulos et al., 2020)
## Chapter 7. Model Deployment and Prediction Service
## Chapter 8. Data Distribution Shifts and Monitoring
* [Beyond Incremental Processing: Tracking Concept Drift](https://www.aaai.org/Papers/AAAI/1986/AAAI86-084.pdf) (Jeffrey C. Schlimmer and Richard H. Granger, Jr., 1986). Concept drift isn’t something new!
* [Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift](https://arxiv.org/abs/1810.11953) (Rabanser et al., 2019)
* [Out-of-Distribution Generalization via Risk Extrapolation (REx)](http://proceedings.mlr.press/v139/krueger21a.html) (Krueger et al., 2020)
* [Domain Adaptation under Target and Conditional Shift](https://proceedings.mlr.press/v28/zhang13d.html) (Zhang et al., 2013)
* [A Review of Domain Adaptation without Target Labels](https://ieeexplore.ieee.org/abstract/document/8861136) (Kouw et al., 2019)
* [On Learning Invariant Representations for Domain Adaptation](http://proceedings.mlr.press/v97/zhao19a.html) (Zhao et al., 2019)
* [How to deal with the seasonality of a market?](https://eng.lyft.com/how-to-deal-with-the-seasonality-of-a-market-584cc94d6b75) (Marguerite Graveleau, Lyft Engineering 2019)
* [Invariant Risk Minimization](https://arxiv.org/abs/1907.02893) (Arjovsky et al., 2019)
* [Causality for Machine Learning](https://arxiv.org/abs/1911.10500) (Bernhard Schölkopf, 2019)
## Chapter 9. Continual Learning and Test in Production
* [Application deployment and testing strategies](https://cloud.google.com/solutions/application-deployment-and-testing-strategies) (Google)
* [MLOps: Continuous delivery and automation pipelines in machine learning](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning) (Google)
* [Automated Canary Analysis at Netflix with Kayenta](https://netflixtechblog.com/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69) (Michael Graff and Chris Sanden, Netflix Technology Blog 2018)
### Contextual bandits
* [A/B testing — Is there a better way? An exploration of multi-armed bandits](https://towardsdatascience.com/a-b-testing-is-there-a-better-way-an-exploration-of-multi-armed-bandits-98ca927b357d) (Greg Rafferty, Towards Data Science 2020)
* [Deep Bayesian Bandits: Exploring in Online Personalized Recommendations](https://arxiv.org/abs/2008.00727) (Guo et al. 2020)
* [Active Learning and Contextual Bandits](http://www.machinedlearnings.com/2012/02/active-learning-and-contextual-bandits.html) (Paul Mineiro, 2012)
## Chapter 10. Infrastructure and Tooling for MLOps
* [Introduction to Microservices, Docker, and Kubernetes](https://www.youtube.com/watch?v=1xo-0gCVhTU): a good 1-hour video on introduction to Docker and k8s.
* [How Microsoft plans efficient workloads with DevOps](https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/release-flow)
* [Airbnb’s BigHead](https://vimeo.com/274801958)
* [Uber’s Michelangelo](https://eng.uber.com/michelangelo-machine-learning-platform/)
## Chapter 11. The Human Side of Machine Learning
* [Weapons of Math Destruction](https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815) (Cathy O’Neil, Crown Books 2016)
* [NIST Special Publication 1270](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf): Towards a Standard for Identifying and Managing Bias in Artificial Intelligence
* ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) [publications](https://facctconference.org/)
* [Trustworthy ML](https://www.trustworthyml.org/resources)’s recommended list of resources and fundamental papers to researchers and practitioners who want to learn more about trustworthy ML
* Sara Hooker’s awesome slide deck on [ML Beyond Accuracy: Fairness, Security, Governance](https://docs.google.com/presentation/d/1cshMKKSX24L0RL7LNzyOkZNQHD7N-Zyff8iffrLIVYM/edit?usp=sharing) (2022)
* Timnit Gebru and Emily Denton’s [tutorials](https://sites.google.com/view/fatecv-tutorial/schedule) on Fairness, Accountability, Transparency, and Ethics (2020)
================================================
FILE: summary.md
================================================
# Chapter Summaries
These are the summaries of each chapter taken from the book. Some of the summaries might not make sense to readers without having first read the originating chapters, but I hope that they will give you a sense of what the book is about.
* [Chapter 1. Overview of Machine Learning Systems](#chapter-1-overview-of-machine-learning-systems)
* [Chapter 2. Introduction to Machine Learning Systems Design](#chapter-2-introduction-to-machine-learning-systems-design)
* [Chapter 3. Data Engineering Fundamentals](#chapter-3-data-engineering-fundamentals)
* [Chapter 4. Training Data](#chapter-4-training-data)
* [Chapter 5. Feature Engineering](#chapter-5-feature-engineering)
* [Chapter 6. Model Development and Offline Evaluation](#chapter-6-model-development-and-offline-evaluation)
* [Chapter 7. Model Deployment and Prediction Service](#chapter-7-model-deployment-and-prediction-service)
* [Chapter 8. Data Distribution Shifts and Monitoring](#chapter-8-data-distribution-shifts-and-monitoring)
* [Chapter 9. Continual Learning and Test in Production](#chapter-9-continual-learning-and-test-in-production)
* [Chapter 10. Infrastructure and Tooling for MLOps](#chapter-10-infrastructure-and-tooling-for-mlops)
* [Chapter 11. The Human Side of Machine Learning](#chapter-11-the-human-side-of-machine-learning)
## Chapter 1. Overview of Machine Learning Systems
This opening chapter aimed to give readers an understanding of what it takes to bring ML into the real world. We started with a tour of the wide range of use cases of ML in production today. While most people are familiar with ML in consumer-facing applications, the majority of ML use cases are for enterprise. We also discussed when ML solutions would be appropriate. Even though ML can solve many problems very well, it can’t solve all the problems and it’s certainly not appropriate for all the problems. However, for problems that ML can’t solve, it’s possible that ML can be one part of the solution.
This chapter also highlighted the differences between ML in research and ML in production. The differences include the stakeholder involvement, computational priority, the properties of data used, the gravity of fairness issues, and the requirements for interpretability. This section is the most helpful to those coming to ML production from academia. We also discussed how ML systems differ from traditional software systems, which motivated the need for this book.
ML systems are complex, consisting of many different components. Data scientists and ML engineers working with ML systems in production will likely find that focusing only on the ML algorithms part is far from enough. It’s important to know about other aspects of the system, including the data stack, deployment, monitoring, maintenance, infrastructure, etc. This book takes a system approach to developing ML systems, which means that we’ll consider all components of a system holistically instead of just looking at ML algorithms. We’ll go into detail what this holistic approach means in the next chapter.
## Chapter 2. Introduction to Machine Learning Systems Design
I hope that this chapter has given you an introduction to ML systems design and the considerations we need to take into account when designing an ML system.
Every project must start with why this project needs to happen, and ML projects are no exception. We started the chapter with an assumption that most businesses don’t care about ML metrics unless they can move business metrics. Therefore, if an ML system is built for a business, it must be motivated by business objectives, which need to be translated into ML objectives to guide the development of ML models.
Before building an ML system, we need to understand the requirements that the system needs to meet to be considered a good system. The exact requirements vary from use case to use case, and in this chapter, we focused on the four most general requirements: reliability, scalability, maintainability, and adaptability. Techniques to satisfy each of these requirements will be covered throughout the book.
Building an ML system isn’t a one-off task but an iterative process. In this chapter, we discussed the iterative process to develop an ML system that met those above requirements.
We ended the chapter on a philosophical discussion of the role of data in ML systems. There are still many people who believe that having intelligent algorithms will eventually trump having a large amount of data. However, the success of systems including [AlexNet](https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html), [BERT](https://arxiv.org/abs/1810.04805), and [GPT](https://openai.com/blog/better-language-models/) showed that the progress of ML in the last decade relies on having access to a large amount of data. Regardless of whether data can overpower intelligent design, no one can deny the importance of data in ML. A nontrivial part of this book will be devoted to shedding light on various data questions.
Complex ML systems are made up of simpler building blocks. Now that we’ve covered the high-level overview of an ML system in production, we’ll zoom in to its building blocks in the following chapters, starting with the fundamentals of data engineering in the next chapter. If any of the challenges mentioned in this chapter seem abstract to you, I hope that specific examples in the following chapters will make them more concrete.
## Chapter 3. Data Engineering Fundamentals
This chapter is built on the foundations established in Chapter 2 around the importance of data in developing ML systems. In this chapter, we learned it’s important to choose the right format to store our data to make it easier to use the data in the future. We discussed different data formats and the pros and cons of row-major versus column-major formats as well as text versus binary formats.
We continued to cover three major data models: relational, document, and graph. Even though the relational model is the most well known given the popularity of SQL, all three models are widely used today, and each is good for a certain set of tasks.
When talking about the relational model compared to the document model, many people think of the former as structured and the latter as unstructured. The division between structured and unstructured data is quite fluid—the main question is who has to shoulder the responsibility of assuming the structure of data. Structured data means that the code that writes the data has to assume the structure. Unstructured data means that the code that reads the data has to assume the structure.
We continued the chapter with data storage engines and processing. We studied databases optimized for two distinct types of data processing: transactional processing and analytical processing. We studied data storage engines and processing together because traditionally storage is coupled with processing: transactional databases for transactional processing and analytical databases for analytical processing. However, in recent years, many vendors have worked on decoupling storage and processing. Today, we have transactional databases that can handle analytical queries and analytical databases that can handle transactional queries.
When discussing data formats, data models, data storage engines, and processing, data is assumed to be within a process. However, while working in production, you’ll likely work with multiple processes, and you’ll likely need to transfer data between them. We discussed three modes of data passing. The simplest mode is passing through databases. The most popular mode of data passing for processes is data passing through services. In this mode, a process is exposed as a service that another process can send requests for data. This mode of data passing is tightly coupled with microservice architectures, where each component of an application is set up as a service.
A mode of data passing that has become increasingly popular over the last decade is data passing through a real-time transport like Apache Kafka and RabbitMQ. This mode of data passing is somewhere between passing through databases and passing through services: it allows for asynchronous data passing with reasonably low latency.
As data in real-time transports have different properties from data in databases, they require different processing techniques, as discussed in the “Batch Processing Versus Stream Processing”** **section. Data in databases is often processed in batch jobs and produces static features, whereas data in real-time transports is often processed using stream computation engines and produces dynamic features. Some people argue that batch processing is a special case of stream processing, and stream computation engines can be used to unify both processing pipelines.
## Chapter 4. Training Data
Training data still forms the foundation of modern ML algorithms. No matter how clever your algorithms might be, if your training data is bad, your algorithms won’t be able to perform well. It’s worth it to invest time and effort to curate and create training data that will enable your algorithms to learn something meaningful.
In this chapter, we’ve discussed the multiple steps to create training data. We first covered different sampling methods, both nonprobability sampling and random sampling, that can help us sample the right data for our problem.
Most ML algorithms in use today are supervised ML algorithms, so obtaining labels is an integral part of creating training data. Many tasks, such as delivery time estimation or recommender systems, have natural labels. Natural labels are usually delayed, and the time it takes from when a prediction is served until when the feedback on it is provided is the feedback loop length. Tasks with natural labels are fairly common in the industry, which might mean that companies prefer to start on tasks that have natural labels over tasks without natural labels.
For tasks that don’t have natural labels, companies tend to rely on human annotators to annotate their data. However, hand labeling comes with many drawbacks. For example, hand labels can be expensive and slow. To combat the lack of hand labels, we discussed alternatives including weak supervision, semi-supervision, transfer learning, and active learning.
ML algorithms work well in situations when the data distribution is more balanced, and not so well when the classes are heavily imbalanced. Unfortunately, problems with class imbalance are the norm in the real world. In the following section, we discussed why class imbalance made it hard for ML algorithms to learn. We also discussed different techniques to handle class imbalance, from choosing the right metrics to resampling data to modifying the loss function to encourage the model to pay attention to certain samples.
We ended the chapter with a discussion on data augmentation techniques that can be used to improve a model’s performance and generalization for both computer vision and NLP tasks.
## Chapter 5. Feature Engineering
Because the success of today’s ML systems still depends on their features, it’s important for organizations interested in using ML in production to invest time and effort into feature engineering.
How to engineer good features is a complex question with no foolproof answers. The best way to learn is through experience: trying out different features and observing how they affect your models’ performance. It’s also possible to learn from experts. I find it extremely useful to read about how the winning teams of Kaggle competitions engineer their features to learn more about their techniques and the considerations they went through.
Feature engineering often involves subject matter expertise, and subject matter experts might not always be engineers, so it’s important to design your workflow in a way that allows nonengineers to contribute to the process.
Here is a summary of best practices for feature engineering:
* Split data by time into train/valid/test splits instead of doing it randomly.
* If you oversample your data, do it after splitting.
* Scale and normalize your data after splitting to avoid data leakage.
* Use statistics from only the train split, instead of the entire data, to scale your features and handle missing values.
* Understand how your data is generated, collected, and processed. Involve domain experts if possible.
* Keep track of your data’s lineage.
* Understand feature importance to your model.
* Use features that generalize well.
* Remove no longer useful features from your models.
With a set of good features, we’ll move to the next part of the workflow: training ML models. Before we move on, I just want to reiterate that moving to modeling doesn’t mean we’re done with handling data or feature engineering. We are never done with data and features. In most real-world ML projects, the process of collecting data and feature engineering goes on as long as your models are in production. We need to use new, incoming data to continually improve models, which we’ll cover in Chapter 9.
## Chapter 6. Model Development and Offline Evaluation
In this chapter, we’ve covered the ML algorithm part of ML systems, which many ML practitioners consider to be the most fun part of an ML project lifecycle. With the initial models, we can bring to life (in the form of predictions) all our hard work in data and feature engineering, and can finally evaluate our hypothesis (e.g., we can predict the outputs given the inputs).
We started with how to select ML models best suited for our tasks. Instead of going into pros and cons of each individual model architecture—which is a fool’s errand given the growing pools of existing models—the chapter outlined the aspects you need to consider to make an informed decision on which model is best for your objectives, constraints, and requirements.
We then continued to cover different aspects of model development. We covered not only individual models but also ensembles of models, a technique widely used in competitions and leaderboard-style research.
During the model development phase, you might experiment with many different models. Intensive tracking and versioning of your many experiments are generally agreed to be important, but many ML engineers still skip it because doing it might feel like a chore. Therefore, having tools and appropriate infrastructure to automate the tracking and versioning process is essential. We’ll cover tools and infrastructure for ML production in Chapter 10.
As models today are getting bigger and consuming more data, distributed training is becoming an essential skill for ML model developers, and we discussed techniques for parallelism including data parallelism, model parallelism, and pipeline parallelism. Making your models work on a large distributed system, like the one that runs models with hundreds of millions, if not billions, of parameters, can be challenging and require specialized system engineering expertise.
We ended the chapter with how to evaluate your models to pick the best one to deploy. Evaluation metrics don’t mean much unless you have a baseline to compare them to, and we covered different types of baselines you might want to consider for evaluation. We also covered a range of evaluation techniques necessary to sanity check your models before further evaluating your models in a production environment.
Often, no matter how good your offline evaluation of a model is, you still can’t be sure of your model’s performance in production until that model has been deployed. In the next chapter, we’ll go over how to deploy a model.
## Chapter 7. Model Deployment and Prediction Service
Congratulations, you’ve finished possibly one of the most technical chapters in this book! The chapter is technical because deploying ML models is an engineering challenge, not an ML challenge.
We’ve discussed different ways to deploy a model, comparing online prediction with batch prediction, and ML on the edge with ML on the cloud. Each way has its own challenges. Online prediction makes your model more responsive to users’ changing preferences, but you have to worry about inference latency. Batch prediction is a workaround for when your models take too long to generate predictions, but it makes your model less flexible.
Similarly, doing inference on the cloud is easy to set up, but it becomes impractical with network latency and cloud cost. Doing inference on the edge requires having edge devices with sufficient compute power, memory, and battery.
However, I believe that most of these challenges are due to the limitations of the hardware that ML models run on. As hardware becomes more powerful and optimized for ML, I believe that ML systems will transition to making online prediction on-device.
I used to think that an ML project is done after the model is deployed, and I hope that I’ve made clear in this chapter that I was seriously mistaken. Moving the model from the development environment to the production environment creates a whole new host of problems. The first is how to keep that model in production. In the next chapter, we’ll discuss how our models might fail in production, and how to continually monitor models to detect issues and address them as fast as possible.
## Chapter 8. Data Distribution Shifts and Monitoring
This might have been the most challenging chapter for me to write in this book. The reason is that despite the importance of understanding how and why ML systems fail in production, the literature surrounding it is limited. We usually think of research preceding production, but this is an area of ML where research is still trying to catch up with production.
To understand failures of ML systems, we differentiated between two types of failures: software systems failures (failures that also happen to non-ML systems) and ML-specific failures. Even though the majority of ML failures today are non-ML-specific, as tooling and infrastructure around MLOps matures, this might change.
We discussed three major causes of ML-specific failures: production data differing from training data, edge cases, and degenerate feedback loops. The first two causes are related to data, whereas the last cause is related to system design because it happens when the system’s outputs influence the same system’s input.
We zeroed into one failure that has gathered much attention in recent years: data distribution shifts. We looked into three types of shifts: covariate shift, label shift, and concept drift. Even though studying distribution shifts is a growing subfield of ML research, the research community hasn’t yet found a standard narrative. Different papers call the same phenomena by different names. Many studies are still based on the assumption that we know in advance how the distribution will shift or have the labels for the data from both the source distribution and the target distribution. However, in reality, we don’t know what the future data will be like, and obtaining labels for new data might be costly, slow, or just infeasible.
To be able to detect shifts, we need to monitor our deployed systems. Monitoring is an important set of practices for any software engineering system in production, not just ML, and it’s an area of ML where we should learn as much as we can from the DevOps world.
Monitoring is all about metrics. We discussed different metrics we need to monitor: operational metrics—the metrics that should be monitored with any software systems such as latency, throughput, and CPU utilization—and ML-specific metrics. Monitoring can be applied to accuracy-related metrics, predictions, features, and/or raw inputs.
Monitoring is hard because even if it’s cheap to compute metrics, understanding metrics isn’t straightforward. It’s easy to build dashboards to show graphs, but it’s much more difficult to understand what a graph means, whether it shows signs of drift, and, if there’s drift, whether it’s caused by an underlying data distribution change or by errors in the pipeline. An understanding of statistics might be required to make sense of the numbers and graphs.
Detecting model performance’s degradation in production is the first step. The next step is how to adapt our systems to changing environments, which we’ll discuss in the next chapter.
## Chapter 9. Continual Learning and Test in Production
This chapter touches on a topic that I believe is among the most exciting yet underexplored topics: how to continually update your models in production to adapt them to changing data distributions. We discussed the four stages a company might go through in the process of modernizing their infrastructure for continual learning: from the manual, training from scratch stage to automated, stateless continual learning.
We then examined the question that haunts ML engineers at companies of all shapes and sizes, “How often _should_ I update my models?” by urging them to consider the value of data freshness to their models and the trade-offs between model iteration and data iteration.
Similar to online prediction discussed in Chapter 7, continual learning requires a mature streaming infrastructure. The training part of continual learning can be done in batch, but the online evaluation part requires streaming. Many engineers worry that streaming is hard and costly. It was true three years ago, but streaming technologies have matured significantly since then. More and more companies are providing solutions to make it easier for companies to move to streaming, including Spark Streaming, Snowflake Streaming, Materialize, Decodable, Vectorize, etc.
Continual learning is a problem specific to ML, but it largely requires an infrastructural solution. To be able to speed up the iteration cycle and detect failures in new model updates quickly, we need to set up our infrastructure in the right way. This requires the data science/ML team and the platform team to work together. We’ll discuss infrastructure for ML in the next chapter.
## Chapter 10. Infrastructure and Tooling for MLOps
If you’ve stayed with me until now, I hope you agree that bringing ML models to production is an infrastructural problem. To enable data scientists to develop and deploy ML models, it’s crucial to have the right tools and infrastructure set up.
In this chapter, we covered different layers of infrastructure needed for ML systems. We started from the storage and compute layer, which provides vital resources for any engineering project that requires intensive data and compute resources like ML projects. The storage and compute layer is heavily commoditized, which means that most companies pay cloud services for the exact amount of storage and compute they use instead of setting up their own data centers. However, while cloud providers make it easy for a company to get started, their cost becomes prohibitive as this company grows, and more and more large companies are looking into repatriating from the cloud to private data centers.
We then continued on to discuss the development environment where data scientists write code and interact with the production environment. Because the dev environment is where engineers spend most of their time, improvements in the dev environment translate directly into improvements in productivity. One of the first things a company can do to improve the dev environment is to standardize the dev environment for data scientists and ML engineers working on the same team. We discussed in this chapter why standardization is recommended and how to do so.
We then discussed an infrastructural topic whose relevance to data scientists has been debated heavily in the last few years: resource management. Resource management is important to data science workflows, but the question is whether data scientists should be expected to handle it. In this section, we traced the evolution of resource management tools from cron to schedulers to orchestrators. We also discussed why ML workflows are different from other software engineering workflows and why they need their own workflow management tools. We compared various workflow management tools such as Airflow, Argo, and Metaflow.
ML platform is a team that has emerged recently as ML adoption matures. Since it’s an emerging concept, there are still disagreements on what an ML platform should consist of. We chose to focus on the three sets of tools that are essential for most ML platforms: deployment, model store, and feature store. We skipped monitoring of the ML platform since it’s already covered in Chapter 8.
When working on infrastructure, a question constantly haunts engineering managers and CTOs alike: build or buy? We ended this chapter with a few discussion points that I hope can provide you or your team with sufficient context to make those difficult decisions.
## Chapter 11. The Human Side of Machine Learning
Despite the technical nature of ML solutions, designing ML systems can’t be confined in the technical domain. They are developed by humans, used by humans, and leave their marks in society. In this chapter, we deviated from the technical theme of the last eight chapters to focus on the human side of ML.
We first focused on how the probabilistic, mostly correct, and high-latency nature of ML systems can affect user experience in various ways. The probabilistic nature can lead to inconsistency in user experience, which can cause frustration—“Hey, I just saw this option right here, and now I can’t find it anywhere.” The mostly correct nature of an ML system might render it useless if users can’t easily fix these predictions to be correct. To counter this, you might want to show users multiple “most correct” predictions for the same input, in the hope that at least one of them will be correct.
Building an ML system often requires multiple skill sets, and an organization might wonder how to distribute these required skill sets: to involve different teams with different skill sets or to expect the same team (e.g., data scientists) to have all the skills. We explored the pros and cons of both approaches. The main cons of the first approach is overhead in communication. The main cons of the second approach is that it’s difficult to hire data scientists who can own the process of developing an ML system end-to-end. Even if they can, they might not be happy doing it. However, the second approach might be possible if these end-to-end data scientists are provided with sufficient tools and infrastructure, which was the focus of Chapter 10.
We ended the chapter with what I believe to be the most important topic of this book: responsible AI. Responsible AI is no longer just an abstraction, but an essential practice in today’s ML industry that merits urgent actions. Incorporating ethics principles into your modeling and organizational practices will not only help you distinguish yourself as a professional and cutting-edge data scientist and ML engineer but also help your organization gain trust from your customers and users. It will also help your organization obtain a competitive edge in the market as more and more customers and users emphasize their need for responsible AI products and services.
It is important to not treat this responsible AI as merely a checkbox ticking activity that we undertake to meet compliance requirements for our organization. It’s true that the framework proposed in this chapter will help you meet the compliance requirements for your organization, but it won’t be a replacement for critical thinking on whether a product or service should be built in the first place.