Machine learning (ML) models have boundless potential, but realizing their potential requires careful monitoring and evaluation. Without good model evaluation methods or proper metrics, ML models can degrade so subtly that by the time your model begins making inaccurate predictions, it can be hard to deduce why. Take this example from 2009, when a camera’s face recognition algorithm struggled to register darker skin. This could have happened for any number of reasons, but it’s likely that the usual suspects — data drift, model bias, and missed inaccuracies — were involved. Flaws in ML models can be small during development, but create larger errors post-production that might seem obvious to an end-user.
Model monitoring protects against the inevitable drift of ML models. And model evaluation in model monitoring is crucial to properly assessing your model’s performance. What are the general steps of model evaluation? That depends on the type of model being assessed.
The foundation of model evaluation methods is knowing how to measure model performance. Typically, evaluation uses a set of “ground truth” data, which is either an annotated data set or live feedback from users. From there, the model is measured differently depending on its type. To illustrate these difference, we’ll examine two common types of models:
Both examples above are challenged when presented with real-world data, because they, like all models, rely on past examples. ML models must process a stream of ever-changing data, and if any incoming data is unfamiliar to the model, it can only guess based on what it already knows. This is why ML monitoring is challenging: a model that performs well today may not perform well tomorrow. Thus, establishing model monitoring best practices and choosing the right model evaluation metrics is paramount to success.
Properly evaluating classification models relies on understanding a table known as a confusion matrix. This is a four-quadrant table with the following associated categories:
A classification model determines the nature of an input based upon predetermined categories. If we use the spam detector example, there are only two options for sorting incoming emails: spam (1) or not spam (0). When the model correctly sorts something as spam — or not spam — it is issuing a true positive or negative, respectively. Similarly, when the model incorrectly identifies something as spam or not spam, it issues a false positive or negative, respectively.
Now that we’ve established how a confusion matrix is organized, let’s examine a few example functions:
Each of these formulas carry different levels of significance depending on the kind of model you’re developing. For example, monitoring the true positive rate is highly important in fraud detection. Despite the varying priority each formula may hold, these formulas — among others — provide a comprehensive understanding of how your model is working.
Regression models play a critical role in statistical analysis. The formulas are far more complex than classification models, so we’ll just take a surface-level look at a few data points calculated for regression models, and what purpose these data points serve:
The above formulas and models are just the beginning of properly assessing model performance. Model monitoring is important to maintain throughout the MLOps lifecycle, including after the model is deployed for end users, to ensure high-performing models.