4

Min Read

Machine learning (ML) models have boundless potential, but realizing their potential requires careful monitoring and evaluation. Without good model evaluation methods or proper metrics, ML models can degrade so subtly that by the time your model begins making inaccurate predictions, it can be hard to deduce why. Take this example from 2009, when a camera’s face recognition algorithm struggled to register darker skin. This could have happened for any number of reasons, but it’s likely that the usual suspects — data drift, model bias, and missed inaccuracies — were involved. Flaws in ML models can be small during development, but create larger errors post-production that might seem obvious to an end-user.

Model monitoring protects against the inevitable drift of ML models. And model evaluation in model monitoring is crucial to properly assessing your model’s performance. What are the general steps of model evaluation? That depends on the type of model being assessed.

The foundation of model evaluation methods is knowing how to measure model performance. Typically, evaluation uses a set of “ground truth” data, which is either an annotated data set or live feedback from users. From there, the model is measured differently depending on its type. To illustrate these difference, we’ll examine two common types of models:

**Classification**— labels inputs as belonging to one or more categories (ex: spam email sorting)**Regression**— draws correlations between inputs and outputs (ex: weather forecasting)

Both examples above are challenged when presented with real-world data, because they, like all models, rely on past examples. ML models must process a stream of ever-changing data, and if any incoming data is unfamiliar to the model, it can only guess based on what it already knows. This is why ML monitoring is challenging: a model that performs well today may not perform well tomorrow. Thus, establishing model monitoring best practices and choosing the right model evaluation metrics is paramount to success.

Properly evaluating classification models relies on understanding a table known as a *confusion matrix*. This is a four-quadrant table with the following associated categories:

- True positive (TP)
- True negative (TN)
- False positive (FP)
- False negative (FN)

A classification model determines the nature of an input based upon predetermined categories. If we use the spam detector example, there are only two options for sorting incoming emails: spam (1) or not spam (0). When the model correctly sorts something as spam — or not spam — it is issuing a *true positive or negative*, respectively. Similarly, when the model incorrectly identifies something as spam or not spam, it issues a *false positive or negative*, respectively.

Now that we’ve established how a confusion matrix is organized, let’s examine a few example functions:

- True Positive Rate: Also known as recall, this shows the percentage of positives that your model classified correctly out of all positives in the data set:

$$\text{TPR} = \frac{\text{TP}}{\text{TP + FN}}$$

- False Positive Rate: The foil of the above formula, it details how many times the model incorrectly classified inputs as positives. The formula is similar to true positive rate’s formula:

$$\text{FPR} = \frac{\text{FP}}{\text{FP + TN}}$$

- Accuracy: A model accuracy formula determines how many predictions were classified correctly. Keep in mind that with this formula, imbalanced datasets can heavily skew the results. For example, if your spam detector is better at registering true positives (i.e. accurately categorizing spam as spam), then a dataset of mostly spam will pad its results. The formula is as follows:

$$\text{ACC} = \frac{\text{TP+TN}}{\text{TP+FP+TN+FN}}$$

Each of these formulas carry different levels of significance depending on the kind of model you’re developing. For example, monitoring the true positive rate is highly important in fraud detection. Despite the varying priority each formula may hold, these formulas — among others — provide a comprehensive understanding of how your model is working.

Regression models play a critical role in statistical analysis. The formulas are far more complex than classification models, so we’ll just take a surface-level look at a few data points calculated for regression models, and what purpose these data points serve:

**Coefficient of Determination (R-squared)**— this is, statistically, how well your model aligns with real-world data. If the model is a perfect fit, this value would be 1.**Mean squared error (MSE)**—the average amount a model deviates from observed data; in a perfect world, this value is 0, but that will essentially never be the case.**Mean absolute error (MAE**) — this calculates the average distance between data points and the line of linear regression the model creates. What makes it “absolute” is it’s only measuring absolute values, meaning whether the value is positive or negative is disregarded.

The above formulas and models are just the beginning of properly assessing model performance. Model monitoring is important to maintain throughout the MLOps lifecycle, including after the model is deployed for end users, to ensure high-performing models.