Which functions are used for model evaluation?

Min Read

Machine learning (ML) models have boundless potential, but realizing their potential requires careful monitoring and evaluation. Without good model evaluation methods or proper metrics, ML models can degrade so subtly that by the time your model begins making inaccurate predictions, it can be hard to deduce why. Take this example from 2009, when a camera’s face recognition algorithm struggled to register darker skin. Model degradation can happen for any number of reasons, but it’s likely that the usual suspects — data drift, model bias, and missed inaccuracies — were involved. Flaws in ML models can be small during development, but create larger errors post-production that might seem obvious to an end-user.

Model monitoring protects against the inevitable drift of ML models. And model evaluation in model monitoring is crucial to properly assessing your model’s performance. What are the general steps of model evaluation? That depends on the type of model being assessed.

What is Model evaluation?

Model evaluation involves assessing an ML model’s performance using specific metrics and functions tailored to its type. This typically requires a “ground truth” dataset, such as annotated samples or real-world user feedback. Different evaluation methods and metrics come into play depending on the model type — such as classification versus regression.

Model evaluation techniques

The foundation of model evaluation methods is knowing how to measure model performance effectively. Typically, this process involves a set of “ground truth” data, such as an annotated dataset or live user feedback. From there, the appropriate evaluation function in AI is applied based on the type of model being assessed. To illustrate these differences, let’s explore two common types of models.

Classification — Assigns inputs to specific categories (e.g., spam email detection)
Regression — Identifies relationships between inputs and outputs (e.g., weather forecasting)

Both types face challenges when dealing with real-world data, as they rely heavily on past examples. ML models must process a stream of ever-changing data, and if any incoming data is unfamiliar to the model, it can only guess based on what it already knows. This is why ML monitoring is challenging: a model that performs well today may not perform well tomorrow. Thus, establishing model monitoring best practices and choosing the right model evaluation metrics is paramount to success.

Model evaluation techniques for classification models

Properly evaluating classification models relies on understanding a table known as a confusion matrix. This is a four-quadrant table with the following associated categories:

True positive (TP)
True negative (TN)
False positive (FP)
False negative (FN)

A classification model determines the nature of an input based upon predetermined categories. If we use the spam detector example, there are only two options for sorting incoming emails: spam (1) or not spam (0). When the model correctly sorts something as spam — or not spam — it is issuing a true positive or negative, respectively. Similarly, when the model incorrectly identifies something as spam or not spam, it issues a false positive or negative, respectively.

Now that we’ve established how a confusion matrix is organized, let’s examine a few example functions:

True Positive Rate: Also known as recall, this shows the percentage of positives that your model classified correctly out of all positives in the data set:
$$\text{TPR} = \frac{\text{TP}}{\text{TP + FN}}$$

False Positive Rate: The foil of the above formula, it details how many times the model incorrectly classified inputs as positives. The formula is similar to true positive rate’s formula:
$$\text{FPR} = \frac{\text{FP}}{\text{FP + TN}}$$

Accuracy: A model accuracy formula determines how many predictions were classified correctly. Keep in mind that with this formula, imbalanced datasets can heavily skew the results. For example, if your spam detector is better at registering true positives (i.e. accurately categorizing spam as spam), then a dataset of mostly spam will pad its results. The formula is as follows:
$$\text{ACC} = \frac{\text{TP+TN}}{\text{TP+FP+TN+FN}}$$

Each of these formulas carry different levels of significance depending on the kind of model you’re developing. For example, monitoring the true positive rate is highly important in fraud detection. Despite the varying priority each formula may hold, these formulas — among others — provide a comprehensive understanding of how your model is working.

Model evaluation techniques for regression models

Regression models play a critical role in statistical analysis. The formulas are far more complex than classification models, so we’ll just take a surface-level look at a few data points calculated for regression models, and what purpose these data points serve:

Coefficient of Determination (R-squared) — this is, statistically, how well your model aligns with real-world data. If the model is a perfect fit, this value would be 1.
Mean squared error (MSE) —the average amount a model deviates from observed data; in a perfect world, this value is 0, but that will essentially never be the case.
Mean absolute error (MAE) — this calculates the average distance between data points and the line of linear regression the model creates. What makes it “absolute” is it’s only measuring absolute values, meaning whether the value is positive or negative is disregarded.

Why is AI model evaluation important?

Effective machine learning model evaluation is not a one-time task but a continuous process that evolves over time with changing environments, shifting user behavior, and evolving data patterns. Without ongoing ML model evaluation, organizations risk degradation in model performance, leading to inaccurate predictions and potential operational failures.

Key reasons for continuous evaluation include:

Maintaining reliability
Adapting to data drift
Detecting and mitigating bias
Improving decision-making

Through consistent ML model evaluation, organizations can safeguard the performance of their machine learning models, ensuring they remain reliable, fair, and aligned with business goals.

Stay on top of your ML model’s performance

The formulas and models discussed above are the foundation for effectively assessing performance. Consistent model evaluation in machine learning ensures your models deliver accurate and reliable results over time. To achieve this, model monitoring must be continuous throughout the entire MLOps lifecycle—even after deployment—to protect high-performing models and adapt to evolving real-world data.

The Fiddler AI Observability platform helps ensure your machine learning models stay accurate, reliable, and high-performing. Streamline ML model evaluation, catch issues early, and optimize performance. Explore Fiddler today!