ML Model Monitoring Best Practices

Table of content

Expanding from $21.17 billion in 2022 to an anticipated $209.91 billion by 2029, the machine learning (ML) market continues to show impressive growth. This is further evidenced by the 2022 Gartner CIO and Technology Executive Survey, as 48% of CIOs stated that they have plans to deploy or have deployed AI and machine learning technologies within their organizations.

As business leaders and MLOps teams build a promising, ML-based future, they need to consider the importance of model monitoring. By tracking a model’s performance in production, ML teams create a feedback loop to foster better prediction accuracy and long-term value.

The frustrating thing about model degradation is its subtle nature. A minor change in a model's performance can result in major problems for the business. With the right tools and best practices, however, model monitoring can catch any issues before they escalate.

But before getting to model monitoring techniques and best practices for monitoring ML models in production, let’s first lay some groundwork on the power and consequences of ML.

Why is ML model monitoring important?

Nowadays, ML is quite advanced and used to support various processes in the business world and our personal lives. These can range from simple tasks, like recommending movies on a streaming service, to making life-altering judgments, such as those involved in health care risk predictions.

As ML rises to higher levels of influence, model monitoring becomes an inescapable responsibility. Decision-makers cannot build trust in ML without accuracy, but accuracy can fade over time as circumstances and data changes. The COVID-19 pandemic is an amplified illustration of this. Machine learning models, trained with normal human behavior, couldn’t handle the drastic shifts.

MIT Technology Review covered a few instances of this in action. For example, a company that supplied condiments and sauces to retailers in India faced challenges with their predictive algorithms due to a spike in bulk orders resulting from the pandemic. The company's reliance on the system's sales estimates for stock reordering resulted in a mismatch between what was predicted to sell and what really did. Because the model was never trained to deal with such an unanticipated spike, it couldn’t keep up with the reality of demand.

Simply put, if an ML system faces something unexpected, problems will arise. This is why ML model monitoring is so important. If the world remained constant and unchanging, then a “set it and forget it” model might work, but that’s simply not the case.

Why is model monitoring needed after deploying the model into production?

In the ML world, think of deployment as the beginning rather than the end of a project. While your team may achieve desired results in development, this rarely means a model will continue performing as expected in production. Common obstacles during production can include:

Sudden or unexpected data changes
Differences in processing production data versus training data
Difficulty explaining model predictions to stakeholders
Attacks on the model by adversaries
Inadequate tracking processes to compare newer model versions against in-production models
Lack of clear model ownership among DevOps, engineers, and other team members

With proper model monitoring best practices in place, you can combat these challenges, leading to:

Early detection of problems within your system and model before they create negative impact
Model transparency and the ability to explain and report results to stakeholders and decision makers
A clear route to continuous maintenance and improvement of the model

So, how do you monitor a ML model in production to achieve these results? Through the right model monitoring metrics, tools, and strict adherence to best practices.

How do you monitor ML model performance in production?

How your team monitors ML model performance will depend on your unique goals and challenges faced in production. Broadly speaking, monitoring involves four steps:

Gathering model performance metrics
Continuously tracking these metrics
Detecting any latent or developing issues
Notifying the person who can address said issues

For full-scope model monitoring, you’ll need to focus on the model as well as the incoming data and resulting predictions. Each will hold valuable insight into the health of your ML model. Let’s break each one down further and explain what you should look for.

1. The Model: How do you monitor a model?

Some key areas to focus on for monitoring the model itself include model drift and adversarial attacks.

Model drift

Even the most accurate model will decay over time as incoming data shifts away from the model’s training data. This is called model drift, which is further broken down into “concept drift” and “data drift” (more on data drift a little later).

Concept drift occurs when the relationship between a model’s input (sometimes known as “features”) and outcomes (or “labels”) changes. The progress of spam emails and their detection is a classic example, as multiple studies have examined concept drift in the fight against these emails. Filtering out spam has become more complicated as spammers craft new, realistic-looking emails that defy prior attempts.

This creates concept drift in fraudulent detection models trained with less sophisticated spam. The concept of what it means to be a spammer has changed, or “drifted.” This concept is the foundation of the model’s input, meaning the relationship between the idea of a spammer and what a model predicts is/isn’t spam has changed. If the model isn’t updated, it will likely degrade.

Concept drifts can also happen suddenly, such as the COVID-19 pandemic shifting human behavior and business decisions worldwide in early 2020. They may occur gradually, or you might experience a one-off occurrence influenced by an exceptional event before your model adjusts. To better detect and manage model drifts, your team can:

Set model monitoring alerts for your model’s prediction metrics to understand exactly when your model starts delivering untrustworthy outcomes.
Track data drifts so you can analyze whether your model is facing a data drift or a degradation.
Use statistical tests for detection, such as the Kolmogorov-Smirnov test or Kullback-Leibler divergence.
Retrain deployed models as reality and objectives change.
Consider automating and scheduling retraining at regular intervals if influencing realities change frequently.
Don’t be afraid to go back to the drawing board to create a new model if retraining efforts are unsuccessful,

Adversarial attacks

A recent study on adversarial attacks on medical machine learning focuses on the vulnerabilities of machine learning and the important discussion on what to do with them. “These vulnerabilities allow a small, carefully designed change in how inputs are presented to a system to completely alter its output, causing it to confidently arrive at manifestly wrong conclusions.”

These adversarial attacks can take various forms, like:

Poisoning - An attacker can purposefully create contaminated data that creates model bias,, like how the worst parts of Twitter trained a chatbot in 2016 to be misogynistic and racist.

Evasion - This is when an attacker uses a model’s flaws to produce the attacker’s desired inaccuracies. For example, a model may fail to identify a sticker purposefully placed over a sign and thereby incorrectly identify the sign.

Unfortunately, another recent survey by Microsoft found that nearly 90% of the organizations they interviewed did not have “the right tools in place to secure their ML systems.” To ensure your own models are on the defense against these attacks, consider:

Keeping everything surrounding your model and data confidential to prevent attackers from gaining insider knowledge.
Testing different examples of adversarial attacks on your models.
Treating adversarial events like outliers, since they likely won’t follow a pattern, and ensuring a human reviews these events before using the predictions.

Equally important to the model monitoring equation is incoming data, which we’ll take a look at next.

2. The Data: What is data model monitoring?

The old adage of “garbage in, garbage out” has never been truer than when it comes to inputting ML model data. The three biggest data issues that will inevitably impact predictions are:

Data drift - Models are not able to adjust to a constantly changing world without consistent retraining and remodeling. This significant shift in the pattern of production data versus training data is called “data drift.” To combat data drift, you can use the same statistical tests mentioned above in model monitoring for detection. Additionally, consider utilizing an ML model monitoring platform to achieve faster time-to-value with decreased errors in detection and resolution.

Poor quality data - Altered source data schema, corruption, and data loss are all ways your input can lose quality. Ensure you have proper data quality checks that cover the whole gamut of issues, like missing values, syntax or formatting errors, and the general integrity of the data you are using.

Outliers - It’s impossible to account for all of the anomalies that will influence your model production. These one-off events are called “outliers,” and they can cause isolated performance issues if left undetected. This is where a model monitoring platform pays off, as it greatly simplifies detecting these deviations.

Lastly, a big part of model monitoring is assessing your model’s predictions.

3. The Predictions: How do you assess model performance using predictions?

A model’s predictions are the purpose of a model, and you can use various machine learning and model monitoring metrics to assess this output. Different metrics are used depending on the type of model you have. For example, two of the more popular models include classification and regression models.

Classification models

Classification in model monitoring involves grouping data into predefined categories. These types of models use input training data to assess the probability that subsequent data falls into one of the predefined categories. Some of the best model performance evaluation metrics to use for classification models include:

Accuracy, precision, and recall scores
F1 score
ROC-AUC score
The Confusion Matrix
Log loss

Regression models

Alongside classification, regression models determine how independent variables relate to target or dependent variables. It estimates a numerical value, whereas classification models assign an observation to the category it belongs to. For example, a regression model could be used to predict the relationship between a particular marketing campaign and its effect on sales. Metrics you should be using to monitor regression models include:

Root mean square error (RMSE)
Mean absolute error (MAE)
Mean absolute percentage error (MAPE)
Weighted mean absolute percentage error (WMAPE)
Coefficient of determination or R2
Adjusted R2

One final note…

The best practices we’ve mentioned here are only a starting point. As ML and AI continue to evolve further, these algorithms and best practices will likely change as well. In the meantime, you can use first generation model monitoring frameworks to your advantage, helping you deploy responsible AI.

Ultimately, continuous and dedicated monitoring — paired with explainable AI — as part of a comprehensive AI Observability platform is essential to ensure that the MLOps lifecycle runs smoothly, production is successful, and your bottom line is protected.

ML Model monitoring best practices