Back to blog home

ML model monitoring best practices

With increasing reliance on AI across industries, ML model monitoring is quickly becoming a must-have component for supporting the ongoing success of ML implementations. But how do you operationalize model monitoring? How do you choose the right tools for your use case? And how do you ensure your solution is aligned with your organization’s goals?

Let’s take a deeper look at model monitoring best practices…

Introduction to Model Monitoring

As little as 10 years ago, ML models were purpose-built to solve very narrowly defined problems. Now, models are applied to increasingly complex, critical use cases, which require continuous monitoring after deployment to help ensure accuracy and algorithmic fairness, as well as alert ML teams of any performance issues.

Yet there’s a lingering tendency among MLOps teams to over-emphasize model training while neglecting post-deployment monitoring. That’s a flawed stance to take; model performance inevitably decays in production because real-world input data tends to diverge from training data and away from the original assumptions used. This kind of model drift can be difficult to recognize even as it begins to directly affect the business — potentially impacting the bottom line, eroding customer retention, and damaging the organization’s reputation and brand.

You don’t have to dig very deep to see the underlying value of model monitoring:

  • Early detection of model issues = decreased disruption and increased team productivity
  • Understanding changes in model behavior = vastly reduced resolution times, deeper root cause analysis, and more efficient use of resources
  • Reduced time to resolve issues = preventing potential losses to both revenue and reputation

Understanding Model Drift

Depending on context, you may hear drift described as model drift, feature drift, data drift, or concept drift. They’re all variations of the same underlying phenomenon: once models are in production, the stochastic distribution of features drifts over time, diverging from their distributionin the original training data and gradually violating the assumptions used in training. It’s not that the model itself is changing; it’s the relationship between the output data and input data that’s diverging, distorting recommendations and eroding accuracy.

But why would a rigorously developed model, especially one with a track-record for accuracy in production, be susceptible to drift? For one thing, real-world inputs don’t care about the data used to train the model, or how accurate it’s been up to now. Small shifts in stochastic distribution of input features, the order of feature importance, or shifting interdependencies among them can all amplify output errors in non-linear ways. Drift can also simply be a reflection of actual changes in the system the data describes, like shopping habits that vary with economic cycles or the seasons. The bottom line is that model drift is a natural artifact of a noisy and dynamic world and one big reason why model monitoring exists in the first place.

Model Monitoring Tools, Techniques, and Practices

Luckily, there’s no shortage of off-the-shelf tools and options for model monitoring. Many of the tools are open source, well-understood, and straightforward to deploy. But the serious challenge lies in integration — making all the tools in your MLOps lifecycle work together to avoid siloing information. And so far, there exists no established “recipe” for identifying the right set of tools to support your situation, or for configuring them appropriately for your use case. In fact, the investment of time and resources required to build your own solution, or to simply support it in production, often drives buyers to select managed services over an in-house build.

In some sense, any monitoring solution, whether it’s looking for model bias and fairness or detecting outliers, boils down to some form of accuracy measurement. The exact approach you take is heavily influenced by whether you have access to baseline data or some “ground truth” to compare with model outputs.

When we have access to ground truth labeling, it’s a simple matter to compare results in production and calculate accuracy. That’s the preferred approach, but there are workarounds that allow us to infer accuracy by other means.

Monitoring in the absence of ground truth

Some use cases make ground truth available organically; recommendation systems and search engines, for example, naturally get access to user feedback that serves as ground truth. But in many real world scenarios, we do not have real time access to ground truth, and may not get it for several days or weeks, well after a loan is approved or an applicant rejected.

In the absence of ground truth, some common approaches and workarounds for ensuring model performance involve monitoring:

  • Target variables and input features
  • Changes in variable distribution
  • Shifts in feature importance

Read our whitepaper on model monitoring best practices to learn how to monitor model performance without ground truth labels, determine model bias or unfairness, and understand the role of explainable AI within a model performance management framework.

Model monitoring best practices
Whitepaper: Model monitoring best practices