Expanding from $21.17 billion in 2022 to an anticipated $209.91 billion by 2029, the machine learning (ML) market continues to show impressive growth. This is further evidenced by the 2022 Gartner CIO and Technology Executive Survey, as 48% of CIOs stated that they have plans to deploy or have deployed AI and machine learning technologies within their organizations.
As business leaders and MLOps teams build a promising, ML-based future, they need to consider the importance of model monitoring. By tracking a model’s performance in production, ML teams create a feedback loop to foster better prediction accuracy and long-term value.
The frustrating thing about model degradation is its subtle nature. A minor change in a model's performance can result in major problems for the business. With the right tools and best practices, however, model monitoring can catch any issues before they escalate.
But before getting to model monitoring techniques and best practices for monitoring ML models in production, let’s first lay some groundwork on the power and consequences of ML.
Nowadays, ML is quite advanced and used to support various processes in the business world and our personal lives. These can range from simple tasks, like recommending movies on a streaming service, to making life-altering judgments, such as those involved in health care risk predictions.
As ML rises to higher levels of influence, model monitoring becomes an inescapable responsibility. Decision-makers cannot build trust in ML without accuracy, but accuracy can fade over time as circumstances and data changes. The COVID-19 pandemic is an amplified illustration of this. Machine learning models, trained with normal human behavior, couldn’t handle the drastic shifts.
MIT Technology Review covered a few instances of this in action. For example, a company that supplied condiments and sauces to retailers in India faced challenges with their predictive algorithms due to a spike in bulk orders resulting from the pandemic. The company's reliance on the system's sales estimates for stock reordering resulted in a mismatch between what was predicted to sell and what really did. Because the model was never trained to deal with such an unanticipated spike, it couldn’t keep up with the reality of demand.
Simply put, if an ML system faces something unexpected, problems will arise. This is why ML model monitoring is so important. If the world remained constant and unchanging, then a “set it and forget it” model might work, but that’s simply not the case.
In the ML world, think of deployment as the beginning rather than the end of a project. While your team may achieve desired results in development, this rarely means a model will continue performing as expected in production. Common obstacles during production can include:
With proper model monitoring best practices in place, you can combat these challenges, leading to:
So, how do you monitor a ML model in production to achieve these results? Through the right model monitoring metrics, tools, and strict adherence to best practices.
How your team monitors ML model performance will depend on your unique goals and challenges faced in production. Broadly speaking, monitoring involves four steps:
For full-scope model monitoring, you’ll need to focus on the model as well as the incoming data and resulting predictions. Each will hold valuable insight into the health of your ML model. Let’s break each one down further and explain what you should look for.
Some key areas to focus on for monitoring the model itself include model drift and adversarial attacks.
Even the most accurate model will decay over time as incoming data shifts away from the model’s training data. This is called model drift, which is further broken down into “concept drift” and “data drift” (more on data drift a little later).
Concept drift occurs when the relationship between a model’s input (sometimes known as “features”) and outcomes (or “labels”) changes. The progress of spam emails and their detection is a classic example, as multiple studies have examined concept drift in the fight against these emails. Filtering out spam has become more complicated as spammers craft new, realistic-looking emails that defy prior attempts.
This creates concept drift in fraudulent detection models trained with less sophisticated spam. The concept of what it means to be a spammer has changed, or “drifted.” This concept is the foundation of the model’s input, meaning the relationship between the idea of a spammer and what a model predicts is/isn’t spam has changed. If the model isn’t updated, it will likely degrade.
Concept drifts can also happen suddenly, such as the COVID-19 pandemic shifting human behavior and business decisions worldwide in early 2020. They may occur gradually, or you might experience a one-off occurrence influenced by an exceptional event before your model adjusts. To better detect and manage model drifts, your team can:
A recent study on adversarial attacks on medical machine learning focuses on the vulnerabilities of machine learning and the important discussion on what to do with them. “These vulnerabilities allow a small, carefully designed change in how inputs are presented to a system to completely alter its output, causing it to confidently arrive at manifestly wrong conclusions.”
These adversarial attacks can take various forms, like:
Unfortunately, another recent survey by Microsoft found that nearly 90% of the organizations they interviewed did not have “the right tools in place to secure their ML systems.” To ensure your own models are on the defense against these attacks, consider:
Equally important to the model monitoring equation is incoming data, which we’ll take a look at next.
The old adage of “garbage in, garbage out” has never been truer than when it comes to inputting ML model data. The three biggest data issues that will inevitably impact predictions are:
Lastly, a big part of model monitoring is assessing your model’s predictions.
A model’s predictions are the purpose of a model, and you can use various machine learning and model monitoring metrics to assess this output. Different metrics are used depending on the type of model you have. For example, two of the more popular models include classification and regression models.
Classification in model monitoring involves grouping data into predefined categories. These types of models use input training data to assess the probability that subsequent data falls into one of the predefined categories. Some of the best model performance evaluation metrics to use for classification models include:
Alongside classification, regression models determine how independent variables relate to target or dependent variables. It estimates a numerical value, whereas classification models assign an observation to the category it belongs to. For example, a regression model could be used to predict the relationship between a particular marketing campaign and its effect on sales. Metrics you should be using to monitor regression models include:
The best practices we’ve mentioned here are only a starting point. As ML and AI continue to evolve further, these algorithms and best practices will likely change as well. In the meantime, you can use first generation model monitoring frameworks to your advantage, helping you deploy responsible AI.
Ultimately, continuous and dedicated monitoring — paired with explainable AI — as part of a comprehensive AI Observability platform is essential to ensure that the MLOps lifecycle runs smoothly, production is successful, and your bottom line is protected.