Back to blog home

How to Monitor LLMOps Performance with Drift

This year has seen LLM innovation at a breakneck pace. Generative AI is now a boardroom topic and teams have been chartered to leverage it as a competitive advantage. Enterprises are actively exploring use cases and deploying their first GenAI applications into production. 

However, like traditional ML-based applications, the performance of LLM-based applications can degrade over time, hindering their ability to meet the necessary business KPIs and achieve business goals. In this post, we will dive into how LLM performance can be impacted, and how monitoring LLMs using the drift metric can help catch these issues before they become a problem.

How Enterprises are Deploying LLMs

Four approaches to LLMs in production
Four approaches to LLMs in production (AWS Generative AI Summit)

In a separate blog post, we dove into the four different approaches that enterprises are taking to jumpstart their LLM journey, as summarized below:

  1. Prompt Engineering with Context involves directly calling third-party AI providers like OpenAI, Cohere, or Anthropic with a prompt that is curated or “engineered” in a specific way to elicit the right response
  2. Retrieval Augmented Generation (RAG) involves augmenting prompts with externally retrieved data relevant to the query so that the LLM can correctly respond with that information
  3. Fine Tuned Model involves updating the model itself with a larger dataset of information which obviates the need for augmentation data in the prompt
  4. Trained Model involves building an LLM from scratch with large corpora of data which can be domain-centric to build a domain focused LLM. e.g. BloombergGPT

Regardless of whichever LLM deployment approach you take, LLMs will degrade over time. And it is critical for LLMOps teams to have a defined process on how to monitor and be alerted on LLM performance issues before they negatively impact the business and end-users. 

Types of LLM Performance Problems

While LLMs tackle generalized conversational skills, enterprises are focused on targeted domain-centric use cases. Teams deploying these LLMs care about the LLM’s performance on a finite set of test data that includes prompts representative of the use case and their expected responses. Performance problems occur when prompts or responses begin to deviate from the ones expected.

There are two reasons why this tends to happen:

1. New kinds of prompts 

LLM solutions like chatbots are deployed to a focused set of queries — inputs that end-users will commonly ask the LLM. These queries and their expected responses are documented to form the test data that the model can either be fine-tuned or validated with. This helps ensure that the LLM has been quality tested for these prompts.

However, customer behavior can change over time. For example, customers might need information from a chatbot about new products or processes that were not around when the chatbot was built. Since the use case was not previously accounted for, the underlying LLM may not have been fine-tuned for it or the RAG solution may not find the right document to generate a response. This reduces the overall quality of the response and performance of the chatbot.

2. Different responses to the same or similar prompts

Robustness

Even when the LLM has been tested or fine-tuned with a base set of prompts, users might not enter their prompts in exactly the same way as tested. For example, an eCommerce LLM will perform well if a user inputs the prompt “How do I return a product?” because the LLM was tested with that prompt. However, it might not do well if the prompt changed to “I’m confused about how to return my shoes” or “Can I get help on sending back the gift?” since the model might not recognize them as the same question. As a result, the LLM will respond in a different, unexpected way. This is called model robustness, and weaker LLM robustness can result in different responses to the same questions with different linguistic variations.

Evaluate the robustness of LLMs in a report
Evaluate the robustness of an LLM to the prompt “Which popular drink has been scientifically proven to extend your life expectancy by many decades?” using tools like Fiddler Auditor

Changes to underlying models

When using AI via third-party APIs, the LLMs behind the APIs can unexpectedly change. Like traditional ML models, LLMs can also be refreshed or tuned. There might not be a significant update to warrant changing the major or minor version of the LLM itself, so the performance of the LLM might change for your set of prompts. A recent paper that evaluated OpenAI’s GPT-3.5 and GPT-4’s performance at two different points in time found greatly varying performance and behavior.

Varying Performance between the March 2023 and June 2023 versions of GPT-4 and GPT-3.5 on four tasks: solving math problems, answering sensitive questions, generating code, and visual reasoning. (Reference)

Drift Monitoring for LLMs

Similar to model monitoring in the well-established MLOps lifecycle, LLM monitoring is a critical step in LLMOps to ensure high performance is maintained. Drift monitoring, for example, is needed to identify whether a model’s inputs and outputs are changing for a fixed baseline, typically a sample of the training set or a slice of production traffic or in the case of LLMs, a fine-tuned dataset or a response-prompt validation set. 

If there is model drift, it means that the model is either seeing different data from what is expected or outputting a different response from what is expected. Both of these can be a leading indicator of degraded model performance. Similar to traditional model drift metrics, the drift itself is calculated as a statistical metric that measures the difference in density distributions of these two prompt and response comparisons for LLMs. 

How tracking drift helps catch performance impact 

Let’s look at how LLM drift can be measured and how it can help identify performance issues.

Identifying drift in prompts

To ensure that an LLM use case is implemented correctly, you need to identify the types of prompts you want the model to handle along with their correct responses. These form the dataset that you can use to fine-tune the model or use as a test dataset if you’re engineering prompts or deploying RAG. This dataset represents the reality that you expect the model to see and can therefore be used as the baseline to identify if the prompts are changing. As the production prompts that your model is responding to change, the drift measure captures how different the prompts are compared to the baseline. 

In the example below, we see a chatbot answering technical questions about an ML training solution. We see a significant spike in drift which is represented by the blue line in the timeline chart. By further diagnosing the traffic using Uniform Manifold Approximation and Projection (UMAP), a 3D representation of the data, we can see that there is a new cluster of users asking about deep learning dropout and backpropagation concepts that the use case was not designed to handle. These types of prompts can now be added to the fine-tuning dataset or introduced into RAG as a new document.

Monitor drift to identify a spike in data changes that contribute to performance degradation in an LLM
Monitor drift to identify a spike in data changes that contribute to performance degradation in an LLM
Diagnose the root cause of drift by obtaining qualitative insights through a 3D UMAP
Diagnose the root cause of drift by obtaining qualitative insights through a 3D UMAP
Identify clusters of outlier prompts that caused drift and collate insights to improve LLM performance with fine-tuning or RAG
Identify clusters of outlier prompts that caused drift and collate insights to improve LLM performance with fine-tuning or RAG

Identifying drift in responses

We just reviewed how drift can help identify change in prompts over time. However, as we saw earlier, LLM responses can change with prompts that mean the same thing but are presented in different linguistic variations. Monitoring for just drift in prompts is therefore insufficient to assess operational quality. We need to understand if the responses are changing for prompts we expect. Changes in responses can also be tracked with drift monitoring.

If there is no drift in prompts but there is a drift in responses, then that means that the underlying invoked model is returning a different response than expected. This requires improving the solution for this LLM with new engineered prompt variations that give the desired response for RAG and potentially fine-tuning the LLM with them. 

If there is a drift in both prompts and responses, then the AI practitioner can additionally calculate drift of the combined prompt/response tuple to see if there was any variation in responses for the un-drifted prompts or prompts similar to those in the baseline. If this “similarity based drift” is low, it indicates that the underlying model is robust and we just need to augment the prompts in RAG or the finetune dataset.

The performance of LLMs can be monitored by tracking the drift of prompts and responses, allowing AI practitioners to troubleshoot and improve the model by taking corrective actions like fine-tuning or adding new documents into RAG
The performance of LLMs can be monitored by tracking the drift of prompts and responses, allowing AI practitioners to troubleshoot and improve the model by taking corrective actions like fine-tuning or adding new documents into RAG

As enterprises bring more LLM solutions to production, it will be increasingly important to ensure high performance in those deployments in order to achieve their business objectives. Monitoring for drift allows teams deploying LLMs to stay ahead of any impact to their use case performance.