Generative AI based models and applications are being rapidly adopted across industries to augment human capacity on a wide range of tasks1. Looking at the furious pace of the Large Language Model (LLM) rollout, it’s evident companies are feeling the pressure to adopt generative AI or risk getting disrupted and left behind. However, organizations face hurdles in deploying generative AI at scale given the nascent state of available AI tooling.
In this blog, we’ll explore the key risks of today’s generative AI stack, focusing on the need for model monitoring, explainability, and bias detection when building and deploying generative AI applications. We then discuss why model monitoring is the missing link for completing the generative AI tech stack, and peek into the future of generative AI.
The Generative AI stack
In the emerging generative AI stack, companies are building applications by developing their own models, invoking third party generative AI via APIs, or leveraging open source models that have been fine-tuned to their needs. In all these cases, the three major cloud platforms typically power the underlying AI capabilities. Andreessen Horowitz recently noted that the emerging generative AI tech stack1 includes the following:
- Generative AI-powered applications (e.g., Jasper and Copilot)
a. Two paradigms of foundation models
b. Closed-source proprietary models (e.g., GPT-4)
- Open-source models (e.g., Stable Diffusion)
- Model hubs to share and host foundation models
There are however several risks and concerns with generative AI3-5 — inaccuracies, costs, lack of interpretability, bias and discrimination, privacy implications, lack of model robustness6, fake and misleading content, and copyright implications, to name just a few. Questions therefore remain on how to safely deploy this technology at scale — and that’s where considerations around model monitoring, explainability, bias detection, and privacy protection are of paramount importance; organizations must ensure that these models are continuously monitored, the usage of the generative AI apps is tracked, and users understand the reasoning behind these models.
Monitoring - The Key to Enterprise Adoption of Generative AI
Model monitoring involves the ongoing evaluation of the performance of a generative AI model. This includes tracking model performance over time, identifying and diagnosing issues, and making necessary adjustments to improve performance. For example, when generative AI models are leveraged for applications such as helping financial advisors7 search wealth management content or detecting fraudulent accounts8 on community platforms, it’s crucial to monitor the performance of the underlying models and detect issues in a timely fashion, considering the significant financial stakes involved. Likewise, when we deploy generative AI-powered applications with huge social impact, such as tutoring students9 and serving as a visual assistant10 for people who are blind or have low vision, we need to monitor the models to make sure that they remain reliable and trustworthy over time.
Monitoring involves staying on top of these different aspects of the model’s operational behavior:
Generative AI models are plagued by the accuracy of their content, which impacts all modalities of data. LLMs came under full public scrutiny as Google and Microsoft geared up to launch their AI-assisted experiences. Google Bard was incorrect in a widely viewed ad11, while Microsoft Bing made a lot of basic errors12. Midjourney, Stable Diffusion, and DALL-E 2 all have a flawed generation13 of human fingers and teeth. Monitoring how accurate these outputs are from the end user helps to keep track of predictions that can be used to fine-tune the model with additional data or switch to a different model.
Model performance degrades over time, in a process known as model drift15, which results in models losing their predictive power. In contrast to predictive AI models, measuring the performance of generative AI models is harder since the notion of “correct” response(s) is often ill-defined. Even in cases where model performance can be measured, it may not be possible to do so in (near) real-time. Instead, we can monitor changes in the distribution of inputs over time and treat such changes as indicators of performance degradation since the model may not perform as expected under such distribution shifts. Since generative AI uses unstructured data and typically represents inputs as high-dimensional embedding vectors, monitoring these embeddings can show when the data is shifting and can help determine when the model may need to be updated. For example, we show how drift in OpenAI embeddings16 (indicated in the blue line) changes when the distribution of news topics changes over time, as shown in the image below.
OpenAI, Cohere, and Anthropic have enabled easy access to their generative AI models via APIs. However, costs can add up quickly. For example, 750K words of generated text costs $30 on OpenAI’s GPT-417, while the same can cost about $11 on Cohere18 (assuming 6 characters per word). Not only do teams need to stay on top of these expenses, but also assess which AI-as-an-API service provider is giving them the better bang for the buck. Tracking costs and performance metrics gives a better grasp on ROI as well as cost savings.
Prompts are the most common way end users interact with generative AI models. Users typically cycle through multiple iterations of prompts to refine the output to their needs before reaching the final prompt. This iterative process is a hurdle and has even spawned a new growing field of Prompt Engineering19. In addition to prompts having insufficient information to generate the desired output, some prompts might give subpar results. In either case, users will be dissatisfied. These instances need to be captured from user feedback and collated to understand when the model quality is frustrating users so this can be used to fine-tune or change the model.
Generative AI models, especially LLMs like GPT-3 with 175B parameters, can require intense compute to run inferences. For example, Stable Diffusion inference benchmarking shows a latency of well over 5 secs20 for 512 X 512 resolution images, even on state-of-the-art GPUs. Once we take network delays also into account, roundtrip latencies for each API call can grow quickly. As customers typically expect to engage with generative AI applications in real-time, anything over a few seconds can hurt user experience. Latency tracking can be an early warning system to avoid model availability issues from impacting business metrics.
Self-supervised training on a large corpora of information leads to the model inadvertently learning unsafe content and then sharing it with users. OpenAI, for one, has dedicated resources to put in safety guardrails21 for its models and has even shared a System Card22 that outlines all the safety challenges that were explored. As guidelines for safeguards like these are still evolving, not all generative AI models have these in place.
Therefore, models might generate content that might not be safe, whether prompted or unprompted. Inaccuracies can have serious consequences, especially for critical use cases that could lead to potential harm, such as incorrect or misleading medical information and encouraging self-harm. Safety must be closely monitored based on user feedback to the model’s objectionable outputs or on the user's objectionable inputs, so that these models can be replaced or additional constraints can be imposed, if necessary.
As new versions of generative AI models are released, application teams should evaluate the effectiveness of them before transitioning their business completely to the new version. Performance, tone, prompt engineering, and quality may vary between versions, so customers should run A/B tests and evaluate shadow traffic to ensure they’re using the correct version. Tracking metrics across different versions gives teams information and context to make this decision.
Explainable AI Meets Generative AI
Model bias23 and output transparency are lingering concerns for all ML models and are especially exacerbated with large data and complex generative AI models. After the initial furor about a lack of information sources for the answers being provided by LLMs, Bing’s recent update to its chat language model often cites its sources to be more transparent with users.
Explainability, the degree to which a human can understand the decision-making process of an ML model, was originally applied to simpler ML models to good effect and can be extended to these complex models. This is particularly important in applications where the model's output has significant consequences, such as medical diagnoses or lending decisions. For instance, imagine if a medical support tool were to use an LLM for diagnosing a disease, then it would be important for medical practitioners to understand how the model arrived at a particular diagnosis to ensure that the model's output is trustworthy.
However, explainable AI for these complicated model architectures is still a topic of active research. We’re seeing techniques like Chain of Thought Prompting24 as a promising direction for jointly obtaining model output and associated explanations.
Another approach could be to build a surrogate interpretable model (e.g., decision tree-based model) based on the inputs and outputs of the opaque LLM, and use the surrogate model for explaining the predictions made by the opaque LLM. Even though this explanation might not be of the highest fidelity, the directional guidance would still serve teams better than no guidance at all.
There’s also recent work on NLP models that predicts outputs together with associated rationales. For instance, researchers have studied whether the generated rationales are faithful to the model predictions for T5 models25. When the model provides explanations in addition to the prediction, we need to vet the explanations and make sure that the model isn't using faulty (but convincing-to-humans) reasoning to arrive at a wrong conclusion. There's recent work on achieving both model robustness and explainability using the notion of machine-checkable concepts26. The work on rationales discussed above is also relevant in this context.
Finally, in adversarial settings wherein the model is intentionally designed to deceive the user, there’s work showing that post-hoc explanations could be misleading27 or could be fooled via adversarial attacks28 in the case of predictive AI models; as generative AI models tend to be more complex and opaque, we shouldn't be surprised by the presence of similar attacks.
Generative AI models can also incorporate biases3-5 from the large corpora of data they’re trained on. As these datasets are often heavily skewed towards a small number29 of ethnicities, cultures, demographic groups, and languages, the resulting generative AI model could be biased, and end up producing inaccurate results30 for other cultures. Such biases can show up in blatant or subtle ways31. In the recipe example below, surely there are dishes from other cultures that these ingredients can make.
More broadly, large language models and other generative AI models have been shown to exhibit common gender stereotypes32, biases associating a religious group with violence33, sexual objectification bias31, and possibly several other types of biases that have not yet been discovered. Hence, it’s crucial to identify and mitigate any biases that may be present in a generative AI model before deploying it. Such bias detection and mitigation involves several steps including but not limited to the following: understanding how the sources from which the training dataset was obtained and curated; ensuring that the training dataset is representative and of good quality across different demographic groups; evaluating biases in the pre-trained word embedding models; and evaluating how model performance varies for different demographic groups. Even after deployment of the model, it’s important to continue to monitor for biases, and take correct actions as needed.
Completing the Generative AI Stack
The generative AI stack, like the MLOps stack, therefore needs to have model monitoring to monitor, understand, and safeguard deployment of these models.
Model monitoring connects to AI Application, Model Hubs, or hosted models to continuously monitor their inputs and outputs in order to gain insights from metrics, provide model explanations34 to end-users and developers building applications on top of these models, and detect potential biases in the data being fed to these models.
Future of Generative AI
Generative AI is still in its early stages. If the rapid advances over the past two years are any indicator, this year is shaping up to be even bigger for generative AI, as it goes multi-modal. Already Google has released MusicLM35 that gives anyone the ability to generate music from text while GPT-4 can now be prompted with images.
Accelerated adoption of generative AI can, however, only happen with maturation of tooling. The generative AI Ops or LLMOps workflow needs to advance in training, tuning, deploying, monitoring, and explaining so that fine-tuning, deployment, and inference challenges are addressed. These changes will come quickly — for example, Google AI recently introduced Muse36 that uses a masked generative transformer model instead of pixel-space diffusion or autoregressive models to create visuals which speeds up run times by 10x compared to Imagen in a smaller 900 million parameters footprint.
With the right tools in place, 2023 will kick off the industrialization of generative AI and set the pace for its future adoption.
- Matt Bornstein, Guido Appenzeller, Martin Casado, Who Owns the Generative AI Platform? | Andreessen Horowitz, January 2023
- Nazneen Rajani, The Wild West of NLP Modeling, Evaluation and Documentation, Keynote at EMNLP 2022
- Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell, On the Dangers of Stochastic Parrots: Can Language Models be Too Big? FAccT 2021
- Rishi Bommasani et al., On the Opportunities and Risks of Foundation Models, Stanford Center for Research on Foundation Models (CRFM) Report, 2021.
- Mary Reagan, Not All Rainbows and Sunshine: The Darker Side of ChatGPT, Towards Data Science, January 2023
- Amal Iyer, Expect The Unexpected: The Importance of Model Robustness | Fiddler AI Blog, February 2023
- https://openai.com/customer-stories/morgan-stanley, March 2023
- https://openai.com/customer-stories/stripe, March 2023
- Sal Khan, Harnessing GPT-4 so that all students benefit. A nonprofit approach for equal access, Khan Academy Blog, March 2023
- Introducing Our Virtual Volunteer Tool for People who are Blind or Have Low Vision, Powered by OpenAI’s GPT-4, Be My Eyes Blog, March 2023
- Martin Coulter, Greg Bensinger, Alphabet shares dive after Google AI chatbot Bard flubs answer in ad | Reuters, February 2023
- Tom Warren, Microsoft’s Bing AI, like Google’s, also made dumb mistakes during first demo - The Verge, February 2023
- Pranav Dixit, Why Are AI-Generated Hands So Messed Up?, BuzzFeed News, January 2023
- Cherlynn Low, Google is opening up access to its Bard AI chatbot today | Engadget, March 2023
- Amy Hodler, Drift in Machine Learning: How to Identify Issues Before You Have a Problem | Fiddler AI Blog, Fiddler AI Blog, January 2022
- Bashir Rastegarpanah, Monitoring Natural Language Processing Models -- Monitoring OpenAI text embeddings using Fiddler, Fiddler AI Blog, February 2023
- Eole Cervenka, All You Need Is One GPU: Inference Benchmark for Stable Diffusion, Lambda Labs Blog, October 2022
- Steve Mollman, OpenAI CEO Sam Altman warns that other A.I. developers working on ChatGPT-like tools won’t put on safety limits—and the clock is ticking | Fortune, March 2023
- GPT-4 System Card | OpenAI, March 2023
- Mary Reagan, AI Explained: Understanding Bias and Fairness in AI Systems | Fiddler AI Blog, March 2021
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou, Chain of Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022
- Sarah Wiegreffe, Ana Marasović, Noah A. Smith, Measuring Association Between Labels and Free-Text Rationales, EMNLP 2021
- Vedant Nanda, Till Speicher, John P. Dickerson, Krishna P. Gummadi, Muhammad Bilal Zafar, Unifying Model Explainability and Robustness via Machine-Checkable Concepts, arxiv, July 2020
- Lakkaraju, H. and Bastani, O. “How do I fool you?”: Manipulating user trust via misleading black box explanations. AIES 2020
- Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, Himabindu Lakkaraju, Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods, AIES 2020
- Ricardo Baeza-Yates, Language models fail to say what they mean or mean what they say | VentureBeat, March 2022
- Jenka, AI and the American Smile. How AI misrepresents culture through a facial expression, Medium, March 2023
- Robert Wolfe, Yiwei Yang, Bill Howe, Aylin Caliskan, Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias, arxiv, December 2022
- Li Lucy, David Bamman, Gender and Representation Bias in GPT-3 Generated Stories, ACL Workshop on Narrative Understanding, 2021.
- Andrew Myers, Rooting Out Anti-Muslim Bias in Popular Language Model GPT-3, Stanford HAI News, 2021
- Ankur Taly and Aalok Shanbhag, Counterfactual Explanations vs. Attribution based Explanations | Fiddler AI Blog, March 2020
- Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank, [2301.11325] MusicLM: Generating Music From Text, arxiv, January 2023 (https://google-research.github.io/seanet/musiclm/examples/)
- Daniel Dominguez, Google AI Unveils Muse, a New Text-to-Image Transformer Model, InfoQ, January 2023