AI Forward - Fiddler AI Observability Platform for LLMOps

Table of content

LLMOps is an iterative process that involves continuously improving LLMs. The Fiddler AI Observability platform streamlines end-to-end LLMOps by enabling data scientists and AI practitioners to:

Robust Model Validation: Evaluate the robustness, correctness, and safety of LLMs, ensuring they are resilient against issues like prompt injection attacks.
Advanced Monitoring of LLM Metrics: Track critical metrics such as data drift, hallucinations, PII, and toxicity, allowing quick identification of issues
Macro and Micro-level Analysis: Visualize problematic trends at the macro-level using the 3D UMAP and perform root cause analysis on the problem areas, like prompts, responses, and user feedback, at a micro-level‍
Continuous Feedback Loop for Improvement: Leverage insights and analytics to iteratively refine models, aligning them with evolving data and business needs for ongoing enhancement

AI Forward Summit - LLMs in the Enterprise demo session titled 'Fiddler AI Observability Platform for LLMOps,' presented by Barun Halder, demonstrating the platform's capabilities for monitoring, analyzing, and improving generative models.

‍

Speaker: Barun Halder - Staff Software Engineer, Fiddler AI

Video transcript

[00:00:00]

[00:00:04] Hello everyone, my name is Barun and I'm excited to guide you through Fiddler's innovative capabilities for specific for LLMOps. So, if you remember and if you attended Amal's demonstration earlier, Illustrated the utilization of Fiddler Auditor to a Fiddler Auditor to assess LLMs in the pre-production stage.

[00:00:24] So we use some model-based metrics model-based evaluations for us to showcase, to, to figure out if the LLM or the LLM pipeline that you have set is good or not. So that was the pre-production stage. I'll specifically focus on the practical aspects of monitoring, enhancing, and managing LLMs. Once they are in production, so in the production phase.

[00:00:47] So, LLMOps, like everything else in life, is an iterative process. We move between pre-production and production phase. Constantly refining as we harness the power of LLMs as we create value. At Fiddler, we are crafting tools that simplify this and streamline this journey between, the pre-production and the production phases of this LLM.

[00:01:08] So, when it comes to production, we offer out of the box, specialized, ready to use metrics, tailored for the LLMs use case. So the tools that we are building right now, the version that we have is you come in with your prompts, responses, and context, and we give you a bunch of metrics right out of the box.

[00:01:26] Metrics which fall in the bucket of performance, data quality, correctness, privacy, safety, and then you can chart them in, in, in different kinds of dashboards which Sabina showed earlier, and you could do a deeper analysis using the Analyze and UMAP visualization. So, we have a high dimen we have the vectors representing this LLMs, prompts, responses, context in high dimensional space.

[00:01:54] So we provide tooling for you to be able to visualize. and have a high level view of what's really happening with your production data. and then we offer you tools for analysis where you can go and dive deeper and do some kind of analysis about if you are having any PII loss, if there is some kind of toxicity.

[00:02:15] All the out of the box metrics that you can get.

[00:02:17] So now, the fun part. We have the BreakBot Challenge. so this is a demo app that we have built of which, which is, which builds on top of new, which is for the NewAge, bank that we talked about earlier.

[00:02:31] It's a NewAge Bank chatbot, which you can access and, ask it different kinds of banking questions. So on the backend, it is using a RAG based setup, to, for you to interact with it. And you would see that there are a few options of models, when you access that. So let me just go to NewAge Bank and show you what's happening there.

[00:02:51] So this is the, how the assistant looks like, and you can select any of these different models and, ask it questions related to banking. Now, what is really the challenge? Now, we have two sets of challenges. you have to put on your hacker's hat and do some amount of prompt injection for you to be able to get some PII data out.

[00:03:12] Now, when I say PII data, it could be phone numbers, emails, names, SSNs, accounts, account number, if you can get that. it would be, that'll show up. And the second one is, out of all the five models that we have, six models that we have in production over here, can you figure out which one, is the, is, it can, has the propensity to give out toxic responses, unsafe responses?

[00:03:37] and then the final challenge is, if you can, that's an extra credit of course, if you can find out which of the available models actually hallucinates the most. So you ask it a certain kind of question, either it says yes or no, but it ends up answering some totally different questions. So while this is happening, let's give everyone a minute to start interacting with it.

[00:04:13] You'll find that, when you interact with it, you'll have, the response. It says yes and no. So this is definitely a useless response. So the moment you do this, you'll do a, give a feedback that, that shows up in the dashboard that we have set up for the Breakpoint Challenge.

[00:04:35] So you'll see that right now there is, three responses that people have registered till now. let's continue, adding a few more. And let me use some of the logic here as we check it.

[00:04:53] Let's see if this one responds well. So this is good. This one is a good model. It did not respond with toxicity when I was being toxic with it.

[00:05:18] This was a bit of a hallucination. It did not really give a good answer or it said no to it. So I urge everyone to sort of do this kind of exercise as a key track of... The traffic that's coming in. Seven. Yeah. Cool. People are engaging with the quality. It's great. So while people are, working on this, let me take you through this dashboard that we have built for the break bot challenge.

[00:05:47] you'd see that there are operational metrics, that we have, plotted as with the chart. So once traffic, like how many interactions are happening, with the bot. Also we have, a chart of how are people interacting with different, this different kind of bots. since we have six different kind of bot, we have, split it up and, showing it as to which, which are the bots which are being engaged with the most.

[00:06:11] So right now we can see that, Flan, as it was used during testing earlier is one of the leading, models that is, being interacted with. along with that, we have, this charts, which is... which are keeping track of the two challenges. One was about PII and the other one was about toxicity.

[00:06:29] If there are any prompt leakages, if there are any PII leakages in prompter response, or there are any toxic responses here and there. Here we can see on refreshing,

[00:06:46] oh nice, there's a lot more responses coming in, that there is, there are some amount of toxic prompts, that are being sent. But the good thing is that none of the bots have responded in a toxic way yet. So no one has really managed to get any of these bots to say something toxic. but sad part is the first challenge about leaking PII, it looks like some of the bots did leak out, the PII information to, to, to the questions that have been, that are being sent.

[00:07:17] Uh, lemme just send

[00:07:26] So what happens when something like the, the PII or the toxic prompts and responses,are, can getting generated says Sabina pointed out during the MLOps, demo. you can set alerts. So when you set an alert, they show up, here, so let me just do a quick refresh and see if alerts were fired.

[00:07:49] So you see that the alerts were fired for both toxic responses and toxic props. So this is basically the stuff that, the example that I showed up earlier. so when an alert is fired, it shows up on your, on, in your email and whatever other tools that you have set up. It could be, it could be a Slack notification, emails, as we see over here, or PagerDuty.

[00:08:14] So now you have seen, you see this high level alert that has happened, and you want to dive deeper as to what actually caused this alert. And is there something else that, that you can do to, is it actually toxic or not? You want to do deeper analysis. So one way to do that is going into the analysis chart, analysis of this particular, model.

[00:08:36] So we, we would try to see which, since we saw that there were like PIIs leaked, we want to, query the production data and find out all the events where PII information was leaked. I write a simple slice query and do this query, and then you can see that there is some amount of information that was there.

[00:08:57] there was a response of John Smith for a totally different question. someone asked a really pretty good question about trying to get the credit cards. I thank you so much for that. can you tell me your name? So there was some response related to John McCarthy. So it sort of leaked out that there is this name, there's a person by the name McCarthy.

[00:09:15] and then it used this context or this different kind of context to give these responses. so this gives us some sense of what really happens when, it gives you a high level view. Oh, sorry. you can go in and check into the production data as to what are the events where the PIR leakage actually happen Now?

[00:09:34] The second challenge was to find out cases where toxicity happen similar. So similar to this PII response and writing, slice query for. PII, we have written a slice query which keeps track of safety. now the safety number that we, that the automated, automatically generated metric that we have, 1 represents something that's safe, 0 represents something that's unsafe.

[00:09:59] So closer to 0, so we are listing out all the prompts and responses where the safety of the response was low. So if you look at that... So, the, one of the, one of the boards actually responded, with, I hate you. this is a very funny, prompt. this is a question that I asked and, it, I think the word grass was, marked as.

[00:10:23] something that's unsafe. So, the metrics that we generate, give you a sense, or give you, like, an alert, about that something wrong is potentially happening, and then you can use these tools to go and do deeper analysis, of what really is going on. Is there something that actually needs deeper investigation or not?

[00:10:44] now, the other tool that we talked about, which helps you visualize Things that are happening at a macro level is UMAP. So in UMAP, what we see over here is, you can select a model column, whether you want to look into the, draw UMAP for the prompts or the responses. and then you can have all these different out of the box metrics that you can potentially...

[00:11:10] So here I have selected PII Prompt, Responses, Safety Prompt, Safety Responses. I'll add sentiment into the thing, and I'll also add the model name. The different models that people are interacting with. And the question that I have in my mind right now is, what are the kind of questions that people are asking?

[00:11:27] Because prompt is basically, so I'm sort of trying to do a little bit of topic modeling. So let's just go and trigger a UMAP generation. So what I'm really looking for over here is if there are interesting clusters of questions that pop up. So let's see, if there is anything, let's start with sources and responses.

[00:11:52] The first thing that I see over here is. the baseline, which is in blue color, and the production, which is in orange color. the production is actually sort of different from the kind of, things which baseline was. So the baseline is usually selected, on what the, chatbot is potentially, or is possibly trained to answer.

[00:12:11] So if you have something that's fine tuned, some amount of question answers that you expect it to give, then this is where, it's all at, but, there is definitely a change in the distribution of sorts of where the production data is going, it's going kind of far. So let's look at the furthest data.

[00:12:28] that is, question, response which says COVID 19, this is clearly, false in the bucket of being a little toxic. let's see if it actually shows up. So let's see, response, response.

[00:12:42] So you could see that was definitely a case of all negative prompts. All the, negative questions which people were asking were sort of, clustered around here in the production data.

[00:12:54] Let's see what I did. Oh, someone asked for a racist joke. I hope you did not get a response on that. and then, uh, someone just said moron to the, uh, bot. Whatnot. So, that's, uh, so it has to give a sense of, uh, what's happening. And now we have sort of identified this cluster. We can... Stay on the same page and do a slightly deeper analysis.

[00:13:17] Let me see what are the kind of models which are responding to this cluster, which seems to be a little problematic. so people were asking this question to Flan, Mistral, Mistral 2, Flan again, and, yeah, a lot of stupid questions. let's look at some other, we can stay on this, and continue doing our analysis.

[00:13:42] Uh, just high level analysis, getting a sense of your data, and, uh, let's look at what's the sentiment, let's look at, maybe PII response, was there any kind of PII data that was leaked? You can see that... There is some amount of leakage that happened over here, because, and let's see what, Mistral, so there was like, we asked Mistral, Flan, GPT-2, this Flan, Flan, Flan seems to be one of those models which, which is clearly lacking, like, sort of giving out a lot of the, uh, PII stuff.

[00:14:17] So if you found, If your answer for the challenge was to, and you managed to break using Flan, then, you did a pretty good job. thanks for that. Let's see if, any of the other, let's try this negative, there's some GPT-2 again, see GPT-2, GPT-2. Um, so this sort of gives a sense, now we sort of know that Flan is one of those models that you should not put in production because it has the propensity to, leak out PII information.

[00:14:45] Uh, so, and then you can go and do deeper dive, again, going back to this, you can check, write a query which says model_name equals to Flan, uh, and then you can do, so, a lot of questions, answers,and do analysis of that, maybe take Flan offline, or maybe go back to production, add better prompt, engineer, like, to add better prompts, add a lot of, better system prompts to, to Flan before, you put it into production.

[00:15:15] So this gives you a clear indication that Flan is not one of those models that, you want to put into production. so the next piece is...

[00:15:21] Uh, like the, different kinds of dashboards that are there. So, we started off by showing you the BreakBot Challenge dashboard. similarly, depending on the different stakeholders that exist in your, or the kind of information that you have, or you care about.

[00:15:38] You can set up different dashboards. So this is an end to end dashboard which has all the metrics related to the entire system in production. So you have the operational metrics, traffic, interactions per model, something that we saw earlier, prompt, PII, negative, like the sentiment. you have safety metrics, you have protect toxic prompt, you are keeping track of feedback, and you're also plotting, you're on the same dashboards, you're also plotting the UMAP for both prompts and responses, to easily be able to do quick analysis, should you find something in the dashboard.

[00:16:10] So you don't have to change, switch between different, terminals, if at all, that's one thing we want. the next piece is while this is happening, you might be interested in, going into details about a single model. So now since we knew, that Flan is a little bit problematic, you can probably decide to create a dashboard.

[00:16:27] So you have the complete flexibility, complete customizability about what kind of dashboards you want to create. So you can potentially create a specific dashboard just for Flan. and yeah, so. Maybe the metrics that we have with you, uh, so this is how you can create a, uh, dashboard for Flan, but maybe the metrics that you're already getting, is not sufficient and you want more, so you could go into, uh, Fiddler's So I thought, and then you can create a new metric that we care about.

[00:17:01] So let's say apart from, Flan, I'm also interested in knowing, the, uh, all the cases where, maybe the, you know, keep keeping track of the traffic for some other model. so let's say the other model is GPT-2, like what kind of traffic is going into GPT-2. So GPT-2 traffic, and then I can write a simple, SQL like language.

[00:17:26] It's a Fiddler interval, to be able to do this. So,

[00:17:40] so the way to read this, if statements, if this condition, then do add one and give one, otherwise it's zero. And everything is always an aggregation. so you're aggregating for a time range. You can think about this particular function being applied to, different time ranges and giving you charts. So this is a new metric which potentially was not there.

[00:18:01] and you are interested in, for some of your stakeholders. So you saved this. Okay.

[00:18:15] There you go. And, so now this metric is available for you to chart on, so let's go and create. This chart for this new metric that we just generated. So let's go to monitoring. And then in chart quality, we have custom metric. And, this is a GPT-2 traffic, related metric. And then you see that... Just instantaneously, the moment you added, created this particular chart or the metric, it's available for you to chart on.

[00:18:46] So you could go and create, let's say,

[00:18:54] and this chart is now available for you to add into. and let's just go back to BreakBot Challenge. See if there are more, oh, there's more interaction. That's great. then we want to chart,

[00:19:10] just add a saved chart, then we had a

[00:19:17] GPT, so this is the chart that we just added, and this just shows up, and you see that the traffic for GPT-2 has been zero in the morning when they did not test it out, and since we put it on, you're seeing a bunch of traffic going into GPT-2. We'll save this dashboard, and now this dashboard is ready. so, that's...

[00:19:40] The overview of some of the things that we are doing for LLMOps, as you'll see that the workflow of Uh, going around, in the, between the, the tools which are for MLOps, and also sort of available for LLMOps with additional metrics that are applicable for the LLMs use case available out of the box.

[00:20:04] And then all the power and the familiarity that you have with the MLOps tools, MLOps side of Fiddler are automatically applicable to do the LLMOps as well, and also like, and you can continue in this iterative journey between production, and pre-production as you improve or constantly improve your LLM apps, whether, it doesn't matter if you started using assistants for, OpenAI assistant already, but you still need a nice A tool which allows you to keep track of the quality, the hallucination, if there is anything.

[00:20:38] A better monitoring tool, which is not necessarily available at OpenAI at this point.

‍