Monitor Domain-Specific AI with Custom Evaluators in Your Environment

Discover how Fiddler's custom evaluators enable you to bring your own domain-specific evaluation criteria to your agentic and LLM applications.

Generative AI and LLMOps

Table of content

Monitor Domain-Specific AI with Custom Evaluators in Your Environment

‍

In this demo, we show you how to create custom prompts for bias detection, hallucination detection, faithfulness scoring, and other use cases specific to your business needs. Custom evaluators run on batteries-included Fiddler Trust Models, deployable in your own environment without data exposure risks, external API calls, or hidden costs.

What you'll see:

Configure custom evaluation prompts with your own definitions and parameters.
Deploy evaluators privately using Fiddler Trust Models in your cloud environment.
Drill down from metrics to trace-level logs to examine custom evaluator outputs and reasoning.
Use UMAP embedding visualizations to identify clusters of faithful versus unfaithful responses for root cause analysis.

Video transcript

[00:00:00] Hey everyone. My name's Kevin, and I'm a Solutions engineer here at Fiddler. And today I'm going to be talking about the new custom evaluators that are available for Agentic and LLM monitoring within the Fiddler platform.

[00:00:12] These custom evaluators now allow customers to bring their own domain specific use cases to apply to their applications. For example, if you have a specific use case like bias detection and you have a definition for bias that you wanna apply, you can use it as a custom prompt and get targeted metrics back. Similarly, if you have a specific definition for hallucination detection or faithfulness scoring that you want to apply, you can now use that and have the LLM judge bias or judge faithfulness.

[00:00:45] Based on the parameters that you set within your prompt. These customizable evaluators are part of the Fiddler trust models, which allow customers to deploy evaluators privately in their own cloud environment, but also incur potentially lower latency and lower costs compared to external frontier models like OpenAI and Gemini.

[00:01:08] Next, I'll move into a demo environment where here you can see a chat bot that we manage internally that was built through land graph and is being instrumented through open telemetry. And here we can start to see some of the span activity around LLM and tool calls. And there's also some metrics and charts that we've configured around faithfulness and prompt safety and faithfulness is a particular area of interest for my team.

[00:01:32] So I'm going to drill into this chart and click in here to start to see the actual spans. In the logs and I can directly go into the trace from here and here. We're now seeing the full agentic flow from start to finish, as well as this particular span where we're seeing the system prompt and user prompt that was used, the output of this span and the context that the language model used for this span.

[00:01:56] And then the evaluators and Fiddler has some built-in evaluators around prompt safety, as you can see here. But we also allow our customers to configure customizable evaluators. And if I continue scrolling here, we can see this is actually a custom evaluator that our team configured around faithfulness detection.

[00:02:16] Where we're telling the model that we want it to output faithful or not. In this case, it's output faithful, and we also want it to give us reasoning for why it thought so. And here we can see the reasoning includes information around the reference documents, the response, as well as the user prompt. So this is actually a custom prompt that we used where we asked it to look at all of these things. And finally, give an assessment, and this is just a quick example of how you can use a custom prompt within Fiddler.

[00:02:47] And next, I'll also show an embedding view of what this looks like in a 3D UMAP. Where here we're using a public Hugging Face dataset and we've applied also a custom evaluator here.

[00:03:00] And we can start to see actually clusters of faithful versus not faithful responses from the language model. And we can actually highlight a cluster here. And start to see a mix of, some of these are judged as faithful and some of are not faithful. We can click into the not faithful response here and start to see the actual logs around the question, answer, documents, and then again, the reasoning for why in this case, the language model thought this particular response was not faithful to the documents it was provided.

[00:03:32] And so this UMAP can be used for a root cause analysis and to start to see clusters of conversations where maybe you're seeing lower faithfulness scores, more hallucinations. And that might lead you to want to update the prompts within these clusters of conversations or add more relevant context. And this is a quick introduction to how custom evaluators can be used within the Fiddler platform