Top 5 Questions on LLMOps from our Generative AI Meets Responsible AI Summit

Published

April 25, 2023

Last Edited

April 15, 2025

Mary Reagan, PhD

Senior AI/ML Product Manager

Fiddler AI

If you missed it, we held a summit on how Generative AI Meets Responsible AI with some great speakers and fascinating conversations! We put together a list of the top questions asked by our attendees on LLMOps and you can see the responses of our expert speakers below.

1. How should teams prioritize what is monitored when setting up observability when building LLMOps systems?

Fiddler’s Director of Data Science, Josh Rubin, answered “model monitoring in the LLM and Generative AI era has to do with monitoring a combination of embedding vectors of prompts, responses and any metadata that can be used for defining performance metrics and analytics of problematic regions. In this context, metadata can provide application specific clues about how well a model performed in a particular scenario—an example of this could be user interaction such as like/dislike, clicked/didn’t click, or a classification by a secondary model—e.g. a toxicity or appropriateness rating. The embeddings provide a semantic index and similarity metric for the other metrics”.

2. What evaluation metrics that go beyond traditional performance would you recommend for language systems?

Jasper’s Director of AI Saad Ansari answered: “Certainly! One of my favorite areas is discussing evaluation metrics for language systems. We are all familiar with common technical metrics like BLEU and ROUGE, as well as benchmarks like BIG-bench. However, from a customer's perspective, the most important metrics are those that align with their needs and expectations. It's crucial to identify what would be most useful for them and convert those into measurable aspects of the generated content. This concept can be summarized as 'form follows function'.

For instance, at Jasper, BIG-bench didn't cover some aspects important to our customers, so we developed tailored metrics to measure their success criteria. While I can't dive too deep into details, let's say a marketing customer wants more LinkedIn likes on their blog posts. We could analyze content factors that correlate with higher LinkedIn engagement, such as semantic complexity, sentence structure, topic selection, tone, length, humor, etc. By conducting experiments to correlate success metrics with content metrics, we can more consistently produce content that meets those criteria. So, as a rule of thumb, remember that 'form follows function' applies even to metrics."

3. What about the adversarial robustness of LLMs? What is the current state here?

Fiddler’s Staff Data Scientist Amal Iyer says, “Adversarial robustness is the ability of a machine learning model, including large language models (LLMs), to maintain model performance when subjected to adversarial inputs or attacks. These attacks typically involve small, carefully crafted perturbations to the input data with the intent of causing the model to produce incorrect or unexpected outputs.

The current state of adversarial robustness in LLMs is an active area of research. While LLMs have shown impressive performance in various natural language processing tasks, they remain vulnerable to adversarial attacks. Researchers have demonstrated that slight modifications to input text can cause LLMs to produce incorrect, biased, or nonsensical responses. Moreover, adversarial attacks on LLMs can exploit their lack of common sense or exploit biases present in the training data.

Efforts to improve the adversarial robustness of LLMs focus on techniques like adversarial training, where the model is trained with adversarial examples in addition to the original dataset. This aims to enhance the model's ability to recognize and resist adversarial inputs. Other approaches include developing methods to detect and filter adversarial inputs before they reach the model, or creating models with inherent defenses against such attacks.

Despite the ongoing research and advancements, achieving full adversarial robustness in LLMs remains a challenge. As LLMs continue to play a significant role in various applications, ensuring their security and model robustness against adversarial attacks is crucial for maintaining trust in these systems.”

4. Can you describe the difference between fine-tuning and prompt engineering for LLMs?

Amit Prakash, the CEO of Thoughtspot, had this to say, "Large language models (LLMs) possess a unique immersion property that enables them to learn within context. This is often referred to as 'prompt engineering,' in which new information is provided in the prompt, allowing the model to perform reasoning based on that knowledge. This can be particularly useful for incorporating specific institutional knowledge, such as company-specific terminology or data sources.

On the other hand, fine-tuning is the process of adapting a pre-trained model, which has already learned from a vast amount of unrelated data, to suit a specific problem by adjusting its weights or adding extra layers. This approach reduces the required training data and cost while producing a more intelligent model tailored to the task at hand.

In our case, we utilize a combination of both prompt engineering and fine-tuning. However, prompt engineering currently seems to offer more potential than fine-tuning.

5. Do you believe that democratization of LLMs is on the horizon? LLMs like GPT-4 may be overkill in an enterprise setting, but an LLM that is built to a particular application and with task specific data is more useful. These kinds of models do not have to be really large.

Google’s Dr. Ali Arsanjani replied, “ I would absolutely agree with that supposition. As an example, ensemble model have multiple smaller models and each smaller model is cheaper and more focused. By employing this strategy for foundation models, you can selectively determine which foundation model or fine-tuned, prompt-tuned model to use based on the task at hand. These smaller models could serve as "training wheels" for the industry when adopting large language models for enterprise applications. This way, we can mitigate risks associated with large language models while still taking advantage of their creative capabilities in more constrained settings.” Lavender AI’s Casey Corvino said, “I'm just really excited to see how people build with the democratization of these large language models. Really anyone with a computer can build really cool applications, really cool AI applications now. “

BONUS Question for the nerds out there:

Mary, will we get a recording of these sessions?

This was a very popular question and the answer is yes! You can check out all the recordings here!

———

**Note: Both the questions and answers have been edited for clarity