Season 1 | Episode 19

The Agentic Gap With Jeff Dalton

‍

In this episode of AI Explained, we are joined by Jeff Dalton, Head of AI and Chief Scientist at Valence.

Jeff has spent two decades at the intersection of research and industry, from building early conversational search benchmarks at Carnegie Mellon and Microsoft to leading the AI behind Nadia, Valence's purpose-built enterprise coaching assistant. He discusses the fundamentals of agentic system design that still hold from classical AI theory, why evaluation has to come before the prompt, how he approaches memory as a first-class object in coaching systems, and the defense-in-depth approach to guardrails that keeps complex agents safe across diverse enterprise deployments.

About the Guest

Jeff Dalton is Head of AI and Chief Scientist at Valence, the company behind Nadia, the AI coach that nearly 100 of the Fortune 500 use. Jeff has more than 100 published research papers on search, information retrieval, and natural language understanding, and at Valence, his work focuses on best-in-class context and memory for AI coaching. Jeff is a Turing Fellow, and prior to joining Valence, he was at Google, where he developed language understanding capabilities for Google Assistant and built next-generation knowledge graphs for Google Search.

Transcript

[00:00:00]

Introductions

[00:00:06] Josh Rubin: So today we have, uh, a really special guest, um, Jeff Dalton, who's the Head of AI and, and Chief Scientist at Valence. He's also a professor of informatics at the University of Edinburgh. he's a world expert in, in conversational search. And has been working on this for, uh, at least a decade, um, in a lot of different contexts. Uh, he's leading, the development work behind, uh, Nadia at Valence, which is a chat assistant that's intended for coaching and improving people within enterprises and their enterprise specific contexts. so, uh, with that, welcome, welcome to Jeff. You wanna wanna pop up Jeff? Hi. So I, I thought I would start out by asking you a little bit about, you know, sort of your history you've been working on, like conversational search and conversation in the search and AI space for a long time. You know, obviously the last 15 years have been like, radical in the amount of change year to year.

[00:01:06] Jeff Dalton: Indeed.

[00:01:06] Josh Rubin: I thought you might want to a little bit about kind of what you focused on over that period at various places that you've worked, um, you know, and, and how dramatic these changes are and, and what you've seen happen, how that's changed the landscape.

[00:01:18] Jeff Dalton: Yeah. Thank you Josh. Um, so a little bit about me and my background. Um, I, I actually started my career in search. So after I graduated, I worked at a startup search engine back then. This was almost 20 years ago now. So it tells you how old I am, but, um, then I went and I, and I studied it in a, in research.

[00:01:33] Jeff Dalton: Um, and so my, my background straddles that intersection of research and industry. That, and I kind of bounced back and forth to making sure that we ha we're working on important things and we're, we're at the cutting edge of the frontier and we're also, uh, making that out and get, getting that to people.

[00:01:49] Jeff Dalton: Um, so when I did my PhD was in search, we were working on knowledge graphs. How do we make knowledge graphs successful to search engines? How do we make that really useful for people? Then when I, I got that, I went and worked for Google, surprisingly enough, I worked on health search there at Google, and what we really realized was that we.

[00:02:03] Jeff Dalton: The search box wasn't enough. We had to shift from, this is a conversation with the state about the user. So we worked to moving over to working on the Google Assistant, um, having conversational systems that, um, could start to do that. But they were really basic, right? We could, we, we could, you know, turn the lights off and on, play some songs, but what what we really needed was a health kind of virtual assistant that was conversational.

[00:02:24] Jeff Dalton: So that was, that was motivating it my work. But going, how that relates to search was we wanted to have that conversational assistant be able to do work on hard research problems. So research your health, need research complex information needs. So that's one of what I'm going back and kind of have built a research lab around conversational search and assistance.

[00:02:42] Jeff Dalton: So we worked on building evaluation benchmarks 'cause they didn't exist. Um, so, uh, for example, I built some of the first evaluation benchmarks for conversational search with partners at Microsoft and Carnegie Mellon. Uh, the TREC, conversational assistance track way back in 20 2018, 2019. Uh, ancient history now in these days.

[00:03:00] Jeff Dalton: Um. What we started to realize was at the time it was really hard with the language and, and the evaluation tools that we had, kind of traditional NLP pipelines. And we really noticed a shift with GPT. Two things that we thought were going to take, like be hard, um, was uh, suddenly became really easy and we solved within six months or a year.

[00:03:19] Jeff Dalton: So we continually up far some of the, those really hard benchmarks. Hard hand curated human data is still really fundamental. And so we continue to make actually really difficult, challenging benchmarks for that still hold up today even with state-of-the-art, uh, large language models. And now currently what the research is going, moving into personalization, um, tool use and agents and deep research agents, um, is kind of some of the kind of frontier of what some of what we're working on in the lab.

[00:03:46] Josh Rubin: Hmm. That's great. Great, great intro. I think before we dive into agents, one, I, I'm curious, just, uh, you know, some things have so dramatically changed that it might, uh, sort of beg the question of whether we just throw the old stuff out or, uh, you know, or there's also a philosophy that actually the old stuff really sort of. Methodologies, measurement techniques kind of undergird, um, best practice today. I'm kind of curious where you come down on that. Splits like some stuff must be, uh, easier. Uh, but you know, at least in my experience, I've found that, you know, simple is often still king and, uh, you know, uh, best practices don't, don't change a lot.

[00:04:31] Josh Rubin: Do you have a different, a different perspective on that?

Old vs. New: What's Changed in AI

[00:04:34] Jeff Dalton: I, I think I'll break that up into two pieces. First, it's just like a little bit of background, like on, on what I, what I think about it, and then what that, what also how, how we make that practical. Um, so stepping back for a minute, um, I, I've been at, we, there's a whole subfield of. Agents and agent systems.

[00:04:52] Jeff Dalton: It's been around since the 19, 1970s. Right. It's really fundamental work. You know, we, everyone, you know, uses the word agent, but they often mean different things and the word becomes very blurry. So if we go back for, I want to really be real precise there. It's like, if you go back to the textbook, Peter Norvig textbook, um, and others kind of AI 1 0 1, uh, we have an, we have an agent that can plan, it can take actions, it can observe the environment.

[00:05:17] Jeff Dalton: Maintain state, it has a representation of the world, state, and it can change over time. So those are kind of what the fundamentals of an agent are, and those haven't changed all since for 50 years. Really, fundamentally what's changed is how we implement them, which, which has changed is like the state representation, the fact that they, they're now operating in language rather than in small, kind of almost like finite state trees, right?

[00:05:40] Jeff Dalton: So that's kind of, that's, we still need to have those fundamentals. We need to know that, that background in order to know where this 'cause those patterns. That we had with early systems are gonna continue. Those are gonna recur. So we need to know our history. We need to know the fundamentals in computer science so that when we start moving to multi-agent systems, there's a whole, we have lots of algorithms for this.

[00:06:01] Jeff Dalton: We know there's a whole subfield of computer science that we need to make sure that we are using that history and use the learnings that we have from it. Um, that's, so that's part of part one. Part two is what does it actually mean? What, where, what, what hasn't changed? And what hasn't changed for me is the fact that we still have to have humans in the loop.

[00:06:21] Jeff Dalton: We still have to have humans looking at the data. I always tell my grad students, look at the data. Look at the data. Look at the data. Now I tell my agents the same thing. Look at the data. Stop. Look at the data. Go line by line. Have you looked at it yourself? How, how are things working? Um, these agents, these systems are very complex.

[00:06:38] Jeff Dalton: So you have to actually look at fundamental behavior. Can you do the task that you're trying to ask the agent? Um, how hard is it? Um, so trying to do some of those things yourself, starting really simple. Um, and also just having good human evals. So LLM as a judge is great. Everyone will is using them. It provides scale.

[00:06:57] Jeff Dalton: But have you just, have you opened up the file? Have you, have you labeled some data yourself? How do you, how can you do it? So those are just really simple fundamentals that aren't going away, that are just becoming even more important as we, as we move forward with the complexity of these systems.

[00:07:12] Josh Rubin: Yeah. Yeah. I, I think it's, it's, it's really easy to sort of, um,

[00:07:16] Jeff Dalton: Yeah.

[00:07:17] Josh Rubin: I don't know, un understate the importance of like, look at 10 examples with your own eyes. You know when people, you run tons of data, thousands of examples, but you know, what you learn with your human brain on, you know, 10 examples kind of gets you to 90% confidence if you can kinda spot what's going on. Oftentimes you can, uh, save a lot of effort, uh, by just applying your, your primate brain to a few examples that we've all been so. In, in our, our, you know, domain of expertise that evolution and, uh, our lived experience has, has provided us,

[00:07:56] Jeff Dalton: One of the, one of the common mistakes people really often make in grad students make, and I've made in the past is that we, we think, oh, I need lots of data. I need huge, huge amounts of data. I'm gonna run a massive experiment with, with thousands or tens of thousands of things. And you're like, well, yeah, sometimes you need that.

[00:08:11] Jeff Dalton: Um, but sometimes, like, start small, start simple. Find, go find the, the cases that are gonna be hard. Find the cases that are gonna cause errors and make sure that those are the ones that you're looking at first. And that effect, you're almost like, it's like a test first data, first hard case, first kind of philosophy.

[00:08:29] Jeff Dalton: Um, but start starting small and start simple that so that you can look at it and understand the behavior.

[00:08:34] Josh Rubin: Yeah. A hundred, a hundred percent. I, that's, I think those are, those are wise words to everybody out there.

[00:08:40] Jeff Dalton: Hard, hard lessons.

[00:08:41] Josh Rubin: Do you wanna talk a little bit about, um, I don't know how we think about agents today. I think you spoke a little sort of abstractly about like, sort of the definition of the agent, but you know, I think that there's, you know, people build things like, um, you know, there's, you know, uh, you know, we have RAG, we have MCP that's connecting everything together.

[00:09:00] Josh Rubin: We have different ways of managing state. I don't know, is there, how do you sort of. Characterize the landscape of technologies that people are sort of using right now to kind of construct agents? It does. I mean, it does go beyond the model. The models are changing in lots of important ways also. Um, but maybe you could paint a little picture around the, the, the agentic landscape and maybe how, I mean, I think that different guests that I've had have framed this slightly different ways.

[00:09:29] Josh Rubin: I'm sort of curious about how you think about the, technologies are involved and what parts are responsible for what and what changes have. Um, made any of this possible with large language models.

[00:09:42] Jeff Dalton: Yeah.

The Agentic Landscape & LLM Evolution

[00:09:43] Jeff Dalton: So, uh, let's step back just for a minute to talk about also where we've come from and, and how things have evolved.

[00:09:49] Josh Rubin: please.

[00:09:49] Jeff Dalton: And, and things are changing just so rapidly as well. Um, but let's stepping back a little bit, um, we've, this things really started with deep learning, right? So starting with kind of non generative systems, things that classify kind of deep neural networks, and those were super successful.

[00:10:04] Jeff Dalton: Um, and one with data at scale. And then we started moving into the, the GPT. We started moving into the generative AI world. Just the simple pre-trained foundation models that can generate language really well. Um, and that those were extremely powerful. And then we added, started, added instruction tuning that, so now they can follow instructions.

[00:10:23] Jeff Dalton: And then we, then we realize, okay, but that's not enough. So now we're, we, we're going to need to add tools so they can call and use calculators and they can, and then we now have part, as part of our post training mechanism. After the model's been pre-trained, we can now, we can now learn to use these tools at the right time and at the right place.

[00:10:39] Jeff Dalton: These models are now, um, they would sometimes call it distributed models. So there are architectures there that are like mixture of experts, and so under the hood, you're getting routed to an internal expert that can then work on answering, answering your question. So that's just a, there's a lot there that's been evolving and the found and the core foundation layer, um, and those actually starting to get a little bit blurry because some of those kind of core elements of like, oh, it can call tools now.

[00:11:03] Jeff Dalton: Oh, it, it can remember, it can remember this. These things, we can add a, some additional adapter layer that has some additional memory. Um, we can have specialized components. Those starting to make, that those underlying foundation models start to feel a little bit closer to agents, even though they're not quite there.

[00:11:18] Jeff Dalton: Um, and now with the, as we've seen with the latest models, like things like the, the next step was the reasoning models, but starting with a o series of models, right? So they now can have like reflective self-talk where they're talking to themselves effectively, not necessarily reasoning per se, but kind of reflective talk back and forth with each other.

[00:11:37] Jeff Dalton: Um, and that's gonna continue to evolve. Um, I, I think we're going to continue to see some. Really significant advances over the next, over the next year or two on that, particularly with how they incorporate memory and other SY kind of system components. I think the shift that we're seeing is, is that we go from that foundation model to kind of first agents to now we're moving to agentic systems.

[00:11:59] Jeff Dalton: So, so now we're, so we're in the land of Claude Code, which is really not just single agent, but it's a multi-agent system with multiple different kind of components that are, that are um, that are, that work together to help user accomplish complex tasks. And that's also what we're building at Valence.

[00:12:14] Jeff Dalton: When we're building an AI coach. We're building an agentic system that is helping people for coaching.

[00:12:21] Josh Rubin: I think that this point about, um, when I was just thinking of the, I'm just responding kind of the cuff to like mixture of experts as like a, you know, what is the, you know. What's the distinction between the model and the scaffolding, right? Like by the time you're talking about models that have kind of, I don't know, internal disaggregation or sparsity to them. Uh, I think it's a really interesting time when there are, there is, um, orchestration happening at lots of different layers kind of within the model. And then, you know, every people are talking about harnesses and talking about, um, talking about orchestration and. Um, and then interconnecting these things.

[00:13:06] Josh Rubin: Uh, you know, I don't know that, well, it's all to say it's evolving quickly. Um, um, what do I want to, what else do I wanna say here? Um, let's talk, I don't know if you wanna jump into talking about, um, Nadia and what you're doing at Valence and why that's a particularly interesting problem given, like where you're coming from.

[00:13:31] Jeff Dalton: That'd be great.

[00:13:32] Josh Rubin: You want to jump into that? Who Nadia is.

[00:13:36] Jeff Dalton: So, um, I'm Head of AI at, at Valence we're building an AI, an AI coach. And what really is exciting and kind of, kind of some of the key differences there is that we do, it is an assistant, it can help people. Um, but the goal of what we're trying to build with. With Nadia is that we're trying to build a coach and the, the goal of the coach is to improve your performance over time.

[00:13:58] Jeff Dalton: Fun. And that's fundamentally different from what an assistant is, which is just trying to help you draft, draft an email or, or, or do something that do a simple task. Our goal is, is transformation for you, your organization, your team, so that you're improving your performance, working with you, that's helping you be a better person that's augmenting you.

[00:14:15] Jeff Dalton: That's giving you what you need to do to, to be able to improve, um, so that the objective function there is not just did we complete your task, but did we help you learn and grow through that process? And so, and that's fundamentally different. So we're optimizing for value over time. Like how is this getting for you over time?

[00:14:33] Jeff Dalton: It is not even getting to know you. Is it personalized for you? And or are we also introducing productive friction? So are we stopping you from writing that email and saying like, oh, maybe that shouldn't be an email. Maybe you should go talk to that person. Or maybe this should, maybe you should, uh, do something else, right?

[00:14:48] Jeff Dalton: So by the way, you've been, you've sent other emails in the past and they haven't been successful. And also we, you've been talking in new way this history. You, you have communication issues. Let's make sure that we work on those. So that's what we're, we're building. The future of a AI coach that can work with you, that helps you in your flow of work and be able, helps you kinda get things done, but also helps you grow and improve for you and your team.

[00:15:11] Jeff Dalton: So kinda moving towards just, uh, an AI chat bot to actually, it knows your world. It knows it's connected to your organization, it has your organizational values. Let's learn and grow, grow with you over time. It knows your personality, your profile, how you work, and how you work with others so that it can really kind of coach you and, and do a better job than, um, because it's highly specialized for you and for your environment.

[00:15:37] Josh Rubin: Yeah, I hear a lot of things in there that I think are, uh, potentially technically challenging. Like I think the of personalization, you know, to you, to your org, I think the time horizon. How are you successful in the long term when you know the, the, uh, like you said, the objective function is. maybe harder to quantify when it's not directly observable in kind of little chunks.

[00:16:01] Josh Rubin: This I could see something like LLM as a judge struggling a little bit to provide a meaningful metric on sort of, um, you know, conversation turns or, uh, or even task, you know, uh, trace level kind of task completion. Um, do you, so, so tell, maybe you could say a little bit more about sort of. How you think about that problem and what differentiates it from, um, you know, what we can do with, you know, a clot or a chat GPT with a little bit of prompting, you know, a, a clever system prompt and some, some, you know, sort of simplistic memory mechanism. Where, where, where, where, where's the meat of this?

[00:16:47] Jeff Dalton: So, so there's a few different pillars we have here. So first, I think it's purpose-built to be a coach. So it's built, we have human coaches on staff, we have human coaching knowledge and best practices are encoded within it. Um, so you're, you're getting a kind of leading edge kind of coaching abilities kind of, that are grounded in, in fundamental principles that are, are there to facilitate growth and judgment and better actions over time.

[00:17:13] Jeff Dalton: Um, then, uh, what we really have is like memory is a really first class object for us, right? So if we're going to learn and grow with you over time, we have to make sure that we are. Are we remembering the right things? We're looking at the past conversations. We're analyzing patterns in those. Uh, we're then making sure that we have the, the memories of that.

[00:17:32] Jeff Dalton: You're, we're talking about the same, you're talking about the same people. So we know your people. We know that it's not just Josh, we know it's this Josh, right. That is there. So we've done kind of co reference resolution and understanding, and we give you control of that. You can see it, you can edit the people.

[00:17:45] Jeff Dalton: Oh, that's the wrong Josh. So you have control over that. So those are all built into the product. That's 'cause part of the product experience. Uh, end to end, not just in the models in terms of what, what we, what we have for context. It's, it's about transparency and control for the user to see, this is what we know about you, this is what we're learning, these are the, the growth areas that we have for you.

[00:18:04] Jeff Dalton: What, which ones do you wanna work on now? And giving you those choice choices in that process. Um, what we call, um. What makes kind of Nadia a little bit element there is what we call like the intelligence layer. So that's, you know, the intelligence to be able to take those past histories and be able to have a coaching lens or what we think are, the patterns are going to be for the hypotheses that we have for the future, um, but also that coaching intelligence at the, at the right time in the flow of the conversation.

[00:18:32] Jeff Dalton: When do we pull in memory? Have you mentioned Josh? Oh, by the way, we're gonna pull in, go pull in the connections and pull in the relevant kind of retrieval parts, um, of the conversation that are relevant to you right now. Um, and be able to have that, and have that also be, we know that, you know, we know this information about Josh.

[00:18:48] Jeff Dalton: This is the way he works, this is the way I work. This is, these are the failure modes that, that we could be experiencing. And so having that integrated into the kind of coaching knowledge as a kind of a fundamental pillar or how we work and how these people and the team should work together. As well as obviously for Nadia, it's about, um, it's it's foreign organization, so it's operating within the organizational constraints.

[00:19:10] Jeff Dalton: It's operating with a safety that the organization has. We have a, we can talk about it, but one of the fundamental things that we have is our guardrails and a safety system. So we, we want the coaching that we have to be safe and tailored for, for you and for your organization. And I can feel free to pick wherever you see where you want to go next.

[00:19:29] Josh Rubin: Oh, yeah. Um, I, I, I think I wanna go to measurement. I, I, you know, I think when you, you, you, you, uh, raised my eyebrows when you were talking about sort of objective, objective functions, right? When you are building something that, you know, uh, it's easy to pull a, you know, a chatbot off the shelf now, uh, it's remarkably easy to build something that has some basic utility. you know, much like the conversa, the previous conversation about like not jumping to analyzing 10,000 data points when you can get a whole lot out of 10 of them. Um. You also don't wanna build a whole lot of apparatus, a whole lot of scaffolding when the, know, the delta there in performance is, uh, you know, doesn't, doesn't support the investment, right?

[00:20:16] Josh Rubin: You, you wanna spend your energy where it's, it's most impactful. And I know, you know, from a Fiddler perspective, you know, we think a lot about how we support our, um, you know, our, our enterprise customers with, um, you know, our, our kind of. Three pillars are kind of production, Observability and guard, railing and evals experiments, experience.

[00:20:39] Josh Rubin: Uh, you know, we put a lot of sort of, uh, emphasis behind measurement as an important way of, know, identifying how well variant B is from variant A. Can you justify building out something more sophisticated? Also, um, uh. Where, where are the hotspots? Where is, what are, what are common failure modes? Maybe you wanna talk a little bit about, like, I, I guess I would really love to hear from you about, um, you know, what your perspective is on measurements and what kinds of, whether it's evals or human labeling or, uh. You know, any of those things. How do, how do you determine, you know, where the leverage is in improving your system? I guess it all goes back to like, what are the, you know, what's the objective function you're trying to optimize? So maybe, maybe you start there like, uh, what are some of your objective functions?

Measurement & Objective Functions

[00:21:35] Jeff Dalton: Well, I think I can, we'll talk about, I'll talk about objective functions, maybe a second may. Maybe first I can tell you a little bit about how we very consciously have built our system to be eval eval first, right? And be evaluatable. Um. So, for example, if people start out, if they're building, if they're building an agent system and they're building kind of, maybe they have some different prompts for different parts of the system, uh, you might start out with just some very simple prompts like, you know, do this or a natural language, and that's great.

[00:22:03] Jeff Dalton: Um, and that, that, that's a first version. And that's where we started with the first version of Nadia, for example. But what we found was how do we trace it? How do we understand the behavior? What's, what's going on? And so we, we actually, actually, our approach is actually much closer to code, so it's closer to the prompts as code where they're really kind of semi-structured objects.

[00:22:22] Jeff Dalton: Um, and so that allows us to say like, we are, we are taking an approach here. We are in grow coaching mode so that we know, and here's the set of instructions that we know that we we're in grow. And it has clearly defined steps and phases that were, that we can walk through so we can inspect those. So we take kinda a structured approach to the conversation so it's not just open-ended.

[00:22:41] Jeff Dalton: We have kind of playbooks and kind of coaching approaches and playbooks that we're, that we're executing, and because we're executing kind of these kind of structured coaching programs that we can switch, switch on and, and transform over the kind of on the fly, what that allows us to do is to basically see like, oh, we're executing instructions one and three and grow, and so when something goes wrong, we can trace that back to like, oh, we need to go modify these lines of the prompt, right to, in order to make things, to be able to make things work.

[00:23:09] Jeff Dalton: So there's, that's an element of like determinism. We can say like, should this be in G or should it be in R? So we can then evaluate offline with humans on label data sets like. Coaching is, is flowing as expected. So when we go to a model upgrade, we can say, ah, okay, the model is still transitioning at the right times, or, oh, something is stuck.

[00:23:28] Jeff Dalton: You get stuck in G and we're not progressing. There's something broken here. So we have internal thinking and traceability from the model to be able to see what it's, what is actually, we have our own proprietary kind of thinking model that we have to be able to for those structured programs. And I think that's really, that's really kind of fundamental to make sure that you have those traces.

[00:23:46] Jeff Dalton: So I'm really a fan of like, what, what you're doing with, with your work there. Um, I do wanna step back a little bit for talking about some of the objective function. Uh, what for us is, is a little bit more difficult. So some of the things, if you're drafting an email, that's fine, you can, you can kind of measure that.

[00:24:00] Jeff Dalton: Um, but kind of best practice and kind of science-based approaches for evaluating the systems is based on rubrics. So grading things on with clearly defined a measurement benchmark. So we just say. Coach, you can grade a coach, you can grade a coaching conversation with rubric. This goes back to kind of teaching first principles as a professor, right?

[00:24:18] Jeff Dalton: We have grading rubrics for when we graded grade papers. When we grade exams, we start with a rubric first. So for example, for our, we have our, our kind of rubric pillars. I can share a little bit about them.

Evaluating Coaching Quality with Rubrics

[00:24:29] Jeff Dalton: Um. We look at things like coaching, pro coaching as an explicit pillar. How good is, how good was the coaching in this conversation across different elements of coaching?

[00:24:38] Jeff Dalton: Did we facilitate, um, facilitate change? Did we, was it person forced? Did we let, um, did we, uh. They were new things from coaching about them, about the world and about themselves. For example, as one of those elements we also have, how is the flow of the conversation? Right. Did was it, did it proceed as expected?

[00:24:58] Jeff Dalton: And of course kind of user satisfaction, intent, right? It's kind of, kind of cross those di dimensions and we can, you can obviously dig into what those are, but those are kind. Some of those elements and kind of pillars of being able to evaluate a really subjective system. And that's where it takes kind of trained, calibrated humans.

[00:25:14] Jeff Dalton: We spend a lot of time, um, calibrating our coaches so that when on those rubrics and aligning people so that we're getting clear signal. 'cause the worst part is again, um, what's worse than having no evaluation, having bad evaluation, right. Because then you're, you're measuring the wrong thing and you're moving in the wrong direction, and you don't know if you're actually improving.

[00:25:36] Jeff Dalton: That's a really common failure case. I see. Um, and experienced, um, personally as well.

[00:25:43] Josh Rubin: Yeah. Um, yeah. That's super, super interesting. Do, how do you think, so I, one, one thing that, that we think about a lot with our customers is, um. I dunno, I guess I call it sort of, um, aggregate and anecdote. Um, so, you know, things like aggregate metrics, you know, you can measure across you, you can sample, you can measure across, you can get a Gen A, a general tendency for performance along some parameter that you care about.

[00:26:13] Josh Rubin: You, you measure it in some way. You define some sort of, um, you know. Figure of merit, some sort of, some sort of metric that is, uh, or some set of metrics that you, you want to, uh, observe over time. And you certainly want to know if there are significant deviations over time, um, that that means something is awry.

[00:26:32] Josh Rubin: Usually something has changed in the world. People are asking some sort of new question that corresponds to some new event. Probably you guys. up having some experiences where, I mean, I, I could imagine, you know, world events or local events could push your models to have conversations in regimes that they were never. specifically, uh, tested in the, the other end. Let me, you one. Well, the anecdote is the other piece. I, I, I am sorry. I don't mean to overload your, uh, the, the, the stack with, with thoughts, but it's also like, I think the other piece that I think about is like, how do you catch the, the dangerous rare corner cases?

[00:27:09] Josh Rubin: And I think that piece is a little tougher from sort of sampling and, you know, I wouldn't quite know how to, you need some sort of probably a, a mechanism to identify those before you can. Give them to your high quality human labeler. I'm a huge fan of the, um, getting, getting humans to actually label stuff.

[00:27:26] Josh Rubin: I love the idea of calibrating your coaches on, on real human expertise. Um, yeah, so, so I'm not sure quite what the question was in there, except that I, I was gonna try to, I was getting, try to press you a little bit on, um, how you get at the sort of tricky corner cases that may not show up so much in a, in a random sample.

[00:27:47] Josh Rubin: Right. Those might be high stakes corner cases, right?

[00:27:51] Jeff Dalton: Yeah, well, I, I guess we'll, we'll start a little bit about some of what, maybe I'll share a few anecdotes, um, for example, and kind of where things, where things are challenging. So, you know, safety is fundamental. We have guardrail systems that, that we, that we build. Um, and then we're constantly onboarding new clients, um, and new clients are joining every, every day.

[00:28:09] Jeff Dalton: Um, and um, one of those clients is a large legal firm. Um, and guess what? We have legal guardrail. Let's not give legal advice. But when people then start putting in, uploading their docs and like, help me coach, coach about this, about this document, um, are triggered our legal guardrails saying, I can't suddenly, I can't, um, give you legal advice.

[00:28:26] Jeff Dalton: Um, and that, that was like, we quickly had, we saw that in our logs and we're like, okay, now we have to, we have to go back and quickly revisit. What, what the definition of legal advice means and what that, what that means for this client. Um, so making sure that things are both safe in the general, but also customized there to be able to address some of those, not just educators, but real domain shifts in terms of where things are being deployed and being able to see those, see those issues.

[00:28:51] Jeff Dalton: So more for medical advice when we roll out to a health company, um, we have the, we have the same, we have the same issues, right? So we see those, we can anticipate some of them, but then it hit, we hit real users and the real users are, are messy. And real use is messy. And that's part of the, the fun kind of challenge that keeps us, um, that keeps us excited to be able to make these assistants kind of scalable for people over time.

[00:29:12] Jeff Dalton: But another, you mentioned, for example, um, we're a coach and so, um, you can imagine when there was a, were some elections recently, and then we saw when, after those elections people came in, they're like, I'm depressed. And or on one side or the other, um, you have reactions to those. Um, and like how do we handle that in a way that's tactful, thoughtful, um, that doesn't just block on a political guardrail, but that still allows us to like, respond to your emotions while still maintaining our neutrality.

[00:29:43] Jeff Dalton: Um, and kinda in a very. Um, in a very careful way. So that's some, some anecdotes from the front lines, um, from, from seeing sort, seeing some challenging situations there. Uh, the other kind of your kind of second question there is how do you find those cases? How do you see it? Um, and one of the things, one thing is there is that, that's where again, we all having LLMs.

[00:30:05] Jeff Dalton: Kind of run, run as judges looking for those kind of edge cases, anomalies, where things were working. Um, we sample sample failures, so we over sample cases where we have low what, what, what that LLM thinks is a low quality conversation and then kind of prioritize those for anonymized review. Um, and, and kind of sampling in privacy, preserving ways.

[00:30:32] Josh Rubin: Yeah, that's, that's great. That's really interesting. Um, you know, I think, uh, you touched on something that I think is, is I think, relevant for both of us, which is the, the challenges associated with, you know, the general versus the specific, right? Like if you are able to roll out a guardrail model that is specific to constraint. You know, uh, for legal, legal customers or for healthcare, um, know, even, even there, you're, the domain is reasonably broad that it's not straightforward. I think one of the, one of the things that we a lot of time thinking about and uh, is, is sort of a major challenge for, for, you know, my team that's developing sort of fairly general purpose guardrail models is how do you deal with the, um, you know, uh, dramatic differences between domains. and you know, I think there's a really interesting set of trade-offs around, you know, I think for probably many of our listeners here, right? Like, where is the right place to draw the balance for their particular use case? I mean, not everyone is a startup who gets to, um, uh, you know, focus on, know.

[00:31:46] Josh Rubin: Uh, one kind of, uh, you know, uh, coaching as a focus, you know, I mean, I, I suspect we have, um, you know, audience members who are, know, overseeing different applications across, you know, some financial services organization and, um, you know, it. Probably not a good use of resources or practical to totally dial in customization for every single one of those use cases down.

[00:32:14] Josh Rubin: You know, like how much can you repurpose from off the shelf where there's some, you know, expertise that's, uh, you know, baked and doing, doing a pretty good job at a thing versus, um, some of the benefits you get out of the, the very specific, uh, purpose-built thing. Um, I think, uh, you know, one of the things that I think sounds. Appealing about, about your role is you do really get to focus on this particular use case and, um, generate some sort of bespoke, uh, solutions to the kinds of problems that you have. How, maybe, maybe, I don't know, do, do you have any sort of advice you would give to the kind of spectrum of people out there operating in different orgs about, you know, when it's a good idea to go kind of whole hog into investing in, um, you know. Bespoke parts of an agentic system versus taking something that's much more off the shelf in a common design pattern.

[00:33:13] Jeff Dalton: Yeah, I mean, so I'm, I'm, I'm gonna start, I have a couple thoughts there. One of, one of the, the first challenges, do you have actually a clear. Requirements. Do you have, do you have a clear spec for what things need to do? Do you have some sample conversations, some sample use cases people maybe might rush to like, oh, I'm gonna make a prompt.

[00:33:29] Jeff Dalton: Like, whoa, whoa, whoa, stop. Before you write the prompt, what are, are you trying to do? Do you have some data that those five, 10 examples at least kind of for that to be able to know, at least have a sanity check if things are, if things are working. So if you, if you, again, if you can't measure it, how do you know if it's working or not?

[00:33:46] Jeff Dalton: I mean, so you need to have some data. For that to be able to have some samples. Now the big issue there is that the data, data is different when you actually deploy it, and then you actually have, but these are actually kind of standard known issues that we go, if you go back to machine learning, we have model drift and there's kind of just playbooks on how you, on how you basically make sure that your models are performing.

[00:34:04] Jeff Dalton: And that's, that's kind of standard practice you can look at. Some of the, the, some of the literature from, from Google and others who have been deploying ML models at scale across, and many of your listeners are doing the same thing. So it's making sure that you're running those same playbooks and running it with a, with the agentic system so that that's something that hasn't changed very much.

[00:34:22] Jeff Dalton: We just have to make sure that we actually do it in these cases. Um, I will say that for something like safety or guardrails, you might start with something off the shelf to begin with and see how, how it works. Um. We actually, we started with like, what, what was the simplest possible version that we could have as a first version as well?

[00:34:42] Jeff Dalton: And built it, built some of those, our ourselves. Um, and we continue to build those ourselves. 'cause some of that's our, our business value and it really matters to us. So it that, that's as a, as a safety feature, right? For, for our users. So if it really matters to you, then you should own it and build it. If it, if it's just something that you need to have, that's just, you know, that, that's just there.

[00:35:01] Jeff Dalton: Then go buy it and you know, take something that's kind of mo off the shelf. But I think from experience, what I would've seen is quickly you hit issues of, it doesn't quite work the way I want it to, then I have to adapt it. And if it's a big, complex framework that's been built around it, then it becomes, it's a lot harder to change, um, and be able to get that behavior, um, that you're looking for.

[00:35:22] Josh Rubin: Gotcha. Yeah, points. I think that, um, you know, even within our users, you know, we, while we try to provide some set of stuff that mostly, you know, works out of the box to provide some sort of general coverage, you know, safety, basic safety coverage and Observability, you know, some general purpose guardrails, um, you know. That's fine for, for some applications there are certain fairly standard things that we see that, um, could be improved, but are kind of at some reasonable level of quality with the, with the basic stuff. But, you know, one of the nice things about, you know, our product and the team I get to work with here is that, you know, we have a whole, you know, customer success team that's really good at helping. Our users, um, sort of adapt our product to the particular applications that the customers bring. So some of that is, you know, uh, custom metrics, you know, being able to throw, uh, you know, uh, uh, calculations of interest from a business perspective up onto a dashboard or, um, you know, customizable LLM as a judge that runs it.

[00:36:29] Josh Rubin: Automation at scale so that you know the specific thing that is important to your, your use case. At least you have a, you know. The, the model's al not always the best, but, um, you know, aggregate you have a good proxy for how it performance might be changing with time and stuff like that. Um, so, uh.

[00:36:47] Jeff Dalton: We certainly, and I, I really appreciate that, and I think that's fundamental to making these systems successful. Um, and like when, and for example, when we roll out Nadia to a new customer, um, there's a lot of customization that goes into into that. So making sure that we're curating the right knowledge that Nadia knows about you and your organization and how you work and your terminology and your phrasing.

[00:37:07] Jeff Dalton: Um, having and having the pilot so we can see that. And then like what other, what customization that needs to be done next. Do you need a custom workflow for, for feedback? Do you need that custom? Um. Uh, the kind of custom role play, for example, for, in terms of like what you wanna role play with your coach and how you approach role play.

[00:37:26] Jeff Dalton: Um, so those are kind of key things that we wanna make sure that for, for example, for a coach that we're, we're clear in adding significant value. And I think that's one thing why a specialized coaching solution, like, like, like Nadia and really kind of beats co-pilots and, uh, other kind of more generic solutions.

[00:37:44] Josh Rubin: Yeah. Yeah, totally. I can totally imagine that. Um, how do you think about, how do you think about privacy? I mean, you're bringing up this con concept of like, you know, all these private conversations with various kinds of role play, and presumably there's, you know. Some sort of, uh, org level structure, org level customization that tries to, has some interplay between private conversations and, you know, uh, org level oversight

[00:38:17] Jeff Dalton: Yes.

[00:38:18] Josh Rubin: this is.

Privacy, Trust & Memory Systems

[00:38:19] Jeff Dalton: So, so I think our all, I mean, organizations can provide their customer guard guardrails and safety in terms of what people can talk about. They can seed it with, with organizational knowledge that's important for the company that's there for, for that's there. Um, they can, they can have some org tool calls for, to look up other parts of knowledge to be able to interact with for the key things.

[00:38:37] Jeff Dalton: But then I really kind of fundamental first principle for us is that your chats are your chats. Privacy, they, no one gets to see those, right? That they're your, they're yours. And that's a fundamental building block of coaching is, is trust. And that you have to be able to trust the coach. 'cause you're, you have to be, you're gonna be emotionally vulnerable.

[00:38:52] Jeff Dalton: You're gonna, you're gonna say, I'm struggling with this and this is hard and I'm having these people problems. And that level of, of trust and transparency is really fundamental to growth, um, in terms of being able to make things possible. So that, that's really just a fundamental pillar. That we have kind of, that we've built, kind of baked into the product from, from the beginning.

[00:39:09] Jeff Dalton: Um, and also control and transparency. So it's like you can go and you can delete chats, you can delete insights that the coaches learned about you, and you can, you, you have full control over that and not your organization. Right. Um, the also the ability, for example, for you to share very clearly with other people on your team.

[00:39:26] Jeff Dalton: So one of the, one the key things is like if we want to move from just coaching for me to coaching. To a team, and we're moving. In general, these agents are gonna start working together. They're gonna have to share information somehow with each other. So for example, um, in NAIA we allow you to share information about yourself publicly, and you control your kind of public profile of yourself.

[00:39:45] Jeff Dalton: So your work profile. So you might, you might be willing to share your personality profile. You might be willing to share information about some of your calendar. Share information about how you worked, how to work best with you, that then another Nadia can kind of come and read and have access to, and be able to see that information and, and be able to then use that to be able to customize the, the coaching.

[00:40:05] Jeff Dalton: So that kind of fundamental, um, transparency around what's being shared, how it's being shared in a way that you control very clearly is, is really important. And I, I think will continue to be important, um, for the en entire industry. Um, when we, as, as these systems become more complex and we, we are increasingly sharing information.

[00:40:25] Josh Rubin: Yeah, I, I think, um. Um, on the side, I've tinkered around with memory systems a little bit, and I, you know, I think the, the idea at least a few times has crossed my mind that we're likely entering a world just where, um, we have age agentic assistance that are working for us, that have access to a variety of privileged knowledge about our own experiences and about our family is they're useful to us when we can share the specifics of our lives in a personal way with them, you know. but we're also already using them to make reservations on the phone to restaurants and you know, if we're not already letting our agents talk to other people's agents, you know, it's gonna be next week. Right. Um, I think it's a really interesting question about how we federate knowledge. Um. In a way that's, uh, sort of, uh, deterministically enforceable versus just, um, sort of fall into the model. If you, if you follow what I'm trying to say,

[00:41:31] Jeff Dalton: Yeah.

[00:41:32] Josh Rubin: I don't think we be, we've begun to solve that problem yet. Um.

[00:41:36] Jeff Dalton: And, um, it reminds me, I think we're, we're going to need to do some research here. It's gonna take, it's gonna take a little bit of evolution. There's gonna be a, a future of, this is my knowledge and how do I share this and make this available with not just with multiple agents in ways that have clear, um, clear scope for where and how they can be used and how they can be shared.

[00:41:57] Jeff Dalton: Um, and if they're remembered or whether they're, whether, whether they have to be forgotten. So you, oh, you can use this to make that reservation, but then don't remember that I, you know, hate Thai food other otherwise or something. That fact, right? I actually love Thai food. Um, but the. Whatever that preference, whatever that user's preference is, right.

[00:42:12] Jeff Dalton: Um, the, if you, if you look at some of these standards are starting to emerge, it reminds me there was some work from Tim Burners Lee, um, you know, founder of the world or web, kind of looking at, oh, privacy is gonna be fundamental. We need it, we are going to need some new foundational, um, layers and standards that are being built for people's knowledge and personal knowledge bases.

[00:42:33] Jeff Dalton: And that's gonna be increasingly important when we're moving into. Agents being, having those, those knowledge stores, being able to take them out to be able to control them. I wanna share them with a different agent as well. How do, how do we make those portable? There's no, there's very, there's not clear standards for that today, and we, and it's something that's gonna be increasingly important, but I don't have a good solution for you today.

[00:42:54] Jeff Dalton: But it's, it's really, it's a really important problem. But I also, a

[00:42:57] Josh Rubin: one,

[00:42:58] Jeff Dalton: lot of smart people working on it.

[00:42:59] Josh Rubin: one thing really interesting that you raised that's worth highlighting is the degree to which, um. It sounds like Nadia is a sort of a synthesis, some sort of hybrid of, um, being large language model and being some deterministic scaffolding around tasks and, um, sort of, structured explorations that are designed by experts.

[00:43:25] Josh Rubin: Right. I think, I think, um. One pitfall is maybe sort of giving too much freedom to the task pursuing agent in every scenario. Do you think that providing sort of I'm gonna call it semi deterministic structure, right? Giving, uh, guiding the language model with specific prompts that follow a schedule that, that sort of, I, I don't know if there's a good name for, um, for that.

[00:43:56] Josh Rubin: Uh. Sort of paradigm of, know, deliberately trying to guide the model, you know, along something that's a little bit more of a deterministic state machine. But I, it seems to me that,

[00:44:09] Jeff Dalton: I could,

[00:44:10] Josh Rubin: yeah,

Deterministic vs. Flexible Agent Design

[00:44:11] Jeff Dalton: I can dive in a little bit there. There's a. There's a long history in dialogue systems like SIGDial has been around a long time. We've learned a lot about how conversational systems work. Um, and, and your past listeners have, have heard people talking about, uh, those some classic dialogue systems.

[00:44:25] Jeff Dalton: We had finite state machines. We have dialogue trees. If you've ever, you know, called a phone, you've used a dialogue tree. So you can go step by step. We also then learned, we moved the dialogue systems evolved to frame based models, um, which moved to like slots and values. And you're like, I just need to make sure that these things get filled.

[00:44:41] Jeff Dalton: I don't care the order you do them in. So now that gave the system the flexibility to ask multiple, get multiple pieces of information in the same turn, for example, from the user. Um. We're continuing to move. Um, and now it's, it's largely like we, now we're building an artifact with the user. And so we have to, basically, we are, we're not just, it's not just, um, there's a clear outcome that we're trying to achieve, whether it's a coaching plan or whether it's that, that a user is committed to, or whether there's a, an email that's getting written.

[00:45:09] Jeff Dalton: There's an actual, um, outcome that's being drafted with the agent in the loop. Um, and, and you can measure progress and, and what's complete and what's not complete, um, against that. And so you can still have a, an LLM kind of following structured process

[00:45:23] Josh Rubin: Mm-hmm.

[00:45:23] Jeff Dalton: to know that. And if I think about how we, how we think about Nadia, we started really, really simple.

[00:45:30] Jeff Dalton: We started with a single, uh, uh, we didn't start with a complex multi-agent system. We started with a kinda a single threaded agent that's, that was kind of working that would then had different kind of workflows with code to be able to pull in things like different parts of memory at the right time. Um, and that's kind of controlled deterministically with code and different kind of prompt gates and models that we have.

[00:45:51] Jeff Dalton: So. We can clearly see and, and inspect, um, what that, what that workflow is working, whether it's working, how it's working, and be able to make changes. And now it's pretty easy to change code. Um, so if we want to evolve that system, we can, we can change those rules and make 'em more flexible and, and have them be in code and where it makes sense, um, pull things out into subagents where, where they're appropriate.

[00:46:13] Josh Rubin: I, I love this, Jeff. I think that, um, so it sounds like you come down somewhere with, with respect to kind of federating memory, it sounds like you see a lot of utility in this kind of deterministic state machine management of uh, state.

[00:46:31] Jeff Dalton: Because you don't need all memory at all, everything at all times, right? Um, we now have large language models that have pretty huge context windows, massive context windows, millions of tokens. We can fit books, entire books inside, inside of a context window. So it's about how, how, what information do we need to have all the time?

[00:46:48] Jeff Dalton: What information do we need to have some of the time and be able to have relevance? And so again, I have a search background, so I always think in the, in the concept of like what. Piece of information was relevant for right now. And if I give you this information, are you using it to materially change the output in, in the, in the system?

[00:47:07] Jeff Dalton: Um, so there's obviously citations and, and elements that we have then we, that we know, we cited different information, that we use different pieces of information. And you can track that and see and, okay, if we always had this information here, it was never used by in any of the outputs. So do we need it?

[00:47:23] Jeff Dalton: No, we can prune that from our context windows and just, and put something else in that's valuable instead that then see an experiment with, um, different types of context and different types of, whether it's retrieval or whether it's more deterministic, um, kind of elements of rules that are sometimes always there.

[00:47:39] Jeff Dalton: Is it, sometimes we have to then factor up, okay, well this is never used, but we still have to have it in these three cases. Let's move it out. Um, so. You know, for example, do we need to trigger memory on every single for for every single system in every single turn? No. We have to be very selective and be relevant.

[00:47:56] Jeff Dalton: It's like, this is what we think we need. Oh, you mentioned a person, so we, let's go look up that person's name. Oh, you mentioned a project. Let's go make sure that we've like looked up when you've mentioned that project recently. And so it's about having the systems and intelligence layer to be able to be able to coordinate those, that memory, those memory systems at the right time.

[00:48:14] Josh Rubin: Nice. That's a great answer. Um, so I've finally just discovered where the q and A resides with which button I have to press to bring it up. I, apparently I'm not very good at this. Um, so let's, let's, now that we have. Have I have some questions. Uh, maybe we can quickly, you know, spend five minutes or so kind of addressing, uh, them and then, uh, at least a couple of them.

Q&A

[00:48:34] Josh Rubin: So, uh, we have one question that is, what kind of evaluations are needed in production environments, um, to assess if the agent is going rogue or stuck in a loop. What kind of monitoring is necessary to address and trace its memory, context and decisions?

[00:48:54] Jeff Dalton: Well, I think, I think you probably have some, some experience there, so I'd love to have some of your thoughts there In terms of, I'll share a little bit of, kind of reiterate a little bit of what I said earlier. It's like we have our own. Custom internal thinking structures that we have, so we know what the agent thinks it's trying to do, and we can trace, we can trace those so when it, we can see it getting stuck, oh, it's stuck in G again, shoot, we have to go fix that.

[00:49:13] Jeff Dalton: There's some, there's some rule that it, that there is a flag that's not getting set to allow it to progress the way we expect it to. So if you have the, if you have those reasoning traces, whether it's thought traces of the model, you can actually see and, and that's inspectable and making sure that it's inspectable, but also making sure that it's structured.

[00:49:31] Jeff Dalton: So that, so in, in structured ways so that it's not just free text. Um, that also is a kind of a, a fundamental principle, I think to be able to make that, uh, debug. But I'm also curious, you, you probably have some additional thoughts there, Josh.

[00:49:42] Josh Rubin: A lot of thoughts. Uh, it's, it's a little bit of a tricky question, so I think, you know, in production, so. I'm, I'm, I, I assume that the question at some level is about, um, so in some ways guard railing is harder than Observability because, or sort of production Observability, know, uh, guard railing is essentially trying to stop an errant process in, in flow.

[00:50:06] Josh Rubin: Uh, you know, the Observability, you have time to use smart, smart models and sort of chew on the dataset. Um, stuff doesn't have to happen quite in real time and still gives you sort of debug ability into what the failure modes are. Um, so, you know, I think that there are, you know, for egregious errors, these are things that an LLM as a judge, you know, capable of ingesting a lot of data, some entire conversation, trace some sort of trace level. Um, uh. You know, uh, conversation telemetry, you can, you can capture, um, and these LLMs are generally quite good at identifying, um, you know, or classifying common failure modes. Hopefully this isn't happening too commonly. Um, but you, you also have other proxies, right? You can look at things that raise guardrail errors.

[00:51:05] Josh Rubin: You can look at. know, ideally we get user feedback sometimes when something's gone awry, and that can sometimes be your sort of entry point to investigating a certain mode. You know, sometimes it's like the, the human feedback is the hint at like a, a vein of gold, defer some, um, infrequent but important failure case.

[00:51:25] Josh Rubin: Uh, you know, and so, um, you know, there's a number of techniques. I don't know that all of them we are fielding today, but in our near term roadmap, we should have coverage for, you know. Looking at, uh, classifying failure mode types by doing some sort of, um, you know, unstructured clustering using embeddings of, um, you know, SubT traces and exploring. Small but common kind of error classes or, um, you know, or, or surfacing them with, with evaluator models that, um, you know, are tasked with failure modes in an, an offline capacity. That was a little ranty. I, I don't know if I had a totally clear answer. There's kind of a lot of stuff there.

[00:52:08] Jeff Dalton: Couple quick thoughts there. I, I think the fir some of the, the first one there is again, having some of that structure so you can actually see those, see those elements and see where, where, what the traceability of that is, is fundamental. Um, for something like safety and guardrails, it's, it's not, there's not one solution.

[00:52:21] Jeff Dalton: There's like, it's actually defense in depth. There's, it's a whole layer. So there's foundation, model guard, guardrails, then there's, there's. Before we even get to input, there's, there's like regexes and patterns and things to make sure that the input that's going doesn't have prompt injection attacks and, and PII.

[00:52:37] Jeff Dalton: And so before that, even before it even hits an LLM, we're met, we're testing those things. Then further on top of that we have input guardrails that are maybe light LLMs, and after that we have output, um, guardrails as well that are additional kind of LLM driven with. And all of those have different operating points of precision and recall.

[00:52:54] Jeff Dalton: But, and you're, you're really putting that safety system together, overall end to end and making sure that you're measuring, um, each of the component wise steps on, uh, as input and output. But also is this, does that overall aggregate behavior kind of lead to and align with like what the target operating point is that you're looking for?

[00:53:12] Jeff Dalton: And sometimes it very different. A guardrail for something like, um, for legal is gonna be very different from one that says just. Um, that has legal implications gonna be very different from one that has, um, kind of just, um, maybe a date, a policy, a policy violation.

[00:53:26] Josh Rubin: I think that's very well put. I think that, you know, the fact that we have magic now in language models doesn't absolve us from system level thinking about the components and. You know, whatever the equivalent now of, you know, unit tests, like, sort of, uh, you know, instrumenting each of the known steps with, um, ways of measuring the things that can go wrong.

[00:53:46] Josh Rubin: I mean, this goes back to our, uh. You know, previous topic of, you know, out of the box versus customizability, I think, you know, something that is, uh, has risk associated with it is a high value use case. I think you almost necessarily have to think about the details of the components, um, and build, build a strategy for capturing and mitigating the kinds of things that might be expected to go wrong in, in that piece.

[00:54:11] Jeff Dalton: Good agent systems have, follow a lot of the same principles as good software. And so, um, so if you're building a good, a good agent system, then you, you also need to know, you need to know some of the best principles of how, what, what good software engineering is and best practices. I spent a lot of time at IBM working on logging and tracing in an enterprise application.

[00:54:29] Jeff Dalton: And if you want to debug some of the really complex web sphere application, you better have, you better have really detailed traces, um, so that, that you can, so that you can understand that behavior. And we need the same thing in agent systems.

[00:54:42] Josh Rubin: Um, so I, I just glancing across the other questions that we have here. I think we've mostly addressed, uh, the topics that were raised. Um, how should we incorporate measurement or evaluation tools and customized agents? I think that's what we're talking about. Um, would we need custom measurements? I think sometimes depends on risk, depends on what you're trying to do

[00:55:04] Jeff Dalton: Yes. Need custom measurements. And for something like coaching, right? We, we can, we can, and you know, we can measure process quality, outcome, quality, and whether it's, you know, if that, if a session was successful. But going back to like the research question, how do we measure a journey? How do we measures, measure a user's journey over time?

[00:55:20] Jeff Dalton: Is there, is, is it, is there a coherent plan? Is user making progress and there is trust being strengthened? Um, all of those different elements of. Part of what keeps me, uh, keeps me here like working on how do you measure a longitudinally, uh, kind of change. Um, and, and that's so that, that is a form of very, very custom, highly specific, and not just conversation level metric, but a system level metric of what we're trying to accomplish in the user's world.

[00:55:46] Josh Rubin: That's great. great. I think we wrap here. This has been a really, really stimulating conversation. Uh, yeah. Uh, a lot of fun. Thank you very much, Jeff, for sitting and talking to me about, about this stuff.

[00:55:58] Jeff Dalton: been a pleasure and it's been great fun.

[00:56:00] Josh Rubin: You've heard about Valence and Nadia, check them out. Um, and we're Fiddler AI. Uh, you know, we'd love to work with you on getting your instrumentation just right on whatever collection of complex agentic and predictive workflows you have. Um, so thanks everybody for tuning in and, uh, you know, have a great day.

[00:56:20] Jeff Dalton: Have a great day. Thank you very much.

[00:56:21]