Key Takeaways
- AI agents fail differently than models do, and traditional monitoring wasn't built for that.
- Bad evaluation is worse than no evaluation: the wrong metric moves teams confidently in the wrong direction.
- Measurability is a design decision; agents built without structure can't be debugged after the fact.
- Guardrails are a layered system, not a switch, and each layer needs to be measured independently.
- The teams that handle production failures well aren't the ones with better models; they're the ones who see domain shifts before they become incidents.
A model produces a bad output. An agent makes a decision, takes an action, calls a tool, and hands it off to another agent.
By the time something surfaces as wrong, the origin of the failure is three steps back. It is buried in a reasoning chain no dashboard was built to follow.
The monitoring instincts that worked for predictive models tracking drift, flagging anomalies, and watching performance metrics don't transfer cleanly to agentic systems. Failures in these systems are emergent, probabilistic, and distributed across multiple components. The gap between "something went wrong" and "here's why and what to change" is much harder to close than most teams expect.
Closing that gap requires decisions that start long before an agent reaches users.
Bad Evaluation is Worse Than No Evaluation
The instinct when standing up a new AI system is to reach for whatever evaluation tooling is available and start generating metrics. The problem is that a poorly designed eval doesn't just fail to surface issues. It actively misleads.
Teams optimize for the metric they have, not the behavior they want. The system moves confidently in the wrong direction, and nothing in the data signals otherwise.
The alternative starts with rubrics. Before looking at a single output:
- Define what good looks like across the dimensions that matter for your system
- Write down what success means at the conversation level, the session level, the user journey
- Calibrate the humans who will apply those definitions until their judgments align
Only then does evaluation data become trustworthy enough to act on.
In a recent AI Explained, Jeff Dalton, Head of AI at Valence, said it plainly: bad evaluation is worse than no evaluation. The approach is to build rubrics first, calibrate human evaluators before scoring any sessions, and use LLM-as-a-judge to amplify that signal at scale, not replace it.
The sequencing matters. But rubric design alone isn't enough; calibration is the step that determines whether anything else in the evaluation pipeline is worth trusting.
This is especially hard for systems optimizing for longer-term outcomes, like coaching or behavioral change. The metrics become more subjective and the feedback loops get longer.
Grading coaching conversations across multiple dimensions (was change facilitated, was the conversation personalized, did the user learn something new) aren't metrics any off-the-shelf eval tool provides. They have to be defined, written down, and calibrated across the humans doing the grading.
Measurability is a Design Decision, Not a Retrofit
A common mistake is treating inspectability as something to bolt on after the system is built. By then it is usually too late.
If the system was not designed to produce traceable behavior, no monitoring layer applied afterward will make it fully debuggable.
Agents built on loosely defined natural language prompts produce behavior that is difficult to trace to a specific cause. When something goes wrong, the failure is visible but its origin is not.
The team can see that the agent got stuck; they cannot see why, or which instruction to change.
The approach that works is treating prompts as structured, inspectable objects with clearly defined phases, steps, and transition conditions. When a system is built this way:
- A failure in production maps back to a specific part of the workflow
- Teams can identify the problem, make a targeted change, and verify it worked
- Prompts are treated like code: structured objects with clearly defined steps that can be inspected, modified, and tested in isolation
- When something breaks, the trace points to a specific instruction, not a black box
Structured design is not a constraint on what an agent can do. It is what makes it possible to run agents in production without flying blind.
Guardrails Follow the Same Logic
Guardrails are often treated as a setting: on or off, blocking or allowing. Guardrails in production are more complicated.
Effective guardrails are a layered system:
- Before input reaches an LLM, pattern-matching rules screen for prompt injection and PII
- Input guardrails add a second layer of protection
- Output guardrails add a third
Each operates at a different point in the inference path with different precision-recall tradeoffs. Each needs to be measured and tuned on its own terms. Whether the system is operating safely and within policy emerges from how those layers work together, not from any one of them.
What makes this hard in practice is domain shift. A system configured for a general enterprise context will encounter edge cases the moment it lands in a specific vertical. When a large legal firm onboarded on an AI coaching platform, employees began uploading documents. They asked the assistant to help them work through the content. Guardrails built for general enterprise use flagged the interactions as legal advice requests and blocked them.
The guardrail was technically correct and operationally wrong for that customer.
The difference between catching it quickly and finding out through user complaints weeks later comes down to one thing: whether the monitoring infrastructure was in place to surface the failure at all.
The same dynamic appeared when users came in after a major election cycle expressing distress. A blunt political content guardrail would have blocked the conversation entirely. What was actually needed was a response that acknowledged the emotion while maintaining neutrality. A rigid rule couldn't handle that. It only became visible through careful log review and human assessment.
Conclusion
None of this is novel. Good agent systems follow the same principles as good software. Clear specifications. Traceable behavior. Layered defenses.
And the discipline to look closely at the data, a few real examples with your own eyes, before trusting what the aggregate numbers say.
This post draws on a candid conversation with Jeff Dalton, Head of AI and Chief Scientist at Valence, from Fiddler's AI Explained AMA series. Watch the full session on demand.

