A Guide to Monitoring and Controlling Agentic Applications

Published

July 23, 2025

Last Edited

July 24, 2025

Amit Paka

Former

Founder and COO

Fiddler AI

The Rise of the Agentic Application

Over the past two years, AI-powered applications have evolved from relying on traditional machine learning models to leveraging the transformative capabilities of large language models (LLMs). Today, we’re witnessing the emergence of a new architectural paradigm, where foundation model intelligence is modularized into composable components and orchestrated into structured or probabilistic workflows.

This shift is redefining the nature of applications themselves. What were once deterministic, logic-driven systems are now becoming probabilistic and adaptive, completing the transition to truly AI-native applications. At the heart of this transformation is the rise of the agent, a composable unit of intelligence capable of autonomous reasoning and action.

Agentic applications introduce autonomy, reasoning, and coordination, far beyond what traditional APM tools were built to handle.

Enterprises are now in a race to harness agentic systems, aiming to unlock increasingly complex, dynamic use cases that traditional architectures simply couldn’t support. This new frontier marks a critical inflection point in the enterprise AI journey.

What is an Agent

While an agent can have different definitions, Anthropic captures what an “agentic system” is best by differentiating between Agents and Workflows:

"Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."

How Agents Can Fail

AI agents are inherently complex — and that complexity is only increasing as systems evolve from single-agent architectures to multi-agent ecosystems. While these agents promise greater autonomy and intelligence, they also introduce new layers of operational risk. Failures can occur in seemingly simple ways, yet remain difficult to diagnose due to the opaque and dynamic nature of agentic behavior.

Common Failure Modes in Enterprise AI Agent Systems:

Data Access & Retrieval Issues
1. Retrieval errors due to datastore outages or misconfigurations
2. Incorrect results from indexing errors or faulty retrieval logic
3. Incomplete data coverage leading to knowledge gaps
Prompting & Contextual Failures
1. Insufficient or inaccurate context provided to the model
2. Misleading or suboptimal prompt construction affecting performance
Model-Level Errors
1. Hallucinations or factually incorrect responses from the LLM
2. Generation of offensive, inappropriate, or brand-unsafe content
Security & Privacy Risks
1. Leakage of sensitive or private enterprise data
2. Successful jailbreaks or prompt injection attacks
Tooling & Execution Gaps
1. Agents failing to invoke required tools due to access issues
2. Inability to act due to missing or incomplete tool input data

What to Monitor in Agents

Given the wide range of failure modes in agentic systems, enterprises need a comprehensive set of metrics to gain full visibility into the operational behavior of AI-driven applications. Robust monitoring isn't just a best practice, it's essential for ensuring reliability, safety, and trust at scale.

Below are a few key metrics to consider when instrumenting agent-based applications:

Safety & Trustworthiness
1. Multi-Dimensional Safety: Detection of outputs that are illegal, hateful, harassing, racist, sexist, violent, sexual, harmful, unethical, or indicative of jailbreaking attempts
2. Toxicity: Measurement of harmful or aggressive language
3. Profanity: Detection of explicit or inappropriate language
4. Sentiment: Analysis of emotional tone, with thresholds for acceptable response ranges
5. Banned Keyword Detection: Identification of restricted terms or phrases
6. Personally Identifiable Information (PII): Monitoring for leakage of sensitive data
Output Quality
1. Faithfulness: Ensuring responses remain grounded in the provided data or source context
2. Conciseness: Assessing whether outputs are unnecessarily verbose or meandering
3. Coherence: Evaluating logical flow and structural consistency
4. Answer Relevance: Measuring alignment with the original prompt or user query

By implementing and tracking these metrics, enterprises can not only detect issues in real time but also continuously improve agent performance and user trust across critical workflows.

A “Control Plane” for Agentic Applications

As applications shift from deterministic to probabilistic behavior, real-time monitoring becomes mission-critical. The very same metrics used for post-hoc analysis of agent logs can now serve a more proactive role, dynamically influencing or halting agent execution based on detected risks.

This evolution paves the way for a real-time control plane: a system that not only observes but actively governs agent behavior. By integrating intelligent guardrails, enterprises can reroute application logic or block unsafe outputs in the moment, minimizing risk and maintaining trust.

For example, OpenAI’s safety guidelines below recommend implementing guardrails across multiple dimensions. These safeguards aren't just defensive — they're foundational to building resilient, enterprise-ready AI systems.

How to Monitor an Agentic System

1. Set Up Guardrails

Depending on the criticality of business priorities, such as brand safety, data privacy, or response accuracy, organizations should select guardrails that evaluate and score relevant inputs and outputs within the system. These guardrails enable dynamic control to modify agent execution in real time when risk thresholds are exceeded.

Below is an architectural approach recommended by OpenAI’s “A practical guide to building agents,” illustrating how to embed such safeguards directly into the application workflow.

2. Gather Agent Logs

Effective monitoring of agent behavior begins with detailed operational logs that capture inputs, outputs, and critical metadata, such as latency, model version, agent ID, and more, at span-level granularity. OpenTelemetry has greatly simplified this process by standardizing how AI telemetry data is structured across spans, traces, sessions, applications, and other key artifacts.

Depending on your deployment architecture, agent logs can be ingested from various sources. For example, logs from AWS Bedrock Agents may reside in a log store like Amazon S3, while systems like LangGraph provide direct access to structured telemetry from agent orchestration pipelines.

This structured approach to logging forms the foundation for robust observability, enabling real-time debugging, quality assurance, and risk management in enterprise-grade agentic applications.

3. Monitor App Level Metrics

Once logs are collected and processed, it's important to define a focused set of high-level metrics to track at the application level. This helps avoid the complexity and overhead of monitoring dozens of metrics at the span level for every agent operation, an approach that can quickly become unmanageable as the number of deployed agents scales.

By identifying and prioritizing a few key performance and risk indicators, organizations can create streamlined dashboards and set up targeted alerts. This ensures teams stay ahead of emerging issues without being overwhelmed by noise, enabling proactive management of agent behavior across the application landscape.

Fiddler Agentic Observability dashboard for an autonomous trading system showing metrics for Total Hallucinations, Total Unsafe Inputs, and Trading System Faithfulness chart, with cost monitoring via Total Daily Cost visualization.

4. Identify Issues

When an alert is triggered, the complexity of agentic systems can make root cause analysis both difficult and time-consuming. With numerous moving parts, pinpointing the source of failure requires careful investigation:

Is a single agent causing issues within a multi-agent workflow?
Is the problem isolated to one user or a specific interaction?
Is a particular span consistently underperforming or failing?
- Are LLM responses breaking down across certain prompts?
- Have vector retrieval latencies spiked for a specific span?

Fiddler Agentic Observability trace dashboard showing 30-day Trading System Faithfulness chart, session selector, agent hierarchy centered on the ‘trader’ agent, and detailed agent metrics including faithfulness score, latency, tokens, context, and enriched sentiment.

Finding the proverbial needle in the log haystack demands a systematic drill-down, from the triggered metric, through the agent hierarchy, down to individual sessions and spans. This level of traceability is only possible when logs are enriched with high-quality, structured metadata, enabling a fast, and efficient debugging experience.

5. Resolve Issues

Identifying the root cause is only the first step — resolving the issue requires targeted action. Depending on the failure mode, this may involve updating the prompt context, switching to a different LLM, refreshing vector data stores, refining retrieval logic, or something else

It also presents a valuable opportunity to optimize your safety and quality controls. You can fine-tune guardrail thresholds to minimize false positives or introduce new guardrails to proactively catch similar issues in the future. This continuous feedback loop is essential for maintaining robust, reliable, and scalable agent-driven systems.

As AI agents become more capable and more complex, effective monitoring isn't optional; it's foundational. By instrumenting the right metrics, deploying real-time guardrails, and embracing a control-plane mindset, you can ensure your agents are not only powerful, but also safe, reliable, and aligned with your business goals. The future of AI is agentic and it needs observability built in from day one.

A Practical Guide to Monitoring and Controlling Agentic Applications