Key Takeaways
- Distributed traces that capture an agent's full decision path (thought, action, tool call, reflection) are the single most valuable telemetry signal for production agents.
- Traditional metrics like CPU, memory, and latency remain necessary but insufficient; agent-specific span attributes such as tool call success rate, token consumption per step, and evaluation scores fill the observability blind spot.
- OpenTelemetry semantic conventions for AI workloads are stabilizing, and adopting them now prevents costly re-instrumentation later.
- Telemetry without evaluation is just data collection; connecting spans to automated quality scoring is what turns observability into control.
The Problem with Traditional Telemetry for AI Agents
Traditional application performance monitoring collects three signal types: metrics, logs, and traces. For deterministic services, this combination is sufficient. A request enters, follows a predictable code path, and returns a response. When something breaks, a spike in error rate or latency points you to the root cause.
AI agents break this model. The same input can produce different reasoning chains, tool call sequences, and final outputs. A multi-step agent processing a customer query might call three tools in one execution and five in another, selecting different reasoning paths each time. Non-determinism is the default, not the exception. Even leading AI labs recognize the challenge of needing to monitor agents for misalignment and unexpected behavior [6].
This distinction matters because observability and control are not the same thing. Collecting telemetry tells you something happened. Understanding whether the outcome was correct requires a different layer entirely.
Consider a concrete failure scenario. Nielsen deployed a multi-agent system called Ask Nielsen to process complex market research queries. In a system like this, a parent agent delegates subtasks to specialized sub-agents. If one sub-agent hallucinates a market statistic, the parent agent propagates that incorrect answer to the user. Traditional metrics show healthy latency, normal throughput, and zero errors. Every service-level indicator reads green while the system returns wrong answers.
This is the core problem: traditional APM was designed for systems where incorrect behavior correlates with infrastructure failure. In agentic systems, the infrastructure can be perfectly healthy while the outputs are fundamentally unreliable. Agentic Observability addresses this by going beyond infrastructure health to evaluate what production agents are actually doing. We cannot give agents more decision-making authority until we can verify what decisions they are making.
Distributed Traces Are the Primary Signal
In agentic systems, a single user request fans out into multiple LLM calls, tool invocations, and sub-agent delegations. Traces capture this full agentic hierarchy as a tree of causally connected spans. Each span represents one discrete operation: a thought step, an action decision, a tool execution, or a reflection on results.
What makes traces the primary signal is causal structure. Logs capture individual events but lose the relationships between them. We can see that a tool was called and that an LLM generated a response, but we cannot see that the LLM generated the response because the first tool call returned an empty result, triggering a fallback path. Traces preserve that chain of causation across every step of agent execution.
Each span in an agent trace should record specific attributes: the input prompt, the output completion, the model used, token counts (prompt and completion), latency, tool call parameters and return values, and any evaluation scores attached at capture time. This structured representation turns an opaque agent execution into an inspectable decision path.
Here is what a well-instrumented agent tool call looks like using OpenTelemetry semantic conventions [1]:
The OpenTelemetry GenAI semantic conventions define standard attribute names for LLM-specific telemetry. These conventions are currently experimental but represent the emerging standard for agent instrumentation. Adopting them now means our telemetry will be interoperable across tools and vendors as the ecosystem matures.
Agent-Specific Metrics That Traditional APM Misses
Traces provide the raw data. Metrics derived from those traces provide the system-wide view that tells you whether your agents are healthy across the full agentic hierarchy.
Five categories of agent-specific metrics matter beyond traditional latency and error rate:
- Tool call success rate. The percentage of tool invocations that return valid, usable results. At 500K traces per day, even a 2% tool failure rate means 10,000 failed tool invocations daily. Many of these failures are silent: the agent retries internally and eventually succeeds, masking the problem from top-level metrics.
- Token consumption per step. Tracks cost at the granularity of individual reasoning steps, not just per request. A sudden spike in tokens per step often indicates the agent is stuck in a reasoning loop or generating unnecessarily verbose intermediate outputs. This metric is the fastest signal for cost drift.
- Evaluation scores per span. Faithfulness, groundedness, and relevance scores attached to each LLM completion at capture time. These transform traces from a record of what happened into a record of how well it happened. Without per-span evaluation, you are collecting expensive telemetry with no quality signal.
- Decision path depth. The number of reasoning steps an agent takes before producing a final output. Unusually deep paths suggest the agent is struggling with the task. Shallow paths on complex queries may indicate the agent is taking shortcuts. Tracking the distribution of path depth over time reveals behavioral drift.
- Guardrail trigger rate. How often pre-LLM or post-execution guardrails intercept inputs or outputs. A rising trigger rate on a stable workload signals that model behavior is shifting. A falling rate after a model update may mean the guardrail triggers need recalibration, not that the model improved.
These metrics aggregate from span-level telemetry to provide a view across the entire system. They answer questions that traditional APM cannot: not just whether the system is running, but whether it is running correctly.
Semantic Conventions for Agent Telemetry
Without semantic conventions, every team invents its own span attribute names. One team uses llm.model, another uses gen_ai.model_name, a third uses model_id. This prevents cross-team correlation, breaks shared dashboards, and makes vendor migration painful.
The OpenTelemetry GenAI semantic conventions define standard attributes for LLM calls: model name, token counts, prompt and completion content, and system identifiers. For agents specifically, the conventions are still evolving. The following attributes represent the practical minimum to standardize now, drawn from both the emerging OTel conventions and the OWASP Agent Observability Standard [3] and evolving OTel best practices [4]:
agent.name: identifies which agent in a multi-agent system produced this spanagent.step_type: classifies the span as thought, action, tool_call, or reflectiontool.name,tool.input,tool.output: captures the full tool interactioneval.score,eval.metric_name: attaches quality evaluation to each span
Here is how to set these attributes in Python using the OpenTelemetry SDK:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.instrumentation")
with tracer.start_as_current_span("agent.reflection") as span:
span.set_attribute("agent.name", "planning_agent")
span.set_attribute("agent.step_type", "reflection")
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("eval.metric_name", "faithfulness")
span.set_attribute("eval.score", 0.87)Teams that adopt these conventions now will have interoperable telemetry when the conventions stabilize. Teams that wait will face re-instrumentation across every service that emits agent telemetry.
What to Watch For: Common Telemetry Blind Spots
Even well-instrumented agent systems can produce misleading telemetry. These are the failure modes we see most often in production deployments.
- Silent sub-agent failures. A parent agent retries a failed sub-agent call internally and eventually succeeds. Top-level metrics show a successful request with slightly elevated latency. The sub-agent failure never surfaces unless we instrument at every level of the agentic hierarchy, not just the entry point. The fix: every sub-agent and tool call gets its own span with success and failure attributes.
- Token cost drift. Model provider pricing changes can shift cost per trace significantly between billing cycles. If we track token consumption only at the request level, we miss which specific reasoning steps are driving cost increases. Track token consumption per span and alert on step-level anomalies.
- Evaluation coverage shortfalls. Teams instrument traces comprehensively but never attach evaluation scores. The result is 500K traces per day flowing into storage with no quality signal. This is expensive data collection, not observability. If we are not scoring spans, we are not monitoring agent quality.
- Sampling that hides edge cases. Aggressive head-based sampling drops traces uniformly, which means rare but critical failure traces are discarded at the same rate as routine successes. Use tail-based sampling that retains traces with errors, high latency, or guardrail triggers. The traces we most need to investigate are exactly the ones uniform sampling is most likely to discard.
From Telemetry to Trust: Connecting Traces to Evaluation
Collecting 500K traces per day without scoring them is expensive storage, not observability. The value of telemetry is only realized when it feeds automated evaluation [5] that tells you whether each agent response was faithful, grounded, and relevant.
The challenge is where evaluation runs. Calling an external API for every trace adds latency to the pipeline. It also introduces data exposure risk by sending production traces to a third-party service. Beyond these operational concerns, there is a direct financial cost. At 500K traces per day, enterprises can incur approximately $260K annually in external API evaluation costs alone. This per-query financial cost is the Evaluation Trust Tax, and it scales linearly with trace volume.
Fiddler Centor Models take a different approach. They are batteries-included, in-environment evaluation models that require no external LLM calls. No data leaves the customer's environment. No external API is called. There is no per-evaluation cost. Evaluation runs under 100ms response time at the span level.
In practice, the workflow looks like this. A trace is captured via OpenTelemetry and exported to Fiddler's OTLP endpoint. Each span is scored by Centor Models for faithfulness, groundedness, and relevance. Scores feed Continuous Monitoring dashboards that surface quality trends across the agentic hierarchy.
When a score violates a defined threshold, Enforceable Policy actions trigger automatically. These can be alerts, blocks, or escalations depending on severity.
Nielsen's production deployment of Ask Nielsen demonstrates this pipeline end to end. OpenTelemetry traces from their multi-agent system feed Fiddler for real-time evaluation and root-cause analysis, turning raw telemetry into a continuous quality signal across every agent, tool call, and reasoning step.
Conclusion
The telemetry hierarchy for production agents is clear: distributed traces that capture the full decision path are the primary signal. Agent-specific metrics derived from those traces (tool call success rate, token consumption per step, evaluation scores, decision path depth, guardrail trigger rate) provide the system-wide view. Logs fill in supporting context but cannot replace either.
The real frontier is not collecting more telemetry. It is making telemetry evaluable by default. Every span should carry a quality score at capture time, turning passive data collection into an active quality signal.
As OpenTelemetry's GenAI semantic conventions stabilize, teams that adopt them now will have interoperable, vendor-neutral agent observability from the start. Teams that wait will pay the re-instrumentation cost later.
A concrete next step: audit your current agent instrumentation against the semantic conventions outlined in this article. Check whether every span carries the attributes needed for cross-team correlation and automated evaluation. The conventions are the foundation; evaluation on top of those conventions is what turns telemetry into trust.
See how Fiddler AI Observability connects OpenTelemetry to continuous evaluation.
References
[1] OpenTelemetry Authors, "Semantic Conventions for Generative AI Systems," OpenTelemetry, 2024. [Online]. Available: https://opentelemetry.io/docs/specs/semconv/gen-ai/
[2] OpenTelemetry Authors, "Agent Deployment Pattern," OpenTelemetry, 2024. [Online]. Available: https://opentelemetry.io/docs/collector/deploy/agent/
[3] OWASP Foundation, "AI Agent Observability Standards," OWASP GenAI, 2025. [Online]. Available: https://genai.owasp.org/resource/ai-agent-observability-standards/
[4] OpenTelemetry Authors, "AI Agent Observability - Evolving Standards and Best Practices," OpenTelemetry Blog, 2025. [Online]. Available: https://opentelemetry.io/blog/2025/ai-agent-observability/
[5] Anthropic, "Demystifying Evals for AI Agents," Anthropic Engineering, 2025. [Online]. Available: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
[6] OpenAI, "How We Monitor Internal Coding Agents for Misalignment," OpenAI, 2025. [Online]. Available: https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/
