Published on

May 28, 2026

Last Edited

July 1, 2026

OpenTelemetry for AI Observability: What It Covers and Where It Stops

A practitioner's guide to using OpenTelemetry as your AI telemetry foundation, and understanding where purpose-built evaluation takes over.

Fiddler Team

Table of Contents

Key Takeaways

OpenTelemetry is the right telemetry standard for AI observability because it provides vendor-neutral distributed tracing, metrics, and logs across LLM calls, agent workflows, and traditional infrastructure.
The GenAI semantic conventions now standardize how teams capture model attributes, token usage, and latency, but they do not cover output evaluation, safety scoring, or content quality assessment.
Production AI observability requires OpenTelemetry as the data plane paired with a purpose-built evaluation layer that scores outputs for faithfulness, toxicity, and policy compliance without relying on external API calls.

The Problem in Production

A financial services team deploys an agentic workflow to automate research summaries for portfolio managers. The system chains a retrieval agent, a summarization LLM, and a compliance-check agent. The infrastructure team instruments it with their existing APM stack. Within a week, they can see latency percentiles and error rates. They cannot see anything else that matters.

They cannot trace why the retrieval agent chose one data source over another. They cannot determine whether the summarization model hallucinated a statistic. They cannot verify that the compliance agent actually enforced their internal policy on forward-looking statements. The APM dashboards show green. The compliance team finds three violations in a single morning.

This is the instrumentation blind spot. Traditional observability captures infrastructure signals. It misses the decision lineage of AI workloads: prompt construction, model selection, retrieval quality, and output evaluation. These are the signals that determine whether an agent observability strategy is producing trustworthy results.

The fragmentation compounds the problem. Teams end up with separate dashboards for infrastructure metrics, proprietary LLM traces from one vendor, and evaluation results from yet another tool. No single trace connects a user request through the full agentic hierarchy (the complete decision tree of agent calls, tool invocations, and sub-agent outputs). When something goes wrong, root-cause analysis requires correlating timestamps across three systems manually. The challenge of LLM observability extends well beyond what traditional APM was designed to handle.

The real question is whether a single telemetry standard can unify these signals into one trace.

Why OpenTelemetry Solves AI Telemetry Fragmentation

OpenTelemetry solves the fragmentation problem at the telemetry layer. It provides a vendor-neutral instrumentation standard that lets teams capture traces, metrics, and logs from AI workloads using the same framework they already use for traditional infrastructure. The evolving standards around AI agent observability are making this increasingly practical.

Three properties make it the right foundation for LLM observability.

Vendor neutrality: Instrument once, export to any backend. This matters for AI because the observability vendor landscape is shifting rapidly. Teams evaluating platforms today do not want to reinstrument when they switch backends in six months. OpenTelemetry decouples instrumentation from the backend entirely.
Distributed tracing across the agentic hierarchy: OpenTelemetry traces can capture the full call graph from a user request through an orchestrator, sub-agents, tool calls, and LLM invocations. Each step becomes a span with structured attributes. For a multi-agent system processing 500K traces per day, this means every decision in the agentic hierarchy is recorded in a format that any OTel-compatible backend can ingest [3].
Unified pipeline: The OTel Collector [1] handles metrics, traces, and logs in a single pipeline. This enables correlation between AI-specific signals and traditional infrastructure signals. A latency spike in an LLM call can be traced back to a network partition or a rate limit from the model provider, all within the same trace context.

Auto-instrumentation libraries accelerate adoption. Projects like OpenLIT and Traceloop, along with framework-native support in CrewAI and LangGraph, provide an instrumentation library approach that lets teams add OpenTelemetry to AI applications with minimal code changes. The following example shows a basic OTel-instrumented LLM call with GenAI semantic attributes:

from opentelemetry import trace
from opentelemetry.semconv.ai import SpanAttributes

tracer = trace.get_tracer("ai-research-agent")

with tracer.start_as_current_span("llm.chat") as span:
    span.set_attribute("gen_ai.system", "openai")
    span.set_attribute("gen_ai.request.model", "gpt-4o")
    span.set_attribute("gen_ai.request.max_tokens", 2048)
    span.set_attribute("gen_ai.request.temperature", 0.2)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=2048,
        temperature=0.2,
    )

    span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
    span.set_attribute("gen_ai.response.finish_reasons", [response.choices[0].finish_reason])

What the GenAI Semantic Conventions Standardize

The OpenTelemetry GenAI SIG has defined semantic conventions [2] that standardize how teams record AI-specific telemetry. Recent work on GenAI observability has advanced these conventions further. The key attributes include:

gen_ai.system: The AI provider (e.g., openai, anthropic, aws.bedrock)
gen_ai.request.model: The specific model invoked
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens: Token consumption per call
gen_ai.response.finish_reasons: Why the model stopped generating (e.g., stop, length, tool_calls)

These conventions enable cross-vendor comparison. Switching from one model provider to another does not require reinstrumentation. Attribute names, trace structure, and dashboards remain consistent across providers.

The conventions are currently in beta, with some attributes still experimental. Teams should expect evolution as the GenAI SIG incorporates feedback from production deployments. The foundation is stable enough to build on today.

Where OpenTelemetry Stops

OpenTelemetry captures what happened. It does not assess whether what happened was good. This is the fundamental boundary between telemetry and evaluation, and it is where most production AI observability architectures fall short.

Five specific areas sit above what OpenTelemetry covers:

Output evaluation: Was the response faithful to the retrieved context? Was it hallucinated? OpenTelemetry has no concept of faithfulness or groundedness scoring. A span can record that an LLM returned 1,200 tokens in 850ms. It cannot record that those tokens contradicted the source documents.
Safety and policy compliance: Was the response toxic? Did it expose PII? OpenTelemetry traces can capture inputs and outputs when configured to do so. Scoring those outputs for safety, toxicity, or regulatory compliance requires a separate evaluation layer that understands content semantics.
Content quality at the span level: OpenTelemetry can tell you an LLM call completed successfully. It cannot tell you the output was irrelevant to the user's question. Relevance, coherence, and completeness are evaluation judgments, not telemetry signals.
Guardrail enforcement: OpenTelemetry is passive instrumentation. It records but does not intercept, redact, or block. Pre-LLM guardrails that intercept inputs before they reach the model and post-execution guardrails that inspect outputs before they are returned require active middleware. Telemetry alone cannot enforce policy.
Cost attribution beyond tokens: Token counts are necessary but not sufficient for enterprise cost management. OpenTelemetry traces do not surface the Evaluation Trust Tax: the per-query cost of calling external LLM judges for evaluation. For example, Enterprises running 500K traces per day can incur approximately $260K annually in these costs alone. These figures vary by model, deployment size, and traffic volume, but they represent a real operational cost that telemetry does not surface.

The architectural implication is clear. OpenTelemetry is the data plane. It captures structured telemetry across the full agentic hierarchy. Production AI observability also requires a control plane: a layer that ingests that telemetry and applies evaluation, scoring, and enforcement on top of it. Teams weighing the build vs. buy decision for this layer should consider what it takes to maintain evaluation infrastructure at scale.

The Fiddler AI Observability and Security platform ingests OpenTelemetry telemetry via its OTel integration and layers batteries-included Trust Models on top for in-environment evaluation. Trust Models run inside the customer's deployment, scoring outputs for faithfulness, toxicity, and policy compliance with under 100ms response time, no external API calls, and no per-evaluation cost.

Four Production Pitfalls When Instrumenting AI with OpenTelemetry

Deploying OpenTelemetry for AI workloads introduces failure modes that do not exist in traditional microservice tracing. Four patterns appear consistently in production.

Context loss in multi-turn conversations: OpenTelemetry trace context propagation assumes request-response patterns. Multi-turn agent conversations can lose parent context if sessions are not explicitly managed. When a user interacts with an agent across multiple turns, each turn may start a new trace unless the application explicitly propagates a session-level context. This makes it impossible to reconstruct the full conversation flow from traces alone.
Prompt and completion capture creates data governance risk: Capturing full prompts and completions in OpenTelemetry spans is powerful for debugging. It also creates data governance exposure. Prompts may contain customer data, PII, or proprietary information. Before enabling content capture, teams must implement sampling, redaction, and retention policies. The configuration is straightforward but the implications are significant:

# Enables full prompt/completion capture in OTel GenAI instrumentation.
# WARNING: Review data governance policies before enabling in production.
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true

# Recommended: pair with a collector processor that redacts sensitive fields
# before export to your observability backend.

‍‍Span explosion in agentic hierarchies: A single user request to a multi-agent system can generate hundreds of spans. An orchestrator calls three sub-agents; each sub-agent makes multiple tool calls and LLM invocations; each invocation generates its own child spans. Without tail-based sampling or span limits, this overwhelms backends and budgets. We recommend starting with head-based sampling at 10% for high-volume endpoints and using tail-based sampling to capture 100% of error traces.‍
Evaluation latency versus trace latency: If evaluation runs synchronously in the request path, it adds latency to every response. If evaluation runs asynchronously, the trace closes before evaluation results are available. Teams must architect for this tradeoff explicitly. In-environment evaluation with under 100ms response time makes synchronous scoring feasible at scale. External API-based evaluation does not.

What Becomes Possible When OpenTelemetry and Evaluation Work Together

With OpenTelemetry as the telemetry foundation and purpose-built evaluation layered on top, a new architectural pattern becomes available. Recent observability research [4] confirms this convergence is accelerating.

Unified trace from request to evaluation: A single trace connects the user request through the agent decision tree through every evaluation result. This gives teams full visibility from request through evaluation to root cause. When a healthcare AI assistant produces an inaccurate summary, the team traces from the user query through retrieval, generation, and the evaluation score that flagged the output, all within a single span hierarchy. Agentic Observability platforms make this possible by connecting span-level telemetry to end-to-end observability across the full lifecycle.
Cross-framework portability: Teams can swap agent frameworks without losing observability coverage [5]. Moving from LangGraph to CrewAI to a custom orchestrator requires changing instrumentation libraries, not rearchitecting the observability pipeline. The telemetry standard stays the same.
Enterprise governance without data exposure: With structured telemetry and in-environment evaluation, teams can satisfy audit requirements without shipping prompt data to external APIs. This is a prerequisite for regulated industries. Financial services firms operating under SR 11-7 and healthcare organizations subject to HIPAA need evaluation results that never leave their environment.
What remains unsolved: Real-time evaluation of multi-modal outputs (images, audio, video) has no standardized telemetry convention. Cost attribution across compound AI systems, where a single user request triggers multiple models and tools with different pricing, remains fragmented. Standardized conventions for embedding evaluation results directly into OpenTelemetry spans are still in early discussion within the GenAI SIG. These are active areas of development that will shape the next generation of AI observability.

The Telemetry Standard Meets the Evaluation Layer

Return to the financial services team from the opening. With OpenTelemetry capturing structured telemetry across their agentic hierarchy and purpose-built evaluation scoring every output, they can trace every agent decision, identify the exact span where a compliance violation originated, and verify that guardrails enforced their policies. The three violations their compliance team found manually are now flagged automatically, in real time, within the same trace.

The convergence of OpenTelemetry's GenAI semantic conventions with purpose-built evaluation platforms is creating a new architectural standard for production AI. Teams that adopt this pattern now will be positioned to scale agent complexity without sacrificing visibility or governance. The telemetry standard is here. The question is what you build on top of it.

To see how Fiddler connects OpenTelemetry traces to in-environment evaluation, request a demo.

References

[1] OpenTelemetry, "OpenTelemetry Collector," OpenTelemetry Documentation. [Online]. Available: https://opentelemetry.io/docs/collector/

[2] OpenTelemetry, "Semantic Conventions for Generative AI Systems," OpenTelemetry Specification. [Online]. Available: https://opentelemetry.io/docs/specs/semconv/gen-ai/

[3] Cisco Outshift, "AI Observability in Multi-Agent Systems Using OpenTelemetry," Cisco Outshift Blog, 2025. [Online]. Available: https://outshift.cisco.com/blog/ai-ml/ai-observability-multi-agent-systems-opentelemetry

[4] M. Shridhar et al., "AI Observability for Developer Productivity Tools," arXiv preprint arXiv:2604.17092, 2025. [Online]. Available: https://arxiv.org/pdf/2604.17092

[5] Microsoft, "Azure AI Foundry: Advancing OpenTelemetry and Delivering Unified Multi-Agent Observability," Microsoft Tech Community, 2025. [Online]. Available: https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/azure-ai-foundry-advancing-opentelemetry-and-delivering-unified-multi-agent-obse/4456039

Frequently Asked Questions

Should AI Observability use OpenTelemetry?

Yes. OpenTelemetry provides the vendor-neutral telemetry layer that AI workloads need for distributed tracing, metrics, and logs. Production systems also require a purpose-built evaluation layer for output quality, safety, and policy compliance.

What are OpenTelemetry GenAI semantic conventions?

They are standardized attribute definitions for AI telemetry signals, including model name, token usage, latency, and finish reasons. They enable consistent instrumentation across LLM providers so teams can switch models without reinstrumenting.

Can OpenTelemetry track multi-agent AI workflows?

OpenTelemetry's distributed tracing captures the full call graph across orchestrators, sub-agents, and tool calls. Teams should manage trace context explicitly for multi-turn conversations to avoid context loss across session boundaries.

What are the limitations of OpenTelemetry for AI Observability?

OpenTelemetry captures what happened but does not evaluate whether outputs were accurate, safe, or policy-compliant. Output evaluation, guardrail enforcement, and content quality scoring require a separate evaluation layer above the telemetry standard.