Key Takeaways
- Harness engineering is the discipline of building everything around an AI agent except the model itself. Most implementations neglect the observability layer that tells you whether the harness is working.
- Current harness approaches split into feedforward controls (what the agent can do) and feedback controls (what happened after it acted). Observability is the feedback layer most teams skip.
- Evaluating harness quality requires standardized telemetry across the agentic hierarchy, not ad hoc log inspection. Without it, silent failures compound.
The Production Failure Mode No Alert Catches
A financial services team deploys a customer-facing agent. It passes every pre-deployment eval. Completion quality scores clear the bar. The team ships.
Two weeks later, a routine manual review reveals a 30% drop in response quality. No alert fired. No dashboard turned red. The harness had no continuous evaluation pipeline, so the degradation was invisible until a human happened to look.
This is the core failure mode in agent harness engineering today.
Coding agents expose the same pattern. Teams invest in sandboxed execution environments, file permission controls, and shell access restrictions. All feedforward controls. But few instrument the feedback side: whether the agent's code changes are correct, whether its reasoning about the codebase is grounded, whether it is making the same class of mistakes repeatedly across thousands of daily interactions. The harness knows what the coding agent is allowed to do. It rarely knows whether what it is doing is actually working.
The formula is simple: Agent = Model + Harness. The model is a single component. The harness is everything else. Tool orchestration, context management, memory, constraints, hooks, and the feedback loops that keep it all honest. Harness engineering is the discipline of building, configuring, and maintaining that execution environment so the agent behaves correctly under real-world conditions.
Most teams invest heavily in the action side of the harness. They define tool permissions. They build context assembly pipelines. They add schema validation and pre-action hooks. This is the feedforward layer; it steers the agent before it acts.
The feedback layer is where things break down. Martin Fowler and Birgitta Bockeler describe this split in their analysis of harness engineering [1] for coding agents: feedforward controls prevent known failure modes; feedback controls evaluate what happened after the agent acted. Fowler raises a question that most teams cannot answer: if your harness sensors never fire, is that because quality is high, or because detection is inadequate?
In most production harnesses, feedback controls are binary. Tests pass or fail. Linters flag or do not flag. There is no continuous signal. No quality scoring across agent populations. No drift detection over time. The observation side of the harness is either absent or stitched together from ad hoc logging. The challenge of monitoring agents in production is well documented, yet most teams still treat it as an afterthought.
What a Harness Actually Contains
Harness engineering is distinct from two adjacent disciplines. Prompt engineering optimizes what you say to the model. Context engineering optimizes what the model knows at inference time. Harness engineering optimizes the entire execution environment the model operates in. Context is one input to the harness. It is not the harness itself.
A production harness has two sides.
The action side (feedforward controls) includes four component categories:
- Tool definitions and permissions: What external systems the agent can access, what operations it can perform, and what it is explicitly blocked from doing.
- Context management: Compaction strategies, progressive disclosure of information, memory systems, and retrieval pipelines that determine what the model sees at each turn.
- Architectural constraints: Schema validation, typed interfaces, output format enforcement, and structural linters that catch malformed agent actions before execution. The risk of architectural drift [2] increases as agents operate over longer horizons.
- Hooks and enforcement: Pre-action interceptors that block known-bad operations and post-action validators that confirm the agent's output meets structural requirements.
The feedback side (feedback controls) includes three component categories:
- Evaluation: Scoring agent outputs against quality dimensions. Faithfulness, groundedness, relevance, and safety scoring using LLM-as-a-Judge patterns or purpose-built evaluation models.
- Observability: Trace-level telemetry across the full agentic hierarchy. Application, session, agent, trace, and span. This is the structured record of what the agent did, when, and with what inputs and outputs.
- Continuous monitoring: Alerting on quality degradation trends, anomaly detection across agent populations, and drift measurement against established baselines.
Addy Osmani provides a comprehensive synthesis of agent harness [3] components that maps closely to this taxonomy. The Fiddler four-stage walkthrough details the feedback-side components in practice, and guidance on building agentic workflows with safety and accuracy shows how evaluation fits into the development lifecycle.
The imbalance in practice is consistent. Teams build sophisticated feedforward controls. They write detailed tool schemas. They implement multi-layer context assembly. Then they rely on ad hoc log inspection for feedback. Grep through CloudWatch. Spot-check individual traces. Hope someone notices when quality degrades.
The action side gets engineered. The feedback side gets improvised.
What The Feedback Loop Requires
Feedforward controls handle known failure modes. If you know an agent should never execute a DELETE on a production database, you write a constraint. If you know the context window has a token limit, you build a compaction strategy.
Observability catches the failures you did not predict. It answers the question that feedforward controls cannot: is this agent performing well right now, across the full population of requests it is handling?
Building harness observability requires four layers.
- Standardized telemetry across the agentic hierarchy. Every agent action produces a structured trace with spans for context assembly, model inference, tool execution, and evaluation. These traces roll up from span to trace to agent to session to application. Without this hierarchy, you are debugging individual requests instead of understanding system behavior.
- Evaluation at trace level. Each agent turn receives a quality score on dimensions that matter for the use case. Faithfulness and groundedness for retrieval-augmented agents. Safety scoring for customer-facing deployments. Anthropic's deep dive into agent evaluation [4] demonstrates how these scoring patterns operate across thousands of daily agent interactions. These scores are computed per-trace, not sampled.
- Drift detection across agent populations. Current behavior is compared against established baselines. A faithfulness score that was 0.92 last week and is 0.84 this week is a signal, even if 0.84 still passes the absolute threshold.
- Alerting on quality degradation before users notice. Threshold alerts on individual metrics. Trend alerts on rolling averages. Anomaly detection on distribution shifts.
Fiddler Centor Models deliver low-latency, cost-effective evaluations for task-specific and complex use cases. They run entirely in-environment. No external API call is made. No data leaves the customer's deployment. Evaluation happens at under 80ms response time. The Evals Trust Tax is eliminated because the evaluation model is local, purpose-built, and does not incur per-query API charges.
Here is what harness-level telemetry instrumentation looks like in practice:
from agent_telemetry import Tracer, SpanKind
tracer = Tracer(service="customer-support-agent")
with tracer.start_span("agent_turn", kind=SpanKind.AGENT) as span:
span.set_attribute("session.id", session_id)
span.set_attribute("agent.model", "claude-4-sonnet")
# Feedforward: context assembly
context = assemble_context(session_id, memory_store)
span.set_attribute("context.token_count", len(context.tokens))
# Agent action
response = agent.run(context)
# Feedback: in-environment evaluation
eval_result = trust_model.evaluate(
input=context,
output=response,
metrics=["faithfulness", "groundedness", "safety"]
)
span.set_attribute("eval.faithfulness", eval_result.faithfulness)
span.set_attribute("eval.groundedness", eval_result.groundedness)
span.set_attribute("eval.safety_pass", eval_result.safety_pass)The pattern is not complex. The discipline is instrumenting every agent turn, not just the ones you are debugging.
Four Places The Feedback Layer Breaks Down
Four failure modes appear consistently in teams building harness observability.
- Silent evaluation drift: Your evaluation model itself can degrade. If you only evaluate agent output but never evaluate the evaluator, you build false confidence. Monitor evaluator agreement rates against human judgment on a rolling sample. A quarterly calibration check is not sufficient for agents handling thousands of daily interactions.
- Over-instrumentation tax: Capturing every span attribute across every trace creates storage and compute overhead that undermines the system it is meant to support. Be selective. Instrument the decision points: tool selection, context retrieval, response generation. Skip internal bookkeeping operations that do not influence agent behavior.
- Telemetry without enforcement: Collecting traces without connecting them to policy responses is monitoring theater. Traces that flag quality drops should trigger automated responses. Throttle autonomy. Escalate to human review. Activate guardrails. Observability without action is just expensive logging. Understanding agent security threat models helps teams connect observability signals to the right enforcement actions.
- Metric collapse in multi-agent systems: When an orchestrator agent delegates to sub-agents, aggregate metrics mask individual agent failures. A system-level faithfulness score of 0.88 can hide a sub-agent scoring 0.62. Use the agentic hierarchy to preserve per-agent visibility. Span-level telemetry rolls up to aggregate insights, but the per-agent detail must remain accessible.
From Reactive Debugging To Proactive Governance
With standardized telemetry and continuous evaluation built into the harness, teams can quantify harness effectiveness for the first time. Instead of asking whether the harness is working, they can measure it. Evaluation pass rates. Drift velocity. Quality percentiles across agent populations.
This enables a shift from reactive debugging to proactive governance. The question changes from reactive failure analysis to proactive quality management: the agent's faithfulness score dropped 8% this week, investigate before users notice. That is a fundamentally different operational posture.
The shift also changes team dynamics. Platform teams can set quality baselines and alerting thresholds. Application teams can iterate on harness configurations and see the impact in real time. Governance teams can audit agent behavior from structured telemetry rather than request-level log review.
What remains unsolved: harness coverage metrics. Software testing has code coverage and mutation testing. Harness engineering has no equivalent. We cannot yet measure what percentage of agent failure modes are covered by the current harness. We do not know if the feedforward controls handle 30% of possible failures or 90%. This is the next frontier for the discipline. A recent evaluation survey [5] confirms that agent evaluation itself remains an evolving area with significant open questions.
Why One Side Is Never Enough
Return to the financial services agent. With harness observability in place, the 30% quality degradation would have triggered an alert within hours, not weeks. The team would have traced the issue to a specific span, identified the root cause, and deployed a fix before customers noticed.
The harness engineering equation needs an update. Agent = Model + Harness still holds. But the harness itself has two sides: Action Layer + Observation Layer. Building one without the other leaves you with a well-constructed agent you cannot verify.
The teams that treat observability as a first-class harness component will be the ones that can safely increase agent autonomy. The rest will keep discovering problems the way that financial services team did: late, manually, and expensively.
Harness-level observability tells you how one agent is performing. A control plane unifies that signal across every harness, every team, and every deployment, making the diversity of your agent infrastructure governable from a unified platform.
A harness gets your agent running. A control plane governs what it does at scale. See how the Fiddler Control Plane works.
References
[1] M. Fowler and B. Bockeler, "Harness Engineering for Coding Agent Users," martinfowler.com. [Online]. Available: https://martinfowler.com/articles/harness-engineering.html
[2] R. Lopopolo, "Harness Engineering: Leveraging Codex in an Agent-First World," OpenAI. [Online]. Available: https://openai.com/index/harness-engineering/
[3] A. Osmani, "Agent Harness Engineering," addyosmani.com. [Online]. Available: https://addyosmani.com/blog/agent-harness-engineering/
[4] Anthropic, "Demystifying Evals for AI Agents," Anthropic Engineering Blog. [Online]. Available: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
[5] Y. Liu et al., "A Survey on LLM-Based Agents: Evaluation," arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2503.16416
