How we build evals for Deep Agents

LangChain Blog

Effective agent evaluation isn't about accumulating test cases. After building and operating Deep Agents across production systems like Fleet and Open SWE, the pattern is clear: every eval you add creates pressure on your agent's behavior. Add an eval for efficient file reading, and you'll tweak prompts and tool descriptions until it passes. Keep that eval, and it continues shaping behavior over time. The problem is that most teams treat eval count as a proxy for quality, chasing high scores on benchmark suites that don't reflect production behavior.

The alternative is behavior-focused curation. Start by cataloging what actually matters in your system: retrieving content across multiple files, composing five-plus tool calls in sequence, handling context overflow gracefully. Then build or adapt evals that measure those specific capabilities in verifiable ways. This means pulling selectively from benchmarks like Terminal Bench 2.0 or BFCL rather than running them wholesale, writing targeted unit tests for isolated behaviors, and most importantly, converting production failures into evals.

Dogfooding generates the highest-signal eval data. When Open SWE generates bug-fix PRs across different codebases, every traced interaction that fails becomes a candidate eval. The key infrastructure piece is comprehensive tracing to a shared project where anyone can analyze failure modes and assess whether an eval still matters. This creates shared ownership and prevents eval debt from accumulating.

Taxonomy matters more than provenance. Grouping evals by source (internal vs external benchmarks) tells you nothing about what they measure. Instead, tag by capability: file_operations for tool pagination and parallel invocation, retrieval for multi-hop document synthesis, tool_use for state tracking across turns, memory for context persistence, conversation for clarifying vague requests. This middle view between aggregate scores and individual runs shows where models actually differ in production-relevant ways.

For metrics, correctness comes first. If a model can't reliably complete your tasks, efficiency is irrelevant. Measuring correctness varies by eval type: custom assertions for structural requirements like parallelized tool calls, exact matching for benchmark ground truth, LLM-as-judge for semantic checks like whether the agent persisted the right information to memory. Only after multiple models clear the correctness bar do efficiency metrics matter.

Efficiency metrics should capture real production costs. Step ratio (observed steps over ideal steps) and tool call ratio reveal unnecessary work. Latency ratio accounts for model round trips, provider latency, wrong turns, and tool execution time. Solve rate combines these: expected steps divided by observed latency, zeroed out if the task fails. This single metric captures both accuracy and speed, making model selection straightforward: filter by correctness on tasks you care about, then compare efficiency tradeoffs.

The operational pattern is continuous: run evals on model updates, analyze traces for failure modes, update eval coverage, repeat. Separate SDK unit tests (prompt passthrough, interrupt config, subagent routing) from model capability evals since any model passes those. Including them in scoring adds noise, not signal.

Running many models across many evals gets expensive fast. Targeted evals save money while improving agents because you're optimizing for behaviors that matter in production rather than chasing benchmark leaderboards. The discipline is resisting the urge to blindly add tests. Each eval should have a docstring explaining exactly what agent capability it measures and why that capability matters. If you can't articulate that, the eval probably doesn't belong in your suite.