Human judgment in the agent improvement loop

LangChain Blog

Human judgment in the agent improvement loop

The most consequential insight from teams shipping production LLM agents isn't about model selection or prompt engineering—it's that automated evaluations calibrated to human judgment scale better than manual review. This matters because agent behavior is fundamentally unpredictable until runtime, making tight production feedback loops the only reliable path to quality.

Consider a SQL-generating agent for financial traders. The agent needs two layers of tacit knowledge: trading conventions (what "today's exposure" means in context) and database practicalities (which tables are stale, which query patterns fail). Neither lives in documentation. Both require domain experts to surface and encode. The question is how to extract that expertise efficiently.

The naive approach is having experts review every agent output. This doesn't scale and creates a bottleneck that kills iteration velocity. The better pattern: experts spend time upfront calibrating automated evaluators, then those evaluators run continuously against production traffic. A compliance officer might review 50 agent responses to establish what "meets risk standards" means, then encode those criteria into automated checks. The agent can now validate thousands of queries per day against that judgment.

This shifts human time from repetitive review to higher-leverage activities: designing workflow constraints, configuring tool boundaries, and structuring retrievable context. For workflow design, the tradeoff is LLM flexibility versus deterministic control. Letting the model autonomously sequence actions reduces latency and tokens, but regulatory requirements often demand hard-coded validation steps. The SQL agent might freely generate queries but must pass automated compliance checks before returning results. Those checks encode expert judgment once, then run everywhere.

Tool design presents a similar tension. A generic execute_sql tool maximizes capability but increases risk. Parameterized query templates are safer but constrain what the agent can do. The right choice depends on your risk tolerance and the actual failure modes you observe in production. You won't know which approach works until you run evaluations against real user queries and measure both task success rate and violation frequency.

Context engineering—deciding what information the agent accesses and when—has evolved significantly. Early agents crammed everything into system prompts. Modern patterns like Anthropic's Skills standard let agents fetch curated documentation, examples, and domain rules at runtime. This scales context without bloating prompts, but requires upfront work organizing knowledge for retrieval. For the SQL agent, this means not just documenting the schema but capturing tribal knowledge about data quality, common query mistakes, and interpretation conventions.

The practical implementation loop looks like this: ship a minimal agent to production or a production-like environment immediately, instrument every step to collect traces, and use that data to identify where the agent fails or produces low-confidence outputs. Then—and this is critical—have domain experts review a sample of those failure cases to understand the gap between agent behavior and desired outcomes. Use those insights to either add automated guardrails, refine tool configurations, or enrich retrievable context.

The key metric isn't just task completion rate. Track confidence scores, tool selection accuracy, validation failure rates, and latency at each workflow stage. For the SQL agent, you'd measure query correctness, compliance pass rate, and whether the agent retrieves relevant schema documentation before generating queries. These metrics tell you where to invest expert time next.

What doesn't work: waiting to ship until the agent is "perfect," relying on synthetic evals that don't reflect actual user behavior, or asking experts to review outputs without giving them a systematic way to encode their judgment into the system. The teams that succeed treat human expertise as a configuration input, not a continuous operational requirement. They build tight loops where automated evals surface edge cases, experts calibrate the system's understanding of quality, and the agent improves without linear scaling of human review time.