Better Harness: A Recipe for Harness Hill-Climbing with Evals
The promise of automated harness optimization is compelling: point an agent at your eval suite, let it iteratively refine prompts and tools, watch scores climb. The reality is messier. Without deliberate safeguards, you end up with a harness that aces your test set but faceplants in production—classic overfitting, now with more tokens burned.
Better-Harness tackles this by treating harness optimization like model training, complete with holdout sets, behavioral tagging, and manual review gates. The core insight is sound: evals are training data for agents, and the same discipline we apply to dataset curation should apply here. But the devil lives in the implementation details, and this is where the framework gets interesting.
The eval sourcing strategy combines three channels: hand-curated examples (high signal, low volume), production trace mining (high volume, variable quality), and external datasets (broad coverage, needs heavy curation). The critical move is tagging everything by behavioral category—tool selection, multi-step reasoning, error recovery. This enables targeted experiments and meaningful holdout splits. Without tags, you're flying blind, unable to diagnose whether a harness change improved tool usage at the cost of reasoning quality.
The optimization loop itself mirrors gradient descent: run baseline, diagnose failures from traces, propose a single targeted change, validate against both new passes and existing regressions. The one-change-at-a-time constraint is crucial. When you update prompts and tools simultaneously, you lose causal attribution. Did the score jump because the new instruction clarified tool usage, or because you changed the tool's output format? You can't tell, and you can't reliably reproduce the win.
Human review acts as the second line of defense against overfitting. Automated validation catches score regressions, but it misses subtler failure modes: instructions that overfit to optimization set quirks, prompts bloated with edge-case handling that wastes tokens without improving generalization. This is where production experience matters. A senior engineer reviewing proposed changes can spot the difference between "this instruction clarifies ambiguous tool behavior" and "this instruction hardcodes a workaround for three specific eval cases."
The holdout set design deserves scrutiny. Splitting by behavioral category makes sense, but how you sample matters enormously. If your optimization set skews toward simple single-tool cases and your holdout includes complex multi-step scenarios, you're not measuring generalization—you're measuring distribution shift. The framework recommends stratified sampling, which is correct, but underspecifies how to handle category imbalance. If 60 percent of your evals are tool selection and 5 percent are error recovery, naive splitting leaves you with tiny holdout sets for rare behaviors.
The practical results—18 percent improvement on optimization sets, 12 percent on holdout—suggest the approach works, but those numbers hide important context. What's the baseline score? If you're starting at 40 percent pass rate, a 12-point gain is transformative. If you're starting at 85 percent, you're chasing diminishing returns. The cost implications also matter. Each optimization iteration burns tokens on trace analysis, change proposals, and validation runs. At scale, with expensive models and large eval suites, this adds up fast. The recommendation to use representative sampling is pragmatic, but it introduces another tuning problem: how small can you make the optimization set before the signal degrades?
The framework's emphasis on behavioral tagging and holdout discipline is its strongest contribution. Too many teams treat harness optimization as prompt tweaking without structure, iterating on vibes instead of metrics. Better-Harness formalizes the feedback loop and forces you to confront overfitting head-on. But it's not a turnkey solution. You still need to curate quality evals, design meaningful behavioral categories, and maintain human oversight. The automation handles the mechanical work of proposing and validating changes. The strategic work—deciding what behaviors matter, how to measure them, when to ship—remains firmly in human hands.