How Arize Skills Improved RAG Recall from 39% to 75% in 8 Hours

Arize AI Blog

You can't optimize what you don't measure systematically. The Arize Skills experiment demonstrates something practical about RAG development: closing the evaluation loop programmatically beats manual iteration by orders of magnitude. The core pattern here is worth unpacking because it generalizes beyond this specific toolchain.

The technical setup is straightforward. A LangGraph agent with two nodes (retrieve via OpenSearch kNN, generate via GPT-4o-mini) runs against a legal document corpus. Arize Skills provides Claude Code with programmatic access to create experiments and read evaluation results. The agent modifies its own code based on structured feedback from Recall@1, Recall@5, and Recall@10 metrics, commits changes, re-indexes when necessary, and repeats until hitting the target threshold.

What makes this work is the blue/green index pattern. RAG iteration is destructive by default because changing chunk size or embedding strategy requires full re-indexing. The solution: create versioned indices (self_ralph_v1, v2, v3...), run experiments against each, and atomically swap the ralphton alias only when performance improves. Previous indices stay available for instant rollback. Over 17 iterations, 11 index versions were created. Several performed worse than baseline and were never promoted. This is critical infrastructure for safe autonomous iteration.

The metric progression tells you where the leverage actually is. Baseline Recall@5 was 39 percent. Changing chunk size from 1000 to 400 tokens jumped it to 52 percent, a 13 point gain. This happened in iteration two and became the foundation for everything after. For legal documents, clause-level chunks beat paragraph-level chunks decisively. RRF and BM25 weight tuning added 4 points. HyDE as both kNN and BM25 signal added another 2 points, with BM25 surprisingly outperforming the embedding signal for precise legal terminology. Multi-query expansion contributed 5 points. The final jump from 63 to 75 percent came from two-stage GPT-4o reranking.

What didn't work matters as much. Cross-encoder reranking with ms-marco added only 1 point with significant latency cost and was abandoned. Adding more than six RRF signals degraded performance through signal dilution. These negative results were discovered in hours instead of weeks because the evaluation loop ran continuously without human intervention.

The Arize Skills integration is doing something specific that existing observability tools don't: it's making evaluation results directly consumable by code-generating agents through structured commands. This isn't tracing or logging, it's programmatic access to experiment comparisons across iterations. The key difference from manual evaluation is consistency. Every experiment used identical recall calculations against the same dataset, so results were directly comparable even as the codebase changed underneath. No evaluation drift.

The CLAUDE.md directive structure is the other critical piece. After every story completion, run experiment, analyze failures, generate new stories if Recall@5 is below 80 percent, continue until target is met. The PRD grows dynamically based on eval results. This means the agent doesn't stop when it runs out of predefined work, it generates new optimization candidates from failure analysis.

The practical limitation is that this pattern requires a well-defined metric target and a dataset that actually represents production queries. Recall@5 is measurable and unambiguous. If your RAG quality problem is more about answer correctness or hallucination, you need LLM-as-judge evals, which introduce their own consistency problems across iterations. The blue/green pattern also assumes you can afford to maintain multiple index versions simultaneously, which has storage and compute costs.

The switching cost to replicate this is moderate. You need structured evaluation that's callable from code, version-controlled indices with atomic alias swapping, and an agent framework that can modify its own implementation. The specific tools matter less than the pattern: tight feedback loops between code changes and quantitative metrics, with safe rollback mechanisms.

What this really demonstrates is that RAG optimization is a search problem over a high-dimensional configuration space, and manual search is pathologically inefficient. Chunk size, embedding model, retrieval signals, reranking strategy, and query expansion all interact. Trying combinations manually takes weeks. An autonomous loop that commits changes, measures impact, and iterates based on structured feedback compresses that timeline to hours.