Prompt Learning: How We Made Claude Code 20% Better Without Changing the Model

Arize AI Youtube

Prompt Learning treats your agent's instruction file as a trainable artifact. The core insight is simple: if you're running coding agents in production, you're already generating labeled data every time someone accepts or rejects a change. That signal is sitting in your git history, and most teams ignore it.

The technique works by analyzing commit patterns and failure modes to identify what your agent consistently gets wrong, then codifying corrections as natural language rules in your agent's instruction file. For Claude Code, that's CLAUDE.md. For Cursor, it's .cursorrules. The file format doesn't matter—what matters is systematically deriving instructions from observed behavior rather than guessing what might help.

Laurie Voss's results on SWE-Bench Lite show this isn't marginal. A 5 percentage point improvement on cross-repo tasks (40% to 45%) and nearly 11 percentage points on Django-specific problems represents real productivity gains. More striking: GPT-4.1 with optimized prompts nearly matched Sonnet 4.5's baseline performance. That's a model generation gap closed with better instructions, which has obvious cost implications if you're running thousands of agent sessions monthly.

The approach is straightforward to implement. Start with your git log. Filter for commits that came from agent suggestions—most teams tag these or can identify them by commit message patterns. Categorize rejections and revisions. Common failure modes emerge quickly: the agent ignores project-specific conventions, makes assumptions about API versions, or applies patterns that worked in training data but don't fit your codebase. Each failure mode becomes a candidate instruction.

The key is specificity. Generic instructions like "follow best practices" don't work. Instructions like "always use timezone-aware datetime objects in Django models" or "prefer dataclasses over NamedTuple for data structures with more than three fields" do. You're encoding the implicit knowledge that experienced engineers on your team already have but that the agent can't infer from context alone.

This differs fundamentally from few-shot prompting or RAG. Few-shot examples show the agent what good looks like but don't explain why. RAG retrieves relevant context but doesn't provide decision rules. Prompt Learning generates explicit heuristics that apply across sessions. The agent doesn't need to see your Django models in context if it already knows your timezone handling rules.

The tradeoff is maintenance burden. As your codebase evolves, some instructions become stale or contradictory. You need a process to review and prune the instruction file, ideally tied to your normal code review cycle. Instructions should have attribution and dates so you can deprecate them when conventions change.

This also only works if you have sufficient agent usage to generate signal. If your team runs ten agent sessions a week, you don't have enough data. If you're running hundreds, the pattern recognition becomes tractable. The technique scales better for teams with established codebases and conventions than for greenfield projects where patterns haven't stabilized.

The practical implication for platform teams: before upgrading to the next model version or investing in fine-tuning infrastructure, audit your instruction files. Most teams are running default configurations or hand-written rules that haven't been updated in months. Systematic prompt optimization from production data is lower effort and lower risk than model changes, and the results suggest it can deliver comparable gains for domain-specific tasks.

Read original source →

Prompt Learning: How We Made Claude Code 20% Better Without Changing the Model

Related Articles