How My Agents Self-Heal in Production

LangChain Blog

The most interesting part of this self-healing pipeline isn't the coding agent that writes fixes—it's the statistical gating layer that prevents it from hallucinating solutions to problems that don't exist. Most production LLM systems fail not because they can't generate code, but because they generate it for the wrong reasons.

The core innovation here is treating post-deployment monitoring as a two-stage filter. First, a Poisson test flags error signatures that spike beyond baseline noise. Second, a triage agent validates whether the code change actually caused the error before invoking the expensive fix attempt. That second gate matters more than it looks.

Without it, you're feeding every error spike directly into a coding agent that's architecturally incentivized to find something to fix. Third-party API outage? The agent will still try to "fix" your retry logic. Correlated failures from a traffic spike? It'll propose rate limiting changes. The triage layer forces causal attribution before action, which is the difference between self-healing and self-sabotage.

The Poisson approach is clever but fragile. It assumes error independence, which breaks during cascading failures or shared infrastructure issues. The 7-day baseline vs 60-minute window comparison also introduces temporal bias—weekend traffic patterns look nothing like weekday deploys, and a 60-minute window might not capture errors with longer trigger paths. For systems with high traffic variance, you'd want stratified sampling by time-of-day and day-of-week, or switch to a rolling percentile comparison instead of raw counts.

The error normalization step—regex replacing UUIDs, timestamps, and numeric strings before truncating to 200 characters—is doing a lot of heavy lifting. This is where most implementations break. Too aggressive and you collapse distinct errors into false duplicates. Too conservative and you fragment the same error across dozens of signatures, diluting your statistical power. The 200-character truncation is particularly suspect for stack traces where the meaningful context often lives deeper in the trace.

The proposed embedding-based clustering is the right direction but introduces new problems. Distance thresholds for "same error" vs "new error" are dataset-dependent and drift over time as your codebase evolves. You'd need continuous recalibration, probably through manual labeling of a sample set after each deploy. The alternative—using a small model to classify errors—shifts the problem from threshold tuning to prompt engineering and model drift, which isn't obviously better.

What's missing from this system is any notion of blast radius or user impact. An error that spikes 10x but affects 0.01% of requests gets the same treatment as one that breaks 50% of traffic. Production systems need severity-weighted gating, not just statistical significance. You want p-value AND impact score before kicking off a fix attempt.

The fix-forward vs rollback decision is currently manual, which is the right call. Automated rollbacks based on error rates are notoriously prone to false positives—you roll back a deploy that fixed a critical bug because it temporarily increased logging verbosity, or because a downstream service happened to degrade in the same 60-minute window. The confidence threshold for automated rollback needs to be much higher than for automated investigation.

The real cost model here isn't mentioned but matters. Every triage agent call and Open SWE invocation burns tokens and engineer review time. If your false positive rate on the Poisson test is high, you're paying for a lot of triage investigations that go nowhere. The economics only work if the triage agent has a high precision rate—probably north of 80%—at filtering out non-causal errors. Without that, you're just automating noise.

For teams considering this pattern, start with the triage layer before adding the coding agent. Run it in shadow mode for a month, logging what it would have flagged and whether the causal attribution was correct. Measure precision and recall against your actual incident log. Only add the auto-fix step once you trust the detection pipeline, otherwise you're just shipping faster bugs.

Read original source →

How My Agents Self-Heal in Production

Related Articles