SRE Weekly Issue #509
The incident review debate has taken an interesting turn with the arrival of LLMs capable of drafting postmortems. The temptation is obvious: feed your logs and Slack threads into GPT-4, get a polished document back, and move on. But this misses the entire point of why we write incident reviews in the first place.
Incident reviews are fundamentally socio-technical artifacts. The value isn't in the document itself but in the cognitive work that happens when engineers reconstruct what occurred, debate contributing factors, and collectively decide what matters enough to fix. When an LLM generates your postmortem, you've outsourced the learning. The team that responds to the next incident won't have internalized the lessons because they never wrestled with the questions. You end up with a prettier document that nobody remembers writing and fewer people bother reading.
This connects directly to the concept of reliability debt, which deserves more attention than technical debt in our planning conversations. Technical debt is code you'll refactor later. Reliability debt is the accumulated risk from all those "we'll fix it later" decisions: the monitoring gap you noticed but didn't instrument, the runbook that's 80% complete, the failover procedure you've never actually tested in production. Unlike technical debt, reliability debt has a nasty habit of collecting interest during incidents when you're least equipped to pay it down.
The aviation industry offers a useful contrast here. Commercial pilots train extensively on failure scenarios not because failures are common but precisely because they're rare. When US Airways 1549 lost both engines over the Hudson River, Captain Sullenberger had seconds to make decisions, but those decisions drew on thousands of hours of simulator training for scenarios he'd never actually experienced. SRE teams often do the opposite: we respond to novel incidents with ad-hoc procedures and then write postmortems promising to improve our runbooks. The better approach is inverting this: regular chaos engineering and game days that stress-test both your systems and your team's decision-making under pressure.
The rise of AI-generated code introduces a new category of reliability debt that's harder to spot. When LLMs write SQL queries or generate API calls, they often produce syntactically correct code that passes tests but exhibits subtle performance characteristics or error handling gaps that only surface under production load. This mirrors what happened when ORMs became popular: developers stopped thinking about query plans and index usage because the framework abstracted it away. The difference is that ORMs were deterministic and debuggable. LLM-generated code has the added complexity of being non-deterministic in its creation, making it harder to establish patterns for what to review carefully.
Recent analysis of 470 codebases shows AI-generated code produces bugs at rates comparable to human-written code, but the distribution differs. LLMs are particularly prone to generating code with inadequate error handling and edge case coverage, exactly the kind of defects that manifest as production incidents rather than test failures.
Finally, status pages remain underutilized as reliability tools. The best status pages don't just report what's broken; they set explicit expectations about recovery time and scope. If you don't tell customers what level of reliability to expect, they'll assume five nines and hold you to it regardless of what you actually promised in the SLA.