SRE Weekly Issue #510

SRE Weekly Blog

The most interesting insight from this week's collection isn't about any single tool or technique. It's the fundamental mismatch between traditional SRE metrics and the systems we're actually running in 2025.

Take ML workloads. The standard four golden signals and uptime-based error budgets assume binary failure modes: your service is up or it's down, requests succeed or they fail. But ML systems don't break that way. A recommendation model doesn't return 500s when it degrades. It just starts serving slightly worse recommendations. Your API latency looks fine, your error rate is zero, and meanwhile your model is quietly costing the business millions because training data drift pushed accuracy from 94% to 87% over three months.

This demands different instrumentation. You need error budgets on model performance metrics like precision, recall, and prediction latency distributions. You need alerting on data freshness because a model trained on stale data is functionally broken even if it returns 200 OK. The article on ML error budgets gets this right: track accuracy degradation as a budget burn just like you'd track error rate spikes. Set thresholds for acceptable drift in feature distributions. Treat fairness metrics as reliability signals, not nice-to-haves.

The budget allocation piece connects to this. Organizations fund incident response heavily because outages are visible. The site goes down, executives notice, money flows to "make sure this never happens again." But gradual degradation doesn't trigger that response. Neither does preventing incidents that never occur. You can't show a graph of the outage that didn't happen because you invested in chaos engineering six months ago.

This explains why so many platforms are over-instrumented for reactive firefighting and under-invested in the unglamorous work that actually prevents fires. You've got five incident management tools and a 24/7 war room, but your deployment pipeline still has manual steps and your load testing is aspirational. The fix isn't technical, it's organizational: make prevention visible. Track near-misses. Measure time-to-detect for issues caught in staging. Quantify the cost of incidents prevented by better testing or gradual rollouts.

The Netflix database migration story shows what good prevention looks like at scale. Hundreds of databases, each with different schemas and traffic patterns. They didn't just script the migration and hope. They built a self-service platform with automated validation, rollback capabilities, and extensive testing. That's prevention as infrastructure. The upfront cost is real, but it's amortized across every migration and every team that doesn't have to reinvent the process.

Cloudflare's graceful restart mechanism for Rust services is another example. Yes, socket passing for zero-downtime restarts has existed since the nginx days, but implementing it correctly in Rust with proper ownership semantics and error handling is non-trivial. The value is in making the safe path the easy path. When restarting a service doesn't risk dropping connections, you restart more often, which means you deploy more often, which means smaller changes and faster iteration.

The Kubernetes event retention problem is telling. Ninety seconds of history by default. That's barely enough to debug an issue while it's happening, let alone do any meaningful postmortem. The real ask isn't longer retention, it's point-in-time state reconstruction. You need to answer "what did the cluster think the world looked like at 3:47 AM" not just "what events fired around then." That requires either snapshotting cluster state periodically or building event sourcing into your control plane. Neither is cheap, but both beat flying blind.