Cluster Autoscaler Evolution - Kuba Tużnik, Google & Jack Francis, Microsoft

CNCF Youtube

Cluster Autoscaler has been running production Kubernetes clusters since 2017, and if you've worked with it at scale, you've probably hit its rough edges. The core logic is sound—watch for pending pods, provision nodes, consolidate underutilized capacity—but eight years of feature accretion has left the codebase in a state that's becoming genuinely difficult to maintain. Google and Microsoft engineers are now planning significant architectural changes, and if you're running autoscaling workloads, you should understand what's driving this and what might break.

The fundamental challenge is that Cluster Autoscaler started as a relatively simple controller and evolved into something far more complex without corresponding architectural evolution. The original design made reasonable assumptions for 2017: cluster sizes were smaller, cloud provider APIs were simpler, and the interaction between scheduling constraints and node provisioning was less sophisticated. Today, you're dealing with spot instance interruptions, multiple node pools with different instance types, pod topology spread constraints, and daemonsets that affect node utilization calculations. All of this logic has been bolted onto a foundation that wasn't designed for it.

One concrete pain point is the cloud provider abstraction layer. Each cloud provider implements its own interface for node group management, but the abstraction has leaked badly over time. Provider-specific logic has crept into the core autoscaler code, making it harder to reason about behavior and nearly impossible to test thoroughly without access to actual cloud APIs. This matters because subtle differences in how providers handle node registration timing or capacity reservations can cause the autoscaler to make suboptimal decisions, like provisioning too many nodes because it doesn't realize some are still coming online.

The simulation engine that predicts whether adding a node will actually schedule pending pods has also become a bottleneck. It needs to run scheduler predicates against hypothetical nodes, but as scheduling logic has gotten more sophisticated—think pod affinity, topology constraints, volume binding—the simulation has struggled to keep up. You end up with situations where the autoscaler provisions a node that the actual scheduler then refuses to use, wasting both time and money.

The planned refactoring aims to create cleaner boundaries between core autoscaling logic and provider-specific implementations, improve the simulation accuracy by better tracking scheduler state, and make the codebase more testable. This isn't a rewrite—the proven algorithms for scale-up and scale-down decisions are staying—but the internal structure is changing significantly.

For practitioners, this means a few things. First, if you're running a forked or heavily customized version of Cluster Autoscaler, you'll have migration work ahead. The provider interface is changing, so custom cloud integrations will need updates. Second, there may be behavioral changes in edge cases, particularly around timing and how quickly the autoscaler reacts to capacity changes. You should plan to test thoroughly in staging environments before upgrading production clusters.

Third, and most importantly, this is a reminder that even mature, widely-deployed infrastructure components accumulate technical debt. Cluster Autoscaler works well enough that many teams treat it as solved infrastructure, but the maintainability issues are real. If you're building similar controllers, invest in clean abstractions early, especially at cloud provider boundaries. The cost of refactoring later is substantial, and in the meantime, you're shipping bugs that stem from architectural problems rather than logic errors.

Read original source →

Cluster Autoscaler Evolution - Kuba Tużnik, Google & Jack Francis, Microsoft

Related Articles