Running the Runners: How CNCF Powers GitHub Actions at Scale - Koray Oksay, Kubermatic
The Cloud Native Computing Foundation runs CI/CD for hundreds of projects, and they've built something worth examining: a multi-tenant GitHub Actions runner platform that handles the messy reality of shared infrastructure at scale. The architecture choices here reveal practical lessons about cost control, isolation boundaries, and where Kubernetes actually helps versus where it just adds complexity.
The core problem is straightforward. GitHub-hosted runners are convenient but expensive at scale, especially for compute-heavy workloads like container builds and integration tests. Self-hosted runners are cheaper but create operational overhead. CNCF needed something in between: shared infrastructure that could serve multiple projects without letting one noisy neighbor starve others, while keeping costs predictable.
Their solution runs on Oracle Cloud Infrastructure using donated credits, which immediately shapes the architecture. When you're optimizing for donated resources rather than pure technical elegance, different tradeoffs make sense. The platform scales runners on demand rather than maintaining a standing pool, which matters more when you're cost-constrained than latency-sensitive. Cold start overhead becomes acceptable when the alternative is paying for idle capacity across dozens of projects with unpredictable CI patterns.
The isolation model is particularly interesting. Each runner spins up in its own ephemeral environment, executes the workflow, and gets torn down. This isn't just security theater—it prevents state leakage between jobs and makes capacity planning simpler since you're not trying to bin-pack heterogeneous workloads onto long-lived hosts. The downside is that every job pays the startup cost, but for typical CI workloads that run for minutes, a few extra seconds of provisioning time is negligible compared to the operational simplicity gained.
Integration with GitHub Actions happens through the standard self-hosted runner registration flow, but at scale this creates coordination challenges. You need to handle runner registration tokens that expire, manage the lifecycle of runner processes that can crash mid-job, and deal with GitHub's API rate limits when you're spinning up dozens of runners simultaneously. The platform likely uses runner groups to segment projects and control access, though the specifics of how they map CNCF project boundaries to GitHub's permission model would be illuminating.
Cost optimization in this context means more than just picking cheap instance types. It requires understanding the actual resource profiles of different project workloads. A project doing Go binary compilation has different CPU and memory patterns than one building container images or running browser-based integration tests. The platform presumably collects metrics on actual resource utilization to right-size instance selection, though there's always tension between offering flexibility and maintaining operational simplicity.
The security posture matters because these runners have access to project secrets and can push releases. Network isolation, secrets management through GitHub's encrypted secrets, and audit logging become critical. Running on Kubernetes provides some of these primitives, but also introduces attack surface through the cluster API and potential privilege escalation paths if pod security policies aren't configured correctly.
For platform engineers running similar systems, the transferable lesson is about choosing your abstraction level. Kubernetes makes sense here because it handles scheduling, resource limits, and provides a consistent API across cloud providers. But you could accomplish similar isolation with VMs or even containers on bare metal. The decision should hinge on your team's operational expertise and whether you need the portability Kubernetes provides, not just following the cloud native playbook because it's fashionable.