Keynote: Building AI in the Open: How the Cloud Native Community Sha... Moderated by: Mario Fahlandt
The shift from prompt-driven LLMs to agent-based runtimes isn't just a conceptual evolution—it fundamentally changes the infrastructure requirements for running AI in production. When you're serving simple request-response inference, you can get away with stateless pods behind a load balancer, treating GPU workloads like slightly exotic HTTP services. Agent-based systems break that model entirely.
Agents maintain state across multiple inference calls, coordinate between different models, and execute multi-step workflows that might involve retrieval, reasoning, tool use, and validation. This means your infrastructure needs to handle long-lived sessions, manage state persistence across potential pod restarts, and route related requests to the same backend instance. The standard Kubernetes service mesh patterns don't map cleanly here. You're looking at something closer to actor model systems or stateful stream processing than traditional web serving.
The distributed inference challenge gets more interesting when you consider model sharding and pipeline parallelism. A single agent workflow might hit a 70B parameter model that's sharded across eight A100s, then call out to a smaller specialized model for tool use, then route back for final synthesis. Your scheduler needs topology awareness—not just "find me GPU capacity" but "find me GPUs with the right interconnect bandwidth and locality to the state store." This is where the cloud-native ecosystem has real work to do.
Current CNCF projects weren't designed with these access patterns in mind. Kubernetes batch scheduling works for training jobs with clear start and end times. It's less suited for inference workloads that need sub-second scheduling decisions and can't tolerate cold starts. The GPU operator helps with device management but doesn't understand model placement or KV cache sharing between requests. Service meshes add latency that matters when you're chaining multiple inference calls in a workflow.
What's needed is workload-aware scheduling that understands inference-specific constraints. Think about KV cache affinity—if you've already processed a 50k token context for an agent session, you want subsequent requests hitting the same GPU where that cache lives. Evicting and reloading that state costs real money in compute time. Similarly, batching strategies for agent workloads differ from simple inference serving. You can't just batch arbitrary requests together when each one is part of a stateful conversation with different context windows.
The open source angle matters here because vendor-provided AI platforms inevitably optimize for their own model formats and runtime characteristics. OpenAI's infrastructure works great for OpenAI models. What we need are abstractions that work across Llama, Mistral, command-r, and whatever comes next month. The CNCF ecosystem should provide the plumbing—scheduling primitives, state management, observability hooks—without dictating the model architecture.
For practitioners, the immediate question is whether to wait for these patterns to mature or start building custom solutions now. If you're running agent-based systems in production today, you're probably already working around standard Kubernetes patterns with custom operators and admission controllers. Contributing those patterns back upstream would accelerate the ecosystem evolution everyone needs. The alternative is fragmentation where every organization rebuilds the same infrastructure differently.