Instrument zero‑code observability for LLMs and agents on Kubernetes

Grafana Labs Blog

The OpenLIT Operator tackles a real pain point for teams running AI workloads on Kubernetes: keeping instrumentation current across a sprawling stack of LLM providers, vector databases, and agent frameworks without touching application code or rebuilding images. For SREs managing production AI services, this matters because your observability layer shouldn't lag behind your model upgrades.

The operator works through Kubernetes admission webhooks. When a pod matches your AutoInstrumentation resource's label selector, the operator injects an init container that configures OpenTelemetry instrumentation at runtime. This happens before your application container starts, so by the time your LangChain agent or vector DB client initializes, the instrumentation is already wired into the Python runtime. The telemetry flows to any OTLP-compatible backend—Grafana Cloud's managed gateway, a self-hosted collector, or your existing observability stack.

The key architectural decision here is injection versus baked-in instrumentation. Baking OpenTelemetry libraries into your container images gives you explicit control but creates operational friction. Every time you update a model provider SDK or switch from LangChain to CrewAI, you're rebuilding images and managing dependency conflicts between instrumentation libraries and your AI frameworks. The operator shifts that burden to the platform layer. You define instrumentation policy once in the AutoInstrumentation CRD, and pods pick it up automatically on restart.

This approach has real tradeoffs. On the upside, you get consistent telemetry across heterogeneous workloads without coordinating changes across multiple application teams. Token usage, latency percentiles, and agent step sequences appear in Grafana Cloud's AI dashboards immediately after pod restart. The operator supports major providers—OpenAI, Anthropic, AWS Bedrock—and frameworks like LlamaIndex, Haystack, and DSPy out of the box. Because it's OpenTelemetry-native, you're not locked into OpenLIT's collector; you can route traces to Tempo and metrics to Prometheus through standard OTLP exporters.

The downside is reduced visibility into what's actually happening inside the init container. When instrumentation breaks—say, a new version of the OpenAI SDK changes internal method signatures—debugging requires understanding both the operator's injection logic and the OpenTelemetry auto-instrumentation hooks for Python. You also lose fine-grained control over sampling decisions and custom span attributes that you'd have with manual instrumentation. For high-throughput inference endpoints, the operator's default sampling strategy might not match your cost-performance requirements.

From a practical standpoint, the setup is straightforward. After installing the operator via Helm, you create an AutoInstrumentation resource specifying your OTLP endpoint and authorization header. The critical configuration is the selector—use label selectors that match your AI workloads but exclude infrastructure pods to avoid instrumentation overhead where it doesn't matter. The resource attributes like deployment.environment and service.namespace become dimensions in your metrics, so choose values that align with your existing cardinality budgets in Prometheus.

For teams already running OpenTelemetry collectors in-cluster, you can point the operator at your existing collector instead of Grafana Cloud's gateway. This keeps telemetry routing consistent with your non-AI workloads and lets you apply existing sampling and filtering rules. The operator's collector is optional—it's just a convenience if you want OpenLIT-specific processing before data hits your backend.

The real question is whether zero-code instrumentation fits your operational model. If you're iterating rapidly on agent architectures or evaluating multiple LLM providers, the operator reduces friction significantly. If you need deep customization of trace semantics or run latency-sensitive inference at scale, manual instrumentation gives you more control. For most platform teams supporting AI developers, the operator hits a sweet spot: comprehensive coverage with minimal maintenance overhead.