Inside Adobe's OpenTelemetry pipeline: simplicity at scale
Adobe's OpenTelemetry deployment reveals a critical architectural decision that most ML platform teams will eventually face: whether to optimize for consolidation or operational simplicity when scaling observability infrastructure. Their choice—running thousands of collectors per signal type rather than pursuing aggressive consolidation—offers concrete lessons for teams instrumenting production LLM systems where telemetry volume and organizational complexity intersect.
The architecture is deliberately decentralized. Adobe runs separate collector fleets for metrics, traces, and logs rather than multiplexing signals through unified collectors. This contradicts the common pattern of deploying a single collector per host or cluster that handles all telemetry types. The tradeoff is straightforward: more infrastructure to manage in exchange for blast radius isolation and simpler troubleshooting. When your trace collector fails, your metrics pipeline continues unaffected. For LLM systems generating high-cardinality span data from complex agent workflows, this isolation matters—a single misbehaving RAG pipeline flooding trace volume won't destabilize your token usage metrics or model latency monitoring.
The "thousands of collectors" scale point deserves scrutiny. Most organizations start with centralized gateway collectors that aggregate telemetry before forwarding to backends. Adobe's approach suggests they're running collectors closer to workloads, likely as sidecars or daemonsets. This topology reduces network hops and provides better failure isolation, but it means managing collector configuration, versioning, and resource allocation at significant scale. For ML teams, this pattern works well when you need fine-grained control over sampling decisions per model or service—your GPT-4 endpoints might warrant different trace sampling rates than your embedding services.
The acquisition-driven heterogeneity is the real architectural constraint here. Adobe isn't operating a greenfield environment where they can mandate uniform instrumentation. Some product groups run independent observability stacks. This mirrors the reality of many ML platform teams: legacy model serving infrastructure running Prometheus, newer LLM services on vendor observability platforms, and experimental agent frameworks with custom telemetry. Adobe's central team provides infrastructure but doesn't force consolidation, which is pragmatic when the cost of migration exceeds the benefit.
For LLM systems specifically, this architecture pattern accommodates the telemetry diversity you'll encounter. Inference latency metrics, token streaming throughput, prompt caching hit rates, retrieval quality scores, and agent step traces all have different cardinality, volume, and retention requirements. Running specialized collector pipelines per signal type lets you apply appropriate sampling, aggregation, and routing without complex conditional logic in a monolithic collector configuration.
The operational simplicity emphasis is telling. Adobe could theoretically reduce collector count through consolidation, but they've chosen not to. This suggests their pain points are in telemetry quality and pipeline reliability, not infrastructure cost. For teams evaluating OpenTelemetry adoption, this is a useful data point: the collector's resource footprint is typically negligible compared to the cost of blind spots in production LLM behavior. A few hundred additional collector instances are cheap insurance against losing visibility into hallucination rates or context window utilization during an incident.
The key takeaway isn't to copy Adobe's exact topology, but to recognize that observability architecture should match organizational reality rather than theoretical ideals. If your ML platform serves multiple teams with different maturity levels and tooling preferences, designing for federation and isolation may be more pragmatic than pursuing a unified observability monoculture.