Inside Adobe's OpenTelemetry pipeline: simplicity at scale

OpenTelemetry Blog

Adobe runs one of the larger OpenTelemetry Collector deployments you'll find, with thousands of collector instances per signal type. What makes their setup worth examining isn't the scale alone, but how they've deliberately optimized for operational simplicity in an organizationally complex environment.

The architecture reflects Adobe's post-acquisition reality. The central observability team provides telemetry infrastructure for most of the company, but several large product groups maintain their own observability stacks. This fragmentation is common in companies that have grown through acquisition, where forcing consolidation often costs more than the inefficiency of running parallel systems. The central team's collector pipeline has to coexist with these independent implementations while still providing a reliable default path.

Running thousands of collectors per signal type means Adobe has separated their deployment by telemetry type rather than running unified collectors that handle traces, metrics, and logs together. This is a deliberate tradeoff. Unified collectors reduce operational overhead when you have fewer instances, but at Adobe's scale, signal-specific deployments provide clearer failure domains and simpler troubleshooting. When your metrics pipeline has an issue, it doesn't cascade into trace collection. The blast radius stays contained.

The emphasis on simplicity shows up in how they approach the collector configuration itself. At thousands of instances, complex processor chains become maintenance nightmares. Every additional processor is another thing that can break, another component that needs version compatibility testing, another knob that operators need to understand during incidents. Adobe's pipeline likely keeps the processing logic minimal in the collectors themselves, pushing transformation work either upstream to the SDKs or downstream to the backends.

This approach contradicts the common pattern of using collectors as heavy transformation engines. Many organizations load up their collectors with tail sampling, metric aggregation, and complex routing logic. That works fine at smaller scales, but creates operational complexity that doesn't scale linearly. When you're managing thousands of instances, you want each one to be as stateless and simple as possible. Complex stateful processing belongs in dedicated services, not distributed across your entire collector fleet.

The organizational split between central and product-specific teams also influences pipeline design. The central team can't mandate that every product group use their infrastructure, so the pipeline needs to be compelling enough that teams choose it over building their own. This means reliability and simplicity matter more than feature richness. A collector pipeline that's down or requires constant configuration tweaking will drive teams to build alternatives.

For teams considering similar deployments, Adobe's setup suggests a few principles. First, signal-specific collectors make sense at high scale even though they increase the number of components. Second, keeping collector logic simple and pushing complexity elsewhere pays dividends in operational overhead. Third, in organizations with multiple observability teams, the central platform needs to compete on reliability and ease of use, not just mandate adoption.

The real lesson is that architectural decisions at this scale are driven more by operational constraints than technical capabilities. The OpenTelemetry Collector can do sophisticated processing, but that doesn't mean you should use those features everywhere. Sometimes the best architecture is the boring one that's easy to operate.

Read original source →

Inside Adobe's OpenTelemetry pipeline: simplicity at scale

Related Articles