How Mastodon Runs OpenTelemetry Collectors in Production

OpenTelemetry Blog

Mastodon's OpenTelemetry deployment offers a surprisingly instructive case study for teams running observability infrastructure in resource-constrained environments. The federated social network operates across thousands of independent instances, many running on modest hardware, which forced architectural decisions that larger teams with dedicated observability budgets rarely confront.

The core challenge Mastodon faced was implementing distributed tracing without imposing significant overhead on instance operators who might be running on a single VPS with 2GB RAM. Their solution centers on a lightweight Collector deployment pattern that prioritizes sampling and batching over comprehensive trace capture. Rather than instrumenting every request, they configured the Ruby SDK to sample at 1% for most traffic, ramping to 10% only for specific error conditions or slow transactions exceeding 500ms. This adaptive sampling keeps the data volume manageable while still capturing the tail latency issues that actually matter in a federated architecture where cross-instance requests can cascade unpredictably.

The Collector configuration itself reveals practical tradeoffs most documentation glosses over. Mastodon runs Collectors in agent mode directly on each instance rather than deploying gateway Collectors, which would require additional infrastructure that small operators can't justify. Each agent Collector uses the batch processor with a 10-second timeout and 512-span batch size, tuned specifically to balance memory usage against the network overhead of frequent exports. They disabled the memory_limiter processor after finding it caused more problems than it solved in low-memory environments, instead relying on the SDK's built-in backpressure mechanisms when the Collector falls behind.

For exporters, they chose the OTLP exporter with gRPC over HTTP, despite HTTP being more firewall-friendly, because gRPC's connection reuse reduced CPU overhead by roughly 15% in their testing. The compression setting uses gzip rather than the newer zstd option because their target backends don't universally support it yet, highlighting how real production decisions often lag behind what's technically optimal.

The instrumentation strategy is equally pragmatic. They instrument at the application boundary—HTTP requests, database queries, and Redis operations—but explicitly avoid tracing internal method calls. This keeps span counts reasonable and makes traces actually readable when debugging federation issues. For database queries, they truncate SQL statements to 256 characters in the span attributes to prevent sensitive data leakage and control cardinality, a detail that matters significantly when traces flow through multiple independent operators' infrastructure.

One particularly relevant pattern is how they handle trace context propagation across federated instances. Since ActivityPub requests cross organizational boundaries, they inject trace context into HTTP headers but configure the SDK to start new traces if the incoming context looks malformed or suspiciously old, preventing bad trace data from one instance from polluting another's observability.

The resource requirements they documented are concrete: the Collector typically uses 50-80MB RSS with occasional spikes to 120MB during traffic bursts, and CPU usage stays under 2% on a modern x86_64 core. These numbers provide actual capacity planning data that most OpenTelemetry documentation lacks. For teams evaluating whether they can afford distributed tracing in constrained environments, Mastodon's deployment proves it's feasible if you're willing to make intelligent tradeoffs on sampling and processing complexity.

Read original source →

How Mastodon Runs OpenTelemetry Collectors in Production

Related Articles