Beyond the good first issue - How to make your contributions sustainable
The OpenTelemetry instrumentation gap most ML teams hit isn't about finding good first issues—it's about instrumenting LLM systems in ways that actually matter for production operations, and the current state of OTel support for ML workloads leaves significant blind spots.
Standard OTel instrumentation captures HTTP latency, database queries, and service dependencies well. But LLM inference has fundamentally different performance characteristics that don't map cleanly to traditional observability primitives. Time to first token matters more than total request duration for streaming responses. Token throughput varies wildly based on prompt length and model state. Context window utilization directly impacts both cost and quality but isn't captured by default spans.
The semantic conventions for GenAI, currently in experimental status as of OTel 1.24, attempt to address this with attributes like gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.response.finish_reasons. In practice, these conventions help standardize what you're measuring but don't solve the harder problem of what you should be measuring. A span that reports 2000 input tokens and 500 output tokens tells you nothing about whether those tokens represented effective context usage or wasteful prompt bloat.
The real instrumentation challenge emerges with RAG pipelines and multi-step agent workflows. You need to correlate retrieval quality with generation quality across span boundaries. Was the hallucination caused by poor retrieval results, or did the model ignore good context? OTel's trace structure can capture the parent-child relationships between retrieval and generation spans, but it has no native concept of semantic relationships like "this generation span used context from these three retrieval spans with these relevance scores."
Teams end up encoding this information in span attributes or events, which works but creates vendor lock-in through custom attribute schemas. One team might log retrieved_doc_ids as a comma-separated string attribute, another as separate events, another as links to external storage. The experimental semantic conventions don't prescribe patterns for these relationships, leaving each implementation to solve it differently.
The cost implications are non-trivial. Detailed span attributes for every token-level operation in a high-throughput LLM service generate massive telemetry volumes. At 10k requests per second with 20 spans per request and 30 attributes per span, you're looking at 6M spans per second. Even with tail sampling, the collector becomes a bottleneck. Most teams end up with aggressive head-based sampling that drops 95-99% of traces, which defeats the purpose when you're trying to debug rare hallucination patterns or quality regressions.
The practical path forward involves selective instrumentation. Instrument model invocations and retrieval operations with full detail, but treat intermediate processing steps as internal span events rather than separate spans. Use exemplar traces that link high-cardinality metrics to specific trace IDs rather than trying to capture everything. Log prompt hashes instead of full prompt text to reduce payload size while maintaining debuggability.
Contributing to OTel's GenAI semantic conventions is valuable, but the harder work is establishing patterns for what's actually measurable and actionable in production LLM systems versus what generates telemetry noise. The community needs fewer good first issues and more production war stories about what instrumentation actually helped teams catch problems before users did.