Building Better Go Systems with Logs, Context, and Profiling | Big Tent S3E8
Go's observability story has matured significantly, but building properly instrumented systems still requires intentional design choices from the start. The key isn't just adding logs or traces—it's understanding how context propagation, structured logging, and continuous profiling work together, and where each approach actually helps.
Start with structured logging using slog rather than treating it as an afterthought. The stdlib's slog package gives you structured output without third-party dependencies, which matters when you're debugging production issues and need consistent log parsing. The real value shows up when you're correlating logs with traces—having trace IDs in structured fields means you can jump from a span to relevant logs without grep archaeology.
Context propagation is where Go developers consistently trip up. Context needs to flow through every function that might need to emit telemetry or make downstream calls. The mistake isn't forgetting to pass context—it's passing the wrong context. If you create a new background context in a handler because you want a longer timeout, you've just severed the trace. Use context.WithTimeout on the existing context instead. Similarly, when spawning goroutines for background work, decide explicitly whether they should inherit the parent trace or start fresh. There's no universal right answer, but the decision needs to be conscious.
For tracing, the practical question is whether to instrument at the library level or push it to application code. Libraries that make network calls should accept context and propagate it, but they shouldn't make opinionated decisions about span creation. Let the caller decide what constitutes a meaningful span. OpenTelemetry's auto-instrumentation for common libraries handles the HTTP and gRPC cases reasonably well, but custom internal services need explicit instrumentation. Focus on instrumenting at service boundaries and anywhere you're making decisions that could cause latency—database queries, cache lookups, external API calls.
Continuous profiling with pprof should run in production from day one, not just when you have a performance problem. The overhead is negligible—typically under one percent CPU—and having baseline profiles means you can compare before and after when issues arise. Enable the default profiles: CPU, heap, goroutines, and block. The goroutine profile catches leaks early; block profile shows you where lock contention actually happens rather than where you think it happens. Set block profile rate to something reasonable like 10000 to avoid overhead while still catching real contention.
eBPF-based profiling adds another dimension by capturing kernel-level behavior without application changes. This matters for Go because the runtime's interaction with the kernel—syscalls, network I/O, scheduler behavior—often drives production performance issues that pprof alone won't show. eBPF profiles reveal when you're spending cycles in kernel space or when context switching is killing throughput. The tradeoff is deployment complexity and kernel version requirements, but for platform teams running standardized infrastructure, that's manageable.
The instrumentation decision that matters most: instrument at the boundaries where requests enter and leave your service, then work inward only where you have actual questions. Over-instrumentation creates noise and overhead. Under-instrumentation means flying blind. The right balance comes from starting minimal and adding instrumentation when you need to answer specific questions about system behavior. Observability isn't about capturing everything—it's about capturing what you need to understand and fix production issues quickly.