Observability in Go: Where to start and what matters most

Grafana Blog

When your Go service starts behaving badly in production, you need observability. But the observability landscape has become cluttered with overlapping tools and buzzwords that make it hard to know where to actually start. The progression matters more than the tooling: logs first, metrics second, tracing when you go distributed, and profiling only when you have concrete performance problems.

Logs remain the most practical entry point because they require almost no setup. Dump to stdout, ship to Loki or similar, and you're operational. What's less obvious is that logs can be parsed into metrics after the fact. At Grafana Labs, they scan logs for panic stack traces and convert those into alertable metrics. This pattern works for any structured event in your logs—rate limiting hits, authentication failures, circuit breaker trips. You're essentially getting metrics for free by mining your existing log stream, which matters when you're early and don't know what metrics you'll need yet.

The jump to explicit metrics makes sense when you need precise aggregations or when log volume becomes expensive. But don't overthink the instrumentation. Prometheus client library histograms for latency, counters for events, gauges for current state. The real value is in cardinality discipline—label your metrics carefully because high-cardinality labels (user IDs, request IDs) will blow up your storage costs and query performance fast.

Distributed tracing is where things get interesting and expensive. The reality is that tracing only pays off once you have multiple services talking to each other. A monolith with 20 or 30 functions doesn't need OpenTelemetry spans—structured logs with correlation IDs will get you there. But once you have frontend, backend, database, cache, and message queue all in the mix, tracing becomes the only practical way to understand request flow and attribute latency. The setup cost is real though. You need context propagation across service boundaries, sampling strategies to control volume, and a backend like Tempo that can handle the write throughput. Start with head-based sampling at 1 percent and only increase it when you're missing critical traces.

Profiling with pprof is powerful but misused. The biggest mistake is reaching for CPU profiling when your service isn't actually CPU-bound. If your process is sitting at 0.1 CPU utilization but requests are slow, CPU profiling will show you nothing useful—you're waiting on I/O, not burning cycles. Check your CPU saturation first. When you do have CPU problems in Go, look at memory allocation profiles before pure CPU profiles. Go's garbage collector means allocation patterns drive CPU consumption more than you'd expect. The alloc_space profile will show you where you're churning memory, which is usually the real bottleneck.

The emerging piece is eBPF for kernel-level observability. When application-level instrumentation isn't enough—maybe you're debugging network latency, file system behavior, or syscall patterns—eBPF lets you instrument the kernel without modifying your application. Libraries like cilium/ebpf make this accessible from Go, but it's still advanced territory. You need root access, kernel version awareness, and a solid understanding of what you're measuring. It's not day-one observability, but it's increasingly necessary for diagnosing issues that live below your application layer.

The pattern that works: start with structured logs and standard library tooling, add metrics when you need aggregation, introduce tracing only when you go multi-service, profile when you have confirmed resource saturation, and reach for eBPF when application instrumentation hits its limits. Each layer has real costs in complexity and runtime overhead. Add them deliberately, not preemptively.