Observability in Go: Where to start and what matters most

Grafana Labs Blog

When you're building Go services, observability tooling should follow your actual problems rather than some imaginary maturity model. Start with logs because they're trivial to implement and surprisingly powerful. Then add metrics, tracing, and profiling only when you have specific questions those tools can answer.

Logs are the entry point because they require almost zero setup. Use the standard library, write to stdout, and you're done. What makes logs particularly useful in Go is how naturally they capture panic stack traces. At Grafana Labs, they parse logs to count panics and turn those counts into alertable metrics. This is dead simple: a regex looking for panic signatures, feeding a counter, with alerts on anomalies. You get observability without instrumenting anything beyond basic structured logging.

The progression from logs to metrics should be driven by cardinality concerns. If you're logging every request with user IDs and getting millions of unique log lines, that's when you extract specific dimensions into proper metrics. But don't prematurely optimize here. Loki can handle surprisingly high log volumes, and deriving metrics from logs means you can iterate on what you're measuring without redeploying code.

Tracing becomes necessary when you have multiple services and need to understand request flow across boundaries. The key insight is that tracing's value comes from the parent-child relationships and explicit timing data. A single service with 30 lines of code probably doesn't need tracing, but the moment you have frontend calling backend calling database, tracing is how you identify which hop is slow. The context package is critical here because it carries trace IDs across service boundaries. Without proper context propagation, your traces fragment and lose their diagnostic power.

The common mistake with tracing is treating it as a performance optimization tool when it's actually a topology and timing visualization tool. If your service is slow, tracing shows you where time is spent, but it won't tell you why that database query is slow. That's a different problem requiring different tools.

Profiling with pprof should be your last resort, not your first instinct. The biggest gotcha is running CPU profiles when you don't have a CPU problem. If your service is using 0.1 cores and running slowly, CPU profiling is useless because the slowness is from waiting on IO, not computation. Check your actual CPU utilization first. When you do have CPU problems in Go, memory allocation is usually the culprit. Go's garbage collector means allocation pressure shows up as CPU time, so look at memory profiles even when diagnosing CPU issues.

Go's error handling creates an interesting observability challenge. Errors are often just strings rather than typed exceptions, making it harder to aggregate error types for metrics. The workaround is either defining custom error types from the start or using sentinel errors that you can check with errors.Is. This requires more discipline than exception-based languages but gives you better control over error context.

eBPF enters the picture when application-level instrumentation isn't enough. If you need visibility into kernel-level behavior, network stack performance, or syscall patterns, eBPF programs can observe without modifying your application. But this is genuinely advanced territory. Most Go services never need eBPF because application-level observability covers 95% of real problems.

The practical takeaway is to resist the urge to instrument everything upfront. Add observability incrementally as you encounter actual blind spots. Logs first, metrics when cardinality matters, tracing when you have multiple services, profiling when you have confirmed resource problems, and eBPF when nothing else reaches deep enough.

Read original source →

Observability in Go: Where to start and what matters most

Related Articles