Analyzing round trip query latency

Datadog Blog

When you're staring at slow database queries, the instinct is to blame the database. But total query latency from the application's perspective includes three distinct components: time spent in application code preparing the query, network round trip time, and actual database execution time. Treating these as a single number leads to optimization theater where you're tuning indexes when the real problem is connection pooling or chasing phantom network issues when your ORM is doing something stupid.

Datadog's approach here is straightforward but genuinely useful: they correlate APM traces with database monitoring telemetry to break down that total latency. On the APM side, you see when your application code initiated the query and when it received the response. On the database side, you see when the query actually started executing and when it finished. The delta between these timestamps tells you where time is actually being spent.

The practical value becomes obvious when you consider common failure modes. Say you're seeing 500ms query latency on what should be a simple SELECT. If database execution time is 450ms, you've got a database problem—missing index, table scan, lock contention, whatever. But if database execution is 50ms and you're seeing 450ms of overhead, you're looking at something completely different. Maybe your connection pool is exhausted and requests are queuing. Maybe you're opening a new connection for every query instead of reusing them. Maybe there's actual network latency because someone decided to deploy the application in us-east-1 and the database in eu-west-1.

The network component is particularly slippery because it's not just wire time. It includes time waiting for a connection from the pool, SSL/TLS handshake overhead if you're establishing new connections, and any proxy or sidecar latency if you're running something like Envoy or a database proxy in front of your actual database. I've seen cases where 80% of "network" time was actually Kubernetes DNS resolution happening on every query because the connection string used a service name and the client library was being clever about failover.

The application-side component catches the stuff that never makes it to database logs. Inefficient query building, N+1 queries that show up as many small database executions but massive application-side overhead, or ORM lazy loading that triggers cascading queries. When you see high application overhead relative to database execution, start looking at your data access patterns before you touch database configuration.

This decomposition matters most when you're operating at scale where small inefficiencies compound. A 10ms connection pool wait doesn't matter for a query that runs once per user session. It matters a lot when you're doing 10,000 queries per second and that 10ms represents 100 concurrent connections sitting idle waiting for pool slots.

The correlation requires instrumentation on both sides, which means you need Datadog's APM agent in your application and their database monitoring running against your database. For Postgres and MySQL this is relatively straightforward. The APM side uses trace context propagation to tag database spans, and the database monitoring side matches these up with actual query execution from pg_stat_statements or performance_schema.

The real test is whether this changes your troubleshooting workflow. If you're already exporting detailed metrics and can correlate them yourself, maybe not. But if you're currently guessing whether to optimize queries or fix connection handling, having this decomposition in a single view eliminates a lot of diagnostic overhead.

Read original source →

Analyzing round trip query latency

Related Articles