Monitor ClickHouse query performance with Datadog Database Monitoring

Datadog Blog

Datadog's Database Monitoring support for ClickHouse addresses a real gap for teams running analytical workloads at scale. If you're operating ClickHouse clusters and currently stitching together system table queries, custom exporters, and log parsing to understand query performance, this integration consolidates that visibility into a single pane.

The integration pulls three distinct data types from ClickHouse. Aggregated query metrics give you the statistical view: p95 latency, execution counts, bytes scanned, and memory consumption broken down by normalized query patterns. This is sourced from system.query_log and system.query_thread_log tables, which ClickHouse populates after each query completes. The normalization matters here because ClickHouse queries often have highly variable literals in WHERE clauses or INSERT statements. Without normalization, you'd see thousands of unique queries when you really have a dozen patterns with different parameter values.

Query samples capture individual execution details for completed queries. This is where you dig into why a specific query ran slowly: which tables it touched, how many rows it examined versus returned, whether it hit the query cache, and the actual execution time breakdown across query stages. For ClickHouse specifically, seeing the ratio of rows read to rows returned is critical since inefficient queries often scan entire partitions when they should be pruning aggressively.

Active query visibility shows what's running right now. This is essential during incidents when you need to identify runaway queries consuming cluster resources. ClickHouse doesn't have the same kind of query cancellation infrastructure as PostgreSQL or MySQL, so catching expensive queries early matters more. The active query view pulls from system.processes and refreshes frequently enough to catch queries that might only run for a few seconds.

The practical value depends on your current monitoring setup. If you're already exporting ClickHouse metrics to Prometheus and have Grafana dashboards built around system table queries, the incremental benefit is consolidation and reduced maintenance burden. You're trading custom exporters and dashboard YAML for a managed integration. The query normalization and automatic tagging by database, user, and cluster are conveniences but not game-changers.

Where this becomes more compelling is correlation with application-level traces. If you're using Datadog APM to trace requests through your application stack, seeing which ClickHouse queries those requests triggered and their performance characteristics closes the observability loop. You can move from "this API endpoint is slow" to "this endpoint triggers a ClickHouse query that's scanning 10 billion rows because the partition key isn't being used" without context switching between tools.

One limitation worth noting: this relies on ClickHouse's query log tables being enabled and retained long enough for the Datadog agent to collect them. If you're running lean on disk or have aggressive log rotation policies, you'll need to adjust settings like system_log_retention_size or query_log_ttl. The agent polls these tables periodically rather than streaming changes, so there's inherent lag in the data appearing in Datadog.

For teams already invested in Datadog for infrastructure and application monitoring, adding ClickHouse database monitoring makes sense. For teams using open-source stacks or those with heavily customized ClickHouse monitoring, the value proposition is weaker unless the unified interface and correlation capabilities justify the additional cost.

Read original source →

Monitor ClickHouse query performance with Datadog Database Monitoring

Related Articles