How Hex Builds AI Agents: Making Agents Reason Like Human Data Analysts | Izzy Miller, AI Engineer

LangChain Youtube

Hex's production data agents expose a fundamental verification problem that most LLM evaluation frameworks aren't designed to handle. When you're generating SQL queries or Python transformations against real customer databases, syntactic correctness is table stakes. The hard part is validating whether the agent understood the analytical intent correctly—whether it joined the right tables, applied the correct aggregation logic, or caught a subtle data quality issue that would silently corrupt downstream analysis.

This is categorically different from code generation benchmarks like HumanEval, where you can run unit tests and get a binary pass/fail. A data agent can produce perfectly valid SQL that returns results, but those results might be analytically meaningless because the agent misinterpreted business logic or made incorrect assumptions about data semantics. Hex is dealing with this at scale: their multi-agent system operates with nearly 100K tokens worth of tool definitions, and they're seeing systematic failure modes that standard evals completely miss.

The architecture choices matter more than most teams realize. Hex initially explored off-the-shelf orchestration frameworks but ended up building custom orchestration because existing tools couldn't handle the token budget and context management requirements of their tool catalog. When you're working with that many tools, prompt engineering alone won't save you—you need intelligent tool selection, dynamic context pruning, and careful state management to avoid context collapse. They're also dealing with ephemeral query patterns where agents need to reason about intermediate results without polluting the main analytical workflow, which creates tradeoffs between agent autonomy and user control.

The eval problem gets worse at longer time horizons. Hex built an internal eval harness that goes beyond single-turn question answering. They're running what they call a "90-day simulation"—essentially testing whether agents can maintain analytical coherence across extended workflows that mirror real user behavior. Current models, including Claude Opus 4.6 and the latest GPT variants, systematically fail these long-horizon evals. One particularly revealing test involves a 900 percent quota calculation scenario that every production model gets wrong because it requires chaining multiple analytical steps while maintaining semantic consistency about what "quota" means in different contexts.

Most public eval sets are optimized for publishing papers, not for catching the failure modes you'll see in production. They tend to be either too narrow (single SQL query generation) or too synthetic (fabricated business scenarios that don't reflect real data complexity). Hex's approach is to build domain-specific eval harnesses that test analytical reasoning patterns their users actually exhibit: multi-step transformations, handling of ambiguous requirements, recovery from partial failures, and the ability to validate results against business logic constraints.

The practical implication for teams building data agents is that you can't rely on benchmark scores to predict production performance. You need eval infrastructure that tests the specific reasoning patterns your domain requires, and you need to accept that current models will fail in ways that aren't immediately obvious from their API responses. The agent might return confident results with perfect formatting, but the underlying analytical logic could be fundamentally broken. Until we have better tools for semantic verification of analytical reasoning, the only reliable approach is extensive domain-specific testing with real user workflows, not synthetic benchmarks.