How Kensho built a multi-agent framework with LangGraph to solve trusted financial data retrieval
Kensho's Grounding framework demonstrates what production multi-agent systems actually require beyond the demos. The architecture is straightforward: a LangGraph router receives natural language queries about financial data, decomposes them into sub-queries, dispatches to specialized Data Retrieval Agents owned by different domain teams, then aggregates responses. What matters here isn't the pattern itself—map-reduce over specialized agents is obvious—but the three operational components they had to build to make it work at scale.
First, the custom protocol. This is the unsexy infrastructure work that determines whether a multi-agent system survives contact with production. Financial data comes in wildly different formats: structured time series for equity prices, semi-structured documents for earnings transcripts, unstructured analyst notes. Without a standardized response format, every consuming agent needs custom parsing logic for every data source, creating an N×M integration problem. Kensho's protocol enforces a common envelope around all responses—both the data payload and metadata like provenance, timestamps, and confidence scores. This is critical for financial services where audit trails matter, but it's equally important for debugging. When an aggregated response is wrong, you need to trace back through which DRA returned what, and a consistent protocol makes that tractable.
The observability piece is where most multi-agent systems fail quietly. LangGraph provides native tracing, which Kensho relies on for end-to-end visibility across the router and all DRAs. But tracing alone doesn't tell you why a query failed or returned stale data. Their custom protocol packages deliberate metadata at each hop—which agent was invoked, what sub-query it received, what data sources it hit, response latency. This isn't just for post-mortem debugging. In production, you need real-time alerts when routing accuracy degrades or when specific DRAs start timing out. Without embedded observability from day one, you're flying blind when query patterns shift or data sources change schema.
The evaluation framework is the most interesting part because it acknowledges that single-metric eval doesn't work for multi-stage systems. Kensho measures three things separately: routing accuracy (did the router select the right DRAs for this query), tool-calling correctness (did each DRA invoke the right data APIs), and answer completeness (does the aggregated response actually satisfy the original query). They distinguish between exact-match success—correct agents, expected responses—and partial success where routing was right but responses varied. This granularity matters because failure modes are different at each stage. A routing error means the wrong DRAs were invoked entirely. A tool-calling error means the right DRA used the wrong API or parameters. A completeness error means all the pieces were right but aggregation failed.
The tradeoffs here are real. Standardized protocols add latency—every DRA must format responses consistently rather than returning raw data. The router introduces a single point of failure and adds network hops. Multi-stage evaluation is expensive to run continuously. But the alternative is worse: without these components, debugging becomes archeological work, adding new data sources requires custom integration for every consumer, and you have no systematic way to measure whether the system is degrading.
What's missing from their writeup is cost and latency specifics. How much overhead does the router add versus direct DRA calls? What's the p95 latency for multi-DRA queries? How many LLM calls does query decomposition and aggregation require, and what does that cost at scale? These numbers determine whether this architecture makes sense for high-throughput applications or only works for lower-volume analyst workflows.
The broader lesson is that production multi-agent systems are distributed systems problems first, LLM problems second. The hard parts aren't prompt engineering or model selection—they're protocol design, observability infrastructure, and evaluation frameworks that surface failure modes before users do.