Managing Memory in AI Agents: Beyond the Context Window

Arize AI Blog

Once your agent moves past toy demos and starts handling real production workloads, context management becomes the bottleneck. Not latency, not cost—context. You can throw a bigger model at the problem, but even with 200k token windows, a long-running agent analyzing production traces or debugging across multiple services will hit the wall fast.

The failure mode is insidious. Your agent errors out because context exceeded the limit. The offending span stays in the session. You retry, hit the same limit, now with additional overhead. You're not making progress, you're looping.

The naive fix is aggressive truncation: cap every input at 100k characters, slice individual rows to 100 chars. This trades one failure for another. Cut too hard and the agent loses continuity. Ask a follow-up question and it responds like the first exchange never happened. The data is gone, but the conversation window is still open. Users notice immediately.

Middle truncation with retrieval beats head or tail cuts. Keep meaningful chunks from the start and end of large blobs—enough signal to understand what kind of data this is and where it resolves. Assign each truncated blob an ID and store the full version server-side. Give the agent a retrieval tool. Most of the time, the preview is sufficient. When it's not, the agent can fetch what it needs without carrying everything in context from the start.

The file system pattern from tools like Cursor translates directly here. When Cursor reads 10 files to find a bug, it doesn't dump everything into context. It holds references, reads previews, runs grep, jumps to specific lines. The files live on disk. Apply this to trace data: store large JSON server-side, pass the agent a preview and a stable json_id handle. The preview gives enough structure to reason about next steps. The ID lets the agent query targeted slices using jq for structured transforms or grep for regex search. This shifts the model from "hold everything" to "know how to retrieve."

Simple hygiene matters more than you'd expect. Don't send duplicate messages—if a tool is called twice with identical inputs, keep only the last result. Don't resend the system prompt on every iteration; it's already in context after the first turn. Prune tool results after they've been incorporated into the plan. The information is reflected in what the agent did next; the raw response is dead weight.

Long-running conversations require session-based evaluation. Context management bugs are hard to reproduce in fresh conversations. Something works fine on turn 3, breaks on turn 11. Extend your eval framework to preload conversations with 10+ turns, then test on turn 11. This catches regressions before they reach users and validates that your truncation strategy doesn't orphan information the user expects the agent to remember.

Sub-agents are the cleanest architectural solution for high-volume tasks. A search task might involve reading dozens of web pages, running multiple queries, scraping content. Almost none of that intermediate data needs to live in the main conversation thread. Spin up a sub-agent, let it churn through the work, return only the synthesized result. The main agent never sees the noise.

One thing that didn't work: using an LLM to summarize and compress message history. The compression itself consumes tokens and adds latency. The summarization introduces drift—details the agent needs later get lost in the compression step. Retrieval-based approaches with structured previews consistently outperformed summarization in both accuracy and cost.

The key insight is architectural: stop treating context as a buffer you pack until it's full. Treat it as working memory. Store data externally, pass references and previews, build retrieval tools the agent can invoke when it needs depth. Test under realistic session length. The agents that work in production aren't the ones with the biggest context windows—they're the ones that know how to manage what they hold.

Read original source →

Managing Memory in AI Agents: Beyond the Context Window

Related Articles