Continual learning for AI agents
Most teams building AI agents default to model retraining when they want systems to improve over time. This is expensive, slow, and often unnecessary. The reality is that continual learning for production agents operates across three distinct layers, each with different optimization characteristics and practical tradeoffs.
The bottom layer is model weights. This is what most ML engineers think of as continual learning: running SFT or GRPO on new data to update parameters. The problem is catastrophic forgetting. When you fine-tune on new tasks or user interactions, performance on previous capabilities degrades. This remains an open research problem, and while you could theoretically maintain per-user LoRAs, almost no one does this in practice because the operational complexity is prohibitive. Model-level updates typically happen at the agent level, not per-tenant, and require significant compute and careful dataset curation.
The middle layer is the harness: the code, system prompts, and tool definitions that wrap the model and drive agent behavior. This is where things get more practical. Recent work like Meta-Harness demonstrates end-to-end optimization of harness code by running the agent over evaluation tasks, collecting execution traces, and using a coding agent to analyze failures and propose code changes. This approach is far more tractable than model retraining because you're operating in the discrete space of code rather than continuous parameter space. The feedback loop is faster, changes are interpretable, and you can version control everything. The downside is that harness optimization still typically happens at the agent level, not per-user, because maintaining divergent code paths per tenant creates maintenance nightmares.
The top layer is configurable context: instructions, skills, and memory that sit outside the harness and configure it at runtime. This is where most production systems should focus their continual learning efforts. Context can be updated at multiple granularities simultaneously—agent-level defaults, org-level policies, and user-level preferences. Systems like Hex Context Studio and Decagon Duet operate here, and for good reason: the iteration speed is high, the blast radius of changes is contained, and you can personalize without fragmenting your codebase.
Context updates can happen in two modes. Offline learning runs over recent traces in batch jobs to extract patterns and update memory—OpenClaw calls this "dreaming." Online learning happens in the hot path as the agent executes, either explicitly when users say "remember this" or implicitly based on harness instructions. The tradeoff is latency versus freshness. Offline learning adds no runtime overhead but lags behind user interactions. Online learning is immediate but adds token overhead and potential failure modes to every request.
The key architectural decision is explicitness. Do users control memory updates, or does the agent decide autonomously? Explicit control is safer and more predictable but requires user effort. Implicit learning feels magical when it works but can surprise users with unwanted persistence or privacy concerns.
All three layers depend on high-quality execution traces. You need complete visibility into what the agent did, what tools it called, what context it used, and what the outcome was. Without traces, you're flying blind regardless of which layer you're optimizing. This is why trace collection infrastructure is table stakes for any serious agent deployment.
The practical recommendation for most teams: start at the context layer. Build infrastructure for per-user and per-org memory that can be updated both online and offline. Only move to harness optimization once you've exhausted context-layer improvements and have clear evidence that code changes would help. Touch model weights last, and only if you have the ML infrastructure and dataset quality to do it properly. Most teams never need to get there.