50 Articles

Autonomous RAG optimization using Claude Code with Arize evaluation tooling achieved 39% → 75% Recall@5 in 8 hours by closing the feedback loop between code changes and retrieval metrics, discovering that chunk size, signal selection, and LLM-based reranking are the highest-leverage levers for RAG performance. The key technical pattern is using structured evaluation results to dynamically generate and prioritize improvements, with blue/green index deployments ensuring safe iteration.

Arize AI Blog

The article presents a maturity model for LLM evaluation practices organized around a consistent 'evaluation harness' architecture (inputs → execution → actions), progressing from GUI-based ad-hoc evals through AI-assisted workflows to fully autonomous monitor-triggered agents. The key insight is that production teams should build evaluation infrastructure incrementally as a continuous operational system rather than one-off notebooks, with the same underlying platform supporting all maturity levels.

Arize AI Blog

As enterprises scale from single agents to dozens per employee, the critical gap isn't access control policies—it's runtime visibility and enforcement to catch silent agent failures (confident wrong outputs, behavioral drift, memory corruption) before they cascade through multi-agent pipelines and damage organizational trust in AI. Organizations need observability-driven sandboxing that traces every agent action and enforces policy in real-time, not post-hoc compliance reviews.

Arize AI Blog

Managing context in long-running LLM agents requires intelligent data handling beyond simple truncation: middle truncation with ID-based retrieval, server-side storage with preview-based references (like a file system), deduplication, and sub-agents for isolated high-volume tasks. The key insight is shifting from 'hold everything in context' to 'know how to retrieve what you need,' combined with session-based evaluation testing to catch context management regressions.

Arize AI Blog

Deep Agents Deploy provides an open-source, model-agnostic alternative to proprietary agent platforms by bundling orchestration, memory management, and multi-protocol endpoints (MCP, A2A, Agent Protocol) into a single deployment command, with the critical differentiator being that agent memory remains owned and queryable by the user rather than locked behind a vendor API. For production LLM teams, this addresses a fundamental lock-in risk: while switching LLM providers is relatively easy, losing access to accumulated agent memory creates severe operational and business continuity problems.

LangChain Blog

Better-Harness is a systematic framework for iteratively improving LLM agent behavior by treating evals as training signals for harness optimization, with critical safeguards (holdout sets, human review, behavioral tagging) to prevent overfitting and ensure production generalization. The approach combines automated harness updates (prompt refinements, tool modifications) with structured eval sourcing from hand-curation, production traces, and external datasets to create a compound feedback loop that mirrors classical ML training rigor.

LangChain Blog

LangChain engineer built an automated post-deployment self-healing system that detects regressions using Poisson statistical testing on normalized error signatures, triages causality via an LLM agent to filter false positives, and automatically opens PRs via a coding agent—eliminating manual investigation loops for production bugs. The key innovation is the triage layer that validates causal links between code changes and errors before invoking expensive fix attempts, reducing hallucination-driven false fixes.

LangChain Blog

Open models (GLM-5, MiniMax M2.7) now achieve parity with frontier models on core agentic tasks (tool use, file operations, instruction following) at 8-10x lower cost and 4x faster latency, making them production-viable alternatives when evaluated on standardized benchmarks. The article provides concrete eval methodology, performance metrics across seven task categories, and practical integration patterns for teams building agent systems.

LangChain Blog

LangChain and MongoDB have integrated vector search, persistent agent state management, natural-language query generation, and end-to-end observability into a single platform, allowing teams to build production agents without standing up parallel infrastructure for retrieval, memory, and monitoring. This reduces operational complexity by consolidating agent backends, checkpoint storage, and observability into existing MongoDB deployments rather than requiring separate vector databases, state stores, and analytics systems.

LangChain Blog

Agent evaluation requires a systematic approach starting with manual trace review and clear success criteria, then layering in capability vs. regression evals at the appropriate level (trace-level first), with 60-80% of effort spent on root cause analysis before automation. The core insight is that infrastructure issues and ambiguous success criteria masquerade as agent failures, so teams must separate signal from noise through domain expert ownership and structured error taxonomy before building eval infrastructure.

LangChain Blog

Kensho built Grounding, a LangGraph-based multi-agent framework that routes natural language queries to specialized Data Retrieval Agents across fragmented financial datasets, then aggregates responses—demonstrating that production multi-agent systems require three critical components: embedded observability/tracing for debugging, multi-stage evaluation metrics (routing accuracy, data quality, completeness), and standardized protocols for consistent agent communication at scale.

LangChain Blog

OpenTelemetry Profiles has reached public alpha, establishing a unified industry standard for continuous production profiling alongside traces, metrics, and logs—enabling teams to capture low-overhead performance data for production troubleshooting and cost optimization without vendor lock-in. This standardization addresses a long-standing gap in observability infrastructure by providing a common protocol where format fragmentation (JFR, pprof) previously existed.

OpenTelemetry Blog

This talk provides a comprehensive framework for the full lifecycle of production AI agents, covering evaluation loops with LLM-as-a-judge metrics, context engineering optimization, tool hardening, and observability/governance practices. The key technical takeaway is that reliable agent production requires systematic evaluation frameworks, token/cost optimization through context compaction, failure handling patterns (circuit breakers), and continuous monitoring—not just building working demos.

Arize AI Youtube

Prompt learning is an automated optimization technique that uses evaluation feedback and meta-prompting to iteratively improve LLM system prompts by running evals against benchmark datasets, extracting English-language failure insights, and feeding those into a meta-prompt to generate updated instructions. This creates a closed-loop system for prompt optimization that can be applied to coding agents and other LLM applications, with practical implementation shown using SWE-Bench, Arize AX, and Phoenix.

Arize AI Youtube

Context window management for AI agents requires strategic pruning and retrieval techniques—middle truncation, deduplication, memory systems, and sub-agent decomposition—rather than naive context stuffing, as the volume of traces, tool outputs, and conversation history quickly exceeds token limits and degrades agent performance. Teams must choose between lossy compression strategies (truncation, pruning) and retrieval-augmented approaches based on their agent's task characteristics and error tolerance.

Arize AI Youtube

Prompt Learning is a systematic technique that optimizes LLM agent instructions by analyzing git history and failure data to generate better prompts, achieving 5-20% relative performance improvements on coding tasks without model changes or fine-tuning. This approach is directly applicable across multiple coding agents (Claude Code, Cursor, Cline, Windsurf) and demonstrates that prompt optimization from production failure patterns can be a high-ROI alternative to model upgrades.

Arize AI Youtube

Hex's production data agents reveal that verification and evaluation at scale requires domain-specific harness design, custom orchestration for ~100K tokens of tools, and long-horizon simulation evals—not standard benchmarks—to catch failure modes that current models systematically fail. Data agents are fundamentally harder to verify than code agents because correctness requires semantic validation of analytical reasoning, not just syntax.

LangChain Youtube

Arize AX now offers native integration with NVIDIA NIM, enabling enterprises to connect self-hosted NIM inference endpoints directly to Arize's platform for unified monitoring, evaluation, and experimentation without custom configuration. This integration closes the observability gap for on-premises model deployments and enables continuous improvement loops through production data evaluation, human-in-the-loop curation, and fine-tuning workflows.

Arize AI Blog

LangSmith Fleet now integrates Arcade.dev's MCP gateway, providing agents with secure, centralized access to 7,500+ pre-optimized tools through a single endpoint while handling per-user authorization and credential management—eliminating the integration tax of managing individual tool connections and API quirks. Arcade's agent-specific tool design (narrowed schemas, LLM-optimized descriptions, consistent patterns) addresses the core problem that REST APIs designed for human developers create hallucination and token waste when called by LLMs operating from natural language context.

LangChain Blog

LangSmith's Polly AI assistant automates trace analysis and debugging workflows by contextually analyzing execution logs, experiment data, and suggesting prompt improvements—reducing manual navigation overhead in LLM observability. For teams running LLM systems in production, this represents a meaningful productivity improvement in the debugging/iteration cycle, though it's primarily a UX enhancement rather than a fundamental observability capability.

LangChain Youtube

Modern AI agents decompose into three modular components—model, runtime, and harness—and Nvidia/LangChain have released open-source alternatives (Nemotron 3, OpenShell, DeepAgents) that replicate proprietary agent architectures, enabling teams to build and customize agents without vendor lock-in. This matters for production LLMOps because it provides a reference architecture and tooling for understanding agent internals, debugging behavior, and maintaining control over the full stack.

LangChain Youtube

Interrupt 2026 is a conference preview focused on moving AI agents from proof-of-concept to enterprise production, featuring talks from Lyft, Apple, LinkedIn, and others on evaluation systems, low-code agent platforms, and production-scale infrastructure. The key technical themes are building robust evals tied to product policies, dynamic graph construction at scale, and closing feedback loops between failed traces and engineering teams.

LangChain Blog