Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

Datadog Blog

Operating agentic AI systems in production introduces observability challenges that traditional APM tools weren't designed to handle. Unlike stateless API calls, agents make sequential decisions, invoke multiple tools, and maintain context across multi-step workflows. When an agent fails to complete a task or produces incorrect results, you need visibility into its reasoning chain, not just HTTP status codes and latency percentiles.

NTT DATA's implementation with Amazon Bedrock AgentCore and Datadog LLM Observability demonstrates a practical approach to this problem. AgentCore handles the orchestration layer—managing tool invocation, state persistence, and the control flow between LLM calls. It's essentially a workflow engine optimized for agentic patterns, with built-in retry logic and guardrails. The key architectural decision here is separating agent execution from observability, rather than building custom logging into each agent.

The observability piece is where this gets interesting for SREs. Datadog's LLM Observability doesn't just capture request/response pairs. It traces the entire agent execution graph: which tools were considered, why specific paths were chosen, token consumption at each step, and the quality of intermediate outputs. This matters because agent failures rarely present as simple errors. An agent might successfully complete its workflow while producing subtly incorrect results, or it might get stuck in loops that burn through your token budget without obvious exceptions.

Consider a customer support agent that retrieves documentation, analyzes the user's issue, and generates a response. Traditional metrics show you latency and throughput. LLM observability shows you that the agent is consistently retrieving irrelevant documentation because your embedding model doesn't handle domain-specific terminology well, or that it's making unnecessary tool calls because the prompt doesn't clearly specify when to stop searching. These are operational issues that directly impact your LLM API costs and user experience, but they're invisible to conventional monitoring.

The evaluation component is equally critical. Datadog's approach lets you define evaluators—essentially automated checks that score agent outputs against criteria like relevance, factual accuracy, or adherence to brand guidelines. These run continuously in production, not just during development. You can set up alerts when evaluation scores drop below thresholds, similar to how you'd alert on error rates or p99 latency. This shifts LLM quality from a pre-deployment concern to an ongoing operational metric.

From a cost perspective, token-level tracing becomes essential at scale. AgentCore workflows can easily consume 10-50x more tokens than single LLM calls due to tool use and multi-step reasoning. Without detailed breakdowns showing which agent components are driving costs, optimization is guesswork. The integration surfaces this data alongside traditional infrastructure metrics, so you can correlate token consumption with actual business outcomes.

The practical takeaway for platform teams: if you're running agentic workflows in production, you need observability that understands agent semantics. Bolting LLM tracing onto existing APM tools as an afterthought leaves critical gaps. The AgentCore and Datadog pairing shows what purpose-built tooling looks like—execution infrastructure that exposes the right hooks, paired with observability that captures agent-specific telemetry. Whether you use these specific tools or not, the pattern is worth replicating: treat agent observability as a first-class requirement, not a logging problem.