What does this article cover?

How to trace and debug agent runs with stable IDs, spans for each step, and safe replay to reproduce incidents.

Teams operating AI agents in production who need a reliable way to diagnose failures across prompts, tools, retrieval and routing.

Tracing AI Agent Runs: Request IDs, Spans and Replayable Execution

Agent failures are multi-layer failures. A user sees "it did not work", but the real cause can be a tool timeout, a permission filter, a bad retry, a prompt regression, or a routing change. Without end-to-end tracing, teams guess. With tracing, teams debug.

Tracing an agent run means you can answer three questions quickly:

What happened? The sequence of steps, tool calls and decisions.
Why did it happen? The versioned configuration and policy that drove choices.
Can we reproduce it? Safe replay of the run inputs and outcomes.

Define a stable ID model for agent runs

Start with identifiers that you can use across logs, traces and dashboards:

request_id. Unique per inbound request.
run_id. Unique per agent execution (can differ from request_id if you fork or retry).
step_id. A monotonic step counter for the run.
tool_call_id. Unique per tool invocation.

These IDs become the backbone for reliability metrics and incident analysis (see agent reliability metrics).

Represent each step as a span

If you use distributed tracing, map the run to spans. A practical span taxonomy:

agent.plan (planning step, route selection).
agent.retrieve (RAG retrieval, ranking, filtering).
agent.model (LLM call, tokens, latency).
agent.tool (tool call, inputs, outputs, errors).
agent.validate (schema validation, policy checks).

Use consistent attribute names so you can aggregate across tools and routes (see telemetry schema and observability).

Attach version metadata to every run

Most incidents are caused by change. Make change visible by recording version metadata on each run:

prompt_version. From your prompt registry (see prompt registries).
policy_version. Which policy module or ruleset applied.
tool_contract_version. For tools and schemas (see tool contracts).
route_config_version. Routing and fallback config (see configuration drift).

This is what turns debugging into "compare before and after" instead of guesswork.

Capture the minimum data needed for safe replay

Replay is powerful, but it can also create privacy and security risk. Store only what you need:

Normalized user input (with PII redacted where required).
Tool call arguments and tool results (with sensitive fields masked).
Retrieval IDs and snippets, not entire documents, where feasible.
Model route and parameter settings (temperature, max tokens).

Apply data minimisation and retention policies (see data minimisation and retention and deletion).

Link traces to decision logs and audits

Traces tell you what happened. Decision logs tell you why it was allowed. For agentic systems, you often need both:

Record approvals and policy blocks (see approvals).
Record key routing decisions and fallbacks (see fallbacks).
Keep audit trails for sensitive tool actions (see decision logging).

Use tracing as an engineering feedback loop

Tracing is not only for incidents. It also drives improvement:

Find slow steps and reduce latency.
Find repetitive tool calls and improve caching or planning.
Find common validation failures and improve tool contracts.
Correlate customer complaints to specific versions and routes.

If you want agents to be operationally real, not just demos, tracing is non-negotiable. It is the difference between "the model was weird" and "step 7 timed out after a policy update".