Agent reliability is not a single number. An agent can be great at reasoning yet fail because of tool timeouts, permission errors, bad retries, or missing guardrails. The best teams split reliability into measurable components, then attach clear SLOs and runbooks.
Start with a task-level definition of success
Define success at the task boundary, not the model boundary. A task success definition should include:
- Completion criteria. What the user expects to be true when the task is done.
- Policy constraints. What must not happen (data leakage, unsafe actions, unapproved tools).
- Time and cost bounds. How long and how expensive completion is allowed to be.
This lets you report a simple primary metric: task success rate (success / attempts) segmented by intent, channel, tenant and route.
Measure tool failures separately from model failures
Tooling is where most production reliability issues appear. Track tool calls as first-class events and classify failures:
- Availability. Timeouts, 5xx, DNS and dependency outages.
- AuthZ/AuthN. 401/403, expired tokens, missing scopes.
- Contract. Validation errors, schema mismatches, bad parameters.
- Rate limits. 429s, quota exhaustion, backoff misconfiguration.
From this you can compute tool success rate by tool and endpoint, plus tool latency percentiles and retry effectiveness.
Track recovery and safe outcomes
A reliable agent does not just avoid failure; it recovers. Add metrics that capture whether the system returns to a safe path:
- Recovery rate. Failures that end in a safe completion (retry, fallback route, or handoff).
- Mean attempts to success. How many tool/model steps the agent needs to finish.
- Human escalation rate. How often you need a person to complete or approve.
If you support approvals, treat them as part of the reliability story (see agent approvals and safe tooling).
Include safety and policy signals
Most teams track success and latency, but miss policy health. Add:
- Refusal rate. By intent and tenant, to catch overly strict policies.
- Policy block rate. How often tool actions are prevented and why.
- False positive review rate. Human overrides of blocks, to tune guardrails.
Turn metrics into SLOs and runbooks
Metrics become operational when you set targets and define responses. A simple pattern:
- SLO. e.g. 99% task success for a critical workflow, measured daily.
- Error budget. Define how much failure is acceptable before a freeze.
- Runbook. Clear steps for diagnosing tool outages, routing failures and regressions.
Pair this with alerting designed for humans (see alerting and runbooks) and incident response playbooks (see incident response).
When you can explain why the agent failed and what recovered it, you are no longer guessing. You are operating a system.