What does this article cover?

A practical way to classify AI failures so teams can diagnose issues faster and improve reliability systematically.

Engineering, SRE and product teams operating AI systems in production environments.

An Error Taxonomy for AI Systems: Classifying Failures to Improve Reliability

When an AI feature fails, the default diagnosis is often "the model is wrong". That framing slows remediation because the real issue may be retrieval, permissions, tool integration, or policy settings. A shared error taxonomy helps teams triage faster and invest in the right fixes.

A practical taxonomy

Use categories that map to system layers and owners:

Intent and UX mismatch. The user asked something ambiguous; the UI did not confirm intent or show constraints.
Context failures. Missing policy, wrong prompt template, or stale instructions (see prompt change control).
Retrieval failures. No relevant sources, stale content, or poor chunking (see common RAG failures).
Grounding failures. The model answers without evidence or misstates retrieved facts (see citations and grounding).
Tool failures. Wrong arguments, partial tool results, timeouts, or API errors (see safe tooling).
Policy failures. Over-refusal, under-refusal, or policy gaps (see policy layering).
Security failures. Prompt injection, data leakage, or unsafe actions (see prompt injection defence).
Cost and latency failures. Token explosions, cascading retries, or slow stages (see incident response).

Map categories to signals and owners

A taxonomy is only useful if each category has measurable signals and a clear owner. Pair categories with operational metrics: escalation rate, groundedness checks, tool error rate, retrieval coverage, and latency. Then define targets and alerting (see SLO playbooks and observability).

Use the taxonomy to drive runbooks

Runbooks should reference the taxonomy directly. For example, a retrieval failure runbook includes ingestion checks and index freshness, while a policy failure runbook focuses on control changes and approvals. During stabilisation, teams can decide whether to pause changes or ship targeted fixes (see change freeze).

Collect the right evidence

Most teams cannot diagnose failures because the evidence is missing. Capture model/provider, prompt version, retrieved sources, tool results, and relevant policy versions. Balance that evidence with privacy and retention constraints (see retention and deletion).

Bring it into governance and audits

Governance improves when failures are classified consistently. Incident trends and error categories become part of evidence packs and audit readiness (see compliance audits and governance artefacts).

The outcome is not perfect classification. The outcome is faster diagnosis, clearer ownership, and reliability improvements that compound over time.