AI systems fail in more ways than traditional services. Providers degrade, retrieval returns stale sources, prompt updates shift refusal behaviour, and tool calls can amplify incidents. Chaos engineering game days make those failures rehearsed rather than surprising.
What to test in AI game days
Start with scenarios that map to your highest-risk dependencies:
- Provider outage or throttling. High latency, increased error rates, partial outages.
- Routing failures. Fallback models are misconfigured or violate residency constraints (see routing and failover).
- RAG degradation. Index stale, retriever returns nothing, or ranking collapses (see common RAG failures).
- Tool failures. Downstream API timeouts, schema mismatches, idempotency conflicts (see tool reliability).
- Cost explosions. Retry loops, context growth, runaway tool loops (see cost anomaly detection).
Define the safe degradation paths
A game day is not about creating chaos for its own sake. It is about verifying that the system degrades safely:
- Low-cost mode. Smaller context budgets, no reranking, fewer tool calls.
- Read-only mode. Disable action-taking tools and return guidance instead (see safe tooling).
- Evidence-first mode. Require citations and refuse unsupported claims (see citations and grounding).
- Human fallback. Escalate high-stakes cases rather than guessing (see human-in-the-loop).
Feature flags are the fastest operational lever for these transitions (see feature flags).
Instrument what you need to diagnose
Game days expose telemetry gaps. For each scenario, ensure you can answer:
- Which model/provider and prompt version was used?
- What routing rules and policy checks were applied?
- What sources were retrieved and cited?
- What tools were invoked and what side effects occurred?
This is where structured telemetry and decision logging pay off (see telemetry schema and decision logging).
Use SLOs to decide when to stop the line
Game days should verify your stop-the-line rules: when quality or safety signals degrade, do you pause releases and stabilise? Use SLOs and error budgets to pace change (see SLO playbooks and change freeze).
Turn game day findings into durable controls
Every game day should produce improvements:
- New runbook steps and alert thresholds (see alerting and runbooks).
- New synthetic checks and golden queries (see synthetic monitoring).
- Fixes to routing, caching, or tool contracts.
Practised recovery is one of the strongest trust signals you can build into an AI capability.