Many AI regressions are not caused by the model. They are caused by drift: prompt templates changed without review, routing rules adjusted in production, or policy packs diverged across environments. Drift turns AI systems into moving targets and makes incidents harder to diagnose.
Define what counts as configuration
In LLM systems, configuration includes:
- Prompt template versions and safety prompts.
- Policy packs and output scanning thresholds.
- Routing rules and fallback models (see routing).
- Tool enablement and schemas (see tool authorisation).
- Retrieval configuration and source allowlists for RAG.
Use registries as systems of record
Registries reduce drift by making the "official" versions explicit:
- Prompt registry. Version prompts, policies and release history (see prompt registry).
- Model registry. Track model routes, policy constraints and deployment gates (see model registry).
Detect drift continuously
Drift detection is a comparison problem: what is running vs what should be running. Practical patterns:
- Config snapshots. Periodically snapshot runtime configuration and compare to registry state.
- Decision logging. Record applied versions per request and alert on unknown versions (see decision logging).
- Synthetic checks. Run golden queries and alert when behaviour shifts unexpectedly (see synthetic monitoring).
Make drift prevention part of delivery
Prevention is cheaper than detection. Use:
- Feature flags and release rings for risky changes (see feature flags).
- Approvals for high-risk controls and tool changes (see approvals).
- Regression suites for prompt changes (see prompt regression testing).
Respond to drift like an incident
When drift is detected, treat it as an incident: freeze risky changes, restore known-good versions, and document what changed and why (see incident response and change freeze).
Stable AI systems are not built by avoiding change. They are built by controlling change and making drift visible.