Most AI safety incidents are regressions. A small prompt change increases disclosure. A routing change increases refusals. A tool contract change allows unsafe arguments. A retrieval tweak surfaces sensitive content. The fix is not more meetings. The fix is a safety evaluation suite that runs like tests.
Define the safety behaviours you must preserve
Safety is not a generic checkbox. Start by naming the behaviours that matter for your product:
- Policy compliance. Refuse disallowed intents consistently.
- Data boundaries. Do not leak secrets, prompts, or sensitive content.
- Safe tool use. Do not call tools outside approvals and allowlists.
- Grounding discipline. Do not claim facts without evidence for RAG answers.
This aligns to a guardrail taxonomy so you can map tests to controls (see guardrail taxonomy).
Build three layers of safety tests
A practical suite usually has three layers:
- Policy tests. A dataset of allowed, disallowed, and ambiguous requests scored by a rubric (see rubrics).
- Adversarial fixtures. Prompt injection cases and social engineering attempts (see prompt injection test suites).
- End-to-end scenarios. Agent journeys with tool outcomes and approvals (see simulated users).
Most teams start with policy tests and injection fixtures, then add end-to-end scenarios for their highest-risk workflows.
Cover tool misuse explicitly
If you have tools, your safety suite must include tool misuse cases. Examples:
- Unsafe parameters ("delete all", "export everything").
- Permission boundary probing (tenant IDs, group IDs, project IDs).
- Retries that amplify side effects.
Tool safety is easier when contracts are strict and errors are machine-readable (see tool contracts and tool error handling).
Include RAG-specific safety checks
RAG adds unique risks: untrusted content, stale policies, and misattributed citations. Add checks for:
- Answerability. Abstain when evidence is weak (see answerability gates).
- Citations. Citations must support claims (see citation audits).
- Isolation. Tenants and ACLs are enforced (see tenant isolation).
These checks prevent the most damaging trust failures in enterprise assistants.
Turn the suite into release gates
Safety tests are only useful if they block unsafe releases. A lightweight gating approach:
- Run safety suites on every prompt, policy, tool and retrieval change.
- Store results with version metadata (see prompt registries).
- Define thresholds for pass rates and block regressions.
This is part of an AI testing pyramid that treats safety as an engineering discipline (see testing pyramid).
Keep the suite representative over time
Suites drift. Add a refresh cadence:
- Convert incidents into permanent tests.
- Add new intents that appear in production.
- Review refusal and disclosure cases when policies change.
Evaluation drift is a known failure mode (see evaluation drift).
When safety tests run continuously and gate change, you stop relying on hope. You build a system that stays safe as it evolves.