Red teaming an agent is different from red teaming a chatbot. The moment you add tools, the system can do real work in real systems. That is where failures become incidents: data exposure, unauthorised changes, and costly loops.
A practical red team focuses on realistic scenarios that combine model behaviour, tool contracts, permissions and user workflows. The goal is not to trick the model with clever prompts. It is to find paths that a real adversary or careless user could trigger.
Start with an attack surface map
Before writing tests, list the surfaces:
- Inputs. User prompts, file uploads, and RAG sources.
- Tools. APIs the agent can call and what they can mutate.
- Data boundaries. Tenants, ACLs, and classification levels.
- Outputs. UI, logs, and tool results returned to the user.
This map drives scenario design. It also highlights missing controls such as tool allowlists and approvals (see safe tooling and approvals).
Scenario set 1: prompt injection and instruction override
Injection is not only about "ignore previous instructions". In agent systems, it often targets tool use and data disclosure. Test:
- Documents that attempt to override tool policies or request hidden prompts.
- Instructions that encourage unsafe retries or bypass approvals.
- Content that asks the model to reveal sensitive configuration.
Pair these tests with connector hardening and provenance controls for RAG sources (see connector hardening and prompt confidentiality).
Scenario set 2: tool abuse and privilege escalation
Agents fail dangerously when tools are too powerful or contracts are ambiguous. Test:
- Parameter smuggling. Can a benign request be turned into a risky tool call?
- Scope confusion. Can the agent access another tenant or project by changing IDs?
- Write amplification. Can it create many changes quickly (loops, retries, fan-out)?
Tool contracts and validation are the best first-line defense (see tool contracts and tool authorisation).
Scenario set 3: data exfiltration via summarisation
Even when direct disclosure is blocked, agents can leak through "helpful summaries". Test:
- Requests to summarise private documents with sensitive fields.
- Requests to generate examples that reproduce secrets or PII.
- Requests that combine multiple sources to reconstruct restricted information.
Use PII scanning and redaction on both inputs and outputs (see PII redaction pipelines).
Scenario set 4: safety regressions after change
Many incidents are regressions: a prompt change, a tool change, or a routing change breaks guardrails. Add regression tests for:
- Refusal behaviour for disallowed intents.
- Approval gating for sensitive actions.
- Grounding and citation requirements for knowledge answers.
Track outcomes per version so you can tie incidents to changes (see prompt registries and configuration drift).
Make red teaming operational, not occasional
Red teaming should not be a one-off workshop. Make it part of delivery:
- Maintain a small scenario suite that runs in CI (see testing pyramid).
- Add production monitoring for tool errors, loops and policy blocks (see reliability metrics).
- Feed incidents back into scenarios and datasets (see incident response).
When you red team agents with realistic scenarios, you stop debating hypothetical risks and start fixing concrete failure paths.