Prompt injection is the most common real-world attack on LLM applications: you don’t exploit the model, you simply feed it instructions disguised as data. In enterprise systems, that data comes from customer messages, tickets, emails, PDFs, wiki pages, and even tool outputs—exactly the surfaces you want to automate.
It is also different from a jailbreak. A jailbreak is a user trying to persuade the model; prompt injection is an attacker hiding directives inside content so the model follows them while doing something else (summarising a document, answering a question, filing a case, issuing a refund).
Treat prompts as an interface boundary. Combine zero-trust AI integration thinking with safe tool exposure: validate what enters the model, and strictly constrain what the model can do.
Where injection enters
Injection typically arrives through three paths. First is direct user input: instructions embedded in “notes”, “attachments”, or “copied from a doc” text. Second is retrieval: a malicious or compromised document in your knowledge base can instruct the model to ignore policies, leak secrets, or call tools. Third is tool output: if you pass tool responses back into the model without sanitisation, an upstream system can become an injection source.
Agents amplify the impact because they combine language with actions. A single successful injection can trigger a chain: call a tool, pull more data, and then exfiltrate it in the final response. This is why “the model should ignore it” is not a control—controls must be enforced outside the model.
Design a layered defence
There is no single “magic prompt” that blocks injection. You need multiple layers that fail independently. A useful mental model is: sanitise inputs, separate instructions from data, constrain capabilities, and observe behavior.
Separation matters most: keep system instructions short and stable, and present retrieved documents as quoted, untrusted context with explicit boundaries. If you need dynamic policy, implement it as code (or a policy engine), not as prose in the prompt.
- Validate and canonicalise untrusted inputs. Normalise encodings, strip invisible characters, and detect instruction-like blocks before they reach the prompt.
- Fence retrieved content. Wrap documents as “untrusted context”, prefer citations, and avoid copying large blobs that can smuggle directives.
- Enforce retrieval permissions. Apply ACL checks at query time and keep tenant boundaries strict so one customer’s documents can’t influence another’s answers.
- Gate tool calls with policy. Enforce allowlists, parameter schemas, and preconditions (identity verified, amount within limit). Reject ambiguous calls rather than asking the model to “try again”.
- Constrain outputs and parsing. Use typed response formats for tool plans, run strict parsers, and fall back to safe refusal or human escalation when output doesn’t validate.
- Keep secrets out of context. Never place API keys or privileged URLs in prompts; use scoped credentials in the tool layer instead.
Make it operational
Make injection defense measurable. Log prompts, retrieval IDs, tool decisions and refusal reasons with enough context to debug, but with sensitive fields redacted. Add detectors that flag abnormal tool invocation, suspicious instruction markers, and PII leakage. Wire those alerts into runbooks that tell operators when to switch to conservative prompts, disable tools, or route to a human.
When you do find a successful injection, treat it like any other security incident: identify the source (user, document, tool), quarantine it, and add the exact attack prompt to your regression suite. The fastest improvements come from turning real attempts into repeatable tests.
Test like an attacker. Maintain a red-team suite—see Red Teaming LLM Applications—and run it in CI and staging. Prompt injection is not solved once; it is managed continuously as new content sources, tools and models are introduced.