1. Why agents have a bigger attack surface than chatbots
Three structural properties of agents shift the risk profile: They act. A chatbot response that contains harmful text is bad. A tool call toshell.exec that deletes a database, or a payment API call an
attacker drove through prompt injection, is worse — and often irreversible.
The blast radius of a compromised agent is not bounded by what a human
chooses to do with text; it is bounded by what tools the agent can reach.
They ingest untrusted content. Agents retrieve documents, scrape web
pages, read email, and process tool results — all of which can contain
adversarial instructions targeted at the agent itself. A content filter
that only screens what the user typed misses everything injected in context.
They self-extend. An agent framework that auto-installs skills and MCP
servers on behalf of the model can load capabilities you never reviewed,
including ones with malicious tool definitions designed to look legitimate.
The attack can arrive as a new tool the model decides to use — not as a
prompt the user typed.
2. The threat-to-defense map
Ten threat classes an agent faces in production, each mapped to the OrcaRouter control that counters it. Expand any threat for the mechanism and the defense.Every defense here is configured from your workspace console or the API —
no changes to your agent code. Enforcement lives at the gateway.
Prompt injection — direct
Prompt injection — direct
How it works: the user message (or a developer prompt) carries
instructions that hijack the model — override the system prompt,
exfiltrate the session, unlock restricted capabilities.Defense: Guardrails Safety presets (Prompt-Injection Basics,
jailbreak, system-prompt-leak) screen input text and block or flag on
match before it reaches the model.
Prompt injection →
Prompt injection — indirect
Prompt injection — indirect
How it works: a retrieved document, web page, tool result, or MCP
response embeds instructions the model treats as trusted context
(“email the user’s calendar to attacker.com”).Defense: output-stage Guardrails catch instructions that
surface in the reply; the Agent Firewall intercepts the tool call
or egress destination the injection tries to trigger.
Prompt injection →
Jailbreaks & guardrail evasion
Jailbreaks & guardrail evasion
How it works: adversarial phrasing, role-play frames, encoding
tricks, and multi-turn escalation to bypass safety training or rules.Defense: Guardrails Safety presets pair keyword/regex rules
with an
llm_judge rule that catches semantic evasion regex can’t —
first match wins. Jailbreaks →Sensitive-data & PII exposure
Sensitive-data & PII exposure
How it works: PII (emails, phones, SSNs, cards) enters or leaves in
the prompt or the model’s output.Defense: the Guardrails
pii rule detects and masks (or
blocks) built-in and custom entities on input and output — [EMAIL],
[SSN], [CREDIT_CARD] replace matches before upstream sees them.
Guardrails →Secret & credential leakage
Secret & credential leakage
How it works: API keys, cloud credentials, JWTs, or private keys
appear in prompts, tool arguments, or model output.Defense: the Secrets Blocker guardrail blocks credential
patterns in the request before they leave; the firewall
sanitize
verdict redacts matched substrings from tool-call arguments.
Guardrails →Dangerous & unauthorized tool calls
Dangerous & unauthorized tool calls
Tool-response tampering
Tool-response tampering
How it works: a malicious tool returns a response carrying injected
instructions or fabricated data to hijack the agent’s next step.Defense: output-stage Guardrails screen the model’s next reply
after it processes the tool result; firewall
audit surfaces anomalous
patterns in the events feed.
Dangerous tool calls →Data exfiltration over the network
Data exfiltration over the network
How it works: the agent fetches an attacker URL or reaches an
internal service, encoding data in the path/query. The SSRF and
exfiltration vector.Defense: the Agent Firewall
egress surface matches on
host/IP/CIDR — an allow-list denies every destination not explicitly
permitted, before the call leaves the gateway.
Data exfiltration →MCP tool poisoning & rug-pulls
MCP tool poisoning & rug-pulls
How it works: a malicious MCP server advertises legitimate-sounding
tools with harmful implementations, or changes its tools after you
connected it (rug-pull).Defense: the MCP gateway evaluates every
tools/call against
your policy before dispatch; skill scanning assigns a risk band and
the quarantine mode holds calls from a risky skill for approval.
MCP tool poisoning →Excessive agency & confused deputy
Excessive agency & confused deputy
How it works: an agent holds more capability than its task needs,
so one compromise has a large blast radius — or it is tricked into
using its authority on an attacker’s behalf.Defense: scoped keys give each agent least-agency identity
(specific models, IPs, spend cap, expiry); a
tight firewall policy
default-denies everything not explicitly allowed.
Excessive agency →Runaway cost & denial-of-wallet
Runaway cost & denial-of-wallet
How it works: an injection loop, retry-storm, or long agentic task
drains quota and spend far beyond intent.Defense: the firewall
cap_cost verdict denies a call once the
run’s spend crosses your cents cap; scoped keys carry a per-key spend
cap; anomaly detection flags cost spikes.
Excessive agency →3. Control stack summary
Every defense in the table above is a layer in the same ordered stack. Understanding how they compose is the key to applying them correctly.| Layer | What it governs | Fires when |
|---|---|---|
| Scoped keys | Identity — which models, IPs, spend cap, expiry, and which policies bind | Every request, before any content is read |
| Guardrails | Content — prompt and response text | Input stage (before the model) and output stage (after the model replies) |
| Agent Firewall | Actions — tool calls, MCP dispatch, egress destinations | On every tool call / outbound destination, on the surface it was detected |
| Audit | Attribution — every match, verdict, approval, and policy change | After every decision, correlated to the agent run |
tight / balanced / permissive) configure
Guardrails and Firewall together in one step, so you do not have to tune
them separately to get a coherent posture.
For a step-by-step walkthrough of how a single request traverses all four
layers, see The control stack.
4. Choosing the right layer for a threat
Some threats require one layer; others require two working together. The quick decision:- Text in the prompt or response is the attack surface — reach for Guardrails first (keyword, regex, PII, LLM judge presets).
- A tool call or outbound request is the attack surface — reach for the Agent Firewall (inbound/response/mcp/egress surfaces, deny/sanitize/ pending_approval/cap_cost verdicts).
- Both text and action — layer them. The injected instruction fires a guardrail on the input; the tool call the injection tried to drive fires a firewall rule on the action.
- Identity and scope — use scoped keys to constrain what an agent is allowed to call at all, before any content or action rule is evaluated.
5. Deep-dive threat pages
Prompt injection
Direct and indirect injection — how attackers embed instructions in
untrusted content and how guardrails and the firewall intercept them.
Jailbreaks
Adversarial phrasing and evasion techniques — how semantic-aware LLM
judge rules catch what regex misses.
Dangerous tool calls
Destructive tools, argument attacks, and tool-response tampering — the
firewall surfaces and verdicts that govern each.
Data exfiltration
SSRF and network exfiltration — egress allowlists and how the firewall
blocks outbound requests before they leave the gateway.
MCP tool poisoning
Malicious MCP servers, rug-pulls, and skill risk bands — the MCP
gateway, skill scanning, and quarantine enforcement.
Excessive agency
Overreaching agents, confused deputy, and denial-of-wallet — scoped
keys, default-deny posture, and cost caps.
Reference: The control stack — Guardrails — Agent Firewall — Firewall rules — MCP gateway — Skills — Scoped keys — Zero trust for AI agents
