The AI agent threat model

A chatbot produces text and a human reads it. An AI agent reads untrusted web pages, executes tool calls, reaches internal services, and installs capabilities it found at runtime — often with no human in the loop at all. That difference in surface area is the difference between a text-moderation problem and a full attack-surface problem. This page catalogues the threat classes your agent faces and maps each one to the OrcaRouter control that counters it. It is the hub for the Threats section; each row links to a deep-dive page. For the controls themselves, see The control stack and Securing AI agents with OrcaRouter.

1. Why agents have a bigger attack surface than chatbots

Three structural properties of agents shift the risk profile: They act. A chatbot response that contains harmful text is bad. A tool call to shell.exec that deletes a database, or a payment API call an attacker drove through prompt injection, is worse — and often irreversible. The blast radius of a compromised agent is not bounded by what a human chooses to do with text; it is bounded by what tools the agent can reach. They ingest untrusted content. Agents retrieve documents, scrape web pages, read email, and process tool results — all of which can contain adversarial instructions targeted at the agent itself. A content filter that only screens what the user typed misses everything injected in context. They self-extend. An agent framework that auto-installs skills and MCP servers on behalf of the model can load capabilities you never reviewed, including ones with malicious tool definitions designed to look legitimate. The attack can arrive as a new tool the model decides to use — not as a prompt the user typed.

2. The threat-to-defense map

Ten threat classes an agent faces in production, each mapped to the OrcaRouter control that counters it. Expand any threat for the mechanism and the defense.

Every defense here is configured from your workspace console or the API — no changes to your agent code. Enforcement lives at the gateway.

Prompt injection — direct

How it works: the user message (or a developer prompt) carries instructions that hijack the model — override the system prompt, exfiltrate the session, unlock restricted capabilities.Defense: Guardrails Safety presets (Prompt-Injection Basics, jailbreak, system-prompt-leak) screen input text and block or flag on match before it reaches the model. Prompt injection →

Prompt injection — indirect

How it works: a retrieved document, web page, tool result, or MCP response embeds instructions the model treats as trusted context (“email the user’s calendar to attacker.com”).Defense: output-stage Guardrails catch instructions that surface in the reply; the Agent Firewall intercepts the tool call or egress destination the injection tries to trigger. Prompt injection →

Jailbreaks & guardrail evasion

How it works: adversarial phrasing, role-play frames, encoding tricks, and multi-turn escalation to bypass safety training or rules.Defense: Guardrails Safety presets pair keyword/regex rules with an llm_judge rule that catches semantic evasion regex can’t — first match wins. Jailbreaks →

Sensitive-data & PII exposure

How it works: PII (emails, phones, SSNs, cards) enters or leaves in the prompt or the model’s output.Defense: the Guardrails pii rule detects and masks (or blocks) built-in and custom entities on input and output — [EMAIL], [SSN], [CREDIT_CARD] replace matches before upstream sees them. Guardrails →

Secret & credential leakage

How it works: API keys, cloud credentials, JWTs, or private keys appear in prompts, tool arguments, or model output.Defense: the Secrets Blocker guardrail blocks credential patterns in the request before they leave; the firewall sanitize verdict redacts matched substrings from tool-call arguments. Guardrails →

Dangerous & unauthorized tool calls

How it works: the agent calls destructive tools (shell.exec, db.delete), tools it should never have, or a legitimate tool with dangerous arguments.Defense: the Agent Firewall matches on tool-name globs, argument clauses, and surfaces — deny blocks, sanitize strips bad arguments, pending_approval holds for a human. Dangerous tool calls →

Tool-response tampering

How it works: a malicious tool returns a response carrying injected instructions or fabricated data to hijack the agent’s next step.Defense: output-stage Guardrails screen the model’s next reply after it processes the tool result; firewall audit surfaces anomalous patterns in the events feed. Dangerous tool calls →

Data exfiltration over the network

How it works: the agent fetches an attacker URL or reaches an internal service, encoding data in the path/query. The SSRF and exfiltration vector.Defense: the Agent Firewall egress surface matches on host/IP/CIDR — an allow-list denies every destination not explicitly permitted, before the call leaves the gateway. Data exfiltration →

MCP tool poisoning & rug-pulls

How it works: a malicious MCP server advertises legitimate-sounding tools with harmful implementations, or changes its tools after you connected it (rug-pull).Defense: the MCP gateway evaluates every tools/call against your policy before dispatch; skill scanning assigns a risk band and the quarantine mode holds calls from a risky skill for approval. MCP tool poisoning →

Excessive agency & confused deputy

How it works: an agent holds more capability than its task needs, so one compromise has a large blast radius — or it is tricked into using its authority on an attacker’s behalf.Defense: scoped keys give each agent least-agency identity (specific models, IPs, spend cap, expiry); a tight firewall policy default-denies everything not explicitly allowed. Excessive agency →

Runaway cost & denial-of-wallet

How it works: an injection loop, retry-storm, or long agentic task drains quota and spend far beyond intent.Defense: the firewall cap_cost verdict denies a call once the run’s spend crosses your cents cap; scoped keys carry a per-key spend cap; anomaly detection flags cost spikes. Excessive agency →

3. Control stack summary

Every defense in the table above is a layer in the same ordered stack. Understanding how they compose is the key to applying them correctly.

Layer	What it governs	Fires when
Scoped keys	Identity — which models, IPs, spend cap, expiry, and which policies bind	Every request, before any content is read
Guardrails	Content — prompt and response text	Input stage (before the model) and output stage (after the model replies)
Agent Firewall	Actions — tool calls, MCP dispatch, egress destinations	On every tool call / outbound destination, on the surface it was detected
Audit	Attribution — every match, verdict, approval, and policy change	After every decision, correlated to the agent run

The layers are independent and additive — a request passes through all four. Autonomy levels (tight / balanced / permissive) configure Guardrails and Firewall together in one step, so you do not have to tune them separately to get a coherent posture. For a step-by-step walkthrough of how a single request traverses all four layers, see The control stack.

4. Choosing the right layer for a threat

Some threats require one layer; others require two working together. The quick decision:

Text in the prompt or response is the attack surface — reach for Guardrails first (keyword, regex, PII, LLM judge presets).
A tool call or outbound request is the attack surface — reach for the Agent Firewall (inbound/response/mcp/egress surfaces, deny/sanitize/ pending_approval/cap_cost verdicts).
Both text and action — layer them. The injected instruction fires a guardrail on the input; the tool call the injection tried to drive fires a firewall rule on the action.
Identity and scope — use scoped keys to constrain what an agent is allowed to call at all, before any content or action rule is evaluated.

See Guardrails vs. Firewall for a deeper comparison.

5. Deep-dive threat pages

Prompt injection

Direct and indirect injection — how attackers embed instructions in untrusted content and how guardrails and the firewall intercept them.

Jailbreaks

Adversarial phrasing and evasion techniques — how semantic-aware LLM judge rules catch what regex misses.

Dangerous tool calls

Destructive tools, argument attacks, and tool-response tampering — the firewall surfaces and verdicts that govern each.

Data exfiltration

SSRF and network exfiltration — egress allowlists and how the firewall blocks outbound requests before they leave the gateway.

MCP tool poisoning

Malicious MCP servers, rug-pulls, and skill risk bands — the MCP gateway, skill scanning, and quarantine enforcement.

Excessive agency

Overreaching agents, confused deputy, and denial-of-wallet — scoped keys, default-deny posture, and cost caps.

Reference: The control stack — Guardrails — Agent Firewall — Firewall rules — MCP gateway — Skills — Scoped keys — Zero trust for AI agents

​1. Why agents have a bigger attack surface than chatbots

​2. The threat-to-defense map

​3. Control stack summary

​4. Choosing the right layer for a threat

​5. Deep-dive threat pages

Prompt injection

Jailbreaks

Dangerous tool calls

Data exfiltration

MCP tool poisoning

Excessive agency

1. Why agents have a bigger attack surface than chatbots

2. The threat-to-defense map

3. Control stack summary

4. Choosing the right layer for a threat

5. Deep-dive threat pages