Skip to main content
An agent that has more capability than its task needs is a liability waiting to be exploited. Steal its key, trick it with an injected instruction, or compromise one dependency — and everything that key can do is now in the attacker’s hands. This is the excessive agency problem, and it compounds with a closely related pattern called the confused deputy: the agent is not compromised directly, it is tricked into using its legitimate authority on an attacker’s behalf. Both problems share one root cause: the key a compromised agent holds is too powerful for the task it performs. The defense is least agency — give each agent exactly the capability its task requires, and nothing more.
This page is about the gateway controls that limit blast radius. The upstream threat-model context — why agents are high-value targets and how injection works — is in Threat model. For the matching control that governs dangerous individual tool calls, see Dangerous tool calls.

1. What makes an agent excessively capable

When every agent in a workspace shares one key, or when a key is issued once and never revisited, capability drifts upward:
  • Unrestricted models — the agent can call any model in the workspace, including expensive or highly capable ones it never needs.
  • No spend ceiling — a runaway loop, a triggered injection, or a billing attack can exhaust the workspace balance before you notice.
  • No expiry — a key issued during a sprint is still valid a year later, long after the agent it was minted for is retired.
  • No IP constraint — the credential works from anywhere, so a leaked key has no geographic limit.
  • No tool allow-list — the agent can invoke any tool, even ones unrelated to its function.
Any one of these alone is a widened blast radius. Combined, a single compromised agent can do everything a workspace admin can do — call the most powerful model, spend the full balance, reach every tool.

2. The confused deputy pattern

The confused deputy is a specialisation of excessive agency. The agent is not hijacked; it is convinced. A prompt-injection payload in a retrieved web page, document, or tool result tells the agent to take an action it is legitimately authorized to perform — move money, delete a record, send a message — on the attacker’s behalf. The agent acts. It was authorized to do exactly that. The authorization check passes. The damage is done. Defense requires two things working together:
  1. Narrow scope — the agent cannot be tricked into doing something its task never intended, because it is not authorized to do it at all.
  2. Human approval for irreversible actions — even within authorized scope, a high-stakes call requires a human to confirm before it executes.

3. Defense in depth: the four layers

OrcaRouter enforces least agency across four independent controls that compose on a single API key. None requires a code change to your agent.

Layer 1 — Scoped key (identity + hard limits)

Every agent should have its own API key. The key carries hard limits that the gateway enforces regardless of what the agent requests:
FieldWhat it restricts
model_limitsThe exact set of models this key may call. A request for any other model is rejected before it leaves the gateway.
allow_ipsRequests from any address not on this list are rejected at the auth layer. Empty means no IP restriction.
credit_limit_usdLifetime spend cap in USD. 0 means unlimited. The gateway enforces this against cumulative spend on the key.
expired_timeAbsolute expiry timestamp. -1 means the key never expires. Set this to the agent’s deployment lifecycle.
environmentA label (prod, staging, dev) for organizing keys and filtering audit logs.
These limits are enforced at the key level — before any policy, before any model call. They are the outermost blast-radius boundary.

Layer 2 — Firewall policy (tool allow-list)

Attach a firewall policy to the key via firewall_policy_id. The policy governs every tool call that key issues:
  • Write rules that allow the tool names the agent legitimately uses (tool-name globs are supported — e.g. db.query*).
  • Set the policy’s default_verdict to deny so anything not explicitly listed is blocked.
  • Add argument predicates to restrict even the allowed tools — e.g. allow db.query only when the database argument matches a specific schema.
A key with no firewall attachment falls back to the workspace default policy. For agents with narrow tool needs, an explicit attachment with a tight policy is always preferable to relying on the workspace default. See Firewall rules for the full matching language.

Layer 3 — Human approval for high-stakes actions (pending_approval)

For irreversible or high-value tool calls — a payment dispatch, a record delete, an email send — add a pending_approval rule. The flow:
  1. The agent issues the tool call. The firewall holds it and returns a “held” response carrying an approval id. The call does not reach the tool.
  2. A reviewer approves or rejects out-of-band — from the console (Developer+) or via an HMAC-signed webhook to your own approval system.
  3. Your agent polls the approval id. Once approved, it re-submits the original call with a single-use X-OrcaRouter-Firewall-Approval header. The gateway passes it through exactly once.
The confused deputy is stopped here even when scope is valid: a human confirms the action is intentional before it executes.

Layer 4 — Per-run cost cap (cap_cost)

A cap_cost rule denies any tool call once the agent run’s accumulated spend exceeds a per-rule ceiling (in cents). This is the circuit-breaker for:
  • Runaway loops triggered by injection.
  • Billing attacks that drive spend before any human notices.
  • Accidental recursion in multi-step plans.
cap_cost operates at the run level, not the key lifetime level — so it resets per agent invocation, and a single misbehaving run cannot exhaust the key’s credit_limit_usd ceiling.

4. A well-scoped agent key — example

An agent that summarizes customer tickets using gpt-4o-mini and queries a read-only replica should look like this:
  • model_limits: ["openai/gpt-4o-mini"] — cannot escalate to a more capable or expensive model.
  • allow_ips: the worker pool’s egress CIDR — the key is inert everywhere else.
  • credit_limit_usd: a weekly ceiling matching the task’s expected cost, with some headroom — e.g. 5.00.
  • expired_time: the end of the sprint or deployment period — the key self-expires without manual cleanup.
  • environment: "prod" — appears in log filters and anomaly views.
  • guardrail_id: a guardrail scoped to this agent’s data sensitivity (PII masking, no secrets in output).
  • firewall_policy_id: a policy that allow-lists db.query* and ticket.read* only, default verdict deny.
When this agent is tricked into exfiltrating data via an injected instruction, the blast radius is: one model, one IP range, one tool namespace, one cost ceiling. The rest of the workspace is unaffected.
is_firewall_gateway marks a key as a gateway-scoped token for the MCP-dispatch and evaluate-hook routes. Only create these for agents that drive the firewall programmatically — never for general inference traffic. A gateway key on the inference path exposes routes that broad-purpose keys should never reach. Enabling is_firewall_gateway requires Admin+.

5. Roles required

ActionMinimum role
Read a key, firewall policy, or guardrail matchMember
Read firewall events, runs, or tracesDeveloper
Create or edit keys, firewall policies, rulesDeveloper
Approve a held tool call from the consoleDeveloper
Enable is_firewall_gateway on a keyAdmin

6. Relationship to other threats

Excessive agency is the enabler for almost every other agent threat:
  • Dangerous tool calls — a key with a tight tool allow-list cannot be forced to call a tool it does not list, even if injection succeeds.
  • Prompt injection — scope limits the damage injection can do; approval gates block the irreversible actions injection tries to trigger.
  • Threat model — the full attack surface map, showing where excessive agency sits relative to other vectors.
Least agency does not prevent injection. It shrinks what injection can achieve.

Scoped keys & policies

The full key-fields reference, resolution order, and the workspace boundary model.

Firewall

Policy authoring, verdicts, HITL approval flow, and the full API reference.
Least agency — one narrow key per agent, a tight tool allow-list, a spend cap, and human approval for irreversible actions — is the primary defense against excessive agency llm attacks and the confused deputy pattern.