Agent Firewall overview

An AI agent doesn’t just generate text — it acts. It calls shell.exec, queries a database, fetches a URL, dispatches a tool through an MCP server. Each of those is a real-world action that prompt-level guardrails can’t see. The agent firewall is the action-layer plane that governs them: a workspace-scoped, named policy the gateway evaluates on every tool call, before it reaches the tool. This page is the hub for the Firewall section — the policy model, the verdicts, the surfaces, and how a policy attaches to a key. Each spoke goes deeper, and the full engine reference lives in Firewall and Firewall Rules.

1. What the agent firewall does

You author a policy once in your workspace console, attach an API key to it (or set one as the workspace default), and from then on every tool call that key issues is checked against the policy at the gateway. No redeploy, no agent-code change — your agent keeps issuing tool calls exactly as before, and editing the policy takes effect on the next call for every key attached to it. A policy is an ordered list of rules. Each rule decides which tool calls it applies to and what to do about them. The engine walks the rules in priority order, first match wins, and falls back to the policy’s default verdict if nothing matches.

Detection happens at the gateway, on first use — not at install time. A tool, MCP server, or skill an agent self-installs is caught the first time its call crosses the gateway. That’s the one choke point that sees every provider, every agent, and every tool call regardless of how the capability got there.

2. A concrete example

Suppose you want to block destructive shell commands but let everything else through under audit. In the console you create a policy with default_verdict = audit and one rule:

{
  "label": "block rm -rf",
  "tool_name_glob": "*.exec",
  "args_match_json": "{\"clauses\":[{\"path\":\"$.command\",\"op\":\"regex\",\"value\":\"rm -rf|drop table\"}]}",
  "verdict": "deny"
}

args_match_json is a JSON-encoded string (the gateway validates it against the clause schema on save): path is a JSONPath into the call’s arguments, op is one of eq, contains, regex, in, cidr_match, gt, lt. Attach a key to the policy (set firewall_policy_id on the key). Now when an agent emits shell.exec with command: "rm -rf /", the gateway returns HTTP 400 with error code firewall_blocked and a reason naming the tool — and the call never reaches the shell. Every other tool call is allowed and recorded for review.

Roll a new policy out under shadow mode first: it evaluates and logs exactly as it would in production, but every enforcing verdict is downgraded to audit and the reason is prefixed [shadow] would …. Watch the events feed, then turn shadow off to enforce.

3. Policy, rules, and resolution

A policy is workspace-scoped and named, with enabled, is_default, a default_verdict (allow / audit / deny, default audit), and a shadow_mode flag. A rule is one check inside it — see Create a policy and Rule schema. For any tool call the gateway resolves the active policy in order:

Key attachment — the calling key’s firewall_policy_id, when that policy exists and is enabled.
Workspace default — otherwise the enabled is_default policy.

A disabled attached firewall policy falls back to the workspace default — this differs from guardrails, where a disabled attachment resolves to none. The off switch on a key’s firewall is “use the default,” not “no enforcement.” See Manage policies.

When no policy resolves at all, behavior depends on the workspace firewall_observe_mode setting: with observe mode on, the call is allowed but logged as a coverage gap (it surfaces in Discovered Tools); with it off, the call is allowed silently.

4. Verdicts

A rule — or the default — produces one of:

Verdict	What it does
`allow`	Let through, logged.
`audit`	Allow + record for review. The usual default.
`deny`	Block. HTTP 400 `firewall_blocked` on inbound; tool error on MCP.
`sanitize`	Redact matched substrings from the tool arguments and forward.
`pending_approval`	Hold for a human; HTTP 400 `firewall_approval_pending`.
`cap_cost`	Deny once the run’s spend crosses a per-rule cap.

A sanitize verdict redacts call arguments only — never the content a tool returns. On the inbound surface, where there are no call-time args yet, sanitize escalates to a block. See Verdicts and Sanitize responses.

5. The four enforcement surfaces

Every tool call is evaluated against exactly one surface — pin a rule to one with the stage field, or leave it empty to apply to all.

inbound

The tools an agent advertises to the model on the request. Block a dangerous tool before the model can even choose it.

response

The tool_calls the model emits in its reply.

mcp

A tools/call dispatched through the MCP gateway. See MCP servers.

egress

An outbound host / IP / CIDR a tool reaches — the SSRF and data-exfiltration surface.

6. Matching

Rules express which tool calls they catch with a small, deterministic vocabulary that’s safe on the hot path:

Tool & skill name globs

A case-sensitive glob on the tool name (shell.*, *.delete), optionally AND-ed with a glob on the owning skill. See Glob syntax and Tool allow-listing.

Argument clauses

JSONPath predicates over the call’s arguments with operators eq, contains, regex, in, cidr_match, gt, lt — the difference between “block shell.exec” and “block it only when the command is rm -rf.” Clauses fail closed (the rule, not the request). See Validate arguments.

Egress lists

Host / IP / CIDR allow and deny lists on the egress surface. You can author your own deny rule for cloud-metadata or RFC-1918 ranges. See Egress control.

Sequences & cost caps

A sequence rule matches an ordered chain of calls across a window (enforced reactively); a cap_cost rule denies once an agent run’s accumulated spend crosses a cents ceiling. See Sequence rules and Cap cost.

7. Human approval, autonomy, and anomalies

Human-in-the-loop. A pending_approval verdict holds the call and returns its approval id. A reviewer resolves it in the console (Developer+) or via an HMAC-signed webhook callback; the agent polls and re-submits with a single-use X-OrcaRouter-Firewall-Approval header. See Approvals and Approval webhook.
Autonomy levels. One switch sets your whole posture: tight (default-deny + deny destructive shell + deny fetch-shaped tools (http_fetch/web_search/fetch_url/request, the SSRF vector) + PII Shield + Secrets Blocker enforced), balanced (default audit, deny destructive shell, PII Shield audit-only), or permissive (observe only). Each writes real, editable policy and guardrail rows, with one-click undo from the audit snapshot.
Anomaly detection. Beyond static rules, the firewall scores tool use against a learned hour-of-week baseline (14-day) and flags rate/cost spikes, retry_loop, and novel_path on a Member-readable feed you can snooze for up to 7 days.

8. Where the firewall fits

The firewall is the action-layer peer of two adjacent planes:

Plane	Governs	When to reach for it
Guardrails	Prompt & response text	PII, secrets, jailbreaks, injection intent
Agent firewall	Tool actions	Which tools, MCP calls, hosts, and cost
Compliance	Evidence & frameworks	SOC 2 / HIPAA / EU AI Act readiness

Both content and action planes can apply to a single request, and an autonomy level configures them together. See Guardrails vs. Firewall and the control stack.

9. Attaching and connecting

A policy binds to a key via firewall_policy_id (configured in the console; the binding lives on the key in the gateway). Two ways a tool call reaches the engine, both requiring a firewall-gateway-scoped key (is_firewall_gateway = true) — a regular relay key gets 403 on these routes:

MCP gateway — point your MCP client at the unified ANY /api/v1/firewall/mcp endpoint; every tools/call is evaluated inline. See MCP servers.
Evaluate hook — call POST /api/v1/firewall/evaluate (or /evaluate_plan for a multi-step plan) from your own agent loop before dispatching, and act on the verdict.

All console configuration runs under your session via /api/workspace/firewall/*. Reads of policies, settings, discovered tools, the read-only autonomy simulator, and the anomaly feed are open to every workspace member; the dry-run Test sandbox, the Events / Runs log, and all writes require Developer+. See Gateway keys and Test rules.

10. Threats this plane addresses

Dangerous tool calls

Deny destructive shell, DB drops, and risky verbs by glob + args.

Data exfiltration

Egress lists and read-then-export sequence rules.

MCP tool poisoning

Per-call evaluation on the mcp surface before dispatch.

Excessive agency

Approvals, cost caps, and default-deny posture.

Next steps

Create a policy

Author your first policy and rule.

Verdicts

What each verdict does on the wire.

Events log

Read what the firewall decided and why.

​1. What the agent firewall does

​2. A concrete example

​3. Policy, rules, and resolution

​4. Verdicts

​5. The four enforcement surfaces

inbound

response

mcp

egress

​6. Matching

​7. Human approval, autonomy, and anomalies

​8. Where the firewall fits

​9. Attaching and connecting

​10. Threats this plane addresses

Dangerous tool calls

Data exfiltration

MCP tool poisoning

Excessive agency

​Next steps

Create a policy

Verdicts

Events log

1. What the agent firewall does

2. A concrete example

3. Policy, rules, and resolution

4. Verdicts

5. The four enforcement surfaces

6. Matching

7. Human approval, autonomy, and anomalies

8. Where the firewall fits

9. Attaching and connecting

10. Threats this plane addresses

Next steps