Skip to main content
You wrote a firewall rule — a deny on shell.exec, an egress allow-list, an argument clause that only fires on rm -rf — and now you want to know it does exactly what you think before it changes a single production tool call. The firewall gives you three non-destructive ways to test firewall rules, each answering a different question:

Dry-run one call

The Test sandbox feeds one synthetic tool call through the real engine and returns the verdict — nothing dispatched, nothing logged. Developer+.

Replay a posture

Simulate replays an autonomy level against your recent traffic and counts how many calls it would block. Member-readable.

Run against live traffic

Shadow mode evaluates a whole policy on real calls but downgrades every enforcing verdict to audit. Zero blast radius.
All three configure through the console (or the /api/workspace/firewall/* management routes, which authenticate with your session / access token — not a relay sk-orca-… key). Your agent’s /v1/* relay calls never change while you test.

1. Test firewall rules with the dry-run Test sandbox

The Test sandbox is the tightest loop: hand it a single synthetic tool call and it runs the real evaluation engine — full policy resolution, rules walked in priority order, first-match-wins — then returns the verdict, the rule that produced it, and the human-readable reason. The call is a dry run: nothing is dispatched to any tool, and nothing is written to the events feed or the Discovered-tools inventory. It answers one question precisely: given this exact tool name and these arguments, what does my policy decide — and which rule decides it?
The Test sandbox is Developer+. It can preview against an unsaved draft policy by id and the response surfaces the matched policy name and rule label, so it sits closer to a write-surface preview than a plain read — unlike Simulate and the other read views, which are open to every member.

One concrete dry run

Say you’ve added a rule that should deny shell.exec only when the command contains rm -rf. You want to confirm two things in one sitting: the dangerous command is denied, and an innocent one still passes.
1

Test the dangerous call

In Security → Firewall, open the Test tab, pick the response surface, enter tool name shell.exec and arguments {"command": "rm -rf /data"}, and run. The response names the verdict and the matched rule:
{
  "verdict": "deny",
  "policy_name": "prod-agents",
  "rule_label": "block destructive shell",
  "reason": "destructive shell command",
  "gap": false,
  "shadow_mode": false
}
2

Test the innocent call

Run it again with {"command": "ls -la"}. The argument clause no longer matches, so the rule falls through to the policy default — you should see allow or audit and an empty rule_label. If rm -rf denies and ls -la doesn’t, your argument clause is scoped correctly.
3

Preview a draft before you attach it

Pass a policy_id to evaluate against a specific draft policy instead of the one your traffic currently resolves — so you can prove a new policy is right before you attach a key to it or promote it to workspace default.
Read gap in the response. gap: true means a policy resolved but no rule inside it matched the call (and the workspace is in observe mode) — the tool slipped through every rule and fell to the default. That’s a coverage hole to close before you ship, not a verdict to trust.
The Test sandbox uses the same surfaces as live evaluation — inbound, response, mcp, egress (default inbound) — so test each rule on the surface it’s pinned to. On inbound there are no call-time arguments, so a sanitize rule escalates to a block there exactly as it would in production; see stages for why surface matters.

2. Simulate an autonomy level before you apply it

The Test sandbox checks one call. Simulate answers the posture-level question: if I switched this whole workspace to a stricter autonomy level, how much of my recent traffic would it block? Simulate replays a candidate level’s deny rules against your trailing firewall events and returns the would-be impact — tool names and counts only, never arguments. It is read-only and Member-readable, so anyone on the team can preview the blast radius of tight before a Developer commits to it.
  • tight — default-deny, deny destructive shell, deny fetch-shaped tools (the SSRF vector), PII Shield + Secrets Blocker enforced. Simulate shows how much of your real traffic this floor would catch.
  • balanced — default audit, deny destructive shell, PII Shield in audit-only (flags PII). The recommended starting posture.
  • permissive — observe only; nothing enforced.
Simulate changes nothing — it’s a what-if over past events. Applying an autonomy level (a Developer+ write) materializes real, editable autonomy_* policy and guardrail rows, with one-click undo from the audit snapshot. Preview with Simulate, then apply when the count looks right.

3. Shadow mode: test against live traffic with no blast radius

The Test sandbox and Simulate are offline previews. Shadow mode is the live one: a per-policy flag that evaluates the policy on real agent traffic, walks every rule, picks a verdict — then downgrades every enforcing verdict (deny, sanitize, pending_approval) to audit and prefixes the reason [shadow] would …. The call always goes through; nothing is blocked, redacted, or held. That makes the events feed read like a production run with enforcement turned off. Filter for [shadow] and you have a complete list of every call the policy is about to start blocking — before it blocks one.
Test methodRuns againstQuestion it answers
Test sandboxOne synthetic call”What verdict does this exact call get, and which rule decides?”
SimulateRecent events”How many calls would a stricter autonomy level block?”
Shadow modeLive traffic”What would this policy block across real production traffic?”
Shadow mode is the deeper of the three — full live coverage with zero blast radius. It has its own page: Roll out a firewall policy with shadow mode walks the toggle, the [shadow] would … reasons, and the flip to enforce.

4. A practical testing order

The three tools compose into one safe-rollout path — cheapest check first, widest coverage last:
1

Dry-run the rules you just wrote

Use Test to confirm each new rule fires on the calls it should and passes the ones it shouldn’t — including the negative cases. Fast, Developer+, nothing persisted.
2

Gauge the posture (optional)

If you’re reaching for an autonomy level rather than hand-written rules, Simulate the level and read the would-be-blocked count against real traffic before applying it.
3

Shadow against live traffic

Turn on shadow mode and let a representative window of real calls flow. Read the [shadow] would … events; tighten any rule that surfaces a false positive — still in shadow, zero blast radius.
4

Enforce

When the feed fires on what you expect and nothing you don’t, flip shadow off. The next call enforces for real.
Testing previews the policy, not governed skills. A skill in block or quarantine mode still enforces even under a shadowed policy — the skill’s review disposition wins. Shadowing a policy was never a request to un-quarantine a skill.

5. API reference

These management routes use your session / access token and are workspace-scoped:
Method & pathRolePurpose
POST /api/workspace/firewall/testDeveloper+Dry-run one synthetic tool call against the resolved (or a draft policy_id) policy. Returns verdict, policy_name, rule_label, reason, gap, shadow_mode. Nothing dispatched or logged.
GET /api/workspace/firewall/simulate?level=MemberReplay an autonomy level against recent events; returns would-be-blocked counts.
GET /api/workspace/firewall/policies/:idMemberRead a policy’s current shadow_mode flag.
PUT /api/workspace/firewall/policiesDeveloper+Toggle shadow_mode on the policy.
The Test body takes surface (default inbound), a required tool_name, optional args_json, and an optional policy_id to override resolution.

Where to go next

Shadow mode

The live-traffic rollout: [shadow] would …, the events filter, and the flip to enforce.

Validate arguments

Scope a rule to which arguments — the clauses the Test sandbox lets you verify against rm -rf vs ls -la.

Verdicts

What allow / audit / deny / sanitize / pending_approval / cap_cost each do when a test stops being a test.

Events log

Where shadowed verdicts land — filter, drill into runs and matched rules.
For the rule-matching grammar these tests exercise, see the full firewall rules reference; for where testing fits the broader model, see enforcement modes.