The control stack: keys → guardrails → firewall → audit

OrcaRouter applies four layers to every request in a fixed order. Each layer is independent, workspace-scoped, and attaches to an API key with no code change. This page walks one request through all four layers in sequence, then covers resolution order and fail-open/fail-closed behavior. For a broader introduction see Securing AI agents with OrcaRouter.

1. Layer 1 — Scoped API key

The key is the first gate. Before any content is inspected or any model is called, the gateway resolves the calling key and decides whether the request is even permitted. What the key carries:

model_limits — the set of models the key may call. A request for a model outside the list is rejected immediately.
allow_ips — optional IP allowlist. A request from an unlisted source is rejected.
credit_limit_usd — a hard spend cap. A request that would push the key over the cap is rejected.
expiry — a hard expiry date. Expired keys are rejected.
environment — a tag (production, staging, dev, …) to organize and identify the key by deployment environment.
guardrail_id — the guardrail policy that binds to this key (see Layer 2 and Layer 4).
firewall_policy_id — the firewall policy that binds to this key (see Layer 3).
is_firewall_gateway — flags the key as a firewall-gateway-scoped token; required for the evaluate and MCP gateway routes.

A request that fails key validation never reaches a model — and is never metered. Where to configure: Console → API Keys, or the token API. Requires Developer+ to create or edit; is_firewall_gateway and plaintext key reads require Admin. For the full key model see Scope, keys, policies, and workspaces.

2. Layer 2 — Input guardrails

Once the key is validated, the gateway runs the bound guardrail’s input-stage rules against the request text — before any upstream model call. What it sees: The caller’s messages as submitted. (A prompt injected from a registry prompt is appended later, in the routing stage; input rules do not see it.) Rule types available: keyword, regex, pii, max_chars, external, llm_judge, grounding. Actions a rule can produce:

Action	What happens
`block`	Request is rejected — HTTP 400 `guardrail_blocked`. No quota is charged. Marked skip-retry.
`mask`	Match is redacted (e.g. `jane@acme.com` → `[EMAIL]`). The sanitized text continues to the model.
`flag`	Match is recorded; traffic is unchanged.

A block at this stage means the model is never called. Cost: zero. The caller sees a structured error naming the guardrail and the rule that fired. Where to configure: Console → Guardrails, or the guardrail API. Requires Developer+ to create or modify. See Guardrails for the full rule reference.

3. Layer 3 — Model runs

If the key is valid and input guardrails pass, the gateway forwards the request to the upstream model. This is the only layer with no configurable enforcement — it is simply the model doing its job. The firewall operates on the actions the model produces (Layer 3 → Layer 4 below), not on the model itself. Routing, fallbacks, and load balancing happen here transparently.

4. Layer 4 — Agent Firewall (tool calls and egress)

After the model responds — or inline, as tool calls are emitted — the Agent Firewall judges every action the model requests. Four enforcement surfaces:

Surface	What the firewall sees
`inbound`	Tool definitions the agent advertises to the model. Block a dangerous tool before the model can choose it.
`response`	`tool_calls` the model emits in its reply.
`mcp`	A `tools/call` dispatched through the Firewall MCP gateway or the evaluate hook.
`egress`	An outbound network destination (host / IP / CIDR) reported by a tool — the SSRF and data-exfiltration surface.

Verdicts a rule (or the default) can produce:

Verdict	What it does
`allow`	Call proceeds. Logged.
`audit`	Call proceeds; recorded for review. The default `default_verdict`.
`deny`	Call blocked — HTTP 400 `firewall_blocked` on the `inbound` surface; a tool error on `mcp`.
`sanitize`	Matched substrings are redacted from tool arguments; the cleaned call proceeds. On `inbound` (no arguments yet), escalates to deny.
`pending_approval`	Call is held; an out-of-band reviewer approves or rejects; the agent re-submits with a single-use approval token.
`cap_cost`	Deny once the agent run’s accumulated spend exceeds a per-rule cents cap.

A deny on the inbound surface costs no model tokens — the block fires before the upstream call. A pending_approval hold returns HTTP 400 firewall_approval_pending with an approval id the client polls. Where to configure: Console → Firewall, or the firewall API. Requires Developer+ to create or modify policies and rules. See Firewall and Firewall rules for the full rule language.

5. Layer 5 — Output guardrails

After the model responds (and after any tool-call cycle completes), the gateway runs the bound guardrail’s output-stage rules against the response text before it reaches the caller. The same rule types and actions apply as in Layer 2. An output block returns HTTP 400 guardrail_blocked and refunds the pre-consumed quota — the caller pays nothing.

Streaming and output masking. A block action is enforced on both streaming and non-streaming responses — on a stream, the scanner cuts mid-flight and emits a replacement. A mask action on output currently applies only to non-streaming responses; on a streaming response the original chunk passes through unmasked. Verify your stage/stream combination in the guardrail sandbox before depending on it.

6. Layer 6 — Audit

Every match, verdict, and approval decision is written to the audit trail, correlated to the agent run and session that caused it. This is not a separate enforcement step — it runs in parallel with Layers 2–5 — but it is the layer that makes the others accountable. What is logged:

Guardrail matches: rule type, action, stage, detail string, and (if Log raw content is enabled) the matched substring.
Firewall events: surface, tool name, verdict, matched rule, reason code, risk factors, and the run/session the call belongs to.
Approval decisions: who approved or rejected, when, and whether the underlying rule changed between hold and decision.
Policy changes: every create, update, delete, and autonomy-level change writes a versioned audit row.

Where to review: Console → Guardrails → Matches; Console → Firewall → Events, Runs & Sessions, Audit. The guardrail Matches feed is open to any workspace Member; the firewall Events and Runs & Sessions feeds require Developer+.

7. Summary table

Layer	What it controls	What it sees	Outcome on a hit	Where to configure
1. Scoped key	Identity, model access, spend, IP, expiry	The request’s auth token	HTTP 4xx before anything runs; no metering	Console → API Keys (Developer+)
2. Input guardrails	Request text content	Caller’s messages	Block (HTTP 400 `guardrail_blocked`, no charge), mask, or flag	Console → Guardrails (Developer+)
3. Model	—	—	—	Routing / channel config
4. Agent Firewall	Tool calls, MCP dispatch, egress	Tool name, arguments, destination	allow / audit / deny / sanitize / pending_approval / cap_cost	Console → Firewall (Developer+)
5. Output guardrails	Response text content	Model’s reply	Block (HTTP 400, quota refunded), mask, or flag	Console → Guardrails (Developer+)
6. Audit	Attribution and trail	All of the above	Immutable log entry	Console → Matches (Member) / Events & Runs (Developer+)

8. Policy resolution order

For any request, the active guardrail and firewall policy are resolved independently:

Key attachment — if the key carries an explicit guardrail_id or firewall_policy_id, that policy binds (when it exists and is enabled).
Workspace default — if the key has no attachment, the workspace’s enabled is_default guardrail or policy applies.
Neither — no enforcement. The request is byte-identical to a workspace that never enabled the feature.

The two planes differ when an attached policy is disabled: a disabled guardrail attachment turns guardrails off for that key (no fallback), while a disabled firewall attachment falls back to the workspace default firewall policy. At most one guardrail and one firewall policy per workspace can be the default. Promoting a new default demotes the old one in the same transaction.

9. Fail-open and fail-closed

Two behaviors — applied to different cases.Fail-open (transient errors): If policy resolution hits a transient error — a database hiccup, a network blip on the way to an external vendor — the gateway degrades to no enforcement rather than taking traffic down. Safety degrades; availability is preserved. Configure fail_open: false on external or llm_judge rules when a missed check is unacceptable for your policy.Fail-closed (ambiguous cases): Where not enforcing would defeat the rule, the engine fails closed: an egress report with an unresolvable destination is denied; an unreachable approval store holds the call rather than passing it; a skill whose ownership cannot be resolved is blocked. Availability is preserved on the happy path; safety is not silently skipped on the edge cases that matter.

See Enforcement modes for the full decision tree, and How OrcaRouter inspects requests for the relay-path mechanics.

10. Deep dives

Guardrails

Full rule reference — types, actions, PII entities, LLM judge, grounding, and the test sandbox.

Firewall

Policy model, verdicts, surfaces, shadow mode, HITL approval, and anomaly detection.

Firewall rules

The rule matching language — tool globs, argument clauses, egress lists, and sanitizers.

Guardrails vs. Firewall

Which layer catches which threat — and when you need both.

Scope, keys, and policies

The full key model: what a key carries and how policies resolve.

Enforcement modes

Fail-open vs. fail-closed — the full decision tree.

Every call through OrcaRouter passes all four enforcement layers in order — key validation, input screening, firewall judgment, output screening — with a full audit trail written across all of them.

​1. Layer 1 — Scoped API key

​2. Layer 2 — Input guardrails

​3. Layer 3 — Model runs

​4. Layer 4 — Agent Firewall (tool calls and egress)

​5. Layer 5 — Output guardrails

​6. Layer 6 — Audit

​7. Summary table

​8. Policy resolution order

​9. Fail-open and fail-closed

​10. Deep dives

Guardrails

Firewall

Firewall rules

Guardrails vs. Firewall

Scope, keys, and policies

Enforcement modes

1. Layer 1 — Scoped API key

2. Layer 2 — Input guardrails

3. Layer 3 — Model runs

4. Layer 4 — Agent Firewall (tool calls and egress)

5. Layer 5 — Output guardrails

6. Layer 6 — Audit

7. Summary table

8. Policy resolution order

9. Fail-open and fail-closed

10. Deep dives