rm -rf / the model echoes into a shell tool, a UNION SELECT it emits for a SQL runner to execute. A content policy that only
thinks about PII or secrets misses all four. The Agent preset
category exists for exactly this shape — deterministic regex rules that
block the request or response before a downstream tool ever acts on
it.
This is a focused landing for the agentic use case. For the complete
guardrail engine — every rule type, field, stage, and route — see the
Guardrails reference.
1. Why agent guardrails are a distinct surface
A guardrail screens content — the text in the request and the text in the response. For an agent, that text becomes an action: the URL gets fetched, the markdown gets rendered, the shell line gets run, the SQL gets executed. So the sameblock / mask engine you use for PII does
double duty here — it stops a payload at the gateway before the agent’s
tool layer can turn it into a side effect.
The Agent category ships four presets, each one regex rule with
action block, split across the two stages:
URL Filter — input, block
URL Filter — input, block
Blocks any
http(s) URL on the request. Use it for agent flows
where outbound URLs must be allowlisted rather than open. The seeded
pattern matches any URL; edit the regex to permit specific domains.Markdown Image Block — output, block
Markdown Image Block — output, block
Blocks markdown image embeds (
) in the model’s
response. Defends against image-rendering exfiltration on clients
that auto-load remote images — a classic data-leak channel where a
rendered image URL smuggles data out.Tool Call Shell Block — input, block
Tool Call Shell Block — input, block
Blocks obvious shell-injection patterns in the request (
rm -rf /, curl … | sh, wget … | bash, sudo escalation). Use it for
agent flows that may forward user input into a shell tool.SQL Injection in Output — output, block
SQL Injection in Output — output, block
Blocks model responses that carry classic SQL-injection payloads
(
UNION SELECT, OR 1=1, DROP TABLE, comment terminators).
Defense-in-depth for tools that auto-execute SQL the model produced.Two presets screen input, two screen output. URL Filter and Tool Call
Shell Block fire on the request — before the model runs, before any
quota is metered. Markdown Image Block and SQL Injection in Output fire on
the response — after the model answers, before the content reaches
your client or its tool layer. Knowing which stage a risk lives on is the
whole game; see Input stage and
Output stage.
2. Apply an agent guardrail in the console
Every step here is a console action on the hosted gateway under your own session. Creating and editing guardrails requires Developer+ in the workspace. Only the final/v1/* call uses an sk-orca-... relay
key — the guardrail itself is configured entirely in the console.
Open the template
In the console, open Guardrails, click the New guardrail
split-button, and pick a preset from the Agent template category
— e.g. Markdown Image Block. It seeds the single
regex block
rule at the right stage.Name and save
Give it a name (≤ 64 chars), e.g.
agent-rails, and save. A preset is
a seed, not a lock — add the other three Agent rules or edit the regex
freely afterward (see §4).Test it in the sandbox
Open the Test tab inside the editor, paste a sample, pick the
matching stage, and run the current policy locally — no upstream
call, no quota (see §3).
Attach a key
Edit an API key and pick
agent-rails from the Guardrail dropdown
(sets guardrail_id on the key), or mark it the workspace
default. See Attach to a key
and Account default.3. Prove it before you attach
Prove the rule fires before any key points at it. Open the Test tab, pick the output stage, and paste a response that an attacker-poisoned page might have coaxed the model into emitting:4. Compose and tune the rules
The four presets are seeds. The common move is to combine them into oneagent-rails guardrail and tighten each regex to your stack:
Allowlist URLs
Start from URL Filter, then edit the
regex so it blocks every
URL except your sanctioned domains — invert the match to an
allowlist instead of a blanket block.Author your own detectors
Add a
regex rule for any
payload shape your tools care about — RE2 patterns, linear-time, no
backreferences. Patterns compile once and cache across requests.5. What a block looks like
Every Agent preset uses the block action. A blocked request returns HTTP 400 with error codeguardrail_blocked and a message naming the
guardrail and the rule that fired:
guardrail_blocked error.
6. Guardrails are content; the firewall is tool calls
Agent guardrails are a strong first layer, but they reason about strings, not tool semantics. They block a shell line in the content — they do not understand that the model emitted a structuredtool_call
to a destructive tool, or that an outbound request is heading to a
metadata IP.
That tool-call layer is the Firewall: it evaluates
the model’s emitted tool_calls, MCP tools/call, and outbound egress
with verdicts like allow / audit / deny / pending_approval. The
two compose — guardrails screen the text, the firewall governs the
action.
Firewall
Govern the model’s emitted tool calls, MCP calls, and egress with
allow / audit / deny / approval verdicts.
Guardrails vs. Firewall
When to reach for a content guardrail vs. a tool-call firewall — and
how to run both.
Securing AI agents
The full agent control stack: content, tools, MCP, and egress.
Excessive agency
The threat these rails address — an agent that does more than it
should.
7. See what fired
Every rule that fires records a match — rule type, action, stage, and a detail string — surfaced in the workspace Matches feed. The matched substring itself is recorded only when Log raw content is on, which is off by default. Group and filter the feed by guardrail, rule type, and action to watch your agent-rule hit rate and tune false positives. See Matches feed, Logging & privacy, and Tune false positives.8. Where to go next
Output-stage rules
How response screening works for Markdown Image Block and SQL
Injection in Output.
Regex detectors
Author your own RE2 patterns to extend the Agent rules.
Data exfiltration
The exfil channel Markdown Image Block closes.
Dangerous tool calls
Why a content rail alone isn’t enough — pair it with the firewall.
