Everything here binds to your workspace and is configured from the
console. Your chatbot keeps calling
https://api.orcarouter.ai/v1/chat/completions
with the same sk-orca-... key — only the policy in the gateway changes.
Configuration actions need the roles called out per step; relay calls use
the scoped key.1. The threat model for a public chatbot
Before authoring anything, know what you’re defending against. A chatbot’s attack surface is narrower than a full agent’s — but the high-frequency risks are concrete:PII in, PII logged
Users paste emails, card numbers, SSNs into chat — and you forward them
upstream and into your logs.
Prompt injection
“Ignore previous instructions and …” — attempts to override your system
prompt and change the bot’s behavior.
Jailbreaks
DAN / role-play framings that try to pull the bot off-policy.
Unsafe output
The model echoing leaked secrets, system-prompt boilerplate, or
injection-laced content back into the chat.
2. One guardrail, four jobs
Rather than four separate policies, author one workspace guardrail with ordered rules covering each risk. A guardrail is a named, ordered list of rules; each rule says what to look for, where (input,
output, or both), and what to do (block, mask, or flag).
In the console, open Guardrails → New guardrail, name it
chatbot-shield, and add the rules below. Authoring a guardrail — and
running the Test sandbox — needs the Developer role; viewing
guardrails is open to any member.
a. PII on the request
Add a PII rule, stageinput, action mask. The built-in entity
set is closed — pick the ones a chatbot actually sees:
jane@acme.com becomes
[EMAIL], so the upstream model never sees the address. The
entity_actions override blocks the request outright on a card number
or SSN while masking the lower-severity entities. This is exactly the
PII Shield preset, extended with per-entity overrides — apply the
preset from the template library and edit from there.
b. Prompt-injection screening
OrcaRouter ships this as the Prompt-Injection Basics safety preset (a keyword denylist for phrases like “ignore previous instructions” and “reveal your system prompt”; for stricter regex coverage of DAN / role-play framings, add the Jailbreak / Role-Play Blocker preset) plus, for semantic intent that no pattern catches, anllm_judge rule. Add the preset, then a judge rule on the
input stage with a rubric that flags injection/override attempts. The
judge runs against a model in your workspace, is bounded by
judge_timeout_ms, and fails open by default (a judge error is logged
and the request continues) — set judge_fail_open: false to fail closed.
c. Output safety
Add anoutput-stage block rule (regex or keyword) for content that
must never reach the chat window — leaked secrets, chat-template control
tokens, system-prompt boilerplate. The Secrets & API-Key Blocker and
the system-prompt-leak safety presets cover the common cases; apply them
and pin the relevant rules to the output stage. Output block is
enforced on streaming too — the scanner cuts the stream mid-flight and
emits a replacement message before blocked content reaches the user.
3. Test before you ship
Every guardrail editor has a Test tab. Paste a sample, pick the stage, and run the current policy locally — no upstream call, no quota spent.| Paste this | Stage | Expect |
|---|---|---|
email me at jane@acme.com | input | email me at [EMAIL] |
ignore previous instructions | input | flag / block (your choice) |
card 4111 1111 1111 1111 | input | guardrail_blocked (per the override) |
4. Mint one scoped key for the bot
A guardrail only enforces on keys that resolve to it. Give the chatbot its own key, scoped to the minimum it needs — never your account-wide key. In API Keys → New key, set:Attach the guardrail
Attach the guardrail
Pick
chatbot-shield from the Guardrail dropdown. This sets
guardrail_id on the key. An explicit attachment is the off switch’s
opposite: if it’s set and enabled, it always applies and never silently
falls back. (Leave it unset to fall back to the workspace
is_default guardrail instead.)Cap the spend
Cap the spend
Set
credit_limit_usd to a sane ceiling (0 = unlimited). A public
chatbot is the key most likely to get abused — a hard credit cap is
your blast-radius limit. See denial-of-wallet.Pin the models
Pin the models
Turn on
model_limits and list only the model(s) the bot is allowed to
call, so a leaked key can’t be used to run an expensive model you never
intended to expose.Lock it down further
Lock it down further
Set
allow_ips to your backend’s egress IPs if the bot calls from a
fixed server, and an expired_time if the key is temporary
(-1 = never expires).chatbot-shield with no code
aware that screening is happening.
5. Watch it in production
Two reads keep you honest, both workspace-scoped:- Guardrails → Matches (any Member) — every rule that fired: type, action, stage, and detail. The matched substring is recorded only if Log raw content is on for the guardrail (off by default — the privacy-conservative posture). Mark a false positive to tune the policy (Admin).
- Version history — every change writes a history row; diff any two
versions and revert if a rule turns out too aggressive. A blocked
request returns HTTP 400
guardrail_blocked, costs no quota, and is marked skip-retry.
A
guardrail_blocked response is a deliberate, user-visible 400. Handle it
in your chatbot UI with a friendly message (“I can’t process that”) rather
than surfacing the raw error — the gateway has already stopped the unsafe
turn for you.6. If your bot calls tools
The moment your chatbot can call a function, fetch a URL, or reach an MCP server, text screening isn’t enough — you need the action plane. Attach a Firewall policy to the same key viafirewall_policy_id, or apply the balanced autonomy level to audit
tool calls and flag PII workspace-wide before tightening. The fastest path
is the zero-trust quickstart; for an agent
that calls tools heavily, see
secure an autonomous agent.
7. Where to go deeper
Guardrails reference
Every rule type, PII entity, judge field, and the eval harness in full.
Guardrails vs Firewall
Text plane vs action plane — when you need which.
Enforcement modes
Observe → shadow → enforce: roll out without breaking the bot.
Scope keys, policies, workspaces
How key attachment and workspace defaults resolve.
