Skip to main content
A customer-facing chatbot takes untrusted input from the public and sends it to a model. That makes it the highest-exposure surface you run: users paste in PII you don’t want stored upstream, attackers try to override your system prompt, and the model can echo secrets or unsafe content back into the chat window. This recipe wires up the four controls that secure an AI chatbot end-to-end — a PII guardrail on the request, prompt-injection screening, output safety, and a single tightly-scoped key — all in the console, with zero change to your chatbot code.
Everything here binds to your workspace and is configured from the console. Your chatbot keeps calling https://api.orcarouter.ai/v1/chat/completions with the same sk-orca-... key — only the policy in the gateway changes. Configuration actions need the roles called out per step; relay calls use the scoped key.

1. The threat model for a public chatbot

Before authoring anything, know what you’re defending against. A chatbot’s attack surface is narrower than a full agent’s — but the high-frequency risks are concrete:

PII in, PII logged

Users paste emails, card numbers, SSNs into chat — and you forward them upstream and into your logs.

Prompt injection

“Ignore previous instructions and …” — attempts to override your system prompt and change the bot’s behavior.

Jailbreaks

DAN / role-play framings that try to pull the bot off-policy.

Unsafe output

The model echoing leaked secrets, system-prompt boilerplate, or injection-laced content back into the chat.
A plain chatbot has no tool calls, so this recipe leans on Guardrails — the text plane — rather than the Firewall. If your bot does call tools, layer the Firewall on top (see §6).

2. One guardrail, four jobs

Rather than four separate policies, author one workspace guardrail with ordered rules covering each risk. A guardrail is a named, ordered list of rules; each rule says what to look for, where (input, output, or both), and what to do (block, mask, or flag). In the console, open Guardrails → New guardrail, name it chatbot-shield, and add the rules below. Authoring a guardrail — and running the Test sandbox — needs the Developer role; viewing guardrails is open to any member.

a. PII on the request

Add a PII rule, stage input, action mask. The built-in entity set is closed — pick the ones a chatbot actually sees:
{
  "type": "pii",
  "stage": "input",
  "action": "mask",
  "entities": ["email", "phone", "credit_card", "ssn", "ip"],
  "entity_actions": { "credit_card": "block", "ssn": "block" }
}
A mask replaces each match with a typed tag — jane@acme.com becomes [EMAIL], so the upstream model never sees the address. The entity_actions override blocks the request outright on a card number or SSN while masking the lower-severity entities. This is exactly the PII Shield preset, extended with per-entity overrides — apply the preset from the template library and edit from there.
Input-stage PII masking is live today — it rewrites the request before the model sees it. Live masking of the streamed response is on the roadmap. To redact PII from what the bot says back, use an output block rule (enforced on streaming and non-streaming) or run the bot non-streaming where output masking applies. Prove your exact stage/stream combination in the Test tab first.

b. Prompt-injection screening

OrcaRouter ships this as the Prompt-Injection Basics safety preset (a keyword denylist for phrases like “ignore previous instructions” and “reveal your system prompt”; for stricter regex coverage of DAN / role-play framings, add the Jailbreak / Role-Play Blocker preset) plus, for semantic intent that no pattern catches, an llm_judge rule. Add the preset, then a judge rule on the input stage with a rubric that flags injection/override attempts. The judge runs against a model in your workspace, is bounded by judge_timeout_ms, and fails open by default (a judge error is logged and the request continues) — set judge_fail_open: false to fail closed.
Start the injection rules at flag, watch the Matches feed for a day on real traffic, then promote to block once you’ve confirmed they fire on attacks and not on legitimate questions. See enforcement modes.

c. Output safety

Add an output-stage block rule (regex or keyword) for content that must never reach the chat window — leaked secrets, chat-template control tokens, system-prompt boilerplate. The Secrets & API-Key Blocker and the system-prompt-leak safety presets cover the common cases; apply them and pin the relevant rules to the output stage. Output block is enforced on streaming too — the scanner cuts the stream mid-flight and emits a replacement message before blocked content reaches the user.

3. Test before you ship

Every guardrail editor has a Test tab. Paste a sample, pick the stage, and run the current policy locally — no upstream call, no quota spent.
Paste thisStageExpect
email me at jane@acme.cominputemail me at [EMAIL]
ignore previous instructionsinputflag / block (your choice)
card 4111 1111 1111 1111inputguardrail_blocked (per the override)
For adversarial coverage, the Eval tab runs the policy against bundled red-team corpora (or your own JSONL) and reports how it scored — tune the judge rubric until it catches known attacks without flagging benign chat.

4. Mint one scoped key for the bot

A guardrail only enforces on keys that resolve to it. Give the chatbot its own key, scoped to the minimum it needs — never your account-wide key. In API Keys → New key, set:
Pick chatbot-shield from the Guardrail dropdown. This sets guardrail_id on the key. An explicit attachment is the off switch’s opposite: if it’s set and enabled, it always applies and never silently falls back. (Leave it unset to fall back to the workspace is_default guardrail instead.)
Set credit_limit_usd to a sane ceiling (0 = unlimited). A public chatbot is the key most likely to get abused — a hard credit cap is your blast-radius limit. See denial-of-wallet.
Turn on model_limits and list only the model(s) the bot is allowed to call, so a leaked key can’t be used to run an expensive model you never intended to expose.
Set allow_ips to your backend’s egress IPs if the bot calls from a fixed server, and an expired_time if the key is temporary (-1 = never expires).
The key is masked on display after creation — copy it once. Your chatbot backend now sends every user turn through chatbot-shield with no code aware that screening is happening.

5. Watch it in production

Two reads keep you honest, both workspace-scoped:
  • Guardrails → Matches (any Member) — every rule that fired: type, action, stage, and detail. The matched substring is recorded only if Log raw content is on for the guardrail (off by default — the privacy-conservative posture). Mark a false positive to tune the policy (Admin).
  • Version history — every change writes a history row; diff any two versions and revert if a rule turns out too aggressive. A blocked request returns HTTP 400 guardrail_blocked, costs no quota, and is marked skip-retry.
A guardrail_blocked response is a deliberate, user-visible 400. Handle it in your chatbot UI with a friendly message (“I can’t process that”) rather than surfacing the raw error — the gateway has already stopped the unsafe turn for you.

6. If your bot calls tools

The moment your chatbot can call a function, fetch a URL, or reach an MCP server, text screening isn’t enough — you need the action plane. Attach a Firewall policy to the same key via firewall_policy_id, or apply the balanced autonomy level to audit tool calls and flag PII workspace-wide before tightening. The fastest path is the zero-trust quickstart; for an agent that calls tools heavily, see secure an autonomous agent.

7. Where to go deeper

Guardrails reference

Every rule type, PII entity, judge field, and the eval harness in full.

Guardrails vs Firewall

Text plane vs action plane — when you need which.

Enforcement modes

Observe → shadow → enforce: roll out without breaking the bot.

Scope keys, policies, workspaces

How key attachment and workspace defaults resolve.