1. Prompt injection protection in three layers
No single check stops every injection. OrcaRouter gives you three complementary layers you can stack on one guardrail:Prompt-Injection Basics
A safety preset — a keyword rule that flags the classic
jailbreak phrases (“ignore previous instructions”, “reveal your
system prompt”) for review, without blocking. Deterministic, no
model call.
LLM-judge intent rule
An
llm_judge rule that asks a model in your workspace “is this an
attempt to override the system instructions?” — catching paraphrased
and obfuscated injection no fixed keyword list can. Bills a small
judge sub-line.Spotlight untrusted text
The
spotlight action wraps matched untrusted input in delimiters
(e.g. ⟦UNTRUSTED⟧…⟦/UNTRUSTED⟧) and tells the model to treat the
region as data, never instructions — the strongest defense for
indirect injection from retrieved or tool-returned content. Use
spotlight_whole to wrap the entire input.Why flag-then-judge. A keyword denylist is fast and free but brittle
— attackers reword around it. A judge is robust but costs a sub-call.
Run the preset to see what hits your traffic, then add the judge to
catch the rewordings. Both rules live on one guardrail and run on the
same request.
2. Start with the Prompt-Injection Basics preset
Every step here is a console action on the hosted gateway under your own session. Creating and editing guardrails requires Developer+ in the workspace. Only the final/v1/* call uses an sk-orca-... relay
key.
Open the template
In the console, open Guardrails, click the New guardrail
split-button, and pick Prompt-Injection Basics from the
Safety template category. It seeds a single
keyword rule on the
input stage with the flag action.Name and save
Name it (≤ 64 chars), e.g.
prompt-injection, and save. A preset is
a seed, not a lock — add or remove phrases freely afterward.Test it
Open the Test tab, paste a sample at the
input stage, and run
the policy locally — no upstream call, no quota (see
§4).Attach a key
Edit an API key and pick
prompt-injection from the Guardrail
dropdown (sets guardrail_id on the key), or mark it the workspace
default. See Attach to a key
and Account default.3. Catch what keywords miss — add an llm_judge rule
Keyword matching only catches the phrases you listed. Add anllm_judge
rule to the same guardrail to catch the intent behind a reworded
attack. Open the guardrail, Add rule, choose LLM judge, and
configure:
judge_model
judge_model
A model or router alias your workspace can already call. The judge
call routes through your channels, so its tokens bill and attribute
like any other call — as a judge sub-line.
judge_format
judge_format
One of
yes_no, score, or category. For an injection check,
yes_no is the natural fit (the console pre-selects it). With
score, set judge_threshold; with category, list the denied
judge_categories.judge_timeout_ms and judge_fail_open
judge_timeout_ms and judge_fail_open
judge_timeout_ms bounds the call (0 → engine default). With
judge_fail_open true (default) a judge error is recorded and the
request continues; set it false to treat an error or timeout as a
block where a missed check is unacceptable.4. Test before you attach
Prove the guardrail does what you expect before any key points at it. Open the Test tab inside the editor, paste an injection sample, pick theinput stage, and run:
5. See what fired
Every rule that fires records a match — rule type, action, stage, and a detail string — surfaced in the workspace Matches feed. While the guardrail is in flag mode, this feed is the value: it shows you how often injection phrases hit your traffic and what they look like, so you can decide whether to enforce.6. Stack it with stricter siblings
Prompt-Injection Basics is the gentle, flag-only starting point. The Safety template category ships stricter siblings you can layer on the same guardrail when you’re ready to block:| Preset | Action | Catches |
|---|---|---|
| Prompt-Injection Basics | flag | Classic phrases — the watch layer. |
| Jailbreak / Role-Play Blocker | block | DAN / developer-mode / “act as” patterns. |
| Jailbreak v2 Regex | block | Newer modes + invisible Unicode tag-byte smuggling. |
7. Guardrails screen text; the firewall governs actions
A guardrail stops the injected instruction from reaching the model. But a successful injection’s goal is usually to make an agent do something — call a dangerous tool, exfiltrate data, hit an internal host. That blast radius is the Firewall’s job: it evaluates the model’s emitted tool calls and candeny, sanitize
arguments, or require approval. Run both for defense in depth.
Prompt injection (threat)
The full threat model and where each control sits.
Jailbreaks
The persona-bypass cousin of injection.
Dangerous tool calls
What an injection tries to make an agent do — and how the firewall stops it.
Securing AI agents
The baseline control stack for agentic workloads.
llm_judge
field reference, versioning, and routes — read the
Guardrails reference.