Everything here is read-only or sandboxed — no user-facing block, no
production traffic affected. (Keyword, regex, and PII rules run entirely
locally; an
llm_judge rule still calls its configured model, so an eval
over a judge policy does make that call.) The point is to break things
before launch, on your terms.1. How to red team an AI agent before launch
A pre-launch red team answers three questions, and OrcaRouter has one tool for each:Does my guardrail catch attacks?
Run the guardrail’s Eval harness against bundled adversarial
corpora and read back precision / recall / F1.
What would my firewall break?
Turn on shadow mode and watch which real tool calls would be
denied — without denying any of them yet.
Is a tighter posture safe?
Simulate an autonomy level to preview exactly what it would
change against your traffic before you apply it.
2. Score your guardrail against adversarial corpora
The fastest way to know whether a content policy survives contact with an attacker is to throw a corpus of known attacks at it and read the score. The guardrail editor’s Eval tab does exactly that: it replays every sample in a corpus through your current policy and compares the verdict against each sample’s expected outcome — replaying the corpus locally against your rules, never against live traffic. OrcaRouter ships bundled red-team corpora so you don’t have to source your own. Among them:| Corpus | What it is |
|---|---|
advbench_harmful_behaviors | The canonical adversarial-suffix target set — every row is an unsafe request a guardrail should block. |
anthropic_hh_redteam | Real multi-turn human red-team transcripts against an assistant. |
deepset_prompt_injections | Labelled prompt-injection vs benign requests — a precision/recall baseline for an input-stage block. |
databricks_dolly_benign | A pure benign baseline: an over-strict policy should block none of these. |
deepset_prompt_injections corpus:
- TP / FP / FN / TN — true/false positives and negatives, where a “false positive” includes catching an attack with the wrong action class (e.g. masking when you expected a block).
- Precision / Recall / F1 — the headline numbers. Low recall means attacks slip through; low precision means you’re blocking benign traffic.
Where prompt-injection defense lives. The bundled Prompt-Injection
Basics preset is a keyword rule on the flag action — it surfaces
common jailbreak phrases for review without blocking the user. For
semantic injection intent that no keyword list captures, add an
llm_judge rule and red-team it the same way: eval it against
deepset_prompt_injections and anthropic_hh_redteam and read the F1.
See the guardrail reference.3. Shadow-mode the firewall against real traffic
A guardrail eval tests text against a fixed corpus. Your firewall, by contrast, needs to be tested against the messy reality of what your agent actually does — and the safest way to do that before launch is shadow mode. Shadow mode is a per-policy flag that makes the firewall evaluate and log every tool call exactly as it would in production, but downgrade every enforcing verdict toaudit. A deny becomes an audit row whose
reason is prefixed [shadow] would …. Nothing is blocked. Nothing
breaks. But the Events feed now shows you the precise list of calls
your policy would have rejected.
This is the firewall red team: author your strictest intended policy,
flip shadow mode on, run your agent through a realistic launch rehearsal,
then read the [shadow] would … events.
Author the policy, then shadow it
Author the policy, then shadow it
Exercise the agent like it's launch day
Exercise the agent like it's launch day
Run your real agent flows against the gateway with a key attached to
the shadowed policy. Every tool call — inbound, response, MCP
dispatch, egress — is evaluated and logged.
Read the would-block list
Read the would-block list
Open Firewall → Events (Developer+) and filter for the
[shadow] would … reasons. Each one is a call your policy would have denied in
production. Confirm every entry is a call you want denied — and that
nothing legitimate is on the list.Flip shadow off to go live
Flip shadow off to go live
Once the would-block list is clean, turn shadow mode off. The very
next matching call is enforced for real — no other change.
4. Simulate a tighter posture before you commit
The third red-team move is the cheapest: before you apply a stricter autonomy level, simulate it. The simulator previews what applyingtight (or any level) would
change against your workspace’s recent traffic — how many calls would
flip to deny — without writing a single policy row.
tight?” before launch: if the preview shows a wall of
would-be denials on calls your agent depends on, you have rules to soften
before go-live, not an incident after it.
Simulate is preview-only — it never mutates your policies. Applying an
autonomy level is a separate, Developer+ action, and it’s one
transaction with one-click undo if the live result still surprises you.
5. The pre-launch red-team checklist
Put the three passes together and you have a launch gate:| Pass | Tool | Green when |
|---|---|---|
| Content policy | Guardrail Eval vs attack + benign corpora | High recall on attacks, no blocks on benign |
| Action policy | Firewall shadow mode vs rehearsal traffic | Every [shadow] would … is intended |
| Coverage | Observe mode + Discovered tools | No surprising tool sits in a coverage gap |
| Posture | Simulate the target autonomy level | Preview matches what you expect |
https://api.orcarouter.ai/v1/... exactly as before.
6. Next steps
Enforcement modes
Observe → shadow → enforce, the safe rollout this recipe rehearses.
The Secure Agents baseline
What each autonomy level sets — and how
simulate previews it.Prompt injection
The threat your guardrail eval is scoring against.
Go live
The production cutover after the red team passes.
