Red-team your agent before launch

The day you put an agent in front of users is the worst day to find out a jailbreak walks straight through your content policy, or that a tool you forgot to govern fires on the first run. A pre-launch red team turns those surprises into a number you can read before you ship — and OrcaRouter gives you three ways to produce it, all without touching your agent code or sending a single live request you didn’t mean to. This recipe is the dry-run pass: measure a policy against known attacks, shadow it against your own traffic, and simulate a tighter posture before you commit to it.

Everything here is read-only or sandboxed — no user-facing block, no production traffic affected. (Keyword, regex, and PII rules run entirely locally; an llm_judge rule still calls its configured model, so an eval over a judge policy does make that call.) The point is to break things before launch, on your terms.

1. How to red team an AI agent before launch

A pre-launch red team answers three questions, and OrcaRouter has one tool for each:

Does my guardrail catch attacks?

Run the guardrail’s Eval harness against bundled adversarial corpora and read back precision / recall / F1.

What would my firewall break?

Turn on shadow mode and watch which real tool calls would be denied — without denying any of them yet.

Is a tighter posture safe?

Simulate an autonomy level to preview exactly what it would change against your traffic before you apply it.

The first tests your Guardrails (the text plane); the second and third test your Firewall (the action plane). A real launch checklist runs all three.

2. Score your guardrail against adversarial corpora

The fastest way to know whether a content policy survives contact with an attacker is to throw a corpus of known attacks at it and read the score. The guardrail editor’s Eval tab does exactly that: it replays every sample in a corpus through your current policy and compares the verdict against each sample’s expected outcome — replaying the corpus locally against your rules, never against live traffic. OrcaRouter ships bundled red-team corpora so you don’t have to source your own. Among them:

Corpus	What it is
`advbench_harmful_behaviors`	The canonical adversarial-suffix target set — every row is an unsafe request a guardrail should block.
`anthropic_hh_redteam`	Real multi-turn human red-team transcripts against an assistant.
`deepset_prompt_injections`	Labelled prompt-injection vs benign requests — a precision/recall baseline for an input-stage block.
`databricks_dolly_benign`	A pure benign baseline: an over-strict policy should block none of these.

Always pair an attack corpus with a benign one. A policy that blocks 100% of attacks but also blocks databricks_dolly_benign isn’t safe — it’s unusable. The benign run is your false-positive budget.

Run an eval against the bundled deepset_prompt_injections corpus:

curl https://api.orcarouter.ai/api/guardrail/123/eval \
  -H "Authorization: Bearer <your-session-token>" \
  -H "X-Workspace-Id: <workspace-id>" \
  -H "Content-Type: application/json" \
  -d '{ "corpus_name": "deepset_prompt_injections" }'

The /api/guardrail/* routes use your console session / access token, not an sk-orca-... relay key — and they’re workspace-scoped via X-Workspace-Id. In practice you’ll run this from the Eval tab in the console; the curl is here to show the shape. Running an eval is open to any Member.

The run reports the detection metrics computed against expected actions:

TP / FP / FN / TN — true/false positives and negatives, where a “false positive” includes catching an attack with the wrong action class (e.g. masking when you expected a block).
Precision / Recall / F1 — the headline numbers. Low recall means attacks slip through; low precision means you’re blocking benign traffic.

Open the run to inspect the failures sample by sample, tune the rule or the judge rubric, and re-run until the score holds. Custom corpora work the same way — upload your own JSONL (Developer+) to test against the exact attack shapes your product faces.

Where prompt-injection defense lives. The bundled Prompt-Injection Basics preset is a keyword rule on the flag action — it surfaces common jailbreak phrases for review without blocking the user. For semantic injection intent that no keyword list captures, add an llm_judge rule and red-team it the same way: eval it against deepset_prompt_injections and anthropic_hh_redteam and read the F1. See the guardrail reference.

3. Shadow-mode the firewall against real traffic

A guardrail eval tests text against a fixed corpus. Your firewall, by contrast, needs to be tested against the messy reality of what your agent actually does — and the safest way to do that before launch is shadow mode. Shadow mode is a per-policy flag that makes the firewall evaluate and log every tool call exactly as it would in production, but downgrade every enforcing verdict to audit. A deny becomes an audit row whose reason is prefixed [shadow] would …. Nothing is blocked. Nothing breaks. But the Events feed now shows you the precise list of calls your policy would have rejected. This is the firewall red team: author your strictest intended policy, flip shadow mode on, run your agent through a realistic launch rehearsal, then read the [shadow] would … events.

Author the policy, then shadow it

Build your enforcing policy in the console (Developer+) — for a launch dry-run, set default_verdict to audit and add the deny rules you intend to ship. Toggle shadow mode on. The whole policy now logs without enforcing.

Exercise the agent like it's launch day

Run your real agent flows against the gateway with a key attached to the shadowed policy. Every tool call — inbound, response, MCP dispatch, egress — is evaluated and logged.

Read the would-block list

Open Firewall → Events (Developer+) and filter for the [shadow] would … reasons. Each one is a call your policy would have denied in production. Confirm every entry is a call you want denied — and that nothing legitimate is on the list.

Flip shadow off to go live

Once the would-block list is clean, turn shadow mode off. The very next matching call is enforced for real — no other change.

Pair shadow mode with observe mode (a workspace setting) for coverage, not just correctness. Observe mode logs every tool call that resolves to no policy as a gap, populating the Discovered tools view — so you catch the tool you forgot to write a rule for, not just the rules you got wrong. See enforcement modes.

4. Simulate a tighter posture before you commit

The third red-team move is the cheapest: before you apply a stricter autonomy level, simulate it. The simulator previews what applying tight (or any level) would change against your workspace’s recent traffic — how many calls would flip to deny — without writing a single policy row.

curl "https://api.orcarouter.ai/api/workspace/firewall/simulate?level=tight" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "X-Workspace-Id: <workspace-id>"

Reading the simulator is open to any Member. Use it to answer “is my agent ready for tight?” before launch: if the preview shows a wall of would-be denials on calls your agent depends on, you have rules to soften before go-live, not an incident after it.

Simulate is preview-only — it never mutates your policies. Applying an autonomy level is a separate, Developer+ action, and it’s one transaction with one-click undo if the live result still surprises you.

5. The pre-launch red-team checklist

Put the three passes together and you have a launch gate:

Pass	Tool	Green when
Content policy	Guardrail Eval vs attack + benign corpora	High recall on attacks, no blocks on benign
Action policy	Firewall shadow mode vs rehearsal traffic	Every `[shadow] would …` is intended
Coverage	Observe mode + Discovered tools	No surprising tool sits in a coverage gap
Posture	Simulate the target autonomy level	Preview matches what you expect

Run all four green, then enforce: flip shadow mode off and apply your autonomy level. Because every binding lives on the key in the gateway, the move from dry-run to live is a config change, not a deploy — your agent keeps calling https://api.orcarouter.ai/v1/... exactly as before.

Output-stage masking and live response scanning are still maturing — an eval run proves a rule’s logic in the sandbox, but confirm your specific stage and streaming combination against the guardrail notes before you depend on it in production.

6. Next steps

Enforcement modes

Observe → shadow → enforce, the safe rollout this recipe rehearses.

The Secure Agents baseline

What each autonomy level sets — and how simulate previews it.

Prompt injection

The threat your guardrail eval is scoring against.

Go live

The production cutover after the red team passes.

For the full engines behind each pass, see the Guardrails and Firewall references, and the related threats: jailbreaks and dangerous-tool-calls.

​1. How to red team an AI agent before launch

Does my guardrail catch attacks?

What would my firewall break?

Is a tighter posture safe?

​2. Score your guardrail against adversarial corpora

​3. Shadow-mode the firewall against real traffic

​4. Simulate a tighter posture before you commit

​5. The pre-launch red-team checklist

​6. Next steps

Enforcement modes

The Secure Agents baseline

Prompt injection

Go live

1. How to red team an AI agent before launch

2. Score your guardrail against adversarial corpora

3. Shadow-mode the firewall against real traffic

4. Simulate a tighter posture before you commit

5. The pre-launch red-team checklist

6. Next steps