https://api.orcarouter.ai/v1/...; only the keys and policies in the
gateway change. For the underlying attack anatomy, read
Prompt injection and
Dangerous tool calls; this page
is the response.
The roles each step needs are called out inline. Reading the guardrail
Matches feed is open to any Member; the firewall Events,
Runs, and trace views need Developer+; revoking a key, applying
an autonomy posture, and editing a policy need Developer+; marking a
guardrail match a false positive needs Admin.
1. The ai security incident response loop
Three phases, run in order. Don’t skip straight to hardening — contain first so the attacker loses access while you investigate.Contain
Revoke the compromised key so the attacker can’t make another call.
Mint a fresh, tightly-scoped replacement.
Scope
Read the firewall Events / Runs and guardrail Matches feeds
to see exactly what the key did and what fired.
Harden
Tighten the autonomy posture and add the rule that would have caught
it, so the same attack can’t recur.
2. Contain — revoke the key
The first move is to cut off access. A leakedsk-orca-... key keeps
working until you revoke it, so do this before anything else.
In the console, open API Keys, find the compromised key (it’s masked
on display — match it by name, environment, or last-used), and delete
it (Developer role). Deletion is immediate: the very next request on
that key is rejected at the gateway.
Then mint a replacement, scoped to the minimum the workload needs — never
your account-wide key. In API Keys → New key (Developer role):
Cap the blast radius on the new key
Cap the blast radius on the new key
Set
credit_limit_usd to a sane ceiling (0 = unlimited) so a future
leak can’t drain quota, allow_ips to your backend’s egress IPs if
the caller runs from a fixed server, and expired_time for anything
temporary (-1 = never expires). Use model_limits (with
model_limits_enabled) to fence the key to only the models it needs.Attach your policies to the new key
Attach your policies to the new key
Pick your hardened guardrail from the Guardrail dropdown (sets
guardrail_id) and your firewall policy from the Firewall policy
dropdown (sets firewall_policy_id). Both bindings live on the key in
the gateway, so the new key is governed from its first call. Copy the
plaintext once — it’s masked everywhere after creation.3. Scope — read the Events and Matches feeds
Now find out what the key actually did. The gateway already recorded every tool call and every rule that fired — workspace-scoped, no extra instrumentation.| Feed | Where | Role | What it answers |
|---|---|---|---|
| Firewall → Events | per tool call | Developer+ | Every evaluation — verdict, surface, tool, args, the run it belongs to. |
| Firewall → Runs | rolled up | Developer+ | “What did this agent session actually do” — verdict mix, distinct tools and models. |
| Guardrails → Matches | per rule hit | Member | Every guardrail rule that fired — type, action, stage, detail. |
deny and audit to see what was blocked versus
what slipped through under an observe-only posture.
Cross-check Guardrails → Matches for the same window. If a
Prompt-Injection Basics rule flagged the request — phrases like
“ignore previous instructions” or “reveal your system prompt” — it
lands here with the rule type and stage.
The Matches feed records the matched substring only when Log raw
content is on for that guardrail — it’s off by default (the
privacy-conservative posture). With it off you still see that a rule
fired and its detail meta-string, just not the literal text. Turn it on
per guardrail when you need the substring for triage; the setting is
non-retroactive.
POST /api/guardrail/match/:id/mark-fp, Admin) so it stops skewing
your signal while you tune.
4. Harden — close the gap
Containment stops this attacker; hardening stops the next one. Two moves: tighten the workspace posture immediately, then add the specific rule that would have caught what you just saw.Fast path — raise the autonomy level
If the incident exposed an agent that was running too open, flip the whole workspace posture in one transaction. In Firewall → Posture, apply thetight autonomy level
(Developer role). In one move this sets default-deny, denies
destructive shell, denies the fetch-shaped SSRF tool names, and enforces
the PII Shield and Secrets & API-Key Blocker guardrails. Every change is one
transaction with one-click undo from the audit snapshot, so you can
roll straight back if it’s too strict.
Precise path — add the rule that would have caught it
For prompt-injection specifically, OrcaRouter ships a Prompt-Injection Basics preset (category safety) — a keyword rule that flags common injection phrases for review without blocking the user. Start there to get signal, then escalate. Its stricter sibling, the Jailbreak / Role-Play Blocker, blocks the same class with a regex. In Guardrails → New guardrail (Developer role; the Test sandbox runs candidate rules inline —llm_judge makes a paid model
call — so it’s Developer+ too), apply the Prompt-Injection Basics
preset, then add an llm_judge rule to catch the obfuscated injections a
keyword list misses:
judge_fail_open: false to
treat a judge error or timeout as a block when a missed check is
unacceptable. Prove the whole policy in the Test tab and against an
Eval corpus before attaching it to a key.
Roll the new rule out safely
Don’t enforce a fresh rule blind on live traffic. For the firewall, setshadow_mode: true on the policy — every enforcing verdict is downgraded
to audit and logged as [shadow] would …, so you watch it fire on the
Events feed before it changes any traffic. For guardrails, set a new
rule’s action to flag first, watch the Matches feed, then promote
it to block or mask. See
enforcement modes for the full
observe → shadow → enforce path.
5. Verify the fix
Confirm the loop is closed before you call it resolved.Replay the attack in the sandbox
Paste the malicious prompt into the guardrail Test tab at the
input stage and confirm the verdict is now a block (or flag). For a
tool-call incident, dry-run the offending call in Firewall → Test
(Developer+) and confirm the verdict is deny. Neither sandbox sends
anything upstream or persists anything.Confirm the old key is dead
Send a request on the revoked key and confirm it’s rejected. A blocked
guardrail returns HTTP 400
guardrail_blocked; a denied tool call
returns HTTP 400 firewall_blocked — and a block costs no
quota (input-stage blocks fire before metering; output blocks refund
the pre-consumed quota) and is marked skip-retry.Snapshot the timeline
Every guardrail change writes a version-history row you can diff
and revert. Firewall changes are captured in the audit trail, and
an autonomy-level apply carries a one-click undo snapshot. Together
with the workspace audit log, that’s your incident record — who
changed what, when, and what the posture was before and after.
6. Runbook at a glance
| Phase | Action | Where | Role |
|---|---|---|---|
| Contain | Delete the leaked key | API Keys | Developer+ |
| Contain | Mint a scoped replacement | API Keys → New key | Developer+ |
| Scope | Read tool calls + verdicts | Firewall → Events / Runs | Developer+ |
| Scope | Read rules that fired | Guardrails → Matches | Member |
| Harden | Raise the posture | Firewall → Posture (tight) | Developer+ |
| Harden | Add the catching rule | Guardrails / Firewall | Developer+ |
| Verify | Replay in the sandbox | Test tabs | Developer+ |
7. Where to go next
Go-live checklist
The pre-production hardening pass — scope keys and lock posture before
you ship.
Prompt injection
The attack this runbook responds to, end to end.
Enforcement modes
Observe → shadow → enforce — roll a new rule out without breaking
traffic.
Stop exfiltration
Lock outbound destinations if the incident touched the network.
