1. ai agent security faq — start here
A 30-second map of which control answers which question:| You’re asking about… | The plane | Read |
|---|---|---|
| Text in prompts or responses (PII, secrets, jailbreaks) | Guardrails | Guardrails |
| Tool calls, MCP, egress, skills | Firewall | Firewall |
Which one fired on a 400 | Either | Why was it blocked? |
2. Guardrails — content screening
What happens if no guardrail resolves on a request?
What happens if no guardrail resolves on a request?
guardrail_id on the key (if it
exists and is enabled) → otherwise the workspace is_default
guardrail → otherwise no enforcement. A disabled explicit
attachment is the off switch — it does not fall back to the
default. With nothing resolved, the request is byte-identical to a
workspace that never enabled the feature.Does a blocked request cost me quota?
Does a blocked request cost me quota?
block action returns 400 guardrail_blocked and costs no
quota — an input-stage block fires before metering; an output-stage
block refunds the pre-consumed quota. It’s also marked skip-retry:
re-running the identical prompt just blocks again.What rule types and actions are there?
What rule types and actions are there?
keyword, regex, pii, max_chars, external,
llm_judge, grounding. Actions: block (reject), mask (redact and
forward), flag (log only, no traffic change). Stages: input,
output, both. See Guardrails for each.Which PII entities are detected, and what does a mask look like?
Which PII entities are detected, and what does a mask look like?
email, phone, credit_card, ssn,
ip, iban, mac_address, jwt, aws_access_key, api_key_openai,
bitcoin_address, plus regional types (jp_mynumber, kr_rrn,
cn_resident_id). A mask action renders a typed tag —
jane@acme.com → [EMAIL], an SSN → [SSN]. You can layer up to
25 custom regex entities per rule (with an optional Luhn checksum)
and override the action per entity via entity_actions.Is output masking enforced on streaming responses?
Is output masking enforced on streaming responses?
What does the LLM judge cost?
What does the LLM judge cost?
keyword / regex / pii / max_chars rules do no model call and
bill nothing. An llm_judge rule runs a semantic check through a
workspace model (bounded by judge_timeout_ms, fail-open by
default) and is billed as a separate judge sub-line. A grounding
rule scores answer faithfulness against the request’s retrieved sources
(threshold default 0.7) the same way.Can I see what a rule actually matched?
Can I see what a rule actually matched?
GET /api/guardrail/match, Member). Each
row records rule type, action, stage, and a detail string — and the
matched substring only if “Log raw content” is on for that
guardrail (off by default, the privacy-conservative posture). Wrong
block? Mark it a false positive
(POST /api/guardrail/match/:id/mark-fp, Admin).Do you scan dependencies for known CVEs?
Do you scan dependencies for known CVEs?
block / mask /
flag actions you author directly. Connect a scanner under
Integrations to drive it.3. Firewall — agent actions
How does the firewall differ from guardrails on resolution?
How does the firewall differ from guardrails on resolution?
firewall_policy_id / guardrail_id) and share the workspace-default
fallback. See
Guardrails vs Firewall.What are the verdicts and surfaces?
What are the verdicts and surfaces?
allow, audit, deny, sanitize, pending_approval,
cap_cost. default_verdict is allow / audit / deny (audit by
default). Surfaces: inbound (advertised tools), response
(model-emitted tool_calls), mcp (a tools/call), egress
(outbound host/IP/CIDR). The
verdict glossary decodes each.Does `sanitize` clean up what a tool returns?
Does `sanitize` clean up what a tool returns?
sanitize verdict redacts
matched substrings from the tool-call arguments only, never the
content a tool returns. On the inbound surface (no call-time args
yet) sanitize escalates to a deny.What do the autonomy levels do?
What do the autonomy levels do?
autonomy_*
rows:•
balanced (recommended start) — default audit, deny
destructive shell, PII Shield in audit-only (flags PII).•
tight — default-deny, deny destructive shell, deny SSRF-shaped
fetch tools, PII Shield + Secrets Blocker enforced.•
permissive — observe only.One-click undo restores the prior state from the audit snapshot the apply wrote. It’s a single step — undo is unavailable once a later apply (or a manual policy edit) has superseded that snapshot. See Enforcement modes.
Does the SSRF preset block private IPs and cloud metadata?
Does the SSRF preset block private IPs and cloud metadata?
tight autonomy SSRF preset denies the common
fetch-shaped tool names (http_fetch, web_search, fetch_url,
request). To deny by destination — RFC-1918 ranges, cloud-metadata
IPs, specific CIDRs — author your own egress-surface host/CIDR deny
rule. No preset ships CIDR rules for you. See
Egress & data exfiltration.How do I roll out a policy without breaking traffic?
How do I roll out a policy without breaking traffic?
audit, prefixing the reason
[shadow] would …. Watch the Events and Runs views, then turn
shadow off to enforce. Workspace-level observe mode
(firewall_observe_mode) is the complementary discovery dial — it logs
uncovered calls as gaps in Discovered Tools.How does human approval (HITL) work?
How does human approval (HITL) work?
pending_approval verdict returns 400 firewall_approval_pending
with an approval id. A reviewer resolves it from the console
(Developer+) or via an HMAC webhook callback
(POST /api/v1/firewall/approvals/:id/callback). The agent polls
GET /api/v1/firewall/approvals/:id and re-submits the original call
with a single-use X-OrcaRouter-Firewall-Approval header. See
Dangerous tool calls.What is anomaly detection looking for?
What is anomaly detection looking for?
retry_loop and novel_path (a tool-to-tool
transition never seen before). The feed is Member-readable; snooze an
anomaly for up to 7 days. See
Excessive agency.4. MCP, keys & gateway access
How are MCP servers governed?
How are MCP servers governed?
name, endpoint, auth_mode of
none/bearer/oauth/basic, encrypted credentials) and the MCP
gateway evaluates every tools/call on the mcp surface before
dispatch. Health is tracked (ok/degraded/down); probe it with
POST /api/workspace/firewall/mcp_servers/:id/probe. A probe also
baselines the server’s advertised tool schema — later drift flips its
schema status from verified to changed (the “rug-pull” signal), and
you either re-baseline (approve) or quarantine the server. So
governance is per-call evaluation plus schema-integrity tracking and
skill risk-bands. See Firewall MCP and
MCP tool poisoning.What happens to a risky or auto-detected skill?
What happens to a risky or auto-detected skill?
allow / quarantine / block. A
quarantined skill is held for approval; auto-detected skills stay
quarantined until a human reviews them. The mode rides on top of the
rule verdict.Which key fields lock down an agent?
Which key fields lock down an agent?
model_limits (+ model_limits_enabled), allow_ips,
credit_limit_usd (0 = unlimited), expired_time (-1 = never),
environment, guardrail_id, firewall_policy_id, and
is_firewall_gateway. Combine them for least agency — see
Scope, keys & policies.
Keys are masked on display.Why am I getting 403 on `/api/v1/firewall/*`?
Why am I getting 403 on `/api/v1/firewall/*`?
POST /evaluate, POST /evaluate_plan,
ANY /mcp) require a key with is_firewall_gateway=true — a
dedicated firewall-gateway-scoped token, not your sk-orca-… relay
key. Minting one and reading its plaintext is Admin+.What's the difference between configuring and calling?
What's the difference between configuring and calling?
/v1/* relay traffic uses an
sk-orca-… key; only the /api/v1/firewall/* gateway hooks use the
firewall-gateway-scoped token.5. Compliance, residency & data
Which frameworks are covered?
Which frameworks are covered?
/api/compliance/*.Why is install/report gated?
Why is install/report gated?
POST /api/compliance/packs/:key/install) materializes real
guardrails + firewall policies you can then edit.Are the compliance reports verifiable?
Are the compliance reports verifiable?
GET /api/public/compliance/pubkey), verify a
report (POST /api/public/compliance/verify), or hand an auditor a
share link (GET /api/public/compliance/share/:token). Exports are
CSV / JSON / PDF.What does data residency actually pin?
What does data residency actually pin?
us, eu,
uk, ap, cn, global), settable via PUT /api/compliance/residency
(Admin); a cross-region read is withheld. It is not geo-pinning of
your inference data. See
Shared responsibility.How long are logs kept, and how do I get data erased?
How long are logs kept, and how do I get data erased?
