Skip to main content
You’ve read a control page and have one question left before you ship. This is the ai agent security faq — the cross-cutting questions that span the whole Zero-Trust section, answered in one place, each linking to the reference for depth. If you’re brand new to the section, start at Securing AI agents and the control stack; this page assumes you know there are two enforcement planes — Guardrails (prompt/response text) and the Firewall (agent actions) — and just need the edges nailed down.

1. ai agent security faq — start here

A 30-second map of which control answers which question:
You’re asking about…The planeRead
Text in prompts or responses (PII, secrets, jailbreaks)GuardrailsGuardrails
Tool calls, MCP, egress, skillsFirewallFirewall
Which one fired on a 400EitherWhy was it blocked?
Every security block on the hosted gateway is HTTP 400 with a machine-readable code. Read the code first — it forks you to the right feed. The full table lives in Error codes.

2. Guardrails — content screening

Nothing. Resolution is: explicit guardrail_id on the key (if it exists and is enabled) → otherwise the workspace is_default guardrail → otherwise no enforcement. A disabled explicit attachment is the off switch — it does not fall back to the default. With nothing resolved, the request is byte-identical to a workspace that never enabled the feature.
No. A block action returns 400 guardrail_blocked and costs no quota — an input-stage block fires before metering; an output-stage block refunds the pre-consumed quota. It’s also marked skip-retry: re-running the identical prompt just blocks again.
Rule types: keyword, regex, pii, max_chars, external, llm_judge, grounding. Actions: block (reject), mask (redact and forward), flag (log only, no traffic change). Stages: input, output, both. See Guardrails for each.
Built-in entities include email, phone, credit_card, ssn, ip, iban, mac_address, jwt, aws_access_key, api_key_openai, bitcoin_address, plus regional types (jp_mynumber, kr_rrn, cn_resident_id). A mask action renders a typed tag — jane@acme.com[EMAIL], an SSN → [SSN]. You can layer up to 25 custom regex entities per rule (with an optional Luhn checksum) and override the action per entity via entity_actions.
Output block is enforced both ways — non-streaming responses are screened before they return, and a streaming scanner cuts the stream mid-flight. Output mask is currently non-streaming only; on a streaming response the chunk passes through unmasked (in-band stream rewriting is on the roadmap). Input-stage masking — sanitizing the request before the model sees it — is live regardless. The PII Shield preset masks at the input stage today.
keyword / regex / pii / max_chars rules do no model call and bill nothing. An llm_judge rule runs a semantic check through a workspace model (bounded by judge_timeout_ms, fail-open by default) and is billed as a separate judge sub-line. A grounding rule scores answer faithfulness against the request’s retrieved sources (threshold default 0.7) the same way.
Open the Matches feed (GET /api/guardrail/match, Member). Each row records rule type, action, stage, and a detail string — and the matched substring only if “Log raw content” is on for that guardrail (off by default, the privacy-conservative posture). Wrong block? Mark it a false positive (POST /api/guardrail/match/:id/mark-fp, Admin).
A guardrail can decorate a prompt with a code-security advisory (e.g. a CVE/SBOM note on a referenced package) without blocking or masking the text. This is an annotation layer that augments the request rather than rejecting it — distinct from the block / mask / flag actions you author directly. Connect a scanner under Integrations to drive it.

3. Firewall — agent actions

One key difference: a disabled attached firewall policy falls back to the workspace default, whereas a disabled attached guardrail resolves to none. Otherwise both attach via the key (firewall_policy_id / guardrail_id) and share the workspace-default fallback. See Guardrails vs Firewall.
Verdicts: allow, audit, deny, sanitize, pending_approval, cap_cost. default_verdict is allow / audit / deny (audit by default). Surfaces: inbound (advertised tools), response (model-emitted tool_calls), mcp (a tools/call), egress (outbound host/IP/CIDR). The verdict glossary decodes each.
No — and this is the common misconception. A sanitize verdict redacts matched substrings from the tool-call arguments only, never the content a tool returns. On the inbound surface (no call-time args yet) sanitize escalates to a deny.
One switch sets your whole posture, writing real editable autonomy_* rows:
balanced (recommended start) — default audit, deny destructive shell, PII Shield in audit-only (flags PII).
tight — default-deny, deny destructive shell, deny SSRF-shaped fetch tools, PII Shield + Secrets Blocker enforced.
permissive — observe only.
One-click undo restores the prior state from the audit snapshot the apply wrote. It’s a single step — undo is unavailable once a later apply (or a manual policy edit) has superseded that snapshot. See Enforcement modes.
Not by preset. The tight autonomy SSRF preset denies the common fetch-shaped tool names (http_fetch, web_search, fetch_url, request). To deny by destination — RFC-1918 ranges, cloud-metadata IPs, specific CIDRs — author your own egress-surface host/CIDR deny rule. No preset ships CIDR rules for you. See Egress & data exfiltration.
Turn on shadow mode (per-policy): the policy evaluates and logs but downgrades every enforcing verdict to audit, prefixing the reason [shadow] would …. Watch the Events and Runs views, then turn shadow off to enforce. Workspace-level observe mode (firewall_observe_mode) is the complementary discovery dial — it logs uncovered calls as gaps in Discovered Tools.
A pending_approval verdict returns 400 firewall_approval_pending with an approval id. A reviewer resolves it from the console (Developer+) or via an HMAC webhook callback (POST /api/v1/firewall/approvals/:id/callback). The agent polls GET /api/v1/firewall/approvals/:id and re-submits the original call with a single-use X-OrcaRouter-Firewall-Approval header. See Dangerous tool calls.
Rate/cost spikes scored against a learned hour-of-week baseline (14-day), plus retry_loop and novel_path (a tool-to-tool transition never seen before). The feed is Member-readable; snooze an anomaly for up to 7 days. See Excessive agency.

4. MCP, keys & gateway access

Register a server (name, endpoint, auth_mode of none/bearer/oauth/basic, encrypted credentials) and the MCP gateway evaluates every tools/call on the mcp surface before dispatch. Health is tracked (ok/degraded/down); probe it with POST /api/workspace/firewall/mcp_servers/:id/probe. A probe also baselines the server’s advertised tool schema — later drift flips its schema status from verified to changed (the “rug-pull” signal), and you either re-baseline (approve) or quarantine the server. So governance is per-call evaluation plus schema-integrity tracking and skill risk-bands. See Firewall MCP and MCP tool poisoning.
Each skill is scanned into a risk band with an enforcement mode of allow / quarantine / block. A quarantined skill is held for approval; auto-detected skills stay quarantined until a human reviews them. The mode rides on top of the rule verdict.
model_limits (+ model_limits_enabled), allow_ips, credit_limit_usd (0 = unlimited), expired_time (-1 = never), environment, guardrail_id, firewall_policy_id, and is_firewall_gateway. Combine them for least agency — see Scope, keys & policies. Keys are masked on display.
Those gateway routes (POST /evaluate, POST /evaluate_plan, ANY /mcp) require a key with is_firewall_gateway=true — a dedicated firewall-gateway-scoped token, not your sk-orca-… relay key. Minting one and reading its plaintext is Admin+.
Configuration runs in the console — guardrails, firewall policies, MCP servers, and compliance are managed under your session/access token (UserAuth), and every write is role-gated (Developer+ for policy and guardrail writes). Only your /v1/* relay traffic uses an sk-orca-… key; only the /api/v1/firewall/* gateway hooks use the firewall-gateway-scoped token.

5. Compliance, residency & data

The catalog includes SOC 2, HIPAA, GDPR, UK GDPR, the EU AI Act, ISO 27001, ISO 42001, the NIST AI RMF, PCI DSS, CCPA, GLBA, the OWASP Top 10 for LLM Applications (as a control mapping), plus regional profiles (PIPL, APPI, PIPA, LGPD, PIPEDA, DPDP, Australia’s APPs, Singapore PDPA, DORA, and several US state laws). Browse the catalog, packs, and readiness — all Member, free — at /api/compliance/*.
Browsing is free; installing a pack, generating a report, going live, and setting residency require workspace Admin and a paid plan (server-gated). Installing a pack (POST /api/compliance/packs/:key/install) materializes real guardrails + firewall policies you can then edit.
Yes. A report is Ed25519-signed + SHA-256 and publicly verifiable: fetch the public key (GET /api/public/compliance/pubkey), verify a report (POST /api/public/compliance/verify), or hand an auditor a share link (GET /api/public/compliance/share/:token). Exports are CSV / JSON / PDF.
It’s the region of the compliance report artifact (us, eu, uk, ap, cn, global), settable via PUT /api/compliance/residency (Admin); a cross-region read is withheld. It is not geo-pinning of your inference data. See Shared responsibility.
Request-log retention defaults to 30 days and is server-clamped to a hard max of 180 days. An account deletion is held for a grace window (default 30 days) before an irreversible PII scrub runs; that scrub cascade-purges the Mongo request-log payloads, guardrail matches, and firewall events attributed to you. Archiving a workspace cascade-purges the same three collections for that workspace. See PII exposure.
A 400 from a security control is not a bug in your prompt. It’s a policy doing its job. Don’t retry — these codes are skip-retry. Trace the rule, then decide whether to fix the call or relax the policy: Why was it blocked?.

6. Still stuck?

Error codes

Every block, hold, and rejection the gateway can return.

Why was it blocked?

Read the code, open the right feed, find the exact rule.

Guardrail API

Routes, roles, and payloads for content policies.

Firewall API

Console and gateway routes for action governance.

Compliance API

Catalog, install, report, and residency endpoints.

Glossary

Every term used across the Zero-Trust docs.
For the threats these controls stop, start at the threat model. For a clean baseline, follow Secure Agents baseline.