AI agent security glossary
A quick-reference index of every term used across the Zero Trust documentation. Each definition is scoped to what you, as a developer on the hosted gateway, can observe and configure. Terms link to their home pages for full detail.Identity & scope
| Term | Definition |
|---|---|
| Workspace | The top-level tenant boundary. All keys, guardrails, firewall policies, and audit events belong to one workspace; nothing crosses tenant boundaries. See Scope, keys & policies. |
| API key (scoped key) | A bearer token your agent presents on every call. Carries its own model allow-list, IP restrictions, spend cap, expiry, and the exact guardrail + firewall policy that applies to it. See Scope, keys & policies. |
model_limits | The set of models (or model globs) a key is allowed to call. Requests for a model outside the list are rejected before any upstream call. |
allow_ips | An IP or CIDR allowlist on the key. Requests originating from an address outside the list are rejected at authentication. |
credit_limit_usd (spend cap) | A hard spend ceiling on the key, in USD. Once the key’s accumulated usage reaches the cap, further requests are rejected. Useful for bounding runaway agent loops. |
| Environment tag | A free-form label (e.g. production, staging) attached to a key to organize and identify it by deployment environment. |
is_firewall_gateway | A flag that scopes a key for the Firewall gateway routes (/api/v1/firewall/*) — the MCP dispatch and evaluate-hook endpoints. A regular key gets 403 on those routes. |
| Least agency | The principle of giving an agent only the models, spend, IPs, and policies it actually needs — no more. Implemented by combining model_limits, allow_ips, credit_limit_usd, and a restrictive firewall policy on the same key. See Scope, keys & policies. |
Guardrails
| Term | Definition |
|---|---|
| Guardrail | A named, workspace-scoped content policy — an ordered list of rules the gateway runs against request input and model output. Attach it to a key (or set it as the workspace default) once; every bound call is screened with no redeploy. |
| Rule | One check inside a guardrail: a type (what to detect), a stage (where to look), and an action (what to do). Rules run in order. |
| Stage | input (the caller’s request), output (the model’s response), or both. A rule fires only at its declared stage. |
| Action | What a guardrail rule does on a match: block — reject the request (HTTP 400); mask — redact the match and let the call through; flag — log only, no traffic change; annotate — attach a note (e.g. a CVE/SBOM finding) without changing traffic; spotlight — wrap matched untrusted text in delimiters so the model treats it as data, not instructions (a prompt-injection defense). |
guardrail_blocked | The error code returned when a guardrail rule fires a block action. Returns HTTP 400. The request costs no quota — input-stage blocks fire before metering; output-stage blocks refund pre-consumed quota. |
| PII Shield | A pii-type rule that detects built-in sensitive entity types (email, phone, SSN, credit card, IP, and more) and masks them with typed tags. (The pii rule type also supports per-entity block when you author your own.) The canonical starting-point for data-loss prevention. Secrets and credentials are covered by the separate Secrets Blocker preset. |
| Prompt-injection guardrail | A safety rule that detects attempts by untrusted content (web pages, tool results) to hijack the agent’s instructions. Ships as the Prompt-Injection Basics preset in the Safety template category. |
| Sensitive-word filter | A keyword-type rule that matches a literal term list, case-insensitively. The simplest denylist. |
| LLM judge | An llm_judge-type rule that runs a semantic check (toxicity, off-topic, jailbreak intent) against a model in your workspace. Use for fuzzy policies no regex can capture. Tokens billed as a judge sub-line. |
| Contextual grounding | A grounding-type rule that scores the model’s answer against the RAG sources on the request and flags or blocks answers that aren’t faithful to them. |
| Log raw content | A per-guardrail toggle — off by default (privacy-conservative). When off, the Matches feed records that a rule fired but not the matched substring. Turn on per guardrail when you need the actual string for triage. |
| Matches feed | The workspace-wide record of every rule that fired: rule type, action, stage, detail string, and (when Log raw content is on) the matched substring. Filterable by guardrail, rule type, and action. |
Agent Firewall
| Term | Definition |
|---|---|
| Firewall policy | A named, workspace-scoped set of ordered rules that the gateway evaluates on every tool call. Attach once to a key or set as the workspace default; no agent-code change required. |
| Verdict | The outcome a rule (or the default) produces for a tool call. One of allow, audit, deny, sanitize, pending_approval, or cap_cost. |
| Default verdict | The verdict applied when no rule in the policy matches the tool call. Defaults to audit — allow everything and record it — until you’re ready to enforce. |
| Enforcement surface | The point in the request lifecycle where the firewall sees a call: inbound (tool definitions the agent advertises), response (tool calls the model emits), mcp (a tools/call through the MCP gateway), or egress (an outbound destination reported by a tool). See Firewall. |
| Tool allow-list (glob) | A tool_name_glob on a rule — a small case-sensitive grammar (shell.*, *.exec, *) that matches a tool name or family. First-match-wins against the ordered rule list. |
| Argument validation | args_match clauses on a rule — eq, contains, regex, in, cidr_match, gt, lt operators over JSONPath fields in the tool’s arguments. The difference between “block shell.exec” and “block shell.exec only when the command is rm -rf.” |
| Sanitize | A sanitize verdict that redacts matched substrings (secrets, PII) from tool arguments and forwards the cleaned call, rather than blocking the whole action. Escalates to a block on the inbound surface. |
| Egress control | An egress-surface rule with a host/CIDR allow or deny list — the primary defense against SSRF and data exfiltration. The tight autonomy level also denies the common fetch-shaped tools (http_fetch, fetch_url, web_search, request). |
cap_cost | A verdict that denies tool calls once the agent run’s accumulated spend (in cents) exceeds a per-rule ceiling. A circuit-breaker for runaway agent loops; authored as a rule and resolves to allow or deny in events based on accumulated spend. |
| Sequence rule | A rule with a sequence block that matches an ordered multi-step chain of tool calls within a time window (e.g. bulk-read → export → egress). Enforced reactively by an async matcher; surfaces on the events feed. |
firewall_blocked | The error code on a denied tool call. Returns HTTP 400 on inbound; a tool error on mcp. Marked skip-retry. |
Approval / HITL (pending_approval) | A pending_approval verdict holds a tool call for human review. The agent receives a held response with an approval id, a reviewer approves or rejects out of band, and the agent re-submits with a single-use approval token. The HTTP error code while held is firewall_approval_pending. |
| Anomaly detection | Statistical layer above static rules. Scores per-tool activity against a 14-day hour-of-week baseline and flags spikes, retry loops, and novel tool-transition paths on a reviewable feed. |
Postures
| Term | Definition |
|---|---|
| Observe mode | A workspace-level setting. When on and no policy is attached to a key, tool calls are allowed but logged as coverage gaps, populating the Discovered-tools view. |
| Shadow mode | A flag on a policy. The policy evaluates and logs exactly as it would in production, but every enforcing verdict is downgraded to audit (reason prefixed [shadow] would …). Safe-rollout switch. |
| Enforce | The default state when shadow mode is off and a policy is attached. Verdicts take effect — deny blocks, sanitize redacts, pending_approval holds. |
| Autonomy level | A single switch (tight / balanced / permissive) that atomically replaces the workspace’s Firewall and Guardrails posture in one transaction with one-click undo. See Enforcement modes and Secure Agents baseline. |
MCP & skills
| Term | Definition |
|---|---|
| MCP server | A Model Context Protocol server registered in your workspace and exposed through the Firewall MCP gateway (api.orcarouter.ai/api/v1/firewall/mcp). Every tools/call it receives is evaluated inline. See Firewall MCP. |
tools/call | The MCP protocol message that dispatches a tool to an MCP server. The firewall evaluates it on the mcp surface before forwarding. |
| Rug-pull | A supply-chain attack where an MCP server changes or expands its tool definitions after you approved it. OrcaRouter catches it two ways: the gateway baselines each server’s advertised tool schema on first use and fails closed on drift — a server whose tool definitions change from the approved baseline is held (changed → re-approve or quarantine) instead of served; and every MCP tools/call is firewall-evaluated on the mcp surface, so an unexpected tool is denied at call time regardless. See Rug-pull defense and Schema-drift states. |
| Skill | A capability bundle (one or more tools from one or more MCP servers) that the gateway scans for risk on registration. Each skill gets a risk band and an enforcement mode (allow, quarantine, block) that rides on top of policy-level verdicts. |
Compliance & data
| Term | Definition |
|---|---|
| Compliance pack | A pre-built guardrail + firewall policy bundle for a regulatory profile (GDPR, PCI, HIPAA, financial data). Apply once from the template library; rules are editable after application. |
| Signed compliance report | A workspace-level attestation report signed with Ed25519. The signature is publicly verifiable — anyone with the public key can confirm the report has not been tampered with. |
| Data residency | The region recorded for your compliance evidence. Signed compliance reports are stamped and stored by region (us, eu, uk, ap, cn, global), and a report is only served under a matching declared region. Set it in compliance settings. |
| Right to erasure | On a workspace deletion or explicit erasure request, OrcaRouter grants a 30-day grace period, then scrubs PII from logs and audit records for that workspace. |
| Audit event | An immutable record written after every create, update, delete, and enforcement decision — policy changes, rule edits, approval resolutions, guardrail saves. Secret values and rule blobs are never written to the audit log. |
Threats (one-liners)
| Threat | What it is |
|---|---|
| Prompt injection | An attacker embeds instructions in content the agent ingests (direct: in the user’s message; indirect: in a web page, document, or tool result) to hijack the agent’s behavior. |
| Jailbreak | A crafted prompt that attempts to bypass a model’s safety training, typically by framing the request as roleplay, hypothetical, or a system override. |
| Excessive agency / confused deputy | An agent granted broader permissions than its task requires, making it trivially exploitable by injected instructions — the key mitigation is least agency. |
| Data exfiltration | An agent (or injected instruction) steering tool calls or outbound requests to leak sensitive data to an attacker-controlled endpoint. Mitigated by egress control rules. |
| Denial-of-wallet | A runaway or adversarially triggered agent that generates unbounded upstream model spend. Mitigated by credit_limit_usd on the key and cap_cost rules in the firewall policy. |
For the full picture of how these controls compose, see Securing AI agents with OrcaRouter.
