Skip to main content

AI agent security glossary

A quick-reference index of every term used across the Zero Trust documentation. Each definition is scoped to what you, as a developer on the hosted gateway, can observe and configure. Terms link to their home pages for full detail.

Identity & scope

TermDefinition
WorkspaceThe top-level tenant boundary. All keys, guardrails, firewall policies, and audit events belong to one workspace; nothing crosses tenant boundaries. See Scope, keys & policies.
API key (scoped key)A bearer token your agent presents on every call. Carries its own model allow-list, IP restrictions, spend cap, expiry, and the exact guardrail + firewall policy that applies to it. See Scope, keys & policies.
model_limitsThe set of models (or model globs) a key is allowed to call. Requests for a model outside the list are rejected before any upstream call.
allow_ipsAn IP or CIDR allowlist on the key. Requests originating from an address outside the list are rejected at authentication.
credit_limit_usd (spend cap)A hard spend ceiling on the key, in USD. Once the key’s accumulated usage reaches the cap, further requests are rejected. Useful for bounding runaway agent loops.
Environment tagA free-form label (e.g. production, staging) attached to a key to organize and identify it by deployment environment.
is_firewall_gatewayA flag that scopes a key for the Firewall gateway routes (/api/v1/firewall/*) — the MCP dispatch and evaluate-hook endpoints. A regular key gets 403 on those routes.
Least agencyThe principle of giving an agent only the models, spend, IPs, and policies it actually needs — no more. Implemented by combining model_limits, allow_ips, credit_limit_usd, and a restrictive firewall policy on the same key. See Scope, keys & policies.

Guardrails

TermDefinition
GuardrailA named, workspace-scoped content policy — an ordered list of rules the gateway runs against request input and model output. Attach it to a key (or set it as the workspace default) once; every bound call is screened with no redeploy.
RuleOne check inside a guardrail: a type (what to detect), a stage (where to look), and an action (what to do). Rules run in order.
Stageinput (the caller’s request), output (the model’s response), or both. A rule fires only at its declared stage.
ActionWhat a guardrail rule does on a match: block — reject the request (HTTP 400); mask — redact the match and let the call through; flag — log only, no traffic change; annotate — attach a note (e.g. a CVE/SBOM finding) without changing traffic; spotlight — wrap matched untrusted text in delimiters so the model treats it as data, not instructions (a prompt-injection defense).
guardrail_blockedThe error code returned when a guardrail rule fires a block action. Returns HTTP 400. The request costs no quota — input-stage blocks fire before metering; output-stage blocks refund pre-consumed quota.
PII ShieldA pii-type rule that detects built-in sensitive entity types (email, phone, SSN, credit card, IP, and more) and masks them with typed tags. (The pii rule type also supports per-entity block when you author your own.) The canonical starting-point for data-loss prevention. Secrets and credentials are covered by the separate Secrets Blocker preset.
Prompt-injection guardrailA safety rule that detects attempts by untrusted content (web pages, tool results) to hijack the agent’s instructions. Ships as the Prompt-Injection Basics preset in the Safety template category.
Sensitive-word filterA keyword-type rule that matches a literal term list, case-insensitively. The simplest denylist.
LLM judgeAn llm_judge-type rule that runs a semantic check (toxicity, off-topic, jailbreak intent) against a model in your workspace. Use for fuzzy policies no regex can capture. Tokens billed as a judge sub-line.
Contextual groundingA grounding-type rule that scores the model’s answer against the RAG sources on the request and flags or blocks answers that aren’t faithful to them.
Log raw contentA per-guardrail toggle — off by default (privacy-conservative). When off, the Matches feed records that a rule fired but not the matched substring. Turn on per guardrail when you need the actual string for triage.
Matches feedThe workspace-wide record of every rule that fired: rule type, action, stage, detail string, and (when Log raw content is on) the matched substring. Filterable by guardrail, rule type, and action.

Agent Firewall

TermDefinition
Firewall policyA named, workspace-scoped set of ordered rules that the gateway evaluates on every tool call. Attach once to a key or set as the workspace default; no agent-code change required.
VerdictThe outcome a rule (or the default) produces for a tool call. One of allow, audit, deny, sanitize, pending_approval, or cap_cost.
Default verdictThe verdict applied when no rule in the policy matches the tool call. Defaults to audit — allow everything and record it — until you’re ready to enforce.
Enforcement surfaceThe point in the request lifecycle where the firewall sees a call: inbound (tool definitions the agent advertises), response (tool calls the model emits), mcp (a tools/call through the MCP gateway), or egress (an outbound destination reported by a tool). See Firewall.
Tool allow-list (glob)A tool_name_glob on a rule — a small case-sensitive grammar (shell.*, *.exec, *) that matches a tool name or family. First-match-wins against the ordered rule list.
Argument validationargs_match clauses on a rule — eq, contains, regex, in, cidr_match, gt, lt operators over JSONPath fields in the tool’s arguments. The difference between “block shell.exec” and “block shell.exec only when the command is rm -rf.”
SanitizeA sanitize verdict that redacts matched substrings (secrets, PII) from tool arguments and forwards the cleaned call, rather than blocking the whole action. Escalates to a block on the inbound surface.
Egress controlAn egress-surface rule with a host/CIDR allow or deny list — the primary defense against SSRF and data exfiltration. The tight autonomy level also denies the common fetch-shaped tools (http_fetch, fetch_url, web_search, request).
cap_costA verdict that denies tool calls once the agent run’s accumulated spend (in cents) exceeds a per-rule ceiling. A circuit-breaker for runaway agent loops; authored as a rule and resolves to allow or deny in events based on accumulated spend.
Sequence ruleA rule with a sequence block that matches an ordered multi-step chain of tool calls within a time window (e.g. bulk-read → export → egress). Enforced reactively by an async matcher; surfaces on the events feed.
firewall_blockedThe error code on a denied tool call. Returns HTTP 400 on inbound; a tool error on mcp. Marked skip-retry.
Approval / HITL (pending_approval)A pending_approval verdict holds a tool call for human review. The agent receives a held response with an approval id, a reviewer approves or rejects out of band, and the agent re-submits with a single-use approval token. The HTTP error code while held is firewall_approval_pending.
Anomaly detectionStatistical layer above static rules. Scores per-tool activity against a 14-day hour-of-week baseline and flags spikes, retry loops, and novel tool-transition paths on a reviewable feed.

Postures

TermDefinition
Observe modeA workspace-level setting. When on and no policy is attached to a key, tool calls are allowed but logged as coverage gaps, populating the Discovered-tools view.
Shadow modeA flag on a policy. The policy evaluates and logs exactly as it would in production, but every enforcing verdict is downgraded to audit (reason prefixed [shadow] would …). Safe-rollout switch.
EnforceThe default state when shadow mode is off and a policy is attached. Verdicts take effect — deny blocks, sanitize redacts, pending_approval holds.
Autonomy levelA single switch (tight / balanced / permissive) that atomically replaces the workspace’s Firewall and Guardrails posture in one transaction with one-click undo. See Enforcement modes and Secure Agents baseline.

MCP & skills

TermDefinition
MCP serverA Model Context Protocol server registered in your workspace and exposed through the Firewall MCP gateway (api.orcarouter.ai/api/v1/firewall/mcp). Every tools/call it receives is evaluated inline. See Firewall MCP.
tools/callThe MCP protocol message that dispatches a tool to an MCP server. The firewall evaluates it on the mcp surface before forwarding.
Rug-pullA supply-chain attack where an MCP server changes or expands its tool definitions after you approved it. OrcaRouter catches it two ways: the gateway baselines each server’s advertised tool schema on first use and fails closed on drift — a server whose tool definitions change from the approved baseline is held (changed → re-approve or quarantine) instead of served; and every MCP tools/call is firewall-evaluated on the mcp surface, so an unexpected tool is denied at call time regardless. See Rug-pull defense and Schema-drift states.
SkillA capability bundle (one or more tools from one or more MCP servers) that the gateway scans for risk on registration. Each skill gets a risk band and an enforcement mode (allow, quarantine, block) that rides on top of policy-level verdicts.

Compliance & data

TermDefinition
Compliance packA pre-built guardrail + firewall policy bundle for a regulatory profile (GDPR, PCI, HIPAA, financial data). Apply once from the template library; rules are editable after application.
Signed compliance reportA workspace-level attestation report signed with Ed25519. The signature is publicly verifiable — anyone with the public key can confirm the report has not been tampered with.
Data residencyThe region recorded for your compliance evidence. Signed compliance reports are stamped and stored by region (us, eu, uk, ap, cn, global), and a report is only served under a matching declared region. Set it in compliance settings.
Right to erasureOn a workspace deletion or explicit erasure request, OrcaRouter grants a 30-day grace period, then scrubs PII from logs and audit records for that workspace.
Audit eventAn immutable record written after every create, update, delete, and enforcement decision — policy changes, rule edits, approval resolutions, guardrail saves. Secret values and rule blobs are never written to the audit log.

Threats (one-liners)

ThreatWhat it is
Prompt injectionAn attacker embeds instructions in content the agent ingests (direct: in the user’s message; indirect: in a web page, document, or tool result) to hijack the agent’s behavior.
JailbreakA crafted prompt that attempts to bypass a model’s safety training, typically by framing the request as roleplay, hypothetical, or a system override.
Excessive agency / confused deputyAn agent granted broader permissions than its task requires, making it trivially exploitable by injected instructions — the key mitigation is least agency.
Data exfiltrationAn agent (or injected instruction) steering tool calls or outbound requests to leak sensitive data to an attacker-controlled endpoint. Mitigated by egress control rules.
Denial-of-walletA runaway or adversarially triggered agent that generates unbounded upstream model spend. Mitigated by credit_limit_usd on the key and cap_cost rules in the firewall policy.

For the full picture of how these controls compose, see Securing AI agents with OrcaRouter.