Concepts glossary - OrcaRouter

AI agent security glossary

A quick-reference index of every term used across the Zero Trust documentation. Each definition is scoped to what you, as a developer on the hosted gateway, can observe and configure. Terms link to their home pages for full detail.

Identity & scope

Term	Definition
Workspace	The top-level tenant boundary. All keys, guardrails, firewall policies, and audit events belong to one workspace; nothing crosses tenant boundaries. See Scope, keys & policies.
API key (scoped key)	A bearer token your agent presents on every call. Carries its own model allow-list, IP restrictions, spend cap, expiry, and the exact guardrail + firewall policy that applies to it. See Scope, keys & policies.
`model_limits`	The set of models (or model globs) a key is allowed to call. Requests for a model outside the list are rejected before any upstream call.
`allow_ips`	An IP or CIDR allowlist on the key. Requests originating from an address outside the list are rejected at authentication.
`credit_limit_usd` (spend cap)	A hard spend ceiling on the key, in USD. Once the key’s accumulated usage reaches the cap, further requests are rejected. Useful for bounding runaway agent loops.
Environment tag	A free-form label (e.g. `production`, `staging`) attached to a key to organize and identify it by deployment environment.
`is_firewall_gateway`	A flag that scopes a key for the Firewall gateway routes (`/api/v1/firewall/*`) — the MCP dispatch and evaluate-hook endpoints. A regular key gets `403` on those routes.
Least agency	The principle of giving an agent only the models, spend, IPs, and policies it actually needs — no more. Implemented by combining `model_limits`, `allow_ips`, `credit_limit_usd`, and a restrictive firewall policy on the same key. See Scope, keys & policies.

Guardrails

Term	Definition
Guardrail	A named, workspace-scoped content policy — an ordered list of rules the gateway runs against request input and model output. Attach it to a key (or set it as the workspace default) once; every bound call is screened with no redeploy.
Rule	One check inside a guardrail: a type (what to detect), a stage (where to look), and an action (what to do). Rules run in order.
Stage	`input` (the caller’s request), `output` (the model’s response), or `both`. A rule fires only at its declared stage.
Action	What a guardrail rule does on a match: `block` — reject the request (HTTP 400); `mask` — redact the match and let the call through; `flag` — log only, no traffic change; `annotate` — attach a note (e.g. a CVE/SBOM finding) without changing traffic; `spotlight` — wrap matched untrusted text in delimiters so the model treats it as data, not instructions (a prompt-injection defense).
`guardrail_blocked`	The error code returned when a guardrail rule fires a `block` action. Returns HTTP 400. The request costs no quota — input-stage blocks fire before metering; output-stage blocks refund pre-consumed quota.
PII Shield	A `pii`-type rule that detects built-in sensitive entity types (email, phone, SSN, credit card, IP, and more) and masks them with typed tags. (The `pii` rule type also supports per-entity `block` when you author your own.) The canonical starting-point for data-loss prevention. Secrets and credentials are covered by the separate Secrets Blocker preset.
Prompt-injection guardrail	A safety rule that detects attempts by untrusted content (web pages, tool results) to hijack the agent’s instructions. Ships as the Prompt-Injection Basics preset in the Safety template category.
Sensitive-word filter	A `keyword`-type rule that matches a literal term list, case-insensitively. The simplest denylist.
LLM judge	An `llm_judge`-type rule that runs a semantic check (toxicity, off-topic, jailbreak intent) against a model in your workspace. Use for fuzzy policies no regex can capture. Tokens billed as a judge sub-line.
Contextual grounding	A `grounding`-type rule that scores the model’s answer against the RAG sources on the request and flags or blocks answers that aren’t faithful to them.
Log raw content	A per-guardrail toggle — off by default (privacy-conservative). When off, the Matches feed records that a rule fired but not the matched substring. Turn on per guardrail when you need the actual string for triage.
Matches feed	The workspace-wide record of every rule that fired: rule type, action, stage, detail string, and (when Log raw content is on) the matched substring. Filterable by guardrail, rule type, and action.

Agent Firewall

Term	Definition
Firewall policy	A named, workspace-scoped set of ordered rules that the gateway evaluates on every tool call. Attach once to a key or set as the workspace default; no agent-code change required.
Verdict	The outcome a rule (or the default) produces for a tool call. One of `allow`, `audit`, `deny`, `sanitize`, `pending_approval`, or `cap_cost`.
Default verdict	The verdict applied when no rule in the policy matches the tool call. Defaults to `audit` — allow everything and record it — until you’re ready to enforce.
Enforcement surface	The point in the request lifecycle where the firewall sees a call: `inbound` (tool definitions the agent advertises), `response` (tool calls the model emits), `mcp` (a `tools/call` through the MCP gateway), or `egress` (an outbound destination reported by a tool). See Firewall.
Tool allow-list (glob)	A `tool_name_glob` on a rule — a small case-sensitive grammar (`shell.`, `.exec`, `*`) that matches a tool name or family. First-match-wins against the ordered rule list.
Argument validation	`args_match` clauses on a rule — `eq`, `contains`, `regex`, `in`, `cidr_match`, `gt`, `lt` operators over JSONPath fields in the tool’s arguments. The difference between “block `shell.exec`” and “block `shell.exec` only when the command is `rm -rf`.”
Sanitize	A `sanitize` verdict that redacts matched substrings (secrets, PII) from tool arguments and forwards the cleaned call, rather than blocking the whole action. Escalates to a block on the `inbound` surface.
Egress control	An `egress`-surface rule with a host/CIDR allow or deny list — the primary defense against SSRF and data exfiltration. The `tight` autonomy level also denies the common fetch-shaped tools (`http_fetch`, `fetch_url`, `web_search`, `request`).
`cap_cost`	A verdict that denies tool calls once the agent run’s accumulated spend (in cents) exceeds a per-rule ceiling. A circuit-breaker for runaway agent loops; authored as a rule and resolves to allow or deny in events based on accumulated spend.
Sequence rule	A rule with a `sequence` block that matches an ordered multi-step chain of tool calls within a time window (e.g. bulk-read → export → egress). Enforced reactively by an async matcher; surfaces on the events feed.
`firewall_blocked`	The error code on a denied tool call. Returns HTTP 400 on `inbound`; a tool error on `mcp`. Marked skip-retry.
Approval / HITL (`pending_approval`)	A `pending_approval` verdict holds a tool call for human review. The agent receives a held response with an approval id, a reviewer approves or rejects out of band, and the agent re-submits with a single-use approval token. The HTTP error code while held is `firewall_approval_pending`.
Anomaly detection	Statistical layer above static rules. Scores per-tool activity against a 14-day hour-of-week baseline and flags spikes, retry loops, and novel tool-transition paths on a reviewable feed.

Postures

Term	Definition
Observe mode	A workspace-level setting. When on and no policy is attached to a key, tool calls are allowed but logged as coverage gaps, populating the Discovered-tools view.
Shadow mode	A flag on a policy. The policy evaluates and logs exactly as it would in production, but every enforcing verdict is downgraded to `audit` (reason prefixed `[shadow] would …`). Safe-rollout switch.
Enforce	The default state when shadow mode is off and a policy is attached. Verdicts take effect — `deny` blocks, `sanitize` redacts, `pending_approval` holds.
Autonomy level	A single switch (`tight` / `balanced` / `permissive`) that atomically replaces the workspace’s Firewall and Guardrails posture in one transaction with one-click undo. See Enforcement modes and Secure Agents baseline.

MCP & skills

Term	Definition
MCP server	A Model Context Protocol server registered in your workspace and exposed through the Firewall MCP gateway (`api.orcarouter.ai/api/v1/firewall/mcp`). Every `tools/call` it receives is evaluated inline. See Firewall MCP.
`tools/call`	The MCP protocol message that dispatches a tool to an MCP server. The firewall evaluates it on the `mcp` surface before forwarding.
Rug-pull	A supply-chain attack where an MCP server changes or expands its tool definitions after you approved it. OrcaRouter catches it two ways: the gateway baselines each server’s advertised tool schema on first use and fails closed on drift — a server whose tool definitions change from the approved baseline is held (`changed` → re-approve or quarantine) instead of served; and every MCP `tools/call` is firewall-evaluated on the `mcp` surface, so an unexpected tool is denied at call time regardless. See Rug-pull defense and Schema-drift states.
Skill	A capability bundle (one or more tools from one or more MCP servers) that the gateway scans for risk on registration. Each skill gets a risk band and an enforcement mode (`allow`, `quarantine`, `block`) that rides on top of policy-level verdicts.

Compliance & data

Term	Definition
Compliance pack	A pre-built guardrail + firewall policy bundle for a regulatory profile (GDPR, PCI, HIPAA, financial data). Apply once from the template library; rules are editable after application.
Signed compliance report	A workspace-level attestation report signed with Ed25519. The signature is publicly verifiable — anyone with the public key can confirm the report has not been tampered with.
Data residency	The region recorded for your compliance evidence. Signed compliance reports are stamped and stored by region (`us`, `eu`, `uk`, `ap`, `cn`, `global`), and a report is only served under a matching declared region. Set it in compliance settings.
Right to erasure	On a workspace deletion or explicit erasure request, OrcaRouter grants a 30-day grace period, then scrubs PII from logs and audit records for that workspace.
Audit event	An immutable record written after every create, update, delete, and enforcement decision — policy changes, rule edits, approval resolutions, guardrail saves. Secret values and rule blobs are never written to the audit log.

Threats (one-liners)

Threat	What it is
Prompt injection	An attacker embeds instructions in content the agent ingests (direct: in the user’s message; indirect: in a web page, document, or tool result) to hijack the agent’s behavior.
Jailbreak	A crafted prompt that attempts to bypass a model’s safety training, typically by framing the request as roleplay, hypothetical, or a system override.
Excessive agency / confused deputy	An agent granted broader permissions than its task requires, making it trivially exploitable by injected instructions — the key mitigation is least agency.
Data exfiltration	An agent (or injected instruction) steering tool calls or outbound requests to leak sensitive data to an attacker-controlled endpoint. Mitigated by egress control rules.
Denial-of-wallet	A runaway or adversarially triggered agent that generates unbounded upstream model spend. Mitigated by `credit_limit_usd` on the key and `cap_cost` rules in the firewall policy.

For the full picture of how these controls compose, see Securing AI agents with OrcaRouter.

​AI agent security glossary

​Identity & scope

​Guardrails

​Agent Firewall

​Postures

​MCP & skills

​Compliance & data

​Threats (one-liners)