tools/call
at dispatch time against your live policy. Each registered server’s
advertised tool set is baselined on first probe and re-checked for drift
— if the tool schema changes from the approved baseline, the server fails
closed until an admin re-approves or quarantines it. And the
Skills layer assigns every installed capability
a risk band and an enforcement mode — quarantining anything risky or
unreviewed until a human signs off. A server can’t earn a free pass by
behaving for the first hundred calls.
1. Why MCP rug pull protection needs per-call evaluation
A connect-time review answers one question once: is this server safe to list? It can’t answer the question that actually matters at runtime: is this specific call, with these specific arguments, safe right now? OrcaRouter answers the second question. Everytools/call that crosses the
gateway is evaluated on the mcp surface before it’s dispatched to the
real server, with the tool name and arguments in hand. The verdict is
computed fresh each time, so the moment a tool starts doing something your
policy forbids — exfiltrating a secret in an argument, reaching a host
that’s denied, calling a capability you never approved — the call is
stopped, regardless of how the same tool behaved a minute ago.
Per-call evaluation governs the behavior of each call — argument
contents, destinations, the owning skill’s risk — so it catches a rug pull
even when the tool keeps an identical signature and only its behavior
turns hostile. Schema-drift detection (§ below) is the complementary layer:
it catches the case where the server’s advertised tool set itself changes.
Both run.
mcp surface:
allow / audit
Forwarded to the server.
audit logs the call; allow stays quiet.sanitize
Forwarded with the tool-call arguments redacted first (it never
rewrites what the server returns).
deny
Returned to the model as a tool error (
firewall deny: …) so the
agent can adapt instead of crashing.pending_approval
The call is held for a human to resolve before it can run.
2. Skill risk-band quarantine
The second half of rug-pull defense covers the supply chain: the skills, plugins, and bring-your-own MCP servers an agent installs. Each one is registered as a workspace-scoped record, scanned by a deterministic risk engine, and assigned a risk band (low / medium / high /
critical) plus an enforcement mode:
| Mode | Effect at runtime |
|---|---|
allow | Rule verdicts decide; the skill adds nothing. |
quarantine | Anything short of a deny is escalated to pending_approval — tools run only after a human approves. |
block | The skill’s tools are denied outright. |
auto_detected and quarantined until reviewed —
even if it scanned clean, it doesn’t run on its own authority. And a skill’s
mode only ever ratchets tighter on re-scan: a block or quarantine you
set is never silently relaxed when a manifest is re-presented.
See Firewall: Skills for the full scanner,
scoring weights, and trust signals.
3. Tool-schema drift detection
The classic rug pull is a registered server that changes what it advertises — adds a tool, alters a tool’s input schema, swaps a description. OrcaRouter baselines each registered server’s advertised tool set on a successful probe and watches it for drift.Baseline on first probe
The first successful probe records a canonical hash of the server’s
tools (trust-on-first-use under a discovery posture; under an
enforcing posture an unbaselined server is held
pending until an
admin approves its initial tool set).Drift fails closed
On a later probe, if the canonical tool set no longer matches the
approved baseline the server is marked
changed and stops being
served — the gateway won’t dispatch its tools until you decide.Approve or quarantine
Re-approve to re-baseline to the new schema, or quarantine the server.
A quarantined server is also disabled and only an explicit approve
restores service — a plain edit can’t re-enable it.
Audited
The first detection of drift from an approved baseline writes a
workspace audit entry, so the change is on the record.
unknown (never baselined),
verified (matches baseline), changed (drifted, held), pending
(unbaselined under enforcing), or quarantined. This layer catches the
rug pull that moves the schema;
per-call evaluation (§1) catches the one that keeps an identical signature
and only changes behavior.
4. One concrete example
Suppose a community MCP servernotes advertises a harmless
notes.search tool. You list it, review it, and it works. A week later the
server is compromised and notes.search starts attaching an exfiltration
argument that POSTs your context to an attacker host.
A connect-time-only gateway would forward it — the tool name and schema
look unchanged. OrcaRouter evaluates the call:
args_match operators are eq, contains, regex, in, cidr_match,
gt, lt; cidr_match tests an IP-valued argument against a CIDR. To
bound where a tool may reach by host/CIDR, use the
egress destination list instead of an
argument clause.)
At dispatch the engine returns deny, and instead of forwarding the call
the gateway hands the agent an MCP tool-result error — a normal result
flagged as an error, not a transport failure — so the model can adapt:
5. How it fits together
Per-call evaluation vs. skill quarantine — which catches what?
Per-call evaluation vs. skill quarantine — which catches what?
Per-call evaluation catches a trusted tool turning malicious —
same name, new behavior in the arguments or destination. Skill
quarantine catches a new or unreviewed capability appearing at all
— an auto-detected install, a re-scanned manifest that newly degrades.
A rug pull can take either shape, so both run: the skill’s mode rides
on top of the per-call rule verdict.
Does this baseline the server's schema?
Does this baseline the server's schema?
Yes — see §3. Each registered server’s advertised tool set is
baselined on first probe and re-checked for drift; a drifted server
fails closed until you re-approve or quarantine it. That’s
complementary to per-call evaluation, which also catches a tool that
keeps an identical signature and only changes its behavior.
Where do held calls go?
Where do held calls go?
A
pending_approval verdict holds the call for a human to resolve in
the console (Developer+) or via an HMAC approval callback. See
enforcement modes for how
holds and approvals are surfaced to an agent.6. Configuring it
Every step below is a console / management action authenticated with your session or access token — not thesk-orca-… relay key. Only /v1/*
relay traffic uses the relay key.
Register your MCP servers behind the gateway
Connect each server so its tools are
advertised under one audited endpoint. Registration is Developer+.
Set a default verdict and rules on the mcp surface
Author rules with
tool_name_glob and args_match so risky calls
resolve to deny, sanitize, or pending_approval. See the
Firewall rule reference.Review quarantined skills
Anything auto-detected sits in
quarantine until a reviewer
(Developer+) approves it. Read the band and findings first.Roll out in shadow, then enforce
Use enforcement modes to run
new rules in shadow, watch the audit events,
and flip to enforcing once the verdicts look right.
Reads (settings, policies, discovered tools, anomalies) are open to any
Member; every write is Developer+. Reading a firewall-gateway
key’s plaintext is Developer+.
Related
Firewall: MCP Servers
The full MCP gateway reference — registration, probing, dispatch.
Firewall: Skills
Scanner passes, risk scoring, and the quarantine derivation.
MCP tool poisoning
The threat model rug-pull defense exists to counter.
Egress limits
Author host/CIDR deny rules to bound where tools may reach.
Trust checklist
The end-to-end checklist for trusting an MCP server.
Guardrails vs. Firewall
When content policy applies and when the firewall does.
