MCP tool poisoning & rug-pulls

A third-party MCP server or installed skill is a supply-chain dependency. Two failure modes stand out:

Poisoning — the server was malicious from day one. Its manifest looked benign; the dangerous behavior was in the tool implementation, not the declared scopes.
Rug-pull — you trusted it, then it changed. A new tool appeared that the server’s operator added quietly, or a community registry entry was hijacked and updated to call home.

Both threats share a root cause: after you said “I trust this server,” your agents keep calling its tools — even new or modified ones — with no further review.

1. How mcp tool poisoning reaches your agents

Every tools/call your agent issues travels through the MCP server’s declared tool set. A poisoned or rug-pulled server exploits that trust in a few ways:

Vector	What happens
Undeclared tool	A new tool appears in `tools/list` that the server’s manifest never declared. Your agent finds it and calls it.
Hijacked registry entry	A community registry listing is taken over; the endpoint now points to an attacker-controlled server.
Credential harvesting	The server’s tool implementation sends collected inputs to an external host.
Prompt-injection via tool result	A tool returns attacker-controlled text that redirects the agent’s next action.

2. OrcaRouter’s defenses

2.1 Every `tools/call` is firewall-evaluated before it runs

MCP servers connect to your agents through the Firewall MCP gateway at /api/v1/firewall/mcp. The gateway does not forward a tool call until the firewall engine has evaluated it against your policy. That means your allow-list is the source of truth — not the server’s tool manifest. If a rug-pull adds shell.exec and your policy has no rule permitting it, the verdict is deny and the call never leaves the gateway. The model receives a tool error (firewall deny: …) and can react; the attacker-added tool is dead on arrival. Verdicts the engine can return:

Verdict	Effect
`allow` / `audit`	Call forwarded; `audit` additionally logs arguments.
`sanitize`	Arguments rewritten before forwarding.
`deny`	Call blocked; model receives a tool error.
`pending_approval`	Call held; a human must approve before it proceeds.
`cap_cost`	Cost cap enforced; call blocked if it would exceed it.

2.2 The server’s tool schema is baselined — drift fails closed

The most direct rug-pull defense runs before any call. On first contact the gateway records a canonical hash of the server’s advertised tool set — every tool’s name, description, and input schema (a trust-on-first-use baseline). On every later probe it re-hashes the live tools and compares:

unchanged → verified; tools are served normally.
drifted (a tool added, removed, or a definition changed) → the server’s schema status flips to changed and the gateway fails closed: its tools stop being served until an admin re-baselines (approves the new schema) or quarantines the server.

So a server that turns malicious mid-session — silently re-defining an approved tool or adding a new one — is caught at the schema layer, before call-time evaluation even applies. See Schema-drift states.

2.3 Auto-detected capabilities are quarantined until reviewed

When an agent self-installs a capability — or a rug-pull adds new tools that weren’t present when you registered the server — the Firewall auto-detects the new capability off the hot path, synthesizes a manifest, scans it, and assigns a risk band and enforcement mode. Crucially, auto-detected capabilities are always quarantined regardless of scan result: they are held in pending_approval until a human reviews them. This is how rug-pulls are contained. An operator can’t quietly add a new tool and have your agents start using it — those calls are held until you inspect and approve the new capability.

2.4 Skill scanning assigns a risk band and enforcement mode

Every installable capability — whether you registered it or the Firewall auto-detected it — is passed through the skill scanner. The scanner runs deterministic passes over the manifest and declared scopes:

prompt_injection — manifest text that attempts to hijack instructions.
tool_creep — tools the manifest uses but never declared.
network_egress — HTTP(S) hosts outside the approved network scopes.
fs_write_unsafe — write-mode filesystem access outside /tmp.

Findings roll up to a risk band (low / medium / high / critical) and an enforcement mode:

Mode	What happens at runtime
`allow`	Skill imposes nothing of its own; your policy rules decide.
`quarantine`	Any non-deny verdict escalates to `pending_approval`. A human must approve each tool call.
`block`	Force `deny` on all of this skill’s tools, regardless of policy rules.

A high-band skill is quarantined automatically; critical is blocked. A single error finding (e.g., tool_creep for an undeclared shell.exec) is enough to block a skill even when its numeric score looks low. The mode only ever ratchets tighter — approving a skill never relaxes a block set by a fresh scan.

2.4 Credentials are stored encrypted

Server auth secrets are encrypted at rest with a workspace secrets key and injected by the gateway at dispatch time. They never reach the model, the agent, or the call arguments. A compromised server can’t exfiltrate your API keys by reading its own auth_json.

Third-party MCP server vetting checklistBefore registering an external MCP server:

Verify the publisher’s identity — who controls the endpoint URL?
Read the source or changelog; look for new tools added after the initial release.
Check whether the skill scan returns any tool_creep or prompt_injection findings on registration.
Scope a firewall rule with tool_name_glob: <server>.* to audit or pending_approval until you have a call history.
Review the network_egress findings: does the manifest claim it only needs one domain but the tool descriptions mention others?
Re-probe the server after any upstream version bump (POST /api/workspace/firewall/mcp_servers/:id/probe) to surface new tools.

3. What to do after a suspected rug-pull

Disable the server immediately — a disabled server is dropped from the runtime registry and its credentials are never decrypted. Use PUT /api/workspace/firewall/mcp_servers with "enabled": false.
Re-probe to surface changes — POST /api/workspace/firewall/mcp_servers/:id/probe runs tools/list and returns any new tools that appeared since your last probe.
Rescan the skill record — POST /api/workspace/firewall/skills/:id/rescan re-runs the scanner against the updated manifest. If the verdict degrades to flagged or blocked, the Firewall emits an event in your feed.
Review pending_approval queue — any calls held since the rug-pull are in the queue. Inspect and deny them rather than bulk-approving.
Audit the call log — check the Firewall event trail for calls that went through before you detected the change.

4. Pairing skill scanning with firewall rules

Skill scanning and firewall rules are complementary and compose:

A rule with tool_name_glob: community.* set to pending_approval ensures you review every call from a community-sourced server, regardless of risk band.
A quarantined skill overrides an allow rule — even if your policy permits http.fetch broadly, a quarantined skill that owns it still holds the call.
Use skill_name_glob in a rule to scope tighter policies to untrusted servers without affecting your first-party integrations.

See Firewall: MCP Servers for the full gateway model and Firewall: Skills for the scanner and enforcement-mode reference.

Dangerous tool calls — rules for blocking destructive or irreversible tool actions regardless of source.
Data exfiltration — egress rules that restrict where tool calls may send data.
Threat model — the full attack surface OrcaRouter is designed to defend.

Firewall: MCP Servers

Register MCP servers behind the gateway, probe their tools, and apply per-call verdicts before any call reaches the real server.

Firewall: Skills

Scan and risk-score every installable capability. Quarantine or block risky skills before their tools can run.

​1. How mcp tool poisoning reaches your agents

​2. OrcaRouter’s defenses

​2.1 Every tools/call is firewall-evaluated before it runs

​2.2 The server’s tool schema is baselined — drift fails closed

​2.3 Auto-detected capabilities are quarantined until reviewed

​2.4 Skill scanning assigns a risk band and enforcement mode

​2.4 Credentials are stored encrypted

​3. What to do after a suspected rug-pull

​4. Pairing skill scanning with firewall rules

​5. Related threats