1. Why test ai guardrail policies before you attach a key
A content policy has two failure modes, and they pull in opposite directions:- Misses — an attack or a leak slips through because no rule fired.
- False positives — a benign prompt gets blocked or masked because a rule is too broad.
Both tools run entirely on your session via the management API
(
/api/guardrail/*) — never the relay key. They evaluate text locally and
send nothing upstream, so a test run costs no model quota.2. The Test tab — one sample, instant verdict
Every guardrail editor has a Test tab. Paste a sample, pick a stage (input or output), and run the current draft of the policy. You get back
the full decision — blocked, mutated, the sanitized text, and the
list of violations — so you can prove a single rule does what you expect
before saving.
Open the editor
In the console go to
/console/guardrails, open the guardrail, and
select the Test tab.3. The Eval tab — score a policy against a corpus
The Eval tab runs your guardrail against a corpus of labelled samples and reports how it scored: precision, recall, and F1 overall and per category, plus the exact samples it got wrong. Use it to tune anllm_judge rubric, prove a block rule catches a known attack family, or
catch an over-broad regex before it starts rejecting good traffic.
A run streams progress as it goes (one event per completed sample) and
persists a run row you can reopen later — queued → running → complete,
with the rules snapshotted at run time so a later edit to the guardrail never
rewrites an old run’s verdict.
Bundled corpora
Red-team and benign sets baked into the gateway — prompt injection,
jailbreaks, PII/secrets, multilingual, over-refusal. No setup.
Custom JSONL
Upload your own labelled set to measure the policy against your real
traffic shapes.
4. What a corpus looks like (JSONL)
A corpus is JSONL — one JSON object per line. Each line is a labelled sample: thetext to evaluate, the stage it belongs to, and the
expected_action the policy should produce. The runner compares the
policy’s actual verdict to that label to score the run.
Field reference
Field reference
| Field | Meaning |
|---|---|
id | Unique per row. Required — empty id rows are dropped as malformed. |
text | The prompt or completion to evaluate. Required. |
stage | input or output — which stage’s rules to run the sample through. |
expected_action | block, mask, flag, or "" (benign — no action expected). |
category | Free-form label that buckets the per-category metrics. |
Malformed rows are tolerated, not silent
Malformed rows are tolerated, not silent
A row with bad JSON or a missing
id/text is skipped and counted,
not fatal — a single typo never blows up the whole run. The loader bumps
its buffer for long multi-line prompts, so a sample with embedded
newlines inside one JSON string parses fine.5. Bundled corpora — red-team sets, zero setup
The gateway ships a catalogue of curated corpora you can run immediately — each carries its source, license, language coverage, and a sample preview in the picker. They’re grouped into 11 categories that span the attack surface real traffic sees:| Category | What it probes |
|---|---|
prompt_injection | Instruction-override and human-written injection submissions. |
jailbreak_single_turn | Real in-the-wild jailbreaks + an academic behaviour baseline. |
jailbreak_encoded_multiturn | base64 / ROT13 / leetspeak / payload-splitting probes. |
indirect_agent | Injection delivered through tool outputs to a tool-using agent. |
multilingual | Native-speaker red-team prompts across many languages, incl. low-resource. |
pii_secrets | Emails, SSNs, cards, IBANs, API keys, AWS keys, JWTs. |
toxicity | Toxic-generation prompts and over-refusal contrasts. |
bias | Stereotype and discrimination probes. |
hallucination | Adversarial factuality / faithfulness sets. |
hazardous_knowledge | Dual-use chem / bio / cyber knowledge probes. |
over_refusal_benign | Safe prompts that look unsafe — your false-positive regression guard. |
The bundled
owasp_llm_top10 corpus is a labelled test set covering the
OWASP LLM Top 10 attack families (prompt injection, jailbreaks, insecure
output, data exfil) — it’s a corpus to run an eval against, not a
compliance pack. For framework packs that materialize policies, see
compliance.6. One concrete example — eval the PII Shield preset
Say you started from the PII Shield preset (a singlepii rule, mask)
and want to confirm it catches the identifier shapes a model might emit
before you bind it to a key. Run it against the bundled pii_smoke corpus.
Eval is a read-level action (POST /api/guardrail/:id/eval,
Member) — it persists a run row but mutates no policy:
expected vs got) so you can grep the corpus and
fix the rule. Reopen it any time from the Runs list
(GET /api/guardrail/:id/eval/runs).
7. Custom corpora — test against your own traffic
Bundled sets prove the policy handles known attacks. To prove it handles your prompts, upload your own JSONL. There are three ways to point an eval at a corpus, and they resolve in this order:Ad-hoc upload (corpus_data)
Ad-hoc upload (corpus_data)
Pass a base64-encoded JSONL blob inline on the eval request. Wins over
everything else — iterate on a draft set without saving it to the
workspace.
Saved corpus (corpus_id)
Saved corpus (corpus_id)
Upload once via
POST /api/guardrail/eval/corpora (Developer+), then
reference it by id on future runs. The name must match
^[a-z][a-z0-9_]*$ and can’t shadow a bundled name.Bundled (corpus_name)
Bundled (corpus_name)
Name one of the shipped corpora, as in §6.
GET /api/guardrail/eval/corpora (Member); upload and delete are
Developer+.
8. Reading the score
The runner classifies every sample into a confusion matrix and derives the headline metrics from it:| Term | Meaning |
|---|---|
| Recall | Of the prompts that should trip the policy, how many did. Low recall = misses. |
| Precision | Of the prompts the policy tripped, how many should have. Low precision = false positives. |
| F1 | The harmonic mean — one number that punishes lopsided tuning. |
9. Where to go next
Tune false positives
Turn a failures list into a tighter, lower-noise policy.
Streaming coverage
Which stage/action combos hold on SSE traffic — verify before you depend on it.
Matches feed
Once live, every rule that fires lands here — the production counterpart to eval.
Versioning
Diff and revert a policy after an eval tells you the last change regressed.
Related guardrail pages
Related guardrail pages
Related concepts & threats
Related concepts & threats
Full engine reference
Full engine reference
Guardrails — every rule type, field, and route,
including the eval and corpora API.
