Cost-control guardrails

A runaway prompt is a runaway bill. An agent that pastes a 400KB transcript into context, a retry loop that keeps re-sending the same swollen request, a model that streams a 50,000-character wall of text — each one bills tokens you never meant to spend. The cost preset category puts a hard ceiling in front of those requests so the gateway stops them before they reach the upstream model and meter. This is a focused landing for the cost-control use case. For the full guardrail engine — every rule type, field, and route — see the Guardrails reference.

1. The llm cost guardrail use case

The lever is one built-in rule type: max_chars. It caps the character count of the text at a stage. No model call, no network hop — a deterministic length check that runs on the request before metering, or on the response after the model returns. Two shapes, picked by the rule’s action:

Block oversized requests

On a request max_chars rule with action block, any prompt over the limit is rejected with HTTP 400 guardrail_blocked — and a blocked request costs no quota, because the block fires before usage is metered.

Clamp oversized responses

On a max_chars rule with action mask, the text is truncated to the limit instead of rejected — the caller still gets a usable answer, just bounded. Useful on the response stage to cap egress.

The cap counts characters (rune-aware — 日本語 is three, not nine), not tokens. The shipped token-oriented preset translates a token budget into a character ceiling at the standard char→token ratio; tighten the rule’s max_chars field directly for a stricter budget.

2. The shipped cost presets

Open the New guardrail split-button in the console and pick from the cost template category. Three presets seed a single max_chars rule each:

Preset	Stage · action	Cap
Prompt-Size Cap	input · block	50,000 chars
Token Cost Cap (prompt)	input · block	200,000 chars (~50K tokens)
Response Size Cap	output · block	32,000 chars

Each preset is a seed, not a lock — apply it, then edit the max_chars value, stage, or action to fit your budget. Authoring and editing guardrails requires Developer+ in the workspace.

The Response Size Cap is an output-stage cap. To clamp a long answer rather than reject it, switch its action to mask — the gateway trims the response to the limit and the user still gets a truncated-but-usable reply instead of an error.

3. Author your own cap

A cost rule is the simplest rule in the engine — a stage, an action, and an integer. To cap requests at 20,000 characters and reject anything larger:

{
  "type": "max_chars",
  "stage": "input",
  "action": "block",
  "max_chars": 20000
}

Add it to any guardrail in the console. max_chars must be a positive integer; the validator rejects 0 or negative values.

4. Test before you attach

Prove the cap fires where you expect before any key points at it. Open the Test tab inside the guardrail editor, paste a sample, pick the input stage, and run the current policy locally — no upstream call, no quota. An over-limit sample returns a blocked verdict; an under-limit sample passes through untouched. For a clamp rule, the sandbox shows the truncated rendered text, so you can confirm the cap lands on a rune boundary before depending on it.

5. Attach the cap to a key

A cost guardrail resolves exactly like any other — attach it to an API key, or set it as the workspace default. Every step here is a console action under your own session.

Save the guardrail

Create or open a guardrail in the console, add a max_chars rule (or apply a cost preset), and save.

Attach a key

Edit an API key and pick the guardrail from the Guardrail dropdown (sets guardrail_id on the key), or mark the guardrail the workspace default. See Attach to a key and Account default.

Send a request

Using that key, call OrcaRouter exactly as before — no new headers, no SDK change:

curl https://api.orcarouter.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-orca-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "...a very long prompt..."}
    ]
  }'

If the prompt is over the cap, the call returns HTTP 400 guardrail_blocked and nothing is billed.

6. What a blocked request costs

A request-stage cap is the cheapest guardrail to enforce: it runs before usage is metered, so an oversized prompt is rejected at zero quota cost.

Does a blocked oversized request cost quota?

No. An input-stage block fires before metering. An output-stage block refunds the pre-consumed quota after the response is rejected. Either way the caller pays no quota, gets HTTP 400 guardrail_blocked, and the request is marked skip-retry — re-running the same oversized prompt would just block again. See the guardrail_blocked error.

Is the response cap enforced on streaming?

A max_chars block on the output stage is enforced both ways: on a non-streaming response the answer is screened before it returns, and on a streaming response a scanner cuts the stream mid-flight once the buffer crosses the cap. A mask (clamp) on output currently applies to non-streaming responses only. See Streaming coverage.

Does a cost rule show the matched text in the feed?

No. A max_chars rule has no substring concept, so the Matches feed records that the cap fired — its type, action, and stage — but never a matched substring, even with Log raw content on. You get the that it fired signal without re-capturing the oversized payload.

7. Where this fits

A max_chars cap is a blunt cost lever — a hard ceiling, not a per-key spend budget. To cap dollars rather than characters, set credit_limit_usd on the API key itself (0 = unlimited), which the gateway enforces independently of any guardrail. The two stack: the key budget bounds total spend, the cost guardrail bounds the size of any single request or response.

A cost guardrail screens content size, not the model choice or the routing decision. It rejects an oversized prompt regardless of which model serves it. To govern an agent’s tool calls — deny destructive actions or hold them for approval — use the Firewall, which decides on the tool-call surface (allow / deny / pending_approval), not the content surface.

8. Where to go next

Input-stage rules

How request screening runs before the upstream call and before metering.

Output-stage rules

Screening and clamping the model’s response, streaming and not.

The guardrail_blocked error

The HTTP 400 shape, the no-quota guarantee, and skip-retry.

Test & eval

Prove a cap against a corpus before you attach a key.

Cost caps bound size. To bound content — PII, secrets, unsafe prompts — start with the Guardrails overview or read the Guardrails reference for the complete engine.

​1. The llm cost guardrail use case

Block oversized requests

Clamp oversized responses

​2. The shipped cost presets

​3. Author your own cap

​4. Test before you attach

​5. Attach the cap to a key

​6. What a blocked request costs

​7. Where this fits

​8. Where to go next

Input-stage rules

Output-stage rules

The guardrail_blocked error

Test & eval

1. The llm cost guardrail use case

2. The shipped cost presets

3. Author your own cap

4. Test before you attach

5. Attach the cap to a key

6. What a blocked request costs

7. Where this fits

8. Where to go next