Guardrails
Overview
Guardrails are configurable validation rules that run on every request passing through the gateway. Unlike the LLM Firewall which focuses on blocking malicious prompts, guardrails enforce your organization’s content policies — both on inputs (before the LLM sees them) and outputs (before the response reaches your users).
Each rule can either block the request (return an error) or flag it (log the violation but let it through). Rules are evaluated in priority order, and blocking rules short-circuit on the first failure.
Guardrails require the Pro plan (has_guardrails feature flag).
Rule Types
max_length
Validates that the text does not exceed a character limit.
- Stage: input, output, or all
- Action: block or flag
- Config:
{ "max_chars": 5000 }Use this to prevent excessively long prompts from burning tokens or excessively long responses from being returned to users.
keyword_block
Blocks text containing any of the configured terms. Uses case-insensitive matching — no regex patterns are compiled from user input.
- Stage: input, output, or all
- Action: block or flag
- Config:
{ "terms": ["competitor_name", "internal_codename", "banned phrase"] }Use this to prevent leaking internal terminology, blocking competitor mentions, or enforcing brand guidelines.
pii_detect
Detects personally identifiable information using the same regex patterns as the gateway’s built-in PII scrubber.
- Stage: input, output, or all
- Action: block or flag
- Available patterns:
email,cpf,ssn,credit_card,phone,api_key - Config:
{ "patterns": ["email", "cpf", "credit_card", "phone"] }Only the patterns listed in the config are checked. Unknown pattern names are silently skipped.
json_schema
Validates that the response is valid JSON conforming to a provided JSON Schema. Useful for structured output enforcement.
- Stage: output only
- Action: block or flag
- Config:
{ "schema": { "type": "object", "required": ["answer", "confidence"], "properties": { "answer": { "type": "string" }, "confidence": { "type": "number", "minimum": 0, "maximum": 1 } } }}The text is first parsed as JSON, then validated against the schema. If the response is not valid JSON at all, the rule fails.
toxicity
Currently a no-op pass-through. The pre-migration implementation called the local ONNX Prompt-Guard model synchronously. With the firewall now LLM-backed via the BackendRouter, the per-call sync interface no longer fits — calls into the router are async. Wiring async into the sync Validator trait (or splitting the trait) is deferred to a follow-up; the firewall itself remains the primary safety gate.
custom_llm
Coming soon. Evaluate text against a custom LLM prompt for domain-specific policies.
- Stage: output only (slow validator — requires LLM call)
Evaluation Stages
Rules are assigned to one of three stages:
| Stage | When it runs | Use case |
|---|---|---|
| input | Before the request is sent to the LLM provider | Block bad prompts, detect PII in user input |
| output | After the LLM responds, before returning to the user | Validate response format, detect PII leaks, check toxicity |
| all | Both input and output | Rules that apply to both directions (e.g., keyword blocking) |
Important: Slow rule types (toxicity, json_schema, custom_llm) can only run on the output stage. This constraint is enforced at the database level.
Block vs Flag
- Block: The request is rejected with a
400 GuardrailBlockederror. The reason is included in the response. For input rules, the request never reaches the LLM. For output rules, the response is discarded. - Flag: The violation is logged to ClickHouse (
guardrail_eventstable) and visible in the dashboard, but the request/response proceeds normally. Use this for monitoring before enforcing.
For streaming responses, output guardrails run asynchronously after the stream completes. Since the response is already sent, blocking rules behave as flags — the violation is logged and flagged but cannot be retracted.
Priority
Rules are evaluated in ascending priority order (lower number = runs first). If two rules have the same priority, evaluation order is not guaranteed. Use priority to ensure critical rules (like PII detection) run before less important ones (like keyword blocking).
Guardrail Events
Every rule evaluation that results in a block or flag is logged to ClickHouse. Each event includes:
- Request ID
- Organization ID
- Rule ID and type
- Evaluation stage (input/output)
- Action taken (block/flag)
- Reason for failure
- Text preview (first 200 characters)
View events in the dashboard under Guardrails > Events.
Managing Rules
Creating a Rule
- Go to Guardrails in the dashboard.
- Click Create Rule.
- Configure:
- Name — descriptive label for the rule
- Type — select from the available rule types
- Stage — input, output, or all
- Action — block or flag
- Priority — evaluation order (lower runs first)
- Config — type-specific JSON configuration
- Click Save. The rule is active immediately.
Editing a Rule
Click the rule in the dashboard, modify the fields, and save. Changes take effect after the Redis cache TTL expires (typically within seconds).
Disabling a Rule
Toggle the Active switch to disable a rule without deleting it. Disabled rules are not evaluated.
Deleting a Rule
Only organization owners can delete guardrail rules. This action is permanent.
API Behavior
When a guardrail blocks a request, the gateway returns:
{ "error": { "message": "Text length 8500 exceeds maximum of 5000 characters", "type": "guardrail_blocked", "code": 400 }}The message field contains the specific reason from the validator. Your application should handle guardrail_blocked errors and present a user-friendly message.
Guardrails vs Firewall
| LLM Firewall | Guardrails | |
|---|---|---|
| Purpose | Block malicious/unsafe prompts | Enforce org-specific content policies |
| Scope | Input only | Input and/or output |
| Configuration | Global threshold | Per-rule, per-org |
| Rule types | LLM-backed safety classifier | 6 configurable validators |
| Action | Always blocks | Block or flag |
| Plan | All plans | Pro only |
Both systems run independently. A request must pass the firewall first, then guardrails are evaluated.