Guardrails

Overview

Guardrails are configurable validation rules that run on every request passing through the gateway. Unlike the LLM Firewall which focuses on blocking malicious prompts, guardrails enforce your organization’s content policies — both on inputs (before the LLM sees them) and outputs (before the response reaches your users).

Each rule can either block the request (return an error) or flag it (log the violation but let it through). Rules are evaluated in priority order, and blocking rules short-circuit on the first failure.

Guardrails require the Pro plan (has_guardrails feature flag).

Rule Types

max_length

Validates that the text does not exceed a character limit.

Stage: input, output, or all
Action: block or flag
Config:

{ "max_chars": 5000 }

Use this to prevent excessively long prompts from burning tokens or excessively long responses from being returned to users.

keyword_block

Blocks text containing any of the configured terms. Uses case-insensitive matching — no regex patterns are compiled from user input.

Stage: input, output, or all
Action: block or flag
Config:

{ "terms": ["competitor_name", "internal_codename", "banned phrase"] }

Use this to prevent leaking internal terminology, blocking competitor mentions, or enforcing brand guidelines.

pii_detect

Detects personally identifiable information using the same regex patterns as the gateway’s built-in PII scrubber.

Stage: input, output, or all
Action: block or flag
Available patterns: email, cpf, ssn, credit_card, phone, api_key
Config:

{ "patterns": ["email", "cpf", "credit_card", "phone"] }

Only the patterns listed in the config are checked. Unknown pattern names are silently skipped.

json_schema

Validates that the response is valid JSON conforming to a provided JSON Schema. Useful for structured output enforcement.

Stage: output only
Action: block or flag
Config:

{
  "schema": {
    "type": "object",
    "required": ["answer", "confidence"],
    "properties": {
      "answer": { "type": "string" },
      "confidence": { "type": "number", "minimum": 0, "maximum": 1 }
    }
  }
}

The text is first parsed as JSON, then validated against the schema. If the response is not valid JSON at all, the rule fails.

toxicity

Currently a no-op pass-through. The pre-migration implementation called the local ONNX Prompt-Guard model synchronously. With the firewall now LLM-backed via the BackendRouter, the per-call sync interface no longer fits — calls into the router are async. Wiring async into the sync Validator trait (or splitting the trait) is deferred to a follow-up; the firewall itself remains the primary safety gate.

custom_llm

Coming soon. Evaluate text against a custom LLM prompt for domain-specific policies.

Stage: output only (slow validator — requires LLM call)

Evaluation Stages

Rules are assigned to one of three stages:

Stage	When it runs	Use case
input	Before the request is sent to the LLM provider	Block bad prompts, detect PII in user input
output	After the LLM responds, before returning to the user	Validate response format, detect PII leaks, check toxicity
all	Both input and output	Rules that apply to both directions (e.g., keyword blocking)

Important: Slow rule types (toxicity, json_schema, custom_llm) can only run on the output stage. This constraint is enforced at the database level.

Block vs Flag

Block: The request is rejected with a 400 GuardrailBlocked error. The reason is included in the response. For input rules, the request never reaches the LLM. For output rules, the response is discarded.
Flag: The violation is logged to ClickHouse (guardrail_events table) and visible in the dashboard, but the request/response proceeds normally. Use this for monitoring before enforcing.

For streaming responses, output guardrails run asynchronously after the stream completes. Since the response is already sent, blocking rules behave as flags — the violation is logged and flagged but cannot be retracted.

Priority

Rules are evaluated in ascending priority order (lower number = runs first). If two rules have the same priority, evaluation order is not guaranteed. Use priority to ensure critical rules (like PII detection) run before less important ones (like keyword blocking).

Guardrail Events

Every rule evaluation that results in a block or flag is logged to ClickHouse. Each event includes:

Request ID
Organization ID
Rule ID and type
Evaluation stage (input/output)
Action taken (block/flag)
Reason for failure
Text preview (first 200 characters)

View events in the dashboard under Guardrails > Events.

Managing Rules

Creating a Rule

Go to Guardrails in the dashboard.
Click Create Rule.
Configure:
- Name — descriptive label for the rule
- Type — select from the available rule types
- Stage — input, output, or all
- Action — block or flag
- Priority — evaluation order (lower runs first)
- Config — type-specific JSON configuration
Click Save. The rule is active immediately.

Editing a Rule

Click the rule in the dashboard, modify the fields, and save. Changes take effect after the Redis cache TTL expires (typically within seconds).

Disabling a Rule

Toggle the Active switch to disable a rule without deleting it. Disabled rules are not evaluated.

Deleting a Rule

Only organization owners can delete guardrail rules. This action is permanent.

API Behavior

When a guardrail blocks a request, the gateway returns:

{
  "error": {
    "message": "Text length 8500 exceeds maximum of 5000 characters",
    "type": "guardrail_blocked",
    "code": 400
  }
}

The message field contains the specific reason from the validator. Your application should handle guardrail_blocked errors and present a user-friendly message.

Guardrails vs Firewall

	LLM Firewall	Guardrails
Purpose	Block malicious/unsafe prompts	Enforce org-specific content policies
Scope	Input only	Input and/or output
Configuration	Global threshold	Per-rule, per-org
Rule types	LLM-backed safety classifier	6 configurable validators
Action	Always blocks	Block or flag
Plan	All plans	Pro only

Both systems run independently. A request must pass the firewall first, then guardrails are evaluated.