Skip to content

Guardrails

Overview

Guardrails are configurable validation rules that run on every request passing through the gateway. Unlike the LLM Firewall which focuses on blocking malicious prompts, guardrails enforce your organization’s content policies — both on inputs (before the LLM sees them) and outputs (before the response reaches your users).

Each rule can either block the request (return an error) or flag it (log the violation but let it through). Rules are evaluated in priority order, and blocking rules short-circuit on the first failure.

Guardrails require the Pro plan (has_guardrails feature flag).

Rule Types

max_length

Validates that the text does not exceed a character limit.

  • Stage: input, output, or all
  • Action: block or flag
  • Config:
{ "max_chars": 5000 }

Use this to prevent excessively long prompts from burning tokens or excessively long responses from being returned to users.

keyword_block

Blocks text containing any of the configured terms. Uses case-insensitive matching — no regex patterns are compiled from user input.

  • Stage: input, output, or all
  • Action: block or flag
  • Config:
{ "terms": ["competitor_name", "internal_codename", "banned phrase"] }

Use this to prevent leaking internal terminology, blocking competitor mentions, or enforcing brand guidelines.

pii_detect

Detects personally identifiable information using the same regex patterns as the gateway’s built-in PII scrubber.

  • Stage: input, output, or all
  • Action: block or flag
  • Available patterns: email, cpf, ssn, credit_card, phone, api_key
  • Config:
{ "patterns": ["email", "cpf", "credit_card", "phone"] }

Only the patterns listed in the config are checked. Unknown pattern names are silently skipped.

json_schema

Validates that the response is valid JSON conforming to a provided JSON Schema. Useful for structured output enforcement.

  • Stage: output only
  • Action: block or flag
  • Config:
{
"schema": {
"type": "object",
"required": ["answer", "confidence"],
"properties": {
"answer": { "type": "string" },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 }
}
}
}

The text is first parsed as JSON, then validated against the schema. If the response is not valid JSON at all, the rule fails.

toxicity

Currently a no-op pass-through. The pre-migration implementation called the local ONNX Prompt-Guard model synchronously. With the firewall now LLM-backed via the BackendRouter, the per-call sync interface no longer fits — calls into the router are async. Wiring async into the sync Validator trait (or splitting the trait) is deferred to a follow-up; the firewall itself remains the primary safety gate.

custom_llm

Coming soon. Evaluate text against a custom LLM prompt for domain-specific policies.

  • Stage: output only (slow validator — requires LLM call)

Evaluation Stages

Rules are assigned to one of three stages:

StageWhen it runsUse case
inputBefore the request is sent to the LLM providerBlock bad prompts, detect PII in user input
outputAfter the LLM responds, before returning to the userValidate response format, detect PII leaks, check toxicity
allBoth input and outputRules that apply to both directions (e.g., keyword blocking)

Important: Slow rule types (toxicity, json_schema, custom_llm) can only run on the output stage. This constraint is enforced at the database level.

Block vs Flag

  • Block: The request is rejected with a 400 GuardrailBlocked error. The reason is included in the response. For input rules, the request never reaches the LLM. For output rules, the response is discarded.
  • Flag: The violation is logged to ClickHouse (guardrail_events table) and visible in the dashboard, but the request/response proceeds normally. Use this for monitoring before enforcing.

For streaming responses, output guardrails run asynchronously after the stream completes. Since the response is already sent, blocking rules behave as flags — the violation is logged and flagged but cannot be retracted.

Priority

Rules are evaluated in ascending priority order (lower number = runs first). If two rules have the same priority, evaluation order is not guaranteed. Use priority to ensure critical rules (like PII detection) run before less important ones (like keyword blocking).

Guardrail Events

Every rule evaluation that results in a block or flag is logged to ClickHouse. Each event includes:

  • Request ID
  • Organization ID
  • Rule ID and type
  • Evaluation stage (input/output)
  • Action taken (block/flag)
  • Reason for failure
  • Text preview (first 200 characters)

View events in the dashboard under Guardrails > Events.

Managing Rules

Creating a Rule

  1. Go to Guardrails in the dashboard.
  2. Click Create Rule.
  3. Configure:
    • Name — descriptive label for the rule
    • Type — select from the available rule types
    • Stage — input, output, or all
    • Action — block or flag
    • Priority — evaluation order (lower runs first)
    • Config — type-specific JSON configuration
  4. Click Save. The rule is active immediately.

Editing a Rule

Click the rule in the dashboard, modify the fields, and save. Changes take effect after the Redis cache TTL expires (typically within seconds).

Disabling a Rule

Toggle the Active switch to disable a rule without deleting it. Disabled rules are not evaluated.

Deleting a Rule

Only organization owners can delete guardrail rules. This action is permanent.

API Behavior

When a guardrail blocks a request, the gateway returns:

{
"error": {
"message": "Text length 8500 exceeds maximum of 5000 characters",
"type": "guardrail_blocked",
"code": 400
}
}

The message field contains the specific reason from the validator. Your application should handle guardrail_blocked errors and present a user-friendly message.

Guardrails vs Firewall

LLM FirewallGuardrails
PurposeBlock malicious/unsafe promptsEnforce org-specific content policies
ScopeInput onlyInput and/or output
ConfigurationGlobal thresholdPer-rule, per-org
Rule typesLLM-backed safety classifier6 configurable validators
ActionAlways blocksBlock or flag
PlanAll plansPro only

Both systems run independently. A request must pass the firewall first, then guardrails are evaluated.