Evaluations

Este conteúdo não está disponível em sua língua ainda.

Evaluations

Floopy evaluates every AI response across multiple quality dimensions, detects regressions automatically, and lets you run controlled evaluations against datasets before deploying changes. The system has three capabilities:

Automated Scoring — Every response is scored on relevance, coherence, helpfulness, and safety using a blend of heuristic and LLM-based evaluation.
Regression Alerts — Continuous monitoring compares recent scores against a 7-day baseline and fires alerts when quality drops.
Dataset Evaluation Runs — Test prompt versions and compare models side-by-side using your own datasets with budget controls.

Scoring System

Built-in Dimensions

Every response is scored on four built-in dimensions, each producing a value from 0 to 100:

Dimension	Description
Relevance	How closely the response addresses the request. Measures token overlap between prompt and response.
Coherence	Response structure quality — sentence count, length, formatting (paragraphs, lists, code blocks, headers).
Helpfulness	Response thoroughness relative to prompt complexity. Rewards appropriate length, code blocks when code is requested, structured lists when lists are requested.
Safety	Absence of harmful content patterns. Starts high (95) and deducts for detected harmful phrases. Scores at least 85 when refusal signals are present.

Additionally, Floopy computes:

Dimension	Description
Cost Efficiency	How cost-effective the request was relative to the cheapest option in your org over the last 24 hours. Range 0–100.
Composite Score	Weighted average of all dimensions (built-in + custom). This is the primary metric used for regression detection and routing optimization.

How Scoring Works

Each response is scored using a blend of two methods:

Heuristic scoring (default weight: 60%) — Fast, deterministic rules applied locally.
LLM scoring (default weight: 40%) — An LLM evaluates the response and produces scores. Falls back to heuristic-only if the LLM is unavailable.

The final score per dimension is:

score = (heuristic_score × 0.6) + (llm_score × 0.4)

Scores are clamped to the 0–100 range.

Weight Presets

The composite score uses dimension weights to compute a single quality number. Floopy provides four presets:

Preset	Relevance	Coherence	Helpfulness	Safety	Cost Efficiency
Balanced (default)	0.25	0.10	0.30	0.15	0.20
Quality First	0.30	0.15	0.35	0.15	0.05
Cost Optimized	0.15	0.05	0.20	0.10	0.50
Safety Critical	0.15	0.05	0.15	0.50	0.15

You can also define custom weights per organization.

Custom Dimensions

Organizations can define additional scoring dimensions beyond the four built-in ones. Each custom dimension has:

Name — The dimension label (e.g., “brand_voice”, “technical_accuracy”).
Evaluation prompt — A template prompt with {request} and {response} placeholders. The LLM evaluates against this prompt and returns a score from 0–100.
Weight — How much this dimension contributes to the composite score.
Active flag — Only active dimensions are evaluated.

Custom dimension scores are stored alongside built-in scores and appear in the dashboard charts and breakdown views.

Configuration: Custom dimensions are managed via the organization settings in the dashboard. They are cached for 5 minutes, so changes take effect shortly after saving.

Regression Alerts

Floopy continuously monitors your composite score for sudden drops and fires alerts when quality regresses.

Detection Logic

The regression detector runs on a schedule and compares two time windows:

Parameter	Value
Current window	Last 1 hour
Baseline window	Previous 7 days (excluding the last hour)
Minimum sample size	50 requests in both windows
Regression threshold	> 15% drop from baseline average
Deduplication window	4 hours (same org won’t get duplicate alerts)

Severity Levels

The severity is assigned based on how large the drop is:

Drop Percentage	Severity
> 40%	Critical
> 25%	High
>= 15%	Medium

Alert Pipeline

When a regression is detected:

An alert is created in the security alerts system with type quality_regression.
Details include: current average, historical average, drop percentage, and sample size.
If webhooks are configured for security_alert events, a notification is delivered to your endpoints.
The alert appears in the dashboard under Security Alerts.

Tuning

Regression detection requires sufficient traffic volume (50+ scored requests per hour). For low-traffic organizations, alerts will not fire until the sample size threshold is met in both time windows.

If you receive too many alerts, check whether a legitimate change in your prompts or models is causing the score shift. The 4-hour deduplication window prevents alert fatigue from a single ongoing regression.

Dashboard

The evaluations dashboard at Settings > Evaluations provides analytics over your scored traffic.

Score Time Series

A line chart showing all scoring dimensions over time. Select a time range:

Range	Chart Granularity
Last 1 hour	5-minute buckets
Last 6 hours	5-minute buckets
Last 24 hours	1-hour buckets
Last 7 days	1-day buckets
Last 30 days	1-day buckets

Each line represents a dimension (composite, relevance, coherence, helpfulness, safety). Custom dimensions appear as dashed lines.

Breakdown Filters

Filter scores by three dimensions to identify underperforming configurations:

By Model — Compare score averages across different models.
By Prompt Version — See how different prompt IDs perform.
By API Key — Identify which keys produce higher or lower quality.

Filters persist in URL search parameters, so you can bookmark or share specific views.

The breakdown view shows a side-by-side bar chart comparing scores across the selected grouping, plus a comparison table with sample counts.

Top & Bottom Requests

Tables showing the highest and lowest scoring requests for the selected time range. Each row includes:

Prompt snippet (first 200 characters)
Model used
Composite score and individual dimension scores
Timestamp

Click any row to navigate to the full request detail view. Use the count selector to show the top/bottom 10, 25, or 50 requests.

Custom Dimension Scores

If your organization has custom dimensions configured, they appear automatically:

As dashed lines in the time series chart.
As additional bars in the breakdown chart.
Organizations with no custom dimensions see no changes — the views pass through cleanly.

Dataset Evaluation Runs

Run controlled evaluations against your datasets to test prompt versions and compare models before deploying to production.

Creating a Run

From the evaluations page, click New Evaluation Run and configure:

Field	Description
Dataset	Select from your existing datasets.
Prompt Version	(Optional) Select a specific prompt to use.
Models	Select up to 3 models to evaluate. One run is created per model.
Budget Limit	Maximum spend in dollars. The run stops if this limit is exceeded.

An estimated cost is shown before starting, calculated from:

estimated_cost = dataset_rows × avg_tokens (500) × model_price_per_token

with a 60/40 input/output token split assumption.

Run Lifecycle

Each evaluation run progresses through these states:

State	Description
`pending`	Run created, waiting to start processing.
`running`	Actively processing dataset rows through the gateway pipeline.
`completed`	All rows processed and results aggregated.
`failed`	An error occurred or the budget limit was exceeded.
`cancelled`	Manually cancelled by a user.

During execution, the runner:

Fetches dataset rows from the database.
For each row, sends the request through the normal gateway pipeline with the specified model.
Each response is scored by the feedback worker (same scoring as production traffic).
Progress updates every 10 rows.
Checks for cancellation between each row.

Budget Enforcement

If a budget limit is set in the run configuration:

Cumulative cost is tracked after each row completes.
If the cumulative cost exceeds the budget, the run stops immediately.
The run status is set to failed with the reason recorded in the results summary.
Evaluation run costs are tracked separately from production traffic.

Viewing Results

Each completed run produces a results summary with aggregated scores:

Metric	Description
Mean	Average score across all evaluated rows.
P50	Median score.
P95	95th percentile score.
Min	Lowest score in the run.
Max	Highest score in the run.

These metrics are computed per dimension (relevance, coherence, helpfulness, safety, composite).

The results page also shows:

Progress bar for running evaluations (completed rows / total rows).
Cost summary with total tokens and cost per model.
Cancel button for runs that are still in progress.

Comparing Results

When multiple models are evaluated on the same dataset:

Model comparison — Side-by-side score table showing each model’s mean scores per dimension. The best score per dimension is highlighted in green, the worst in red.
Version comparison — Compare the same model across different prompt versions from related runs on the same dataset.
Statistical significance warning — When sample size is below 30, a warning indicates that results may not be statistically significant.

API Reference

All evaluation endpoints require a valid API key with admin permissions in the Authorization: Bearer <key> header.

Create Evaluation Run

POST /v1/evaluations

curl -X POST https://api.floopy.ai/v1/evaluations \
  -H "Authorization: Bearer $FLOOPY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef",
    "model": "gpt-4o",
    "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef",
    "config": {
      "budget_limit_cents": 5000
    }
  }'

const response = await fetch("https://api.floopy.ai/v1/evaluations", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.FLOOPY_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    dataset_id: "d1234567-89ab-cdef-0123-456789abcdef",
    model: "gpt-4o",
    prompt_id: "p1234567-89ab-cdef-0123-456789abcdef",
    config: { budget_limit_cents: 5000 },
  }),
});
const run = await response.json();

import requests
import os

response = requests.post(
    "https://api.floopy.ai/v1/evaluations",
    headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},
    json={
        "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef",
        "model": "gpt-4o",
        "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef",
        "config": {"budget_limit_cents": 5000},
    },
)
run = response.json()

Request Body:

Field	Type	Required	Description
`dataset_id`	string (UUID)	Yes	ID of the dataset to evaluate against.
`model`	string	Yes	Model identifier (e.g., `gpt-4o`, `claude-sonnet-4-20250514`).
`prompt_id`	string (UUID)	No	Prompt version to use. If omitted, uses the raw dataset prompts.
`config`	object	No	Run configuration.
`config.budget_limit_cents`	integer	No	Maximum budget in cents. Run stops if exceeded.

Response: 201 Created

{
  "id": "r1234567-89ab-cdef-0123-456789abcdef",
  "status": "pending",
  "total_rows": 150,
  "completed_rows": 0,
  "created_at": "2026-04-10T14:30:00.000Z"
}

Get Evaluation Run

GET /v1/evaluations/:id

curl https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef \
  -H "Authorization: Bearer $FLOOPY_API_KEY"

const response = await fetch(
  "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef",
  { headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}` } }
);
const run = await response.json();

response = requests.get(
    "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef",
    headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},
)
run = response.json()

Response: 200 OK

{
  "id": "r1234567-89ab-cdef-0123-456789abcdef",
  "organization_id": "550e8400-e29b-41d4-a716-446655440000",
  "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef",
  "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef",
  "model": "gpt-4o",
  "status": "completed",
  "config": { "budget_limit_cents": 5000 },
  "total_rows": 150,
  "completed_rows": 150,
  "results_summary": {
    "count": 150,
    "relevance": { "mean": 78.5 },
    "coherence": { "mean": 82.1 },
    "helpfulness": { "mean": 75.3 },
    "safety": { "mean": 94.8 },
    "composite_score": {
      "mean": 81.2,
      "p50": 83.0,
      "p95": 95.0,
      "min": 42.0,
      "max": 99.0
    }
  },
  "created_at": "2026-04-10T14:30:00.000Z",
  "completed_at": "2026-04-10T14:45:12.000Z"
}

Get Evaluation Results

GET /v1/evaluations/:id/results

Returns the aggregated results summary for a completed run. Response format matches the results_summary object above.

Cancel Evaluation Run

POST /v1/evaluations/:id/cancel

curl -X POST https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel \
  -H "Authorization: Bearer $FLOOPY_API_KEY"

await fetch(
  "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel",
  {
    method: "POST",
    headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}` },
  }
);

requests.post(
    "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel",
    headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},
)

Response: 200 OK

{
  "id": "r1234567-89ab-cdef-0123-456789abcdef",
  "status": "cancelled"
}

Only runs in pending or running state can be cancelled.

Troubleshooting

No scores appearing

Verify that traffic is flowing through the gateway (check the request logs page).
Scoring is automatic — no configuration needed for built-in dimensions.
Scores appear after the feedback worker processes batches (up to 30 seconds delay).

Regression alerts not firing

Ensure you have at least 50 scored requests in the last hour AND in the 7-day baseline.
Low-traffic organizations may not reach the sample size threshold.
Duplicate alerts are suppressed for 4 hours — check if a recent alert already exists.

Custom dimensions not scoring

Confirm your custom dimensions are marked as Active in organization settings.
Custom dimension evaluation uses an external LLM. If the LLM is rate-limited or unavailable, custom scores are skipped and the composite score is computed without them.
Changes to custom dimensions are cached for 5 minutes.

Evaluation run stuck in “running”

Check if the gateway is healthy and processing requests.
Large datasets may take time — monitor the progress bar (completed rows / total rows).
If the run appears stuck, cancel it and create a new one.

Evaluation run failed with budget error

The run exceeded the configured budget limit before completing all rows.
Increase the budget limit or reduce the dataset size.
Check the results summary for partial results from rows that completed before the budget was exceeded.

Comparison shows statistical significance warning

Results are based on fewer than 30 samples.
Run a larger dataset or wait for more production traffic before drawing conclusions.