Evaluations
Evaluations
Floopy evaluates every AI response across multiple quality dimensions, detects regressions automatically, and lets you run controlled evaluations against datasets before deploying changes. The system has three capabilities:
- Automated Scoring — Every response is scored on relevance, coherence, helpfulness, and safety using a blend of heuristic and LLM-based evaluation.
- Regression Alerts — Continuous monitoring compares recent scores against a 7-day baseline and fires alerts when quality drops.
- Dataset Evaluation Runs — Test prompt versions and compare models side-by-side using your own datasets with budget controls.
Scoring System
Built-in Dimensions
Every response is scored on four built-in dimensions, each producing a value from 0 to 100:
| Dimension | Description |
|---|---|
| Relevance | How closely the response addresses the request. Measures token overlap between prompt and response. |
| Coherence | Response structure quality — sentence count, length, formatting (paragraphs, lists, code blocks, headers). |
| Helpfulness | Response thoroughness relative to prompt complexity. Rewards appropriate length, code blocks when code is requested, structured lists when lists are requested. |
| Safety | Absence of harmful content patterns. Starts high (95) and deducts for detected harmful phrases. Scores at least 85 when refusal signals are present. |
Additionally, Floopy computes:
| Dimension | Description |
|---|---|
| Cost Efficiency | How cost-effective the request was relative to the cheapest option in your org over the last 24 hours. Range 0–100. |
| Composite Score | Weighted average of all dimensions (built-in + custom). This is the primary metric used for regression detection and routing optimization. |
How Scoring Works
Each response is scored using a blend of two methods:
- Heuristic scoring (default weight: 60%) — Fast, deterministic rules applied locally.
- LLM scoring (default weight: 40%) — An LLM evaluates the response and produces scores. Falls back to heuristic-only if the LLM is unavailable.
The final score per dimension is:
score = (heuristic_score × 0.6) + (llm_score × 0.4)Scores are clamped to the 0–100 range.
Weight Presets
The composite score uses dimension weights to compute a single quality number. Floopy provides four presets:
| Preset | Relevance | Coherence | Helpfulness | Safety | Cost Efficiency |
|---|---|---|---|---|---|
| Balanced (default) | 0.25 | 0.10 | 0.30 | 0.15 | 0.20 |
| Quality First | 0.30 | 0.15 | 0.35 | 0.15 | 0.05 |
| Cost Optimized | 0.15 | 0.05 | 0.20 | 0.10 | 0.50 |
| Safety Critical | 0.15 | 0.05 | 0.15 | 0.50 | 0.15 |
You can also define custom weights per organization.
Custom Dimensions
Organizations can define additional scoring dimensions beyond the four built-in ones. Each custom dimension has:
- Name — The dimension label (e.g., “brand_voice”, “technical_accuracy”).
- Evaluation prompt — A template prompt with
{request}and{response}placeholders. The LLM evaluates against this prompt and returns a score from 0–100. - Weight — How much this dimension contributes to the composite score.
- Active flag — Only active dimensions are evaluated.
Custom dimension scores are stored alongside built-in scores and appear in the dashboard charts and breakdown views.
Configuration: Custom dimensions are managed via the organization settings in the dashboard. They are cached for 5 minutes, so changes take effect shortly after saving.
Regression Alerts
Floopy continuously monitors your composite score for sudden drops and fires alerts when quality regresses.
Detection Logic
The regression detector runs on a schedule and compares two time windows:
| Parameter | Value |
|---|---|
| Current window | Last 1 hour |
| Baseline window | Previous 7 days (excluding the last hour) |
| Minimum sample size | 50 requests in both windows |
| Regression threshold | > 15% drop from baseline average |
| Deduplication window | 4 hours (same org won’t get duplicate alerts) |
Severity Levels
The severity is assigned based on how large the drop is:
| Drop Percentage | Severity |
|---|---|
| > 40% | Critical |
| > 25% | High |
| >= 15% | Medium |
Alert Pipeline
When a regression is detected:
- An alert is created in the security alerts system with type
quality_regression. - Details include: current average, historical average, drop percentage, and sample size.
- If webhooks are configured for
security_alertevents, a notification is delivered to your endpoints. - The alert appears in the dashboard under Security Alerts.
Tuning
Regression detection requires sufficient traffic volume (50+ scored requests per hour). For low-traffic organizations, alerts will not fire until the sample size threshold is met in both time windows.
If you receive too many alerts, check whether a legitimate change in your prompts or models is causing the score shift. The 4-hour deduplication window prevents alert fatigue from a single ongoing regression.
Dashboard
The evaluations dashboard at Settings > Evaluations provides analytics over your scored traffic.
Score Time Series
A line chart showing all scoring dimensions over time. Select a time range:
| Range | Chart Granularity |
|---|---|
| Last 1 hour | 5-minute buckets |
| Last 6 hours | 5-minute buckets |
| Last 24 hours | 1-hour buckets |
| Last 7 days | 1-day buckets |
| Last 30 days | 1-day buckets |
Each line represents a dimension (composite, relevance, coherence, helpfulness, safety). Custom dimensions appear as dashed lines.
Breakdown Filters
Filter scores by three dimensions to identify underperforming configurations:
- By Model — Compare score averages across different models.
- By Prompt Version — See how different prompt IDs perform.
- By API Key — Identify which keys produce higher or lower quality.
Filters persist in URL search parameters, so you can bookmark or share specific views.
The breakdown view shows a side-by-side bar chart comparing scores across the selected grouping, plus a comparison table with sample counts.
Top & Bottom Requests
Tables showing the highest and lowest scoring requests for the selected time range. Each row includes:
- Prompt snippet (first 200 characters)
- Model used
- Composite score and individual dimension scores
- Timestamp
Click any row to navigate to the full request detail view. Use the count selector to show the top/bottom 10, 25, or 50 requests.
Custom Dimension Scores
If your organization has custom dimensions configured, they appear automatically:
- As dashed lines in the time series chart.
- As additional bars in the breakdown chart.
- Organizations with no custom dimensions see no changes — the views pass through cleanly.
Dataset Evaluation Runs
Run controlled evaluations against your datasets to test prompt versions and compare models before deploying to production.
Creating a Run
From the evaluations page, click New Evaluation Run and configure:
| Field | Description |
|---|---|
| Dataset | Select from your existing datasets. |
| Prompt Version | (Optional) Select a specific prompt to use. |
| Models | Select up to 3 models to evaluate. One run is created per model. |
| Budget Limit | Maximum spend in dollars. The run stops if this limit is exceeded. |
An estimated cost is shown before starting, calculated from:
estimated_cost = dataset_rows × avg_tokens (500) × model_price_per_tokenwith a 60/40 input/output token split assumption.
Run Lifecycle
Each evaluation run progresses through these states:
| State | Description |
|---|---|
pending | Run created, waiting to start processing. |
running | Actively processing dataset rows through the gateway pipeline. |
completed | All rows processed and results aggregated. |
failed | An error occurred or the budget limit was exceeded. |
cancelled | Manually cancelled by a user. |
During execution, the runner:
- Fetches dataset rows from the database.
- For each row, sends the request through the normal gateway pipeline with the specified model.
- Each response is scored by the feedback worker (same scoring as production traffic).
- Progress updates every 10 rows.
- Checks for cancellation between each row.
Budget Enforcement
If a budget limit is set in the run configuration:
- Cumulative cost is tracked after each row completes.
- If the cumulative cost exceeds the budget, the run stops immediately.
- The run status is set to
failedwith the reason recorded in the results summary. - Evaluation run costs are tracked separately from production traffic.
Viewing Results
Each completed run produces a results summary with aggregated scores:
| Metric | Description |
|---|---|
| Mean | Average score across all evaluated rows. |
| P50 | Median score. |
| P95 | 95th percentile score. |
| Min | Lowest score in the run. |
| Max | Highest score in the run. |
These metrics are computed per dimension (relevance, coherence, helpfulness, safety, composite).
The results page also shows:
- Progress bar for running evaluations (completed rows / total rows).
- Cost summary with total tokens and cost per model.
- Cancel button for runs that are still in progress.
Comparing Results
When multiple models are evaluated on the same dataset:
- Model comparison — Side-by-side score table showing each model’s mean scores per dimension. The best score per dimension is highlighted in green, the worst in red.
- Version comparison — Compare the same model across different prompt versions from related runs on the same dataset.
- Statistical significance warning — When sample size is below 30, a warning indicates that results may not be statistically significant.
API Reference
All evaluation endpoints require a valid API key with admin permissions in the Authorization: Bearer <key> header.
Create Evaluation Run
POST /v1/evaluationscurl -X POST https://api.floopy.ai/v1/evaluations \ -H "Authorization: Bearer $FLOOPY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef", "model": "gpt-4o", "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef", "config": { "budget_limit_cents": 5000 } }'const response = await fetch("https://api.floopy.ai/v1/evaluations", { method: "POST", headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ dataset_id: "d1234567-89ab-cdef-0123-456789abcdef", model: "gpt-4o", prompt_id: "p1234567-89ab-cdef-0123-456789abcdef", config: { budget_limit_cents: 5000 }, }),});const run = await response.json();import requestsimport os
response = requests.post( "https://api.floopy.ai/v1/evaluations", headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"}, json={ "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef", "model": "gpt-4o", "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef", "config": {"budget_limit_cents": 5000}, },)run = response.json()Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
dataset_id | string (UUID) | Yes | ID of the dataset to evaluate against. |
model | string | Yes | Model identifier (e.g., gpt-4o, claude-sonnet-4-20250514). |
prompt_id | string (UUID) | No | Prompt version to use. If omitted, uses the raw dataset prompts. |
config | object | No | Run configuration. |
config.budget_limit_cents | integer | No | Maximum budget in cents. Run stops if exceeded. |
Response: 201 Created
{ "id": "r1234567-89ab-cdef-0123-456789abcdef", "status": "pending", "total_rows": 150, "completed_rows": 0, "created_at": "2026-04-10T14:30:00.000Z"}Get Evaluation Run
GET /v1/evaluations/:idcurl https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef \ -H "Authorization: Bearer $FLOOPY_API_KEY"const response = await fetch( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef", { headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}` } });const run = await response.json();response = requests.get( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef", headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},)run = response.json()Response: 200 OK
{ "id": "r1234567-89ab-cdef-0123-456789abcdef", "organization_id": "550e8400-e29b-41d4-a716-446655440000", "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef", "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef", "model": "gpt-4o", "status": "completed", "config": { "budget_limit_cents": 5000 }, "total_rows": 150, "completed_rows": 150, "results_summary": { "count": 150, "relevance": { "mean": 78.5 }, "coherence": { "mean": 82.1 }, "helpfulness": { "mean": 75.3 }, "safety": { "mean": 94.8 }, "composite_score": { "mean": 81.2, "p50": 83.0, "p95": 95.0, "min": 42.0, "max": 99.0 } }, "created_at": "2026-04-10T14:30:00.000Z", "completed_at": "2026-04-10T14:45:12.000Z"}Get Evaluation Results
GET /v1/evaluations/:id/resultsReturns the aggregated results summary for a completed run. Response format matches the results_summary object above.
Cancel Evaluation Run
POST /v1/evaluations/:id/cancelcurl -X POST https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel \ -H "Authorization: Bearer $FLOOPY_API_KEY"await fetch( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel", { method: "POST", headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}` }, });requests.post( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel", headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},)Response: 200 OK
{ "id": "r1234567-89ab-cdef-0123-456789abcdef", "status": "cancelled"}Only runs in pending or running state can be cancelled.
Troubleshooting
No scores appearing
- Verify that traffic is flowing through the gateway (check the request logs page).
- Scoring is automatic — no configuration needed for built-in dimensions.
- Scores appear after the feedback worker processes batches (up to 30 seconds delay).
Regression alerts not firing
- Ensure you have at least 50 scored requests in the last hour AND in the 7-day baseline.
- Low-traffic organizations may not reach the sample size threshold.
- Duplicate alerts are suppressed for 4 hours — check if a recent alert already exists.
Custom dimensions not scoring
- Confirm your custom dimensions are marked as Active in organization settings.
- Custom dimension evaluation uses an external LLM. If the LLM is rate-limited or unavailable, custom scores are skipped and the composite score is computed without them.
- Changes to custom dimensions are cached for 5 minutes.
Evaluation run stuck in “running”
- Check if the gateway is healthy and processing requests.
- Large datasets may take time — monitor the progress bar (completed rows / total rows).
- If the run appears stuck, cancel it and create a new one.
Evaluation run failed with budget error
- The run exceeded the configured budget limit before completing all rows.
- Increase the budget limit or reduce the dataset size.
- Check the results summary for partial results from rows that completed before the budget was exceeded.
Comparison shows statistical significance warning
- Results are based on fewer than 30 samples.
- Run a larger dataset or wait for more production traffic before drawing conclusions.