Evaluations
Este conteúdo não está disponível em sua língua ainda.
Evaluations
Floopy evaluates every AI response across multiple quality dimensions, detects regressions automatically, and lets you run controlled evaluations against datasets before deploying changes. The system has three capabilities:
- Automated Scoring — Every response is scored on relevance, coherence, helpfulness, and safety using a blend of heuristic and LLM-based evaluation.
- Regression Alerts — Continuous monitoring compares recent scores against a 7-day baseline and fires alerts when quality drops.
- Dataset Evaluation Runs — Test prompt versions and compare models side-by-side using your own datasets with budget controls.
Scoring System
Built-in Dimensions
Every response is scored on four built-in dimensions, each producing a value from 0 to 100:
| Dimension | Description |
|---|---|
| Relevance | How closely the response addresses the request. Measures token overlap between prompt and response. |
| Coherence | Response structure quality — sentence count, length, formatting (paragraphs, lists, code blocks, headers). |
| Helpfulness | Response thoroughness relative to prompt complexity. Rewards appropriate length, code blocks when code is requested, structured lists when lists are requested. |
| Safety | Absence of harmful content patterns. Starts high (95) and deducts for detected harmful phrases. Scores at least 85 when refusal signals are present. |
Additionally, Floopy computes:
| Dimension | Description |
|---|---|
| Cost Efficiency | How cost-effective the request was relative to the cheapest option in your org over the last 24 hours. Range 0–100. |
| Composite Score | Weighted average of all dimensions (built-in + custom). This is the primary metric used for regression detection and routing optimization. |
How Scoring Works
Each response is scored using a blend of two methods:
- Heuristic scoring (default weight: 60%) — Fast, deterministic rules applied locally.
- LLM scoring (default weight: 40%) — An LLM evaluates the response and produces scores. Falls back to heuristic-only if the LLM is unavailable.
The final score per dimension is:
score = (heuristic_score × 0.6) + (llm_score × 0.4)Scores are clamped to the 0–100 range.
Weight Presets
The composite score uses dimension weights to compute a single quality number. Floopy provides four presets:
| Preset | Relevance | Coherence | Helpfulness | Safety | Cost Efficiency |
|---|---|---|---|---|---|
| Balanced (default) | 0.25 | 0.10 | 0.30 | 0.15 | 0.20 |
| Quality First | 0.30 | 0.15 | 0.35 | 0.15 | 0.05 |
| Cost Optimized | 0.15 | 0.05 | 0.20 | 0.10 | 0.50 |
| Safety Critical | 0.15 | 0.05 | 0.15 | 0.50 | 0.15 |
You can also define custom weights per organization.
Custom Dimensions
Organizations can define additional scoring dimensions beyond the four built-in ones. Each custom dimension has:
- Name — The dimension label (e.g., “brand_voice”, “technical_accuracy”).
- Evaluation prompt — A template prompt with
{request}and{response}placeholders. The LLM evaluates against this prompt and returns a score from 0–100. - Weight — How much this dimension contributes to the composite score.
- Active flag — Only active dimensions are evaluated.
Custom dimension scores are stored alongside built-in scores and appear in the dashboard charts and breakdown views.
Configuration: Custom dimensions are managed via the organization settings in the dashboard. They are cached for 5 minutes, so changes take effect shortly after saving.
Regression Alerts
Floopy continuously monitors your composite score for sudden drops and fires alerts when quality regresses.
Detection Logic
The regression detector runs on a schedule and compares two time windows:
| Parameter | Value |
|---|---|
| Current window | Last 1 hour |
| Baseline window | Previous 7 days (excluding the last hour) |
| Minimum sample size | 50 requests in both windows |
| Regression threshold | > 15% drop from baseline average |
| Deduplication window | 4 hours (same org won’t get duplicate alerts) |
Severity Levels
The severity is assigned based on how large the drop is:
| Drop Percentage | Severity |
|---|---|
| > 40% | Critical |
| > 25% | High |
| >= 15% | Medium |
Alert Pipeline
When a regression is detected:
- An alert is created in the security alerts system with type
quality_regression. - Details include: current average, historical average, drop percentage, and sample size.
- If webhooks are configured for
security_alertevents, a notification is delivered to your endpoints. - The alert appears in the dashboard under Security Alerts.
Tuning
Regression detection requires sufficient traffic volume (50+ scored requests per hour). For low-traffic organizations, alerts will not fire until the sample size threshold is met in both time windows.
If you receive too many alerts, check whether a legitimate change in your prompts or models is causing the score shift. The 4-hour deduplication window prevents alert fatigue from a single ongoing regression.
Dashboard
The evaluations dashboard at Settings > Evaluations provides analytics over your scored traffic.
Score Time Series
A line chart showing all scoring dimensions over time. Select a time range:
| Range | Chart Granularity |
|---|---|
| Last 1 hour | 5-minute buckets |
| Last 6 hours | 5-minute buckets |
| Last 24 hours | 1-hour buckets |
| Last 7 days | 1-day buckets |
| Last 30 days | 1-day buckets |
Each line represents a dimension (composite, relevance, coherence, helpfulness, safety). Custom dimensions appear as dashed lines.
Breakdown Filters
Filter scores by three dimensions to identify underperforming configurations:
- By Model — Compare score averages across different models.
- By Prompt Version — See how different prompt IDs perform.
- By API Key — Identify which keys produce higher or lower quality.
Filters persist in URL search parameters, so you can bookmark or share specific views.
The breakdown view shows a side-by-side bar chart comparing scores across the selected grouping, plus a comparison table with sample counts.
Top & Bottom Requests
Tables showing the highest and lowest scoring requests for the selected time range. Each row includes:
- Prompt snippet (first 200 characters)
- Model used
- Composite score and individual dimension scores
- Timestamp
Click any row to navigate to the full request detail view. Use the count selector to show the top/bottom 10, 25, or 50 requests.
Custom Dimension Scores
If your organization has custom dimensions configured, they appear automatically:
- As dashed lines in the time series chart.
- As additional bars in the breakdown chart.
- Organizations with no custom dimensions see no changes — the views pass through cleanly.
Dataset Evaluation Runs
Run controlled evaluations against your datasets to test prompt versions and compare models before deploying to production.
Creating a Run
From the evaluations page, click New Evaluation Run and configure:
| Field | Description |
|---|---|
| Dataset | Select from your existing datasets. |
| Prompt Version | (Optional) Select a specific prompt to use. |
| Models | Select up to 3 models to evaluate. One run is created per model. |
| Budget Limit | Maximum spend in dollars. The run stops if this limit is exceeded. |
An estimated cost is shown before starting, calculated from:
estimated_cost = dataset_rows × avg_tokens (500) × model_price_per_tokenwith a 60/40 input/output token split assumption.
Run Lifecycle
Each evaluation run progresses through these states:
| State | Description |
|---|---|
pending | Run created, waiting to start processing. |
running | Actively processing dataset rows through the gateway pipeline. |
completed | All rows processed and results aggregated. |
failed | An error occurred or the budget limit was exceeded. |
cancelled | Manually cancelled by a user. |
During execution, the runner:
- Fetches dataset rows from the database.
- For each row, sends the request through the normal gateway pipeline with the specified model.
- Each response is scored by the feedback worker (same scoring as production traffic).
- Progress updates every 10 rows.
- Checks for cancellation between each row.
Budget Enforcement
If a budget limit is set in the run configuration:
- Cumulative cost is tracked after each row completes.
- If the cumulative cost exceeds the budget, the run stops immediately.
- The run status is set to
failedwith the reason recorded in the results summary. - Evaluation run costs are tracked separately from production traffic.
Viewing Results
Each completed run produces a results summary with aggregated scores:
| Metric | Description |
|---|---|
| Mean | Average score across all evaluated rows. |
| P50 | Median score. |
| P95 | 95th percentile score. |
| Min | Lowest score in the run. |
| Max | Highest score in the run. |
These metrics are computed per dimension (relevance, coherence, helpfulness, safety, composite).
The results page also shows:
- Progress bar for running evaluations (completed rows / total rows).
- Cost summary with total tokens and cost per model.
- Cancel button for runs that are still in progress.
Comparing Results
When multiple models are evaluated on the same dataset:
- Model comparison — Side-by-side score table showing each model’s mean scores per dimension. The best score per dimension is highlighted in green, the worst in red.
- Version comparison — Compare the same model across different prompt versions from related runs on the same dataset.
- Statistical significance warning — When sample size is below 30, a warning indicates that results may not be statistically significant.
API Reference
All evaluation endpoints require a valid API key with admin permissions in the Authorization: Bearer <key> header.
Create Evaluation Run
POST /v1/evaluationscurl -X POST https://api.floopy.ai/v1/evaluations \ -H "Authorization: Bearer $FLOOPY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef", "model": "gpt-4o", "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef", "config": { "budget_limit_cents": 5000 } }'const response = await fetch("https://api.floopy.ai/v1/evaluations", { method: "POST", headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ dataset_id: "d1234567-89ab-cdef-0123-456789abcdef", model: "gpt-4o", prompt_id: "p1234567-89ab-cdef-0123-456789abcdef", config: { budget_limit_cents: 5000 }, }),});const run = await response.json();import requestsimport os
response = requests.post( "https://api.floopy.ai/v1/evaluations", headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"}, json={ "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef", "model": "gpt-4o", "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef", "config": {"budget_limit_cents": 5000}, },)run = response.json()Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
dataset_id | string (UUID) | Yes | ID of the dataset to evaluate against. |
model | string | Yes | Model identifier (e.g., gpt-4o, claude-sonnet-4-20250514). |
prompt_id | string (UUID) | No | Prompt version to use. If omitted, uses the raw dataset prompts. |
config | object | No | Run configuration. |
config.budget_limit_cents | integer | No | Maximum budget in cents. Run stops if exceeded. |
Response: 201 Created
{ "id": "r1234567-89ab-cdef-0123-456789abcdef", "status": "pending", "total_rows": 150, "completed_rows": 0, "created_at": "2026-04-10T14:30:00.000Z"}Get Evaluation Run
GET /v1/evaluations/:idcurl https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef \ -H "Authorization: Bearer $FLOOPY_API_KEY"const response = await fetch( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef", { headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}` } });const run = await response.json();response = requests.get( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef", headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},)run = response.json()Response: 200 OK
{ "id": "r1234567-89ab-cdef-0123-456789abcdef", "organization_id": "550e8400-e29b-41d4-a716-446655440000", "dataset_id": "d1234567-89ab-cdef-0123-456789abcdef", "prompt_id": "p1234567-89ab-cdef-0123-456789abcdef", "model": "gpt-4o", "status": "completed", "config": { "budget_limit_cents": 5000 }, "total_rows": 150, "completed_rows": 150, "results_summary": { "count": 150, "relevance": { "mean": 78.5 }, "coherence": { "mean": 82.1 }, "helpfulness": { "mean": 75.3 }, "safety": { "mean": 94.8 }, "composite_score": { "mean": 81.2, "p50": 83.0, "p95": 95.0, "min": 42.0, "max": 99.0 } }, "created_at": "2026-04-10T14:30:00.000Z", "completed_at": "2026-04-10T14:45:12.000Z"}Get Evaluation Results
GET /v1/evaluations/:id/resultsReturns the aggregated results summary for a completed run. Response format matches the results_summary object above.
Cancel Evaluation Run
POST /v1/evaluations/:id/cancelcurl -X POST https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel \ -H "Authorization: Bearer $FLOOPY_API_KEY"await fetch( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel", { method: "POST", headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}` }, });requests.post( "https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel", headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},)Response: 200 OK
{ "id": "r1234567-89ab-cdef-0123-456789abcdef", "status": "cancelled"}Only runs in pending or running state can be cancelled.
Troubleshooting
No scores appearing
- Verify that traffic is flowing through the gateway (check the request logs page).
- Scoring is automatic — no configuration needed for built-in dimensions.
- Scores appear after the feedback worker processes batches (up to 30 seconds delay).
Regression alerts not firing
- Ensure you have at least 50 scored requests in the last hour AND in the 7-day baseline.
- Low-traffic organizations may not reach the sample size threshold.
- Duplicate alerts are suppressed for 4 hours — check if a recent alert already exists.
Custom dimensions not scoring
- Confirm your custom dimensions are marked as Active in organization settings.
- Custom dimension evaluation uses an external LLM. If the LLM is rate-limited or unavailable, custom scores are skipped and the composite score is computed without them.
- Changes to custom dimensions are cached for 5 minutes.
Evaluation run stuck in “running”
- Check if the gateway is healthy and processing requests.
- Large datasets may take time — monitor the progress bar (completed rows / total rows).
- If the run appears stuck, cancel it and create a new one.
Evaluation run failed with budget error
- The run exceeded the configured budget limit before completing all rows.
- Increase the budget limit or reduce the dataset size.
- Check the results summary for partial results from rows that completed before the budget was exceeded.
Comparison shows statistical significance warning
- Results are based on fewer than 30 samples.
- Run a larger dataset or wait for more production traffic before drawing conclusions.