Skip to content

Evaluations

Evaluations

Floopy evaluates every AI response across multiple quality dimensions, detects regressions automatically, and lets you run controlled evaluations against datasets before deploying changes. The system has three capabilities:

  1. Automated Scoring — Every response is scored on relevance, coherence, helpfulness, and safety using a blend of heuristic and LLM-based evaluation.
  2. Regression Alerts — Continuous monitoring compares recent scores against a 7-day baseline and fires alerts when quality drops.
  3. Dataset Evaluation Runs — Test prompt versions and compare models side-by-side using your own datasets with budget controls.

Scoring System

Built-in Dimensions

Every response is scored on four built-in dimensions, each producing a value from 0 to 100:

DimensionDescription
RelevanceHow closely the response addresses the request. Measures token overlap between prompt and response.
CoherenceResponse structure quality — sentence count, length, formatting (paragraphs, lists, code blocks, headers).
HelpfulnessResponse thoroughness relative to prompt complexity. Rewards appropriate length, code blocks when code is requested, structured lists when lists are requested.
SafetyAbsence of harmful content patterns. Starts high (95) and deducts for detected harmful phrases. Scores at least 85 when refusal signals are present.

Additionally, Floopy computes:

DimensionDescription
Cost EfficiencyHow cost-effective the request was relative to the cheapest option in your org over the last 24 hours. Range 0–100.
Composite ScoreWeighted average of all dimensions (built-in + custom). This is the primary metric used for regression detection and routing optimization.

How Scoring Works

Each response is scored using a blend of two methods:

  • Heuristic scoring (default weight: 60%) — Fast, deterministic rules applied locally.
  • LLM scoring (default weight: 40%) — An LLM evaluates the response and produces scores. Falls back to heuristic-only if the LLM is unavailable.

The final score per dimension is:

score = (heuristic_score × 0.6) + (llm_score × 0.4)

Scores are clamped to the 0–100 range.

Weight Presets

The composite score uses dimension weights to compute a single quality number. Floopy provides four presets:

PresetRelevanceCoherenceHelpfulnessSafetyCost Efficiency
Balanced (default)0.250.100.300.150.20
Quality First0.300.150.350.150.05
Cost Optimized0.150.050.200.100.50
Safety Critical0.150.050.150.500.15

You can also define custom weights per organization.

Custom Dimensions

Organizations can define additional scoring dimensions beyond the four built-in ones. Each custom dimension has:

  • Name — The dimension label (e.g., “brand_voice”, “technical_accuracy”).
  • Evaluation prompt — A template prompt with {request} and {response} placeholders. The LLM evaluates against this prompt and returns a score from 0–100.
  • Weight — How much this dimension contributes to the composite score.
  • Active flag — Only active dimensions are evaluated.

Custom dimension scores are stored alongside built-in scores and appear in the dashboard charts and breakdown views.

Configuration: Custom dimensions are managed via the organization settings in the dashboard. They are cached for 5 minutes, so changes take effect shortly after saving.


Regression Alerts

Floopy continuously monitors your composite score for sudden drops and fires alerts when quality regresses.

Detection Logic

The regression detector runs on a schedule and compares two time windows:

ParameterValue
Current windowLast 1 hour
Baseline windowPrevious 7 days (excluding the last hour)
Minimum sample size50 requests in both windows
Regression threshold> 15% drop from baseline average
Deduplication window4 hours (same org won’t get duplicate alerts)

Severity Levels

The severity is assigned based on how large the drop is:

Drop PercentageSeverity
> 40%Critical
> 25%High
>= 15%Medium

Alert Pipeline

When a regression is detected:

  1. An alert is created in the security alerts system with type quality_regression.
  2. Details include: current average, historical average, drop percentage, and sample size.
  3. If webhooks are configured for security_alert events, a notification is delivered to your endpoints.
  4. The alert appears in the dashboard under Security Alerts.

Tuning

Regression detection requires sufficient traffic volume (50+ scored requests per hour). For low-traffic organizations, alerts will not fire until the sample size threshold is met in both time windows.

If you receive too many alerts, check whether a legitimate change in your prompts or models is causing the score shift. The 4-hour deduplication window prevents alert fatigue from a single ongoing regression.


Dashboard

The evaluations dashboard at Settings > Evaluations provides analytics over your scored traffic.

Score Time Series

A line chart showing all scoring dimensions over time. Select a time range:

RangeChart Granularity
Last 1 hour5-minute buckets
Last 6 hours5-minute buckets
Last 24 hours1-hour buckets
Last 7 days1-day buckets
Last 30 days1-day buckets

Each line represents a dimension (composite, relevance, coherence, helpfulness, safety). Custom dimensions appear as dashed lines.

Breakdown Filters

Filter scores by three dimensions to identify underperforming configurations:

  • By Model — Compare score averages across different models.
  • By Prompt Version — See how different prompt IDs perform.
  • By API Key — Identify which keys produce higher or lower quality.

Filters persist in URL search parameters, so you can bookmark or share specific views.

The breakdown view shows a side-by-side bar chart comparing scores across the selected grouping, plus a comparison table with sample counts.

Top & Bottom Requests

Tables showing the highest and lowest scoring requests for the selected time range. Each row includes:

  • Prompt snippet (first 200 characters)
  • Model used
  • Composite score and individual dimension scores
  • Timestamp

Click any row to navigate to the full request detail view. Use the count selector to show the top/bottom 10, 25, or 50 requests.

Custom Dimension Scores

If your organization has custom dimensions configured, they appear automatically:

  • As dashed lines in the time series chart.
  • As additional bars in the breakdown chart.
  • Organizations with no custom dimensions see no changes — the views pass through cleanly.

Dataset Evaluation Runs

Run controlled evaluations against your datasets to test prompt versions and compare models before deploying to production.

Creating a Run

From the evaluations page, click New Evaluation Run and configure:

FieldDescription
DatasetSelect from your existing datasets.
Prompt Version(Optional) Select a specific prompt to use.
ModelsSelect up to 3 models to evaluate. One run is created per model.
Budget LimitMaximum spend in dollars. The run stops if this limit is exceeded.

An estimated cost is shown before starting, calculated from:

estimated_cost = dataset_rows × avg_tokens (500) × model_price_per_token

with a 60/40 input/output token split assumption.

Run Lifecycle

Each evaluation run progresses through these states:

StateDescription
pendingRun created, waiting to start processing.
runningActively processing dataset rows through the gateway pipeline.
completedAll rows processed and results aggregated.
failedAn error occurred or the budget limit was exceeded.
cancelledManually cancelled by a user.

During execution, the runner:

  1. Fetches dataset rows from the database.
  2. For each row, sends the request through the normal gateway pipeline with the specified model.
  3. Each response is scored by the feedback worker (same scoring as production traffic).
  4. Progress updates every 10 rows.
  5. Checks for cancellation between each row.

Budget Enforcement

If a budget limit is set in the run configuration:

  • Cumulative cost is tracked after each row completes.
  • If the cumulative cost exceeds the budget, the run stops immediately.
  • The run status is set to failed with the reason recorded in the results summary.
  • Evaluation run costs are tracked separately from production traffic.

Viewing Results

Each completed run produces a results summary with aggregated scores:

MetricDescription
MeanAverage score across all evaluated rows.
P50Median score.
P9595th percentile score.
MinLowest score in the run.
MaxHighest score in the run.

These metrics are computed per dimension (relevance, coherence, helpfulness, safety, composite).

The results page also shows:

  • Progress bar for running evaluations (completed rows / total rows).
  • Cost summary with total tokens and cost per model.
  • Cancel button for runs that are still in progress.

Comparing Results

When multiple models are evaluated on the same dataset:

  • Model comparison — Side-by-side score table showing each model’s mean scores per dimension. The best score per dimension is highlighted in green, the worst in red.
  • Version comparison — Compare the same model across different prompt versions from related runs on the same dataset.
  • Statistical significance warning — When sample size is below 30, a warning indicates that results may not be statistically significant.

API Reference

All evaluation endpoints require a valid API key with admin permissions in the Authorization: Bearer <key> header.

Create Evaluation Run

POST /v1/evaluations
Terminal window
curl -X POST https://api.floopy.ai/v1/evaluations \
-H "Authorization: Bearer $FLOOPY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"dataset_id": "d1234567-89ab-cdef-0123-456789abcdef",
"model": "gpt-4o",
"prompt_id": "p1234567-89ab-cdef-0123-456789abcdef",
"config": {
"budget_limit_cents": 5000
}
}'

Request Body:

FieldTypeRequiredDescription
dataset_idstring (UUID)YesID of the dataset to evaluate against.
modelstringYesModel identifier (e.g., gpt-4o, claude-sonnet-4-20250514).
prompt_idstring (UUID)NoPrompt version to use. If omitted, uses the raw dataset prompts.
configobjectNoRun configuration.
config.budget_limit_centsintegerNoMaximum budget in cents. Run stops if exceeded.

Response: 201 Created

{
"id": "r1234567-89ab-cdef-0123-456789abcdef",
"status": "pending",
"total_rows": 150,
"completed_rows": 0,
"created_at": "2026-04-10T14:30:00.000Z"
}

Get Evaluation Run

GET /v1/evaluations/:id
Terminal window
curl https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef \
-H "Authorization: Bearer $FLOOPY_API_KEY"

Response: 200 OK

{
"id": "r1234567-89ab-cdef-0123-456789abcdef",
"organization_id": "550e8400-e29b-41d4-a716-446655440000",
"dataset_id": "d1234567-89ab-cdef-0123-456789abcdef",
"prompt_id": "p1234567-89ab-cdef-0123-456789abcdef",
"model": "gpt-4o",
"status": "completed",
"config": { "budget_limit_cents": 5000 },
"total_rows": 150,
"completed_rows": 150,
"results_summary": {
"count": 150,
"relevance": { "mean": 78.5 },
"coherence": { "mean": 82.1 },
"helpfulness": { "mean": 75.3 },
"safety": { "mean": 94.8 },
"composite_score": {
"mean": 81.2,
"p50": 83.0,
"p95": 95.0,
"min": 42.0,
"max": 99.0
}
},
"created_at": "2026-04-10T14:30:00.000Z",
"completed_at": "2026-04-10T14:45:12.000Z"
}

Get Evaluation Results

GET /v1/evaluations/:id/results

Returns the aggregated results summary for a completed run. Response format matches the results_summary object above.

Cancel Evaluation Run

POST /v1/evaluations/:id/cancel
Terminal window
curl -X POST https://api.floopy.ai/v1/evaluations/r1234567-89ab-cdef-0123-456789abcdef/cancel \
-H "Authorization: Bearer $FLOOPY_API_KEY"

Response: 200 OK

{
"id": "r1234567-89ab-cdef-0123-456789abcdef",
"status": "cancelled"
}

Only runs in pending or running state can be cancelled.


Troubleshooting

No scores appearing

  • Verify that traffic is flowing through the gateway (check the request logs page).
  • Scoring is automatic — no configuration needed for built-in dimensions.
  • Scores appear after the feedback worker processes batches (up to 30 seconds delay).

Regression alerts not firing

  • Ensure you have at least 50 scored requests in the last hour AND in the 7-day baseline.
  • Low-traffic organizations may not reach the sample size threshold.
  • Duplicate alerts are suppressed for 4 hours — check if a recent alert already exists.

Custom dimensions not scoring

  • Confirm your custom dimensions are marked as Active in organization settings.
  • Custom dimension evaluation uses an external LLM. If the LLM is rate-limited or unavailable, custom scores are skipped and the composite score is computed without them.
  • Changes to custom dimensions are cached for 5 minutes.

Evaluation run stuck in “running”

  • Check if the gateway is healthy and processing requests.
  • Large datasets may take time — monitor the progress bar (completed rows / total rows).
  • If the run appears stuck, cancel it and create a new one.

Evaluation run failed with budget error

  • The run exceeded the configured budget limit before completing all rows.
  • Increase the budget limit or reduce the dataset size.
  • Check the results summary for partial results from rows that completed before the budget was exceeded.

Comparison shows statistical significance warning

  • Results are based on fewer than 30 samples.
  • Run a larger dataset or wait for more production traffic before drawing conclusions.