Experiments
Overview
Section titled “Overview”Experiments let you systematically evaluate different models and prompts against a test dataset. Instead of guessing which model or prompt works best, you can run a structured comparison and get scored results across multiple quality dimensions.
This is especially useful when deciding between providers, testing a new prompt version, or validating that a cost optimization does not degrade quality.
Creating an Experiment
Section titled “Creating an Experiment”To set up an experiment:
- Go to Experiments in the dashboard and click Create Experiment.
- Select a test dataset — a collection of input prompts with optional expected outputs.
- Choose the variants to compare. Each variant is a combination of model, provider, and prompt.
- Select a scoring preset or customize the scoring dimensions.
- Run the experiment.
Floopy sends each test input to every variant, collects the responses, and scores them automatically.
Scoring Dimensions
Section titled “Scoring Dimensions”Each response is scored on multiple dimensions:
- Relevance — how well the response addresses the input.
- Coherence — logical consistency and readability.
- Helpfulness — whether the response is actionable and useful.
- Safety — absence of harmful, biased, or inappropriate content.
- Cost efficiency — token usage and cost relative to response quality.
Scores are normalized to a 0-100 scale for easy comparison across variants.
Scoring Presets
Section titled “Scoring Presets”Presets configure how dimensions are weighted in the overall score:
| Preset | Focus |
|---|---|
| Balanced | Equal weight across all dimensions. Good starting point. |
| Quality First | Prioritizes relevance, coherence, and helpfulness over cost. |
| Cost Optimized | Prioritizes cost efficiency while maintaining minimum quality thresholds. |
| Safety Critical | Heavily weights the safety dimension. Use for regulated or sensitive applications. |
You can also define custom weights if none of the presets match your needs.
Reading Results
Section titled “Reading Results”The results page shows a comparison table with each variant’s scores broken down by dimension. You can sort by any dimension or the overall weighted score to find the best performer.
Click into a variant to see individual responses alongside the test inputs, so you can qualitatively review the output in addition to the automated scores.
Regression Alerts
Section titled “Regression Alerts”Enable regression alerts to get notified when prompt quality drops. Floopy compares experiment results against a baseline and flags significant declines in any scoring dimension. This is useful for catching quality regressions after prompt edits or model updates.
Alerts are delivered via the dashboard notifications and can be configured per experiment.
The dashboard journey
Section titled “The dashboard journey”Floopy’s dashboard at app.floopy.ai/routing/experiments is the recommended path for running routing experiments end to end. There are four screens.
List view (/routing/experiments)
Section titled “List view (/routing/experiments)”Lands on a filterable table of experiments scoped to your organisation: type (canary or shadow), status (draft, active, completed, rolled_back), baseline (provider, model), candidate (provider, model), started/ended timestamps. The list dogfoods the GET /v1/experiments endpoint.
The “New experiment” button takes you to the create flow.
Create view (/routing/experiments/new)
Section titled “Create view (/routing/experiments/new)”A form for authoring a single experiment:
- Type (
canaryorshadow). - Baseline
(provider, model)— the control side. - Candidate
(provider, model)— the variant side. - For canary experiments: traffic percentage on the candidate (
0..=100). - For shadow experiments: nothing else — shadow always runs at 100 % alongside the live traffic, but never serves the user.
Submitting the form calls POST /v1/experiments with the safety header X-Floopy-Confirm: experiments set unconditionally — the dashboard never lets a user create an experiment without it. The header is a deliberate, low-cost gate against accidental and drive-by abuse from leaked keys.
The dashboard also sends X-Floopy-Origin: api for normal create flows and X-Floopy-Origin: zeus_onboarding for the onboarding-driven shadow setup, plus the X-Floopy-Actor-User-Id header carrying the session user id (validated server-side). The closed allowlist for X-Floopy-Origin is {"api", "zeus_onboarding", "sdk"}.
Detail view (/routing/experiments/{id})
Section titled “Detail view (/routing/experiments/{id})”The experiment detail page server-side-fetches GET /v1/experiments/{id}/results and renders:
- Header with type, status, and lifetime timestamps.
- Baseline panel: samples, average cost, composite quality, p50 latency.
- Candidate panel: same fields.
- Delta block: cost percentage, quality absolute, latency milliseconds.
For an active experiment the panels refresh on a polling cadence: every 20 seconds for the first 10 minutes, every 60 seconds afterwards. Polling pauses entirely when the browser tab is hidden (document.visibilityState === 'hidden') so a backgrounded tab does not consume request budget.
The endpoint is Redis-cached server-side at 30 seconds, so high-cadence polling never hits the analytics store.
Rollback dialog
Section titled “Rollback dialog”The detail view’s “Roll back” button opens a confirmation dialog. Confirming calls POST /v1/experiments/{id}/rollback with the same X-Floopy-Confirm: experiments header and the same origin/actor headers as the create flow. The rollback is itself an audit event and clears the shadow-validation cache so a recovered route runs hot.
Plan Requirements
Section titled “Plan Requirements”Experiments and the new results endpoint are available on the Pro plan. Check your current plan under Settings > Billing.
See also
Section titled “See also”- Experiments API — list, create, rollback.
- GET /v1/experiments/{id}/results — aggregated baseline-vs-candidate results.
- Constraints feature — how
require_shadow_before_liveandmax_cost_drop_without_validationinteract with shadow experiments.