Skip to content
Sign In Get Started

Experiments

Experiments let you systematically evaluate different models and prompts against a test dataset. Instead of guessing which model or prompt works best, you can run a structured comparison and get scored results across multiple quality dimensions.

This is especially useful when deciding between providers, testing a new prompt version, or validating that a cost optimization does not degrade quality.

To set up an experiment:

  1. Go to Experiments in the dashboard and click Create Experiment.
  2. Select a test dataset — a collection of input prompts with optional expected outputs.
  3. Choose the variants to compare. Each variant is a combination of model, provider, and prompt.
  4. Select a scoring preset or customize the scoring dimensions.
  5. Run the experiment.

Floopy sends each test input to every variant, collects the responses, and scores them automatically.

Each response is scored on multiple dimensions:

  • Relevance — how well the response addresses the input.
  • Coherence — logical consistency and readability.
  • Helpfulness — whether the response is actionable and useful.
  • Safety — absence of harmful, biased, or inappropriate content.
  • Cost efficiency — token usage and cost relative to response quality.

Scores are normalized to a 0-100 scale for easy comparison across variants.

Presets configure how dimensions are weighted in the overall score:

PresetFocus
BalancedEqual weight across all dimensions. Good starting point.
Quality FirstPrioritizes relevance, coherence, and helpfulness over cost.
Cost OptimizedPrioritizes cost efficiency while maintaining minimum quality thresholds.
Safety CriticalHeavily weights the safety dimension. Use for regulated or sensitive applications.

You can also define custom weights if none of the presets match your needs.

The results page shows a comparison table with each variant’s scores broken down by dimension. You can sort by any dimension or the overall weighted score to find the best performer.

Click into a variant to see individual responses alongside the test inputs, so you can qualitatively review the output in addition to the automated scores.

Enable regression alerts to get notified when prompt quality drops. Floopy compares experiment results against a baseline and flags significant declines in any scoring dimension. This is useful for catching quality regressions after prompt edits or model updates.

Alerts are delivered via the dashboard notifications and can be configured per experiment.

Floopy’s dashboard at app.floopy.ai/routing/experiments is the recommended path for running routing experiments end to end. There are four screens.

Lands on a filterable table of experiments scoped to your organisation: type (canary or shadow), status (draft, active, completed, rolled_back), baseline (provider, model), candidate (provider, model), started/ended timestamps. The list dogfoods the GET /v1/experiments endpoint.

The “New experiment” button takes you to the create flow.

A form for authoring a single experiment:

  • Type (canary or shadow).
  • Baseline (provider, model) — the control side.
  • Candidate (provider, model) — the variant side.
  • For canary experiments: traffic percentage on the candidate (0..=100).
  • For shadow experiments: nothing else — shadow always runs at 100 % alongside the live traffic, but never serves the user.

Submitting the form calls POST /v1/experiments with the safety header X-Floopy-Confirm: experiments set unconditionally — the dashboard never lets a user create an experiment without it. The header is a deliberate, low-cost gate against accidental and drive-by abuse from leaked keys.

The dashboard also sends X-Floopy-Origin: api for normal create flows and X-Floopy-Origin: zeus_onboarding for the onboarding-driven shadow setup, plus the X-Floopy-Actor-User-Id header carrying the session user id (validated server-side). The closed allowlist for X-Floopy-Origin is {"api", "zeus_onboarding", "sdk"}.

The experiment detail page server-side-fetches GET /v1/experiments/{id}/results and renders:

  • Header with type, status, and lifetime timestamps.
  • Baseline panel: samples, average cost, composite quality, p50 latency.
  • Candidate panel: same fields.
  • Delta block: cost percentage, quality absolute, latency milliseconds.

For an active experiment the panels refresh on a polling cadence: every 20 seconds for the first 10 minutes, every 60 seconds afterwards. Polling pauses entirely when the browser tab is hidden (document.visibilityState === 'hidden') so a backgrounded tab does not consume request budget.

The endpoint is Redis-cached server-side at 30 seconds, so high-cadence polling never hits the analytics store.

The detail view’s “Roll back” button opens a confirmation dialog. Confirming calls POST /v1/experiments/{id}/rollback with the same X-Floopy-Confirm: experiments header and the same origin/actor headers as the create flow. The rollback is itself an audit event and clears the shadow-validation cache so a recovered route runs hot.

Experiments and the new results endpoint are available on the Pro plan. Check your current plan under Settings > Billing.