Experiments
Overview
Experiments let you systematically evaluate different models and prompts against a test dataset. Instead of guessing which model or prompt works best, you can run a structured comparison and get scored results across multiple quality dimensions.
This is especially useful when deciding between providers, testing a new prompt version, or validating that a cost optimization does not degrade quality.
Creating an Experiment
To set up an experiment:
- Go to Experiments in the dashboard and click Create Experiment.
- Select a test dataset — a collection of input prompts with optional expected outputs.
- Choose the variants to compare. Each variant is a combination of model, provider, and prompt.
- Select a scoring preset or customize the scoring dimensions.
- Run the experiment.
Floopy sends each test input to every variant, collects the responses, and scores them automatically.
Scoring Dimensions
Each response is scored on multiple dimensions:
- Relevance — how well the response addresses the input.
- Coherence — logical consistency and readability.
- Helpfulness — whether the response is actionable and useful.
- Safety — absence of harmful, biased, or inappropriate content.
- Cost efficiency — token usage and cost relative to response quality.
Scores are normalized to a 0-100 scale for easy comparison across variants.
Scoring Presets
Presets configure how dimensions are weighted in the overall score:
| Preset | Focus |
|---|---|
| Balanced | Equal weight across all dimensions. Good starting point. |
| Quality First | Prioritizes relevance, coherence, and helpfulness over cost. |
| Cost Optimized | Prioritizes cost efficiency while maintaining minimum quality thresholds. |
| Safety Critical | Heavily weights the safety dimension. Use for regulated or sensitive applications. |
You can also define custom weights if none of the presets match your needs.
Reading Results
The results page shows a comparison table with each variant’s scores broken down by dimension. You can sort by any dimension or the overall weighted score to find the best performer.
Click into a variant to see individual responses alongside the test inputs, so you can qualitatively review the output in addition to the automated scores.
Regression Alerts
Enable regression alerts to get notified when prompt quality drops. Floopy compares experiment results against a baseline and flags significant declines in any scoring dimension. This is useful for catching quality regressions after prompt edits or model updates.
Alerts are delivered via the dashboard notifications and can be configured per experiment.
Plan Requirements
Experiments require a plan with the has_experiments feature enabled. Check your current plan under Settings > Billing.