Skip to content

Baseline-vs-Floopy methodology

The Baseline-vs-Floopy comparison view answers one question on a customer’s own traffic: does Floopy actually deliver lower cost without losing quality, compared to a single-default-model baseline?

This page explains how the comparison is constructed, what counts as the baseline, how the deltas are computed, and the methodology caveats — so the numbers are interpretable and not gamed.

The view in the Floopy dashboard renders two stacked panels for the same time window:

  • Left — Floopy: actual outcomes for requests that ran through Floopy’s routing.
  • Right — Baseline: counterfactual outcomes computed as if every request in the same window had been sent to the organisation’s default_model.

Each panel reports three numbers:

  • average cost per request (in micro-USD),
  • p50 latency (milliseconds),
  • composite quality (a weighted average of LLM-as-judge scoring and session NPS where present, normalised to 0..=100).

The delta between the two panels is the headline number: “Floopy delivered X% lower cost at Y points of composite quality vs. baseline.”

The baseline is the organisation’s default_model as configured on the routing rule that was in effect for each request in the window.

  • If multiple routing rules applied during the window (e.g. a config change mid-week), each request is compared against the default_model that was active for that request.
  • If a request had no routing rule (legacy-model path), it is excluded from the comparison — neither panel counts it.
  • If a request was a cache hit (outcome.cache_hit == true), it is also excluded — there was no provider call to compare costs against.

The baseline is not:

  • The cheapest model in the organisation’s catalogue.
  • A simulated “what would the best model have done” — Floopy never claims to know the optimal counterfactual outcome.
  • A leaderboard average across all Floopy customers.

The cost panels are direct, not modelled.

  • Floopy panel: sum outcome.cost_micro_usd over decisions with a winner in the window, divided by request count.
  • Baseline panel: for each decision in the window, look up the priced cost of running the same prompt-and-response token shape on the default_model (using the prompt and completion token counts already recorded by Floopy’s logging path, not a fresh provider call), then aggregate.

The baseline panel uses the priced cost of the default_model at the same point in time as the original request — provider price changes during the window are reflected in both panels equally.

The composite quality score for the window is a weighted average:

  • LLM-as-judge scoring on the response (when scoring is enabled for the route).
  • Session NPS feedback (when a floopy-session-id was provided and the session has a feedback row in request_feedback).
  • Admin overrides on the request (when present).

The Floopy panel sees the actual scored responses. The baseline panel does not have a counterfactual response to score — running the prompt against default_model after the fact would be both expensive and outside the gateway’s contract. Instead, the baseline quality number is the historical composite quality of default_model on similar requests in the same window, weighted by the same task-complexity bucket the routing layer used.

This is the most important methodology caveat: the baseline quality is a historical-aggregate proxy, not a per-request counterfactual.

The comparison is honest about what it cannot prove.

  1. The baseline quality is aggregate, not per-request. We do not re-run prompts against the baseline model; we rely on historical scoring of the baseline on traffic of the same task-complexity class. A small per-request residual is unavoidable.
  2. No counterfactual on would_select. We do not show “if you had asked X instead of Y, the result would have been Z” — that requires replaying the prompt, which we explicitly do not do.
  3. Cache hits are excluded from both panels. A cache hit is a 100% cost saving that has nothing to do with model selection; including it would inflate the Floopy panel’s cost delta misleadingly.
  4. Legacy-model requests are excluded. A request that fell through to legacy parsing did not run through Floopy’s routing — there is nothing to compare.
  5. Free-tier shared-pool decisions are flagged, not excluded. When used_shared_pool_prior == true for a row in the window, the panel renders an “aggregation notice” badge linking to Confidence methodology. The numbers are kept; the provenance is visible.
  6. Window minimum sample size. When the window contains fewer than N_MIN_COMPARE = 200 decisions on a given route, the route’s deltas are hidden behind a “not enough data” affordance — the dashboard refuses to surface a number it does not believe.
  7. The view does not show a confidence interval on the deltas. v1 ships point estimates. A future iteration may add bootstrapped intervals; we did not want to ship a misleading single CI in v1.

The dashboard renders a short methodology summary above the panels at all times — there is no hover-to-reveal. The intent is that a reviewer can read the panel and the methodology together, in one screen, without clicking through. The full content here is what the blurb links to.

For external auditors who want to recompute the Floopy panel from raw exports:

  1. Pull the window via GET /v1/export/decisions.
  2. Filter to rows where winner != null and outcome.cache_hit == false and routing_strategy != "legacy_model".
  3. Sum outcome.cost_micro_usd and divide by the row count for the cost panel.
  4. p50-aggregate outcome.latency_ms for the latency number.
  5. The composite quality number requires the LLM-as-judge and session-NPS scoring tables; those are surfaced via the gateway’s logging path and are visible in the Floopy dashboard but are not part of the export wire shape today.

The baseline panel cannot be reproduced from the export alone — it requires the historical scoring tables for the org’s default_model, which Floopy renders inside the dashboard.

In v2 the dashboard’s Baseline-vs-Floopy view also surfaces a Verification Status card for each routing rule. The card answers a strict, narrow question: given this rule’s last 7 days of traffic on the customer’s own account, do we have enough evidence to call it verified?

The same answer is exposed programmatically by GET /v1/optimization/verification.

StateMeaning
verifiedBoth the baseline panel and the Floopy panel meet the sample floor, no regressions are in flight, and Floopy quality is within tolerance of baseline quality.
not_verifiedBoth panels meet the sample floor, no regressions are in flight, but Floopy quality is below baseline quality by more than the tolerance.
insufficient_dataEither panel has fewer rows than the sample floor in the 7-day window. The honest answer when there is not enough traffic to make a verdict.
regression_detectedThe bucketed recent_regressions signal over the route is non-zero in the 7-day window. The card refuses to call the rule “verified” while a regression is in flight, regardless of the cost/quality numbers.

The constants that define the four states are pinned in code, not customer-configurable:

  • SAMPLE_FLOOR = 100. Each panel must have at least 100 rows in the 7-day window for the verdict to be eligible for verified or not_verified. Below this floor the answer is always insufficient_data. The floor exists because cost and quality deltas computed on tens of rows are too noisy to act on.
  • QUALITY_TOLERANCE = 0.03. The verdict is not_verified when (baseline.composite_quality − floopy.composite_quality) > 0.03 on the [0.0, 1.0] composite-quality scale. Below that delta the verdict is verified. The tolerance is conservative on purpose — Floopy is allowed to be a hair below baseline on quality if it is meaningfully below baseline on cost.

The regression_detected state fires whenever the bucketed recent_regressions signal over the route is anything other than Exact { exact: 0 }. That is: any non-zero exact count, any AtLeast{10}, any AtLeast{50} — the verdict flips to regression_detected.

The bucket boundaries (<10, >=10, >=50) and the 5-minute timestamp rounding on last_regression_at are documented on Confidence methodology.

The verification verdict is computed on a tenant-scoped aggregation over your request log with a 5-second wall-clock timeout, then Redis-cached per (organization_id, route_id, window) for 60 seconds. The dashboard card and the GET /v1/optimization/verification API call both hit the same cache; the Cache-Control: max-age=60 HTTP header is echoed on the API path so HTTP intermediaries can honour the same TTL.

Verification numbers shift slowly. A 60-second cache is short enough that a customer who fixes a misconfigured route does not have to wait 30 minutes to see the verdict update, and long enough that the analytics store is not hit on every dashboard render.

The verification verdict is separate from the headline cost/quality delta rendered on the rest of the page. The two views answer different questions:

  • The cost/quality delta says: here is the rolling difference between Floopy and the baseline, in micro-USD per request and in composite quality points.
  • The verification verdict says: do we have enough evidence to claim this rule is keeping its end of the deal?

A rule can show a 45 % cost saving and still read not_verified if the quality delta exceeds tolerance, or regression_detected if a regression event is in flight. Both numbers are honest; the verdict is the stricter, more conservative read.