Skip to content

Confidence methodology

Every routing decision Floopy makes carries a confidence field — the router’s self-reported belief that its top choice is the right one for this organisation’s traffic right now. This page explains exactly how that number is computed, what each confidence_reason means, and why null is sometimes the honest answer.

Confidence is a heuristic, not a probability. A confidence of 0.78 does not mean “78% chance the router was correct” — it means “the inputs the router has on this candidate (gap to runner-up, sample count, outcome variance) are stronger than the inputs on most other recent decisions in this org.”

We expose confidence so customers can build review queues, alerting, and the confidence_threshold constraint on top of an explicit signal — not so it can be treated as a probability of correctness.

Confidence is computed entirely from the requesting organisation’s own inputs (its candidates, its sample counts, its variance). It does not leak cross-tenant information.

gap_norm = clamp(gap_top2 / GAP_REF, 0.0, 1.0) GAP_REF = 0.20
n_norm = clamp(log1p(n_samples) / log1p(N_REF), 0.0, 1.0) N_REF = 30
var_norm = 1.0 - clamp(variance / VAR_REF, 0.0, 1.0) VAR_REF = 0.25
raw = W_GAP*gap_norm + W_N*n_norm + W_VAR*var_norm
W_GAP = 0.45, W_N = 0.35, W_VAR = 0.20
if phase == Day0: return min(raw, CAP_DAY0) // 0.6
if used_shared_pool_prior == true: return min(raw, CAP_SHARED) // 0.8
return raw
InputSourceMeaning
gap_top2score_top1 − score_top2 from RoutingAudit.candidatesHow far ahead the winner is.
n_samplesrolling window over the request log for the winnerHow much real traffic backs the choice.
variancerolling outcome variance for the winner (composite quality)How stable observed outcomes have been.
phaseFeedback-Driven phase: Day0, Auto, NpsMaturity of the org’s data.
used_shared_pool_priorboolean on the auditWhether cross-tenant priors influenced the score.
  • W_GAP = 0.45 — a clear separation between the top two candidates is the most direct evidence the routing decision is not a coin flip.
  • W_N = 0.35 — sample count matters, but with strong diminishing returns (note the log1p). Going from 5 to 50 samples tells you a lot; going from 500 to 5000 tells you less.
  • W_VAR = 0.20 — outcome stability is a confirming signal, not a leading one. Low variance with a tied gap is still low confidence.

These weights are static and not customer-configurable in v1.

When the organisation is in the Day0 Feedback-Driven phase (no NPS feedback yet, automated signals only), the formula is allowed to compute any raw value, but the returned confidence is hard-capped at 0.6.

Why: at Day-0 there is by definition no end-user-validated signal. The router can rank candidates well, but until real usage feedback is in the loop, “high confidence” would be misleading. The cap is a forcing function: a customer who wants to see confidence above 0.6 must complete the cold-start ramp.

CAP_SHARED = 0.8 — the shared-pool prior cap

Section titled “CAP_SHARED = 0.8 — the shared-pool prior cap”

When the routing decision drew on cross-tenant priors (used_shared_pool_prior == true), the returned confidence is hard-capped at 0.8.

Why: cross-tenant priors are by definition not as confident as own-tenant outcomes. Capping at 0.8 reflects this honestly. The cap was reviewed and accepted in the security review (SEC-020 in the Credibility & Auditability Initiative) — the cross-tenant boolean itself is exposed on the audit, but the underlying priors are never exposed.

You can identify shared-pool-influenced decisions by:

  • The used_shared_pool_prior: true field on the audit, and
  • The Floopy-Aggregation-Notice: contains-shared-pool-influenced-decisions HTTP header on GET /v1/export/decisions, and
  • The trailer’s aggregation_signal_present: true.

Every decision carries a confidence_reason so the cap or edge-case is explicit, not implicit.

ValueMeaning
okFormula returned raw and no cap applied.
cap_day0Returned raw, then clamped down to CAP_DAY0 = 0.6. The org is in the Day-0 phase.
cap_sharedReturned raw, then clamped down to CAP_SHARED = 0.8. The decision used a shared-pool prior.
no_router_invokedNo router ran for this request — confidence is null. Cache hit, legacy-model path, or other short-circuit.
insufficient_samplesn_samples < N_MIN (default 3) and the org is past Day-0. The score is raw * 0.5 to dampen a too-eager formula on too little data.
single_candidateOnly one candidate was considered — gap_top2 is undefined, so confidence is null.

null is a first-class, valid value. It means “no router was invoked for this request” — for example:

  • The response was served from cache (outcome.cache_hit == true).
  • The legacy-model parsing path took over because no routing rule applied.
  • A single-candidate routing rule short-circuited candidate scoring.

Treat null confidence as “don’t audit this row for a routing-quality decision — there wasn’t one”. If you filter min_confidence > 0 on GET /v1/decisions, null rows are excluded by design.

Scenariogap_top2n_samplesvariancephasesharedconfidenceconfidence_reason
High-confidence, mature0.181000.05Npsfalse0.917ok
Low-confidence, tied0.011000.05Npsfalse0.532ok
Day-0, perfect prior0.200noneDay0false0.45ok
Day-0, max raw0.20300.0Day0false0.6cap_day0
Insufficient samples0.181noneAutofalse0.27insufficient_samples
Cache hitn/an/an/an/an/anullno_router_invoked
Single candidaten/a500.05Npsfalsenullsingle_candidate
Shared-pool prior0.18300.05Autotrue0.8cap_shared

These cases are exercised by the gateway’s confidence test suite — they are the contract.

confidence and confidence_reason are populated on every decision from the deploy date forward. Rows older than the deploy date have confidence == null and confidence_reason == null. We chose this over a backfill because reconstructing n_samples, variance, and phase for historical decisions would be wrong more often than helpful — and a wrong confidence number is worse than no number.

  • As a review queue: GET /v1/decisions?max_confidence=0.5 returns the routing decisions you should review first.
  • As an alert: track the share of decisions in your window with confidence < 0.6. A jump usually means the org’s traffic shape changed, not that Floopy got worse.
  • As a routing constraint: PUT /v1/constraints with confidence_threshold: 0.7 makes the router fall back to baseline whenever confidence dips below the threshold. Filtered candidates carry reason: "constraint_confidence_below_threshold".

Evidence — what the router knew, surfaced

Section titled “Evidence — what the router knew, surfaced”

In v2 every decision that goes through the Feedback-Driven or Smart-Cost router carries an evidence block alongside confidence and confidence_reason. Evidence is the small bag of inputs that drove the confidence number, surfaced so you can audit the reasoning, not just the verdict.

FieldTypeMeaning
samplesintegerThe rolling sample count n_samples over the winner. Same number that feeds the n_norm term in the formula.
top2_score_gapnumberThe score gap between the winner and the runner-up (gap_top2). Same number that feeds gap_norm.
outcome_variancenumberThe rolling outcome variance for the winner, on the composite-quality scale. Same number that feeds var_norm.
recent_regressionstagged unionBucketed count of regression alerts on the winner’s (provider, model) over the last 7 days. See below.
last_regression_atISO8601 | nullTimestamp of the most recent regression in the 7-day window, rounded to the nearest 5-minute boundary. null when no regression exists in the window.

When no router was invoked (cache hit, single candidate, insufficient samples for the strategy), the evidence field is absent — null is a first-class value, just like for confidence itself.

recent_regressions is bucketed, not a raw count

Section titled “recent_regressions is bucketed, not a raw count”

The recent_regressions field is a tagged union with two shapes:

{ "kind": "exact", "exact": 3 }
{ "kind": "at_least", "at_least": 10 }
{ "kind": "at_least", "at_least": 50 }

The bucket boundaries are pinned in the gateway:

Raw countEmitted bucket
0..=9Exact { exact: n }
10..=49AtLeast { at_least: 10 }
>= 50AtLeast { at_least: 50 }

The buckets compress a fleet-wide regression-event signal into a 3-bucket shape. They preserve “is something wrong with my model?” — any non-zero exact count or any AtLeast fires the regression-detected branch on the verification endpoint — without leaking precise volumes that could be cross-correlated across tenants.

last_regression_at is rounded to 5 minutes

Section titled “last_regression_at is rounded to 5 minutes”

The most recent regression timestamp is floored to the nearest 5-minute boundary before it leaves the gateway. A regression that fired at 2026-05-07T14:32:18Z is reported as 2026-05-07T14:30:00Z. The rounding is for the same reason as the bucketing: enough signal to investigate, not enough resolution to use as a side-channel.

recent_regressions and last_regression_at are computed over a rolling 7-day window. The window is fixed and not customer-configurable in v2.

The query is org-scoped: the gateway looks up regression alerts whose organization_id matches the calling organisation, joined to the winner’s (provider, model). There is no cross-tenant aggregation on this path.

Every field on evidence follows null-safe rules:

  • evidence itself is absent (not null) on rows where no router was invoked.
  • last_regression_at is null when the 7-day window has no regressions, even when the rest of evidence is populated.
  • A failed regression-summary lookup (PostgREST timeout, transient error) renders recent_regressions: { "kind": "exact", "exact": 0 } and last_regression_at: null — fail-safe to “no regression observed”, with a metric incremented on the gateway side so we can see how often this happens. The decision is not blocked on a regression-summary lookup.

The 7-day-window query has a hard 150 ms timeout and a positive-and-negative Redis cache (TTL 60 s). Decision latency stays bounded; the regression signal stays current.

Evidence is explanatory, not load-bearing. The confidence number is computed from the same inputs (gap, samples, variance) plus the phase and used_shared_pool_prior caps; recent_regressions does not feed confidence directly. We surface evidence so a reviewer can read the audit row and see what the router was working with — same way the explanation text reads back the same numbers in plain prose.