Confidence methodology

Every routing decision Floopy makes carries a confidence field — the router’s self-reported belief that its top choice is the right one for this organisation’s traffic right now. This page explains exactly how that number is computed, what each confidence_reason means, and why null is sometimes the honest answer.

What confidence is and is not

Confidence is a heuristic, not a probability. A confidence of 0.78 does not mean “78% chance the router was correct” — it means “the inputs the router has on this candidate (gap to runner-up, sample count, outcome variance) are stronger than the inputs on most other recent decisions in this org.”

We expose confidence so customers can build review queues, alerting, and the confidence_threshold constraint on top of an explicit signal — not so it can be treated as a probability of correctness.

Confidence is computed entirely from the requesting organisation’s own inputs (its candidates, its sample counts, its variance). It does not leak cross-tenant information.

Formula

gap_norm    = clamp(gap_top2 / GAP_REF, 0.0, 1.0)              GAP_REF = 0.20
n_norm      = clamp(log1p(n_samples) / log1p(N_REF), 0.0, 1.0) N_REF   = 30
var_norm    = 1.0 - clamp(variance / VAR_REF, 0.0, 1.0)        VAR_REF = 0.25

raw         = W_GAP*gap_norm + W_N*n_norm + W_VAR*var_norm
              W_GAP = 0.45, W_N = 0.35, W_VAR = 0.20

if phase == Day0:                  return min(raw, CAP_DAY0)        // 0.6
if used_shared_pool_prior == true: return min(raw, CAP_SHARED)      // 0.8
return raw

Inputs

Input	Source	Meaning
`gap_top2`	`score_top1 − score_top2` from `RoutingAudit.candidates`	How far ahead the winner is.
`n_samples`	rolling window over the request log for the winner	How much real traffic backs the choice.
`variance`	rolling outcome variance for the winner (composite quality)	How stable observed outcomes have been.
`phase`	Feedback-Driven phase: `Day0`, `Auto`, `Nps`	Maturity of the org’s data.
`used_shared_pool_prior`	boolean on the audit	Whether cross-tenant priors influenced the score.

Why these weights?

W_GAP = 0.45 — a clear separation between the top two candidates is the most direct evidence the routing decision is not a coin flip.
W_N = 0.35 — sample count matters, but with strong diminishing returns (note the log1p). Going from 5 to 50 samples tells you a lot; going from 500 to 5000 tells you less.
W_VAR = 0.20 — outcome stability is a confirming signal, not a leading one. Low variance with a tied gap is still low confidence.

These weights are static and not customer-configurable in v1.

The two caps

`CAP_DAY0 = 0.6` — the Day-0 cap

When the organisation is in the Day0 Feedback-Driven phase (no NPS feedback yet, automated signals only), the formula is allowed to compute any raw value, but the returned confidence is hard-capped at 0.6.

Why: at Day-0 there is by definition no end-user-validated signal. The router can rank candidates well, but until real usage feedback is in the loop, “high confidence” would be misleading. The cap is a forcing function: a customer who wants to see confidence above 0.6 must complete the cold-start ramp.

`CAP_SHARED = 0.8` — the shared-pool prior cap

When the routing decision drew on cross-tenant priors (used_shared_pool_prior == true), the returned confidence is hard-capped at 0.8.

Why: cross-tenant priors are by definition not as confident as own-tenant outcomes. Capping at 0.8 reflects this honestly. The cap was reviewed and accepted in the security review (SEC-020 in the Credibility & Auditability Initiative) — the cross-tenant boolean itself is exposed on the audit, but the underlying priors are never exposed.

You can identify shared-pool-influenced decisions by:

The used_shared_pool_prior: true field on the audit, and
The Floopy-Aggregation-Notice: contains-shared-pool-influenced-decisions HTTP header on GET /v1/export/decisions, and
The trailer’s aggregation_signal_present: true.

`confidence_reason` enum

Every decision carries a confidence_reason so the cap or edge-case is explicit, not implicit.

Value	Meaning
`ok`	Formula returned `raw` and no cap applied.
`cap_day0`	Returned `raw`, then clamped down to `CAP_DAY0 = 0.6`. The org is in the Day-0 phase.
`cap_shared`	Returned `raw`, then clamped down to `CAP_SHARED = 0.8`. The decision used a shared-pool prior.
`no_router_invoked`	No router ran for this request — `confidence` is `null`. Cache hit, legacy-model path, or other short-circuit.
`insufficient_samples`	`n_samples < N_MIN` (default 3) and the org is past Day-0. The score is `raw * 0.5` to dampen a too-eager formula on too little data.
`single_candidate`	Only one candidate was considered — `gap_top2` is undefined, so confidence is `null`.

When `confidence` is `null`

null is a first-class, valid value. It means “no router was invoked for this request” — for example:

The response was served from cache (outcome.cache_hit == true).
The legacy-model parsing path took over because no routing rule applied.
A single-candidate routing rule short-circuited candidate scoring.

Treat null confidence as “don’t audit this row for a routing-quality decision — there wasn’t one”. If you filter min_confidence > 0 on GET /v1/decisions, null rows are excluded by design.

Worked examples

Scenario	`gap_top2`	`n_samples`	variance	phase	shared	`confidence`	`confidence_reason`
High-confidence, mature	`0.18`	`100`	`0.05`	Nps	false	`0.917`	`ok`
Low-confidence, tied	`0.01`	`100`	`0.05`	Nps	false	`0.532`	`ok`
Day-0, perfect prior	`0.20`	`0`	none	Day0	false	`0.45`	`ok`
Day-0, max raw	`0.20`	`30`	`0.0`	Day0	false	`0.6`	`cap_day0`
Insufficient samples	`0.18`	`1`	none	Auto	false	`0.27`	`insufficient_samples`
Cache hit	n/a	n/a	n/a	n/a	n/a	`null`	`no_router_invoked`
Single candidate	n/a	`50`	`0.05`	Nps	false	`null`	`single_candidate`
Shared-pool prior	`0.18`	`30`	`0.05`	Auto	true	`0.8`	`cap_shared`

These cases are exercised by the gateway’s confidence test suite — they are the contract.

Forward-only — no backfill

confidence and confidence_reason are populated on every decision from the deploy date forward. Rows older than the deploy date have confidence == null and confidence_reason == null. We chose this over a backfill because reconstructing n_samples, variance, and phase for historical decisions would be wrong more often than helpful — and a wrong confidence number is worse than no number.

How to use it

As a review queue: GET /v1/decisions?max_confidence=0.5 returns the routing decisions you should review first.
As an alert: track the share of decisions in your window with confidence < 0.6. A jump usually means the org’s traffic shape changed, not that Floopy got worse.
As a routing constraint: PUT /v1/constraints with confidence_threshold: 0.7 makes the router fall back to baseline whenever confidence dips below the threshold. Filtered candidates carry reason: "constraint_confidence_below_threshold".

Evidence — what the router knew, surfaced

In v2 every decision that goes through the Feedback-Driven or Smart-Cost router carries an evidence block alongside confidence and confidence_reason. Evidence is the small bag of inputs that drove the confidence number, surfaced so you can audit the reasoning, not just the verdict.

The five fields

Field	Type	Meaning
`samples`	integer	The rolling sample count `n_samples` over the winner. Same number that feeds the `n_norm` term in the formula.
`top2_score_gap`	number	The score gap between the winner and the runner-up (`gap_top2`). Same number that feeds `gap_norm`.
`outcome_variance`	number	The rolling outcome variance for the winner, on the composite-quality scale. Same number that feeds `var_norm`.
`recent_regressions`	tagged union	Bucketed count of regression alerts on the winner’s `(provider, model)` over the last 7 days. See below.
`last_regression_at`	ISO8601 \| null	Timestamp of the most recent regression in the 7-day window, rounded to the nearest 5-minute boundary. `null` when no regression exists in the window.

When no router was invoked (cache hit, single candidate, insufficient samples for the strategy), the evidence field is absent — null is a first-class value, just like for confidence itself.

`recent_regressions` is bucketed, not a raw count

The recent_regressions field is a tagged union with two shapes:

{ "kind": "exact",    "exact":    3   }
{ "kind": "at_least", "at_least": 10  }
{ "kind": "at_least", "at_least": 50  }

The bucket boundaries are pinned in the gateway:

Raw count	Emitted bucket
`0..=9`	`Exact { exact: n }`
`10..=49`	`AtLeast { at_least: 10 }`
`>= 50`	`AtLeast { at_least: 50 }`

The buckets compress a fleet-wide regression-event signal into a 3-bucket shape. They preserve “is something wrong with my model?” — any non-zero exact count or any AtLeast fires the regression-detected branch on the verification endpoint — without leaking precise volumes that could be cross-correlated across tenants.

`last_regression_at` is rounded to 5 minutes

The most recent regression timestamp is floored to the nearest 5-minute boundary before it leaves the gateway. A regression that fired at 2026-05-07T14:32:18Z is reported as 2026-05-07T14:30:00Z. The rounding is for the same reason as the bucketing: enough signal to investigate, not enough resolution to use as a side-channel.

The 7-day window

recent_regressions and last_regression_at are computed over a rolling 7-day window. The window is fixed and not customer-configurable in v2.

The query is org-scoped: the gateway looks up regression alerts whose organization_id matches the calling organisation, joined to the winner’s (provider, model). There is no cross-tenant aggregation on this path.

Null-safe semantics

Every field on evidence follows null-safe rules:

evidence itself is absent (not null) on rows where no router was invoked.
last_regression_at is null when the 7-day window has no regressions, even when the rest of evidence is populated.
A failed regression-summary lookup (PostgREST timeout, transient error) renders recent_regressions: { "kind": "exact", "exact": 0 } and last_regression_at: null — fail-safe to “no regression observed”, with a metric incremented on the gateway side so we can see how often this happens. The decision is not blocked on a regression-summary lookup.

The 7-day-window query has a hard 150 ms timeout and a positive-and-negative Redis cache (TTL 60 s). Decision latency stays bounded; the regression signal stays current.

Evidence on the audit, not on the score

Evidence is explanatory, not load-bearing. The confidence number is computed from the same inputs (gap, samples, variance) plus the phase and used_shared_pool_prior caps; recent_regressions does not feed confidence directly. We surface evidence so a reviewer can read the audit row and see what the router was working with — same way the explanation text reads back the same numbers in plain prose.