Confidence methodology
Every routing decision Floopy makes carries a confidence field — the router’s self-reported belief that its top choice is the right one for this organisation’s traffic right now. This page explains exactly how that number is computed, what each confidence_reason means, and why null is sometimes the honest answer.
What confidence is and is not
Section titled “What confidence is and is not”Confidence is a heuristic, not a probability. A confidence of 0.78 does not mean “78% chance the router was correct” — it means “the inputs the router has on this candidate (gap to runner-up, sample count, outcome variance) are stronger than the inputs on most other recent decisions in this org.”
We expose confidence so customers can build review queues, alerting, and the confidence_threshold constraint on top of an explicit signal — not so it can be treated as a probability of correctness.
Confidence is computed entirely from the requesting organisation’s own inputs (its candidates, its sample counts, its variance). It does not leak cross-tenant information.
Formula
Section titled “Formula”gap_norm = clamp(gap_top2 / GAP_REF, 0.0, 1.0) GAP_REF = 0.20n_norm = clamp(log1p(n_samples) / log1p(N_REF), 0.0, 1.0) N_REF = 30var_norm = 1.0 - clamp(variance / VAR_REF, 0.0, 1.0) VAR_REF = 0.25
raw = W_GAP*gap_norm + W_N*n_norm + W_VAR*var_norm W_GAP = 0.45, W_N = 0.35, W_VAR = 0.20
if phase == Day0: return min(raw, CAP_DAY0) // 0.6if used_shared_pool_prior == true: return min(raw, CAP_SHARED) // 0.8return rawInputs
Section titled “Inputs”| Input | Source | Meaning |
|---|---|---|
gap_top2 | score_top1 − score_top2 from RoutingAudit.candidates | How far ahead the winner is. |
n_samples | rolling window over the request log for the winner | How much real traffic backs the choice. |
variance | rolling outcome variance for the winner (composite quality) | How stable observed outcomes have been. |
phase | Feedback-Driven phase: Day0, Auto, Nps | Maturity of the org’s data. |
used_shared_pool_prior | boolean on the audit | Whether cross-tenant priors influenced the score. |
Why these weights?
Section titled “Why these weights?”W_GAP = 0.45— a clear separation between the top two candidates is the most direct evidence the routing decision is not a coin flip.W_N = 0.35— sample count matters, but with strong diminishing returns (note thelog1p). Going from 5 to 50 samples tells you a lot; going from 500 to 5000 tells you less.W_VAR = 0.20— outcome stability is a confirming signal, not a leading one. Low variance with a tied gap is still low confidence.
These weights are static and not customer-configurable in v1.
The two caps
Section titled “The two caps”CAP_DAY0 = 0.6 — the Day-0 cap
Section titled “CAP_DAY0 = 0.6 — the Day-0 cap”When the organisation is in the Day0 Feedback-Driven phase (no NPS feedback yet, automated signals only), the formula is allowed to compute any raw value, but the returned confidence is hard-capped at 0.6.
Why: at Day-0 there is by definition no end-user-validated signal. The router can rank candidates well, but until real usage feedback is in the loop, “high confidence” would be misleading. The cap is a forcing function: a customer who wants to see confidence above 0.6 must complete the cold-start ramp.
CAP_SHARED = 0.8 — the shared-pool prior cap
Section titled “CAP_SHARED = 0.8 — the shared-pool prior cap”When the routing decision drew on cross-tenant priors (used_shared_pool_prior == true), the returned confidence is hard-capped at 0.8.
Why: cross-tenant priors are by definition not as confident as own-tenant outcomes. Capping at 0.8 reflects this honestly. The cap was reviewed and accepted in the security review (SEC-020 in the Credibility & Auditability Initiative) — the cross-tenant boolean itself is exposed on the audit, but the underlying priors are never exposed.
You can identify shared-pool-influenced decisions by:
- The
used_shared_pool_prior: truefield on the audit, and - The
Floopy-Aggregation-Notice: contains-shared-pool-influenced-decisionsHTTP header onGET /v1/export/decisions, and - The trailer’s
aggregation_signal_present: true.
confidence_reason enum
Section titled “confidence_reason enum”Every decision carries a confidence_reason so the cap or edge-case is explicit, not implicit.
| Value | Meaning |
|---|---|
ok | Formula returned raw and no cap applied. |
cap_day0 | Returned raw, then clamped down to CAP_DAY0 = 0.6. The org is in the Day-0 phase. |
cap_shared | Returned raw, then clamped down to CAP_SHARED = 0.8. The decision used a shared-pool prior. |
no_router_invoked | No router ran for this request — confidence is null. Cache hit, legacy-model path, or other short-circuit. |
insufficient_samples | n_samples < N_MIN (default 3) and the org is past Day-0. The score is raw * 0.5 to dampen a too-eager formula on too little data. |
single_candidate | Only one candidate was considered — gap_top2 is undefined, so confidence is null. |
When confidence is null
Section titled “When confidence is null”null is a first-class, valid value. It means “no router was invoked for this request” — for example:
- The response was served from cache (
outcome.cache_hit == true). - The legacy-model parsing path took over because no routing rule applied.
- A single-candidate routing rule short-circuited candidate scoring.
Treat null confidence as “don’t audit this row for a routing-quality decision — there wasn’t one”. If you filter min_confidence > 0 on GET /v1/decisions, null rows are excluded by design.
Worked examples
Section titled “Worked examples”| Scenario | gap_top2 | n_samples | variance | phase | shared | confidence | confidence_reason |
|---|---|---|---|---|---|---|---|
| High-confidence, mature | 0.18 | 100 | 0.05 | Nps | false | 0.917 | ok |
| Low-confidence, tied | 0.01 | 100 | 0.05 | Nps | false | 0.532 | ok |
| Day-0, perfect prior | 0.20 | 0 | none | Day0 | false | 0.45 | ok |
| Day-0, max raw | 0.20 | 30 | 0.0 | Day0 | false | 0.6 | cap_day0 |
| Insufficient samples | 0.18 | 1 | none | Auto | false | 0.27 | insufficient_samples |
| Cache hit | n/a | n/a | n/a | n/a | n/a | null | no_router_invoked |
| Single candidate | n/a | 50 | 0.05 | Nps | false | null | single_candidate |
| Shared-pool prior | 0.18 | 30 | 0.05 | Auto | true | 0.8 | cap_shared |
These cases are exercised by the gateway’s confidence test suite — they are the contract.
Forward-only — no backfill
Section titled “Forward-only — no backfill”confidence and confidence_reason are populated on every decision from the deploy date forward. Rows older than the deploy date have confidence == null and confidence_reason == null. We chose this over a backfill because reconstructing n_samples, variance, and phase for historical decisions would be wrong more often than helpful — and a wrong confidence number is worse than no number.
How to use it
Section titled “How to use it”- As a review queue:
GET /v1/decisions?max_confidence=0.5returns the routing decisions you should review first. - As an alert: track the share of decisions in your window with
confidence < 0.6. A jump usually means the org’s traffic shape changed, not that Floopy got worse. - As a routing constraint:
PUT /v1/constraintswithconfidence_threshold: 0.7makes the router fall back to baseline whenever confidence dips below the threshold. Filtered candidates carryreason: "constraint_confidence_below_threshold".
Evidence — what the router knew, surfaced
Section titled “Evidence — what the router knew, surfaced”In v2 every decision that goes through the Feedback-Driven or Smart-Cost router carries an evidence block alongside confidence and confidence_reason. Evidence is the small bag of inputs that drove the confidence number, surfaced so you can audit the reasoning, not just the verdict.
The five fields
Section titled “The five fields”| Field | Type | Meaning |
|---|---|---|
samples | integer | The rolling sample count n_samples over the winner. Same number that feeds the n_norm term in the formula. |
top2_score_gap | number | The score gap between the winner and the runner-up (gap_top2). Same number that feeds gap_norm. |
outcome_variance | number | The rolling outcome variance for the winner, on the composite-quality scale. Same number that feeds var_norm. |
recent_regressions | tagged union | Bucketed count of regression alerts on the winner’s (provider, model) over the last 7 days. See below. |
last_regression_at | ISO8601 | null | Timestamp of the most recent regression in the 7-day window, rounded to the nearest 5-minute boundary. null when no regression exists in the window. |
When no router was invoked (cache hit, single candidate, insufficient samples for the strategy), the evidence field is absent — null is a first-class value, just like for confidence itself.
recent_regressions is bucketed, not a raw count
Section titled “recent_regressions is bucketed, not a raw count”The recent_regressions field is a tagged union with two shapes:
{ "kind": "exact", "exact": 3 }{ "kind": "at_least", "at_least": 10 }{ "kind": "at_least", "at_least": 50 }The bucket boundaries are pinned in the gateway:
| Raw count | Emitted bucket |
|---|---|
0..=9 | Exact { exact: n } |
10..=49 | AtLeast { at_least: 10 } |
>= 50 | AtLeast { at_least: 50 } |
The buckets compress a fleet-wide regression-event signal into a 3-bucket shape. They preserve “is something wrong with my model?” — any non-zero exact count or any AtLeast fires the regression-detected branch on the verification endpoint — without leaking precise volumes that could be cross-correlated across tenants.
last_regression_at is rounded to 5 minutes
Section titled “last_regression_at is rounded to 5 minutes”The most recent regression timestamp is floored to the nearest 5-minute boundary before it leaves the gateway. A regression that fired at 2026-05-07T14:32:18Z is reported as 2026-05-07T14:30:00Z. The rounding is for the same reason as the bucketing: enough signal to investigate, not enough resolution to use as a side-channel.
The 7-day window
Section titled “The 7-day window”recent_regressions and last_regression_at are computed over a rolling 7-day window. The window is fixed and not customer-configurable in v2.
The query is org-scoped: the gateway looks up regression alerts whose organization_id matches the calling organisation, joined to the winner’s (provider, model). There is no cross-tenant aggregation on this path.
Null-safe semantics
Section titled “Null-safe semantics”Every field on evidence follows null-safe rules:
evidenceitself is absent (notnull) on rows where no router was invoked.last_regression_atisnullwhen the 7-day window has no regressions, even when the rest ofevidenceis populated.- A failed regression-summary lookup (PostgREST timeout, transient error) renders
recent_regressions: { "kind": "exact", "exact": 0 }andlast_regression_at: null— fail-safe to “no regression observed”, with a metric incremented on the gateway side so we can see how often this happens. The decision is not blocked on a regression-summary lookup.
The 7-day-window query has a hard 150 ms timeout and a positive-and-negative Redis cache (TTL 60 s). Decision latency stays bounded; the regression signal stays current.
Evidence on the audit, not on the score
Section titled “Evidence on the audit, not on the score”Evidence is explanatory, not load-bearing. The confidence number is computed from the same inputs (gap, samples, variance) plus the phase and used_shared_pool_prior caps; recent_regressions does not feed confidence directly. We surface evidence so a reviewer can read the audit row and see what the router was working with — same way the explanation text reads back the same numbers in plain prose.
See also
Section titled “See also”GET /v1/decisions/{request_id}— single decision, includingconfidenceandconfidence_reason.- GET /v1/decisions — list with
min_confidence/max_confidencefilters. - PUT /v1/constraints — set
confidence_threshold. - POST /v1/routing/explain — preview the confidence a candidate request would carry.