Skip to content

Four Signals, One Loop: How Multi-Source Feedback Routing Actually Works

Session NPS, LLM-as-judge, admin rating, and benchmarks each fail alone. Combining all four with dynamic weights is the only honest way to route.

Floopy Team | | 12 min read
feedback-driven-routing agent-optimization llmops multi-source-feedback engineering

Four Signals, One Loop: How Multi-Source Feedback Routing Actually Works

Every feedback-driven routing system eventually confronts the same question: which signal do we trust? The answers you see in the wild are all variations of a single source — a thumbs up/down, a numeric rating, a benchmark score, an LLM evaluator. Each one sounds reasonable in isolation. Each one collapses under production weight.

This post is the long version of why we gave up on picking one. It walks through the four signal sources Floopy uses, the specific failure modes of each when used alone, the weighting problem that connects them, and the self-correcting loop that makes the combination stable. If you’re evaluating agent optimization platforms, or building your own, this is the content you actually need before writing code.

Why single-source feedback is brittle

The four signal sources you could build a routing system on — session NPS, auto feedback (LLM-as-judge), admin rating, and public benchmarks — each have a failure mode severe enough that running a production router on one of them alone is a liability. Not a corner case. The normal operating mode of each source produces wrong routing decisions.

Session NPS alone is brittle

Session-level end-user feedback (a 0–10 NPS score posted after a conversation) is the most meaningful signal in the stack. It’s what the product is actually graded on. But if you treat it as your only routing input, you get a router that:

  • Lags reality by hours. End users don’t rate conversations the moment they end. The feedback you’re routing on today reflects what happened yesterday.
  • Is sparse on low-traffic models. A cheaper model routed to 2% of sessions produces 2% of the feedback volume. Its score wobbles on tiny N for weeks.
  • Skews toward responders. The users willing to submit NPS aren’t a random sample. They’re disproportionately angry or disproportionately delighted. The quiet middle never shows up.
  • Breaks during cold start. A new model added this morning has zero session feedback. Session-NPS-only routing either treats it as average (too optimistic) or excludes it (too pessimistic).

Session NPS is the strongest signal you can collect. It is also too slow, too sparse, and too biased to be the only signal.

Auto feedback (LLM-as-judge) alone is biased

Auto feedback — a cheap evaluator model scoring responses from your production model on every request — is fast, dense, and available from request one. It sounds like the dream input. It has a specific, well-documented pathology when used alone:

  • Judge bias. LLM-as-judge scores correlate with the judge’s own preferences more than with ground truth. A judge trained on one family’s outputs will systematically over-score that family. This is replicated across every public eval study.
  • Surface-feature reward. Judges reward long, confidently-worded, well-formatted outputs — even when they’re wrong. “Confidently wrong” beats “correctly uncertain” in most judge rubrics.
  • No access to the real outcome. The judge sees the response. It does not see whether the user got what they wanted, came back, or churned.
  • Drifts silently. When the judge model is updated, your scores shift without any change in your production stack. You wake up and yesterday’s “good” is today’s “mediocre.”

Auto feedback is the best fast signal. It is also structurally biased in ways that make it a poor sole proxy for quality.

Benchmarks alone are stale

Public benchmarks (MMLU, HumanEval, SWE-bench, MT-Bench, and their descendants) are the only signal you have on day zero, before any request has hit your gateway. That’s enormously useful as a prior. As a sole routing input, they fail for structural reasons:

  • Stale. The benchmark number on the model card reflects performance at release. A provider quietly ships a quantization change three weeks later and the benchmark doesn’t rerun. Your router doesn’t know.
  • Generic. MMLU measures broad knowledge. Your product might be routing 90% customer-support conversations. MMLU isn’t scoring that, and it isn’t going to.
  • Gameable. Benchmarks leak into training data faster than they get rotated. A model that topped the leaderboard in January might have memorized half of it by June.
  • Indifferent to your users. No public benchmark knows whether your prompts, your tone, your latency budget, or your user base are well-served by any particular model.

Benchmarks are a useful prior. They are not a live routing signal.

Admin rating alone is unscalable

Admin rating — a thumbs up/thumbs down that operators click in a dashboard to mark individual requests — is the highest-confidence signal in the stack when it exists. A human engineer deliberately looked at the output and judged it. That’s gold. It’s also fundamentally unscalable:

  • Doesn’t scale with traffic. A team rating 50 responses a day cannot cover a gateway doing 50k requests a day. You’re routing on 0.1% of reality.
  • Operator fatigue. The first hundred ratings are careful. The next thousand aren’t. Rating quality decays faster than rating volume.
  • Selection bias. Operators rate what lands in front of them — recent, flagged, or on-screen. Random sampling across the production distribution does not happen in practice.
  • Biased toward known failure modes. Teams rate what they already suspect is bad. Novel failures don’t get a thumb.

Admin rating is high-confidence when it exists. There is simply never enough of it to route on alone.

The case for combining all four

Each signal fails alone. Each signal covers the others’ failure modes in specific, predictable ways:

Failure of……is covered by
Session NPS’s lag and sparsityAuto feedback’s density and speed
Auto feedback’s judge biasSession NPS’s real user outcome
Benchmarks’ staleness and genericityAll three of the above — production-local, continuously-updated signal
Admin rating’s unscalabilityAuto feedback’s volume and session NPS’s coverage

This is not a philosophical preference. It is the only combination where every failure mode has a counter-signal. A router that ignores any of the four has a predictable weakness that production will find.

The skeptical reader can try the opposite: pick any single source, describe a realistic production failure mode, and walk through what the router does. It stalls, overreacts, or ships bias. Combining all four is the minimum viable architecture, not a nicety.

The weight transition problem

Adding four signals doesn’t solve routing — it creates a new problem. Which signal matters when?

On day zero, the only signal you have is benchmarks. Weighting the other three at that point makes no sense; they’re empty. After a few hundred requests, auto feedback starts to accumulate, but session NPS is still sparse. After a few thousand sessions with real end-user NPS coming back, session feedback is the most informative signal you have — benchmarks are now a weak prior at best.

A fixed weighting scheme (say, 40/30/20/10 across all four) gets every one of those regimes wrong:

  • On day zero it weights empty signals at 60%, producing junk scores.
  • In the auto-only regime it under-weights the one signal you actually have.
  • Once session NPS arrives, it continues weighting benchmarks as if they still carried information.

The weights have to shift with data availability. Not as a one-time migration — as a continuous function of what data exists per model per time window. This is the core mechanic that single-source systems don’t have and can’t bolt on.

Floopy’s dynamic weighting

Floopy’s router resolves the weight-transition problem with a three-regime weighting function, driven by how much data has accumulated for each model in the candidate set. The weights shift dynamically based on what data is available:

ConditionSessionAutoManualBenchmark
Session feedback exists (>10 sessions)0.50.30.10.1
Only auto feedback exists (>10 requests)0.50.20.3
Only benchmark data exists1.0

The system starts with benchmark scores on day zero, transitions to auto feedback as requests accumulate, and converges on session feedback once end users start submitting NPS scores. The more real-world signal available, the less the system relies on synthetic benchmarks.

This table is embedded in the docs at /docs/features/feedback/ and it’s the same table the router reads at runtime. No drift between the documentation and the behavior. The weight values are hard-coded into a single source, and both the docs and the routing decision inherit from it.

The quality signal this table produces is one of three inputs into the final model-selection score:

performance_score = (success_rate × 0.4) + (quality × 0.4) + (cost_savings × 0.2)

Success rate (HTTP 2xx share) and cost savings are objective. Quality is where all four feedback signals collapse into a single 0.0–1.0 number, using the weights above. That number — and only that number — carries the opinion of your users, your operators, your judges, and the public evals into the routing decision.

Degradation and recovery: the whole loop in one scenario

Abstraction only goes so far. The clearest test of a feedback-routing system is what it does when reality changes mid-flight. Here is the scenario from the Feedback docs, unchanged — it’s the narrative we use internally to check the loop.

Degradation

Here is how the system responds when a provider degrades, with no manual intervention at any step:

TimeEventModel A ScoreModel B ScoreTraffic Split
T=0Steady state0.850.72A: 90%, B: 10%
T=1hProvider degrades Model A silently
T=2hUsers report poor quality via feedback. Quality drops to 0.55. Score: (0.9 × 0.4) + (0.55 × 0.4) + (0.3 × 0.2) = 0.640.640.78Traffic shifts to B
T=3hModel A quality stays below min_quality threshold (0.7)Excluded0.78B: 100%

Model A is excluded entirely. No alert needed, no on-call page, no configuration change.

Recovery

The system also recovers automatically when a provider fixes their issues:

TimeEvent
T=24hProvider fixes Model A
T=25hExploration traffic (10%) retests Model A
T=26hNew feedback comes back at 0.88, performance score climbs
T=30hModel A back on top, traffic returns to A: 90%, B: 10%

The exploration phase is what makes recovery possible. Without it, a model excluded at T=3h would stay excluded forever. The 10% exploration rate is the insurance policy that keeps the system self-healing.

The two halves of that scenario only work because all four signals are live. Session NPS is what detects the degradation (users reporting poor quality). Auto feedback confirms it quickly and densely. Admin rating can intervene if operators catch it before the users do. The benchmark floor prevents the router from over-reacting to a single noisy session. Pull any one signal out and the loop limps.

What happens with only one signal

The same scenario, with the router rewritten to use just one signal each time:

  • Benchmarks only. Model A’s benchmark score didn’t change — the provider silently degraded the deployed model, not the published eval number. The router never notices. Traffic stays 90/10. Users leave.
  • Auto feedback only. The judge notices the regression, but the judge has systematic biases — it rewards long confident outputs. If Model A’s degradation makes it more verbose (a common failure mode for quantized models), the judge might reward the broken version. Traffic stays or increases to A.
  • Session NPS only. Users do detect the regression — but the first NPS scores roll in two hours late, on a sparse sample. The router either reacts on 5 sessions (overcorrection) or waits for 50 (too slow). Either way, users get hurt before the router acts.
  • Admin rating only. Operators might notice, if they happen to be looking at the right requests in the dashboard that day. Most teams don’t, and even when they do, one admin thumb on a 50k-request day is too thin to move the router.

The multi-source version, with the weights above, catches the regression at T=2h and excludes the model at T=3h. That’s not because Floopy’s math is magic. It’s because four signals covering each other’s failure modes is the only configuration where the router sees the problem at all.

Try it on the free tier

The four-signal loop is live on every Floopy account, including the Free tier. You don’t need to upgrade to see the weighting function operate against your own traffic — point your OpenAI SDK’s baseURL at https://api.floopy.ai/v1, start sending requests with a floopy-session-id header, and the first three regimes (benchmark → auto → session) unlock as your data accumulates.

The Feedback APIPOST /v1/feedback for submitting session-level NPS scores — is included on the Starter plan at 500 submissions/month, which covers most teams during onboarding. Higher-volume access (unlimited on Pro, enterprise-tier beyond) follows the same shape.

Start at app.floopy.ai, read the Feedback docs, or see the full pricing.

Four signals. One loop. One router that actually learns.