Feedback
Feedback
Floopy collects quality feedback on LLM responses and uses it as the primary signal for routing optimization. Every feedback submission feeds directly into Smart Cost Routing and Smart Selector, creating a self-improving loop: the more feedback you submit, the better your routing decisions become.
Without feedback, routing relies on public benchmark scores. With feedback, routing learns what actually works for your prompts, your users, and your quality bar.
Architecture
graph LR
S1[Session feedback<br/>POST /v1/feedback] --> W[Dynamic weighting]
S2[Auto feedback<br/>LLM-as-judge] --> W
S3[Manual rating<br/>dashboard] --> W
S4[Benchmark scores<br/>MMLU / HumanEval / …] --> W
W --> R[Router]
R --> CH[(ClickHouse<br/>session_feedback)]
CH -.->|60s–5min TTL| RD[(Redis<br/>aggregated scores)]
RD --> D[Routing decision]Four signal sources feed a dynamic weighting function. The weighting output drives the router, which persists session feedback to ClickHouse and reads pre-aggregated scores from Redis on the hot path. The router never queries ClickHouse during a request — it reads from Redis, and Redis is refreshed in the background.
Collecting Feedback
Submit session-level feedback via the feedback endpoint. Use the same API key you use for chat requests. The session_id is the value you send in the floopy-session-id header during chat requests.
Endpoint: POST /v1/feedback
await fetch("https://api.floopy.ai/v1/feedback", { method: "POST", headers: { Authorization: `Bearer ${process.env.FLOOPY_API_KEY}`, "Content-Type": "application/json", }, body: '{"session_id":"sess_abc123","score":8,"useful":true}',});import requestsimport os
requests.post( "https://api.floopy.ai/v1/feedback", headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"}, json={"session_id": "sess_abc123", "score": 8, "useful": True},)curl https://api.floopy.ai/v1/feedback \ -H "Authorization: Bearer $FLOOPY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "session_id": "sess_abc123", "score": 8, "useful": true }'Fields
| Field | Type | Required | Description |
|---|---|---|---|
session_id | string | Yes | The session ID sent in the floopy-session-id header during chat requests |
score | integer | Yes | NPS-style score from 0 to 10. 0-6 = detractor, 7-8 = passive, 9-10 = promoter |
useful | boolean | Yes | Whether the conversation was useful to the end user |
Feedback Storage
Session feedback is stored in the ClickHouse session_feedback table, linked to all requests in that session via session_id. There is no need to update individual requests — the feedback applies to the entire session automatically.
This enables aggregation across multiple axes:
- Per model: Average NPS of
gpt-4ovsclaude-sonnet-4-6for your workload - Per provider: Aggregate quality across all models from a single provider
- Per prompt pattern: Quality trends for specific types of requests
- Per time window: Detect quality degradation over hours, days, or weeks
For fast access during routing decisions, aggregated feedback scores are cached in Redis with a 60-second to 5-minute TTL depending on signal. This means the routing engine never queries ClickHouse on the hot path — it reads pre-computed scores from cache, and the cache refreshes automatically in the background.
How Feedback Drives Smart Cost Routing
Smart Cost Routing uses feedback as 40% of its model selection formula. When deciding which cheaper model can handle a simple prompt, the system computes a performance score for each candidate:
performance_score = (success_rate × 0.4) + (quality × 0.4) + (cost_savings × 0.2)Where:
success_rate=successful_requests / total_requests. An HTTP 2xx counts as success. A model that returns errors 5% of the time gets a success rate of 0.95.quality= a weighted combination of multiple feedback signals (see below), normalized to a 0.0–1.0 scale. When no feedback exists for a model, the system falls back to its benchmark quality score derived from public evaluations.cost_savings=1.0 - (candidate_cost / default_cost), clamped to 0.0–1.0. If the default model costs $15/M tokens and the candidate costs $3/M tokens,cost_savings = 1.0 - (3/15) = 0.80.
Quality Signal: Multi-Source with Dynamic Weights
The quality signal combines three sources with dynamic weights that adapt based on data availability:
| Source | Description | Confidence |
|---|---|---|
| Session feedback | NPS score (0-10) and usefulness from end users via POST /v1/feedback | Highest when available |
| Auto feedback | Heuristic + LLM scoring computed automatically per request | Automated baseline |
| Manual rating | Admin thumbs up/down on individual requests in the dashboard | High confidence, low volume |
| Benchmark scores | Public evaluation scores (MMLU, HumanEval, etc.) | Fallback when no feedback exists |
The weights shift dynamically based on what data is available:
| Condition | Session | Auto | Manual | Benchmark |
|---|---|---|---|---|
| Session feedback exists (>10 sessions) | 0.5 | 0.3 | 0.1 | 0.1 |
| Only auto feedback exists (>10 requests) | — | 0.5 | 0.2 | 0.3 |
| Only benchmark data exists | — | — | — | 1.0 |
This means the system starts with benchmark scores on day zero, transitions to auto feedback as requests accumulate, and converges on session feedback once end users start submitting NPS scores. The more real-world signal available, the less the system relies on synthetic benchmarks.
Why 40/40/20?
The equal weighting between success rate and quality reflects a core design principle: a model that never fails but produces mediocre output is no better than a model that produces excellent output but fails frequently. Both matter equally. Cost savings gets 20% because it is the reason Smart Cost Routing exists — but it should never override reliability or quality.
Minimum Request Threshold
A model needs at least 10 requests before the system switches from exploration to exploitation mode. Below 10 requests, the performance score is too noisy to trust. During this phase, the model receives exploration traffic to build up a reliable score.
How Feedback Drives Smart Selector
Smart Selector uses feedback at a more granular level. Instead of averaging all four dimensions into a single quality score, it applies configurable weights to each dimension independently:
composite = relevance × w₁ + coherence × w₂ + helpfulness × w₃ + safety × w₄ + cost_efficiency × w₅Cost efficiency is calculated as (min_cost / variant_cost) × 100, where min_cost is the cheapest model in the variant set. This rewards cheaper models proportionally rather than using a binary threshold.
Weight Presets
| Preset | Relevance | Coherence | Helpfulness | Safety | Cost Efficiency |
|---|---|---|---|---|---|
| Balanced | 25% | 10% | 30% | 15% | 20% |
| Quality-First | 30% | 15% | 35% | 15% | 5% |
| Cost-Optimized | 15% | 5% | 20% | 10% | 50% |
| Safety-Critical | 15% | 5% | 15% | 50% | 15% |
Choose a preset that matches your use case. A customer support chatbot benefits from Safety-Critical. An internal code generation tool benefits from Quality-First. A high-volume classification pipeline benefits from Cost-Optimized.
Bandit Algorithms
Smart Selector uses multi-armed bandit algorithms to balance exploration (testing models) and exploitation (using the best model). Three strategies are available:
- Epsilon-Greedy: Exploits the best model
(1 - ε)of the time, explores a random modelεof the time. Simple and predictable. Good default. - UCB1 (Upper Confidence Bound): Selects the model that maximizes
score + √(2 × ln(total) / n), wheretotalis total requests across all models andnis requests for this model. The square root term is a confidence bonus that shrinks as a model gets more data. Models with fewer observations get a larger exploration bonus, ensuring under-tested models are tried. - Thompson Sampling: Maintains a Beta distribution for each model’s expected quality, parameterized by successes and failures. Samples from each distribution and picks the highest sample. Naturally balances exploration and exploitation — uncertain models (wide distributions) occasionally sample high, getting tested. Converges faster than UCB1 in practice.
The Self-Correcting Cycle
Feedback creates a closed loop where routing decisions improve automatically over time. Three mechanisms work together to make this happen.
Phase 1: Benchmark Prior (Cold Start)
When a new model is added, the system does not start from zero. It initializes quality scores from public benchmark data — MMLU, HumanEval, SWE-bench, MT-Bench, and others. These are not guesses; they are computed from real evaluation datasets.
A model that scores 92% on HumanEval starts strong for code generation tasks on day zero. A model with top MT-Bench scores starts strong for conversational tasks. This means routing is reasonable from the first request, before any feedback exists.
Phase 2: Exploration (Discovery)
Ten percent of requests are routed to the least-tested model in the candidate set. This exploration phase serves three purposes:
- New models get tested. A model added yesterday receives traffic immediately.
- Degraded models get retested. A model that was poor last month might have been updated.
- Better alternatives are discovered. A cheaper model might outperform the current best for your specific workload.
Model selection during exploration is deterministic round-robin among under-sampled models (those with fewer than 10 requests in the current window). This ensures even coverage rather than random noise.
Phase 3: Exploitation with Feedback (Convergence)
Ninety percent of requests go to the highest-scoring model. As feedback accumulates, real user signals replace the benchmark priors. The quality score transitions through three stages:
- Benchmark only — no feedback data yet, quality comes entirely from public evaluations.
- Auto + benchmark — after 10 requests with auto feedback, the system blends automated scoring with benchmarks.
- Session + auto + benchmark — after 10 sessions with NPS feedback, session feedback becomes the dominant signal (weight 0.5).
The transition is automatic and gradual. No configuration change required.
Degradation Scenario
Here is how the system responds when a provider degrades, with no manual intervention at any step:
| Time | Event | Model A Score | Model B Score | Traffic Split |
|---|---|---|---|---|
| T=0 | Steady state | 0.85 | 0.72 | A: 90%, B: 10% |
| T=1h | Provider degrades Model A silently | — | — | — |
| T=2h | Users report poor quality via feedback. Quality drops to 0.55. Score: (0.9 × 0.4) + (0.55 × 0.4) + (0.3 × 0.2) = 0.64 | 0.64 | 0.78 | Traffic shifts to B |
| T=3h | Model A quality stays below min_quality threshold (0.7) | Excluded | 0.78 | B: 100% |
Model A is excluded entirely. No alert needed, no on-call page, no configuration change.
Recovery Scenario
The system also recovers automatically when a provider fixes their issues:
| Time | Event |
|---|---|
| T=24h | Provider fixes Model A |
| T=25h | Exploration traffic (10%) retests Model A |
| T=26h | New feedback comes back at 0.88, performance score climbs |
| T=30h | Model A back on top, traffic returns to A: 90%, B: 10% |
The exploration phase is what makes recovery possible. Without it, a model excluded at T=3h would stay excluded forever. The 10% exploration rate is the insurance policy that keeps the system self-healing.
Viewing Feedback Data
Feedback data surfaces in three places in the dashboard:
- Requests page: Each request shows its feedback scores inline, so you can correlate quality with model, latency, and cost.
- Smart Selector detail: The experiment view shows per-variant feedback distributions, making it clear which model your users prefer.
- Smart Cost Savings widget: Compares feedback scores between cheap-routed requests and default-model requests, proving that cost savings are not coming at the expense of quality.
When you don’t have end-user feedback yet
If your product doesn’t collect NPS or user feedback today, Floopy still improves routing from day one. Auto feedback (LLM-as-judge on every response) combined with public benchmark scores starts the learning loop immediately. Session NPS is the strongest signal, but the system works without it — you get the optimization benefit even before building a feedback collection pipeline. The weight transition table above shows how signals ramp in as data accumulates.
Best Practices
- Collect session feedback from end users. Even a simple NPS prompt (“How would you rate this conversation? 0-10”) after a session gives the routing engine high-confidence signal. Aim for feedback on at least 10% of sessions.
- Always include both
scoreanduseful. Both fields are required and give the routing engine the full picture — NPS measures satisfaction whileusefulcaptures practical value. - Automate feedback collection. Use LLM-as-judge (a cheap model evaluating the output of an expensive model) or heuristic scoring (response length, format compliance, user engagement signals) to generate auto feedback at scale without manual review. This serves as the automated baseline while session feedback builds up.
- Monitor the exploration rate. If exploration is consistently routing to a model that scores poorly, consider removing that model from the candidate set.
- Start with the Balanced preset, then adjust. After 1,000+ requests with feedback, review the quality signal breakdown to see which weights match your actual quality priorities.
Migration from per-request feedback systems
If you’re migrating from a per-request feedback system (Helicone, LangSmith, PromptLayer), the translation is straightforward: instead of rating each individual response, rate the end-to-end conversation. One NPS score per session covers every routing decision in that session — multi-turn, tool calls, chained reasoning. You can continue capturing per-request data in parallel for debugging, but the routing engine consumes session-level signal for optimization. Existing per-request ratings can be migrated by aggregating them under each session_id; the first one you POST under a session_id wins.
Availability
Feedback collection is available on all plans. Feedback-driven routing (Smart Cost Routing and Smart Selector integration) is available on the Pro plan.
Continuous Evaluation
Even when end-user feedback is thin, the routing engine keeps its per-(subject, complexity) Model Ranking fresh via a nightly continuous-evaluation pass. A Coolify cron job (continuous-eval, daily at 04:00 UTC) samples up to 50 recent (request, response) pairs from each of the 18 buckets (6 subjects × 3 complexities) and asks an LLM-as-judge to grade each response on a 0–100 scale. Scores are written to the continuous_eval_scores table and rolled into continuous_eval_latest for keyed lookups.
The dashboard ranking query blends the continuous-eval score with user feedback at read time:
final_score = (1 − CONTINUOUS_EVAL_WEIGHT) × user_feedback_score + CONTINUOUS_EVAL_WEIGHT × continuous_eval_scoreThe default weight is 0.3 (30% continuous-eval, 70% user feedback). When only one side is present the final score falls back to whichever exists; when neither exists the ranker falls back to the benchmark prior.
Only requests from organisations that have shares_to_shared_pool = true are sampled, matching the privacy boundary used by the Shared Intelligence Pool. The judge model is configurable via CONTINUOUS_EVAL_JUDGE_MODEL (default meta-llama/Llama-3.3-70B-Instruct-Turbo via Together API).
The admin dashboard (/admin/continuous-eval) showing run history and per-bucket score trends is planned for v3.5 — this release ships the scoring pipeline only.