Feedback

Floopy collects quality feedback on LLM responses and uses it as the primary signal for routing optimization. Every feedback submission feeds directly into Smart Cost Routing and Smart Selector, creating a self-improving loop: the more feedback you submit, the better your routing decisions become.

Without feedback, routing relies on public benchmark scores. With feedback, routing learns what actually works for your prompts, your users, and your quality bar.

Architecture

graph LR
    S1[Session feedback<br/>POST /v1/feedback] --> W[Dynamic weighting]
    S2[Auto feedback<br/>LLM-as-judge] --> W
    S3[Manual rating<br/>dashboard] --> W
    S4[Benchmark scores<br/>MMLU / HumanEval / …] --> W
    W --> R[Router]
    R --> CH[(ClickHouse<br/>session_feedback)]
    CH -.->|60s–5min TTL| RD[(Redis<br/>aggregated scores)]
    RD --> D[Routing decision]

Four signal sources feed a dynamic weighting function. The weighting output drives the router, which persists session feedback to ClickHouse and reads pre-aggregated scores from Redis on the hot path. The router never queries ClickHouse during a request — it reads from Redis, and Redis is refreshed in the background.

Collecting Feedback

Submit session-level feedback via the feedback endpoint. Use the same API key you use for chat requests. The session_id is the value you send in the floopy-session-id header during chat requests.

Endpoint: POST /v1/feedback

await fetch("https://api.floopy.ai/v1/feedback", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.FLOOPY_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: '{"session_id":"sess_abc123","score":8,"useful":true}',
});

import requests
import os

requests.post(
    "https://api.floopy.ai/v1/feedback",
    headers={"Authorization": f"Bearer {os.environ['FLOOPY_API_KEY']}"},
    json={"session_id": "sess_abc123", "score": 8, "useful": True},
)

curl https://api.floopy.ai/v1/feedback \
  -H "Authorization: Bearer $FLOOPY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "sess_abc123",
    "score": 8,
    "useful": true
  }'

Fields

Field	Type	Required	Description
`session_id`	string	Yes	The session ID sent in the `floopy-session-id` header during chat requests
`score`	integer	Yes	NPS-style score from 0 to 10. 0-6 = detractor, 7-8 = passive, 9-10 = promoter
`useful`	boolean	Yes	Whether the conversation was useful to the end user

Feedback Storage

Session feedback is stored in the ClickHouse session_feedback table, linked to all requests in that session via session_id. There is no need to update individual requests — the feedback applies to the entire session automatically.

This enables aggregation across multiple axes:

Per model: Average NPS of gpt-4o vs claude-sonnet-4-6 for your workload
Per provider: Aggregate quality across all models from a single provider
Per prompt pattern: Quality trends for specific types of requests
Per time window: Detect quality degradation over hours, days, or weeks

For fast access during routing decisions, aggregated feedback scores are cached in Redis with a 60-second to 5-minute TTL depending on signal. This means the routing engine never queries ClickHouse on the hot path — it reads pre-computed scores from cache, and the cache refreshes automatically in the background.

How Feedback Drives Smart Cost Routing

Smart Cost Routing uses feedback as 40% of its model selection formula. When deciding which cheaper model can handle a simple prompt, the system computes a performance score for each candidate:

performance_score = (success_rate × 0.4) + (quality × 0.4) + (cost_savings × 0.2)

Where:

success_rate = successful_requests / total_requests. An HTTP 2xx counts as success. A model that returns errors 5% of the time gets a success rate of 0.95.
quality = a weighted combination of multiple feedback signals (see below), normalized to a 0.0–1.0 scale. When no feedback exists for a model, the system falls back to its benchmark quality score derived from public evaluations.
cost_savings = 1.0 - (candidate_cost / default_cost), clamped to 0.0–1.0. If the default model costs $15/M tokens and the candidate costs $3/M tokens, cost_savings = 1.0 - (3/15) = 0.80.

Quality Signal: Multi-Source with Dynamic Weights

The quality signal combines three sources with dynamic weights that adapt based on data availability:

Source	Description	Confidence
Session feedback	NPS score (0-10) and usefulness from end users via `POST /v1/feedback`	Highest when available
Auto feedback	Heuristic + LLM scoring computed automatically per request	Automated baseline
Manual rating	Admin thumbs up/down on individual requests in the dashboard	High confidence, low volume
Benchmark scores	Public evaluation scores (MMLU, HumanEval, etc.)	Fallback when no feedback exists

The weights shift dynamically based on what data is available:

Condition	Session	Auto	Manual	Benchmark
Session feedback exists (>10 sessions)	0.5	0.3	0.1	0.1
Only auto feedback exists (>10 requests)	—	0.5	0.2	0.3
Only benchmark data exists	—	—	—	1.0

This means the system starts with benchmark scores on day zero, transitions to auto feedback as requests accumulate, and converges on session feedback once end users start submitting NPS scores. The more real-world signal available, the less the system relies on synthetic benchmarks.

Why 40/40/20?

The equal weighting between success rate and quality reflects a core design principle: a model that never fails but produces mediocre output is no better than a model that produces excellent output but fails frequently. Both matter equally. Cost savings gets 20% because it is the reason Smart Cost Routing exists — but it should never override reliability or quality.

Minimum Request Threshold

A model needs at least 10 requests before the system switches from exploration to exploitation mode. Below 10 requests, the performance score is too noisy to trust. During this phase, the model receives exploration traffic to build up a reliable score.

How Feedback Drives Smart Selector

Smart Selector uses feedback at a more granular level. Instead of averaging all four dimensions into a single quality score, it applies configurable weights to each dimension independently:

composite = relevance × w₁ + coherence × w₂ + helpfulness × w₃ + safety × w₄ + cost_efficiency × w₅

Cost efficiency is calculated as (min_cost / variant_cost) × 100, where min_cost is the cheapest model in the variant set. This rewards cheaper models proportionally rather than using a binary threshold.

Weight Presets

Preset	Relevance	Coherence	Helpfulness	Safety	Cost Efficiency
Balanced	25%	10%	30%	15%	20%
Quality-First	30%	15%	35%	15%	5%
Cost-Optimized	15%	5%	20%	10%	50%
Safety-Critical	15%	5%	15%	50%	15%

Choose a preset that matches your use case. A customer support chatbot benefits from Safety-Critical. An internal code generation tool benefits from Quality-First. A high-volume classification pipeline benefits from Cost-Optimized.

Bandit Algorithms

Smart Selector uses multi-armed bandit algorithms to balance exploration (testing models) and exploitation (using the best model). Three strategies are available:

Epsilon-Greedy: Exploits the best model (1 - ε) of the time, explores a random model ε of the time. Simple and predictable. Good default.
UCB1 (Upper Confidence Bound): Selects the model that maximizes score + √(2 × ln(total) / n), where total is total requests across all models and n is requests for this model. The square root term is a confidence bonus that shrinks as a model gets more data. Models with fewer observations get a larger exploration bonus, ensuring under-tested models are tried.
Thompson Sampling: Maintains a Beta distribution for each model’s expected quality, parameterized by successes and failures. Samples from each distribution and picks the highest sample. Naturally balances exploration and exploitation — uncertain models (wide distributions) occasionally sample high, getting tested. Converges faster than UCB1 in practice.

The Self-Correcting Cycle

Feedback creates a closed loop where routing decisions improve automatically over time. Three mechanisms work together to make this happen.

Phase 1: Benchmark Prior (Cold Start)

When a new model is added, the system does not start from zero. It initializes quality scores from public benchmark data — MMLU, HumanEval, SWE-bench, MT-Bench, and others. These are not guesses; they are computed from real evaluation datasets.

A model that scores 92% on HumanEval starts strong for code generation tasks on day zero. A model with top MT-Bench scores starts strong for conversational tasks. This means routing is reasonable from the first request, before any feedback exists.

Phase 2: Exploration (Discovery)

Ten percent of requests are routed to the least-tested model in the candidate set. This exploration phase serves three purposes:

New models get tested. A model added yesterday receives traffic immediately.
Degraded models get retested. A model that was poor last month might have been updated.
Better alternatives are discovered. A cheaper model might outperform the current best for your specific workload.

Model selection during exploration is deterministic round-robin among under-sampled models (those with fewer than 10 requests in the current window). This ensures even coverage rather than random noise.

Phase 3: Exploitation with Feedback (Convergence)

Ninety percent of requests go to the highest-scoring model. As feedback accumulates, real user signals replace the benchmark priors. The quality score transitions through three stages:

Benchmark only — no feedback data yet, quality comes entirely from public evaluations.
Auto + benchmark — after 10 requests with auto feedback, the system blends automated scoring with benchmarks.
Session + auto + benchmark — after 10 sessions with NPS feedback, session feedback becomes the dominant signal (weight 0.5).

The transition is automatic and gradual. No configuration change required.

Degradation Scenario

Here is how the system responds when a provider degrades, with no manual intervention at any step:

Time	Event	Model A Score	Model B Score	Traffic Split
T=0	Steady state	0.85	0.72	A: 90%, B: 10%
T=1h	Provider degrades Model A silently	—	—	—
T=2h	Users report poor quality via feedback. Quality drops to 0.55. Score: `(0.9 × 0.4) + (0.55 × 0.4) + (0.3 × 0.2) = 0.64`	0.64	0.78	Traffic shifts to B
T=3h	Model A quality stays below `min_quality` threshold (0.7)	Excluded	0.78	B: 100%

Model A is excluded entirely. No alert needed, no on-call page, no configuration change.

Recovery Scenario

The system also recovers automatically when a provider fixes their issues:

Time	Event
T=24h	Provider fixes Model A
T=25h	Exploration traffic (10%) retests Model A
T=26h	New feedback comes back at 0.88, performance score climbs
T=30h	Model A back on top, traffic returns to A: 90%, B: 10%

The exploration phase is what makes recovery possible. Without it, a model excluded at T=3h would stay excluded forever. The 10% exploration rate is the insurance policy that keeps the system self-healing.

Viewing Feedback Data

Feedback data surfaces in three places in the dashboard:

Requests page: Each request shows its feedback scores inline, so you can correlate quality with model, latency, and cost.
Smart Selector detail: The experiment view shows per-variant feedback distributions, making it clear which model your users prefer.
Smart Cost Savings widget: Compares feedback scores between cheap-routed requests and default-model requests, proving that cost savings are not coming at the expense of quality.

When you don’t have end-user feedback yet

If your product doesn’t collect NPS or user feedback today, Floopy still improves routing from day one. Auto feedback (LLM-as-judge on every response) combined with public benchmark scores starts the learning loop immediately. Session NPS is the strongest signal, but the system works without it — you get the optimization benefit even before building a feedback collection pipeline. The weight transition table above shows how signals ramp in as data accumulates.

Best Practices

Collect session feedback from end users. Even a simple NPS prompt (“How would you rate this conversation? 0-10”) after a session gives the routing engine high-confidence signal. Aim for feedback on at least 10% of sessions.
Always include both score and useful. Both fields are required and give the routing engine the full picture — NPS measures satisfaction while useful captures practical value.
Automate feedback collection. Use LLM-as-judge (a cheap model evaluating the output of an expensive model) or heuristic scoring (response length, format compliance, user engagement signals) to generate auto feedback at scale without manual review. This serves as the automated baseline while session feedback builds up.
Monitor the exploration rate. If exploration is consistently routing to a model that scores poorly, consider removing that model from the candidate set.
Start with the Balanced preset, then adjust. After 1,000+ requests with feedback, review the quality signal breakdown to see which weights match your actual quality priorities.

Migration from per-request feedback systems

If you’re migrating from a per-request feedback system (Helicone, LangSmith, PromptLayer), the translation is straightforward: instead of rating each individual response, rate the end-to-end conversation. One NPS score per session covers every routing decision in that session — multi-turn, tool calls, chained reasoning. You can continue capturing per-request data in parallel for debugging, but the routing engine consumes session-level signal for optimization. Existing per-request ratings can be migrated by aggregating them under each session_id; the first one you POST under a session_id wins.

Availability

Feedback collection is available on all plans. Feedback-driven routing (Smart Cost Routing and Smart Selector integration) is available on the Pro plan.

Continuous Evaluation

Even when end-user feedback is thin, the routing engine keeps its per-(subject, complexity) Model Ranking fresh via a nightly continuous-evaluation pass. A Coolify cron job (continuous-eval, daily at 04:00 UTC) samples up to 50 recent (request, response) pairs from each of the 18 buckets (6 subjects × 3 complexities) and asks an LLM-as-judge to grade each response on a 0–100 scale. Scores are written to the continuous_eval_scores table and rolled into continuous_eval_latest for keyed lookups.

The dashboard ranking query blends the continuous-eval score with user feedback at read time:

final_score = (1 − CONTINUOUS_EVAL_WEIGHT) × user_feedback_score
            +  CONTINUOUS_EVAL_WEIGHT      × continuous_eval_score

The default weight is 0.3 (30% continuous-eval, 70% user feedback). When only one side is present the final score falls back to whichever exists; when neither exists the ranker falls back to the benchmark prior.

Only requests from organisations that have shares_to_shared_pool = true are sampled, matching the privacy boundary used by the Shared Intelligence Pool. The judge model is configurable via CONTINUOUS_EVAL_JUDGE_MODEL (default meta-llama/Llama-3.3-70B-Instruct-Turbo via Together API).

The admin dashboard (/admin/continuous-eval) showing run history and per-bucket score trends is planned for v3.5 — this release ships the scoring pipeline only.