Skip to content

Smart Cost Routing

Introduction

Smart Cost Routing analyzes prompt complexity and intent, then selects the cheapest model capable of handling the task well. It combines curated benchmark data, real-time feedback from your users, and multi-armed bandit exploration to continuously improve routing decisions.

The result: you pay less for the same output quality. Simple questions go to cheap models. Hard problems stay on your default model. Everything in between gets routed to the best value option, validated by benchmark scores and historical performance.

How It Works

Every request passes through six stages before a provider is called:

Request → Complexity & Intent Classification → Candidate Filtering → Scoring → Selection → Provider Call → Feedback Loop
  1. Complexity Classification scores the prompt on a 0-to-1 scale using weighted heuristics.
  2. Intent Detection identifies the task type (code, math, reasoning, general).
  3. Candidate Filtering removes models that are too expensive or too low quality.
  4. Scoring ranks remaining candidates using benchmark data, feedback history, and cost savings.
  5. Selection picks a model via exploit/explore strategy.
  6. Feedback Loop records outcome data that improves future decisions.

Complexity Classification

Every prompt receives a complexity score between 0 and 1 based on a weighted heuristic analysis. Each signal is normalized to a 0-1 range, then multiplied by its weight.

SignalWeightHow Measured
Message count30%Normalized across 1-5 messages
System prompt length25%Normalized across 0-300 tokens
Tool usage20%Binary: 1 if tools are present, 0 otherwise
Code blocks15%Binary: 1 if ``` detected in content, 0 otherwise
Token count5%Normalized across 10-500 tokens
JSON output5%Binary: 1 if response_format is set, 0 otherwise

The weighted sum maps to one of three tiers:

TierScore RangeBehavior
Simple< 0.3Routes to cheapest viable model
Moderate0.3 - 0.7Routes to best-value model
Complex> 0.7Uses default model (no routing)

Complex prompts always bypass Smart Cost Routing entirely and go straight to your configured default model.

Intent Detection

In parallel with complexity scoring, the classifier detects the task type by scanning message content for keywords and structural signals. The detected intent determines which benchmarks matter most when scoring candidate models.

IntentDetected FromPriority
CodeCode blocks, function, def, class, import, compile, debug, refactorHighest
Math, , integral, equation, derivative, calculate, probability, theoremHigh
ReasoningTool usage, step by step, analyze, reason, compare, evaluateMedium
GeneralDefault when no specific intent is detectedLowest

Priority determines which intent wins when multiple are detected. A prompt containing both code blocks and the word analyze is classified as Code, not Reasoning.

Model Intelligence: Benchmark Data

This is the core of Smart Cost Routing. The system maintains a curated database of model capabilities, scored across standardized benchmarks. This data drives the quality estimation that makes routing decisions possible.

Where We Source Benchmark Data

Benchmark scores are aggregated from multiple authoritative sources and cross-referenced for accuracy:

  • Provider documentation (official model cards from OpenAI, Anthropic, Google, etc.)
  • HuggingFace Open LLM Leaderboard
  • Epoch AI Benchmarks
  • Artificial Analysis
  • LLM Stats / Klu AI / Onyx Leaderboards

When sources disagree, we use the provider’s official numbers as the primary reference and flag discrepancies for manual review.

What Benchmarks We Track

Each benchmark measures a specific capability. The routing system uses different benchmarks depending on the detected intent.

BenchmarkWhat It MeasuresUsed For Intent
MMLUGeneral knowledge across 57 subjectsGeneral, Reasoning
GPQAGraduate-level expert Q&AReasoning, Math
HumanEvalPython code generation (pass@1)Code
SWE-benchReal GitHub issue resolutionCode
LiveCodeBenchContemporary coding problemsCode
MATHCompetition-level mathematicsMath
AIME 2025American Invitational Math Exam problemsMath
MMLU-ProHarder MMLU variant with 10 answer choicesReasoning
IFEvalInstruction following accuracyGeneral
HellaSwagCommonsense reasoningGeneral
ARCAI2 Reasoning ChallengeReasoning

Coverage: 27 of 51 tracked models currently have benchmark data. Models without benchmark data receive a default score of 0.5 (neutral, neither boosted nor penalized).

How Data Is Curated

The Midas aggregation system fetches pricing and benchmark data for all supported models. Each model entry contains:

  • Pricing: input, output, and cached token costs
  • Benchmarks: normalized to a 0-1 scale for cross-model comparison
  • Capabilities: context window, multimodal support, function calling
  • Strengths and weaknesses: qualitative notes per model
  • Recommendations: suggested use cases

Data is refreshed whenever providers publish updates. The full dataset is embedded in the gateway binary at build time, enabling zero-latency lookups with no external API calls during request routing.

Benchmark-Weighted Quality Score

The quality score for a given model and intent is computed as a weighted average of available benchmark scores:

benchmark_score = Σ(vᵢ × wᵢ) / Σ(wᵢ)

Where vᵢ is the model’s score on benchmark i and wᵢ is the weight for that benchmark given the detected intent. The denominator sums only the weights of benchmarks that are actually present for the model, so missing benchmarks do not penalize the score.

Intent Weight Tables

Code Intent:

BenchmarkWeight
HumanEval35%
SWE-bench30%
LiveCodeBench20%
MMLU10%
IFEval5%

Math Intent:

BenchmarkWeight
MATH40%
GPQA25%
MMLU15%
AIME 202515%
ARC5%

Reasoning Intent:

BenchmarkWeight
GPQA30%
MMLU25%
MATH20%
MMLU-Pro15%
ARC10%

General Intent:

BenchmarkWeight
MMLU30%
GPQA15%
HumanEval15%
MATH15%
IFEval15%
HellaSwag10%

Worked Example

GPT-4o with Code intent:

Suppose GPT-4o has these benchmark scores: HumanEval = 0.902, MMLU = 0.887. SWE-bench and LiveCodeBench are not available for this model.

BenchmarkScoreWeightContribution
HumanEval0.9020.350.3157
SWE-benchskipped
LiveCodeBenchskipped
MMLU0.8870.100.0887
IFEvalskipped

Sum of contributions: 0.3157 + 0.0887 = 0.4044 Sum of available weights: 0.35 + 0.10 = 0.45

Quality score = 0.4044 / 0.45 = 0.898

Model Selection Algorithm

Candidate Filtering

Before scoring, the candidate pool is narrowed:

  1. Provider list: Only models you have enabled in the Smart Cost Routing providers list are considered.
  2. Cost ceiling: avg_token_cost must be less than or equal to the default model’s cost. Smart Cost Routing never routes to a more expensive model.
  3. Quality floor: The model’s quality score (from session feedback, auto feedback, manual rating, and/or benchmarks with dynamic weights) must meet or exceed min_quality.

Selection Strategy

The system uses a multi-armed bandit approach:

  • Exploit (90% of requests): Selects the highest-scoring candidate. The exploit score combines three factors:

    score = (success_rate × 0.4) + (quality × 0.4) + (savings × 0.2)

    A model needs at least 10 completed requests before it is eligible for exploitation. Until then, it is treated as unexplored.

  • Explore (10% of requests): Selects the least-tested candidate to gather performance data. This prevents the system from getting stuck on a local optimum and ensures new models get a fair evaluation.

  • Bypass: If no candidates survive filtering, or if the prompt is classified as Complex, the default model is used with no routing applied.

The Self-Correcting Cycle

Smart Cost Routing improves over time through a multi-signal feedback loop. The quality signal combines four sources with dynamic weights:

  • Session feedback (NPS from end users) — highest confidence when available
  • Auto feedback (heuristic + LLM scoring per request) — automated baseline
  • Manual rating (admin thumbs up/down) — high confidence, low volume
  • Benchmark scores — fallback when no feedback exists

When session feedback exists (>10 sessions), it receives 50% of the quality weight. When only auto feedback exists, it receives 50%. When no feedback exists at all, benchmarks are used at 100%. This means the system starts with reasonable defaults and converges on real-world signal as data accumulates.

Models that consistently receive poor feedback see their quality scores drop and eventually fall below the minimum quality threshold, removing them from the candidate pool. This creates a self-correcting system: bad routing decisions generate negative feedback, which prevents the same mistake from happening again.

For details on how to submit feedback and how it is processed, see Feedback.

Configuration

  1. Go to Routing in the dashboard.
  2. Create or edit a routing rule.
  3. Enable Smart Cost Routing.
  4. Configure the following parameters:
    • Exploration Rate: Percentage of requests used for testing less-explored models (default: 10%).
    • Minimum Quality: The minimum acceptable quality score for a candidate model (default: 70%).
    • Providers: Restrict which providers and models can be used as routing targets.

Monitoring

Every response routed through Smart Cost Routing includes headers that tell you what happened:

HeaderDescription
Floopy-Smart-Cost-DecisionWhether the request was routed (routed), bypassed (bypass), or used the default (default)
Floopy-ModelThe model that actually served the request

Use these headers in your application to log routing decisions, build dashboards, or trigger alerts when routing behavior changes unexpectedly.

Availability

Smart Cost Routing is available on the Pro plan.