Smart Cost Routing
Introduction
Smart Cost Routing analyzes prompt complexity and intent, then selects the cheapest model capable of handling the task well. It combines curated benchmark data, real-time feedback from your users, and multi-armed bandit exploration to continuously improve routing decisions.
The result: you pay less for the same output quality. Simple questions go to cheap models. Hard problems stay on your default model. Everything in between gets routed to the best value option, validated by benchmark scores and historical performance.
How It Works
Every request passes through six stages before a provider is called:
Request → Complexity & Intent Classification → Candidate Filtering → Scoring → Selection → Provider Call → Feedback Loop- Complexity Classification scores the prompt on a 0-to-1 scale using weighted heuristics.
- Intent Detection identifies the task type (code, math, reasoning, general).
- Candidate Filtering removes models that are too expensive or too low quality.
- Scoring ranks remaining candidates using benchmark data, feedback history, and cost savings.
- Selection picks a model via exploit/explore strategy.
- Feedback Loop records outcome data that improves future decisions.
Complexity Classification
Every prompt receives a complexity score between 0 and 1 based on a weighted heuristic analysis. Each signal is normalized to a 0-1 range, then multiplied by its weight.
| Signal | Weight | How Measured |
|---|---|---|
| Message count | 30% | Normalized across 1-5 messages |
| System prompt length | 25% | Normalized across 0-300 tokens |
| Tool usage | 20% | Binary: 1 if tools are present, 0 otherwise |
| Code blocks | 15% | Binary: 1 if ``` detected in content, 0 otherwise |
| Token count | 5% | Normalized across 10-500 tokens |
| JSON output | 5% | Binary: 1 if response_format is set, 0 otherwise |
The weighted sum maps to one of three tiers:
| Tier | Score Range | Behavior |
|---|---|---|
| Simple | < 0.3 | Routes to cheapest viable model |
| Moderate | 0.3 - 0.7 | Routes to best-value model |
| Complex | > 0.7 | Uses default model (no routing) |
Complex prompts always bypass Smart Cost Routing entirely and go straight to your configured default model.
Intent Detection
In parallel with complexity scoring, the classifier detects the task type by scanning message content for keywords and structural signals. The detected intent determines which benchmarks matter most when scoring candidate models.
| Intent | Detected From | Priority |
|---|---|---|
| Code | Code blocks, function, def, class, import, compile, debug, refactor | Highest |
| Math | ∫, ∑, integral, equation, derivative, calculate, probability, theorem | High |
| Reasoning | Tool usage, step by step, analyze, reason, compare, evaluate | Medium |
| General | Default when no specific intent is detected | Lowest |
Priority determines which intent wins when multiple are detected. A prompt containing both code blocks and the word analyze is classified as Code, not Reasoning.
Model Intelligence: Benchmark Data
This is the core of Smart Cost Routing. The system maintains a curated database of model capabilities, scored across standardized benchmarks. This data drives the quality estimation that makes routing decisions possible.
Where We Source Benchmark Data
Benchmark scores are aggregated from multiple authoritative sources and cross-referenced for accuracy:
- Provider documentation (official model cards from OpenAI, Anthropic, Google, etc.)
- HuggingFace Open LLM Leaderboard
- Epoch AI Benchmarks
- Artificial Analysis
- LLM Stats / Klu AI / Onyx Leaderboards
When sources disagree, we use the provider’s official numbers as the primary reference and flag discrepancies for manual review.
What Benchmarks We Track
Each benchmark measures a specific capability. The routing system uses different benchmarks depending on the detected intent.
| Benchmark | What It Measures | Used For Intent |
|---|---|---|
| MMLU | General knowledge across 57 subjects | General, Reasoning |
| GPQA | Graduate-level expert Q&A | Reasoning, Math |
| HumanEval | Python code generation (pass@1) | Code |
| SWE-bench | Real GitHub issue resolution | Code |
| LiveCodeBench | Contemporary coding problems | Code |
| MATH | Competition-level mathematics | Math |
| AIME 2025 | American Invitational Math Exam problems | Math |
| MMLU-Pro | Harder MMLU variant with 10 answer choices | Reasoning |
| IFEval | Instruction following accuracy | General |
| HellaSwag | Commonsense reasoning | General |
| ARC | AI2 Reasoning Challenge | Reasoning |
Coverage: 27 of 51 tracked models currently have benchmark data. Models without benchmark data receive a default score of 0.5 (neutral, neither boosted nor penalized).
How Data Is Curated
The Midas aggregation system fetches pricing and benchmark data for all supported models. Each model entry contains:
- Pricing: input, output, and cached token costs
- Benchmarks: normalized to a 0-1 scale for cross-model comparison
- Capabilities: context window, multimodal support, function calling
- Strengths and weaknesses: qualitative notes per model
- Recommendations: suggested use cases
Data is refreshed whenever providers publish updates. The full dataset is embedded in the gateway binary at build time, enabling zero-latency lookups with no external API calls during request routing.
Benchmark-Weighted Quality Score
The quality score for a given model and intent is computed as a weighted average of available benchmark scores:
benchmark_score = Σ(vᵢ × wᵢ) / Σ(wᵢ)Where vᵢ is the model’s score on benchmark i and wᵢ is the weight for that benchmark given the detected intent. The denominator sums only the weights of benchmarks that are actually present for the model, so missing benchmarks do not penalize the score.
Intent Weight Tables
Code Intent:
| Benchmark | Weight |
|---|---|
| HumanEval | 35% |
| SWE-bench | 30% |
| LiveCodeBench | 20% |
| MMLU | 10% |
| IFEval | 5% |
Math Intent:
| Benchmark | Weight |
|---|---|
| MATH | 40% |
| GPQA | 25% |
| MMLU | 15% |
| AIME 2025 | 15% |
| ARC | 5% |
Reasoning Intent:
| Benchmark | Weight |
|---|---|
| GPQA | 30% |
| MMLU | 25% |
| MATH | 20% |
| MMLU-Pro | 15% |
| ARC | 10% |
General Intent:
| Benchmark | Weight |
|---|---|
| MMLU | 30% |
| GPQA | 15% |
| HumanEval | 15% |
| MATH | 15% |
| IFEval | 15% |
| HellaSwag | 10% |
Worked Example
GPT-4o with Code intent:
Suppose GPT-4o has these benchmark scores: HumanEval = 0.902, MMLU = 0.887. SWE-bench and LiveCodeBench are not available for this model.
| Benchmark | Score | Weight | Contribution |
|---|---|---|---|
| HumanEval | 0.902 | 0.35 | 0.3157 |
| SWE-bench | — | — | skipped |
| LiveCodeBench | — | — | skipped |
| MMLU | 0.887 | 0.10 | 0.0887 |
| IFEval | — | — | skipped |
Sum of contributions: 0.3157 + 0.0887 = 0.4044 Sum of available weights: 0.35 + 0.10 = 0.45
Quality score = 0.4044 / 0.45 = 0.898
Model Selection Algorithm
Candidate Filtering
Before scoring, the candidate pool is narrowed:
- Provider list: Only models you have enabled in the Smart Cost Routing providers list are considered.
- Cost ceiling:
avg_token_costmust be less than or equal to the default model’s cost. Smart Cost Routing never routes to a more expensive model. - Quality floor: The model’s quality score (from session feedback, auto feedback, manual rating, and/or benchmarks with dynamic weights) must meet or exceed
min_quality.
Selection Strategy
The system uses a multi-armed bandit approach:
Exploit (90% of requests): Selects the highest-scoring candidate. The exploit score combines three factors:
score = (success_rate × 0.4) + (quality × 0.4) + (savings × 0.2)A model needs at least 10 completed requests before it is eligible for exploitation. Until then, it is treated as unexplored.
Explore (10% of requests): Selects the least-tested candidate to gather performance data. This prevents the system from getting stuck on a local optimum and ensures new models get a fair evaluation.
Bypass: If no candidates survive filtering, or if the prompt is classified as Complex, the default model is used with no routing applied.
The Self-Correcting Cycle
Smart Cost Routing improves over time through a multi-signal feedback loop. The quality signal combines four sources with dynamic weights:
- Session feedback (NPS from end users) — highest confidence when available
- Auto feedback (heuristic + LLM scoring per request) — automated baseline
- Manual rating (admin thumbs up/down) — high confidence, low volume
- Benchmark scores — fallback when no feedback exists
When session feedback exists (>10 sessions), it receives 50% of the quality weight. When only auto feedback exists, it receives 50%. When no feedback exists at all, benchmarks are used at 100%. This means the system starts with reasonable defaults and converges on real-world signal as data accumulates.
Models that consistently receive poor feedback see their quality scores drop and eventually fall below the minimum quality threshold, removing them from the candidate pool. This creates a self-correcting system: bad routing decisions generate negative feedback, which prevents the same mistake from happening again.
For details on how to submit feedback and how it is processed, see Feedback.
Configuration
- Go to Routing in the dashboard.
- Create or edit a routing rule.
- Enable Smart Cost Routing.
- Configure the following parameters:
- Exploration Rate: Percentage of requests used for testing less-explored models (default: 10%).
- Minimum Quality: The minimum acceptable quality score for a candidate model (default: 70%).
- Providers: Restrict which providers and models can be used as routing targets.
Monitoring
Every response routed through Smart Cost Routing includes headers that tell you what happened:
| Header | Description |
|---|---|
Floopy-Smart-Cost-Decision | Whether the request was routed (routed), bypassed (bypass), or used the default (default) |
Floopy-Model | The model that actually served the request |
Use these headers in your application to log routing decisions, build dashboards, or trigger alerts when routing behavior changes unexpectedly.
Availability
Smart Cost Routing is available on the Pro plan.