Smart Cost Routing

Introduction

Smart Cost Routing analyzes prompt complexity and intent, then selects the cheapest model capable of handling the task well. It combines curated benchmark data, real-time feedback from your users, and multi-armed bandit exploration to continuously improve routing decisions.

The result: you pay less for the same output quality. Simple questions go to cheap models. Hard problems stay on your default model. Everything in between gets routed to the best value option, validated by benchmark scores and historical performance.

How It Works

Every request passes through six stages before a provider is called:

Request → Complexity & Intent Classification → Candidate Filtering → Scoring → Selection → Provider Call → Feedback Loop

Complexity Classification scores the prompt on a 0-to-1 scale using weighted heuristics.
Intent Detection identifies the task type (code, math, reasoning, general).
Candidate Filtering removes models that are too expensive or too low quality.
Scoring ranks remaining candidates using benchmark data, feedback history, and cost savings.
Selection picks a model via exploit/explore strategy.
Feedback Loop records outcome data that improves future decisions.

Complexity Classification

Every prompt receives a complexity score between 0 and 1 based on a weighted heuristic analysis. Each signal is normalized to a 0-1 range, then multiplied by its weight.

Signal	Weight	How Measured
Message count	30%	Normalized across 1-5 messages
System prompt length	25%	Normalized across 0-300 tokens
Tool usage	20%	Binary: 1 if tools are present, 0 otherwise
Code blocks	15%	Binary: 1 if ``` detected in content, 0 otherwise
Token count	5%	Normalized across 10-500 tokens
JSON output	5%	Binary: 1 if `response_format` is set, 0 otherwise

The weighted sum maps to one of three tiers:

Tier	Score Range	Behavior
Simple	< 0.3	Routes to cheapest viable model
Moderate	0.3 - 0.7	Routes to best-value model
Complex	> 0.7	Uses default model (no routing)

Complex prompts always bypass Smart Cost Routing entirely and go straight to your configured default model.

Intent Detection

In parallel with complexity scoring, the classifier detects the task type by scanning message content for keywords and structural signals. The detected intent determines which benchmarks matter most when scoring candidate models.

Intent	Detected From	Priority
Code	Code blocks, `function`, `def`, `class`, `import`, `compile`, `debug`, `refactor`	Highest
Math	`∫`, `∑`, `integral`, `equation`, `derivative`, `calculate`, `probability`, `theorem`	High
Reasoning	Tool usage, `step by step`, `analyze`, `reason`, `compare`, `evaluate`	Medium
General	Default when no specific intent is detected	Lowest

Priority determines which intent wins when multiple are detected. A prompt containing both code blocks and the word analyze is classified as Code, not Reasoning.

Model Intelligence: Benchmark Data

This is the core of Smart Cost Routing. The system maintains a curated database of model capabilities, scored across standardized benchmarks. This data drives the quality estimation that makes routing decisions possible.

Where We Source Benchmark Data

Benchmark scores are aggregated from multiple authoritative sources and cross-referenced for accuracy:

Provider documentation (official model cards from OpenAI, Anthropic, Google, etc.)
HuggingFace Open LLM Leaderboard
Epoch AI Benchmarks
Artificial Analysis
LLM Stats / Klu AI / Onyx Leaderboards

When sources disagree, we use the provider’s official numbers as the primary reference and flag discrepancies for manual review.

What Benchmarks We Track

Each benchmark measures a specific capability. The routing system uses different benchmarks depending on the detected intent.

Benchmark	What It Measures	Used For Intent
MMLU	General knowledge across 57 subjects	General, Reasoning
GPQA	Graduate-level expert Q&A	Reasoning, Math
HumanEval	Python code generation (pass@1)	Code
SWE-bench	Real GitHub issue resolution	Code
LiveCodeBench	Contemporary coding problems	Code
MATH	Competition-level mathematics	Math
AIME 2025	American Invitational Math Exam problems	Math
MMLU-Pro	Harder MMLU variant with 10 answer choices	Reasoning
IFEval	Instruction following accuracy	General
HellaSwag	Commonsense reasoning	General
ARC	AI2 Reasoning Challenge	Reasoning

Coverage: 27 of 51 tracked models currently have benchmark data. Models without benchmark data receive a default score of 0.5 (neutral, neither boosted nor penalized).

How Data Is Curated

The Midas aggregation system fetches pricing and benchmark data for all supported models. Each model entry contains:

Pricing: input, output, and cached token costs
Benchmarks: normalized to a 0-1 scale for cross-model comparison
Capabilities: context window, multimodal support, function calling
Strengths and weaknesses: qualitative notes per model
Recommendations: suggested use cases

Data is refreshed whenever providers publish updates. The full dataset is embedded in the gateway binary at build time, enabling zero-latency lookups with no external API calls during request routing.

Benchmark-Weighted Quality Score

The quality score for a given model and intent is computed as a weighted average of available benchmark scores:

benchmark_score = Σ(vᵢ × wᵢ) / Σ(wᵢ)

Where vᵢ is the model’s score on benchmark i and wᵢ is the weight for that benchmark given the detected intent. The denominator sums only the weights of benchmarks that are actually present for the model, so missing benchmarks do not penalize the score.

Intent Weight Tables

Code Intent:

Benchmark	Weight
HumanEval	35%
SWE-bench	30%
LiveCodeBench	20%
MMLU	10%
IFEval	5%

Math Intent:

Benchmark	Weight
MATH	40%
GPQA	25%
MMLU	15%
AIME 2025	15%
ARC	5%

Reasoning Intent:

Benchmark	Weight
GPQA	30%
MMLU	25%
MATH	20%
MMLU-Pro	15%
ARC	10%

General Intent:

Benchmark	Weight
MMLU	30%
GPQA	15%
HumanEval	15%
MATH	15%
IFEval	15%
HellaSwag	10%

Worked Example

GPT-4o with Code intent:

Suppose GPT-4o has these benchmark scores: HumanEval = 0.902, MMLU = 0.887. SWE-bench and LiveCodeBench are not available for this model.

Benchmark	Score	Weight	Contribution
HumanEval	0.902	0.35	0.3157
SWE-bench	—	—	skipped
LiveCodeBench	—	—	skipped
MMLU	0.887	0.10	0.0887
IFEval	—	—	skipped

Sum of contributions: 0.3157 + 0.0887 = 0.4044 Sum of available weights: 0.35 + 0.10 = 0.45

Quality score = 0.4044 / 0.45 = 0.898

Model Selection Algorithm

Candidate Filtering

Before scoring, the candidate pool is narrowed:

Provider list: Only models you have enabled in the Smart Cost Routing providers list are considered.
Cost ceiling: avg_token_cost must be less than or equal to the default model’s cost. Smart Cost Routing never routes to a more expensive model.
Quality floor: The model’s quality score (from session feedback, auto feedback, manual rating, and/or benchmarks with dynamic weights) must meet or exceed min_quality.

Selection Strategy

The system uses a multi-armed bandit approach:

Exploit (90% of requests): Selects the highest-scoring candidate. The exploit score combines three factors:
```
score = (success_rate × 0.4) + (quality × 0.4) + (savings × 0.2)
```
A model needs at least 10 completed requests before it is eligible for exploitation. Until then, it is treated as unexplored.
Explore (10% of requests): Selects the least-tested candidate to gather performance data. This prevents the system from getting stuck on a local optimum and ensures new models get a fair evaluation.
Bypass: If no candidates survive filtering, or if the prompt is classified as Complex, the default model is used with no routing applied.

The Self-Correcting Cycle

Smart Cost Routing improves over time through a multi-signal feedback loop. The quality signal combines four sources with dynamic weights:

Session feedback (NPS from end users) — highest confidence when available
Auto feedback (heuristic + LLM scoring per request) — automated baseline
Manual rating (admin thumbs up/down) — high confidence, low volume
Benchmark scores — fallback when no feedback exists

When session feedback exists (>10 sessions), it receives 50% of the quality weight. When only auto feedback exists, it receives 50%. When no feedback exists at all, benchmarks are used at 100%. This means the system starts with reasonable defaults and converges on real-world signal as data accumulates.

Models that consistently receive poor feedback see their quality scores drop and eventually fall below the minimum quality threshold, removing them from the candidate pool. This creates a self-correcting system: bad routing decisions generate negative feedback, which prevents the same mistake from happening again.

For details on how to submit feedback and how it is processed, see Feedback.

Configuration

Go to Routing in the dashboard.
Create or edit a routing rule.
Enable Smart Cost Routing.
Configure the following parameters:
- Exploration Rate: Percentage of requests used for testing less-explored models (default: 10%).
- Minimum Quality: The minimum acceptable quality score for a candidate model (default: 70%).
- Providers: Restrict which providers and models can be used as routing targets.

Monitoring

Every response routed through Smart Cost Routing includes headers that tell you what happened:

Header	Description
`Floopy-Smart-Cost-Decision`	Whether the request was routed (`routed`), bypassed (`bypass`), or used the default (`default`)
`Floopy-Model`	The model that actually served the request

Use these headers in your application to log routing decisions, build dashboards, or trigger alerts when routing behavior changes unexpectedly.

Availability

Smart Cost Routing is available on the Pro plan.