How to Estimate and Control Your AI API Costs
Learn how to predict your OpenAI, Anthropic, or Google AI spending before it surprises you — with formulas, examples, and monitoring tips.
“How much will this cost?” is the first question every team asks before putting AI into production — and the hardest to answer without real data.
This guide gives you the formulas, examples, and monitoring strategies to predict your AI API spending with confidence.
Understanding Token-Based Pricing
AI APIs charge per token — roughly 4 characters or ¾ of a word in English. Most providers charge differently for input and output tokens:
| Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Claude Haiku 4 | $0.80 | $4.00 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
The Cost Formula
Cost per request = (input_tokens × input_price) + (output_tokens × output_price)
Monthly cost = cost_per_request × requests_per_day × 30Example: Customer Support Chatbot
Let’s estimate costs for a chatbot handling 10,000 conversations/day:
- System prompt: ~500 tokens
- Average user message: ~50 tokens
- Average conversation context: ~800 tokens (history)
- Average response: ~200 tokens
Input per request: 500 + 50 + 800 = 1,350 tokens Output per request: 200 tokens
With GPT-4o:
- Input: 1,350 × $2.50/1M = $0.003375
- Output: 200 × $10.00/1M = $0.002
- Per request: $0.005375
- Monthly: $0.005375 × 10,000 × 30 = $1,612/month
With GPT-4o-mini:
- Input: 1,350 × $0.15/1M = $0.000203
- Output: 200 × $0.60/1M = $0.000120
- Per request: $0.000323
- Monthly: $0.000323 × 10,000 × 30 = $97/month
Same chatbot. Same quality for most support questions. $1,612 vs $97.
Example: AI Coding Assistant
A coding assistant processing 5,000 requests/day:
- System prompt: ~2,000 tokens (detailed instructions)
- Code context: ~3,000 tokens (file contents, errors)
- User message: ~100 tokens
- Generated code: ~500 tokens
With GPT-4o:
- Input: 5,100 × $2.50/1M = $0.01275
- Output: 500 × $10.00/1M = $0.005
- Per request: $0.01775
- Monthly: $0.01775 × 5,000 × 30 = $2,663/month
The Hidden Costs
Your actual bill will be higher than the formula suggests because of:
1. Retries
Failed requests (timeouts, rate limits, errors) need retries. Budget for 5-15% extra requests.
2. Conversation History
Each message in a multi-turn conversation resends the entire history. A 10-message conversation means message #10 includes all 9 previous messages as input tokens.
3. System Prompt Overhead
Your system prompt is sent with every single request. A 1,000-token system prompt at 10,000 requests/day = 10M input tokens just for the system prompt.
4. Development and Testing
Dev environments, testing, CI/CD prompt testing — these add up. A team of 5 developers testing prompts manually can easily generate 20-30% of production volume.
Setting Up Cost Controls
Step 1: Set Budget Alerts
Before going to production, configure alerts at:
- 50% of monthly budget — awareness
- 75% of monthly budget — investigate if trending high
- 90% of monthly budget — action required
- 100% of monthly budget — hard stop or degraded mode
Step 2: Implement Per-User Limits
Prevent a single user or API key from consuming a disproportionate share:
// Example: 100 requests per minute per userconst rateLimiter = { window: '1m', maxRequests: 100, keyBy: 'userId'};Step 3: Track Cost Per Feature
Don’t just track total spending. Break it down by:
- Feature/endpoint — Which features cost the most?
- User segment — Are free-tier users costing more than they should?
- Model — Are you accidentally using expensive models for simple tasks?
Cost Monitoring Dashboard
At minimum, track these metrics daily:
| Metric | Why It Matters |
|---|---|
| Total cost (daily/weekly/monthly) | Trend awareness |
| Cost per request (p50, p95) | Catch expensive outliers |
| Tokens per request (input/output) | Identify bloated prompts |
| Requests by model | Verify model routing |
| Cache hit rate | Measure optimization effectiveness |
| Cost per user | Identify abuse or inefficiency |
Using an AI Gateway for Cost Control
Building all of this — monitoring, rate limiting, alerts, model routing — from scratch takes significant engineering time.
An AI gateway like Floopy gives you all of this out of the box:
- Real-time cost dashboard with per-request, per-user, and per-model breakdowns
- Budget alerts and hard limits configurable per API key
- Automatic cost logging to ClickHouse for historical analysis
- Smart Cost Routing that automatically picks cheaper models for simple tasks
The estimate you should actually care about. A static per-token estimate assumes model choice is fixed. In production, the cheapest viable model per prompt changes over time as prompts, traffic mix, and quality bars drift. Floopy’s feedback-driven routing keeps that estimate honest by propagating one NPS score per session across every routing decision in that session and combining it with LLM-as-judge, admin ratings, and public benchmarks — so the “cheapest viable” cutoff is continuously re-fit instead of frozen at setup time. Walk-through: Smart Cost Routing and session propagation.
You get visibility into exactly where your money is going from day one.
Cost Estimation Cheat Sheet
| Application Type | Typical Volume | Estimated Monthly Cost (GPT-4o-mini) |
|---|---|---|
| Internal chatbot | 1K req/day | $10-30 |
| Customer support bot | 10K req/day | $100-300 |
| Content generation | 5K req/day | $50-200 |
| Code assistant | 5K req/day | $150-500 |
| RAG application | 10K req/day | $200-600 |
| High-volume API | 100K req/day | $1,000-5,000 |
These are estimates using GPT-4o-mini. Multiply by 15-20x for GPT-4o equivalents.
Key Takeaways
- Do the math before going to production — use the cost formula with your actual prompt sizes
- Account for hidden costs — retries, conversation history, system prompt overhead
- Set budget controls on day one — don’t wait for a surprise bill
- Track cost per feature and per user — aggregates hide the real problems
- Start with the cheapest viable model — upgrade only where quality demands it