How to Cache AI API Requests and Save 30-40%
A practical guide to implementing caching for OpenAI, Claude, and other LLM API calls — from exact matching to semantic caching.
Every time your app sends the same question to an LLM, you’re paying for the same answer twice. In most production applications, 20-40% of requests are duplicates or near-duplicates.
Caching solves this. Here’s how to implement it properly.
Why Cache LLM Responses?
Three reasons:
- Cost reduction: Don’t pay for the same answer twice. Typical savings: 20-40%.
- Lower latency: A cached response returns in milliseconds instead of 1-3 seconds.
- Rate limit relief: Fewer API calls means less pressure on provider rate limits.
Types of LLM Caching
1. Exact Cache (Simple, High Precision)
Store the response for an exact prompt match. If the same prompt comes in again, return the cached response.
import { createHash } from 'crypto';import Redis from 'ioredis';
const redis = new Redis();
function getCacheKey(messages: Message[], model: string): string { const payload = JSON.stringify({ messages, model }); return `llm:${createHash('sha256').update(payload).digest('hex')}`;}
async function cachedCompletion(params: CompletionParams) { const key = getCacheKey(params.messages, params.model);
// Check cache const cached = await redis.get(key); if (cached) return JSON.parse(cached);
// Call LLM const response = await openai.chat.completions.create(params);
// Store with TTL await redis.setex(key, 3600, JSON.stringify(response));
return response;}Pros: Simple, zero false positives, deterministic Cons: Only matches identical prompts — “what’s the weather?” and “whats the weather” are different
Best for: Classification, translation, structured extraction — tasks where the same input repeats exactly.
2. Semantic Cache (Smarter, Higher Hit Rate)
Uses embeddings to find prompts with similar meaning, even if the words differ.
"How do I return a product?" → cache hit"What's the refund process?" → same intent → serve cached response"Can I send something back?" → same intent → serve cached responseHow it works:
- Generate an embedding (vector) for the incoming prompt
- Search a vector database for similar embeddings
- If similarity exceeds a threshold (e.g., 0.95), return the cached response
- Otherwise, call the LLM and store the new response + embedding
async function semanticCachedCompletion(params: CompletionParams) { const prompt = params.messages[params.messages.length - 1].content;
// Generate embedding const embedding = await getEmbedding(prompt);
// Search vector DB for similar prompts const results = await qdrant.search('cache', { vector: embedding, limit: 1, score_threshold: 0.95, });
if (results.length > 0) { return results[0].payload.response; }
// Call LLM const response = await openai.chat.completions.create(params);
// Store in vector DB await qdrant.upsert('cache', { points: [{ id: generateId(), vector: embedding, payload: { prompt, response, timestamp: Date.now() } }] });
return response;}Pros: Catches semantically similar prompts, much higher hit rate Cons: Requires embedding infrastructure, small risk of false positives, adds embedding latency
Best for: Customer support, FAQ bots, any app with diverse phrasing of the same questions.
3. Hybrid Cache (Best of Both)
Try exact cache first (fast, zero risk), then semantic cache (slower, higher recall):
Request → Exact cache hit? → Return immediately ↓ miss Semantic cache hit? → Return cached response ↓ miss Call LLM → Store in both cachesCache Invalidation Strategies
Time-Based (TTL)
Set a time-to-live on cached responses:
| Use Case | Recommended TTL |
|---|---|
| Static knowledge (FAQ, docs) | 24-72 hours |
| Semi-dynamic (product info) | 1-6 hours |
| Dynamic (prices, availability) | 5-30 minutes |
| Real-time data | Don’t cache |
Version-Based
Invalidate cache when your prompts change. Include a version in the cache key:
const PROMPT_VERSION = 'v3';
function getCacheKey(messages, model) { const payload = JSON.stringify({ messages, model, v: PROMPT_VERSION }); return `llm:${hash(payload)}`;}When you update your system prompt, bump the version and the old cache is automatically bypassed.
Event-Based
Invalidate specific cached entries when underlying data changes:
// When product data updates, clear related cacheasync function onProductUpdate(productId: string) { const keys = await redis.keys(`llm:product:${productId}:*`); if (keys.length) await redis.del(...keys);}What NOT to Cache
Not everything should be cached:
- Creative generation: “Write a poem about cats” — users expect different outputs each time
- Conversations with history: Multi-turn conversations where context changes
- Time-sensitive queries: “What time is it?” or “What’s the latest news?”
- Personalized responses: Where the response depends on user-specific data not in the prompt
Measuring Cache Effectiveness
Track these metrics:
Cache hit rate = cached_responses / total_requests × 100%Cost savings = cache_hits × avg_cost_per_requestLatency improvement = avg_uncached_latency - avg_cached_latencyTarget benchmarks:
- Hit rate > 20%: Good, you’re saving money
- Hit rate > 40%: Excellent, significant cost reduction
- Hit rate > 60%: Your use case is very cache-friendly
Using an AI Gateway for Caching
Building caching from scratch means managing Redis, a vector database, embeddings, TTL logic, and invalidation — significant infrastructure.
An AI gateway like Floopy provides both exact and semantic caching built in. You don’t need to change your code:
// Same code, caching happens automatically at the gatewayconst client = new OpenAI({ baseURL: "https://api.floopy.ai/v1", apiKey: process.env.FLOOPY_API_KEY,});The gateway handles cache key generation, TTL management, semantic matching, and cache analytics — all configurable from the dashboard.
Key Takeaways
- Start with exact caching — it’s simple, risk-free, and catches more duplicates than you’d expect
- Add semantic caching for user-facing apps where phrasing varies
- Set appropriate TTLs — balance freshness vs. cost savings
- Don’t cache everything — creative and time-sensitive tasks should skip cache
- Measure hit rate — if it’s below 10%, your use case might not benefit from caching
- Version your cache keys — invalidate automatically when prompts change