Skip to content

How to Cache AI API Requests and Save 30-40%

A practical guide to implementing caching for OpenAI, Claude, and other LLM API calls — from exact matching to semantic caching.

Floopy Team | | 5 min read
caching cost-optimization performance guides

Every time your app sends the same question to an LLM, you’re paying for the same answer twice. In most production applications, 20-40% of requests are duplicates or near-duplicates.

Caching solves this. Here’s how to implement it properly.

Why Cache LLM Responses?

Three reasons:

  1. Cost reduction: Don’t pay for the same answer twice. Typical savings: 20-40%.
  2. Lower latency: A cached response returns in milliseconds instead of 1-3 seconds.
  3. Rate limit relief: Fewer API calls means less pressure on provider rate limits.

Types of LLM Caching

1. Exact Cache (Simple, High Precision)

Store the response for an exact prompt match. If the same prompt comes in again, return the cached response.

import { createHash } from 'crypto';
import Redis from 'ioredis';
const redis = new Redis();
function getCacheKey(messages: Message[], model: string): string {
const payload = JSON.stringify({ messages, model });
return `llm:${createHash('sha256').update(payload).digest('hex')}`;
}
async function cachedCompletion(params: CompletionParams) {
const key = getCacheKey(params.messages, params.model);
// Check cache
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
// Call LLM
const response = await openai.chat.completions.create(params);
// Store with TTL
await redis.setex(key, 3600, JSON.stringify(response));
return response;
}

Pros: Simple, zero false positives, deterministic Cons: Only matches identical prompts — “what’s the weather?” and “whats the weather” are different

Best for: Classification, translation, structured extraction — tasks where the same input repeats exactly.

2. Semantic Cache (Smarter, Higher Hit Rate)

Uses embeddings to find prompts with similar meaning, even if the words differ.

"How do I return a product?" → cache hit
"What's the refund process?" → same intent → serve cached response
"Can I send something back?" → same intent → serve cached response

How it works:

  1. Generate an embedding (vector) for the incoming prompt
  2. Search a vector database for similar embeddings
  3. If similarity exceeds a threshold (e.g., 0.95), return the cached response
  4. Otherwise, call the LLM and store the new response + embedding
async function semanticCachedCompletion(params: CompletionParams) {
const prompt = params.messages[params.messages.length - 1].content;
// Generate embedding
const embedding = await getEmbedding(prompt);
// Search vector DB for similar prompts
const results = await qdrant.search('cache', {
vector: embedding,
limit: 1,
score_threshold: 0.95,
});
if (results.length > 0) {
return results[0].payload.response;
}
// Call LLM
const response = await openai.chat.completions.create(params);
// Store in vector DB
await qdrant.upsert('cache', {
points: [{
id: generateId(),
vector: embedding,
payload: { prompt, response, timestamp: Date.now() }
}]
});
return response;
}

Pros: Catches semantically similar prompts, much higher hit rate Cons: Requires embedding infrastructure, small risk of false positives, adds embedding latency

Best for: Customer support, FAQ bots, any app with diverse phrasing of the same questions.

3. Hybrid Cache (Best of Both)

Try exact cache first (fast, zero risk), then semantic cache (slower, higher recall):

Request → Exact cache hit? → Return immediately
↓ miss
Semantic cache hit? → Return cached response
↓ miss
Call LLM → Store in both caches

Cache Invalidation Strategies

Time-Based (TTL)

Set a time-to-live on cached responses:

Use CaseRecommended TTL
Static knowledge (FAQ, docs)24-72 hours
Semi-dynamic (product info)1-6 hours
Dynamic (prices, availability)5-30 minutes
Real-time dataDon’t cache

Version-Based

Invalidate cache when your prompts change. Include a version in the cache key:

const PROMPT_VERSION = 'v3';
function getCacheKey(messages, model) {
const payload = JSON.stringify({ messages, model, v: PROMPT_VERSION });
return `llm:${hash(payload)}`;
}

When you update your system prompt, bump the version and the old cache is automatically bypassed.

Event-Based

Invalidate specific cached entries when underlying data changes:

// When product data updates, clear related cache
async function onProductUpdate(productId: string) {
const keys = await redis.keys(`llm:product:${productId}:*`);
if (keys.length) await redis.del(...keys);
}

What NOT to Cache

Not everything should be cached:

  • Creative generation: “Write a poem about cats” — users expect different outputs each time
  • Conversations with history: Multi-turn conversations where context changes
  • Time-sensitive queries: “What time is it?” or “What’s the latest news?”
  • Personalized responses: Where the response depends on user-specific data not in the prompt

Measuring Cache Effectiveness

Track these metrics:

Cache hit rate = cached_responses / total_requests × 100%
Cost savings = cache_hits × avg_cost_per_request
Latency improvement = avg_uncached_latency - avg_cached_latency

Target benchmarks:

  • Hit rate > 20%: Good, you’re saving money
  • Hit rate > 40%: Excellent, significant cost reduction
  • Hit rate > 60%: Your use case is very cache-friendly

Using an AI Gateway for Caching

Building caching from scratch means managing Redis, a vector database, embeddings, TTL logic, and invalidation — significant infrastructure.

An AI gateway like Floopy provides both exact and semantic caching built in. You don’t need to change your code:

// Same code, caching happens automatically at the gateway
const client = new OpenAI({
baseURL: "https://api.floopy.ai/v1",
apiKey: process.env.FLOOPY_API_KEY,
});

The gateway handles cache key generation, TTL management, semantic matching, and cache analytics — all configurable from the dashboard.

Key Takeaways

  1. Start with exact caching — it’s simple, risk-free, and catches more duplicates than you’d expect
  2. Add semantic caching for user-facing apps where phrasing varies
  3. Set appropriate TTLs — balance freshness vs. cost savings
  4. Don’t cache everything — creative and time-sensitive tasks should skip cache
  5. Measure hit rate — if it’s below 10%, your use case might not benefit from caching
  6. Version your cache keys — invalidate automatically when prompts change