How to Cache LLM API Requests — Exact and Semantic Caching Guide

Every time your app sends the same question to an LLM, you’re paying for the same answer twice. In most production applications, 20-40% of requests are duplicates or near-duplicates.

Caching solves this. Here’s how to implement it properly.

Why Cache LLM Responses?

Three reasons:

Cost reduction: Don’t pay for the same answer twice. Typical savings: 20-40%.
Lower latency: A cached response returns in milliseconds instead of 1-3 seconds.
Rate limit relief: Fewer API calls means less pressure on provider rate limits.

Types of LLM Caching

1. Exact Cache (Simple, High Precision)

Store the response for an exact prompt match. If the same prompt comes in again, return the cached response.

import { createHash } from 'crypto';
import Redis from 'ioredis';

const redis = new Redis();

function getCacheKey(messages: Message[], model: string): string {
  const payload = JSON.stringify({ messages, model });
  return `llm:${createHash('sha256').update(payload).digest('hex')}`;
}

async function cachedCompletion(params: CompletionParams) {
  const key = getCacheKey(params.messages, params.model);

  // Check cache
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  // Call LLM
  const response = await openai.chat.completions.create(params);

  // Store with TTL
  await redis.setex(key, 3600, JSON.stringify(response));

  return response;
}

Pros: Simple, zero false positives, deterministic Cons: Only matches identical prompts — “what’s the weather?” and “whats the weather” are different

Best for: Classification, translation, structured extraction — tasks where the same input repeats exactly.

2. Semantic Cache (Smarter, Higher Hit Rate)

Uses embeddings to find prompts with similar meaning, even if the words differ.

"How do I return a product?"  →  cache hit
"What's the refund process?"  →  same intent → serve cached response
"Can I send something back?"  →  same intent → serve cached response

How it works:

Generate an embedding (vector) for the incoming prompt
Search a vector database for similar embeddings
If similarity exceeds a threshold (e.g., 0.95), return the cached response
Otherwise, call the LLM and store the new response + embedding

async function semanticCachedCompletion(params: CompletionParams) {
  const prompt = params.messages[params.messages.length - 1].content;

  // Generate embedding
  const embedding = await getEmbedding(prompt);

  // Search vector DB for similar prompts
  const results = await qdrant.search('cache', {
    vector: embedding,
    limit: 1,
    score_threshold: 0.95,
  });

  if (results.length > 0) {
    return results[0].payload.response;
  }

  // Call LLM
  const response = await openai.chat.completions.create(params);

  // Store in vector DB
  await qdrant.upsert('cache', {
    points: [{
      id: generateId(),
      vector: embedding,
      payload: { prompt, response, timestamp: Date.now() }
    }]
  });

  return response;
}

Pros: Catches semantically similar prompts, much higher hit rate Cons: Requires embedding infrastructure, small risk of false positives, adds embedding latency

Best for: Customer support, FAQ bots, any app with diverse phrasing of the same questions.

3. Hybrid Cache (Best of Both)

Try exact cache first (fast, zero risk), then semantic cache (slower, higher recall):

Request → Exact cache hit? → Return immediately
                    ↓ miss
         Semantic cache hit? → Return cached response
                    ↓ miss
         Call LLM → Store in both caches

Cache Invalidation Strategies

Time-Based (TTL)

Set a time-to-live on cached responses:

Use Case	Recommended TTL
Static knowledge (FAQ, docs)	24-72 hours
Semi-dynamic (product info)	1-6 hours
Dynamic (prices, availability)	5-30 minutes
Real-time data	Don’t cache

Version-Based

Invalidate cache when your prompts change. Include a version in the cache key:

const PROMPT_VERSION = 'v3';

function getCacheKey(messages, model) {
  const payload = JSON.stringify({ messages, model, v: PROMPT_VERSION });
  return `llm:${hash(payload)}`;
}

When you update your system prompt, bump the version and the old cache is automatically bypassed.

Event-Based

Invalidate specific cached entries when underlying data changes:

// When product data updates, clear related cache
async function onProductUpdate(productId: string) {
  const keys = await redis.keys(`llm:product:${productId}:*`);
  if (keys.length) await redis.del(...keys);
}

What NOT to Cache

Not everything should be cached:

Creative generation: “Write a poem about cats” — users expect different outputs each time
Conversations with history: Multi-turn conversations where context changes
Time-sensitive queries: “What time is it?” or “What’s the latest news?”
Personalized responses: Where the response depends on user-specific data not in the prompt

Measuring Cache Effectiveness

Track these metrics:

Cache hit rate = cached_responses / total_requests × 100%
Cost savings = cache_hits × avg_cost_per_request
Latency improvement = avg_uncached_latency - avg_cached_latency

Target benchmarks:

Hit rate > 20%: Good, you’re saving money
Hit rate > 40%: Excellent, significant cost reduction
Hit rate > 60%: Your use case is very cache-friendly

Using an AI Gateway for Caching

Building caching from scratch means managing Redis, a vector database, embeddings, TTL logic, and invalidation — significant infrastructure.

An AI gateway like Floopy provides both exact and semantic caching built in. You don’t need to change your code:

// Same code, caching happens automatically at the gateway
const client = new OpenAI({
  baseURL: "https://api.floopy.ai/v1",
  apiKey: process.env.FLOOPY_API_KEY,
});

The gateway handles cache key generation, TTL management, semantic matching, and cache analytics — all configurable from the dashboard.

Key Takeaways

Start with exact caching — it’s simple, risk-free, and catches more duplicates than you’d expect
Add semantic caching for user-facing apps where phrasing varies
Set appropriate TTLs — balance freshness vs. cost savings
Don’t cache everything — creative and time-sensitive tasks should skip cache
Measure hit rate — if it’s below 10%, your use case might not benefit from caching
Version your cache keys — invalidate automatically when prompts change