Caching
How Caching Works
Every request that passes through the Floopy gateway is checked against the cache before it reaches the LLM provider. When a cache hit occurs, the stored response is returned immediately — no tokens are consumed and latency drops to single-digit milliseconds.
Caching is applied automatically based on your configuration. You do not need to change your application code.
Cache Tiers
Floopy uses three tiers of caching, checked in order:
Exact Cache
The first tier performs a direct lookup in Redis. If the incoming request is byte-for-byte identical to a previous request (same model, messages, and parameters), the cached response is returned instantly. This is the fastest tier with the lowest overhead.
Exact cache entries expire based on a configurable TTL. This tier is available on all plans.
Semantic Cache
When the exact cache misses, Floopy checks the semantic cache. Your prompt is converted into a vector embedding and compared against previous prompts stored in the vector database. If a stored prompt is similar enough — above the configured similarity threshold — its response is returned.
This means slight rephrasings of the same question (“What is the capital of France?” vs. “Tell me France’s capital city”) can be served from cache instead of calling the provider again.
Advanced Cache
The advanced tier adds bucketing by message count and token-aware similarity scoring. Requests are grouped by conversation length before similarity comparison, which improves match accuracy for multi-turn conversations and reduces false positives.
Advanced cache is available on the Pro plan and above.
Cache Decision Flow
The following diagram shows the full decision path for every request with caching enabled:
graph TD
A[Request with Floopy-Cache-Enabled: true] --> B[Normalize Prompt]
B --> C[Exact Cache Lookup in Redis]
C -->|Hit + Bucket Full| D[Return Random Cached Response]
C -->|Hit + Bucket Not Full| E[Continue to Provider, Store Additional Response]
C -->|Miss| F[Semantic Cache Lookup in Qdrant]
F -->|Hit above threshold| D
F -->|Miss| G{floopy-cache-advanced: true?}
G -->|Yes| H[Advanced Cache: Embedding + Bucketed Search]
H -->|Hit| D
H -->|Miss| I[Continue to Provider]
G -->|No| I
I --> J[Provider Response]
J --> K[Store in Cache Async]The prompt is first normalized — whitespace is trimmed, and any keys listed in Floopy-Cache-Ignore-Keys are stripped from the request body before computing the cache key. This ensures that irrelevant fields like timestamps or request IDs do not cause cache misses.
Bucket Max Size
When Floopy-Cache-Bucket-Max-Size is set (e.g., 3), each cache key can store multiple distinct responses instead of just one. This gives response variety while still saving cost — particularly useful for creative or non-deterministic prompts where you want diverse outputs.
The bucketing logic works as follows:
- Bucket not full — If the bucket has fewer responses than the configured max size, the request continues to the provider as normal. The new response is added to the bucket alongside the existing cached responses.
- Bucket full — If the bucket already contains the maximum number of responses, a random response is selected from the bucket and returned immediately. No provider call is made, saving both cost and latency.
For example, with Floopy-Cache-Bucket-Max-Size: 3, the first three identical requests all go to the provider, and each response is stored. From the fourth request onward, one of the three stored responses is returned at random.
Cache Headers Quick Reference
| Header | Type | Default | Effect |
|---|---|---|---|
Floopy-Cache-Enabled | "true" / "false" | Per-key setting | Enables or disables all cache tiers for this request |
Floopy-Cache-Seed | string | none | Isolates exact cache entries by seed value. Requests with different seeds produce different exact cache keys. Note: the semantic and advanced cache tiers match by embedding similarity and are not affected by the seed — a semantically identical prompt may still return a cached response from a different seed |
Floopy-Cache-Bucket-Max-Size | integer | 1 | Maximum number of responses stored per cache key. When full, a random cached response is returned |
Floopy-Cache-Ignore-Keys | comma-separated strings | none | Request body keys excluded from cache key computation (e.g., timestamp,request_id) |
floopy-cache-advanced | "true" / "false" | Per-key setting | Enables or disables the advanced cache tier (bucketed embedding search) for this request |
cache-control | max-age=<seconds> | Per-key TTL | Overrides the TTL for this request’s cache entry. The cached response expires after the specified number of seconds |
Cost and Latency Savings
Cached responses are free — no tokens are billed by the upstream provider. For applications with repetitive or similar queries, caching can cut costs by 30-70% depending on traffic patterns.
Latency for a cache hit is typically under 10ms, compared to 500ms-3s for a live provider call. This improvement is especially noticeable for user-facing applications where response time matters.
Dashboard Metrics
The Caching section of the dashboard provides real-time visibility into cache performance:
- Hit rate over time — a chart showing the percentage of requests served from cache, broken down by tier.
- Tokens saved — the total number of tokens that would have been consumed without caching.
- Latency saved — cumulative time saved by returning cached responses instead of calling the provider.
- Top cached pairs — the most frequently cached prompt-response pairs, so you can understand what drives your cache efficiency.
Use these metrics to tune your similarity thresholds and TTL settings for the best balance of freshness and savings.
Configuration
You can enable and configure caching per API key or at the organization level in Settings. Key options include:
- Enable/disable each cache tier independently.
- TTL — how long exact cache entries remain valid.
- Similarity threshold — the minimum similarity score required for a semantic cache hit (higher values require closer matches).
Cache Control Headers
You can control caching behavior on a per-request basis using headers. These override the default configuration for the API key or organization.
| Header | Values | Description |
|---|---|---|
Floopy-Cache-Enabled | "true" / "false" | Enable or disable caching for this request |
Floopy-Cache-Seed | any string | Isolates exact cache entries by seed value. Semantic and advanced tiers are not affected by the seed |
Floopy-Cache-Bucket-Max-Size | integer | Maximum number of responses stored per cache key. When full, a random cached response is returned |
Floopy-Cache-Ignore-Keys | comma-separated list | Request body keys to ignore when computing the cache key (e.g., timestamp,request_id) |
floopy-cache-advanced | "true" / "false" | Enable or disable the advanced semantic cache tier for this request |
cache-control | max-age=<seconds> | Override the TTL for this request’s cache entry |
import { OpenAI } from "openai";
const client = new OpenAI({ baseURL: "https://api.floopy.ai/v1", apiKey: process.env.FLOOPY_API_KEY,});
const response = await client.chat.completions.create( { model: "gpt-4o", messages: [{ role: "user", content: "What is the capital of France?" }], }, { headers: { "Floopy-Cache-Enabled": "true", "Floopy-Cache-Bucket-Max-Size": "3", "Floopy-Cache-Seed": "deterministic-seed-abc", "Floopy-Cache-Ignore-Keys": "timestamp,request_id", "floopy-cache-advanced": "true", "cache-control": "max-age=3600", }, },);
console.log(response.choices[0].message.content);from openai import OpenAIimport os
client = OpenAI( base_url="https://api.floopy.ai/v1", api_key=os.environ["FLOOPY_API_KEY"],)
response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is the capital of France?"}], extra_headers={ "Floopy-Cache-Enabled": "true", "Floopy-Cache-Bucket-Max-Size": "3", "Floopy-Cache-Seed": "deterministic-seed-abc", "Floopy-Cache-Ignore-Keys": "timestamp,request_id", "floopy-cache-advanced": "true", "cache-control": "max-age=3600", },)
print(response.choices[0].message.content)curl https://api.floopy.ai/v1/chat/completions \ -H "Authorization: Bearer $FLOOPY_API_KEY" \ -H "Content-Type: application/json" \ -H "Floopy-Cache-Enabled: true" \ -H "Floopy-Cache-Bucket-Max-Size: 3" \ -H "Floopy-Cache-Seed: deterministic-seed-abc" \ -H "Floopy-Cache-Ignore-Keys: timestamp,request_id" \ -H "floopy-cache-advanced: true" \ -H "cache-control: max-age=3600" \ -d '{ "model": "gpt-4o", "messages": [{"role": "user", "content": "What is the capital of France?"}] }'Plan Requirements
| Tier | Availability |
|---|---|
| Exact Cache | All plans |
| Semantic Cache | All plans |
| Advanced Cache | Pro plan and above |