Benchmarks
These benchmarks compare the Floopy AI Gateway against direct OpenAI calls and two popular competitors — LiteLLM and Helicone — using the OpenAI Node.js SDK. Tests ran with gpt-4.1-nano, 50 rounds per scenario, worst outlier excluded, with anti-cache timestamps injected in every prompt to prevent provider-side caching.
Key Takeaways
- Floopy is 4.8% faster than calling OpenAI directly — even with no features enabled, Rust connection pooling beats the SDK’s own HTTP optimization.
- Floopy with firewall is 8.6% faster than direct — Prompt Guard adds zero latency (runs locally via ONNX in under 1ms) while connection pooling saves more than it costs.
- Floopy + Cache + Firewall is 58% faster than direct — the recommended production config delivers P50 of 260ms with both security and cost savings.
- LiteLLM adds negligible overhead (-0.6%) — Python-based proxy performs well for passthrough, but offers no latency improvement.
- Helicone adds 2.4% overhead — managed proxy introduces slight latency from the extra network hop.
- Floopy uses only 41MB of memory — the Rust gateway is extremely lean, peaking at 44MB under benchmark load.
Results Summary
| Scenario | Avg (ms) | P50 (ms) | P99 (ms) | Min (ms) | RPS | vs Direct |
|---|---|---|---|---|---|---|
| OpenAI Direct | 664 | 633 | 983 | 480 | 1.5 | baseline |
| Floopy (no features) | 632 | 620 | 879 | 387 | 1.5 | -4.8% |
| Floopy + Exact Cache | 195 | 10 | 773 | 5 | 4.8 | -70.6% |
| Floopy + Firewall | 607 | 613 | 826 | 438 | 1.6 | -8.6% |
| Floopy + Cache + Firewall | 277 | 260 | 1,171 | 6 | 3.0 | -58.3% |
| LiteLLM Proxy | 660 | 665 | 895 | 449 | 1.5 | -0.6% |
| Helicone Proxy | 680 | 655 | 980 | 480 | 1.4 | +2.4% |
Gateway Comparison
| Metric | Floopy | LiteLLM | Helicone |
|---|---|---|---|
| Avg latency | 632ms | 660ms | 680ms |
| vs Direct | -4.8% | -0.6% | +2.4% |
| Written in | Rust (Axum/Tokio) | Python | Managed (cloud) |
| Memory usage | 41MB avg / 44MB peak | ~200-400MB typical | N/A (managed) |
| Caching | 3-tier (exact + semantic + advanced) | Basic Redis | No |
| LLM Firewall | On-device (ONNX, sub-1ms) | External integrations | No |
Floopy is the only gateway that is measurably faster than calling the provider directly.
Scenario Details
OpenAI Direct (Baseline)
Direct call to api.openai.com using the OpenAI Node.js SDK.
- Avg: 664ms | P50: 633ms — typical latency for
gpt-4.1-nano - This is the baseline all gateways are compared against
Floopy (No Features)
Gateway with all features disabled. Measures pure proxy overhead.
- Avg: 632ms — 32ms faster than direct (4.8%)
- The Rust gateway maintains persistent keep-alive HTTPS connections to OpenAI, eliminating per-request TLS negotiation that the SDK’s connection reuse can’t fully avoid
- Min: 387ms — best-case latency is nearly 100ms lower than direct’s best case (480ms)
Floopy + Exact Cache
Exact cache enabled with prompts that naturally repeat across 50 rounds.
- Avg: 195ms — 70.6% faster than direct
- P50: 10ms — most requests hit cache, returning from Redis in single-digit milliseconds
- RPS: 4.8 — 3.2x the throughput of direct calls
- Min: 5ms — cache hits bypass the provider entirely
Floopy + Firewall (Prompt Guard)
LLM firewall enabled. Scans every prompt for injection attacks using a local ONNX model.
- Avg: 607ms — 8.6% faster than direct
- The firewall runs locally in under 1ms — it adds no measurable latency
- P99: 826ms — more consistent tail latency than direct (983ms) because connection pooling absorbs provider variance
Floopy + Cache + Firewall (Production Config)
Recommended production configuration with caching and security enabled.
- Avg: 277ms — 58.3% faster than direct
- P50: 260ms — consistent sub-300ms with both security and cost savings
- Min: 6ms — cache hits are near-instant even with firewall active
- RPS: 3.0 — 2x throughput of direct calls
LiteLLM Proxy
Open-source Python proxy in passthrough mode (Docker, no caching or security features).
- Avg: 660ms — 0.6% faster than direct (within noise)
- Python runtime adds overhead that roughly cancels out any connection pooling benefit
- A fair passthrough proxy, but no latency advantage
Helicone Proxy
Managed observability proxy (cloud-hosted, no caching or security features).
- Avg: 680ms — 2.4% slower than direct
- The extra network hop to Helicone’s cloud proxy adds ~16ms
- Provides logging and analytics but at a latency cost
Performance at a Glance
Floopy + Exact Cache █████████ 195ms (-71%) ⚡ fastestFloopy + Cache + FW ██████████████ 277ms (-58%) 🛡️ recommendedFloopy + Firewall ████████████████████████████████ 607ms (-9%)Floopy (no features) █████████████████████████████████ 632ms (-5%)LiteLLM Proxy █████████████████████████████████ 660ms (-1%)OpenAI Direct ██████████████████████████████████ 664ms baselineHelicone Proxy ███████████████████████████████████ 680ms (+2%)Memory Usage
The Floopy gateway uses minimal memory throughout the benchmark:
| Metric | Value |
|---|---|
| Average RSS | 41 MB |
| Peak RSS | 44 MB |
| Samples | 382 (every 500ms) |
For comparison, Python-based gateways typically use 200-400MB at idle. The Rust runtime has no garbage collector, no interpreter overhead, and no VM warmup.
Why Floopy Is Fast
Most AI gateways are written in Python or Node.js — languages that add 5-50ms of overhead per request. Floopy is written in Rust with Axum and Tokio, which gives it structural advantages:
Persistent connection pooling. The gateway maintains warm HTTPS connections to each provider. This eliminates per-request TLS handshakes — saving 20-50ms that even SDK connection reuse can’t fully avoid. Confirmed by the 4.8% improvement with all features disabled.
Zero-allocation forwarding. Rust’s ownership model allows request parsing and forwarding without intermediate buffer allocations. No garbage collection pauses.
Async I/O with Tokio. All operations run on a work-stealing thread pool. Cache lookups, firewall inference, and provider calls execute concurrently. Nothing blocks.
On-device firewall. Prompt Guard runs locally via ONNX Runtime in a dedicated thread — no external API call, no latency penalty. This is why the firewall scenario (607ms) is actually faster than direct (664ms).
Background logging. Request logs are queued via async channels and batch-inserted to ClickHouse. Logging never touches the response path.
41MB memory footprint. No interpreter, no VM, no runtime overhead. The gateway binary includes the ONNX model and still uses less memory than a typical Python import chain.
Methodology
- Client: OpenAI Node.js SDK
- Model:
gpt-4.1-nano - Rounds per scenario: 50
- Concurrency: 1 (sequential — measures pure latency)
- Outlier handling: Worst result per scenario excluded
- Anti-cache: Timestamp + index injected in every prompt (
[ref:{scenario}-{index}-{timestamp}]) — zero provider cache hits possible - Gateway: Local Floopy instance (Rust/Axum), 41MB RSS
- LiteLLM: Docker container (
ghcr.io/berriai/litellm:main-latest), passthrough to OpenAI - Helicone: Managed proxy (
ai-gateway.helicone.ai), passthrough to OpenAI - Temperature: 0.0
- Max tokens: 256
How to Run
cd example
# Start LiteLLM competitor (requires Docker)npm run competitors:up
# Run the full benchmark (sequential)npm run benchmark:full
# Run with concurrencynpm run benchmark:full:concurrent # 10 parallel requestsnpm run benchmark:full:stress # 50 parallel, 200 rounds
# Stop competitorsnpm run competitors:downSet SKIP_COMPETITORS=true to benchmark only Floopy vs OpenAI direct.
All configuration is read from .env (OPENAI_API_KEY, FLOOPY_API_KEY, FLOOPY_URL, LITELLM_URL, HELICONE_URL, HELICONE_API_KEY, MODEL, ROUNDS, CONCURRENCY).