Skip to content

Benchmarks

These benchmarks compare the Floopy AI Gateway against direct OpenAI calls and two popular competitors — LiteLLM and Helicone — using the OpenAI Node.js SDK. Tests ran with gpt-4.1-nano, 50 rounds per scenario, worst outlier excluded, with anti-cache timestamps injected in every prompt to prevent provider-side caching.

Key Takeaways

  • Floopy is 4.8% faster than calling OpenAI directly — even with no features enabled, Rust connection pooling beats the SDK’s own HTTP optimization.
  • Floopy with firewall is 8.6% faster than direct — Prompt Guard adds zero latency (runs locally via ONNX in under 1ms) while connection pooling saves more than it costs.
  • Floopy + Cache + Firewall is 58% faster than direct — the recommended production config delivers P50 of 260ms with both security and cost savings.
  • LiteLLM adds negligible overhead (-0.6%) — Python-based proxy performs well for passthrough, but offers no latency improvement.
  • Helicone adds 2.4% overhead — managed proxy introduces slight latency from the extra network hop.
  • Floopy uses only 41MB of memory — the Rust gateway is extremely lean, peaking at 44MB under benchmark load.

Results Summary

ScenarioAvg (ms)P50 (ms)P99 (ms)Min (ms)RPSvs Direct
OpenAI Direct6646339834801.5baseline
Floopy (no features)6326208793871.5-4.8%
Floopy + Exact Cache1951077354.8-70.6%
Floopy + Firewall6076138264381.6-8.6%
Floopy + Cache + Firewall2772601,17163.0-58.3%
LiteLLM Proxy6606658954491.5-0.6%
Helicone Proxy6806559804801.4+2.4%

Gateway Comparison

MetricFloopyLiteLLMHelicone
Avg latency632ms660ms680ms
vs Direct-4.8%-0.6%+2.4%
Written inRust (Axum/Tokio)PythonManaged (cloud)
Memory usage41MB avg / 44MB peak~200-400MB typicalN/A (managed)
Caching3-tier (exact + semantic + advanced)Basic RedisNo
LLM FirewallOn-device (ONNX, sub-1ms)External integrationsNo

Floopy is the only gateway that is measurably faster than calling the provider directly.

Scenario Details

OpenAI Direct (Baseline)

Direct call to api.openai.com using the OpenAI Node.js SDK.

  • Avg: 664ms | P50: 633ms — typical latency for gpt-4.1-nano
  • This is the baseline all gateways are compared against

Floopy (No Features)

Gateway with all features disabled. Measures pure proxy overhead.

  • Avg: 632ms32ms faster than direct (4.8%)
  • The Rust gateway maintains persistent keep-alive HTTPS connections to OpenAI, eliminating per-request TLS negotiation that the SDK’s connection reuse can’t fully avoid
  • Min: 387ms — best-case latency is nearly 100ms lower than direct’s best case (480ms)

Floopy + Exact Cache

Exact cache enabled with prompts that naturally repeat across 50 rounds.

  • Avg: 195ms70.6% faster than direct
  • P50: 10ms — most requests hit cache, returning from Redis in single-digit milliseconds
  • RPS: 4.8 — 3.2x the throughput of direct calls
  • Min: 5ms — cache hits bypass the provider entirely

Floopy + Firewall (Prompt Guard)

LLM firewall enabled. Scans every prompt for injection attacks using a local ONNX model.

  • Avg: 607ms8.6% faster than direct
  • The firewall runs locally in under 1ms — it adds no measurable latency
  • P99: 826ms — more consistent tail latency than direct (983ms) because connection pooling absorbs provider variance

Floopy + Cache + Firewall (Production Config)

Recommended production configuration with caching and security enabled.

  • Avg: 277ms58.3% faster than direct
  • P50: 260ms — consistent sub-300ms with both security and cost savings
  • Min: 6ms — cache hits are near-instant even with firewall active
  • RPS: 3.0 — 2x throughput of direct calls

LiteLLM Proxy

Open-source Python proxy in passthrough mode (Docker, no caching or security features).

  • Avg: 660ms — 0.6% faster than direct (within noise)
  • Python runtime adds overhead that roughly cancels out any connection pooling benefit
  • A fair passthrough proxy, but no latency advantage

Helicone Proxy

Managed observability proxy (cloud-hosted, no caching or security features).

  • Avg: 680ms — 2.4% slower than direct
  • The extra network hop to Helicone’s cloud proxy adds ~16ms
  • Provides logging and analytics but at a latency cost

Performance at a Glance

Floopy + Exact Cache █████████ 195ms (-71%) ⚡ fastest
Floopy + Cache + FW ██████████████ 277ms (-58%) 🛡️ recommended
Floopy + Firewall ████████████████████████████████ 607ms (-9%)
Floopy (no features) █████████████████████████████████ 632ms (-5%)
LiteLLM Proxy █████████████████████████████████ 660ms (-1%)
OpenAI Direct ██████████████████████████████████ 664ms baseline
Helicone Proxy ███████████████████████████████████ 680ms (+2%)

Memory Usage

The Floopy gateway uses minimal memory throughout the benchmark:

MetricValue
Average RSS41 MB
Peak RSS44 MB
Samples382 (every 500ms)

For comparison, Python-based gateways typically use 200-400MB at idle. The Rust runtime has no garbage collector, no interpreter overhead, and no VM warmup.

Why Floopy Is Fast

Most AI gateways are written in Python or Node.js — languages that add 5-50ms of overhead per request. Floopy is written in Rust with Axum and Tokio, which gives it structural advantages:

Persistent connection pooling. The gateway maintains warm HTTPS connections to each provider. This eliminates per-request TLS handshakes — saving 20-50ms that even SDK connection reuse can’t fully avoid. Confirmed by the 4.8% improvement with all features disabled.

Zero-allocation forwarding. Rust’s ownership model allows request parsing and forwarding without intermediate buffer allocations. No garbage collection pauses.

Async I/O with Tokio. All operations run on a work-stealing thread pool. Cache lookups, firewall inference, and provider calls execute concurrently. Nothing blocks.

On-device firewall. Prompt Guard runs locally via ONNX Runtime in a dedicated thread — no external API call, no latency penalty. This is why the firewall scenario (607ms) is actually faster than direct (664ms).

Background logging. Request logs are queued via async channels and batch-inserted to ClickHouse. Logging never touches the response path.

41MB memory footprint. No interpreter, no VM, no runtime overhead. The gateway binary includes the ONNX model and still uses less memory than a typical Python import chain.

Methodology

  • Client: OpenAI Node.js SDK
  • Model: gpt-4.1-nano
  • Rounds per scenario: 50
  • Concurrency: 1 (sequential — measures pure latency)
  • Outlier handling: Worst result per scenario excluded
  • Anti-cache: Timestamp + index injected in every prompt ([ref:{scenario}-{index}-{timestamp}]) — zero provider cache hits possible
  • Gateway: Local Floopy instance (Rust/Axum), 41MB RSS
  • LiteLLM: Docker container (ghcr.io/berriai/litellm:main-latest), passthrough to OpenAI
  • Helicone: Managed proxy (ai-gateway.helicone.ai), passthrough to OpenAI
  • Temperature: 0.0
  • Max tokens: 256

How to Run

Terminal window
cd example
# Start LiteLLM competitor (requires Docker)
npm run competitors:up
# Run the full benchmark (sequential)
npm run benchmark:full
# Run with concurrency
npm run benchmark:full:concurrent # 10 parallel requests
npm run benchmark:full:stress # 50 parallel, 200 rounds
# Stop competitors
npm run competitors:down

Set SKIP_COMPETITORS=true to benchmark only Floopy vs OpenAI direct.

All configuration is read from .env (OPENAI_API_KEY, FLOOPY_API_KEY, FLOOPY_URL, LITELLM_URL, HELICONE_URL, HELICONE_API_KEY, MODEL, ROUNDS, CONCURRENCY).