Benchmarks

These benchmarks compare the Floopy AI Gateway against direct OpenAI calls and two popular competitors — LiteLLM and Helicone — using the OpenAI Node.js SDK. Tests ran with gpt-4.1-nano, 50 rounds per scenario, worst outlier excluded, with anti-cache timestamps injected in every prompt to prevent provider-side caching.

Key Takeaways

Floopy is 4.8% faster than calling OpenAI directly — even with no features enabled, Rust connection pooling beats the SDK’s own HTTP optimization.
Floopy with firewall is 8.6% faster than direct — Prompt Guard adds zero latency (runs locally via ONNX in under 1ms) while connection pooling saves more than it costs.
Floopy + Cache + Firewall is 58% faster than direct — the recommended production config delivers P50 of 260ms with both security and cost savings.
LiteLLM adds negligible overhead (-0.6%) — Python-based proxy performs well for passthrough, but offers no latency improvement.
Helicone adds 2.4% overhead — managed proxy introduces slight latency from the extra network hop.
Floopy uses only 41MB of memory — the Rust gateway is extremely lean, peaking at 44MB under benchmark load.

Results Summary

Scenario	Avg (ms)	P50 (ms)	P99 (ms)	Min (ms)	RPS	vs Direct
OpenAI Direct	664	633	983	480	1.5	baseline
Floopy (no features)	632	620	879	387	1.5	-4.8%
Floopy + Exact Cache	195	10	773	5	4.8	-70.6%
Floopy + Firewall	607	613	826	438	1.6	-8.6%
Floopy + Cache + Firewall	277	260	1,171	6	3.0	-58.3%
LiteLLM Proxy	660	665	895	449	1.5	-0.6%
Helicone Proxy	680	655	980	480	1.4	+2.4%

Gateway Comparison

Metric	Floopy	LiteLLM	Helicone
Avg latency	632ms	660ms	680ms
vs Direct	-4.8%	-0.6%	+2.4%
Written in	Rust (Axum/Tokio)	Python	Managed (cloud)
Memory usage	41MB avg / 44MB peak	~200-400MB typical	N/A (managed)
Caching	3-tier (exact + semantic + advanced)	Basic Redis	No
LLM Firewall	On-device (ONNX, sub-1ms)	External integrations	No

Floopy is the only gateway that is measurably faster than calling the provider directly.

Scenario Details

OpenAI Direct (Baseline)

Direct call to api.openai.com using the OpenAI Node.js SDK.

Avg: 664ms | P50: 633ms — typical latency for gpt-4.1-nano
This is the baseline all gateways are compared against

Floopy (No Features)

Gateway with all features disabled. Measures pure proxy overhead.

Avg: 632ms — 32ms faster than direct (4.8%)
The Rust gateway maintains persistent keep-alive HTTPS connections to OpenAI, eliminating per-request TLS negotiation that the SDK’s connection reuse can’t fully avoid
Min: 387ms — best-case latency is nearly 100ms lower than direct’s best case (480ms)

Floopy + Exact Cache

Exact cache enabled with prompts that naturally repeat across 50 rounds.

Avg: 195ms — 70.6% faster than direct
P50: 10ms — most requests hit cache, returning from Redis in single-digit milliseconds
RPS: 4.8 — 3.2x the throughput of direct calls
Min: 5ms — cache hits bypass the provider entirely

Floopy + Firewall (Prompt Guard)

LLM firewall enabled. Scans every prompt for injection attacks using a local ONNX model.

Avg: 607ms — 8.6% faster than direct
The firewall runs locally in under 1ms — it adds no measurable latency
P99: 826ms — more consistent tail latency than direct (983ms) because connection pooling absorbs provider variance

Floopy + Cache + Firewall (Production Config)

Recommended production configuration with caching and security enabled.

Avg: 277ms — 58.3% faster than direct
P50: 260ms — consistent sub-300ms with both security and cost savings
Min: 6ms — cache hits are near-instant even with firewall active
RPS: 3.0 — 2x throughput of direct calls

LiteLLM Proxy

Open-source Python proxy in passthrough mode (Docker, no caching or security features).

Avg: 660ms — 0.6% faster than direct (within noise)
Python runtime adds overhead that roughly cancels out any connection pooling benefit
A fair passthrough proxy, but no latency advantage

Helicone Proxy

Managed observability proxy (cloud-hosted, no caching or security features).

Avg: 680ms — 2.4% slower than direct
The extra network hop to Helicone’s cloud proxy adds ~16ms
Provides logging and analytics but at a latency cost

Performance at a Glance

Floopy + Exact Cache   █████████           195ms  (-71%)  ⚡ fastest
Floopy + Cache + FW    ██████████████      277ms  (-58%)  🛡️ recommended
Floopy + Firewall      ████████████████████████████████  607ms  (-9%)
Floopy (no features)   █████████████████████████████████ 632ms  (-5%)
LiteLLM Proxy          █████████████████████████████████ 660ms  (-1%)
OpenAI Direct          ██████████████████████████████████ 664ms  baseline
Helicone Proxy         ███████████████████████████████████ 680ms  (+2%)

Memory Usage

The Floopy gateway uses minimal memory throughout the benchmark:

Metric	Value
Average RSS	41 MB
Peak RSS	44 MB
Samples	382 (every 500ms)

For comparison, Python-based gateways typically use 200-400MB at idle. The Rust runtime has no garbage collector, no interpreter overhead, and no VM warmup.

Why Floopy Is Fast

Most AI gateways are written in Python or Node.js — languages that add 5-50ms of overhead per request. Floopy is written in Rust with Axum and Tokio, which gives it structural advantages:

Persistent connection pooling. The gateway maintains warm HTTPS connections to each provider. This eliminates per-request TLS handshakes — saving 20-50ms that even SDK connection reuse can’t fully avoid. Confirmed by the 4.8% improvement with all features disabled.

Zero-allocation forwarding. Rust’s ownership model allows request parsing and forwarding without intermediate buffer allocations. No garbage collection pauses.

Async I/O with Tokio. All operations run on a work-stealing thread pool. Cache lookups, firewall inference, and provider calls execute concurrently. Nothing blocks.

On-device firewall. Prompt Guard runs locally via ONNX Runtime in a dedicated thread — no external API call, no latency penalty. This is why the firewall scenario (607ms) is actually faster than direct (664ms).

Background logging. Request logs are queued via async channels and batch-inserted to ClickHouse. Logging never touches the response path.

41MB memory footprint. No interpreter, no VM, no runtime overhead. The gateway binary includes the ONNX model and still uses less memory than a typical Python import chain.

Methodology

Client: OpenAI Node.js SDK
Model: gpt-4.1-nano
Rounds per scenario: 50
Concurrency: 1 (sequential — measures pure latency)
Outlier handling: Worst result per scenario excluded
Anti-cache: Timestamp + index injected in every prompt ([ref:{scenario}-{index}-{timestamp}]) — zero provider cache hits possible
Gateway: Local Floopy instance (Rust/Axum), 41MB RSS
LiteLLM: Docker container (ghcr.io/berriai/litellm:main-latest), passthrough to OpenAI
Helicone: Managed proxy (ai-gateway.helicone.ai), passthrough to OpenAI
Temperature: 0.0
Max tokens: 256

How to Run

cd example

# Start LiteLLM competitor (requires Docker)
npm run competitors:up

# Run the full benchmark (sequential)
npm run benchmark:full

# Run with concurrency
npm run benchmark:full:concurrent    # 10 parallel requests
npm run benchmark:full:stress        # 50 parallel, 200 rounds

# Stop competitors
npm run competitors:down

Set SKIP_COMPETITORS=true to benchmark only Floopy vs OpenAI direct.

All configuration is read from .env (OPENAI_API_KEY, FLOOPY_API_KEY, FLOOPY_URL, LITELLM_URL, HELICONE_URL, HELICONE_API_KEY, MODEL, ROUNDS, CONCURRENCY).