Why Floopy Stays Fast While Optimizing Your Agents
Gateway speed is table-stakes now. The real question is whether your routing layer can make agents measurably better over time. Here's how Floopy does both.
Why Floopy Stays Fast While Optimizing Your Agents
A year ago, “fastest AI gateway” would have been a defensible positioning. Today it isn’t. Gateway latency has collapsed across the market — Rust-based proxies, keepalive connection pools, and colocated cache layers mean most serious gateways add under 10ms of overhead. Speed is table-stakes.
What isn’t table-stakes is whether your routing layer can make your agents measurably better over time using feedback signal you already collect. That’s the question this post is actually about. The benchmarks come along for the ride.
Speed, briefly — the numbers still matter
We still run the benchmark. Here’s the short version against gpt-4.1-nano, 50 rounds per scenario, unique anti-cache prompts, outlier excluded:
| Scenario | Avg (ms) | P50 (ms) | vs Direct |
|---|---|---|---|
| OpenAI Direct | 664 | 633 | baseline |
| Floopy (no features) | 632 | 620 | -4.8% |
| Floopy + Exact Cache | 195 | 10 | -70.6% |
| Floopy + Firewall | 607 | 613 | -8.6% |
| Floopy + Cache + Firewall | 277 | 260 | -58.3% |
| LiteLLM Proxy | 660 | 665 | -0.6% |
| Helicone Proxy | 680 | 655 | +2.4% |
The gateway runs in Rust on Axum/Tokio with a shared warm HTTPS pool to every provider, zero-copy forwarding, and an ONNX firewall that scans prompts in under 1ms in a dedicated thread. Memory stays at ~41MB RSS across the whole run.
So: we are fast. We are not going to build a company on that anymore.
The real product is optimization, not proxying
Floopy is an AI Agent Optimization Platform. The core job is picking the right model for each request inside your agents using a dynamic weighted score from four feedback sources. The gateway is the delivery mechanism — it’s how we’re in the path to make those decisions without you rewriting your app. But the value is the routing getting smarter, not the proxy being fast.
Three things make Floopy’s optimization defensible. None of them are about latency.
1. Session-level feedback propagation
Per-request ratings break on agents. A single user turn can fan out into tool calls, retries, chained reasoning, and multiple model invocations. Asking the user to rate each hop is absurd. Asking them to rate the final outcome is normal — it’s what NPS collection already does.
Floopy ingests one rating per session and propagates it to every routing decision made inside that session. You POST to /v1/feedback with the floopy-session-id header you were already sending on requests, and the score becomes the ground truth for every model choice in that trace. Multi-turn, tool-calling, chain-of-thought — all covered by the signal you already collect.
2. Multi-source weighted routing
Single-source feedback is brittle. Session NPS is sparse early on. Auto-scoring (LLM-as-judge) has blind spots. Manual admin ratings don’t scale. Public benchmarks don’t reflect your workload.
Floopy combines all four with weights that shift as signal accumulates:
| Phase | Session | Auto | Manual | Benchmark |
|---|---|---|---|---|
| Day 0 | — | — | — | 100% |
| After 10 requests | — | 50% | 20% | 30% |
| After 10 sessions | 50% | 30% | 10% | 10% |
Day 0 means routing starts from public benchmark priors — you get optimization from your first request. As real usage arrives, auto and manual feedback take over. Once enough sessions land real ratings, session NPS dominates. You never hit the cold-start wall.
3. Managed shared routing intelligence
Floopy is multi-tenant by design. Free and Pro organizations contribute to a shared routing pool — aggregated, privacy-preserving signal about which models perform on which task shapes. Enterprise can opt out.
This is why we don’t ship a self-hosted version of the optimization brain. Self-hosted means siloed — every customer learns from their own traffic only. Our bet is that the cross-tenant pool converges faster than any single tenant can on their own. TensorZero’s open-source framework is excellent and we respect the work; it’s just a different architectural bet. If your constraint is full data locality, self-host is the right answer. If your constraint is “make my agents better with signal I already collect,” managed pooling wins.
How speed actually serves the optimization
Here’s why we still care about latency: routing decisions have to be cheap. If weighing four feedback sources, fetching the session context, and picking a model adds 200ms, no one turns it on in production. The gateway speed budget is what buys us the room to do real routing work without users noticing.
- Shared warm connection pool means we spend almost nothing on the forward hop.
- Redis-backed exact cache and Qdrant semantic cache sit in front of the model call, not behind it, so cache-hit routing is essentially free.
- The firewall runs in-process via ONNX — no extra network round-trip for prompt-injection scanning.
When someone runs our benchmark and sees “faster than direct OpenAI,” the interesting thing isn’t that number. It’s that the number is negative while the gateway is also making a feedback-weighted routing decision, scanning for prompt injection, and logging to ClickHouse. That’s the budget optimization runs inside.
What this means for your stack
If you’re evaluating Floopy against:
- LiteLLM / Portkey / Helicone — these are gateways. They solve proxying, observability, and basic policy. If you only need the gateway primitives, they’re fine picks. Floopy has those primitives (we benchmarked them up top), but the product is the optimization layer on top. The question isn’t “which gateway is faster,” it’s “do you need routing that gets better over time, or a transparent proxy.”
- TensorZero — open-source, self-hosted feedback-driven routing framework. Strong technical work. The tradeoff is siloed learning and self-hosted ops. Floopy is the managed alternative with a shared routing pool and zero infra burden. Choose by architectural preference, not by feature parity.
- No gateway — you’re calling providers directly. That’s fast, but you’re leaving optimization on the table. Floopy adds latency negative-to-neutral while your agent quality starts improving from request one.
Reproduce the benchmark
Still open source, still reproducible:
cd examplenpm run benchmark:fullThe full methodology, prompt isolation details, and scenario explanations live in the Benchmarks documentation.
But the benchmark is no longer the pitch. The pitch is: point your OpenAI SDK at Floopy, start POSTing the NPS scores you already collect to /v1/feedback, and watch your routing improve session by session. The speed is there because optimization needs the budget.
Sign up free — 5,000 requests/month and 500 feedback ingests/month included on Starter. Plug in the SDK, POST feedback from the signal you already have.