How It Works

Architecture Overview

Floopy is an AI Agent Optimization Platform. Its routing surface — the gateway component — is a high-performance Rust proxy built with Axum and Tokio. It sits between your application and LLM providers, adding feedback-driven routing, caching, rate limiting, security, observability, and automatic failover — without requiring any changes to your application code.

You point your OpenAI SDK’s baseURL at https://api.floopy.ai/v1, and the gateway translates requests to the correct provider format, routes them according to your strategy, and logs everything asynchronously.

graph LR
    App[Your Application] -->|OpenAI SDK| Gateway[Floopy Gateway]
    Gateway -->|Translated| OpenAI[OpenAI]
    Gateway -->|Translated| Anthropic[Anthropic]
    Gateway -->|Translated| Gemini[Google Gemini]
    Gateway -->|Translated| Groq[Groq]
    Gateway -->|Translated| Mistral[Mistral]
    Gateway -->|Translated| DeepSeek[DeepSeek]
    Gateway -.->|Async Logs| ClickHouse[(ClickHouse)]
    Gateway -.->|Cache| Redis[(Redis)]
    Gateway -.->|Vectors| Qdrant[(Qdrant)]

The gateway is stateless — all shared state lives in Redis, ClickHouse, and Qdrant. This means you can scale horizontally by running multiple gateway instances behind a load balancer.

Request Pipeline

Every request that enters the Floopy gateway passes through a series of stages. Each stage can short-circuit the pipeline and return a response early (cache hit, rate limit exceeded, firewall block), or pass the request to the next stage.

graph TD
    A[Request Arrives] --> B[API Key Validation]
    B -->|Invalid| B1[401 Unauthorized]
    B -->|Valid| C[Rate Limit Check]
    C -->|Exceeded| C1[429 Too Many Requests]
    C -->|OK| D[Prompt Resolution]
    D --> E{Cache Enabled?}
    E -->|Yes| F[Exact Cache Check]
    F -->|Hit| F1[Return Cached Response]
    F -->|Miss| G[Semantic Cache Check]
    G -->|Hit| G1[Return Cached Response]
    G -->|Miss| H{Advanced Cache?}
    H -->|Yes| I[Advanced Cache Check]
    I -->|Hit| I1[Return Cached Response]
    I -->|Miss| J[LLM Firewall]
    H -->|No| J
    E -->|No| J
    J -->|Threat Detected| J1[400 Blocked]
    J -->|Safe| K[Routing Strategy]
    K --> L[Provider Dispatch]
    L -->|Success| M[Return Response]
    L -->|Failure| N{Fallback Available?}
    N -->|Yes| K
    N -->|No| N1[502 Bad Gateway]
    M --> O[Async: Log to ClickHouse]
    M --> P[Async: Store in Cache]

API Key Validation

The gateway extracts the API key from the Authorization: Bearer header and validates it against Supabase (PostgreSQL). To avoid a database round-trip on every request, validated keys and their associated configuration (org, rate limits, routing rule, feature flags) are cached in Redis with automatic invalidation when settings change in the dashboard.

Rate Limit Check

Rate limiting uses a sliding window algorithm implemented with atomic Redis operations (Lua scripts) to ensure consistency even when running multiple gateway instances. Limits are configurable per API key and per organization. When the limit is exceeded, the gateway returns 429 Too Many Requests with a Retry-After header.

Prompt Resolution

If the request includes a floopy-prompt-id header, the gateway resolves the prompt from your prompt library. Template variables in the prompt (using {{variable}} syntax) are substituted with values from the request body’s inputs field. This lets you version and manage prompts centrally without redeploying your application.

Cache Lookup

When caching is enabled, three tiers are checked in sequence: exact match in Redis, semantic similarity in Qdrant, and advanced bucketed search. Each tier trades a small amount of additional latency for broader match coverage. A hit at any tier returns the stored response immediately — no tokens are consumed and no provider call is made.

LLM Firewall

The firewall sends each prompt to a safety-tuned LLM (configured via FIREWALL_MODEL) that returns a safe or unsafe verdict. A Qdrant verdict cache short-circuits repeat unsafe prompts so the LLM call is skipped when an embedding above the configured threshold matches a recent unsafe entry. If the verdict is unsafe, the request is blocked with a 400 response.

Routing Strategy

The gateway selects a provider based on the routing strategy configured for the API key — fallback, round-robin, weighted, or latency-based. The strategy determines which provider receives the request and what happens if that provider fails. See the Routing guide for details on each strategy.

Provider Dispatch

The request is translated into the target provider’s format and sent. Each provider connection is protected by a circuit breaker that tracks failure rates. If a provider is consistently failing, the circuit breaker opens and the gateway skips it automatically, trying the next available provider according to the routing strategy.

Async Logging

After the response is sent to the client, the gateway logs the full request-response pair to ClickHouse asynchronously. Logging uses Tokio’s mpsc channels to pass log entries to a background worker that performs batch inserts. This design ensures that logging never blocks or slows down the response, and that ClickHouse unavailability does not affect gateway operation.

What Makes It Fast

Floopy is designed for minimal latency overhead. The gateway typically adds less than 5ms to provider response times (excluding cache hits, which are under 10ms total).

Rust with Tokio — The gateway is fully async. Every I/O operation (Redis lookups, provider calls, logging) is non-blocking, allowing a single instance to handle thousands of concurrent requests.
Zero-copy parsing — Request and response bodies are parsed with minimal memory allocation. Large streaming responses are forwarded chunk-by-chunk without buffering the full body.
Background workers — Logging and cache storage happen asynchronously via tokio::sync::mpsc channels. A dedicated background task batches log entries and inserts them into ClickHouse in bulk, reducing write overhead.
Connection pooling — Redis connections and HTTP client connections to providers are pooled and reused across requests, avoiding the overhead of establishing new connections.
Embedding reuse — The same embedding the response semantic cache computes is reused by the firewall verdict cache, so a request never pays for two embed calls.