Skip to content

Streaming

How Streaming Works

Set stream: true in your chat completion request and the Floopy gateway returns the response as Server-Sent Events (SSE). Each event contains a chat.completion.chunk object with incremental content, delivered as soon as the provider generates it.

The gateway proxies SSE frames directly from the upstream provider to your client with no buffering delay. Your application receives tokens the moment they are produced, giving users a responsive, typewriter-style experience.

No code changes beyond stream: true are required — the gateway handles the rest.

Code Examples

import { OpenAI } from "openai";
const client = new OpenAI({
baseURL: "https://api.floopy.ai/v1",
apiKey: process.env.FLOOPY_API_KEY,
});
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Explain quantum computing in simple terms." }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}

Response Format

Each SSE frame is a JSON object prefixed with data:. The stream ends with a data: [DONE] sentinel:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant"},"index":0}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":" there"},"index":0}]}
data: [DONE]

The first chunk typically contains the role field. Subsequent chunks carry incremental content. Your client should concatenate the content values to reconstruct the full response.

What Gets Cached

Streaming responses are fully compatible with caching. The gateway buffers chunks internally as they arrive and, once the stream completes, assembles the full response and stores it in the cache — exactly as if it were a non-streaming request.

On future cache hits the complete response is returned as a single non-streaming response (or re-chunked into SSE frames, depending on whether the new request sets stream: true). The Floopy-Cache-Bucket-Max-Size header works the same way — each buffered response counts as one bucket entry.

Streaming and the Firewall

Firewall checks run before streaming starts. The LLM-backed firewall (with Qdrant verdict cache short-circuit) evaluates the input prompt while the request is still buffered at the gateway. If the prompt is classified unsafe, a 400 error is returned and the stream is never opened.

Once the stream begins, it is not interrupted by the firewall. The response flows directly from the provider to your client without further inspection.

Streaming and Observability

The gateway assembles the complete response from all chunks after the stream ends. The following data is logged to ClickHouse as a single row in request_response_rmt:

  • Full response text — the concatenated content from all chunks.
  • Total tokens — prompt tokens and completion tokens as reported in the final chunk’s usage field (when the provider includes it).
  • Latency — measured from the first byte sent to the provider to the last chunk received.
  • Cost — calculated from the token counts and the model’s pricing.
  • Time to first token (TTFT) — the interval between sending the request and receiving the first content chunk.

All of this appears in the dashboard logs alongside non-streaming requests with no special filtering required.

Edge Cases

ScenarioBehavior
Idle timeoutIf no new SSE frame arrives for 30 seconds (configurable), the gateway closes the connection and logs a partial response.
Max buffer sizeThe internal buffer is capped at 1 MB by default. If the assembled response exceeds this limit, the stream is terminated and the partial response is logged.
Client disconnectsThe gateway detects the closed connection and cancels the upstream provider request to avoid wasting tokens. The partial response is logged.
Provider error mid-streamIf the provider sends an error event during streaming, the gateway forwards it to the client and logs the error alongside any partial content received.

Idle timeout and max buffer size can be adjusted in Settings > Gateway or via environment variables.