GET /v1/experiments/{id}/results

Read aggregated results for a single routing experiment. The endpoint runs two aggregations over the request log — one for the experiment’s baseline side, one for its candidate side — bounded to the experiment’s lifetime, and returns cost, quality, and latency aggregates plus the deltas between them.

This is the API behind the experiment detail page in the Floopy dashboard, including the polling loop used by the onboarding shadow-setup step.

Endpoint

GET https://api.floopy.ai/v1/experiments/{id}/results

Authentication

Authorization: Bearer <your-floopy-api-key>

Permission: read_permission.
Pro plan.
We are rolling this out per organization while we validate quality. Contact support if your organization is not yet enabled.

Path parameters

Field	Type	Required	Constraints
`id`	string (UUID v4)	Yes	The experiment id. Validated as a v4 UUID at the path extractor; malformed input returns `400 invalid_experiment_id` and no database call is made.

A successful call pre-fetches the experiment record scoped by the caller’s organization and the supplied id. If the experiment does not belong to the caller’s org, the response is 404 with bytes identical to a non-existent experiment; the analytics store is not queried.

Time-bound semantics

The aggregation window is bounded by the experiment’s lifetime, not by a customer-supplied window parameter:

Lower bound (always applied): request_created_at >= experiment.started_at.
Upper bound (only when the experiment is no longer active): request_created_at <= experiment.ended_at.

For an active experiment, the upper bound is open — every new request the experiment sees is included on the next read. For a completed or rolled_back experiment, the window is frozen at [started_at, ended_at].

This means polling the same experiment id twice in a row can return different numbers if the experiment is still active and traffic is flowing.

Response (200)

{
  "experiment_id": "5b3f9c1e-7a2b-4c3d-9e1f-0a1b2c3d4e5f",
  "type": "shadow",
  "status": "active",
  "started_at": "2026-05-01T09:00:00Z",
  "ended_at":   null,
  "baseline": {
    "samples": 9412,
    "avg_cost_micro_usd": 412,
    "composite_quality": 0.812,
    "p50_latency_ms": 612
  },
  "candidate": {
    "samples": 9412,
    "avg_cost_micro_usd": 226,
    "composite_quality": 0.804,
    "p50_latency_ms": 588
  },
  "delta": {
    "cost_pct": -45.1,
    "quality_abs": -0.008,
    "p50_latency_ms": -24
  }
}

Field reference

Field	Type	Description
`experiment_id`	UUID	Echoes the path id.
`type`	string	`canary` or `shadow`.
`status`	string	`draft`, `active`, `completed`, or `rolled_back`.
`started_at`	ISO8601 \| null	When the experiment was activated. The lower bound of the aggregation window.
`ended_at`	ISO8601 \| null	When the experiment ended. The upper bound when present; `null` for active experiments.
`baseline`	object	Aggregate over the experiment’s baseline `(provider, model)` for rows in the time-bounded window.
`candidate`	object	Aggregate over the experiment’s candidate `(provider, model)` for the same window.
`delta.cost_pct`	number	`(candidate − baseline) / baseline * 100`. Negative means the candidate is cheaper.
`delta.quality_abs`	number	`candidate − baseline` on composite quality in `[0.0, 1.0]`.
`delta.p50_latency_ms`	number	`candidate − baseline` on p50 latency in milliseconds.

For a shadow experiment, the candidate rows are scored side-by-side with the live baseline traffic; the user-visible response always comes from the baseline. For a canary experiment, both sides serve real traffic in proportion to traffic_pct.

When neither side has accumulated any rows yet (a freshly activated experiment), baseline and candidate blocks return zeros; delta is omitted.

Response cache

The endpoint Redis-caches the response per (organization_id, experiment_id) for 30 seconds. There is no Cache-Control header echoed — the cache is internal to the gateway.

A 30-second TTL is short enough that polling-driven UIs (the onboarding shadow-setup step polls every 20 s) see fresh-enough numbers, and long enough that the analytics store is not hit on every render.

Errors

Status	`error` code	When
`400`	`invalid_experiment_id`	The path id is not a valid UUID v4. No database call is made.
`403`	`read_permission`	Key lacks `read_permission`.
`403`	`plan_required`	The endpoint is not included on the caller’s plan.
`404`	`not_found`	No experiment record exists in the caller’s org for that id. Bytes are identical to a cross-tenant lookup.
`429`	`rate_limited`	Exceeded `60 req/min/org` or `20 req/min/key`. Carries `Retry-After`.
`503`	`upstream_timeout`	The analytics aggregation exceeded the 5 s wall-clock timeout.
`5xx`	`internal`	Upstream failure.

Curl example

curl -s -H "Authorization: Bearer $FLOOPY_API_KEY" \
  "https://api.floopy.ai/v1/experiments/5b3f9c1e-7a2b-4c3d-9e1f-0a1b2c3d4e5f/results" | jq .

Rate limits

60 requests / minute / organization.
20 requests / minute / API key.

Both windows are evaluated atomically. The dashboard’s polling loop respects these caps by spacing requests at 20 s for the first 10 minutes and 60 s after that, and pausing entirely while the browser tab is hidden.

Audit trail

Every successful response writes an experiment_results_read event to the audit log, throttled per (organization_id, key_id, experiment_id) for 60 seconds.