Datasets

Overview

Datasets are collections of input/output pairs that you use to test and evaluate LLM behavior. Each row in a dataset represents a single test case: the input you send to the model and the expected (or reference) output you compare against. Datasets power Experiments and A/B Testing by providing a consistent, repeatable set of inputs.

Creating a Dataset

Open Datasets from the dashboard sidebar.
Click Create Dataset.
Give the dataset a descriptive name (e.g., “Customer Support QA”, “Code Review Inputs”).
Optionally add a description to explain what the dataset covers and when to use it.

Adding Rows

Each row in a dataset is a request-response pair:

Input — The prompt or message that will be sent to the model. This is typically a user message or a full conversation history in JSON format.
Expected Output — The reference response you want to compare the model’s actual output against. This is optional but required for automated scoring in experiments.

You can add rows in two ways:

Manually — Click Add Row and type the input and expected output directly in the dashboard.
Import — Upload a JSONL file where each line is a JSON object with input and expected_output fields.

JSONL Export

Export any dataset as a JSONL file for use outside the dashboard or for backup. Click the Export button on a dataset to download it. Each line in the exported file is a JSON object:

JSONL format

{"input": "What is the capital of France?", "expected_output": "The capital of France is Paris."}
{"input": "Translate 'hello' to Spanish.", "expected_output": "hola"}

Using Datasets with Experiments

When you create an Experiment, you select a dataset as the input source. The experiment runs each row through the configured model and prompt, then compares the output to the expected output using your chosen evaluation criteria. This lets you measure prompt quality at scale instead of testing one input at a time.

Using Datasets with A/B Tests

A/B Tests can use datasets to ensure each variant receives the same inputs. This eliminates variability from different user queries and gives you a fair comparison between prompt versions, models, or configurations.

Organizing Datasets

Keep your datasets organized by use case:

Regression datasets — A stable set of critical inputs that you run after every prompt change to catch regressions.
Edge case datasets — Inputs that test boundary conditions, unusual formats, or adversarial prompts.
Domain-specific datasets — Inputs grouped by topic or product area (e.g., billing questions, technical support, onboarding flows).
Evaluation datasets — Large sets with scored expected outputs for quantitative benchmarking.

Name datasets clearly and add descriptions so your team can find and reuse them without guessing what they contain.