Build comprehensive datasets to evaluate and optimize your prompts. Datasets are scoped to prompts and follow a two-step creation pattern: create the metadata, then bulk insert items.
One JSON object per line — useful for large datasets:
{"input":{"question":"How do I upgrade?"},"expectedOutput":{"response":"Visit Account Settings."},"name":"Upgrade"}{"input":{"question":"Can I get a refund?"},"expectedOutput":{"response":"Within 30 days."},"name":"Refund"}{"input":{"question":"Reset password?"},"expectedOutput":{"response":"Click Forgot Password."},"name":"Password reset"}
Comma-separated values with a header row. The CLI passes CSV content directly to the API for parsing:
question,expected_answer,category"How do I upgrade?","Visit Account Settings and click Upgrade Plan.",account"Can I get a refund?","Refunds are available within 30 days.",billing"Reset my password?","Click Forgot Password on the login page.",authentication
Always include expectedOutput when you want to use the dataset for evaluation and optimization. Without expected outputs, only reference-free metrics (like coherence) can be used.
Begin with 10-20 high-quality items and expand based on evaluation results.
Use descriptive names
Name both the dataset and individual items descriptively:
{ "name": "Angry customer refund request", "input": {"question": "I want my money back NOW!"}, "expectedOutput": {"response": "I understand your frustration..."}}
Avoid generic names like “test-1” or “item-42”.
Include expectedOutput for optimization
Datasets used with the optimizer require expectedOutput to measure quality. Always include expected outputs for datasets you plan to use in evaluation or optimization.
Ensure all items use the same variable names and data types as your prompt’s inputSchema.
Datasets are associated with a prompt’s promptGroupId. This means a dataset works across all versions of the same prompt, so you can test new versions against the same dataset.