Datasets

Datasets are collections of test cases used to evaluate and optimize your prompts. They provide the ground truth against which your prompts are measured, enabling systematic quality assurance and continuous improvement.

What is a Dataset?

A dataset contains:

Input variables: Values to substitute into prompts (maps to your prompt’s {{variables}})
Expected outputs: (Optional) Reference responses for comparison scoring
Metadata: Additional context for filtering and organization

interface Dataset {
  id: string;
  promptId: string;           // Associated prompt
  name: string;
  description?: string;
  itemCount: number;          // Total test cases
  createdAt: Date;
  updatedAt: Date;
}

interface DatasetItem {
  id: string;
  datasetId: string;
  input: Record<string, any>;     // Variable values
  expectedOutput?: string;         // Expected response
  metadata?: Record<string, any>;  // Custom fields
}

Use Cases

Quality Testing

Verify prompts produce expected results across diverse inputs. Catch issues before deployment.

Regression Testing

Ensure changes don’t break existing behavior. Run the same tests after each prompt update.

Optimization Training

Provide training data for the optimization engine. Better datasets lead to better optimized prompts.

Benchmarking

Compare different prompt versions objectively. Track improvement over time with consistent test cases.

How Datasets Work

Quick Example

import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ bearerAuth: 'sk_live_...' });

// Create a dataset for a support prompt
const dataset = await client.datasets.create({
  promptId: 'prompt_xxxx',
  name: 'Support Scenarios',
  description: 'Common customer support questions and expected answers',
});

// Add test cases
await client.datasets.addItems(dataset.id, [
  {
    input: {
      customer_name: 'John',
      question: 'How do I reset my password?'
    },
    expectedOutput: 'To reset your password, go to Settings > Security > Reset Password. You will receive an email with reset instructions.',
  },
  {
    input: {
      customer_name: 'Sarah',
      question: 'What are your business hours?'
    },
    expectedOutput: 'We are open Monday through Friday, 9am to 5pm EST. Weekend support is available via email.',
  },
  {
    input: {
      customer_name: 'Mike',
      question: 'How do I upgrade my plan?'
    },
    expectedOutput: 'To upgrade your plan, navigate to Account > Billing > Upgrade. You can choose from our Pro or Enterprise tiers.',
  },
]);

console.log(`Created dataset with ${dataset.itemCount} items`);

Dataset Types

Golden Datasets

High-quality, human-verified test cases that serve as the benchmark:

Carefully curated inputs
Expert-written expected outputs
Used for critical quality gates

Synthetic Datasets

Generated test cases for broader coverage:

Created from templates or rules
Useful for stress testing
May not have expected outputs

Production Datasets

Real-world data captured from production:

Actual user inputs
Representative of real usage
May require anonymization

What’s Next?

Creating Datasets

Learn how to build datasets with items, imports, and cloning

Best Practices

Guidelines for effective dataset design and management

Prompts

Datasets

Evaluations

Optimization

Providers

Datasets Overview

Datasets

What is a Dataset?

Use Cases

Quality Testing

Regression Testing

Optimization Training

Benchmarking

How Datasets Work

Quick Example

Dataset Types

What’s Next?

Creating Datasets

Best Practices

Prompts

Datasets

Evaluations

Optimization

Providers

​Datasets

​What is a Dataset?

​Use Cases

Quality Testing

Regression Testing

Optimization Training

Benchmarking

​How Datasets Work

​Quick Example

​Dataset Types

​What’s Next?

Creating Datasets

Best Practices

Datasets

What is a Dataset?

Use Cases

How Datasets Work

Quick Example

Dataset Types

What’s Next?