Skip to main content

Datasets

Datasets are collections of test cases used to evaluate and optimize your prompts. They provide the ground truth against which your prompts are measured, enabling systematic quality assurance and continuous improvement.

What is a Dataset?

A dataset contains:
  • Input variables: Values to substitute into prompts (maps to your prompt’s {{variables}})
  • Expected outputs: (Optional) Reference responses for comparison scoring
  • Metadata: Additional context for filtering and organization
interface Dataset {
  id: string;
  promptId: string;           // Associated prompt
  name: string;
  description?: string;
  itemCount: number;          // Total test cases
  createdAt: Date;
  updatedAt: Date;
}

interface DatasetItem {
  id: string;
  datasetId: string;
  input: Record<string, any>;     // Variable values
  expectedOutput?: string;         // Expected response
  metadata?: Record<string, any>;  // Custom fields
}

Use Cases

Quality Testing

Verify prompts produce expected results across diverse inputs. Catch issues before deployment.

Regression Testing

Ensure changes don’t break existing behavior. Run the same tests after each prompt update.

Optimization Training

Provide training data for the optimization engine. Better datasets lead to better optimized prompts.

Benchmarking

Compare different prompt versions objectively. Track improvement over time with consistent test cases.

How Datasets Work

Quick Example

import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ bearerAuth: 'sk_live_...' });

// Create a dataset for a support prompt
const dataset = await client.datasets.create({
  promptId: 'prompt_xxxx',
  name: 'Support Scenarios',
  description: 'Common customer support questions and expected answers',
});

// Add test cases
await client.datasets.addItems(dataset.id, [
  {
    input: {
      customer_name: 'John',
      question: 'How do I reset my password?'
    },
    expectedOutput: 'To reset your password, go to Settings > Security > Reset Password. You will receive an email with reset instructions.',
  },
  {
    input: {
      customer_name: 'Sarah',
      question: 'What are your business hours?'
    },
    expectedOutput: 'We are open Monday through Friday, 9am to 5pm EST. Weekend support is available via email.',
  },
  {
    input: {
      customer_name: 'Mike',
      question: 'How do I upgrade my plan?'
    },
    expectedOutput: 'To upgrade your plan, navigate to Account > Billing > Upgrade. You can choose from our Pro or Enterprise tiers.',
  },
]);

console.log(`Created dataset with ${dataset.itemCount} items`);

Dataset Types

High-quality, human-verified test cases that serve as the benchmark:
  • Carefully curated inputs
  • Expert-written expected outputs
  • Used for critical quality gates
Generated test cases for broader coverage:
  • Created from templates or rules
  • Useful for stress testing
  • May not have expected outputs
Real-world data captured from production:
  • Actual user inputs
  • Representative of real usage
  • May require anonymization

What’s Next?