Skip to main content

Datasets

Datasets are collections of test cases used to evaluate and optimize your prompts. They provide the ground truth against which your prompts are measured, enabling systematic quality assurance and continuous improvement.

What is a Dataset?

A dataset is scoped to a prompt (linked via promptGroupId) and contains test case items with:
  • Input: JSON object of variable values matching the prompt’s inputSchema
  • Expected Output: (Optional) JSON object of reference responses matching the prompt’s outputSchema
  • Name: (Optional) Human-readable test case name
  • Labels: (Optional) ML-style labels for categorization (e.g., ["edge-case", "regression"])
  • Metadata: (Optional) Arbitrary JSON for filtering and organization
interface Dataset {
  id: number;
  name: string;
  description?: string;
  promptGroupId: string;          // UUID linking to the prompt family
  labels?: string[];              // ML-style labels
  metadata?: Record<string, any>; // Custom fields
  createdBy?: string;             // Creator email
  createdAt: string;
  updatedAt: string;
}

interface DatasetItem {
  id: number;
  datasetId: number;
  name?: string;                       // Human-readable test case name
  input: Record<string, any>;          // Variable values (JSON object)
  expectedOutput?: Record<string, any>; // Expected response (JSON object)
  userFeedback?: string;               // Human feedback
  systemFeedback?: string;             // Automated feedback
  labels?: string[];                   // Test case labels
  metadata?: Record<string, any>;      // Custom fields
  createdAt: string;
  updatedAt: string;
}

Use Cases

Quality Testing

Verify prompts produce expected results across diverse inputs. Catch issues before deployment.

Regression Testing

Ensure changes don’t break existing behavior. Run the same tests after each prompt update.

Optimization Training

Provide training data for the optimization engine. Better datasets lead to better optimized prompts.

Benchmarking

Compare different prompt versions objectively. Track improvement over time with consistent test cases.

How Datasets Work

Two-Step Creation

Datasets follow a two-step creation pattern:
  1. Create dataset metadata — name, description, and association to a prompt
  2. Bulk insert items — upload test case items via file or inline JSON

Quick Example

import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ apiKey: process.env.MUTAGENT_API_KEY });

// Step 1: Create dataset metadata (scoped to a prompt)
const dataset = await client.prompt.createDatasetForPrompt({
  id: 123,  // prompt ID
  name: 'Support Scenarios',
  description: 'Common customer support questions and expected answers',
});

// Step 2: Bulk insert test case items
await client.prompt.bulkCreateDatasetItems({
  id: dataset.id,
  items: [
    {
      input: {
        customer_name: 'John',
        question: 'How do I reset my password?',
      },
      expectedOutput: {
        response: 'To reset your password, go to Settings > Security > Reset Password.',
      },
      name: 'Password reset - happy path',
      labels: ['authentication', 'happy-path'],
    },
    {
      input: {
        customer_name: 'Sarah',
        question: 'What are your business hours?',
      },
      expectedOutput: {
        response: 'We are open Monday through Friday, 9am to 5pm EST.',
      },
      labels: ['general', 'happy-path'],
    },
  ],
});

Dataset Types

High-quality, human-verified test cases that serve as the benchmark:
  • Carefully curated inputs
  • Expert-written expected outputs
  • Used for critical quality gates
Generated test cases for broader coverage:
  • Created from templates or rules
  • Useful for stress testing
  • May not have expected outputs
Real-world data captured from production:
  • Actual user inputs
  • Representative of real usage
  • May require anonymization

What’s Next?

Creating Datasets

Learn how to build datasets with items, imports, and the CLI

Best Practices

Guidelines for effective dataset design and management