Evaluations

Evaluations automatically measure how well your prompts perform against test datasets. They provide objective, reproducible quality scores that enable data-driven prompt development.

Why Evaluate?

Quality Assurance

Verify prompts meet quality standards before deployment. Catch issues early.

Regression Detection

Automatically detect when changes break existing behavior. Never deploy broken prompts.

Optimization Feedback

Provide scores that guide the automated optimization engine toward better prompts.

Version Comparison

Objectively compare different prompt versions to make informed decisions.

How It Works

Quick Example

import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ bearerAuth: 'sk_live_...' });

// Create and run an evaluation
const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval', 'semantic_similarity'],
});

console.log('Evaluation started:', evaluation.id);
console.log('Status:', evaluation.status);

// Wait for completion and get results
const results = await client.evaluations.waitForCompletion(evaluation.id);

console.log('Overall Score:', results.overallScore);
console.log('Metric Breakdown:');
for (const [metric, score] of Object.entries(results.metricScores)) {
  console.log(`  ${metric}: ${score.toFixed(2)}`);
}

Available Metrics

Metric	Type	Description	Best For
G-Eval	LLM-based	AI judge assesses quality, relevance, coherence	General quality
Semantic Similarity	Embedding	Cosine similarity between output and expected	Meaning preservation
Exact Match	Deterministic	Binary match against expected output	Classification, structured
Contains	Deterministic	Checks for required substrings	Key information
Regex Match	Deterministic	Pattern matching against output	Format validation
Custom	LLM-based	Your own evaluation criteria	Domain-specific

Combine multiple metrics for comprehensive evaluation. G-Eval catches quality issues; semantic similarity catches meaning drift; deterministic metrics catch format errors.

Evaluation Types

Development Evaluation

Quick feedback during prompt iteration:

Small dataset (5-10 items)
Fast turnaround
Used during development

await client.evaluations.run({
  promptId,
  datasetId: 'dev_dataset',
  metrics: ['g_eval'],
});

Pre-deployment Evaluation

Comprehensive check before publishing:

Full dataset (50+ items)
Multiple metrics
Quality gate enforcement

const evaluation = await client.evaluations.run({
  promptId,
  datasetId: 'golden_dataset',
  metrics: ['g_eval', 'semantic_similarity', 'contains'],
});

if (evaluation.overallScore < 0.85) {
  throw new Error('Quality gate failed');
}

Regression Evaluation

Compare new version against baseline:

Same dataset, different prompt versions
Detect performance degradation
Track improvement over time

const [evalV1, evalV2] = await Promise.all([
  client.evaluations.run({ promptId: v1Id, datasetId, metrics }),
  client.evaluations.run({ promptId: v2Id, datasetId, metrics }),
]);

const improvement = evalV2.overallScore - evalV1.overallScore;
console.log(`V2 is ${improvement > 0 ? 'better' : 'worse'} by ${Math.abs(improvement)}`);

Evaluation Results

Results include detailed scoring at multiple levels:

interface EvaluationResults {
  id: string;
  status: 'pending' | 'running' | 'completed' | 'failed';

  // Aggregate scores
  overallScore: number;                    // 0.0 - 1.0
  metricScores: Record<string, number>;    // Per-metric averages

  // Per-item details
  itemResults: Array<{
    datasetItemId: string;
    input: Record<string, any>;
    actualOutput: string;
    expectedOutput?: string;
    scores: Record<string, number>;
    passed: boolean;
  }>;

  // Metadata
  completedItems: number;
  totalItems: number;
  duration: number;                        // milliseconds
  startedAt: Date;
  completedAt?: Date;
}

Quality Gates

Use evaluations as quality gates in your workflow:

async function deployPrompt(promptId: string) {
  // Run comprehensive evaluation
  const evaluation = await client.evaluations.run({
    promptId,
    datasetId: 'golden_dataset_xxxx',
    metrics: ['g_eval', 'semantic_similarity'],
  });

  // Wait for completion
  const results = await client.evaluations.waitForCompletion(evaluation.id);

  // Enforce quality thresholds
  const QUALITY_THRESHOLD = 0.85;
  const REGRESSION_THRESHOLD = 0.02;

  if (results.overallScore < QUALITY_THRESHOLD) {
    throw new Error(
      `Quality gate failed: ${results.overallScore} < ${QUALITY_THRESHOLD}`
    );
  }

  // Compare against previous version
  const previousScore = await getPreviousScore(promptId);
  if (previousScore - results.overallScore > REGRESSION_THRESHOLD) {
    throw new Error(
      `Regression detected: score dropped from ${previousScore} to ${results.overallScore}`
    );
  }

  // Safe to deploy
  await client.prompts.publish(promptId);
  console.log('Prompt deployed successfully');
}

Prompts

Datasets

Evaluations

Optimization

Providers

Evaluations Overview

Evaluations

Why Evaluate?

Quality Assurance

Regression Detection

Optimization Feedback

Version Comparison

How It Works

Quick Example

Available Metrics

Evaluation Types

Evaluation Results

Quality Gates

What’s Next?

Evaluation Metrics

Running Evaluations

Prompts

Datasets

Evaluations

Optimization

Providers

​Evaluations

​Why Evaluate?

Quality Assurance

Regression Detection

Optimization Feedback

Version Comparison

​How It Works

​Quick Example

​Available Metrics

​Evaluation Types

​Evaluation Results

​Quality Gates

​What’s Next?

Evaluation Metrics

Running Evaluations

Evaluations

Why Evaluate?

How It Works

Quick Example

Available Metrics

Evaluation Types

Evaluation Results

Quality Gates

What’s Next?