Skip to main content

Evaluations

Evaluations automatically measure how well your prompts perform against test datasets. They provide objective, reproducible quality scores that enable data-driven prompt development.

Why Evaluate?

Quality Assurance

Verify prompts meet quality standards before deployment. Catch issues early.

Regression Detection

Automatically detect when changes break existing behavior. Never deploy broken prompts.

Optimization Feedback

Provide scores that guide the automated optimization engine toward better prompts.

Version Comparison

Objectively compare different prompt versions to make informed decisions.

How It Works

Quick Example

import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ bearerAuth: 'sk_live_...' });

// Create and run an evaluation
const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval', 'semantic_similarity'],
});

console.log('Evaluation started:', evaluation.id);
console.log('Status:', evaluation.status);

// Wait for completion and get results
const results = await client.evaluations.waitForCompletion(evaluation.id);

console.log('Overall Score:', results.overallScore);
console.log('Metric Breakdown:');
for (const [metric, score] of Object.entries(results.metricScores)) {
  console.log(`  ${metric}: ${score.toFixed(2)}`);
}

Available Metrics

MetricTypeDescriptionBest For
G-EvalLLM-basedAI judge assesses quality, relevance, coherenceGeneral quality
Semantic SimilarityEmbeddingCosine similarity between output and expectedMeaning preservation
Exact MatchDeterministicBinary match against expected outputClassification, structured
ContainsDeterministicChecks for required substringsKey information
Regex MatchDeterministicPattern matching against outputFormat validation
CustomLLM-basedYour own evaluation criteriaDomain-specific
Combine multiple metrics for comprehensive evaluation. G-Eval catches quality issues; semantic similarity catches meaning drift; deterministic metrics catch format errors.

Evaluation Types

Quick feedback during prompt iteration:
  • Small dataset (5-10 items)
  • Fast turnaround
  • Used during development
await client.evaluations.run({
  promptId,
  datasetId: 'dev_dataset',
  metrics: ['g_eval'],
});
Comprehensive check before publishing:
  • Full dataset (50+ items)
  • Multiple metrics
  • Quality gate enforcement
const evaluation = await client.evaluations.run({
  promptId,
  datasetId: 'golden_dataset',
  metrics: ['g_eval', 'semantic_similarity', 'contains'],
});

if (evaluation.overallScore < 0.85) {
  throw new Error('Quality gate failed');
}
Compare new version against baseline:
  • Same dataset, different prompt versions
  • Detect performance degradation
  • Track improvement over time
const [evalV1, evalV2] = await Promise.all([
  client.evaluations.run({ promptId: v1Id, datasetId, metrics }),
  client.evaluations.run({ promptId: v2Id, datasetId, metrics }),
]);

const improvement = evalV2.overallScore - evalV1.overallScore;
console.log(`V2 is ${improvement > 0 ? 'better' : 'worse'} by ${Math.abs(improvement)}`);

Evaluation Results

Results include detailed scoring at multiple levels:
interface EvaluationResults {
  id: string;
  status: 'pending' | 'running' | 'completed' | 'failed';

  // Aggregate scores
  overallScore: number;                    // 0.0 - 1.0
  metricScores: Record<string, number>;    // Per-metric averages

  // Per-item details
  itemResults: Array<{
    datasetItemId: string;
    input: Record<string, any>;
    actualOutput: string;
    expectedOutput?: string;
    scores: Record<string, number>;
    passed: boolean;
  }>;

  // Metadata
  completedItems: number;
  totalItems: number;
  duration: number;                        // milliseconds
  startedAt: Date;
  completedAt?: Date;
}

Quality Gates

Use evaluations as quality gates in your workflow:
async function deployPrompt(promptId: string) {
  // Run comprehensive evaluation
  const evaluation = await client.evaluations.run({
    promptId,
    datasetId: 'golden_dataset_xxxx',
    metrics: ['g_eval', 'semantic_similarity'],
  });

  // Wait for completion
  const results = await client.evaluations.waitForCompletion(evaluation.id);

  // Enforce quality thresholds
  const QUALITY_THRESHOLD = 0.85;
  const REGRESSION_THRESHOLD = 0.02;

  if (results.overallScore < QUALITY_THRESHOLD) {
    throw new Error(
      `Quality gate failed: ${results.overallScore} < ${QUALITY_THRESHOLD}`
    );
  }

  // Compare against previous version
  const previousScore = await getPreviousScore(promptId);
  if (previousScore - results.overallScore > REGRESSION_THRESHOLD) {
    throw new Error(
      `Regression detected: score dropped from ${previousScore} to ${results.overallScore}`
    );
  }

  // Safe to deploy
  await client.prompts.publish(promptId);
  console.log('Prompt deployed successfully');
}

What’s Next?