Skip to main content

Evaluations

Evaluations define how your prompts are measured against test datasets. An evaluation is a definition entity that specifies what to evaluate (prompt + dataset) and how to evaluate it (criteria and LLM configuration). Evaluation results are produced when the optimization loop runs your evaluation against dataset items.

Why Evaluate?

Quality Assurance

Verify prompts meet quality standards before deployment. Catch issues early with field-level criteria.

Regression Detection

Automatically detect when changes break existing behavior. Compare scores across prompt versions.

Optimization Feedback

Provide scoring criteria that guide the automated optimization engine toward better prompts.

Version Comparison

Objectively compare different prompt versions using consistent metrics and datasets.

How It Works

Evaluations are created as definitions, then executed automatically within the optimization loop or via the dashboard. There is no separate “run evaluation” CLI command — evaluation execution happens as part of prompt optimization.

Creating an Evaluation

An evaluation links a prompt to a dataset with evaluation criteria.
import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ apiKey: process.env.MUTAGENT_API_KEY });

// Create an evaluation definition
const evaluation = await client.promptEvaluations.createEvaluation({
  promptId: 123,
  datasetId: 456,
  name: 'Customer Support Quality Eval',
  description: 'Evaluate response quality for support prompts',
  evalConfig: {
    criteria: [
      { field: 'output', metric: 'g_eval', weight: 0.4 },
      { field: 'output', metric: 'semantic_similarity', weight: 0.3 },
      { field: 'output', metric: 'contains', params: { required: ['refund policy'] }, weight: 0.3 },
    ],
    threshold: 0.8,
  },
  llmConfig: {
    model: 'claude-sonnet-4-6',
    temperature: 0.7,
  },
});

console.log('Evaluation created:', evaluation.id);

Available Metrics

MetricTypeDescriptionBest For
G-EvalLLM-basedAI judge assesses quality, relevance, coherenceGeneral quality
Semantic SimilarityEmbeddingCosine similarity between output and expectedMeaning preservation
Exact MatchDeterministicBinary match against expected outputClassification, structured
ContainsDeterministicChecks for required substringsKey information
Regex MatchDeterministicPattern matching against outputFormat validation
CustomLLM-basedYour own evaluation criteriaDomain-specific
Combine multiple metrics for comprehensive evaluation. G-Eval catches quality issues; semantic similarity catches meaning drift; deterministic metrics catch format errors.

Evaluation Configuration

The evalConfig object defines the criteria for scoring. Criteria are field-level and can target input or output:
interface EvalConfig {
  criteria: Array<{
    field: 'input' | 'output';       // Which field to evaluate
    metric: string;                   // Metric name (g_eval, semantic_similarity, etc.)
    weight?: number;                  // Importance weight (default: 1.0)
    params?: Record<string, any>;     // Metric-specific parameters
  }>;
  threshold?: number;                 // Minimum acceptable score (0.0 - 1.0)
}

Criteria Examples

Quick feedback during prompt iteration:
  • Small dataset (5-10 items)
  • Single metric for fast turnaround
{
  "criteria": [
    { "field": "output", "metric": "g_eval", "weight": 1.0 }
  ],
  "threshold": 0.7
}
Comprehensive check before publishing:
  • Full dataset (50+ items)
  • Multiple metrics with weighted scoring
  • Higher quality threshold
{
  "criteria": [
    { "field": "output", "metric": "g_eval", "weight": 0.4 },
    { "field": "output", "metric": "semantic_similarity", "weight": 0.3 },
    { "field": "output", "metric": "contains", "params": { "required": ["disclaimer"] }, "weight": 0.3 }
  ],
  "threshold": 0.85
}
Compare new version against baseline:
  • Same dataset, different prompt versions
  • Consistent criteria across evaluations
  • Track improvement over time
{
  "criteria": [
    { "field": "output", "metric": "g_eval", "weight": 0.5 },
    { "field": "output", "metric": "semantic_similarity", "weight": 0.5 }
  ],
  "threshold": 0.8
}

Evaluation Results

Results are generated when the optimization loop executes your evaluation. Retrieve them via the API:
// Get results for an evaluation
const result = await client.promptEvaluations.getEvaluationResult({ id: 1 });

console.log('Score:', result.score);
console.log('Success:', result.success);
console.log('Metric Results:', result.metricResults);
console.log('Execution Time:', result.executionTime, 'ms');

Result Structure

interface EvaluationResult {
  id: number;
  evaluationId: number;           // Parent evaluation definition
  actualOutput: object;           // LLM output as JSON
  success: boolean;               // Whether the evaluation passed
  score: number | null;           // Numeric score (0.0 - 1.0)
  metricResults: object;          // Per-metric results (e.g., {"g_eval": 0.95, "semantic_similarity": 0.82})
  executionTime: number | null;   // Execution time in milliseconds
  createdAt: string;              // Result timestamp
}

Quality Gates

Use evaluations as quality gates in your workflow:
async function deployPrompt(promptId: number) {
  // Get evaluation results
  const result = await client.promptEvaluations.getEvaluationResult({ id: evalId });

  // Enforce quality thresholds
  const QUALITY_THRESHOLD = 0.85;

  if (!result.success || (result.score && result.score < QUALITY_THRESHOLD)) {
    throw new Error(
      `Quality gate failed: score ${result.score} < ${QUALITY_THRESHOLD}`
    );
  }

  // Safe to deploy
  console.log('Quality gate passed, deploying prompt');
}

What’s Next?

Evaluation Metrics

Deep dive into available metrics and when to use each

Running Evaluations

Learn how evaluations execute within the optimization workflow