Skip to main content

Evaluation Metrics

MutagenT provides built-in metrics to measure prompt quality from different angles. Choose the right combination of metrics based on your use case.

Metric Categories

LLM-Based

AI judges assess quality using reasoning

Embedding-Based

Semantic comparison using vector similarity

Deterministic

Exact rules with predictable results

G-Eval

AI-powered evaluation using a judge model to assess quality holistically.
metrics: ['g_eval']

How It Works

G-Eval uses a powerful LLM (GPT-4, Claude, etc.) to evaluate responses based on multiple criteria:
  1. Relevance - Does the output address the input?
  2. Coherence - Is the response logical and well-structured?
  3. Factual Accuracy - Are claims correct and verifiable?
  4. Completeness - Does it fully answer the question?
  5. Tone/Style - Does it match expected style guidelines?

Scoring

Score RangeInterpretation
0.9 - 1.0Excellent - Production ready
0.8 - 0.9Good - Minor improvements possible
0.7 - 0.8Fair - Needs attention
< 0.7Poor - Significant issues

Best For

  • General quality assessment
  • Subjective evaluation (tone, helpfulness)
  • Complex outputs where exact matching isn’t practical
  • When you want human-like judgment

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval'],
  config: {
    g_eval: {
      model: 'gpt-5.1',  // Judge model
      criteria: ['relevance', 'coherence', 'completeness'],
    },
  },
});

Semantic Similarity

Measures how semantically similar the output is to the expected output using embedding models.
metrics: ['semantic_similarity']

How It Works

  1. Generate embeddings for both actual and expected outputs
  2. Calculate cosine similarity between embedding vectors
  3. Score ranges from 0 (completely different) to 1 (identical meaning)

Scoring

Score RangeInterpretation
0.9 - 1.0Nearly identical meaning
0.8 - 0.9Same core meaning, different wording
0.7 - 0.8Related but noticeably different
< 0.7Substantially different meaning

Best For

  • Checking if output preserves intended meaning
  • Allowing flexibility in wording
  • Paraphrase detection
  • Content that should say the same thing differently

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['semantic_similarity'],
  config: {
    semantic_similarity: {
      model: 'text-embedding-3-large',
      threshold: 0.8,  // Minimum acceptable similarity
    },
  },
});
Semantic similarity requires expected outputs in your dataset items.

Exact Match

Checks if output exactly matches the expected output.
metrics: ['exact_match']

How It Works

Binary comparison: 1.0 if strings are identical, 0.0 otherwise. Options for flexibility:
  • Case sensitivity (on/off)
  • Whitespace normalization
  • Punctuation handling

Scoring

ScoreMeaning
1.0Exact match
0.0Any difference

Best For

  • Classification tasks (“positive”, “negative”, “neutral”)
  • Structured outputs (JSON, specific formats)
  • Simple Q&A with definitive answers
  • Extraction tasks with expected values

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['exact_match'],
  config: {
    exact_match: {
      caseSensitive: false,
      normalizeWhitespace: true,
    },
  },
});

Contains

Checks if output contains specific required text or patterns.
metrics: ['contains']

How It Works

Searches for required substrings in the output. Can check for multiple required strings.

Scoring

ScoreMeaning
1.0All required strings found
0.5Some required strings found (proportional)
0.0None found

Best For

  • Verifying key information is present
  • Checking for required disclaimers
  • Ensuring specific terms are mentioned
  • Partial matching when exact match is too strict

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['contains'],
  config: {
    contains: {
      requiredStrings: ['refund policy', 'contact support'],
      caseSensitive: false,
    },
  },
});

// Or use expectedOutput as the required string
// Dataset item: { expectedOutput: 'must contain this phrase' }

Regex Match

Pattern matching against output using regular expressions.
metrics: ['regex_match']

How It Works

Tests if output matches specified regex patterns. Useful for format validation.

Scoring

ScoreMeaning
1.0Pattern matches
0.0Pattern doesn’t match

Best For

  • Format validation (dates, emails, IDs)
  • Structure verification
  • Extracting and validating patterns
  • Ensuring specific format compliance

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['regex_match'],
  config: {
    regex_match: {
      patterns: [
        '^\\d{4}-\\d{2}-\\d{2}$',  // Date format
        'Order #[A-Z0-9]{8}',       // Order ID format
      ],
    },
  },
});

Custom Metrics

Define your own evaluation criteria for domain-specific needs.
const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  customMetrics: [
    {
      name: 'brand_voice',
      description: 'Measures adherence to brand voice guidelines',
      rubric: `
        5 - Perfect brand voice: friendly, professional, on-brand
        4 - Minor deviations: mostly on-brand with small issues
        3 - Noticeable issues: tone inconsistencies
        2 - Major issues: frequently off-brand
        1 - Completely off-brand: doesn't match guidelines at all
      `,
    },
    {
      name: 'technical_accuracy',
      description: 'Checks technical correctness of information',
      rubric: `
        5 - All technical details are accurate
        4 - Minor inaccuracies that don't affect understanding
        3 - Some inaccuracies that could confuse users
        2 - Significant technical errors
        1 - Completely inaccurate technical information
      `,
    },
  ],
});

Custom Metric Structure

interface CustomMetric {
  name: string;           // Unique identifier
  description: string;    // What this metric measures
  rubric: string;         // Scoring criteria for the judge
  weight?: number;        // Importance (default: 1.0)
}

Metric Combinations

Use multiple metrics for comprehensive evaluation:
// Comprehensive evaluation setup
const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval', 'semantic_similarity', 'contains'],
  customMetrics: [
    { name: 'safety', description: 'No harmful content', rubric: '...' },
  ],
  config: {
    weights: {
      g_eval: 0.4,
      semantic_similarity: 0.3,
      contains: 0.2,
      safety: 0.1,
    },
  },
});
Use CaseRecommended Metrics
Customer SupportG-Eval, Semantic Similarity, Contains (key info)
ClassificationExact Match, G-Eval (edge cases)
Content GenerationG-Eval, Custom (brand voice, style)
Data ExtractionRegex Match, Exact Match, Contains
Code GenerationG-Eval, Custom (correctness), Regex (syntax)

Scoring Summary

MetricScore RangeTypeRequires Expected Output
G-Eval0.0 - 1.0ContinuousNo
Semantic Similarity0.0 - 1.0ContinuousYes
Exact Match0 or 1BinaryYes
Contains0.0 - 1.0ContinuousYes (or config)
Regex Match0 or 1BinaryNo (uses config)
CustomDefined in rubricVariesNo
Choose metrics that match your evaluation goals. Using only exact match for creative tasks will produce misleading low scores. Using only G-Eval for classification tasks may miss format errors.