Evaluation Metrics

MutagenT provides built-in metrics to measure prompt quality from different angles. Choose the right combination of metrics based on your use case.

Metric Categories

LLM-Based

AI judges assess quality using reasoning

Embedding-Based

Semantic comparison using vector similarity

Deterministic

Exact rules with predictable results

G-Eval

AI-powered evaluation using a judge model to assess quality holistically.

metrics: ['g_eval']

How It Works

G-Eval uses a powerful LLM (GPT-4, Claude, etc.) to evaluate responses based on multiple criteria:

Relevance - Does the output address the input?
Coherence - Is the response logical and well-structured?
Factual Accuracy - Are claims correct and verifiable?
Completeness - Does it fully answer the question?
Tone/Style - Does it match expected style guidelines?

Scoring

Score Range	Interpretation
0.9 - 1.0	Excellent - Production ready
0.8 - 0.9	Good - Minor improvements possible
0.7 - 0.8	Fair - Needs attention
< 0.7	Poor - Significant issues

Best For

General quality assessment
Subjective evaluation (tone, helpfulness)
Complex outputs where exact matching isn’t practical
When you want human-like judgment

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval'],
  config: {
    g_eval: {
      model: 'gpt-5.1',  // Judge model
      criteria: ['relevance', 'coherence', 'completeness'],
    },
  },
});

Semantic Similarity

Measures how semantically similar the output is to the expected output using embedding models.

metrics: ['semantic_similarity']

How It Works

Generate embeddings for both actual and expected outputs
Calculate cosine similarity between embedding vectors
Score ranges from 0 (completely different) to 1 (identical meaning)

Scoring

Score Range	Interpretation
0.9 - 1.0	Nearly identical meaning
0.8 - 0.9	Same core meaning, different wording
0.7 - 0.8	Related but noticeably different
< 0.7	Substantially different meaning

Best For

Checking if output preserves intended meaning
Allowing flexibility in wording
Paraphrase detection
Content that should say the same thing differently

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['semantic_similarity'],
  config: {
    semantic_similarity: {
      model: 'text-embedding-3-large',
      threshold: 0.8,  // Minimum acceptable similarity
    },
  },
});

Semantic similarity requires expected outputs in your dataset items.

Exact Match

Checks if output exactly matches the expected output.

metrics: ['exact_match']

How It Works

Binary comparison: 1.0 if strings are identical, 0.0 otherwise. Options for flexibility:

Case sensitivity (on/off)
Whitespace normalization
Punctuation handling

Scoring

Score	Meaning
1.0	Exact match
0.0	Any difference

Best For

Classification tasks (“positive”, “negative”, “neutral”)
Structured outputs (JSON, specific formats)
Simple Q&A with definitive answers
Extraction tasks with expected values

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['exact_match'],
  config: {
    exact_match: {
      caseSensitive: false,
      normalizeWhitespace: true,
    },
  },
});

Contains

Checks if output contains specific required text or patterns.

metrics: ['contains']

How It Works

Searches for required substrings in the output. Can check for multiple required strings.

Scoring

Score	Meaning
1.0	All required strings found
0.5	Some required strings found (proportional)
0.0	None found

Best For

Verifying key information is present
Checking for required disclaimers
Ensuring specific terms are mentioned
Partial matching when exact match is too strict

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['contains'],
  config: {
    contains: {
      requiredStrings: ['refund policy', 'contact support'],
      caseSensitive: false,
    },
  },
});

// Or use expectedOutput as the required string
// Dataset item: { expectedOutput: 'must contain this phrase' }

Regex Match

Pattern matching against output using regular expressions.

metrics: ['regex_match']

How It Works

Tests if output matches specified regex patterns. Useful for format validation.

Scoring

Score	Meaning
1.0	Pattern matches
0.0	Pattern doesn’t match

Best For

Format validation (dates, emails, IDs)
Structure verification
Extracting and validating patterns
Ensuring specific format compliance

Example

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['regex_match'],
  config: {
    regex_match: {
      patterns: [
        '^\\d{4}-\\d{2}-\\d{2}$',  // Date format
        'Order #[A-Z0-9]{8}',       // Order ID format
      ],
    },
  },
});

Custom Metrics

Define your own evaluation criteria for domain-specific needs.

const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  customMetrics: [
    {
      name: 'brand_voice',
      description: 'Measures adherence to brand voice guidelines',
      rubric: `
        5 - Perfect brand voice: friendly, professional, on-brand
        4 - Minor deviations: mostly on-brand with small issues
        3 - Noticeable issues: tone inconsistencies
        2 - Major issues: frequently off-brand
        1 - Completely off-brand: doesn't match guidelines at all
      `,
    },
    {
      name: 'technical_accuracy',
      description: 'Checks technical correctness of information',
      rubric: `
        5 - All technical details are accurate
        4 - Minor inaccuracies that don't affect understanding
        3 - Some inaccuracies that could confuse users
        2 - Significant technical errors
        1 - Completely inaccurate technical information
      `,
    },
  ],
});

Custom Metric Structure

interface CustomMetric {
  name: string;           // Unique identifier
  description: string;    // What this metric measures
  rubric: string;         // Scoring criteria for the judge
  weight?: number;        // Importance (default: 1.0)
}

Metric Combinations

Use multiple metrics for comprehensive evaluation:

// Comprehensive evaluation setup
const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval', 'semantic_similarity', 'contains'],
  customMetrics: [
    { name: 'safety', description: 'No harmful content', rubric: '...' },
  ],
  config: {
    weights: {
      g_eval: 0.4,
      semantic_similarity: 0.3,
      contains: 0.2,
      safety: 0.1,
    },
  },
});

Recommended Combinations by Use Case

Use Case	Recommended Metrics
Customer Support	G-Eval, Semantic Similarity, Contains (key info)
Classification	Exact Match, G-Eval (edge cases)
Content Generation	G-Eval, Custom (brand voice, style)
Data Extraction	Regex Match, Exact Match, Contains
Code Generation	G-Eval, Custom (correctness), Regex (syntax)

Scoring Summary

Metric	Score Range	Type	Requires Expected Output
G-Eval	0.0 - 1.0	Continuous	No
Semantic Similarity	0.0 - 1.0	Continuous	Yes
Exact Match	0 or 1	Binary	Yes
Contains	0.0 - 1.0	Continuous	Yes (or config)
Regex Match	0 or 1	Binary	No (uses config)
Custom	Defined in rubric	Varies	No

Choose metrics that match your evaluation goals. Using only exact match for creative tasks will produce misleading low scores. Using only G-Eval for classification tasks may miss format errors.

Prompts

Datasets

Evaluations

Optimization

Providers

​Evaluation Metrics

​Metric Categories

LLM-Based

Embedding-Based

Deterministic

​G-Eval

​How It Works

​Scoring

​Best For

​Example

​Semantic Similarity

​How It Works

​Scoring

​Best For

​Example

​Exact Match

​How It Works

​Scoring

​Best For

​Example

​Contains

​How It Works

​Scoring

​Best For

​Example

​Regex Match

​How It Works

​Scoring

​Best For

​Example

​Custom Metrics

​Custom Metric Structure

​Metric Combinations

​Recommended Combinations by Use Case

​Scoring Summary

Evaluation Metrics

Metric Categories

G-Eval

How It Works

Scoring

Best For

Example

Semantic Similarity

How It Works

Scoring

Best For

Example

Exact Match

How It Works

Scoring

Best For

Example

Contains

How It Works

Scoring

Best For

Example

Regex Match

How It Works

Scoring

Best For

Example

Custom Metrics

Custom Metric Structure

Metric Combinations

Recommended Combinations by Use Case

Scoring Summary