Evaluation Metrics
MutagenT provides built-in metrics to measure prompt quality from different angles. Choose the right combination of metrics based on your use case.
Metric Categories
LLM-Based AI judges assess quality using reasoning
Embedding-Based Semantic comparison using vector similarity
Deterministic Exact rules with predictable results
G-Eval
AI-powered evaluation using a judge model to assess quality holistically.
How It Works
G-Eval uses a powerful LLM (GPT-4, Claude, etc.) to evaluate responses based on multiple criteria:
Relevance - Does the output address the input?
Coherence - Is the response logical and well-structured?
Factual Accuracy - Are claims correct and verifiable?
Completeness - Does it fully answer the question?
Tone/Style - Does it match expected style guidelines?
Scoring
Score Range Interpretation 0.9 - 1.0 Excellent - Production ready 0.8 - 0.9 Good - Minor improvements possible 0.7 - 0.8 Fair - Needs attention < 0.7 Poor - Significant issues
Best For
General quality assessment
Subjective evaluation (tone, helpfulness)
Complex outputs where exact matching isn’t practical
When you want human-like judgment
Example
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
metrics: [ 'g_eval' ],
config: {
g_eval: {
model: 'gpt-5.1' , // Judge model
criteria: [ 'relevance' , 'coherence' , 'completeness' ],
},
},
});
Semantic Similarity
Measures how semantically similar the output is to the expected output using embedding models.
metrics : [ 'semantic_similarity' ]
How It Works
Generate embeddings for both actual and expected outputs
Calculate cosine similarity between embedding vectors
Score ranges from 0 (completely different) to 1 (identical meaning)
Scoring
Score Range Interpretation 0.9 - 1.0 Nearly identical meaning 0.8 - 0.9 Same core meaning, different wording 0.7 - 0.8 Related but noticeably different < 0.7 Substantially different meaning
Best For
Checking if output preserves intended meaning
Allowing flexibility in wording
Paraphrase detection
Content that should say the same thing differently
Example
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
metrics: [ 'semantic_similarity' ],
config: {
semantic_similarity: {
model: 'text-embedding-3-large' ,
threshold: 0.8 , // Minimum acceptable similarity
},
},
});
Semantic similarity requires expected outputs in your dataset items.
Exact Match
Checks if output exactly matches the expected output.
How It Works
Binary comparison: 1.0 if strings are identical, 0.0 otherwise.
Options for flexibility:
Case sensitivity (on/off)
Whitespace normalization
Punctuation handling
Scoring
Score Meaning 1.0 Exact match 0.0 Any difference
Best For
Classification tasks (“positive”, “negative”, “neutral”)
Structured outputs (JSON, specific formats)
Simple Q&A with definitive answers
Extraction tasks with expected values
Example
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
metrics: [ 'exact_match' ],
config: {
exact_match: {
caseSensitive: false ,
normalizeWhitespace: true ,
},
},
});
Contains
Checks if output contains specific required text or patterns.
How It Works
Searches for required substrings in the output. Can check for multiple required strings.
Scoring
Score Meaning 1.0 All required strings found 0.5 Some required strings found (proportional) 0.0 None found
Best For
Verifying key information is present
Checking for required disclaimers
Ensuring specific terms are mentioned
Partial matching when exact match is too strict
Example
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
metrics: [ 'contains' ],
config: {
contains: {
requiredStrings: [ 'refund policy' , 'contact support' ],
caseSensitive: false ,
},
},
});
// Or use expectedOutput as the required string
// Dataset item: { expectedOutput: 'must contain this phrase' }
Regex Match
Pattern matching against output using regular expressions.
How It Works
Tests if output matches specified regex patterns. Useful for format validation.
Scoring
Score Meaning 1.0 Pattern matches 0.0 Pattern doesn’t match
Best For
Format validation (dates, emails, IDs)
Structure verification
Extracting and validating patterns
Ensuring specific format compliance
Example
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
metrics: [ 'regex_match' ],
config: {
regex_match: {
patterns: [
'^ \\ d{4}- \\ d{2}- \\ d{2}$' , // Date format
'Order #[A-Z0-9]{8}' , // Order ID format
],
},
},
});
Custom Metrics
Define your own evaluation criteria for domain-specific needs.
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
customMetrics: [
{
name: 'brand_voice' ,
description: 'Measures adherence to brand voice guidelines' ,
rubric: `
5 - Perfect brand voice: friendly, professional, on-brand
4 - Minor deviations: mostly on-brand with small issues
3 - Noticeable issues: tone inconsistencies
2 - Major issues: frequently off-brand
1 - Completely off-brand: doesn't match guidelines at all
` ,
},
{
name: 'technical_accuracy' ,
description: 'Checks technical correctness of information' ,
rubric: `
5 - All technical details are accurate
4 - Minor inaccuracies that don't affect understanding
3 - Some inaccuracies that could confuse users
2 - Significant technical errors
1 - Completely inaccurate technical information
` ,
},
],
});
Custom Metric Structure
interface CustomMetric {
name : string ; // Unique identifier
description : string ; // What this metric measures
rubric : string ; // Scoring criteria for the judge
weight ?: number ; // Importance (default: 1.0)
}
Metric Combinations
Use multiple metrics for comprehensive evaluation:
// Comprehensive evaluation setup
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
metrics: [ 'g_eval' , 'semantic_similarity' , 'contains' ],
customMetrics: [
{ name: 'safety' , description: 'No harmful content' , rubric: '...' },
],
config: {
weights: {
g_eval: 0.4 ,
semantic_similarity: 0.3 ,
contains: 0.2 ,
safety: 0.1 ,
},
},
});
Recommended Combinations by Use Case
Use Case Recommended Metrics Customer Support G-Eval, Semantic Similarity, Contains (key info) Classification Exact Match, G-Eval (edge cases) Content Generation G-Eval, Custom (brand voice, style) Data Extraction Regex Match, Exact Match, Contains Code Generation G-Eval, Custom (correctness), Regex (syntax)
Scoring Summary
Metric Score Range Type Requires Expected Output G-Eval 0.0 - 1.0 Continuous No Semantic Similarity 0.0 - 1.0 Continuous Yes Exact Match 0 or 1 Binary Yes Contains 0.0 - 1.0 Continuous Yes (or config) Regex Match 0 or 1 Binary No (uses config) Custom Defined in rubric Varies No
Choose metrics that match your evaluation goals. Using only exact match for creative tasks will produce misleading low scores. Using only G-Eval for classification tasks may miss format errors.