Evaluation Metrics
MutagenT provides built-in metrics to measure prompt quality from different angles. Choose the right combination of metrics based on your use case. Metrics are configured in the evalConfig.criteria array when creating an evaluation definition.
Metric Categories
LLM-Based AI judges assess quality using reasoning
Embedding-Based Semantic comparison using vector similarity
Deterministic Exact rules with predictable results
G-Eval
AI-powered evaluation using a judge model to assess quality holistically.
{ "field" : "output" , "metric" : "g_eval" }
How It Works
G-Eval uses a powerful LLM (Claude, GPT-4, etc.) to evaluate responses based on multiple criteria:
Relevance - Does the output address the input?
Coherence - Is the response logical and well-structured?
Factual Accuracy - Are claims correct and verifiable?
Completeness - Does it fully answer the question?
Tone/Style - Does it match expected style guidelines?
Scoring
Score Range Interpretation 0.9 - 1.0 Excellent - Production ready 0.8 - 0.9 Good - Minor improvements possible 0.7 - 0.8 Fair - Needs attention < 0.7 Poor - Significant issues
Best For
General quality assessment
Subjective evaluation (tone, helpfulness)
Complex outputs where exact matching isn’t practical
When you want human-like judgment
Example
{
"criteria" : [
{
"field" : "output" ,
"metric" : "g_eval" ,
"weight" : 1.0 ,
"params" : {
"model" : "claude-sonnet-4-6" ,
"aspects" : [ "relevance" , "coherence" , "completeness" ]
}
}
]
}
Semantic Similarity
Measures how semantically similar the output is to the expected output using embedding models.
{ "field" : "output" , "metric" : "semantic_similarity" }
How It Works
Generate embeddings for both actual and expected outputs
Calculate cosine similarity between embedding vectors
Score ranges from 0 (completely different) to 1 (identical meaning)
Scoring
Score Range Interpretation 0.9 - 1.0 Nearly identical meaning 0.8 - 0.9 Same core meaning, different wording 0.7 - 0.8 Related but noticeably different < 0.7 Substantially different meaning
Best For
Checking if output preserves intended meaning
Allowing flexibility in wording
Paraphrase detection
Content that should say the same thing differently
Example
{
"criteria" : [
{
"field" : "output" ,
"metric" : "semantic_similarity" ,
"weight" : 1.0 ,
"params" : {
"model" : "text-embedding-3-large" ,
"threshold" : 0.8
}
}
]
}
Semantic similarity requires expected outputs in your dataset items.
Exact Match
Checks if output exactly matches the expected output.
{ "field" : "output" , "metric" : "exact_match" }
How It Works
Binary comparison: 1.0 if strings are identical, 0.0 otherwise.
Options for flexibility:
Case sensitivity (on/off)
Whitespace normalization
Punctuation handling
Scoring
Score Meaning 1.0 Exact match 0.0 Any difference
Best For
Classification tasks (“positive”, “negative”, “neutral”)
Structured outputs (JSON, specific formats)
Simple Q&A with definitive answers
Extraction tasks with expected values
Example
{
"criteria" : [
{
"field" : "output" ,
"metric" : "exact_match" ,
"weight" : 1.0 ,
"params" : {
"caseSensitive" : false ,
"normalizeWhitespace" : true
}
}
]
}
Contains
Checks if output contains specific required text or patterns.
{ "field" : "output" , "metric" : "contains" }
How It Works
Searches for required substrings in the output. Can check for multiple required strings.
Scoring
Score Meaning 1.0 All required strings found 0.5 Some required strings found (proportional) 0.0 None found
Best For
Verifying key information is present
Checking for required disclaimers
Ensuring specific terms are mentioned
Partial matching when exact match is too strict
Example
{
"criteria" : [
{
"field" : "output" ,
"metric" : "contains" ,
"weight" : 1.0 ,
"params" : {
"required" : [ "refund policy" , "contact support" ],
"caseSensitive" : false
}
}
]
}
Regex Match
Pattern matching against output using regular expressions.
{ "field" : "output" , "metric" : "regex_match" }
How It Works
Tests if output matches specified regex patterns. Useful for format validation.
Scoring
Score Meaning 1.0 Pattern matches 0.0 Pattern doesn’t match
Best For
Format validation (dates, emails, IDs)
Structure verification
Extracting and validating patterns
Ensuring specific format compliance
Example
{
"criteria" : [
{
"field" : "output" ,
"metric" : "regex_match" ,
"weight" : 1.0 ,
"params" : {
"patterns" : [
"^ \\ d{4}- \\ d{2}- \\ d{2}$" ,
"Order #[A-Z0-9]{8}"
]
}
}
]
}
Custom Metrics
Define your own evaluation criteria for domain-specific needs. Custom metrics use an LLM judge with a rubric you provide:
{
"criteria" : [
{
"field" : "output" ,
"metric" : "custom" ,
"weight" : 0.5 ,
"params" : {
"name" : "brand_voice" ,
"description" : "Measures adherence to brand voice guidelines" ,
"rubric" : "5 - Perfect brand voice: friendly, professional, on-brand \n 4 - Minor deviations \n 3 - Noticeable issues \n 2 - Major issues \n 1 - Completely off-brand"
}
},
{
"field" : "output" ,
"metric" : "custom" ,
"weight" : 0.5 ,
"params" : {
"name" : "technical_accuracy" ,
"description" : "Checks technical correctness of information" ,
"rubric" : "5 - All technical details are accurate \n 4 - Minor inaccuracies \n 3 - Some inaccuracies \n 2 - Significant technical errors \n 1 - Completely inaccurate"
}
}
]
}
Metric Combinations
Use multiple metrics for comprehensive evaluation by adding them as criteria with weights:
{
"criteria" : [
{ "field" : "output" , "metric" : "g_eval" , "weight" : 0.4 },
{ "field" : "output" , "metric" : "semantic_similarity" , "weight" : 0.3 },
{ "field" : "output" , "metric" : "contains" , "weight" : 0.2 , "params" : { "required" : [ "disclaimer" ] } },
{ "field" : "output" , "metric" : "custom" , "weight" : 0.1 , "params" : { "name" : "safety" , "description" : "No harmful content" , "rubric" : "..." } }
],
"threshold" : 0.8
}
Recommended Combinations by Use Case
Use Case Recommended Metrics Customer Support G-Eval, Semantic Similarity, Contains (key info) Classification Exact Match, G-Eval (edge cases) Content Generation G-Eval, Custom (brand voice, style) Data Extraction Regex Match, Exact Match, Contains Code Generation G-Eval, Custom (correctness), Regex (syntax)
Scoring Summary
Metric Score Range Type Requires Expected Output G-Eval 0.0 - 1.0 Continuous No Semantic Similarity 0.0 - 1.0 Continuous Yes Exact Match 0 or 1 Binary Yes Contains 0.0 - 1.0 Continuous Yes (or config) Regex Match 0 or 1 Binary No (uses config) Custom Defined in rubric Varies No
Choose metrics that match your evaluation goals. Using only exact match for creative tasks will produce misleading low scores. Using only G-Eval for classification tasks may miss format errors.