Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mutagent.io/llms.txt

Use this file to discover all available pages before exploring further.

Evaluation Metrics

MutagenT provides built-in metrics to measure prompt quality from different angles. Choose the right combination of metrics based on your use case. Metrics are configured in the evalConfig.criteria array when creating an evaluation definition.
Currently, the platform supports G-Eval and Exact Match metrics. The UI metric definition builder supports G-Eval type metrics. Additional metric types are planned for future releases.

Metric Categories

LLM-Based

AI judges assess quality using reasoning

Deterministic

Exact rules with predictable results

G-Eval

AI-powered evaluation using a judge model to assess quality holistically.
{ "field": "output", "metric": "g_eval" }

How It Works

G-Eval uses a powerful LLM (Claude, GPT-4, etc.) to evaluate responses based on multiple criteria:
  1. Relevance - Does the output address the input?
  2. Coherence - Is the response logical and well-structured?
  3. Factual Accuracy - Are claims correct and verifiable?
  4. Completeness - Does it fully answer the question?
  5. Tone/Style - Does it match expected style guidelines?

Scoring

Score RangeInterpretation
0.9 - 1.0Excellent - Production ready
0.8 - 0.9Good - Minor improvements possible
0.7 - 0.8Fair - Needs attention
< 0.7Poor - Significant issues

Best For

  • General quality assessment
  • Subjective evaluation (tone, helpfulness)
  • Complex outputs where exact matching isn’t practical
  • When you want human-like judgment

Example

{
  "criteria": [
    {
      "field": "output",
      "metric": "g_eval",
      "weight": 1.0,
      "params": {
        "model": "claude-sonnet-4-6",
        "aspects": ["relevance", "coherence", "completeness"]
      }
    }
  ]
}

Exact Match

Checks if output exactly matches the expected output.
{ "field": "output", "metric": "exact_match" }

How It Works

Binary comparison: 1.0 if strings are identical, 0.0 otherwise. Options for flexibility:
  • Case sensitivity (on/off)
  • Whitespace normalization
  • Punctuation handling

Scoring

ScoreMeaning
1.0Exact match
0.0Any difference

Best For

  • Classification tasks (“positive”, “negative”, “neutral”)
  • Structured outputs (JSON, specific formats)
  • Simple Q&A with definitive answers
  • Extraction tasks with expected values

Example

{
  "criteria": [
    {
      "field": "output",
      "metric": "exact_match",
      "weight": 1.0,
      "params": {
        "caseSensitive": false,
        "normalizeWhitespace": true
      }
    }
  ]
}

Metric Combinations

Use multiple metrics for comprehensive evaluation by adding them as criteria with weights:
{
  "criteria": [
    { "field": "output", "metric": "g_eval", "weight": 0.6 },
    { "field": "output", "metric": "exact_match", "weight": 0.4 }
  ],
  "threshold": 0.8
}
Use CaseRecommended Metrics
Customer SupportG-Eval
ClassificationExact Match, G-Eval (edge cases)
Content GenerationG-Eval
Simple Q&AExact Match

Scoring Summary

MetricScore RangeTypeRequires Expected Output
G-Eval0.0 - 1.0ContinuousNo
Exact Match0 or 1BinaryYes
Choose metrics that match your evaluation goals. Using only exact match for creative tasks will produce misleading low scores. Using only G-Eval for classification tasks may miss format errors.