Documentation Index
Fetch the complete documentation index at: https://docs.mutagent.io/llms.txt
Use this file to discover all available pages before exploring further.
Evaluation Metrics
MutagenT provides built-in metrics to measure prompt quality from different angles. Choose the right combination of metrics based on your use case. Metrics are configured in theevalConfig.criteria array when creating an evaluation definition.
Currently, the platform supports G-Eval and Exact Match metrics. The UI metric definition builder supports G-Eval type metrics. Additional metric types are planned for future releases.
Metric Categories
LLM-Based
AI judges assess quality using reasoning
Deterministic
Exact rules with predictable results
G-Eval
AI-powered evaluation using a judge model to assess quality holistically.How It Works
G-Eval uses a powerful LLM (Claude, GPT-4, etc.) to evaluate responses based on multiple criteria:- Relevance - Does the output address the input?
- Coherence - Is the response logical and well-structured?
- Factual Accuracy - Are claims correct and verifiable?
- Completeness - Does it fully answer the question?
- Tone/Style - Does it match expected style guidelines?
Scoring
| Score Range | Interpretation |
|---|---|
| 0.9 - 1.0 | Excellent - Production ready |
| 0.8 - 0.9 | Good - Minor improvements possible |
| 0.7 - 0.8 | Fair - Needs attention |
| < 0.7 | Poor - Significant issues |
Best For
- General quality assessment
- Subjective evaluation (tone, helpfulness)
- Complex outputs where exact matching isn’t practical
- When you want human-like judgment
Example
Exact Match
Checks if output exactly matches the expected output.How It Works
Binary comparison: 1.0 if strings are identical, 0.0 otherwise. Options for flexibility:- Case sensitivity (on/off)
- Whitespace normalization
- Punctuation handling
Scoring
| Score | Meaning |
|---|---|
| 1.0 | Exact match |
| 0.0 | Any difference |
Best For
- Classification tasks (“positive”, “negative”, “neutral”)
- Structured outputs (JSON, specific formats)
- Simple Q&A with definitive answers
- Extraction tasks with expected values
Example
Metric Combinations
Use multiple metrics for comprehensive evaluation by adding them as criteria with weights:Recommended Combinations by Use Case
| Use Case | Recommended Metrics |
|---|---|
| Customer Support | G-Eval |
| Classification | Exact Match, G-Eval (edge cases) |
| Content Generation | G-Eval |
| Simple Q&A | Exact Match |
Scoring Summary
| Metric | Score Range | Type | Requires Expected Output |
|---|---|---|---|
| G-Eval | 0.0 - 1.0 | Continuous | No |
| Exact Match | 0 or 1 | Binary | Yes |