Running Evaluations
Evaluations in MutagenT are definition entities — they specify what to evaluate and how. Evaluation execution happens within the optimization loop or via the dashboard. This guide covers how to create evaluations, retrieve results, and interpret scores.There is no separate “run evaluation” CLI command. Evaluations execute as part of the optimization workflow or through the dashboard UI. The CLI is used to create evaluation definitions and retrieve results.
Create an Evaluation
Via Dashboard
- Navigate to a prompt’s Evaluations tab
- Click New Evaluation
- Configure your evaluation:
- Select the dataset to test against
- Define evaluation criteria (metrics, weights, thresholds)
- Configure LLM settings for the judge model
- Click Create Evaluation
- The evaluation will execute when optimization runs or when triggered from the dashboard
Via CLI
criteria.json file contains your evaluation configuration:
Via SDK
How Evaluations Execute
Evaluations produce results through two paths:- Dashboard - Trigger evaluation execution manually from the prompt evaluations page
- Optimization Loop - Evaluations run automatically during each optimization iteration to score prompt variants
Get Results
Retrieve evaluation results once execution completes:Via CLI
Via cURL
Interpreting Scores
Score Thresholds
| Score Range | Quality Level | Recommendation |
|---|---|---|
| 0.95 - 1.00 | Excellent | Production ready |
| 0.85 - 0.94 | Good | Deploy with monitoring |
| 0.75 - 0.84 | Fair | Consider improvements |
| 0.65 - 0.74 | Poor | Needs significant work |
| < 0.65 | Critical | Do not deploy |
Analyzing Low Scores
When scores are lower than expected, examine the per-metric breakdown inmetricResults:
Comparing Evaluations
Compare results across different prompt versions by creating evaluations with the same dataset and criteria for each version:Handling Failures
Common Failure Causes
Provider errors
Provider errors
LLM API returned an error. Check provider configuration in Settings > Providers and verify rate limits.
Timeout
Timeout
Evaluation took too long. Try a smaller dataset or check if the LLM provider is experiencing high latency.
Invalid prompt
Invalid prompt
Prompt template has syntax errors or missing variables. Verify the prompt variables match the dataset schema.
Dataset issues
Dataset issues
Dataset items have invalid or missing required fields. Ensure all dataset items have the fields referenced in your criteria.
Best Practices
Use consistent datasets
Use consistent datasets
Always compare evaluations using the same dataset to ensure valid comparisons.
Set appropriate thresholds
Set appropriate thresholds
Define minimum scores based on your quality requirements and adjust as you learn.
Combine metrics strategically
Combine metrics strategically
Use G-Eval for quality, semantic similarity for meaning, and deterministic metrics for format validation.
Track trends over time
Track trends over time
Create evaluations with the same criteria for each prompt version to monitor quality trends.