Running Evaluations
Execute evaluations and analyze results to measure prompt quality. This guide covers the complete evaluation workflow.
Start an Evaluation
Via Dashboard
- Navigate to Evaluations in the sidebar
- Click New Evaluation
- Configure your evaluation:
- Select the prompt to evaluate
- Select the dataset to test against
- Choose one or more metrics
- (Optional) Add custom metrics
- Click Run Evaluation
- Monitor progress in real-time
Via SDK
import { Mutagent } from '@mutagent/sdk';
const client = new Mutagent({ bearerAuth: 'sk_live_...' });
// Start evaluation
const evaluation = await client.evaluations.run({
promptId: 'prompt_xxxx',
datasetId: 'dataset_xxxx',
metrics: ['g_eval', 'semantic_similarity'],
});
console.log('Evaluation started:', evaluation.id);
console.log('Status:', evaluation.status);
Monitor Progress
Track evaluation status as it runs:
// Poll for status updates
async function monitorEvaluation(evaluationId: string) {
let status = await client.evaluations.get(evaluationId);
while (status.status === 'pending' || status.status === 'running') {
console.log(`Status: ${status.status}`);
console.log(`Progress: ${status.completedItems}/${status.totalItems}`);
// Wait before next check
await new Promise(resolve => setTimeout(resolve, 2000));
status = await client.evaluations.get(evaluationId);
}
return status;
}
const finalStatus = await monitorEvaluation(evaluation.id);
console.log('Evaluation complete:', finalStatus.status);
Using waitForCompletion
A simpler approach using the built-in helper:
// Automatically polls until complete
const results = await client.evaluations.waitForCompletion(evaluation.id, {
pollInterval: 2000, // ms between checks
timeout: 300000, // max wait time (5 min)
});
console.log('Results ready:', results.overallScore);
Evaluation States
| State | Description | Actions Available |
|---|
pending | Queued, waiting to start | Cancel |
running | Currently executing | Cancel |
completed | Finished successfully | View results |
failed | Error occurred | Retry, view error |
cancelled | Manually stopped | Restart |
Get Results
Retrieve detailed evaluation results:
const results = await client.evaluations.getResults(evaluation.id);
// Overall scores
console.log('=== Evaluation Results ===');
console.log('Overall Score:', results.overallScore.toFixed(2));
console.log('Duration:', results.duration, 'ms');
console.log('');
// Per-metric scores
console.log('Metric Scores:');
for (const [metric, score] of Object.entries(results.metricScores)) {
console.log(` ${metric}: ${score.toFixed(2)}`);
}
console.log('');
// Summary statistics
const passed = results.itemResults.filter(r => r.passed).length;
const total = results.itemResults.length;
console.log(`Items: ${passed}/${total} passed (${((passed/total)*100).toFixed(1)}%)`);
Detailed Item Results
Examine individual test case results:
// Show detailed results for each item
console.log('\n=== Item Results ===');
for (const item of results.itemResults) {
console.log(`\nItem: ${item.datasetItemId}`);
console.log(`Input: ${JSON.stringify(item.input)}`);
console.log(`Output: ${item.actualOutput.substring(0, 100)}...`);
if (item.expectedOutput) {
console.log(`Expected: ${item.expectedOutput.substring(0, 100)}...`);
}
console.log('Scores:');
for (const [metric, score] of Object.entries(item.scores)) {
console.log(` ${metric}: ${score.toFixed(2)}`);
}
console.log(`Status: ${item.passed ? 'PASSED' : 'FAILED'}`);
}
Export Results
Export results for external analysis:
import fs from 'fs';
const results = await client.evaluations.getResults(evaluation.id);
// Export as JSON
fs.writeFileSync(
`evaluation-${evaluation.id}.json`,
JSON.stringify(results, null, 2)
);
// Export as CSV
const csv = [
'item_id,input,output,expected,overall_score,passed',
...results.itemResults.map(r => [
r.datasetItemId,
JSON.stringify(r.input),
r.actualOutput.replace(/"/g, '""'),
r.expectedOutput?.replace(/"/g, '""') || '',
Object.values(r.scores).reduce((a, b) => a + b, 0) / Object.keys(r.scores).length,
r.passed,
].map(v => `"${v}"`).join(','))
].join('\n');
fs.writeFileSync(`evaluation-${evaluation.id}.csv`, csv);
Interpreting Scores
Score Thresholds
| Score Range | Quality Level | Recommendation |
|---|
| 0.95 - 1.00 | Excellent | Production ready |
| 0.85 - 0.94 | Good | Deploy with monitoring |
| 0.75 - 0.84 | Fair | Consider improvements |
| 0.65 - 0.74 | Poor | Needs significant work |
| < 0.65 | Critical | Do not deploy |
Analyzing Low Scores
When scores are lower than expected:
// Find failing items
const failingItems = results.itemResults.filter(r => !r.passed);
console.log(`${failingItems.length} items failed\n`);
// Analyze failure patterns
const failuresByMetric: Record<string, number> = {};
for (const item of failingItems) {
for (const [metric, score] of Object.entries(item.scores)) {
if (score < 0.7) {
failuresByMetric[metric] = (failuresByMetric[metric] || 0) + 1;
}
}
}
console.log('Failure breakdown by metric:');
for (const [metric, count] of Object.entries(failuresByMetric)) {
console.log(` ${metric}: ${count} failures`);
}
// Show worst performing items
const sorted = [...results.itemResults].sort(
(a, b) => Object.values(a.scores).reduce((x, y) => x + y, 0) -
Object.values(b.scores).reduce((x, y) => x + y, 0)
);
console.log('\nWorst 5 items:');
sorted.slice(0, 5).forEach(item => {
console.log(` ${item.datasetItemId}: ${JSON.stringify(item.input)}`);
});
Comparing Evaluations
Compare results across different prompt versions or configurations:
// Run evaluations on two prompt versions
const [evalV1, evalV2] = await Promise.all([
client.evaluations.run({
promptId: 'prompt_v1',
datasetId: 'golden_dataset',
metrics: ['g_eval', 'semantic_similarity'],
}),
client.evaluations.run({
promptId: 'prompt_v2',
datasetId: 'golden_dataset',
metrics: ['g_eval', 'semantic_similarity'],
}),
]);
// Wait for both to complete
const [resultsV1, resultsV2] = await Promise.all([
client.evaluations.waitForCompletion(evalV1.id),
client.evaluations.waitForCompletion(evalV2.id),
]);
// Compare
console.log('Version Comparison:');
console.log(`V1 Overall: ${resultsV1.overallScore.toFixed(2)}`);
console.log(`V2 Overall: ${resultsV2.overallScore.toFixed(2)}`);
const improvement = resultsV2.overallScore - resultsV1.overallScore;
console.log(`Change: ${improvement > 0 ? '+' : ''}${(improvement * 100).toFixed(1)}%`);
// Per-metric comparison
console.log('\nMetric Comparison:');
for (const metric of Object.keys(resultsV1.metricScores)) {
const v1Score = resultsV1.metricScores[metric];
const v2Score = resultsV2.metricScores[metric];
const diff = v2Score - v1Score;
console.log(` ${metric}: ${v1Score.toFixed(2)} → ${v2Score.toFixed(2)} (${diff > 0 ? '+' : ''}${diff.toFixed(2)})`);
}
Using the Compare API
// Built-in comparison helper
const comparison = await client.evaluations.compare([
evalV1.id,
evalV2.id,
evalV3.id,
]);
console.log('Comparison Results:');
comparison.results.forEach(result => {
console.log(` ${result.evaluationId}: ${result.overallScore.toFixed(2)}`);
});
console.log('Best:', comparison.best.evaluationId);
Handling Failures
Retry Failed Evaluations
const evaluation = await client.evaluations.get('eval_xxxx');
if (evaluation.status === 'failed') {
console.log('Evaluation failed:', evaluation.error);
// Retry the evaluation
const retry = await client.evaluations.retry('eval_xxxx');
console.log('Retry started:', retry.id);
}
Common Failure Causes
LLM API returned an error. Check provider configuration and rate limits.
Evaluation took too long. Try a smaller dataset or increase timeout.
Prompt template has syntax errors or missing variables.
Dataset items have invalid or missing required fields.
Scheduling Evaluations
Run evaluations automatically on a schedule:
// Note: Scheduling API coming soon
const schedule = await client.evaluations.schedule({
promptId: 'prompt_xxxx',
datasetId: 'dataset_xxxx',
metrics: ['g_eval'],
schedule: {
frequency: 'daily', // 'hourly', 'daily', 'weekly'
time: '00:00', // UTC
timezone: 'America/New_York',
},
notifications: {
onComplete: true,
onFailure: true,
onRegressionDetected: true,
regressionThreshold: 0.05,
},
});
console.log('Schedule created:', schedule.id);
Scheduled evaluations are coming soon. Currently, you can implement scheduling using cron jobs or CI/CD pipelines calling the SDK.
Best Practices
Always compare evaluations using the same dataset to ensure valid comparisons.
Set appropriate thresholds
Define minimum scores based on your quality requirements and adjust as you learn.
Evaluate before deploying
Integrate evaluations into your deployment pipeline as quality gates.
Monitor evaluation scores over time to catch gradual degradation.