Skip to main content

Running Evaluations

Execute evaluations and analyze results to measure prompt quality. This guide covers the complete evaluation workflow.

Start an Evaluation

Via Dashboard

  1. Navigate to Evaluations in the sidebar
  2. Click New Evaluation
  3. Configure your evaluation:
    • Select the prompt to evaluate
    • Select the dataset to test against
    • Choose one or more metrics
    • (Optional) Add custom metrics
  4. Click Run Evaluation
  5. Monitor progress in real-time

Via SDK

import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ bearerAuth: 'sk_live_...' });

// Start evaluation
const evaluation = await client.evaluations.run({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval', 'semantic_similarity'],
});

console.log('Evaluation started:', evaluation.id);
console.log('Status:', evaluation.status);

Monitor Progress

Track evaluation status as it runs:
// Poll for status updates
async function monitorEvaluation(evaluationId: string) {
  let status = await client.evaluations.get(evaluationId);

  while (status.status === 'pending' || status.status === 'running') {
    console.log(`Status: ${status.status}`);
    console.log(`Progress: ${status.completedItems}/${status.totalItems}`);

    // Wait before next check
    await new Promise(resolve => setTimeout(resolve, 2000));
    status = await client.evaluations.get(evaluationId);
  }

  return status;
}

const finalStatus = await monitorEvaluation(evaluation.id);
console.log('Evaluation complete:', finalStatus.status);

Using waitForCompletion

A simpler approach using the built-in helper:
// Automatically polls until complete
const results = await client.evaluations.waitForCompletion(evaluation.id, {
  pollInterval: 2000,  // ms between checks
  timeout: 300000,     // max wait time (5 min)
});

console.log('Results ready:', results.overallScore);

Evaluation States

StateDescriptionActions Available
pendingQueued, waiting to startCancel
runningCurrently executingCancel
completedFinished successfullyView results
failedError occurredRetry, view error
cancelledManually stoppedRestart

Get Results

Retrieve detailed evaluation results:
const results = await client.evaluations.getResults(evaluation.id);

// Overall scores
console.log('=== Evaluation Results ===');
console.log('Overall Score:', results.overallScore.toFixed(2));
console.log('Duration:', results.duration, 'ms');
console.log('');

// Per-metric scores
console.log('Metric Scores:');
for (const [metric, score] of Object.entries(results.metricScores)) {
  console.log(`  ${metric}: ${score.toFixed(2)}`);
}
console.log('');

// Summary statistics
const passed = results.itemResults.filter(r => r.passed).length;
const total = results.itemResults.length;
console.log(`Items: ${passed}/${total} passed (${((passed/total)*100).toFixed(1)}%)`);

Detailed Item Results

Examine individual test case results:
// Show detailed results for each item
console.log('\n=== Item Results ===');
for (const item of results.itemResults) {
  console.log(`\nItem: ${item.datasetItemId}`);
  console.log(`Input: ${JSON.stringify(item.input)}`);
  console.log(`Output: ${item.actualOutput.substring(0, 100)}...`);

  if (item.expectedOutput) {
    console.log(`Expected: ${item.expectedOutput.substring(0, 100)}...`);
  }

  console.log('Scores:');
  for (const [metric, score] of Object.entries(item.scores)) {
    console.log(`  ${metric}: ${score.toFixed(2)}`);
  }

  console.log(`Status: ${item.passed ? 'PASSED' : 'FAILED'}`);
}

Export Results

Export results for external analysis:
import fs from 'fs';

const results = await client.evaluations.getResults(evaluation.id);

// Export as JSON
fs.writeFileSync(
  `evaluation-${evaluation.id}.json`,
  JSON.stringify(results, null, 2)
);

// Export as CSV
const csv = [
  'item_id,input,output,expected,overall_score,passed',
  ...results.itemResults.map(r => [
    r.datasetItemId,
    JSON.stringify(r.input),
    r.actualOutput.replace(/"/g, '""'),
    r.expectedOutput?.replace(/"/g, '""') || '',
    Object.values(r.scores).reduce((a, b) => a + b, 0) / Object.keys(r.scores).length,
    r.passed,
  ].map(v => `"${v}"`).join(','))
].join('\n');

fs.writeFileSync(`evaluation-${evaluation.id}.csv`, csv);

Interpreting Scores

Score Thresholds

Score RangeQuality LevelRecommendation
0.95 - 1.00ExcellentProduction ready
0.85 - 0.94GoodDeploy with monitoring
0.75 - 0.84FairConsider improvements
0.65 - 0.74PoorNeeds significant work
< 0.65CriticalDo not deploy

Analyzing Low Scores

When scores are lower than expected:
// Find failing items
const failingItems = results.itemResults.filter(r => !r.passed);

console.log(`${failingItems.length} items failed\n`);

// Analyze failure patterns
const failuresByMetric: Record<string, number> = {};

for (const item of failingItems) {
  for (const [metric, score] of Object.entries(item.scores)) {
    if (score < 0.7) {
      failuresByMetric[metric] = (failuresByMetric[metric] || 0) + 1;
    }
  }
}

console.log('Failure breakdown by metric:');
for (const [metric, count] of Object.entries(failuresByMetric)) {
  console.log(`  ${metric}: ${count} failures`);
}

// Show worst performing items
const sorted = [...results.itemResults].sort(
  (a, b) => Object.values(a.scores).reduce((x, y) => x + y, 0) -
            Object.values(b.scores).reduce((x, y) => x + y, 0)
);

console.log('\nWorst 5 items:');
sorted.slice(0, 5).forEach(item => {
  console.log(`  ${item.datasetItemId}: ${JSON.stringify(item.input)}`);
});

Comparing Evaluations

Compare results across different prompt versions or configurations:
// Run evaluations on two prompt versions
const [evalV1, evalV2] = await Promise.all([
  client.evaluations.run({
    promptId: 'prompt_v1',
    datasetId: 'golden_dataset',
    metrics: ['g_eval', 'semantic_similarity'],
  }),
  client.evaluations.run({
    promptId: 'prompt_v2',
    datasetId: 'golden_dataset',
    metrics: ['g_eval', 'semantic_similarity'],
  }),
]);

// Wait for both to complete
const [resultsV1, resultsV2] = await Promise.all([
  client.evaluations.waitForCompletion(evalV1.id),
  client.evaluations.waitForCompletion(evalV2.id),
]);

// Compare
console.log('Version Comparison:');
console.log(`V1 Overall: ${resultsV1.overallScore.toFixed(2)}`);
console.log(`V2 Overall: ${resultsV2.overallScore.toFixed(2)}`);

const improvement = resultsV2.overallScore - resultsV1.overallScore;
console.log(`Change: ${improvement > 0 ? '+' : ''}${(improvement * 100).toFixed(1)}%`);

// Per-metric comparison
console.log('\nMetric Comparison:');
for (const metric of Object.keys(resultsV1.metricScores)) {
  const v1Score = resultsV1.metricScores[metric];
  const v2Score = resultsV2.metricScores[metric];
  const diff = v2Score - v1Score;
  console.log(`  ${metric}: ${v1Score.toFixed(2)}${v2Score.toFixed(2)} (${diff > 0 ? '+' : ''}${diff.toFixed(2)})`);
}

Using the Compare API

// Built-in comparison helper
const comparison = await client.evaluations.compare([
  evalV1.id,
  evalV2.id,
  evalV3.id,
]);

console.log('Comparison Results:');
comparison.results.forEach(result => {
  console.log(`  ${result.evaluationId}: ${result.overallScore.toFixed(2)}`);
});

console.log('Best:', comparison.best.evaluationId);

Handling Failures

Retry Failed Evaluations

const evaluation = await client.evaluations.get('eval_xxxx');

if (evaluation.status === 'failed') {
  console.log('Evaluation failed:', evaluation.error);

  // Retry the evaluation
  const retry = await client.evaluations.retry('eval_xxxx');
  console.log('Retry started:', retry.id);
}

Common Failure Causes

LLM API returned an error. Check provider configuration and rate limits.
Evaluation took too long. Try a smaller dataset or increase timeout.
Prompt template has syntax errors or missing variables.
Dataset items have invalid or missing required fields.

Scheduling Evaluations

Run evaluations automatically on a schedule:
// Note: Scheduling API coming soon
const schedule = await client.evaluations.schedule({
  promptId: 'prompt_xxxx',
  datasetId: 'dataset_xxxx',
  metrics: ['g_eval'],
  schedule: {
    frequency: 'daily',      // 'hourly', 'daily', 'weekly'
    time: '00:00',           // UTC
    timezone: 'America/New_York',
  },
  notifications: {
    onComplete: true,
    onFailure: true,
    onRegressionDetected: true,
    regressionThreshold: 0.05,
  },
});

console.log('Schedule created:', schedule.id);
Scheduled evaluations are coming soon. Currently, you can implement scheduling using cron jobs or CI/CD pipelines calling the SDK.

Best Practices

Always compare evaluations using the same dataset to ensure valid comparisons.
Define minimum scores based on your quality requirements and adjust as you learn.
Integrate evaluations into your deployment pipeline as quality gates.