Skip to main content

Running Evaluations

Evaluations in MutagenT are definition entities — they specify what to evaluate and how. Evaluation execution happens within the optimization loop or via the dashboard. This guide covers how to create evaluations, retrieve results, and interpret scores.
There is no separate “run evaluation” CLI command. Evaluations execute as part of the optimization workflow or through the dashboard UI. The CLI is used to create evaluation definitions and retrieve results.

Create an Evaluation

Via Dashboard

  1. Navigate to a prompt’s Evaluations tab
  2. Click New Evaluation
  3. Configure your evaluation:
    • Select the dataset to test against
    • Define evaluation criteria (metrics, weights, thresholds)
    • Configure LLM settings for the judge model
  4. Click Create Evaluation
  5. The evaluation will execute when optimization runs or when triggered from the dashboard

Via CLI

# Create an evaluation definition with criteria from a JSON file
mutagent prompts evaluation create 123 \
  --name "Quality Check" \
  --file criteria.json

# List evaluations for a prompt
mutagent prompts evaluation list --prompt-id 123

# Get evaluation details
mutagent prompts evaluation get 1
The criteria.json file contains your evaluation configuration:
{
  "datasetId": 456,
  "evalConfig": {
    "criteria": [
      { "field": "output", "metric": "g_eval", "weight": 0.5 },
      { "field": "output", "metric": "semantic_similarity", "weight": 0.5 }
    ],
    "threshold": 0.8
  },
  "llmConfig": {
    "model": "claude-sonnet-4-6",
    "temperature": 0.7
  }
}

Via SDK

import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ apiKey: process.env.MUTAGENT_API_KEY });

// Create evaluation definition
const evaluation = await client.promptEvaluations.createEvaluation({
  promptId: 123,
  datasetId: 456,
  name: 'Customer Support Quality',
  evalConfig: {
    criteria: [
      { field: 'output', metric: 'g_eval', weight: 0.5 },
      { field: 'output', metric: 'semantic_similarity', weight: 0.5 },
    ],
    threshold: 0.8,
  },
  llmConfig: {
    model: 'claude-sonnet-4-6',
    temperature: 0.7,
  },
});

console.log('Evaluation created:', evaluation.id);

How Evaluations Execute

Evaluations produce results through two paths:
  1. Dashboard - Trigger evaluation execution manually from the prompt evaluations page
  2. Optimization Loop - Evaluations run automatically during each optimization iteration to score prompt variants

Get Results

Retrieve evaluation results once execution completes:
// Get results for a specific evaluation
const result = await client.promptEvaluations.getEvaluationResult({ id: 1 });

console.log('=== Evaluation Results ===');
console.log('Score:', result.score);
console.log('Success:', result.success);
console.log('Execution Time:', result.executionTime, 'ms');

// Per-metric breakdown
console.log('Metric Results:');
for (const [metric, score] of Object.entries(result.metricResults)) {
  console.log(`  ${metric}: ${score}`);
}

Via CLI

# Get evaluation results
mutagent prompts evaluation result 1

Via cURL

curl https://api.mutagent.io/api/prompts/evaluations/1/result \
  -H "x-api-key: mt_xxxx"

Interpreting Scores

Score Thresholds

Score RangeQuality LevelRecommendation
0.95 - 1.00ExcellentProduction ready
0.85 - 0.94GoodDeploy with monitoring
0.75 - 0.84FairConsider improvements
0.65 - 0.74PoorNeeds significant work
< 0.65CriticalDo not deploy

Analyzing Low Scores

When scores are lower than expected, examine the per-metric breakdown in metricResults:
const result = await client.promptEvaluations.getEvaluationResult({ id: evalId });

if (result.score && result.score < 0.8) {
  console.log('Low score detected. Metric breakdown:');
  for (const [metric, score] of Object.entries(result.metricResults)) {
    const status = typeof score === 'number' && score < 0.7 ? 'FAILING' : 'OK';
    console.log(`  ${metric}: ${score} [${status}]`);
  }
}

Comparing Evaluations

Compare results across different prompt versions by creating evaluations with the same dataset and criteria for each version:
// Create evaluations for two prompt versions (same dataset, same criteria)
const evalConfig = {
  criteria: [
    { field: 'output', metric: 'g_eval', weight: 0.5 },
    { field: 'output', metric: 'semantic_similarity', weight: 0.5 },
  ],
};

const evalV1 = await client.promptEvaluations.createEvaluation({
  promptId: promptV1Id,
  datasetId: goldenDatasetId,
  name: 'V1 Evaluation',
  evalConfig,
});

const evalV2 = await client.promptEvaluations.createEvaluation({
  promptId: promptV2Id,
  datasetId: goldenDatasetId,
  name: 'V2 Evaluation',
  evalConfig,
});

// After optimization runs, compare results
const resultV1 = await client.promptEvaluations.getEvaluationResult({ id: evalV1.id });
const resultV2 = await client.promptEvaluations.getEvaluationResult({ id: evalV2.id });

if (resultV1.score && resultV2.score) {
  const improvement = resultV2.score - resultV1.score;
  console.log(`V1 Score: ${resultV1.score.toFixed(2)}`);
  console.log(`V2 Score: ${resultV2.score.toFixed(2)}`);
  console.log(`Change: ${improvement > 0 ? '+' : ''}${(improvement * 100).toFixed(1)}%`);
}

Handling Failures

Common Failure Causes

LLM API returned an error. Check provider configuration in Settings > Providers and verify rate limits.
Evaluation took too long. Try a smaller dataset or check if the LLM provider is experiencing high latency.
Prompt template has syntax errors or missing variables. Verify the prompt variables match the dataset schema.
Dataset items have invalid or missing required fields. Ensure all dataset items have the fields referenced in your criteria.

Best Practices

Always compare evaluations using the same dataset to ensure valid comparisons.
Define minimum scores based on your quality requirements and adjust as you learn.
Use G-Eval for quality, semantic similarity for meaning, and deterministic metrics for format validation.