Skip to main content

Evaluations SDK

Create and manage prompt evaluations. Evaluations are definition entities that specify how to evaluate a prompt against a dataset. Results are produced by the optimization loop or the dashboard — there is no standalone “run evaluation” API from the SDK.

List Evaluations

Retrieve a paginated list of evaluations. Filter by promptId, promptGroupId, datasetId, name, or createdBy.
import { Mutagent } from '@mutagent/sdk';

const client = new Mutagent({ apiKey: process.env.MUTAGENT_API_KEY });

const page = await client.promptEvaluations.listEvaluations({
  promptId: 42,
  limit: 20,
  offset: 0,
});

for await (const evaluation of page) {
  console.log(evaluation.id, evaluation.name);
}

Request Parameters

ParameterTypeDescription
promptIdnumberFilter by prompt ID
promptGroupIdstringFilter by prompt group UUID
datasetIdnumberFilter by dataset ID
namestringFilter by exact evaluation name
createdBystringFilter by creator email
limitnumberResults per page (1-100)
offsetnumberNumber of results to skip

Create Evaluation

Create a new evaluation definition linking a prompt to a dataset with evaluation configuration.
const evaluation = await client.promptEvaluations.createEvaluation({
  promptId: 42,
  datasetId: 7,
  name: 'Customer Support Quality Eval',
  description: 'Evaluate tone, accuracy, and helpfulness',
  evalConfig: {
    metrics: ['g_eval', 'semantic_similarity'],
    threshold: 0.8,
  },
  llmConfig: {
    model: 'gpt-4o',
    temperature: 0,
  },
  tags: ['production', 'baseline'],
  metadata: { team: 'support' },
});

console.log('Created evaluation:', evaluation.id);

Request Body

FieldTypeRequiredDescription
promptIdnumberYesID of the prompt to evaluate
datasetIdnumberYesID of the test dataset
namestringYesHuman-readable name (max 255 chars)
descriptionstringNoEvaluation purpose and methodology
evalConfigobjectNoMetrics, thresholds, and evaluation parameters
llmConfigobjectNoModel, temperature, and LLM execution settings
tagsstring[]NoOrganization tags
metadataobjectNoArbitrary metadata for tracking
createdBystringNoCreator email

Get Evaluation

Retrieve a single evaluation definition by its ID.
const evaluation = await client.promptEvaluations.getEvaluation({
  id: 456,
});

console.log(evaluation.name);
console.log('Dataset:', evaluation.datasetId);
console.log('Config:', evaluation.evalConfig);

Get Results

Retrieve the execution results for an evaluation. Results include the actual LLM output, pass/fail status, numeric score, and per-metric breakdowns.
const result = await client.promptEvaluations.getEvaluationResult({
  id: 456,
});

console.log('Score:', result.score);
console.log('Passed:', result.success);
console.log('Execution time:', result.executionTime, 'ms');
console.log('Metrics:', result.metricResults);

Poll for Completion

Since evaluations run asynchronously (via the optimization loop or dashboard), you can poll for results.
async function waitForResults(evalId: number): Promise<void> {
  const maxAttempts = 30;

  for (let i = 0; i < maxAttempts; i++) {
    try {
      const result = await client.promptEvaluations.getEvaluationResult({
        id: evalId,
      });

      console.log('Score:', result.score, '| Passed:', result.success);
      return;
    } catch {
      // Results not ready yet
      console.log(`Waiting for results... (attempt ${i + 1}/${maxAttempts})`);
      await new Promise(r => setTimeout(r, 2000));
    }
  }

  throw new Error('Timed out waiting for evaluation results');
}

Type Definitions

interface Evaluation {
  id: number;
  promptGroupId: string;
  datasetId: number;
  name: string;
  description: string | null;
  evalConfig: unknown;
  llmConfig: unknown;
  tags: string[] | null;
  metadata: unknown;
  createdAt: string | null;
  updatedAt: string | null;
  createdBy: string | null;
}

interface EvaluationResult {
  id: number;
  evaluationId: number;
  actualOutput: unknown;
  success: boolean;
  score: number | null;
  metricResults: unknown;
  executionTime: number | null;
  createdAt: string | null;
}

Method Reference

MethodDescription
promptEvaluations.listEvaluations({ ...filters })List evaluations with pagination and filters
promptEvaluations.createEvaluation({ ...body })Create evaluation definition
promptEvaluations.getEvaluation({ id })Get evaluation by ID
promptEvaluations.getEvaluationResult({ id })Get evaluation results
promptEvaluations.runEvaluation({ id })Trigger evaluation run

REST API Reference

The SDK methods map to these HTTP endpoints:
SDK MethodHTTP Endpoint
listEvaluationsGET /api/prompts/evaluations
createEvaluationPOST /api/prompts/evaluations
getEvaluationGET /api/prompts/evaluations/:id
getEvaluationResultGET /api/prompts/evaluations/:id/result
runEvaluationPOST /api/prompts/evaluations/:id/run