Documentation Index
Fetch the complete documentation index at: https://docs.mutagent.io/llms.txt
Use this file to discover all available pages before exploring further.
Evaluations SDK
Create and manage prompt evaluations. Evaluations are definition entities that specify how to evaluate a prompt against a dataset. Results are produced by the optimization loop or the dashboard — there is no standalone “run evaluation” API from the SDK.
List Evaluations
Retrieve a paginated list of evaluations. Filter by promptId, promptGroupId, datasetId, name, or createdBy.
import { Mutagent } from '@mutagent/sdk';
const client = new Mutagent({ apiKey: process.env.MUTAGENT_API_KEY });
const page = await client.promptEvaluations.listEvaluations({
promptId: 42,
limit: 20,
offset: 0,
});
for await (const evaluation of page) {
console.log(evaluation.id, evaluation.name);
}
Request Parameters
| Parameter | Type | Description |
|---|
promptId | number | Filter by prompt ID |
promptGroupId | string | Filter by prompt group UUID |
datasetId | number | Filter by dataset ID |
name | string | Filter by exact evaluation name |
createdBy | string | Filter by creator email |
limit | number | Results per page (1-100) |
offset | number | Number of results to skip |
Create Evaluation
Create a new evaluation definition linking a prompt to a dataset with evaluation configuration.
const evaluation = await client.promptEvaluations.createEvaluation({
promptId: 42,
datasetId: 7,
name: 'Customer Support Quality Eval',
description: 'Evaluate tone, accuracy, and helpfulness',
evalConfig: {
metrics: ['g_eval', 'semantic_similarity'],
threshold: 0.8,
},
llmConfig: {
model: 'gpt-4o',
temperature: 0,
},
tags: ['production', 'baseline'],
metadata: { team: 'support' },
});
console.log('Created evaluation:', evaluation.id);
Request Body
| Field | Type | Required | Description |
|---|
promptId | number | Yes | ID of the prompt to evaluate |
datasetId | number | Yes | ID of the test dataset |
name | string | Yes | Human-readable name (max 255 chars) |
description | string | No | Evaluation purpose and methodology |
evalConfig | object | No | Metrics, thresholds, and evaluation parameters |
llmConfig | object | No | Model, temperature, and LLM execution settings |
tags | string[] | No | Organization tags |
metadata | object | No | Arbitrary metadata for tracking |
createdBy | string | No | Creator email |
Get Evaluation
Retrieve a single evaluation definition by its ID.
const evaluation = await client.promptEvaluations.getEvaluation({
id: 456,
});
console.log(evaluation.name);
console.log('Dataset:', evaluation.datasetId);
console.log('Config:', evaluation.evalConfig);
Get Results
Retrieve the execution results for an evaluation. Results include the actual LLM output, pass/fail status, numeric score, and per-metric breakdowns.
const result = await client.promptEvaluations.getEvaluationResult({
id: 456,
});
console.log('Score:', result.score);
console.log('Passed:', result.success);
console.log('Execution time:', result.executionTime, 'ms');
console.log('Metrics:', result.metricResults);
Poll for Completion
Since evaluations run asynchronously (via the optimization loop or dashboard), you can poll for results.
async function waitForResults(evalId: number): Promise<void> {
const maxAttempts = 30;
for (let i = 0; i < maxAttempts; i++) {
try {
const result = await client.promptEvaluations.getEvaluationResult({
id: evalId,
});
console.log('Score:', result.score, '| Passed:', result.success);
return;
} catch {
// Results not ready yet
console.log(`Waiting for results... (attempt ${i + 1}/${maxAttempts})`);
await new Promise(r => setTimeout(r, 2000));
}
}
throw new Error('Timed out waiting for evaluation results');
}
Type Definitions
interface Evaluation {
id: number;
promptGroupId: string;
datasetId: number;
name: string;
description: string | null;
evalConfig: unknown;
llmConfig: unknown;
tags: string[] | null;
metadata: unknown;
createdAt: string | null;
updatedAt: string | null;
createdBy: string | null;
}
interface EvaluationResult {
id: number;
evaluationId: number;
actualOutput: unknown;
success: boolean;
score: number | null;
metricResults: unknown;
executionTime: number | null;
createdAt: string | null;
}
Method Reference
| Method | Description |
|---|
promptEvaluations.listEvaluations({ ...filters }) | List evaluations with pagination and filters |
promptEvaluations.createEvaluation({ ...body }) | Create evaluation definition |
promptEvaluations.getEvaluation({ id }) | Get evaluation by ID |
promptEvaluations.updateEvaluation({ id, ...body }) | Update an existing evaluation definition |
promptEvaluations.deleteEvaluation({ id }) | Delete an evaluation |
promptEvaluations.getEvaluationResult({ id }) | Get evaluation results |
promptEvaluations.getEvaluationResultsAggregated({ id }) | Get aggregated evaluation results |
promptEvaluations.getEvaluationHistory({ id }) | Retrieve evaluation run history |
promptEvaluations.createEvaluationVersion({ id, ...body }) | Create a new version of an evaluation |
promptEvaluations.runEvaluation({ id }) | Trigger evaluation run |
REST API Reference
The SDK methods map to these HTTP endpoints:
| SDK Method | HTTP Endpoint |
|---|
listEvaluations | GET /api/prompts/evaluations |
createEvaluation | POST /api/prompts/evaluations |
getEvaluation | GET /api/prompts/evaluations/:id |
getEvaluationResult | GET /api/prompts/evaluations/:id/result |
runEvaluation | POST /api/prompts/evaluations/:id/run |