Evaluations SDK

Create and manage prompt evaluations. Evaluations are definition entities that specify how to evaluate a prompt against a dataset. Results are produced by running the evaluation or via the optimization loop.

List Evaluations

Retrieve evaluations with optional filters:

from mutagent import Mutagent

with Mutagent() as client:
    result = client.prompt_evaluations.list_evaluations(
        prompt_id=42,
        limit=20,
        offset=0,
    )
    for eval_ in result.get("data", []):
        print(eval_["id"], eval_["name"])

Filter parameters

Parameter	Type	Description
`prompt_id`	`int`	Filter by prompt ID
`prompt_group_id`	`str`	Filter by prompt group UUID
`dataset_id`	`int`	Filter by dataset ID
`name`	`str`	Filter by evaluation name
`created_by`	`str`	Filter by creator email
`is_latest`	`bool`	Filter by latest-version flag
`limit`	`int`	Results per page
`offset`	`int`	Number of results to skip

Create Evaluation

Create an evaluation definition linking a prompt to a dataset:

from mutagent import Mutagent
from mutagent.models import PromptIdDatasetIdName

with Mutagent() as client:
    evaluation = client.prompt_evaluations.create_evaluation(
        body=PromptIdDatasetIdName(
            prompt_id=42,
            dataset_id=7,
            name="Customer Support Quality Eval",
            description="Evaluate tone, accuracy, and helpfulness",
            eval_config={
                "metrics": ["g_eval", "semantic_similarity"],
                "threshold": 0.8,
            },
            llm_config={
                "model": "claude-sonnet-4-6",
                "temperature": 0,
            },
            tags=["production", "baseline"],
        )
    )
    print("Created evaluation:", evaluation["id"])

`PromptIdDatasetIdName` fields

Field	Type	Required	Description
`prompt_id`	`int`	Yes	ID of the prompt to evaluate
`name`	`str`	Yes	Human-readable name (max 255 chars)
`dataset_id`	`int`	No	ID of the test dataset
`description`	`str`	No	Evaluation purpose and methodology
`eval_config`	`Any`	No	Metrics, thresholds, evaluation parameters
`llm_config`	`Any`	No	Model, temperature, LLM execution settings
`tags`	`list[str]`	No	Organization tags
`metadata`	`Any`	No	Arbitrary metadata
`created_by`	`str`	No	Creator email

Get Evaluation

with Mutagent() as client:
    evaluation = client.prompt_evaluations.get_evaluation(id_=456)
    print(evaluation["name"])
    print("Dataset:", evaluation["datasetId"])

Update Evaluation

from mutagent.models import NameDescriptionEvalConfig

with Mutagent() as client:
    updated = client.prompt_evaluations.update_evaluation(
        id_=456,
        body=NameDescriptionEvalConfig(
            description="Updated description",
        ),
    )

Delete Evaluation

with Mutagent() as client:
    client.prompt_evaluations.delete_evaluation(id_=456)

Run Evaluation

Trigger an evaluation run:

with Mutagent() as client:
    run = client.prompt_evaluations.run_evaluation(id_=456)
    print("Run started:", run)

Get Results

Retrieve the execution results for an evaluation:

with Mutagent() as client:
    result = client.prompt_evaluations.get_evaluation_result(id_=456)
    print("Score:", result.get("score"))
    print("Passed:", result.get("success"))

Get Evaluation History

with Mutagent() as client:
    history = client.prompt_evaluations.get_evaluation_history(id_=456)

Create Evaluation Version

with Mutagent() as client:
    new_version = client.prompt_evaluations.create_evaluation_version(id_=456)

Poll for Completion

Since evaluations run asynchronously, poll for results:

import time
from mutagent import Mutagent

def wait_for_results(eval_id: int, max_attempts: int = 30) -> dict:
    with Mutagent() as client:
        for i in range(max_attempts):
            try:
                result = client.prompt_evaluations.get_evaluation_result(id_=eval_id)
                print(f"Score: {result.get('score')} | Passed: {result.get('success')}")
                return result
            except Exception:
                print(f"Waiting for results... (attempt {i + 1}/{max_attempts})")
                time.sleep(2)

    raise TimeoutError("Timed out waiting for evaluation results")

Async version

import asyncio
from mutagent import AsyncMutagent

async def wait_for_results_async(eval_id: int, max_attempts: int = 30) -> dict:
    async with AsyncMutagent() as client:
        for i in range(max_attempts):
            try:
                result = await client.prompt_evaluations.get_evaluation_result(id_=eval_id)
                print(f"Score: {result.get('score')}")
                return result
            except Exception:
                await asyncio.sleep(2)
    raise TimeoutError("Timed out waiting for evaluation results")

Method Reference

Method	Description	Namespace
`list_evaluations(...)`	List evaluations with filters	`client.prompt_evaluations`
`create_evaluation(body)`	Create evaluation definition	`client.prompt_evaluations`
`get_evaluation(id_)`	Get evaluation by ID	`client.prompt_evaluations`
`update_evaluation(id_, body)`	Update evaluation	`client.prompt_evaluations`
`delete_evaluation(id_)`	Delete evaluation	`client.prompt_evaluations`
`run_evaluation(id_)`	Trigger evaluation run	`client.prompt_evaluations`
`get_evaluation_result(id_)`	Get evaluation results	`client.prompt_evaluations`
`get_evaluation_history(id_)`	Get evaluation run history	`client.prompt_evaluations`
`create_evaluation_version(id_)`	Create new evaluation version	`client.prompt_evaluations`
`get_evaluation_results_aggregated(...)`	Get results aggregated by version	`client.prompt_evaluations`

Getting Started

CLI

Integrations

Tracing

SDK

Evaluations

Evaluations SDK

List Evaluations

Filter parameters

Create Evaluation

`PromptIdDatasetIdName` fields

Get Evaluation

Update Evaluation

Delete Evaluation

Run Evaluation

Get Results

Get Evaluation History

Create Evaluation Version

Poll for Completion

Async version

Method Reference

Getting Started

CLI

Integrations

Tracing

SDK

Documentation Index

​Evaluations SDK

​List Evaluations

​Filter parameters

​Create Evaluation

​PromptIdDatasetIdName fields

​Get Evaluation

​Update Evaluation

​Delete Evaluation

​Run Evaluation

​Get Results

​Get Evaluation History

​Create Evaluation Version

​Poll for Completion

​Async version

​Method Reference

Evaluations SDK

List Evaluations

Filter parameters

Create Evaluation

`PromptIdDatasetIdName` fields

Get Evaluation

Update Evaluation

Delete Evaluation

Run Evaluation

Get Results

Get Evaluation History

Create Evaluation Version

Poll for Completion

Async version

Method Reference