> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mutagent.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations

> Run evaluations with the Python SDK

# Evaluations SDK

Create and manage prompt evaluations. Evaluations are **definition entities** that specify *how* to evaluate a prompt against a dataset. Results are produced by running the evaluation or via the optimization loop.

<Mermaid>
  flowchart LR
  A\[create\_evaluation] --> B\[run\_evaluation / Optimization Loop]
  B --> C\[Results Generated]
  C --> D\[get\_evaluation\_result]
</Mermaid>

## List Evaluations

Retrieve evaluations with optional filters:

```python theme={null}
from mutagent import Mutagent

with Mutagent() as client:
    result = client.prompt_evaluations.list_evaluations(
        prompt_id=42,
        limit=20,
        offset=0,
    )
    for eval_ in result.get("data", []):
        print(eval_["id"], eval_["name"])
```

### Filter parameters

| Parameter         | Type   | Description                   |
| ----------------- | ------ | ----------------------------- |
| `prompt_id`       | `int`  | Filter by prompt ID           |
| `prompt_group_id` | `str`  | Filter by prompt group UUID   |
| `dataset_id`      | `int`  | Filter by dataset ID          |
| `name`            | `str`  | Filter by evaluation name     |
| `created_by`      | `str`  | Filter by creator email       |
| `is_latest`       | `bool` | Filter by latest-version flag |
| `limit`           | `int`  | Results per page              |
| `offset`          | `int`  | Number of results to skip     |

## Create Evaluation

Create an evaluation definition linking a prompt to a dataset:

```python theme={null}
from mutagent import Mutagent
from mutagent.models import PromptIdDatasetIdName

with Mutagent() as client:
    evaluation = client.prompt_evaluations.create_evaluation(
        body=PromptIdDatasetIdName(
            prompt_id=42,
            dataset_id=7,
            name="Customer Support Quality Eval",
            description="Evaluate tone, accuracy, and helpfulness",
            eval_config={
                "metrics": ["g_eval", "semantic_similarity"],
                "threshold": 0.8,
            },
            llm_config={
                "model": "claude-sonnet-4-6",
                "temperature": 0,
            },
            tags=["production", "baseline"],
        )
    )
    print("Created evaluation:", evaluation["id"])
```

### `PromptIdDatasetIdName` fields

| Field         | Type        | Required | Description                                |
| ------------- | ----------- | -------- | ------------------------------------------ |
| `prompt_id`   | `int`       | Yes      | ID of the prompt to evaluate               |
| `name`        | `str`       | Yes      | Human-readable name (max 255 chars)        |
| `dataset_id`  | `int`       | No       | ID of the test dataset                     |
| `description` | `str`       | No       | Evaluation purpose and methodology         |
| `eval_config` | `Any`       | No       | Metrics, thresholds, evaluation parameters |
| `llm_config`  | `Any`       | No       | Model, temperature, LLM execution settings |
| `tags`        | `list[str]` | No       | Organization tags                          |
| `metadata`    | `Any`       | No       | Arbitrary metadata                         |
| `created_by`  | `str`       | No       | Creator email                              |

## Get Evaluation

```python theme={null}
with Mutagent() as client:
    evaluation = client.prompt_evaluations.get_evaluation(id_=456)
    print(evaluation["name"])
    print("Dataset:", evaluation["datasetId"])
```

## Update Evaluation

```python theme={null}
from mutagent.models import NameDescriptionEvalConfig

with Mutagent() as client:
    updated = client.prompt_evaluations.update_evaluation(
        id_=456,
        body=NameDescriptionEvalConfig(
            description="Updated description",
        ),
    )
```

## Delete Evaluation

```python theme={null}
with Mutagent() as client:
    client.prompt_evaluations.delete_evaluation(id_=456)
```

## Run Evaluation

Trigger an evaluation run:

```python theme={null}
with Mutagent() as client:
    run = client.prompt_evaluations.run_evaluation(id_=456)
    print("Run started:", run)
```

## Get Results

Retrieve the execution results for an evaluation:

```python theme={null}
with Mutagent() as client:
    result = client.prompt_evaluations.get_evaluation_result(id_=456)
    print("Score:", result.get("score"))
    print("Passed:", result.get("success"))
```

## Get Evaluation History

```python theme={null}
with Mutagent() as client:
    history = client.prompt_evaluations.get_evaluation_history(id_=456)
```

## Create Evaluation Version

```python theme={null}
with Mutagent() as client:
    new_version = client.prompt_evaluations.create_evaluation_version(id_=456)
```

## Poll for Completion

Since evaluations run asynchronously, poll for results:

```python theme={null}
import time
from mutagent import Mutagent

def wait_for_results(eval_id: int, max_attempts: int = 30) -> dict:
    with Mutagent() as client:
        for i in range(max_attempts):
            try:
                result = client.prompt_evaluations.get_evaluation_result(id_=eval_id)
                print(f"Score: {result.get('score')} | Passed: {result.get('success')}")
                return result
            except Exception:
                print(f"Waiting for results... (attempt {i + 1}/{max_attempts})")
                time.sleep(2)

    raise TimeoutError("Timed out waiting for evaluation results")
```

### Async version

```python theme={null}
import asyncio
from mutagent import AsyncMutagent

async def wait_for_results_async(eval_id: int, max_attempts: int = 30) -> dict:
    async with AsyncMutagent() as client:
        for i in range(max_attempts):
            try:
                result = await client.prompt_evaluations.get_evaluation_result(id_=eval_id)
                print(f"Score: {result.get('score')}")
                return result
            except Exception:
                await asyncio.sleep(2)
    raise TimeoutError("Timed out waiting for evaluation results")
```

## Method Reference

| Method                                   | Description                       | Namespace                   |
| ---------------------------------------- | --------------------------------- | --------------------------- |
| `list_evaluations(...)`                  | List evaluations with filters     | `client.prompt_evaluations` |
| `create_evaluation(body)`                | Create evaluation definition      | `client.prompt_evaluations` |
| `get_evaluation(id_)`                    | Get evaluation by ID              | `client.prompt_evaluations` |
| `update_evaluation(id_, body)`           | Update evaluation                 | `client.prompt_evaluations` |
| `delete_evaluation(id_)`                 | Delete evaluation                 | `client.prompt_evaluations` |
| `run_evaluation(id_)`                    | Trigger evaluation run            | `client.prompt_evaluations` |
| `get_evaluation_result(id_)`             | Get evaluation results            | `client.prompt_evaluations` |
| `get_evaluation_history(id_)`            | Get evaluation run history        | `client.prompt_evaluations` |
| `create_evaluation_version(id_)`         | Create new evaluation version     | `client.prompt_evaluations` |
| `get_evaluation_results_aggregated(...)` | Get results aggregated by version | `client.prompt_evaluations` |
