Evaluations
Evaluations define how your prompts are measured against test datasets. An evaluation is a definition entity that specifies what to evaluate (prompt + dataset) and how to evaluate it (criteria and LLM configuration). Evaluation results are produced when the optimization loop runs your evaluation against dataset items.Why Evaluate?
Quality Assurance
Verify prompts meet quality standards before deployment. Catch issues early with field-level criteria.
Regression Detection
Automatically detect when changes break existing behavior. Compare scores across prompt versions.
Optimization Feedback
Provide scoring criteria that guide the automated optimization engine toward better prompts.
Version Comparison
Objectively compare different prompt versions using consistent metrics and datasets.
How It Works
Evaluations are created as definitions, then executed automatically within the optimization loop or via the dashboard. There is no separate “run evaluation” CLI command — evaluation execution happens as part of prompt optimization.Creating an Evaluation
An evaluation links a prompt to a dataset with evaluation criteria.Available Metrics
| Metric | Type | Description | Best For |
|---|---|---|---|
| G-Eval | LLM-based | AI judge assesses quality, relevance, coherence | General quality |
| Semantic Similarity | Embedding | Cosine similarity between output and expected | Meaning preservation |
| Exact Match | Deterministic | Binary match against expected output | Classification, structured |
| Contains | Deterministic | Checks for required substrings | Key information |
| Regex Match | Deterministic | Pattern matching against output | Format validation |
| Custom | LLM-based | Your own evaluation criteria | Domain-specific |
Combine multiple metrics for comprehensive evaluation. G-Eval catches quality issues; semantic similarity catches meaning drift; deterministic metrics catch format errors.
Evaluation Configuration
TheevalConfig object defines the criteria for scoring. Criteria are field-level and can target input or output:
Criteria Examples
Development Evaluation
Development Evaluation
Quick feedback during prompt iteration:
- Small dataset (5-10 items)
- Single metric for fast turnaround
Pre-deployment Evaluation
Pre-deployment Evaluation
Comprehensive check before publishing:
- Full dataset (50+ items)
- Multiple metrics with weighted scoring
- Higher quality threshold
Regression Evaluation
Regression Evaluation
Compare new version against baseline:
- Same dataset, different prompt versions
- Consistent criteria across evaluations
- Track improvement over time
Evaluation Results
Results are generated when the optimization loop executes your evaluation. Retrieve them via the API:Result Structure
Quality Gates
Use evaluations as quality gates in your workflow:What’s Next?
Evaluation Metrics
Deep dive into available metrics and when to use each
Running Evaluations
Learn how evaluations execute within the optimization workflow