Evaluations
Evaluations automatically measure how well your prompts perform against test datasets. They provide objective, reproducible quality scores that enable data-driven prompt development.
Why Evaluate?
Quality Assurance Verify prompts meet quality standards before deployment. Catch issues early.
Regression Detection Automatically detect when changes break existing behavior. Never deploy broken prompts.
Optimization Feedback Provide scores that guide the automated optimization engine toward better prompts.
Version Comparison Objectively compare different prompt versions to make informed decisions.
How It Works
Quick Example
import { Mutagent } from '@mutagent/sdk' ;
const client = new Mutagent ({ bearerAuth: 'sk_live_...' });
// Create and run an evaluation
const evaluation = await client . evaluations . run ({
promptId: 'prompt_xxxx' ,
datasetId: 'dataset_xxxx' ,
metrics: [ 'g_eval' , 'semantic_similarity' ],
});
console . log ( 'Evaluation started:' , evaluation . id );
console . log ( 'Status:' , evaluation . status );
// Wait for completion and get results
const results = await client . evaluations . waitForCompletion ( evaluation . id );
console . log ( 'Overall Score:' , results . overallScore );
console . log ( 'Metric Breakdown:' );
for ( const [ metric , score ] of Object . entries ( results . metricScores )) {
console . log ( ` ${ metric } : ${ score . toFixed ( 2 ) } ` );
}
Available Metrics
Metric Type Description Best For G-Eval LLM-based AI judge assesses quality, relevance, coherence General quality Semantic Similarity Embedding Cosine similarity between output and expected Meaning preservation Exact Match Deterministic Binary match against expected output Classification, structured Contains Deterministic Checks for required substrings Key information Regex Match Deterministic Pattern matching against output Format validation Custom LLM-based Your own evaluation criteria Domain-specific
Combine multiple metrics for comprehensive evaluation. G-Eval catches quality issues; semantic similarity catches meaning drift; deterministic metrics catch format errors.
Evaluation Types
Quick feedback during prompt iteration:
Small dataset (5-10 items)
Fast turnaround
Used during development
await client . evaluations . run ({
promptId ,
datasetId: 'dev_dataset' ,
metrics: [ 'g_eval' ],
});
Pre-deployment Evaluation
Comprehensive check before publishing:
Full dataset (50+ items)
Multiple metrics
Quality gate enforcement
const evaluation = await client . evaluations . run ({
promptId ,
datasetId: 'golden_dataset' ,
metrics: [ 'g_eval' , 'semantic_similarity' , 'contains' ],
});
if ( evaluation . overallScore < 0.85 ) {
throw new Error ( 'Quality gate failed' );
}
Compare new version against baseline:
Same dataset, different prompt versions
Detect performance degradation
Track improvement over time
const [ evalV1 , evalV2 ] = await Promise . all ([
client . evaluations . run ({ promptId: v1Id , datasetId , metrics }),
client . evaluations . run ({ promptId: v2Id , datasetId , metrics }),
]);
const improvement = evalV2 . overallScore - evalV1 . overallScore ;
console . log ( `V2 is ${ improvement > 0 ? 'better' : 'worse' } by ${ Math . abs ( improvement ) } ` );
Evaluation Results
Results include detailed scoring at multiple levels:
interface EvaluationResults {
id : string ;
status : 'pending' | 'running' | 'completed' | 'failed' ;
// Aggregate scores
overallScore : number ; // 0.0 - 1.0
metricScores : Record < string , number >; // Per-metric averages
// Per-item details
itemResults : Array <{
datasetItemId : string ;
input : Record < string , any >;
actualOutput : string ;
expectedOutput ?: string ;
scores : Record < string , number >;
passed : boolean ;
}>;
// Metadata
completedItems : number ;
totalItems : number ;
duration : number ; // milliseconds
startedAt : Date ;
completedAt ?: Date ;
}
Quality Gates
Use evaluations as quality gates in your workflow:
async function deployPrompt ( promptId : string ) {
// Run comprehensive evaluation
const evaluation = await client . evaluations . run ({
promptId ,
datasetId: 'golden_dataset_xxxx' ,
metrics: [ 'g_eval' , 'semantic_similarity' ],
});
// Wait for completion
const results = await client . evaluations . waitForCompletion ( evaluation . id );
// Enforce quality thresholds
const QUALITY_THRESHOLD = 0.85 ;
const REGRESSION_THRESHOLD = 0.02 ;
if ( results . overallScore < QUALITY_THRESHOLD ) {
throw new Error (
`Quality gate failed: ${ results . overallScore } < ${ QUALITY_THRESHOLD } `
);
}
// Compare against previous version
const previousScore = await getPreviousScore ( promptId );
if ( previousScore - results . overallScore > REGRESSION_THRESHOLD ) {
throw new Error (
`Regression detected: score dropped from ${ previousScore } to ${ results . overallScore } `
);
}
// Safe to deploy
await client . prompts . publish ( promptId );
console . log ( 'Prompt deployed successfully' );
}
What’s Next?