Dataset Best Practices
Well-designed datasets are the foundation of reliable prompt evaluation and optimization. Follow these guidelines to build datasets that accurately measure your prompt’s performance.Golden Datasets
Create a “golden” dataset of high-quality test cases that serve as your primary quality benchmark:Representative
Cover the most common use cases your prompt will encounter in production.
Diverse
Include edge cases, variations, and different user intents.
Verified
Human-reviewed expected outputs that are definitively correct.
Stable
Don’t change frequently - serves as a consistent benchmark over time.
Creating a Golden Dataset
Coverage Strategies
Ensure your dataset covers all important scenarios:Happy Path (40-50% of items)
Happy Path (40-50% of items)
Normal, expected inputs that should work perfectly:
- Common questions users actually ask
- Standard use cases
- Well-formed, clear inputs
Edge Cases (20-30% of items)
Edge Cases (20-30% of items)
Boundary conditions and unusual but valid inputs:
- Very short inputs (“Help”)
- Very long, detailed questions
- Multiple questions in one
- Misspellings and typos
- Non-English characters
Error Cases (15-20% of items)
Error Cases (15-20% of items)
Invalid inputs that should be handled gracefully:
- Off-topic questions
- Requests outside your scope
- Incomplete information
Adversarial (5-10% of items)
Adversarial (5-10% of items)
Inputs designed to test robustness:
- Prompt injection attempts
- Confusing or misleading inputs
- Requests for inappropriate content
Dataset Size Guidelines
Choose the right size based on your use case:| Purpose | Recommended Size | Notes |
|---|---|---|
| Quick smoke test | 5-10 items | Fast feedback during development |
| Standard evaluation | 20-50 items | Balanced coverage for regular testing |
| Comprehensive testing | 100+ items | Full coverage for production prompts |
| Optimization training | 50-200 items | Enough variety for mutation/selection |
| Regression suite | 30-50 items | Focused on critical paths |
Quality matters more than quantity. 30 well-crafted test cases beat 300 poorly designed ones.
Quality Tips
1. Review Expected Outputs
Ensure expected outputs are actually correct and represent ideal responses:2. Use Real Examples
Pull from actual user interactions for authentic test cases:3. Update Regularly
Add new cases as you discover issues:4. Version Datasets
Clone before making major changes:5. Document Purpose
Use descriptions to explain intent:Dataset Organization
Naming Conventions
Use consistent, descriptive names:support-bot-golden-v1support-bot-edge-cases-v2support-bot-regression-q1-2024
Metadata Standards
Define standard metadata fields for your organization:Common Mistakes to Avoid
Too few test cases
Too few test cases
Small datasets don’t provide statistical significance. Aim for at least 20-30 items for meaningful evaluation.
Only happy path testing
Only happy path testing
If all your test cases are “normal” inputs, you won’t catch edge case failures. Include diversity.
Unverified expected outputs
Unverified expected outputs
If your expected outputs are wrong, your evaluations will penalize correct responses. Always verify.
Stale test cases
Stale test cases
If your product changes but your test cases don’t, evaluations become meaningless. Update regularly.
Missing variable coverage
Missing variable coverage
If your prompt has 5 variables but test cases only vary 2, you’re not testing the full prompt capability.