Evaluation Cookbook¶

This collection of recipes focuses on techniques and strategies for evaluating and improving the performance of your LLM-powered applications (whether or not you've built them with Mirascope). Proper evaluation is crucial for ensuring the reliability, accuracy, and effectiveness of AI systems, especially when working with large language models.

Why Evaluate?¶

Evaluating your LLM applications is essential for several reasons:

Quality Assurance: Ensure your models are producing accurate and relevant outputs.
Performance Monitoring: Track how your models perform over time and across different inputs.
Continuous Improvement: Gather insights to refine your prompts, fine-tune your models, or adjust your system architecture.
User Experience: Guarantee a consistent and high-quality experience for your end-users.

What You'll Find Here¶

Each recipe in this section has a sister recipe containing the original recipe implementing the system we will evaluate. These recipes will:

Provides practical code examples using Mirascope
Explains key concepts and best practices in LLM evaluation
Offers tips for adapting the evaluation methods to your specific use case

Recipes In This Section¶

Explore the following recipes to learn how to effectively evaluate your LLM applications with Mirascope:

Evaluating an SQL Agent: Learn how to assess and improve the quality of SQL queries generated by LLMs.

How to Use This Cookbook¶

Review the evaluation recipes to find techniques relevant to your project.
Implement the evaluation methods alongside your main application logic.
Use the insights gained from evaluations to iteratively improve your LLM applications.
Consider combining multiple evaluation techniques for a more comprehensive assessment.

Best Practices¶

While it may be tempting to use existing generalized evaluation tooling, this often (if not always) results in a failure to properly evaluate the quality of your application.

When implementing your evals, remember to:

Clearly Define Your Criteria: The criteria by which you measure success and quality are the cornerstone of any good evaluation system. This is the first and most important step.
Establish Baselines: Create a set of benchmark tests to track improvements (or degradation) over time.
Use Diverse Test Sets: Push your evals to cover a wide range of inputs and edge cases.
Human-in-the-loop: Iterate on your evals with thoughtful, nuanced human reviews to ensure the metrics you're tracking remain true to your goals.
Automate Where Possible: Integrate evaluation into your CI/CD pipeline for continuous monitoring.