Overview
Enterprise-grade evaluation is essential for deploying AI systems that are
Evaluation workflow
Snorkel's evaluation framework follows a comprehensive workflow with these key
Onboard evaluation artifacts
Evaluation for GenAI output begins with preparing and onboarding your dataset
Evaluation for multi-agent systems
Your GenAI application may produce responses in multiple steps rather than as a
Create benchmark for evaluation
With an evaluation-ready dataset, users can create a benchmark customized to
Default criteria and evaluators
Snorkel Flow provides default criteria and evaluators to help you get started.
Create LLMAJ prompt
Use the evaluator builder to create and customize LLM-as-a-judge (LLMAJ)
Run an evaluation benchmark
Once you've completed artifact onboarding and
Refine evaluation benchmark
After running the initial evaluation, you may need to
Export evaluation benchmark
After refining your benchmark to align with your business objectives, you can
Refine GenAI app
The end goal for GenAI evaluation is to use the insights to refine your