Run an initial evaluation benchmark
Once you've completed artifact onboarding and created a benchmark, it's time to run that benchmark for this GenAI model. This is a stage in the evaluation workflow.
Run the first evaluation to assess the model's performance. The data, which are the reference prompts and responses from the current version of the model, are fed into the evaluation and you can see performance across your slices and criteria.
Evaluate overview
-
Select Run a new evaluation.
-
Select the criteria you'd like to use for this evaluation.
-
Select Run evaluation and wait for the assessment to complete.
The Evaluations page displays the results in two ways:
- Iteration Overview, or performance plot
- Latest report table
Iteration overview
The iteration overview is a plot that shows how your performance has changed over recent benchmark runs. Different data splits, criteria, and slices can be selected so you can focus on what you care about. This is a helpful image to share with those interested in how your project is going. When you run the first evaluation, you will see points rather than lines in the plot. Once you have run multiple iterations, you will see lines connecting the points so you can visually track trends in performance.
Axes:
- X-Axis (Runs): Represents different evaluation runs, ordered sequentially (e.g., Run 2 through Run 7). Each run corresponds to an iteration where the evaluation criteria were executed.
- Y-Axis (Criteria Score): Displays the average value of the selected evaluation criteria (e.g., Correctness, Completeness, Safety) for each run. It can also display SME agreement with the programmatic evalutor.
Color legend:
- Blue Line (train split): Shows the performance of the GenAI app on the training split.
- Pink Line (valid split): Shows the performance of the GenAI app on the validation split.
Each dot represents a score for a specific run.
Controls:
- Criteria Selector: Choose the evaluation criteria you want to track (e.g., Correctness).
- Score: Toggle between mean evaluator score and SME agreement rate.
- Split Selector: Toggle between different dataset splits (e.g., train, valid, or both).
- Datapoint Filter: Filter by All Datapoints or specific slices.
Latest report data
Snorkel displays metrics for the latest evaluation run at the bottom of the page in a table. Use the table to select individual data splits. Here you can see the evaluator score per criteria, per slice. Each criteria also shows an agreement rate between the evaluator(s) and the SME. The higher the agreement rate, the higher the confidence in the programmatic evaluator.
The goal for creating a benchmark you can trust is to improve this agreement rate, so that the evaluator rates responses like a human would, whether that is a high, low, or medium score.
Select View all data to get a table view of all data points with evaluator outputs.

Next steps
After running your initial evaluation, you will likely need to refine the benchmark to improve its alignment with your business objectives. This refinement process is iterative and ensures your evaluation provides meaningful insights about your GenAI model's performance.