Skip to main content
Version: 0.96

Refine the evaluation benchmark

After running the initial evaluation, you may need to refine it. This step is iterative, with the end goal of having a benchmark that fully aligns with business objectives, so your measurements of the GenAI model's performance against it are meaningful.

Artifact Refinement involves refining or developing new evaluation artifacts:

  • Ground truth labels: Add ground truth labels to ensure evaluators are accurate.
  • Criteria: Adjust the criteria based on performance or new enterprise requirements.
  • Data slices: Add new slices based on identified gaps or underperforming areas.
  • Prompts; Refine prompts to ensure that the evaluation covers all necessary aspects of the use case.

For each artifact, here are some refinement steps you might consider. To follow along with an example of how to use Snorkel's evaluation framework, see the Evaluate GenAI output tutorial.

Ground truth labels

After running the first round of evaluators, Snorkel recommends collecting a small number of ground truth labels for the relevant, defined criteria to ensure the evaluators are accurate. To do this:

  1. Create a batch with that specific criterion.
  2. Assign your subject matter experts to review that batch so you can collect ground truth labels from your subject matter experts in that batch. Commit these labels as ground truth. For more information, read about annotation.
  3. Add ground truth evaluators and re-run the evaluation.
  4. Compare the ground truth evaluator with the programmatic evaluator for that criterion. If the numbers are similar, you can trust the evaluator going forward and don't need to collect ground truth labels for this criterion in each iteration. If the numbers are not similar, you should compare the places where the ground truth and evaluator disagree and improve the evaluator to better understand these situations.
    note

    If your evaluators are already validated, you can skip subject matter expert (SME) annotation. For example, an enterprise-specific pre-trained PII/PHI model may not require SME annotation for the use case.

Ideally, each evaluator reaches trusted status in the early phases of an experiment, and can be used to expedite the iterative development process. Snorkel recommends re-engaging domain experts for high leverage, ambiguous error buckets throughout development and in the final rounds of development as a pipeline is on its way to production.

How to improve evaluators

Certain criteria may be too difficult for a single evaluator. For example, an organization's definition of "Correctness" may be so broad that developers find that an Evaluator does not accurately scale SME preferences. In cases like this, Snorkel recommends one of the following:

  • Break down the criteria into more fine-grained definitions that can be measured by a single evaluator.
  • Rely on high-quality annotations for that criteria during development.
  • Collect gold standard responses and create a custom evaluator to measure similarity to the collected gold standard response.

Criteria

Sometimes new criteria surface, or it becomes clear that the definition of a criterion should be adjusted. To add a new criterion, follow the steps in the onboarding guide. To edit a criterion, select the pencil icon next to an existing label schema and change the name or description.

Reference prompts

You can add more prompts by uploading another datasource.

More best practices for refining the benchmark

  • If most of your data isn't captured by data slices: Consider refining or writing new slicing functions.
  • If a high-priority data slice is under-represented in your dataset: Consider using Snorkel's synthetic data generation modules (SDK) to augment your existing dataset. Also consider retrieving a more diverse instruction set from an existing query stream or knowledge base.
  • If an evaluator is innaccurate: Use the data explorer to identify key failure modes with the evaluator, and create a batch of these innaccurate predictions for an annotator to review. Once ground truth has been collected, you can scale out these measurements via a fine-tuned quality model or include these as few-shot examples in a prompt-engineered LLM-as-judge.
  • To scale a criterion currently measured via ground truth: From the data explorer dialog inside the evaluation dashboard, select Go to Studio. Use the criterion's ground truth and Snorkel's Studio interface to write labeling functions, train a specialized model for that criterion, and register it as a custom evaluator. These fine-tuned quality models can also be used elsewhere in the platform for LLM Fine-tuning and RAG tuning workflows.

Next steps

Now that your benchmark is indicative of your business objectives, use it to measure and refine your LLM system.