Skip to main content
Version: 25.3

Onboard and define the artifacts for GenAI evaluation

Evaluation for GenAI output begins with preparing and onboarding your dataset of LLM responses for evaluation. This includes preprocessing the data, gathering reference prompts, mapping columns, and defining data slices. This is the first step in the evaluation workflow.

What type of GenAI data can you evaluate?

When a user interacts with a GenAI application, they send a prompt to a model. This prompt may optionally be augmented with additional context, such as a system prompt or retrieved context from a RAG system. The model then returns a response.

You can evaluate a batch of instructions and responses data, and optionally include the context and/or an ideal response.

Create reference prompts

Create the set of prompts that the model will be tested against. The prompts should be representative of all the types of questions you want your model to perform well on. These can be collected, curated, authored, and/or generated. Prompts can include questions that were:

  • Mined from historical data (user query streams from real questions asked to a chatbot, historical query logs from questions asked to human agents, etc).
  • Provided by Subject Matter Experts (SMEs).
  • Synthetically generated.

Each time you iterate on your fine-tuned model, the responses to the prompts will change, but the prompts themselves will remain consistent across iterations.

(Optional) Ground truth

To get started quickly, you can complete onboarding without gathering ground truth labels. This allows you to more quickly generate your first benchmark evaluation before involving subject matter experts (SMEs) in annotation. When you run the benchmark first, you have a better idea of how and where to use your SME time impactfully. In refinement, see the best practice for gathering ground truth labels for your evaluation workflow. However, if you already have ground truth labels, you can also upload them from Datasets > Data Sources > Upload Ground Truth, which creates more signal in your first evaluation benchmark.

For more, see Upload ground truth.

note

Snorkel does not allow you to upload ground truth for traces.

Preprocess data

Before preprocessing your evaluation data, you may want to review the general data preparation guidelines to ensure your data meets Snorkel's requirements.

To evaluate a dataset, the preprocessing step includes:

1. Map columns

Map the column names in your dataset to match the following:

  • instruction: The instruction, also known as the query or prompt, sent by a user to your GenAI app.
  • response: The response generated by your GenAI application for the corresponding instruction.
  • context: The context added to the instruction that helps the model generate the response. This includes a system prompt. If you are running an application that includes retrieval-augmented generation (RAG), this is the text retrieved and sent alongside the user instruction.
  • reference_response: The ground truth or golden response for the given instruction.

2. (Optional) Include trace data

If your data contains multiple phases related to the same user interaction, you may want to evaluate each step independently. This can be especially useful in troubleshooting which phase of a multi-agent system is the source of a sub-standard response.

To learn how to preprocess your trace data for a multi-agent or multi-step system, read Evaluation for multi-agent systems, using traces.

3. Break your dataset into manageable chunks

For optimal performance, we recommend limiting your dataset to approximately 1,000 traces and 147 million tokens. Larger datasets may impact system performance and slow down benchmark development. If you need to evaluate larger datasets, consider breaking them into smaller batches, or contact Snorkel support for guidance on handling high-volume evaluations.

Upload dataset to Snorkel

Once your dataset is prepared, follow the uploading a dataset guide to import it into Snorkel for evaluation.

When you upload a set of reference prompts with accompanying response and optional other data, choose the following data upload options:

  • Dataset name: Name this dataset.
  • Enable multi-schema annotations: Selected - this is required.
  • Select the data source.
  • Split: It's a best practice to upload both a train split and a valid split
  • UID Column: Select the data column containing UIDs.
  • In Advanced Settings, for Define data type, select:
    • For reference prompt datasets, Data type is Raw text and Task type is Classification. Select any value for the Primary text field.
    • For trace datasets, Data type is Trace. Select the column containing your trace data for the Trace Column.

Slice data

Create data slices that segment your dataset into meaningful topics. Refer to the instructions in Using data slices for instructions.

Data slices represent specific subsets of the dataset that you want to measure separately. To learn about slices and see some suggested slice examples, read the Slices conceptual overview.

Next steps

Once you have an evaluation-ready dataset, data slices, and reference prompts, you can create a benchmark.