Skip to main content
Version: 25.3

Default criteria and evaluators

Snorkel Flow provides default criteria and evaluators to help you get started. While these defaults are not comprehensive, they are a good starting point. You should always review and customize these defaults to fit your specific use case.

Dataset requirements

For these default evaluators, your dataset may require the following columns. For detailed instructions on preparing datasets for evaluation, see Onboard artifacts:

  • instruction: The instruction, also known as the query or prompt, sent by a user to your GenAI app.
  • response: The response generated by your GenAI application for the corresponding instruction.
  • context: The context added to the instruction that helps the model generate the response. This includes a system prompt. If you are running an application that includes retrieval-augmented generation (RAG), this is the text retrieved and sent alongside the user instruction.
  • reference_response: The ground truth or golden response for the given instruction.

Review the tables in the next sections to see which columns are required for specific evaluators.

How evaluator prompts are constructed for default criteria

The default critera are assessed with pre-built prompts run with gpt-4o as the LLM-as-a-judge (LLMAJ). For a deeper dive into LLMAJ evaluators, please read Create LLM-as-a-judge prompt.

The tables in the next section break down what goes into each LLMAJ prompt for each of the default criteria. Note that the evaluator assigns the final assessment as a numeric score, as shown in the tables below.

You can tweak these default LLMAJ prompts to better suit your GenAI application.

Defaults for response evaluation

These three default evaluators assess the response to the given instruction for:

  • Safety
  • Correctness
  • Completeness
SafetyCorrectnessCompleteness
DescriptionEvaluates the safety of a response in multiple categories.Evaluates whether the response provides accurate information relevant to the instruction.Evaluates the completeness of a response based on whether it fully addresses all aspects of the instruction.
Required dataset columnsinstruction, responseinstruction, responseinstruction, response
Output format{"safety_assessment": "str", "violated_categories": "list[str]"}{"score": {"type": "binary", "values": [0, 1]}, "justification": {"type": "text", "max_length": 300}}{"score": {"type": "binary", "values": [0, 1]}, "justification": {"type": "text", "max_length": 300}}
Numeric label{"UNKNOWN": -1, "safe": 1, "unsafe": 0}{"UNKNOWN": -1, "incorrect": 0, "correct": 1}{"UNKNOWN": -1, "incomplete": 0, "complete": 1}
Default modelgpt-4ogpt-4ogpt-4o

Defaults for retrieval evaluation

These three default evaluators assess the context retrieved for the given instruction for:

  • Context recall
  • Faithfulness
  • Context relevance
Context recallFaithfulnessContext relevance
DescriptionEvaluates the recall of retrieved context by comparing with a reference response in a RAG system.Evaluates how well the response aligns with the context in a RAG system.Evaluates the relevance of the retrieved context to the instruction in a RAG system.
Required dataset columnsinstruction, response, context, reference_responseinstruction, response, contextinstruction, context
Output format{"score": {"type": "number", "minimum": 0.0, "maximum": 1.0}, "justification": {"type": "string"}}{"score": {"type": "integer", "enum": [0, 1]}, "rationale": {"type": "string"}}{"score": {"type": "integer", "enum": [0, 1]}, "justification": {"type": "string"}}
Numeric label{"UNKNOWN": -1, "0.0": 0, "0.1": 1, "0.2": 2, "0.3": 3, "0.4": 4, "0.5": 5, "0.6": 6, "0.7": 7, "0.8": 8, "0.9": 9, "1.0": 10}{"UNKNOWN": -1, "not_faithful": 0, "faithful": 1}{"UNKNOWN": -1, "irrelevant": 0, "relevant": 1}
Default modelgpt-4ogpt-4ogpt-4o