Default criteria and evaluators
Snorkel Flow provides default criteria and evaluators to help you get started. While these defaults are not comprehensive, they are a good starting point. You should always review and customize these defaults to fit your specific use case.
Dataset requirements
For these default evaluators, your dataset may require the following columns. For detailed instructions on preparing datasets for evaluation, see Onboard artifacts:
instruction
: The instruction, also known as the query or prompt, sent by a user to your GenAI app.response
: The response generated by your GenAI application for the corresponding instruction.context
: The context added to the instruction that helps the model generate the response. This includes a system prompt. If you are running an application that includes retrieval-augmented generation (RAG), this is the text retrieved and sent alongside the user instruction.reference_response
: The ground truth or golden response for the given instruction.
Review the tables in the next sections to see which columns are required for specific evaluators.
How evaluator prompts are constructed for default criteria
The default critera are assessed with pre-built prompts run with gpt-4o as the LLM-as-a-judge (LLMAJ). For a deeper dive into LLMAJ evaluators, please read Create LLM-as-a-judge prompt.
The tables in the next section break down what goes into each LLMAJ prompt for each of the default criteria. Note that the evaluator assigns the final assessment as a numeric score, as shown in the tables below.
You can tweak these default LLMAJ prompts to better suit your GenAI application.
Defaults for response evaluation
These three default evaluators assess the response to the given instruction for:
- Safety
- Correctness
- Completeness
Safety | Correctness | Completeness | |
---|---|---|---|
Description | Evaluates the safety of a response in multiple categories. | Evaluates whether the response provides accurate information relevant to the instruction. | Evaluates the completeness of a response based on whether it fully addresses all aspects of the instruction. |
Required dataset columns | instruction, response | instruction, response | instruction, response |
Output format | {"safety_assessment": "str", "violated_categories": "list[str]"} | {"score": {"type": "binary", "values": [0, 1]}, "justification": {"type": "text", "max_length": 300}} | {"score": {"type": "binary", "values": [0, 1]}, "justification": {"type": "text", "max_length": 300}} |
Numeric label | {"UNKNOWN": -1, "safe": 1, "unsafe": 0} | {"UNKNOWN": -1, "incorrect": 0, "correct": 1} | {"UNKNOWN": -1, "incomplete": 0, "complete": 1} |
Default model | gpt-4o | gpt-4o | gpt-4o |
Defaults for retrieval evaluation
These three default evaluators assess the context retrieved for the given instruction for:
- Context recall
- Faithfulness
- Context relevance
Context recall | Faithfulness | Context relevance | |
---|---|---|---|
Description | Evaluates the recall of retrieved context by comparing with a reference response in a RAG system. | Evaluates how well the response aligns with the context in a RAG system. | Evaluates the relevance of the retrieved context to the instruction in a RAG system. |
Required dataset columns | instruction, response, context, reference_response | instruction, response, context | instruction, context |
Output format | {"score": {"type": "number", "minimum": 0.0, "maximum": 1.0}, "justification": {"type": "string"}} | {"score": {"type": "integer", "enum": [0, 1]}, "rationale": {"type": "string"}} | {"score": {"type": "integer", "enum": [0, 1]}, "justification": {"type": "string"}} |
Numeric label | {"UNKNOWN": -1, "0.0": 0, "0.1": 1, "0.2": 2, "0.3": 3, "0.4": 4, "0.5": 5, "0.6": 6, "0.7": 7, "0.8": 8, "0.9": 9, "1.0": 10} | {"UNKNOWN": -1, "not_faithful": 0, "faithful": 1} | {"UNKNOWN": -1, "irrelevant": 0, "relevant": 1} |
Default model | gpt-4o | gpt-4o | gpt-4o |