Create code evaluator
A code evaluator uses a custom Python function to assess how your AI application's responses satisfy specific criteria. It assigns a label from the associated criteria's label schema to each datapoint in an evaluation dataset based on your programmed instructions.
This is an optional stage in the evaluation workflow.
Code evaluators are useful when you need:
- Custom evaluation logic that is difficult to express through LLM prompts.
- Deterministic evaluation with consistent, reproducible results.
- Domain-specific metrics that require specialized algorithms.
Read the CodeEvaluator
SDK documentation
for the full code evaluator reference.
Code evaluator overview
A code evaluator consists of a Python function that takes your model's responses as input and returns evaluation scores and optional rationales. The evaluator function operates on a Pandas DataFrame that contains your AI app's response data and produces results that align with your criteria's label schema.
Code evaluators run asynchronously on an engine job. When they execute, Snorkel sends batches of datapoints to the evaluator function. If the evaluation function throws an error, scores for the entire batch are discarded.
Evaluation function signature
Your evaluation function must follow this signature:
def evaluate(df: pd.DataFrame) -> pd.DataFrame:
"""
Evaluation function that returns a DataFrame with scores and optional rationales.
Args:
df: Input DataFrame containing your uploaded data columns
The DataFrame has a MultiIndex with a single level named
``__DATAPOINT_UID`` that holds the unique identifier of the datapoint.
Values in the index are of the form ``("uid1",)``.
Returns:
DataFrame with 'score' and optionally 'rationale' columns, same index as input
"""
# Your evaluation logic here
pass
How to create a code evaluator
Prerequisites
Before creating a code evaluator, you need an existing benchmark. Create the benchmark in the Snorkel GUI.
Step 1: Create or select criteria
First, ensure you have a criteria for your code evaluator. You must create criteria for code evaluators via the SDK, because criteria created in the Snorkel GUI use prompt evaluators by default.
import snorkelai.sdk.context as context
from snorkelai.sdk.develop.criteria import Criteria
from snorkelai.sdk.develop.benchmark import Benchmark
from datetime import datetime
# Initialize your context
ctx = context.SnorkelSDKContext.from_endpoint_url(
"https://your-endpoint.com/",
api_key="your-api-key",
)
# The benchmark_uid is visible in the URL of the benchmark page in the Snorkel GUI.
# For example, https://YOUR-SNORKEL-INSTANCE/benchmarks/100/ indicates a benchmark with benchmark_uid of 100.
# Create a new criteria if needed
criteria = Criteria.create(
benchmark_uid=your_benchmark_uid,
name="Custom Length",
description="Evaluates the length accuracy of model responses",
label_map={"too_short": 0, "just_right": 1, "too_long": 2},
requires_rationale=True,
)
print(f"Created criteria: {criteria} with uid {criteria.criteria_uid}")
# If you have a saved criteria, you can get it from the benchmark.
benchmark = Benchmark(benchmark_uid=your_benchmark_uid)
criteria = benchmark.list_criteria()[0]
print(f"Found criteria: {criteria} with uid {criteria.criteria_uid}")
Step 2: Define your evaluation function
Create a Python function that implements your evaluation logic:
import pandas as pd
import numpy as np
def evaluate(df):
"""
Evaluation function that returns a DataFrame with scores and rationales.
Args:
df: Input DataFrame to evaluate
Returns:
DataFrame with 'score' and 'rationale' columns, same index as input
"""
# Create a DataFrame with the same index as the input
results = pd.DataFrame(index=df.index)
# Example: Simple length-based evaluation
# You can implement any custom logic here
for idx in df.index:
response = df.loc[idx, 'response'] # Assuming 'response' is a column
# Custom evaluation logic
if len(response) < 50:
score = 0
rationale = "Response is too short and lacks detail"
elif len(response) < 100:
score = 1
rationale = "Response has moderate length and detail"
else:
score = 2
rationale = "Response is too long"
results.loc[idx, 'score'] = score
results.loc[idx, 'rationale'] = rationale
return results
Step 3: Create the code evaluator
Use the CodeEvaluator.create()
method to create your evaluator:
from snorkelai.sdk.develop.evaluators import CodeEvaluator
# Create the code evaluator
evaluator = CodeEvaluator.create(
criteria_uid=criteria.criteria_uid,
evaluate_function=evaluate,
version_name="v1.0",
)
print(f"Created evaluator: {evaluator}")
Once an evaluator is registered, you can run it via the GUI or via the SDK.
Step 4: Run the evaluation
Execute your code evaluator against your dataset:
# Run the evaluation
execution_uid = evaluator.execute(split="train", num_rows=100)
# Poll for results
status, results = evaluator.poll_execution_result(execution_uid, sync=True)
print(f"Evaluation status: {status}")
print(f"Results: {results}")
(Optional) Step 5: Get execution results
You can view the results of the evaluator execution in the Snorkel GUI, on the Evaluations page for this benchmark.
You can also use the SDK to retrieve and analyze the evaluation results:
# Get execution results
results = evaluator.get_execution_result(execution_uid)
print(f"Results: {results}")
# Get all executions for this evaluator
executions = evaluator.get_executions()
for execution in executions:
print(f"Execution {execution['execution_uid']}: {execution['created_at']}")
Manage code evaluator versions
You can update code evaluators, track version history, and run a specific version.
Update existing evaluators
You can update an existing code evaluator with new logic:
# Update with new code
new_version_name = evaluator.update(
version_name="v2.0",
evaluate_function=new_evaluate_function
)
print(f"Created new version: {new_version_name}")
View evaluator version history
Track changes across different versions:
# Get all versions of an evaluator
versions = evaluator.get_versions()
for version in versions:
print(f"Version: {version}")
Run a specific version
Run the specified version of your evaluator:
# Run a specific version
specific_version = versions[0] # Use the first version
execution = evaluator.run(
split="test",
num_rows=50,
code_version=specific_version
)
Troubleshooting
Common issues
- Function name error: Ensure your function is named exactly
evaluate
. - Missing columns: Verify your input DataFrame contains expected columns.
- Index mismatch: Ensure output DataFrame has the same index as input.
- Score range: Make sure scores match your criteria's label schema.
- Label map validation: Ensure your
label_map
uses consecutive integers starting from0
.
Debugging tips
- Test your function locally before adding it to Snorkel.
- Use small datasets for initial testing.
- Check the execution logs for detailed error messages.
- Validate that your output format matches expectations.
Known limitations
- Snorkel currently supports a limited set of Python libraries for custom function creation.
Next steps
Read the CodeEvaluator
SDK documentation
for the full code evaluator reference.
Once you have created your code evaluator, you can:
- Run the benchmark to see evaluation results.
- Refine your evaluator to align with subject matter experts.
- Compare results with LLM-as-a-judge evaluators.
- Export your benchmark for version control.