Version: 25.6

Create code evaluator

A code evaluator uses a custom Python function to assess how your AI application's responses satisfy specific criteria. It assigns a label from the associated criteria's label schema to each datapoint in an evaluation dataset based on your programmed instructions.

This is an optional stage in the evaluation workflow.

Code evaluators are useful when you need:

Custom evaluation logic that is difficult to express through LLM prompts.
Deterministic evaluation with consistent, reproducible results.
Domain-specific metrics that require specialized algorithms.

Read the CodeEvaluator SDK documentation for the full code evaluator reference.

Code evaluator overview

A code evaluator consists of a Python function that takes your model's responses as input and returns evaluation scores and optional rationales. The evaluator function operates on a Pandas DataFrame that contains your AI app's response data and produces results that align with your criteria's label schema.

Code evaluators run asynchronously on an engine job. When they execute, Snorkel sends batches of datapoints to the evaluator function. If the evaluation function throws an error, scores for the entire batch are discarded.

Evaluation function signature

Your evaluation function must follow this signature:

def evaluate(df: pd.DataFrame) -> pd.DataFrame:
    """
    Evaluation function that returns a DataFrame with scores and optional rationales.

    Args:
        df: Input DataFrame containing your uploaded data columns
        The DataFrame has a MultiIndex with a single level named
              ``__DATAPOINT_UID`` that holds the unique identifier of the datapoint.
              Values in the index are of the form ``("uid1",)``.

    Returns:
        DataFrame with 'score' and optionally 'rationale' columns, same index as input
    """
    # Your evaluation logic here
    pass

How to create a code evaluator

Prerequisites

Before creating a code evaluator, you need an existing benchmark. Create the benchmark in the Snorkel GUI.

Step 1: Create or select criteria

First, ensure you have a criteria for your code evaluator. You must create criteria for code evaluators via the SDK, because criteria created in the Snorkel GUI use prompt evaluators by default.

import snorkelai.sdk.context as context
from snorkelai.sdk.develop.criteria import Criteria
from snorkelai.sdk.develop.benchmark import Benchmark
from datetime import datetime

# Initialize your context
ctx = context.SnorkelSDKContext.from_endpoint_url(
    "https://your-endpoint.com/",
    api_key="your-api-key",
)

# The benchmark_uid is visible in the URL of the benchmark page in the Snorkel GUI.
# For example, https://YOUR-SNORKEL-INSTANCE/benchmarks/100/ indicates a benchmark with benchmark_uid of 100.

# Create a new criteria if needed
criteria = Criteria.create(
    benchmark_uid=your_benchmark_uid,
    name="Custom Length",
    description="Evaluates the length accuracy of model responses",
    label_map={"too_short": 0, "just_right": 1, "too_long": 2},
    requires_rationale=True,
)
print(f"Created criteria: {criteria} with uid {criteria.criteria_uid}")

# If you have a saved criteria, you can get it from the benchmark.
benchmark = Benchmark(benchmark_uid=your_benchmark_uid)
criteria = benchmark.list_criteria()[0]
print(f"Found criteria: {criteria} with uid {criteria.criteria_uid}")

Step 2: Define your evaluation function

Create a Python function that implements your evaluation logic:

import pandas as pd
import numpy as np

def evaluate(df):
    """
    Evaluation function that returns a DataFrame with scores and rationales.

    Args:
        df: Input DataFrame to evaluate

    Returns:
        DataFrame with 'score' and 'rationale' columns, same index as input
    """
    # Create a DataFrame with the same index as the input
    results = pd.DataFrame(index=df.index)

    # Example: Simple length-based evaluation
    # You can implement any custom logic here
    for idx in df.index:
        response = df.loc[idx, 'response']  # Assuming 'response' is a column

        # Custom evaluation logic
        if len(response) < 50:
            score = 0
            rationale = "Response is too short and lacks detail"
        elif len(response) < 100:
            score = 1
            rationale = "Response has moderate length and detail"
        else:
            score = 2
            rationale = "Response is too long"

        results.loc[idx, 'score'] = score
        results.loc[idx, 'rationale'] = rationale

    return results

Step 3: Create the code evaluator

Use the CodeEvaluator.create() method to create your evaluator:

from snorkelai.sdk.develop.evaluators import CodeEvaluator

# Create the code evaluator
evaluator = CodeEvaluator.create(
    criteria_uid=criteria.criteria_uid,
    evaluate_function=evaluate,
    version_name="v1.0",
)

print(f"Created evaluator: {evaluator}")

Once an evaluator is registered, you can run it via the GUI or via the SDK.

Step 4: Run the evaluation

Execute your code evaluator against your dataset:

# Run the evaluation
execution_uid = evaluator.execute(split="train", num_rows=100)

# Poll for results
status, results = evaluator.poll_execution_result(execution_uid, sync=True)
print(f"Evaluation status: {status}")
print(f"Results: {results}")

(Optional) Step 5: Get execution results

You can view the results of the evaluator execution in the Snorkel GUI, on the Evaluations page for this benchmark.

You can also use the SDK to retrieve and analyze the evaluation results:

# Get execution results
results = evaluator.get_execution_result(execution_uid)
print(f"Results: {results}")

# Get all executions for this evaluator
executions = evaluator.get_executions()
for execution in executions:
    print(f"Execution {execution['execution_uid']}: {execution['created_at']}")

Manage code evaluator versions

You can update code evaluators, track version history, and run a specific version.

Update existing evaluators

You can update an existing code evaluator with new logic:

# Update with new code
new_version_name = evaluator.update(
    version_name="v2.0",
    evaluate_function=new_evaluate_function
)
print(f"Created new version: {new_version_name}")

View evaluator version history

Track changes across different versions:

# Get all versions of an evaluator
versions = evaluator.get_versions()

for version in versions:
    print(f"Version: {version}")

Run a specific version

Run the specified version of your evaluator:

# Run a specific version
specific_version = versions[0]  # Use the first version
execution = evaluator.run(
    split="test",
    num_rows=50,
    code_version=specific_version
)

Troubleshooting

Common issues

Function name error: Ensure your function is named exactly evaluate.
Missing columns: Verify your input DataFrame contains expected columns.
Index mismatch: Ensure output DataFrame has the same index as input.
Score range: Make sure scores match your criteria's label schema.
Label map validation: Ensure your label_map uses consecutive integers starting from 0.

Debugging tips

Test your function locally before adding it to Snorkel.
Use small datasets for initial testing.
Check the execution logs for detailed error messages.
Validate that your output format matches expectations.

Known limitations

Snorkel currently supports a limited set of Python libraries for custom function creation.

Next steps

Read the CodeEvaluator SDK documentation for the full code evaluator reference.

Once you have created your code evaluator, you can:

Run the benchmark to see evaluation results.
Refine your evaluator to align with subject matter experts.
Compare results with LLM-as-a-judge evaluators.
Export your benchmark for version control.

Code evaluator overview​

Evaluation function signature​

How to create a code evaluator​

Prerequisites​

Step 1: Create or select criteria​

Step 2: Define your evaluation function​

Step 3: Create the code evaluator​

Step 4: Run the evaluation​

(Optional) Step 5: Get execution results​

Manage code evaluator versions​

Update existing evaluators​

View evaluator version history​

Run a specific version​

Troubleshooting​

Common issues​

Debugging tips​

Known limitations​

Next steps​