Skip to main content
Version: 25.6

snorkelai.sdk.develop.CodeEvaluator

class snorkelai.sdk.develop.CodeEvaluator(*args, **kwargs)

Bases: Evaluator

An evaluator that uses custom Python code to assess an AI application’s responses.

A code evaluator uses custom Python functions to evaluate datapoints containing AI application responses, categorizing them into one of a criteria’s labels by assigning the corresponding integer score and optional rationale. The evaluator function takes a datapoint as input and returns a score based on the criteria’s label schema.

The evaluation function can implement any Python logic needed to assess the AI application’s response.

Read more in the Evaluation overview.

Parameters

NameTypeDefaultInfo
benchmark_uidintThe unique identifier of the benchmark that contains the criteria. The benchmark_uid is visible in the URL of the benchmark page in the Snorkel GUI. For example, https://YOUR-SNORKEL-INSTANCE/benchmarks/100/ indicates a benchmark with benchmark_uid of 100.
criteria_uidintThe unique identifier of the criteria that this evaluator assesses.
evaluator_uidintThe unique identifier for this evaluator.

Examples

Example 1

Creates a new code evaluator, assessing the length of the AI application’s response:

def evaluate(df):
results = pd.DataFrame(index=df.index)
results["score"] = df["response"].str.len() > 10 # Simple length check
results["rationale"] = "Response length evaluation"
return results

# Create a new code evaluator
evaluator = CodeEvaluator.create(
evaluate_function=evaluate,
version_name="Version 1"
)

Example 2

Gets an existing code evaluator:

# Get existing evaluator
evaluator = CodeEvaluator.get(
evaluator_uid=300,
)

__init__

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)
create(criteria_uid, **kwargs)Creates a new code evaluator for a criteria.
execute(split[, num_rows, version_name, sync])Executes the code evaluator against a dataset split.
get(evaluator_uid)Gets an existing code evaluator by its UID.
get_execution_result(execution_uid)Retrieves the evaluation results for a specific evaluation execution.
get_executions()Retrieves all executions for this code evaluator.
get_versions()Retrieves all code version names for this code evaluator.
poll_execution_result(execution_uid[, sync])Polls the job status and retrieves partial results.
update(version_name, **kwargs)Updates the code evaluator with a new evaluation function.

Attributes

benchmark_uid
criteria_uid
evaluator_uid

create

classmethod create(criteria_uid, **kwargs)

Creates a new code evaluator for a criteria.

Parameters

NameTypeDefaultInfo
criteria_uidintThe unique identifier of the criteria that this evaluator assesses.
**kwargsAny

Additional parameters. Must include:

  • evaluate_function : Callable[[pd.DataFrame], pd.DataFrame] A Python function that performs the evaluation. This function must:

    • Be named evaluate

    • Accept a pandas DataFrame as input

    • Return a pandas DataFrame as output

    The input DataFrame has a MultiIndex with a single level named __DATAPOINT_UID that holds the unique identifier of the datapoint. Values in the index are of the form ("uid1",).

    The output DataFrame must:

    • Have the same index as the input DataFrame

    • Include a column named score containing the evaluation results

    • Optionally include a column named rationale with explanations for the scores

May include:

  • version_name : str The name for the initial code version. Defaults to v1.0.

Raises

ValueError – If the function name is not evaluate or if evaluate_function is not callable.

Return type

CodeEvaluator

Example

Example 1

Creates a new code evaluator, assessing the length of the AI application’s response:

def evaluate(df):
results = pd.DataFrame(index=df.index)
results["score"] = df["response"].str.len() > 10
results["rationale"] = "Response length evaluation"
return results

evaluator = CodeEvaluator.create(
evaluate_function=evaluate,
version_name="Length Check v1.0"
)

execute

execute(split, num_rows=None, version_name=None, sync=False, **kwargs)

Executes the code evaluator against a dataset split.

This method runs the evaluation code against the specified dataset split and returns immediately. If no version name is specified, it uses the latest version.

Parameters

NameTypeDefaultInfo
splitstrThe dataset split to evaluate against (e.g., “train”, “test”, “validation”).
num_rowsOptional[int]NoneThe number of rows to evaluate. If None, evaluates all rows in the split.
version_nameOptional[str]NoneThe code version name to run. If None, the latest code version is used.
syncboolFalseWhether to wait for the job to complete. If True, blocks until completion.

Return type

int

Example

Example 1

Run the latest code version and poll for results:

evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
status, results = evaluator.poll_execution_result(code_execution_uid, sync=False)

Example 2

Run a specific code version:

evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(
split="test",
num_rows=100,
version_name="v1.0"
)

get

classmethod get(evaluator_uid)

Gets an existing code evaluator by its UID.

Parameters

NameTypeDefaultInfo
evaluator_uidintThe unique identifier for the evaluator.

Returns

A CodeEvaluator object representing the existing evaluator.

Return type

CodeEvaluator

Raises

ValueError – If the evaluator is not found.

Example

evaluator = CodeEvaluator.get(
evaluator_uid=300,
)

get_execution_result

get_execution_result(execution_uid)

Retrieves the evaluation results for a specific evaluation execution.

This method reads the evaluation results from the database for the given evaluation execution UID.

Parameters

NameTypeDefaultInfo
execution_uidintThe evaluation execution UID to get results for.

Return type

Dict[str, Dict[str, Union[str, int, float, bool]]]

Example

Example 1

Get the results of a code execution:

evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
results = evaluator.get_execution_result(code_execution_uid)
print(f"Evaluation scores: {results}")

get_executions

get_executions()

Retrieves all executions for this code evaluator.

This method fetches all executions that have been run using this evaluator. Executions are returned in chronological order, with the oldest execution first.

The dictionary contains the following keys: :rtype: List[Dict[str, Any]]

  • execution_uid: The execution UID

  • created_at: The timestamp when the execution was created

  • code_version_name: The name of the code version used for the execution

Example

Example 1

Get all executions for an evaluator:

evaluator = CodeEvaluator.get(evaluator_uid=300)
executions = evaluator.get_executions()
for execution in executions:
print(f"Execution {execution['execution_uid']}: {execution['created_at']}")

get_versions

get_versions()

Retrieves all code version names for this code evaluator.

This method fetches all code version names that have been created for this evaluator. Versions are returned in chronological order, with the oldest version first.

Return type

List[str]

Example

Example 1

Get all code version names for an evaluator:

evaluator = CodeEvaluator.get(evaluator_uid=300)
versions = evaluator.get_versions()
for version in versions:
print(f"Version: {version}")

poll_execution_result

poll_execution_result(execution_uid, sync=False)

Polls the job status and retrieves partial results.

This method checks the current status of the evaluation job and returns both the job status and any available results. The current status can be running, completed, failed, cancelled, or unknown.

Parameters

NameTypeDefaultInfo
execution_uidintThe code execution UID to poll for.
syncboolFalseWhether to wait for the job to complete. If False, returns immediately with current status and partial results.

Return type

Tuple[str, Dict[str, Dict[str, Union[str, int, float, bool]]]]

Example

Example 1

Poll for job status and partial results:

evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
status, results = evaluator.poll_execution_result(code_execution_uid, sync=False)
print(f"Job status: {status}")
if results:
print(f"Partial results: {results}")

update

update(version_name, **kwargs)

Updates the code evaluator with a new evaluation function.

This method creates a new code version containing the provided evaluation function. The function must be declared with the name evaluate and will be used to assess datapoints containing AI application responses against the criteria.

Parameters

NameTypeDefaultInfo
version_namestrThe name for the new code version.
**kwargsDict[str, Any]

Additional parameters. Must include:

  • evaluate_function : Callable[[pd.DataFrame], pd.DataFrame] A Python function that performs the evaluation. This function must:

    • Be named evaluate

    • Accept a pandas DataFrame as input

    • Return a pandas DataFrame as output

    The input DataFrame has a MultiIndex with a single level named __DATAPOINT_UID that holds the unique identifier of the datapoint. Values in the index are of the form ("uid1",).

    The output DataFrame must:

    • Have the same index as the input DataFrame

    • Include a column named score containing the evaluation results

    • Optionally include a column named rationale with explanations for the scores

Raises

ValueError – If the function name is not ‘evaluate’ or if evaluate_function is not provided.

Return type

str

Example

Example 1

Update a code evaluator with a new evaluation function:

def evaluate(df):
results = pd.DataFrame(index=df.index)

# Add random scores between 0 and 1
results["score"] = np.random.randint(0, 3, size=len(df))

# Add random rationales
rationale_options = [
"This response is accurate and relevant.",
"The answer demonstrates good understanding.",
"Response shows appropriate reasoning.",
"This is a well-formed answer.",
"The content is factually correct.",
]
results["rationale"] = np.random.choice(rationale_options, size=len(df))

evaluator = CodeEvaluator.get(evaluator_uid=300)
version_name = evaluator.update("v2.0", evaluate_function=evaluate)
benchmark_uid: int
criteria_uid: int
evaluator_uid: int