snorkelai.sdk.develop.CodeEvaluator
- final class snorkelai.sdk.develop.CodeEvaluator(benchmark_uid, criteria_uid, evaluator_uid)
Bases:
EvaluatorAn evaluator that uses custom Python code to assess an AI application’s responses.
A code evaluator uses custom Python functions to evaluate datapoints containing AI application responses, categorizing them into one of a criteria’s labels by assigning the corresponding integer score and optional rationale. The evaluator function takes a datapoint as input and returns a score based on the criteria’s label schema.
The evaluation function can implement any Python logic needed to assess the AI application’s response.
Read more in the Evaluation overview.
Using the
CodeEvaluatorclass requires the following import:from snorkelai.sdk.develop import CodeEvaluatorExamples
Example 1
Example 1
Creates a new code evaluator, assessing the length of the AI application’s response:
import pandas as pd
def evaluate(df: pd.DataFrame) -> pd.DataFrame:
results = pd.DataFrame(index=df.index)
results["score"] = df["response"].str.len() > 10 # Simple length check
results["rationale"] = "Response length evaluation"
return results
# Create a new code evaluator
evaluator = CodeEvaluator.create(
criteria_uid=100,
evaluate_function=evaluate,
version_name="Version 1"
)Example 2
Example 2
Gets an existing code evaluator:
# Get existing evaluator
evaluator = CodeEvaluator.get(
evaluator_uid=300,
)- __init__(benchmark_uid, criteria_uid, evaluator_uid)
Parameters
Parameters
Name Type Default Info benchmark_uid intThe unique identifier of the benchmark that contains the criteria. The benchmark_uidis visible in the URL of the benchmark page in the Snorkel GUI. For example,https://YOUR-SNORKEL-INSTANCE/benchmarks/100/indicates a benchmark withbenchmark_uidof100.criteria_uid intThe unique identifier of the criteria that this evaluator assesses. evaluator_uid intThe unique identifier for this evaluator.
\_\_init\_\_
__init__
Methods
__init__(benchmark_uid, criteria_uid, ...)create(criteria_uid, evaluate_function[, ...])Creates a new code evaluator for a criteria. delete(evaluator_uid)Deletion of an evaluator is not implemented. execute(split[, num_rows, version_name, sync])Executes the code evaluator against a dataset split. get(evaluator_uid)Retrieves a code evaluator for a given uid. get_execution_result(execution_uid)Retrieves the evaluation results for a specific evaluation execution. get_executions()Retrieves all executions for this code evaluator. get_versions()Retrieves all code version names for this code evaluator. poll_execution_result(execution_uid[, sync])Polls the job status and retrieves partial results. update([version_name, evaluate_function])Updates the code evaluator with a new evaluation function. Attributes
benchmark_uidReturn the UID of the parent benchmark criteria_uidReturn the UID of the parent criteria evaluator_uidReturn the UID of the evaluator uidReturn the UID of the evaluator - classmethod create(criteria_uid, evaluate_function, version_name=None)
Creates a new code evaluator for a criteria.
Parameters
Parameters
Be named
evaluateAccept a pandas DataFrame as input
Return a pandas DataFrame as output
Have the same index as the input DataFrame
Include a column named
scorecontaining the evaluation resultsOptionally include a column named
rationalewith explanations for the scoresRaises
Raises
ValueError – If the function name is not
evaluateor ifevaluate_functionis not callable.Return type
Return type
Name Type Default Info criteria_uid intThe unique identifier of the criteria that this evaluator assesses. evaluate_function Callable[[DataFrame], DataFrame]A Python function that performs the evaluation. This function must:
The input DataFrame has a MultiIndex with a single level named
__DATAPOINT_UIDthat holds the unique identifier of the datapoint. Values in the index are of the form("uid1",).The output DataFrame must:
version_name Optional[str]NoneThe name for the initial code version. If not provided, a default name will be generated. Example
Example 1
Example 1
Creates a new code evaluator, assessing the length of the AI application’s response:
import pandas as pd
def evaluate(df: pd.DataFrame) -> pd.DataFrame:
results = pd.DataFrame(index=df.index)
results["score"] = df["response"].str.len() > 10
results["rationale"] = "Response length evaluation"
return results
evaluator = CodeEvaluator.create(
criteria_uid=100,
evaluate_function=evaluate,
)
create
create
- execute(split, num_rows=None, version_name=None, sync=False)
Executes the code evaluator against a dataset split.
This method runs the evaluation code against the specified dataset split. If no version name is specified, it uses the latest version.
Parameters
Parameters
Return type
Return type
int
Name Type Default Info split strThe dataset split to evaluate against (e.g., “train”, “test”, “validation”). num_rows Optional[int]NoneThe number of rows to evaluate. If None, evaluates all rows in the split.version_name Optional[str]NoneThe code version name to run. If None, the latest code version is used.sync boolFalseWhether to wait for the job to complete. If True, blocks until completion.Example
Example 1
Example 1
Run the latest code version and poll for results:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
status, results = evaluator.poll_execution_result(code_execution_uid, sync=False)Example 2
Example 2
Run a specific code version:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(
split="test",
num_rows=100,
version_name="v1.0"
)
execute
execute
- classmethod get(evaluator_uid)
Retrieves a code evaluator for a given uid.
get
get
- get_execution_result(execution_uid)
Retrieves the evaluation results for a specific evaluation execution.
This method reads the evaluation results from the database for the given evaluation execution UID.
Parameters
Parameters
Return type
Return type
Dict[str,Dict[str,Union[str,int,float,bool]]]
Name Type Default Info execution_uid intThe evaluation execution UID to get results for. Example
Example 1
Example 1
Get the results of a code execution:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
results = evaluator.get_execution_result(code_execution_uid)
print(f"Evaluation scores: {results}")
get\_execution\_result
get_execution_result
- get_executions()
Retrieves all executions for this code evaluator.
This method fetches all executions that have been run using this evaluator. Executions are returned in chronological order, with the oldest execution first.
The dictionary contains the following keys: :rtype:
List[Dict[str,Any]]execution_uid: The execution UIDcreated_at: The timestamp when the execution was createdcode_version_name: The name of the code version used for the execution
Example
Example 1
Example 1
Get all executions for an evaluator:
evaluator = CodeEvaluator.get(evaluator_uid=300)
executions = evaluator.get_executions()
for execution in executions:
print(f"Execution {execution['execution_uid']}: {execution['created_at']}")
get\_executions
get_executions
- get_versions()
Retrieves all code version names for this code evaluator.
This method fetches all code version names that have been created for this evaluator. Versions are returned in chronological order, with the oldest version first.
Return type
Return type
List[str]
Example
Example 1
Example 1
Get all code version names for an evaluator:
evaluator = CodeEvaluator.get(evaluator_uid=300)
versions = evaluator.get_versions()
for version in versions:
print(f"Version: {version}")
get\_versions
get_versions
- poll_execution_result(execution_uid, sync=False)
Polls the job status and retrieves partial results.
This method checks the current status of the evaluation job and returns both the job status and any available results. The current status can be
running,completed,failed,cancelled, orunknown.Parameters
Parameters
Return type
Return type
Tuple[str,Dict[str,Dict[str,Union[str,int,float,bool]]]]
Name Type Default Info execution_uid intThe code execution UID to poll for. sync boolFalseWhether to wait for the job to complete. If False, returns immediately with current status and partial results.Example
Example 1
Example 1
Poll for job status and partial results:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
status, results = evaluator.poll_execution_result(code_execution_uid, sync=False)
print(f"Job status: {status}")
if results:
print(f"Partial results: {results}")
poll\_execution\_result
poll_execution_result
- update(version_name=None, evaluate_function=None)
Updates the code evaluator with a new evaluation function.
This method creates a new code version containing the provided evaluation function. The function must be declared with the name
evaluateand will be used to assess datapoints containing AI application responses against the criteria.Parameters
Parameters
Be named
evaluateAccept a pandas DataFrame as input
Return a pandas DataFrame as output
Have the same index as the input DataFrame
Include a column named
scorecontaining the evaluation resultsOptionally include a column named
rationalewith explanations for the scoresRaises
Raises
ValueError – If the function name is not ‘evaluate’ or if evaluate_function is not provided.
Return type
Return type
None
Name Type Default Info version_name Optional[str]NoneThe name for the new code version. If not provided, a default name will be generated. evaluate_function Optional[Callable[[DataFrame], DataFrame]]NoneA Python function that performs the evaluation. This function must:
The input DataFrame has a MultiIndex with a single level named
__DATAPOINT_UIDthat holds the unique identifier of the datapoint. Values in the index are of the form("uid1",).The output DataFrame must:
Example
Example 1
Example 1
Update a code evaluator with a new evaluation function:
import pandas as pd
def evaluate(df: pd.DataFrame) -> pd.DataFrame:
results = pd.DataFrame(index=df.index)
# Add random scores between 0 and 1
results["score"] = np.random.randint(0, 3, size=len(df))
# Add random rationales
rationale_options = [
"This response is accurate and relevant.",
"The answer demonstrates good understanding.",
"Response shows appropriate reasoning.",
"This is a well-formed answer.",
"The content is factually correct.",
]
results["rationale"] = np.random.choice(rationale_options, size=len(df))
return results
evaluator = CodeEvaluator.get(evaluator_uid=300)
version_name = evaluator.update("v2.0", evaluate_function=evaluate)
update
update