snorkelai.sdk.develop.CodeEvaluator
- class snorkelai.sdk.develop.CodeEvaluator(*args, **kwargs)
Bases:
Evaluator
An evaluator that uses custom Python code to assess an AI application’s responses.
A code evaluator uses custom Python functions to evaluate datapoints containing AI application responses, categorizing them into one of a criteria’s labels by assigning the corresponding integer score and optional rationale. The evaluator function takes a datapoint as input and returns a score based on the criteria’s label schema.
The evaluation function can implement any Python logic needed to assess the AI application’s response.
Read more in the Evaluation overview.
Parameters
Parameters
Name Type Default Info benchmark_uid int
The unique identifier of the benchmark that contains the criteria. The benchmark_uid
is visible in the URL of the benchmark page in the Snorkel GUI. For example,https://YOUR-SNORKEL-INSTANCE/benchmarks/100/
indicates a benchmark withbenchmark_uid
of100
.criteria_uid int
The unique identifier of the criteria that this evaluator assesses. evaluator_uid int
The unique identifier for this evaluator. Examples
Example 1
Example 1
Creates a new code evaluator, assessing the length of the AI application’s response:
def evaluate(df):
results = pd.DataFrame(index=df.index)
results["score"] = df["response"].str.len() > 10 # Simple length check
results["rationale"] = "Response length evaluation"
return results
# Create a new code evaluator
evaluator = CodeEvaluator.create(
evaluate_function=evaluate,
version_name="Version 1"
)Example 2
Example 2
Gets an existing code evaluator:
# Get existing evaluator
evaluator = CodeEvaluator.get(
evaluator_uid=300,
)- __init__(*args, **kwargs)
\_\_init\_\_
__init__
Methods
__init__
(*args, **kwargs)create
(criteria_uid, **kwargs)Creates a new code evaluator for a criteria. execute
(split[, num_rows, version_name, sync])Executes the code evaluator against a dataset split. get
(evaluator_uid)Gets an existing code evaluator by its UID. get_execution_result
(execution_uid)Retrieves the evaluation results for a specific evaluation execution. get_executions
()Retrieves all executions for this code evaluator. get_versions
()Retrieves all code version names for this code evaluator. poll_execution_result
(execution_uid[, sync])Polls the job status and retrieves partial results. update
(version_name, **kwargs)Updates the code evaluator with a new evaluation function. Attributes
benchmark_uid
criteria_uid
evaluator_uid
- classmethod create(criteria_uid, **kwargs)
Creates a new code evaluator for a criteria.
Parameters
Parameters
evaluate_function : Callable[[pd.DataFrame], pd.DataFrame] A Python function that performs the evaluation. This function must:
Be named
evaluate
Accept a pandas DataFrame as input
Return a pandas DataFrame as output
The input DataFrame has a MultiIndex with a single level named
__DATAPOINT_UID
that holds the unique identifier of the datapoint. Values in the index are of the form("uid1",)
.The output DataFrame must:
Have the same index as the input DataFrame
Include a column named
score
containing the evaluation resultsOptionally include a column named
rationale
with explanations for the scores
version_name : str The name for the initial code version. Defaults to
v1.0
.Raises
Raises
ValueError – If the function name is not
evaluate
or ifevaluate_function
is not callable.Return type
Return type
Name Type Default Info criteria_uid int
The unique identifier of the criteria that this evaluator assesses. **kwargs Any
Additional parameters. Must include:
May include:
Example
Example 1
Example 1
Creates a new code evaluator, assessing the length of the AI application’s response:
def evaluate(df):
results = pd.DataFrame(index=df.index)
results["score"] = df["response"].str.len() > 10
results["rationale"] = "Response length evaluation"
return results
evaluator = CodeEvaluator.create(
evaluate_function=evaluate,
version_name="Length Check v1.0"
)
create
create
- execute(split, num_rows=None, version_name=None, sync=False, **kwargs)
Executes the code evaluator against a dataset split.
This method runs the evaluation code against the specified dataset split and returns immediately. If no version name is specified, it uses the latest version.
Parameters
Parameters
Return type
Return type
int
Name Type Default Info split str
The dataset split to evaluate against (e.g., “train”, “test”, “validation”). num_rows Optional[int]
None
The number of rows to evaluate. If None
, evaluates all rows in the split.version_name Optional[str]
None
The code version name to run. If None
, the latest code version is used.sync bool
False
Whether to wait for the job to complete. If True
, blocks until completion.Example
Example 1
Example 1
Run the latest code version and poll for results:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
status, results = evaluator.poll_execution_result(code_execution_uid, sync=False)Example 2
Example 2
Run a specific code version:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(
split="test",
num_rows=100,
version_name="v1.0"
)
execute
execute
- classmethod get(evaluator_uid)
Gets an existing code evaluator by its UID.
Parameters
Parameters
Returns
Returns
A CodeEvaluator object representing the existing evaluator.
Return type
Return type
Raises
Raises
ValueError – If the evaluator is not found.
Name Type Default Info evaluator_uid int
The unique identifier for the evaluator. Example
evaluator = CodeEvaluator.get(
evaluator_uid=300,
)
get
get
- get_execution_result(execution_uid)
Retrieves the evaluation results for a specific evaluation execution.
This method reads the evaluation results from the database for the given evaluation execution UID.
Parameters
Parameters
Return type
Return type
Dict
[str
,Dict
[str
,Union
[str
,int
,float
,bool
]]]
Name Type Default Info execution_uid int
The evaluation execution UID to get results for. Example
Example 1
Example 1
Get the results of a code execution:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
results = evaluator.get_execution_result(code_execution_uid)
print(f"Evaluation scores: {results}")
get\_execution\_result
get_execution_result
- get_executions()
Retrieves all executions for this code evaluator.
This method fetches all executions that have been run using this evaluator. Executions are returned in chronological order, with the oldest execution first.
The dictionary contains the following keys: :rtype:
List
[Dict
[str
,Any
]]execution_uid
: The execution UIDcreated_at
: The timestamp when the execution was createdcode_version_name
: The name of the code version used for the execution
Example
Example 1
Example 1
Get all executions for an evaluator:
evaluator = CodeEvaluator.get(evaluator_uid=300)
executions = evaluator.get_executions()
for execution in executions:
print(f"Execution {execution['execution_uid']}: {execution['created_at']}")
get\_executions
get_executions
- get_versions()
Retrieves all code version names for this code evaluator.
This method fetches all code version names that have been created for this evaluator. Versions are returned in chronological order, with the oldest version first.
Return type
Return type
List
[str
]
Example
Example 1
Example 1
Get all code version names for an evaluator:
evaluator = CodeEvaluator.get(evaluator_uid=300)
versions = evaluator.get_versions()
for version in versions:
print(f"Version: {version}")
get\_versions
get_versions
- poll_execution_result(execution_uid, sync=False)
Polls the job status and retrieves partial results.
This method checks the current status of the evaluation job and returns both the job status and any available results. The current status can be
running
,completed
,failed
,cancelled
, orunknown
.Parameters
Parameters
Return type
Return type
Tuple
[str
,Dict
[str
,Dict
[str
,Union
[str
,int
,float
,bool
]]]]
Name Type Default Info execution_uid int
The code execution UID to poll for. sync bool
False
Whether to wait for the job to complete. If False
, returns immediately with current status and partial results.Example
Example 1
Example 1
Poll for job status and partial results:
evaluator = CodeEvaluator.get(evaluator_uid=300)
code_execution_uid = evaluator.execute(split="test", num_rows=100)
status, results = evaluator.poll_execution_result(code_execution_uid, sync=False)
print(f"Job status: {status}")
if results:
print(f"Partial results: {results}")
poll\_execution\_result
poll_execution_result
- update(version_name, **kwargs)
Updates the code evaluator with a new evaluation function.
This method creates a new code version containing the provided evaluation function. The function must be declared with the name
evaluate
and will be used to assess datapoints containing AI application responses against the criteria.Parameters
Parameters
evaluate_function : Callable[[pd.DataFrame], pd.DataFrame] A Python function that performs the evaluation. This function must:
Be named
evaluate
Accept a pandas DataFrame as input
Return a pandas DataFrame as output
The input DataFrame has a MultiIndex with a single level named
__DATAPOINT_UID
that holds the unique identifier of the datapoint. Values in the index are of the form("uid1",)
.The output DataFrame must:
Have the same index as the input DataFrame
Include a column named
score
containing the evaluation resultsOptionally include a column named
rationale
with explanations for the scores
Raises
Raises
ValueError – If the function name is not ‘evaluate’ or if evaluate_function is not provided.
Return type
Return type
str
Name Type Default Info version_name str
The name for the new code version. **kwargs Dict[str, Any]
Additional parameters. Must include:
Example
Example 1
Example 1
Update a code evaluator with a new evaluation function:
def evaluate(df):
results = pd.DataFrame(index=df.index)
# Add random scores between 0 and 1
results["score"] = np.random.randint(0, 3, size=len(df))
# Add random rationales
rationale_options = [
"This response is accurate and relevant.",
"The answer demonstrates good understanding.",
"Response shows appropriate reasoning.",
"This is a well-formed answer.",
"The content is factually correct.",
]
results["rationale"] = np.random.choice(rationale_options, size=len(df))
evaluator = CodeEvaluator.get(evaluator_uid=300)
version_name = evaluator.update("v2.0", evaluate_function=evaluate)
update
update
-
benchmark_uid:
int
-
criteria_uid:
int
-
evaluator_uid:
int