snorkelai.sdk.develop.PromptEvaluator
- class snorkelai.sdk.develop.PromptEvaluator(*args, **kwargs)
Bases:
Evaluator
An evaluator that uses LLM prompts to assess model outputs.
This evaluator type is known as an LLM-as-a-judge (LLMAJ). A prompt evaluator uses LLM prompts to evaluate datapoints containing AI application responses, categorizing them into one of a criteria’s labels by assigning the corresponding integer score and optional rationale.
Prompt evaluator execution via the SDK is not yet supported. Please use the GUI to run prompt evaluators.
Read more about LLM-as-a-judge prompts.
Using the
PromptEvaluator
class requires the following import:from snorkelai.sdk.develop import PromptEvaluator
- __init__(*args, **kwargs)
\_\_init\_\_
__init__
Methods
__init__
(*args, **kwargs)create
(criteria_uid, **kwargs)Creates a new prompt evaluator for a criteria. execute
(split[, num_rows, version_name, sync])Executes the prompt evaluator against a dataset split. get
(evaluator_uid)Retrieves a prompt evaluator for a given uid. get_execution_result
(execution_uid)Retrieves the evaluation results for a specific evaluation execution. get_executions
()Retrieves all executions for this prompt evaluator. get_versions
()Gets all version names for a prompt evaluator. poll_execution_result
(execution_uid[, sync])Polls the job status and retrieves partial results. update
([version_name])Creates a new prompt version for a criteria and updates the evaluator to point to the new prompt version. Attributes
benchmark_uid
criteria_uid
evaluator_uid
prompt_workflow_uid
- classmethod create(criteria_uid, **kwargs)
Creates a new prompt evaluator for a criteria.
Parameters
Parameters
- user_prompt: Optional[str] = None
The user prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.
- system_prompt: Optional[str] = None
The system prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.
- model_name: str
The model to use for the evaluator.
- fm_hyperparameters: Optional[Dict[str, Any]] = None
The hyperparameters to use for the evaluator. These are provided directly to the model provider.
For example, OpenAI supports the response_format hyperparameter. It can be provided in the following way:
PromptEvaluator.create(
criteria_uid=100,
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_object",
}
}
)Or a more sophisticated example:
PromptEvaluator.create(
criteria_uid=100,
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "math_reasoning",
"schema": {
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"explanation": {"type": "string"},
"output": {"type": "string"}
},
"required": ["explanation", "output"],
"additionalProperties": False
}
},
"final_answer": {"type": "string"}
},
"required": ["steps", "final_answer"],
"additionalProperties": False
},
"strict": True
}
}
}
)Returns
Returns
A PromptEvaluator object representing the new evaluator.
Return type
Return type
Raises
Raises
ValueError – If one of user_prompt, system_prompt is not provided. If model_name is not provided.
Name Type Default Info criteria_uid int
The unique identifier of the criteria that this evaluator assesses. **kwargs Any
create
create
- execute(split, num_rows=None, version_name=None, sync=False, **kwargs)
Executes the prompt evaluator against a dataset split.
This method runs the prompt against the specified dataset split. If no version name is specified, it uses the latest version.
Parameters
Parameters
Returns
Returns
The execution uid.
Return type
Return type
int
Name Type Default Info split str
The dataset split to evaluate on. num_rows Optional[int]
None
The number of rows to evaluate on. version_name Optional[str]
None
The version name to use for the execution. If not provided, it uses the latest version. sync bool
False
Whether to wait for the execution to complete. Example
Example 1
Example 1
Execute the latest prompt version and poll for results:
import time
evaluator = PromptEvaluator.get(evaluator_uid=300)
prompt_execution_uid = evaluator.execute(split="test", num_rows=100)
while True:
status, results = evaluator.poll_execution_result(prompt_execution_uid, sync=False)
print(f"Job status: {status}")
if status == "completed" or status == "failed":
break
if results:
print(f"Partial results: {results}")
time.sleep(10)
print(f"Final results: {results}")Example 2
Example 2
Execute a specific prompt version and wait for results:
evaluator = PromptEvaluator.get(evaluator_uid=300)
prompt_execution_uid = evaluator.execute(split="train", num_rows=20, version_name="v1.0")
status, results = evaluator.poll_execution_result(prompt_execution_uid, sync=True)
print(f"Status: {status}")
print(f"Results: {results}")
execute
execute
- classmethod get(evaluator_uid)
Retrieves a prompt evaluator for a given uid.
get
get
- get_execution_result(execution_uid)
Retrieves the evaluation results for a specific evaluation execution.
This method reads the evaluation results for the given evaluation execution UID. If the execution is in progress, it will return partial results.
Parameters
Parameters
Returns
Returns
A dictionary mapping x_uids to their evaluation results. The evaluation results for each x_uid are a dictionary with the following keys: - “score”: The score for the datapoint - “rationale”: The rationale for the score
Return type
Return type
Dict[str, Dict[str, EvaluationScoreType]]
Name Type Default Info execution_uid int
The evaluation execution UID to get results for.
get\_execution\_result
get_execution_result
- get_executions()
Retrieves all executions for this prompt evaluator.
This method fetches all executions that have been run using this evaluator. Executions are returned in chronological order, with the oldest execution first.
The dictionary contains the following keys: :rtype:
List
[Dict
[str
,Any
]]execution_uid
: The execution UIDcreated_at
: The timestamp when the execution was createdprompt_version_name
: The name of the prompt version used for the execution
Example
Example 1
Example 1
Get all executions for an evaluator:
evaluator = PromptEvaluator.get(evaluator_uid=300)
executions = evaluator.get_executions()
for execution in executions:
print(f"Execution {execution['execution_uid']}: {execution['created_at']}")
get\_executions
get_executions
- get_versions()
Gets all version names for a prompt evaluator.
get\_versions
get_versions
- poll_execution_result(execution_uid, sync=False)
Polls the job status and retrieves partial results.
This method checks the current status of the evaluation job and returns both the job status and any available results. The current status can be
running
,completed
,failed
,cancelled
, orunknown
.Parameters
Parameters
Return type
Return type
Tuple
[str
,Dict
[str
,Dict
[str
,Union
[str
,int
,float
,bool
]]]]
Name Type Default Info execution_uid int
The prompt execution UID to poll for. sync bool
False
Whether to wait for the job to complete. If False
, returns immediately with current status and partial results.Example
Example 1
Example 1
Poll for job status and partial results:
evaluator = PromptEvaluator.get(evaluator_uid=300)
prompt_execution_uid = evaluator.execute(split="test", num_rows=100)
while True:
status, results = evaluator.poll_execution_result(prompt_execution_uid, sync=False)
print(f"Job status: {status}")
if results:
print(f"Partial results: {results}")
if status == "completed" or status == "failed":
break
print(f"Final results: {results}")
poll\_execution\_result
poll_execution_result
- update(version_name=None, **kwargs)
Creates a new prompt version for a criteria and updates the evaluator to point to the new prompt version.
Parameters
Parameters
- user_prompt: Optional[str] = None
The user prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.
- system_prompt: Optional[str] = None
The system prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.
- model_name: str
The model to use for the evaluator.
- fm_hyperparameters: Optional[Dict[str, Any]] = None
The hyperparameters to use for the evaluator. These are provided directly to the model provider.
For example, OpenAI supports the response_format hyperparameter. It can be provided in the following way:
evaluator = PromptEvaluator.get(evaluator_uid=300)
evaluator.update(
version_name="New Version",
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_object",
}
}
)Or a more sophisticated example:
evaluator = PromptEvaluator.get(evaluator_uid=300)
evaluator.update(
version_name="New Version",
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "math_reasoning",
"schema": {
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"explanation": {"type": "string"},
"output": {"type": "string"}
},
"required": ["explanation", "output"],
"additionalProperties": False
}
},
"final_answer": {"type": "string"}
},
"required": ["steps", "final_answer"],
"additionalProperties": False
},
"strict": True
}
}
}
)Returns
Returns
The version name of the new prompt version.
Return type
Return type
str
Raises
Raises
ValueError – If one of user_prompt, system_prompt is not provided. If model_name is not provided.
Name Type Default Info version_name Optional[str]
None
The name for the new prompt version. If not provided, a default name will be generated. **kwargs Any
update
update
-
benchmark_uid:
int
-
criteria_uid:
int
-
evaluator_uid:
int
-
prompt_workflow_uid:
int