Skip to main content
Version: 25.7

snorkelai.sdk.develop.PromptEvaluator

class snorkelai.sdk.develop.PromptEvaluator(*args, **kwargs)

Bases: Evaluator

An evaluator that uses LLM prompts to assess model outputs.

This evaluator type is known as an LLM-as-a-judge (LLMAJ). A prompt evaluator uses LLM prompts to evaluate datapoints containing AI application responses, categorizing them into one of a criteria’s labels by assigning the corresponding integer score and optional rationale.

Prompt evaluator execution via the SDK is not yet supported. Please use the GUI to run prompt evaluators.

Read more about LLM-as-a-judge prompts.

Using the PromptEvaluator class requires the following import:

from snorkelai.sdk.develop import PromptEvaluator

__init__

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)
create(criteria_uid, **kwargs)Creates a new prompt evaluator for a criteria.
execute(split[, num_rows, version_name, sync])Executes the prompt evaluator against a dataset split.
get(evaluator_uid)Retrieves a prompt evaluator for a given uid.
get_execution_result(execution_uid)Retrieves the evaluation results for a specific evaluation execution.
get_executions()Retrieves all executions for this prompt evaluator.
get_versions()Gets all version names for a prompt evaluator.
poll_execution_result(execution_uid[, sync])Polls the job status and retrieves partial results.
update([version_name])Creates a new prompt version for a criteria and updates the evaluator to point to the new prompt version.

Attributes

benchmark_uid
criteria_uid
evaluator_uid
prompt_workflow_uid

create

classmethod create(criteria_uid, **kwargs)

Creates a new prompt evaluator for a criteria.

Parameters

NameTypeDefaultInfo
criteria_uidintThe unique identifier of the criteria that this evaluator assesses.
**kwargsAny
user_prompt: Optional[str] = None

The user prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.

system_prompt: Optional[str] = None

The system prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.

model_name: str

The model to use for the evaluator.

fm_hyperparameters: Optional[Dict[str, Any]] = None

The hyperparameters to use for the evaluator. These are provided directly to the model provider.

For example, OpenAI supports the response_format hyperparameter. It can be provided in the following way:

PromptEvaluator.create(
criteria_uid=100,
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_object",
}
}
)

Or a more sophisticated example:

PromptEvaluator.create(
criteria_uid=100,
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "math_reasoning",
"schema": {
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"explanation": {"type": "string"},
"output": {"type": "string"}
},
"required": ["explanation", "output"],
"additionalProperties": False
}
},
"final_answer": {"type": "string"}
},
"required": ["steps", "final_answer"],
"additionalProperties": False
},
"strict": True
}
}
}
)

Returns

A PromptEvaluator object representing the new evaluator.

Return type

PromptEvaluator

Raises

ValueError – If one of user_prompt, system_prompt is not provided. If model_name is not provided.

execute

execute(split, num_rows=None, version_name=None, sync=False, **kwargs)

Executes the prompt evaluator against a dataset split.

This method runs the prompt against the specified dataset split. If no version name is specified, it uses the latest version.

Parameters

NameTypeDefaultInfo
splitstrThe dataset split to evaluate on.
num_rowsOptional[int]NoneThe number of rows to evaluate on.
version_nameOptional[str]NoneThe version name to use for the execution. If not provided, it uses the latest version.
syncboolFalseWhether to wait for the execution to complete.

Returns

The execution uid.

Return type

int

Example

Example 1

Execute the latest prompt version and poll for results:

import time

evaluator = PromptEvaluator.get(evaluator_uid=300)
prompt_execution_uid = evaluator.execute(split="test", num_rows=100)
while True:
status, results = evaluator.poll_execution_result(prompt_execution_uid, sync=False)
print(f"Job status: {status}")
if status == "completed" or status == "failed":
break
if results:
print(f"Partial results: {results}")
time.sleep(10)

print(f"Final results: {results}")

Example 2

Execute a specific prompt version and wait for results:

evaluator = PromptEvaluator.get(evaluator_uid=300)
prompt_execution_uid = evaluator.execute(split="train", num_rows=20, version_name="v1.0")
status, results = evaluator.poll_execution_result(prompt_execution_uid, sync=True)
print(f"Status: {status}")
print(f"Results: {results}")

get

classmethod get(evaluator_uid)

Retrieves a prompt evaluator for a given uid.

Parameters

NameTypeDefaultInfo
evaluator_uidintThe unique identifier for the evaluator.

Returns

A PromptEvaluator instance.

Return type

PromptEvaluator

Raises

ValueError – If the evaluator with the given uid is not a PromptEvaluator.

get_execution_result

get_execution_result(execution_uid)

Retrieves the evaluation results for a specific evaluation execution.

This method reads the evaluation results for the given evaluation execution UID. If the execution is in progress, it will return partial results.

Parameters

NameTypeDefaultInfo
execution_uidintThe evaluation execution UID to get results for.

Returns

A dictionary mapping x_uids to their evaluation results. The evaluation results for each x_uid are a dictionary with the following keys: - “score”: The score for the datapoint - “rationale”: The rationale for the score

Return type

Dict[str, Dict[str, EvaluationScoreType]]

get_executions

get_executions()

Retrieves all executions for this prompt evaluator.

This method fetches all executions that have been run using this evaluator. Executions are returned in chronological order, with the oldest execution first.

The dictionary contains the following keys: :rtype: List[Dict[str, Any]]

  • execution_uid: The execution UID

  • created_at: The timestamp when the execution was created

  • prompt_version_name: The name of the prompt version used for the execution

Example

Example 1

Get all executions for an evaluator:

evaluator = PromptEvaluator.get(evaluator_uid=300)
executions = evaluator.get_executions()
for execution in executions:
print(f"Execution {execution['execution_uid']}: {execution['created_at']}")

get_versions

get_versions()

Gets all version names for a prompt evaluator.

Returns

A list of version names for the prompt evaluator.

Return type

List[str]

poll_execution_result

poll_execution_result(execution_uid, sync=False)

Polls the job status and retrieves partial results.

This method checks the current status of the evaluation job and returns both the job status and any available results. The current status can be running, completed, failed, cancelled, or unknown.

Parameters

NameTypeDefaultInfo
execution_uidintThe prompt execution UID to poll for.
syncboolFalseWhether to wait for the job to complete. If False, returns immediately with current status and partial results.

Return type

Tuple[str, Dict[str, Dict[str, Union[str, int, float, bool]]]]

Example

Example 1

Poll for job status and partial results:

evaluator = PromptEvaluator.get(evaluator_uid=300)
prompt_execution_uid = evaluator.execute(split="test", num_rows=100)
while True:
status, results = evaluator.poll_execution_result(prompt_execution_uid, sync=False)
print(f"Job status: {status}")
if results:
print(f"Partial results: {results}")
if status == "completed" or status == "failed":
break
print(f"Final results: {results}")

update

update(version_name=None, **kwargs)

Creates a new prompt version for a criteria and updates the evaluator to point to the new prompt version.

Parameters

NameTypeDefaultInfo
version_nameOptional[str]NoneThe name for the new prompt version. If not provided, a default name will be generated.
**kwargsAny
user_prompt: Optional[str] = None

The user prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.

system_prompt: Optional[str] = None

The system prompt to use for the evaluator. At least one of user_prompt or system_prompt must be provided.

model_name: str

The model to use for the evaluator.

fm_hyperparameters: Optional[Dict[str, Any]] = None

The hyperparameters to use for the evaluator. These are provided directly to the model provider.

For example, OpenAI supports the response_format hyperparameter. It can be provided in the following way:

evaluator = PromptEvaluator.get(evaluator_uid=300)
evaluator.update(
version_name="New Version",
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_object",
}
}
)

Or a more sophisticated example:

evaluator = PromptEvaluator.get(evaluator_uid=300)
evaluator.update(
version_name="New Version",
user_prompt="User prompt",
system_prompt="System prompt",
model_name="gpt-4o-mini",
fm_hyperparameters={
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "math_reasoning",
"schema": {
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"explanation": {"type": "string"},
"output": {"type": "string"}
},
"required": ["explanation", "output"],
"additionalProperties": False
}
},
"final_answer": {"type": "string"}
},
"required": ["steps", "final_answer"],
"additionalProperties": False
},
"strict": True
}
}
}
)

Returns

The version name of the new prompt version.

Return type

str

Raises

ValueError – If one of user_prompt, system_prompt is not provided. If model_name is not provided.

benchmark_uid: int
criteria_uid: int
evaluator_uid: int
prompt_workflow_uid: int