snorkelai.sdk.develop.Benchmark
- final class snorkelai.sdk.develop.Benchmark(benchmark_uid, name, created_at, updated_at, archived, description=None)
Bases:
BaseA benchmark is the collection of characteristics that you care about for a particular GenAI application, and the measurements you use to assess the performance against those characteristics. It consists of the following elements:
Reference prompts: A set of prompts used to evaluate the model’s responses.
Slices: Subsets of reference prompts focusing on specific topics.
Criteria: Key characteristics that represent the features being optimized for evaluation.
Evaluators: Functions that assess whether a model’s output satisfies the criteria.
Read more in the Evaluation overview.
Using the
Benchmarkclass requires the following import:from snorkelai.sdk.develop import Benchmark- __init__(benchmark_uid, name, created_at, updated_at, archived, description=None)
Parameters
Parameters
Name Type Default Info benchmark_uid intThe unique identifier of the benchmark from which you want to get data. The benchmark_uidis visible in the URL of the benchmark page in the Snorkel GUI. For example,https://YOUR-SNORKEL-INSTANCE/benchmarks/100/indicates a benchmark withbenchmark_uidof100.name strThe name of the benchmark. description Optional[str]NoneThe description of the benchmark. created_at datetimeThe timestamp when the benchmark was created. updated_at datetimeThe timestamp when the benchmark was last updated. archived boolWhether the benchmark is archived.
\_\_init\_\_
__init__
Methods
__init__(benchmark_uid, name, created_at, ...)archive()Archives the benchmark, hiding it from the UI and SDK list method. create(name, dataset_uid[, description])Creates a new benchmark. delete(benchmark_uid)Deletion of a benchmark is not implemented. execute([splits, criteria_uids, name])Executes the benchmark against the associated dataset. export_config(filepath[, format])Exports a benchmark configuration to the specified format and writes to the provided filepath. export_latest_execution(filepath[, config])Export the latest benchmark execution with all its associated data. get(benchmark_uid)Gets a benchmark by its unique identifier. list(workspace_uid[, include_archived])Lists all benchmarks for a given workspace. list_criteria([include_archived])Retrieves all criteria for this benchmark. list_executions([include_archived])Retrieves all benchmark executions for this benchmark. update([name, description, archived])Updates the benchmark with the given parameters. Attributes
archivedReturn whether the benchmark is archived benchmark_uidReturn the UID of the benchmark created_atReturn the timestamp when the benchmark was created descriptionReturn the description of the benchmark nameReturn the name of the benchmark uidReturn the UID of the benchmark updated_atReturn the timestamp when the benchmark was last updated - archive()
Archives the benchmark, hiding it from the UI and SDK list method.
Use
snorkelai.sdk.develop.benchmarks.Benchmark.list()withinclude_archived=Trueto view archived benchmarks.Return type
Return type
None
archive
archive
- static create(name, dataset_uid, description=None)
Creates a new benchmark. The created benchmark does not include any default criteria or evaluators.
Parameters
Parameters
Returns
Returns
A Benchmark object representing the created benchmark.
Return type
Return type
Name Type Default Info name strThe name of the benchmark. dataset_uid intThe unique identifier of the dataset to use as the input for the benchmark. The dataset_uidcan be retrieved using thesnorkelai.sdk.develop.datasets.Dataset.list()method.description Optional[str]NoneThe description of the benchmark.
create
create
- classmethod delete(benchmark_uid)
Deletion of a benchmark is not implemented.
delete
delete
- execute(splits=None, criteria_uids=None, name=None)
Executes the benchmark against the associated dataset. For each criteria, evaluation scores are computed for each datapoint and aggregate metrics are computed across all datapoints.
Parameters
Parameters
Returns
Returns
The execution object.
Return type
Return type
Name Type Default Info splits Optional[List[str]]NoneThe splits to execute the benchmark on. If not provided, will default to [“train”, “valid”]. criteria_uids Optional[List[int]]NoneThe criteria to execute the benchmark on. If not provided, will default to all criteria for the benchmark. name Optional[str]NoneThe name of the execution. If not provided, will default to “Run <number>” based on the number of previous executions.
execute
execute
- export_config(filepath, format=BenchmarkExportFormat.JSON)
Exports a benchmark configuration to the specified format and writes to the provided filepath.
This method exports the complete benchmark configuration, including all criteria, evaluators, and metadata. The exported configuration can be used for:
Version control of benchmark definitions.
Sharing benchmarks across teams.
Integration with CI/CD pipelines.
Backing up evaluation configurations.
Parameters
Parameters
Raises
Raises
NotImplementedError – If an unsupported export format is specified.
ValueError – If the benchmark_uid is None or invalid.
Return type
Return type
None
Name Type Default Info filepath strOutput file path for exported data. The directory will be created if it doesn’t exist. format BenchmarkExportFormat<BenchmarkExportFormat.JSON: 'json'>The format to export the config to. Currently only JSON is supported. Example
Example 1
Example 1
Export a benchmark configuration to JSON:
benchmark = Benchmark.get(100)
benchmark.export_config("benchmark_config.json")Example 1 output
Example 1 output
The exported JSON file contains:
{
"criteria": [
{
"criteria_uid": 101,
"benchmark_uid": 100,
"name": "Example Readability",
"description": "Evaluates how easy the response is to read and understand.",
"state": "ACTIVE",
"output_format": {
"metric_label_schema_uid": 200,
"rationale_label_schema_uid": 201
},
"metadata": {
"version": "1.0"
},
"created_at": "2025-04-01T14:30:00.123456Z",
"updated_at": "2025-04-01T14:35:10.654321Z"
}
],
"evaluators": [
{
"evaluator_uid": 301,
"name": "Readability Evaluator (LLM)",
"description": "Uses an LLM prompt to assess readability.",
"criteria_uid": 101,
"type": "Prompt",
"prompt_workflow_uid": 401,
"parameters": null,
"metadata": {
"default_prompt_config": {
"name": "Readability Prompt v1",
"model_name": "google/gemini-1.5-pro-latest",
"system_prompt": "You are an expert evaluator assessing text readability.",
"user_prompt": "..."
}
},
"created_at": "2025-04-01T15:00:00.987654Z",
"updated_at": "2025-04-01T15:05:00.123123Z"
}
],
"metadata": {
"name": "Sample Benchmark Set",
"description": "A benchmark set including example evaluations.",
"created_at": "2025-04-01T14:00:00.000000Z",
"created_by": "user@example.com"
}
}After exporting your benchmark, you can use it to evaluate data from your GenAI application iteratively, allowing you to measure and refine your LLM system.
export\_config
export_config
- export_latest_execution(filepath, config=None)
Export the latest benchmark execution with all its associated data.
This method exports the most recent benchmark execution, including all evaluation results and metadata. The exported dataset contains:
Benchmark metadata for the associated benchmark
Execution metadata for this execution
- Each datapoint lists its evaluation score, which includes:
The evaluator outputs
Rationale
Agreement with ground truth
Each datapoint lists its slice membership(s)
(CSV exports only) Uploaded user columns and ground truth
The export includes all datapoints without filtering or sampling. Some datapoints may have missing evaluation scores if the benchmark was not executed against them (for example, datapoints in the test split).
Parameters
Parameters
sep: The separator between columns. Default:,.quotechar: The character used to quote fields. Default:".escapechar: The character used to escape special characters. Default:\.Return type
Return type
None
Name Type Default Info filepath strOutput file path for exported data. config Union[JsonExportConfig, CsvExportConfig, None]NoneA
JsonExportConfigorCsvExportConfigobject. If not provided, JSON will be used by default. No additional configuration is required for JSON exports. For CSV exports, the following parameters are supported:Example
Example 1
Example 1
Export the latest benchmark execution to JSON:
benchmark = Benchmark.get(100)
benchmark.export_latest_execution("benchmark_execution.json")Example 1 return
Example 1 return
The exported JSON file contains:
{
"benchmark_metadata": {
"uid": 100,
"name": "Example Benchmark",
"description": "A benchmark for testing model performance",
"created_at": "2025-01-01T12:00:00Z",
"created_by": "user@example.com"
},
"execution_metadata": {
"uid": 1,
"name": "Latest Run",
"created_at": "2025-01-01T12:00:00Z",
"created_by": "user@example.com"
},
"data": [
{
"x_uid": "doc::0",
"scores": [
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "RATIONALE",
"value": "The response is clear and well-structured",
"error": ""
},
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "EVAL",
"value": 0.85,
},
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "AGREEMENT",
"value": 1.0
}
],
"slice_membership": ["test_set"]
},
{
"x_uid": "doc::1",
"scores": [
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "EVAL",
"value": 0.92,
}
],
"slice_membership": ["test_set"]
}
],
"slices": [
{
"id": "None",
"display_name": "All Datapoints",
"reserved_slice_type": "global"
},
{
"id": "-1",
"display_name": "No Slice",
"reserved_slice_type": "no_slice"
},
{
"id": "5",
"display_name": "Your Slice",
"reserved_slice_type": "regular_slice"
}
]
}
export\_latest\_execution
export_latest_execution
- static get(benchmark_uid)
Gets a benchmark by its unique identifier.
Parameters
Parameters
Returns
Returns
A Benchmark object representing the benchmark with the given
benchmark_uid.Return type
Return type
Name Type Default Info benchmark_uid intThe unique identifier of the benchmark from which you want to get data. The benchmark_uidis visible in the URL of the benchmark page in the Snorkel GUI. For example,https://YOUR-SNORKEL-INSTANCE/benchmarks/100/indicates a benchmark withbenchmark_uidof100.
get
get
- static list(workspace_uid, include_archived=False)
Lists all benchmarks for a given workspace.
Parameters
Parameters
Returns
Returns
A list of Benchmark objects representing all benchmarks in the given workspace.
Return type
Return type
List[Benchmark]
Name Type Default Info workspace_uid intThe unique identifier of the workspace from which you want to list benchmarks. The workspace_uidcan be retrieved using thesnorkelai.sdk.client_v3.utils.get_workspace_uid()method.include_archived boolFalseWhether to include archived benchmarks.
list
list
- list_criteria(include_archived=False)
Retrieves all criteria for this benchmark.
Criteria are the key characteristics that represent the features being optimized for evaluation. Each criteria defines what aspect of the model’s performance is being measured, such as accuracy, relevance, or safety.
Each Criteria object contains:
criteria_uid: The unique identifier for this criteria.
benchmark_uid: The ID of the parent benchmark.
name: The name of the criteria.
description: A detailed description of what the criteria measures.
requires_rationale: Whether the criteria requires a rationale explanation.
label_map: A dictionary mapping user-friendly labels to numeric values.
Parameters
Parameters
Returns
Returns
A list of Criteria objects representing all criteria in this benchmark.
Return type
Return type
List[Criteria]
Name Type Default Info include_archived boolFalseWhether to include archived criteria. Example
Example 1
Example 1
Get all criteria for a benchmark and list them:
benchmark = Benchmark.get(100)
criteria_list = benchmark.list_criteria()
for criteria in criteria_list:
print(f"Criteria: {criteria.name} - {criteria.description}")
list\_criteria
list_criteria
- list_executions(include_archived=False)
Retrieves all benchmark executions for this benchmark.
A benchmark execution represents a single run of a benchmark against a dataset, capturing the results and metadata of that evaluation. Executions are returned in chronological order, with the most recent execution last.
Each BenchmarkExecution object contains:
benchmark_uid: The ID of the parent benchmark.
benchmark_execution_uid: The unique identifier for this execution.
name: The name of the execution.
created_at: Timestamp when the execution was created.
created_by: Username of the execution creator.
archived: Whether the execution is archived.
After retrieving executions, you can export their results using
export_latest_execution()or export the benchmark configuration usingexport_config(). For more information about exporting benchmarks, see Export evaluation benchmark.Parameters
Parameters
Return type
Return type
List[BenchmarkExecution]
Name Type Default Info include_archived boolFalseWhether to include archived executions. Example
Example 1
Example 1
Get all executions for a benchmark and list them:
benchmark = Benchmark.get(100)
executions = benchmark.list_executions()
list\_executions
list_executions
- update(name=None, description=None, archived=None)
Updates the benchmark with the given parameters. If a parameter is not provided or is None, the existing value will be left unchanged.
Parameters
Parameters
Return type
Return type
None
Name Type Default Info name Optional[str]NoneThe new name of the benchmark. description Optional[str]NoneThe new description of the benchmark. archived Optional[bool]NoneWhether the benchmark should be archived. Example
benchmark = Benchmark.get(100)
benchmark.update(name="New Name", description="New description")
update
update
- property archived: bool
Return whether the benchmark is archived
- property benchmark_uid: int
Return the UID of the benchmark
- property created_at: datetime
Return the timestamp when the benchmark was created
- property description: str | None
Return the description of the benchmark
- property name: str
Return the name of the benchmark
- property uid: int
Return the UID of the benchmark
- property updated_at: datetime
Return the timestamp when the benchmark was last updated