snorkelai.sdk.develop.Benchmark
- class snorkelai.sdk.develop.Benchmark(benchmark_uid)
Bases:
object
A benchmark is the collection of characteristics that you care about for a particular GenAI application, and the measurements you use to assess the performance against those characteristics. It consists of the following elements:
Reference prompts: A set of prompts used to evaluate the model’s responses.
Slices: Subsets of reference prompts focusing on specific topics.
Criteria: Key characteristics that represent the features being optimized for evaluation.
Evaluators: Functions that assess whether a model’s output satisfies the criteria.
Read more in the Evaluation overview.
- __init__(benchmark_uid)
Initializes a Benchmark object for interacting with benchmark data. This method does not create a new benchmark. It accesses an existing benchmark from Snorkel, using its unique identifier (UID). This allows you to perform further operations on the already-defined benchmark within your SDK environment.
A Benchmark object provides methods to:
List benchmark executions.
Export benchmark configurations.
Export execution results and metadata.
Parameters
Parameters
Raises
Raises
ValueError – If the benchmark_uid is None or invalid
Name Type Default Info benchmark_uid int
The unique identifier of the benchmark from which you want to get data. Example
Example 1
Example 1
Load a benchmark from Snorkel and perform operations on it.
# Access the benchmark with the UID 100 from Snorkel
benchmark = snorkelai.sdk.develop.Benchmark(100)
\_\_init\_\_
__init__
Methods
__init__
(benchmark_uid)Initializes a Benchmark object for interacting with benchmark data. export_config
(filepath[, format])Exports a benchmark configuration to the specified format and writes to the provided filepath. export_latest_execution
(filepath[, config])Export the latest benchmark execution with all its associated data. list_executions
()Retrieves all benchmark executions for this benchmark. - export_config(filepath, format=BenchmarkExportFormat.JSON)
Exports a benchmark configuration to the specified format and writes to the provided filepath.
This method exports the complete benchmark configuration, including all criteria, evaluators, and metadata. The exported configuration can be used for:
Version control of benchmark definitions.
Sharing benchmarks across teams.
Integration with CI/CD pipelines.
Backing up evaluation configurations.
Parameters
Parameters
Raises
Raises
NotImplementedError – If an unsupported export format is specified.
ValueError – If the benchmark_uid is None or invalid.
Return type
Return type
None
Name Type Default Info filepath str
Output file path for exported data. The directory will be created if it doesn’t exist. format BenchmarkExportFormat
<BenchmarkExportFormat.JSON: 'json'>
The format to export the config to. Currently only JSON is supported. Example
Example 1
Example 1
Export a benchmark configuration to JSON:
benchmark = Benchmark(100)
benchmark.export_config("benchmark_config.json")Example output
Example output
The exported JSON file contains:
{
"criteria": [
{
"criteria_uid": 101,
"benchmark_uid": 100,
"name": "Example Readability",
"description": "Evaluates how easy the response is to read and understand.",
"state": "Active",
"output_format": {
"metric_label_schema_uid": 200,
"rationale_label_schema_uid": 201
},
"metadata": {
"version": "1.0"
},
"created_at": "2025-04-01T14:30:00.123456Z",
"updated_at": "2025-04-01T14:35:10.654321Z"
}
],
"evaluators": [
{
"evaluator_uid": 301,
"name": "Readability Evaluator (LLM)",
"description": "Uses an LLM prompt to assess readability.",
"criteria_uid": 101,
"type": "Prompt",
"prompt_workflow_uid": 401,
"parameters": null,
"metadata": {
"default_prompt_config": {
"name": "Readability Prompt v1",
"model_name": "google/gemini-1.5-pro-latest",
"system_prompt": "You are an expert evaluator assessing text readability.",
"user_prompt": "..."
}
},
"created_at": "2025-04-01T15:00:00.987654Z",
"updated_at": "2025-04-01T15:05:00.123123Z"
}
],
"metadata": {
"name": "Sample Benchmark Set",
"description": "A benchmark set including example evaluations.",
"created_at": "2025-04-01T14:00:00.000000Z",
"created_by": "user@example.com"
}
}After exporting your benchmark, you can use it to evaluate data from your GenAI application iteratively, allowing you to measure and refine your LLM system.
export\_config
export_config
- export_latest_execution(filepath, config=None)
Export the latest benchmark execution with all its associated data.
This method exports the most recent benchmark execution, including all evaluation results and metadata. The exported dataset contains:
- Per-datapoint evaluation information:
- Evaluation scores:
Parsed evaluator outputs
Rationale
Agreement with ground truth
Slice membership
Benchmark metadata
Execution metadata
(CSV only) Uploaded user columns and ground truth
This export includes all datapoints without filtering or sampling. Some datapoints may have missing evaluation scores if the benchmark has not been executed against them (e.g. those in the test split).
Parameters
Parameters
sep
: The separator between columns. Default:,
.quotechar
: The character used to quote fields. Default:"
.escapechar
: The character used to escape special characters. Default:\
.Return type
Return type
None
Name Type Default Info filepath str
Output file path for exported data. config Union[JsonExportConfig, CsvExportConfig, None]
None
A
JsonExportConfig
orCsvExportConfig
object. If not provided, JSON will be used by default. No additional configuration is required for JSON exports. For CSV exports, the following parameters are supported:Example
Example 1
Example 1
Export the latest benchmark execution to JSON:
benchmark = Benchmark(100)
benchmark.export_latest_execution("benchmark_execution.json")Example 1 return
Example 1 return
The exported JSON file contains:
{
"benchmark_metadata": {
"uid": 100,
"name": "Example Benchmark",
"description": "A benchmark for testing model performance",
"created_at": "2025-01-01T12:00:00Z",
"created_by": "user@example.com"
},
"execution_metadata": {
"uid": 1,
"name": "Latest Run",
"created_at": "2025-01-01T12:00:00Z",
"created_by": "user@example.com"
},
"data": [
{
"x_uid": "doc::0",
"scores": [
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "RATIONALE",
"value": "The response is clear and well-structured",
"error": ""
},
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "EVAL",
"value": 0.85,
},
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "AGREEMENT",
"value": 1.0
}
],
"slice_membership": ["test_set"]
},
{
"x_uid": "doc::1",
"scores": [
{
"criteria_uid": 101,
"criteria_name": "Readability",
"score_type": "EVAL",
"value": 0.92,
}
],
"slice_membership": ["test_set"]
}
],
"slices": [
{
"id": "None",
"display_name": "All Datapoints",
"reserved_slice_type": "global"
},
{
"id": "-1",
"display_name": "No Slice",
"reserved_slice_type": "no_slice"
},
{
"id": "5",
"display_name": "Your Slice",
"reserved_slice_type": "regular_slice"
}
]
}
export\_latest\_execution
export_latest_execution
- list_executions()
Retrieves all benchmark executions for this benchmark.
A benchmark execution represents a single run of a benchmark against a dataset, capturing the results and metadata of that evaluation. Executions are returned in chronological order, with the most recent execution last.
Each BenchmarkExecution object contains: :rtype:
List
[BenchmarkExecution
]benchmark_uid: The ID of the parent benchmark.
benchmark_execution_uid: The unique identifier for this execution.
name: The name of the execution.
created_at: Timestamp when the execution was created.
created_by: Username of the execution creator.
After retrieving executions, you can export their results using
export_latest_execution()
or export the benchmark configuration usingexport_config()
. For more information about exporting benchmarks, see Export evaluation benchmark.Example
Example 1
Example 1
Get all executions for a benchmark and list them:
benchmark = Benchmark(100)
executions = benchmark.list_executions()
list\_executions
list_executions