Version: 25.7

snorkelai.sdk.develop.Benchmark

class snorkelai.sdk.develop.Benchmark(*args, **kwargs)

Bases: BaseModel

A benchmark is the collection of characteristics that you care about for a particular GenAI application, and the measurements you use to assess the performance against those characteristics. It consists of the following elements:

Reference prompts: A set of prompts used to evaluate the model’s responses.
Slices: Subsets of reference prompts focusing on specific topics.
Criteria: Key characteristics that represent the features being optimized for evaluation.
Evaluators: Functions that assess whether a model’s output satisfies the criteria.

Name	Type	Default	Info
benchmark_uid	`int`		The unique identifier of the benchmark from which you want to get data. The `benchmark_uid` is visible in the URL of the benchmark page in the Snorkel GUI. For example, `https://YOUR-SNORKEL-INSTANCE/benchmarks/100/` indicates a benchmark with `benchmark_uid` of `100`.
name	`str`		The name of the benchmark.
description	`Optional[str]`	`None`	The description of the benchmark.
created_at	`datetime`		The timestamp when the benchmark was created.
updated_at	`datetime`		The timestamp when the benchmark was last updated.
archived	`bool`		Whether the benchmark is archived.

create

static create(name, dataset_uid, description=None)

Creates a new benchmark. The created benchmark does not include any default criteria or evaluators.

Parameters Parameters
Returns Returns: A Benchmark object representing the created benchmark.
Return type Return type: Benchmark

Name	Type	Default	Info
name	`str`		The name of the benchmark.
dataset_uid	`int`		The unique identifier of the dataset to use as the input for the benchmark. The `dataset_uid` can be retrieved using the `snorkelai.sdk.develop.datasets.Dataset.list()` method.
description	`Optional[str]`	`None`	The description of the benchmark.

execute

execute(splits=None, criteria_uids=None, name=None)

Executes the benchmark against the associated dataset. For each criteria, evaluation scores are computed for each datapoint and aggregate metrics are computed across all datapoints.

Parameters Parameters
Returns Returns: The execution object.
Return type Return type: BenchmarkExecution

Name	Type	Default	Info
splits	`Optional[List[str]]`	`None`	The splits to execute the benchmark on. If not provided, will default to [“train”, “valid”].
criteria_uids	`Optional[List[int]]`	`None`	The criteria to execute the benchmark on. If not provided, will default to all criteria for the benchmark.
name	`Optional[str]`	`None`	The name of the execution. If not provided, will default to “Run <number>” based on the number of previous executions.

export_config

export_config(filepath, format=BenchmarkExportFormat.JSON)

Exports a benchmark configuration to the specified format and writes to the provided filepath.

This method exports the complete benchmark configuration, including all criteria, evaluators, and metadata. The exported configuration can be used for:

Version control of benchmark definitions.
Sharing benchmarks across teams.
Integration with CI/CD pipelines.
Backing up evaluation configurations.

Parameters Parameters

Name	Type	Default	Info
filepath	`str`		Output file path for exported data. The directory will be created if it doesn’t exist.
format	`BenchmarkExportFormat`	`<BenchmarkExportFormat.JSON: 'json'>`	The format to export the config to. Currently only JSON is supported.

Raises Raises

NotImplementedError – If an unsupported export format is specified.
ValueError – If the benchmark_uid is None or invalid.

Return type Return type

None

Example

Example 1

Export a benchmark configuration to JSON:

benchmark = Benchmark.get(100)
benchmark.export_config("benchmark_config.json")

Example 1 output

The exported JSON file contains:

{
  "criteria": [
    {
      "criteria_uid": 101,
      "benchmark_uid": 100,
      "name": "Example Readability",
      "description": "Evaluates how easy the response is to read and understand.",
      "state": "ACTIVE",
      "output_format": {
        "metric_label_schema_uid": 200,
        "rationale_label_schema_uid": 201
      },
      "metadata": {
        "version": "1.0"
      },
      "created_at": "2025-04-01T14:30:00.123456Z",
      "updated_at": "2025-04-01T14:35:10.654321Z"
    }
  ],
  "evaluators": [
    {
      "evaluator_uid": 301,
      "name": "Readability Evaluator (LLM)",
      "description": "Uses an LLM prompt to assess readability.",
      "criteria_uid": 101,
      "type": "Prompt",
      "prompt_workflow_uid": 401,
      "parameters": null,
      "metadata": {
        "default_prompt_config": {
          "name": "Readability Prompt v1",
          "model_name": "google/gemini-1.5-pro-latest",
          "system_prompt": "You are an expert evaluator assessing text readability.",
          "user_prompt": "..."
        }
      },
      "created_at": "2025-04-01T15:00:00.987654Z",
      "updated_at": "2025-04-01T15:05:00.123123Z"
    }
  ],
  "metadata": {
    "name": "Sample Benchmark Set",
    "description": "A benchmark set including example evaluations.",
    "created_at": "2025-04-01T14:00:00.000000Z",
    "created_by": "user@example.com"
  }
            }

After exporting your benchmark, you can use it to evaluate data from your GenAI application iteratively, allowing you to measure and refine your LLM system.

export_latest_execution

export_latest_execution(filepath, config=None)

Export the latest benchmark execution with all its associated data.

This method exports the most recent benchmark execution, including all evaluation results and metadata. The exported dataset contains:

Benchmark metadata for the associated benchmark
Execution metadata for this execution
Each datapoint lists its evaluation score, which includes:
- The evaluator outputs
- Rationale
- Agreement with ground truth
Each datapoint lists its slice membership(s)
(CSV exports only) Uploaded user columns and ground truth

The export includes all datapoints without filtering or sampling. Some datapoints may have missing evaluation scores if the benchmark was not executed against them (for example, datapoints in the test split).

Parameters Parameters
Return type Return type: None

Name	Type	Default	Info
filepath	`str`		Output file path for exported data.
config	`Union[JsonExportConfig, CsvExportConfig, None]`	`None`	A `JsonExportConfig` or `CsvExportConfig` object. If not provided, JSON will be used by default. No additional configuration is required for JSON exports. For CSV exports, the following parameters are supported: `sep`: The separator between columns. Default: `,`. `quotechar`: The character used to quote fields. Default: `"`. `escapechar`: The character used to escape special characters. Default: `\`.

Example

Example 1

Export the latest benchmark execution to JSON:

benchmark = Benchmark.get(100)
benchmark.export_latest_execution("benchmark_execution.json")

Example 1 return

The exported JSON file contains:

{
    "benchmark_metadata": {
        "uid": 100,
        "name": "Example Benchmark",
        "description": "A benchmark for testing model performance",
        "created_at": "2025-01-01T12:00:00Z",
        "created_by": "user@example.com"
    },
    "execution_metadata": {
        "uid": 1,
        "name": "Latest Run",
        "created_at": "2025-01-01T12:00:00Z",
        "created_by": "user@example.com"
    },
    "data": [
        {
            "x_uid": "doc::0",
            "scores": [
               {
                    "criteria_uid": 101,
                    "criteria_name": "Readability",
                    "score_type": "RATIONALE",
                    "value": "The response is clear and well-structured",
                    "error": ""
               },
               {
                    "criteria_uid": 101,
                    "criteria_name": "Readability",
                    "score_type": "EVAL",
                    "value": 0.85,
                },
                {
                    "criteria_uid": 101,
                    "criteria_name": "Readability",
                    "score_type": "AGREEMENT",
                    "value": 1.0
                }
            ],
            "slice_membership": ["test_set"]
        },
        {
            "x_uid": "doc::1",
            "scores": [
                {
                    "criteria_uid": 101,
                    "criteria_name": "Readability",
                    "score_type": "EVAL",
                    "value": 0.92,
                }
            ],
            "slice_membership": ["test_set"]
        }
    ],
    "slices": [
        {
            "id": "None",
            "display_name": "All Datapoints",
            "reserved_slice_type": "global"
        },
        {
            "id": "-1",
            "display_name": "No Slice",
            "reserved_slice_type": "no_slice"
        },
        {
            "id": "5",
            "display_name": "Your Slice",
            "reserved_slice_type": "regular_slice"
        }
    ]
}

get

static get(benchmark_uid)

Gets a benchmark by its unique identifier.

Parameters Parameters
Returns Returns: A Benchmark object representing the benchmark with the given benchmark_uid.
Return type Return type: Benchmark

Name	Type	Default	Info
benchmark_uid	`int`		The unique identifier of the benchmark from which you want to get data. The `benchmark_uid` is visible in the URL of the benchmark page in the Snorkel GUI. For example, `https://YOUR-SNORKEL-INSTANCE/benchmarks/100/` indicates a benchmark with `benchmark_uid` of `100`.

list

static list(workspace_uid, include_archived=False)

Lists all benchmarks for a given workspace.

Parameters Parameters
Returns Returns: A list of Benchmark objects representing all benchmarks in the given workspace.
Return type Return type: List[Benchmark]

Name	Type	Default	Info
workspace_uid	`int`		The unique identifier of the workspace from which you want to list benchmarks. The `workspace_uid` can be retrieved using the `snorkelai.sdk.client_v3.utils.get_workspace_uid()` method.
include_archived	`bool`	`False`	Whether to include archived benchmarks.

list_criteria

list_criteria(include_archived=False)

Retrieves all criteria for this benchmark.

Criteria are the key characteristics that represent the features being optimized for evaluation. Each criteria defines what aspect of the model’s performance is being measured, such as accuracy, relevance, or safety.

Each Criteria object contains:

criteria_uid: The unique identifier for this criteria.
benchmark_uid: The ID of the parent benchmark.
name: The name of the criteria.
description: A detailed description of what the criteria measures.
requires_rationale: Whether the criteria requires a rationale explanation.
label_map: A dictionary mapping user-friendly labels to numeric values.

Parameters Parameters
Returns Returns: A list of Criteria objects representing all criteria in this benchmark.
Return type Return type: List[Criteria]

Name	Type	Default	Info
include_archived	`bool`	`False`	Whether to include archived criteria.

Example

Example 1

Get all criteria for a benchmark and list them:

benchmark = Benchmark.get(100)
criteria_list = benchmark.list_criteria()
for criteria in criteria_list:
    print(f"Criteria: {criteria.name} - {criteria.description}")

list_executions

list_executions(include_archived=False)

Retrieves all benchmark executions for this benchmark.

A benchmark execution represents a single run of a benchmark against a dataset, capturing the results and metadata of that evaluation. Executions are returned in chronological order, with the most recent execution last.

Each BenchmarkExecution object contains:

benchmark_uid: The ID of the parent benchmark.
benchmark_execution_uid: The unique identifier for this execution.
name: The name of the execution.
created_at: Timestamp when the execution was created.
created_by: Username of the execution creator.
archived: Whether the execution is archived.

After retrieving executions, you can export their results using export_latest_execution() or export the benchmark configuration using export_config(). For more information about exporting benchmarks, see Export evaluation benchmark.

Parameters Parameters
Return type Return type: List[BenchmarkExecution]

Name	Type	Default	Info
include_archived	`bool`	`False`	Whether to include archived executions.

Example

Example 1

Get all executions for a benchmark and list them:

benchmark = Benchmark.get(100)
executions = benchmark.list_executions()

update

update(name=None, description=None, archived=None)

Updates the benchmark with the given parameters. If a parameter is not provided or is None, the existing value will be left unchanged.

Parameters Parameters
Returns Returns: A Benchmark object representing the updated benchmark.
Return type Return type: Benchmark

Name	Type	Default	Info
name	`Optional[str]`	`None`	The new name of the benchmark.
description	`Optional[str]`	`None`	The new description of the benchmark.
archived	`Optional[bool]`	`None`	Whether the benchmark should be archived.

Example

benchmark = Benchmark.get(100)
benchmark.update(name="New Name", description="New description")

archived: bool

benchmark_uid: int

created_at: datetime

description: Optional[str] = None

name: str

updated_at: datetime

`__init__`(args, *kwargs)
`archive`()	Archives the benchmark, hiding it from the UI and SDK list method.
`create`(name, dataset_uid[, description])	Creates a new benchmark.
`execute`([splits, criteria_uids, name])	Executes the benchmark against the associated dataset.
`export_config`(filepath[, format])	Exports a benchmark configuration to the specified format and writes to the provided filepath.
`export_latest_execution`(filepath[, config])	Export the latest benchmark execution with all its associated data.
`get`(benchmark_uid)	Gets a benchmark by its unique identifier.
`list`(workspace_uid[, include_archived])	Lists all benchmarks for a given workspace.
`list_criteria`([include_archived])	Retrieves all criteria for this benchmark.
`list_executions`([include_archived])	Retrieves all benchmark executions for this benchmark.
`update`([name, description, archived])	Updates the benchmark with the given parameters.

`description`
`benchmark_uid`
`name`
`created_at`
`updated_at`
`archived`

Parameters

Parameters​

\_\_init\_\_

__init__​

archive

archive​

Return type

Return type​

create

create​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

execute

execute​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

export\_config

export_config​

Parameters

Parameters​

Raises

Raises​

Return type

Return type​

Example​

Example 1

Example 1​

Example 1 output

Example 1 output​

export\_latest\_execution

export_latest_execution​

Parameters

Parameters​

Return type

Return type​

Example​

Example 1

Example 1​

Example 1 return

Example 1 return​

get

get​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

list

list​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

list\_criteria

list_criteria​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

Example​

Example 1

Example 1​

list\_executions

list_executions​

Parameters

Parameters​

Return type

Parameters

init

archive

Return type

create

Parameters

Returns

Return type

execute

Parameters

Returns

Return type

export_config

Parameters

Raises

Return type

Example

Example 1

Example 1 output

export_latest_execution

Parameters

Return type

Example

Example 1

Example 1 return

get

Parameters

Returns

Return type

list

Parameters

Returns

Return type

list_criteria

Parameters

Returns

Return type

Example

Example 1

list_executions

Parameters

Return type

Example

Example 1

update

Parameters

Returns

Return type

Example