Export evaluation benchmark
After refining your benchmark to align with your business objectives, you can export it for continuous use in your evaluation workflow. This is a stage in the evaluation workflow. This page explains how to export your benchmark configuration for integration with other systems or for version control.
(SDK) Export benchmark configuration
You can export your benchmark configuration as a JSON file that contains all the criteria, evaluators, and metadata. This allows you to:
- Version control your benchmark definitions
- Share benchmarks across teams
- Integrate with CI/CD pipelines
- Back up your evaluation configurations
Here's the SDK function for benchmark export:
def export_config(
cls,
benchmark_uid: int,
filepath: str,
format: BenchmarkExportFormat = BenchmarkExportFormat.JSON,
) -> None:
"""Export benchmark configuration to the specified format and write to the provided filepath.
Parameters
----------
benchmark_uid : int
The unique identifier of the benchmark
filepath : str
The filepath to write the exported config to
format : BenchmarkExportFormat
The format to export the config to. Currently only JSON is supported.
Returns
-------
None
Examples
--------
>>> Benchmark.export_config(123, "benchmark_config.json")
"""
Here's what it looks like to use this function in the SDK:
from snorkelflow.sdk.benchmarks import Benchmark
benchmark_uid = 123
export_path = "benchmark_export"
benchmark = Benchmark(benchmark_uid)
benchmark.export_config(export_path)
The exported JSON includes detailed information about each criteria and evaluator, including their parameters, prompts, and metadata. Here's a sample benchmark output:
{
"criteria": [
{
"criteria_uid": 101, // Example integer ID
"benchmark_uid": 50, // Example integer ID
"name": "Example Readability", // Example name
"description": "Evaluates how easy the response is to read and understand.", // Example description
"state": "Active", // Common state value
"output_format": {
"metric_label_schema_uid": 201, // Example integer ID
"rationale_label_schema_uid": 202 // Example integer ID (could be null too)
},
"metadata": {
"version": "1.0" // Example metadata
},
"created_at": "2025-04-01T14:30:00.123456Z", // Example timestamp
"updated_at": "2025-04-01T14:35:10.654321Z" // Example timestamp
}
// Add more criteria objects here if needed
],
"evaluators": [
{
"evaluator_uid": 301, // Example integer ID
"name": "Readability Evaluator (LLM)", // Example name linked to criteria
"description": "Uses an LLM prompt to assess readability.", // Example description
"criteria_uid": 101, // Linking to the criteria above
"type": "Prompt", // Common type value
"prompt_workflow_uid": 401, // Example integer ID
"parameters": null, // Often null in source data
"metadata": {
"default_prompt_config": {
"name": "Readability Prompt v1", // Example config name
"model_name": "google/gemini-1.5-pro-latest", // Example model
"system_prompt": "You are an expert evaluator assessing text readability.", // Example system prompt
"user_prompt": "####Task\nEvaluate the readability of the Response.\n\n####Evaluation Guidelines:\n1. **Clarity**: Is the language clear and concise?\n2. **Structure**: Is the response well-organized?\n3. **Complexity**: Is the vocabulary and sentence structure appropriate?\n4. **Score**: Assign a score from 0 (very hard to read) to 1 (very easy to read).\n5. **Rationale**: Explain your score briefly.\n\n#### Formatting Requirements\nOutput strictly JSON:\n```json\n{\n \"score\": <score between 0 and 1>,\n \"rationale\": \"Your rationale here.\"\n}\n```\n\n#### Inputs\nResponse:\n{response}\n\nNow, evaluate." // Example user prompt using placeholders
}
},
"created_at": "2025-04-01T15:00:00.987654Z", // Example timestamp
"updated_at": "2025-04-01T15:05:00.123123Z" // Example timestamp
}
// Add more evaluator objects here if needed
],
"meta": {
"name": "Sample Benchmark Set", // Example name
"description": "A benchmark set including example evaluations.", // Example description
"created_at": "2025-04-01T14:00:00.000000Z", // Example timestamp
"created_by": "example_creator" // Example creator username
}
}
Next steps
After exporting your benchmark, you can use it to evaluate data from your GenAI application iteratively, allowing you measure and refine your LLM system.