Snorkel Flow v25.3 (STS) release notes

Introducing the GenAI Evaluation framework

Snorkel version 25.3 introduces our comprehensive GenAI Evaluation framework, a solution for organizations to systematically measure, evaluate, and improve their AI-generated outputs. This enterprise-ready workflow addresses the inherent challenges of evaluating GenAI models, whose outputs are varied and context-dependent, requiring specialized assessment approaches.

Our evaluation framework enables organizations to:

Define custom evaluation criteria with detailed attributes like name, description, and assigned evaluators.
Leverage out-of-the-box evaluators for evaluating responses and retrieval.
Design and develop LLM-as-a-judge (LLMAJ) evaluators through an intuitive prompting workflow.
Generate ground truth by assigning annotation batches to human evaluators (SMEs).
Support trace-formatted datasets from LangChain and other JSON-based frameworks.
Filter evaluation data by dataset slice and ground truth for targeted analysis.
Track progress through aggregate criteria metrics across benchmark iterations.

To read in-depth, start with the Evaluation overview.

With seamless navigation between benchmark creation and evaluator development interfaces, organizations can test and track improvements to their AI models, making them reliable for enterprise applications. The framework starts with smart defaults for quick initial evaluation while supporting increasingly powerful and customized benchmarks as needs evolve.

Features and improvements

Integrations

Added support for Claude 3.7 Sonnet via AWS Bedrock.

Data upload

You can import datasets from external AI systems, with column mapping, for traces, benchmarks, and prompts.
Snorkel now supports trace-formatted datasets (LangChain, JSON).

Annotation

You can create annotation batches from the Benchmarks page, using criteria-based label schemas.
Annotators can provide criteria rationales using free-text fields.
Annotators can annotate traces for evaluation, using benchmark criteria.
Sequence tagging and extraction are enabled for working with traces.
Sequence tagging now targets whole words instead of partial words or individual characters.

Data development

When you select a span, you can now see which labeling functions (LFs) voted.
The labeling function view now displays which labeling function voted on a span.

Prompt development

Prompt development is now in general availability after the successful completion of our beta phase.
Prompt development now supports LLM-as-a-judge (LLMAJ) prompts for GenAI Evaluation. See Create LLMAJ prompt for details.
You can see which prompt version is saved as the LLMAJ evaluator with a visual bookmark indicator.
The prompt dataset size limit is now 420MB.
You can view benchmark results, including criteria scores and aggregate metrics, when developing an LLMAJ prompt.
You can filter data by slice and by ground truth when developing prompts.
You can cancel an in-progress prompt run.
You can now view your collection of prompts and search for prompts.

Evaluation

Introduced the ability to create custom criteria with detailed attributes such as name, description, and evaluator.
Introduced the ability to design and assign LLMAJ evaluators.
Included out-of-the-box response evaluators for criteria such as correctness, completeness, and safety.
Included out-of-the-box RAG evaluators and metrics for context relevance, faithfulness, and recall.
Developed insights into multi-step GenAI applications by viewing entire traces, including all steps and trace-level metadata such as timestamps, token metrics, and latency.
You can export evaluation data with annotations and evaluator outputs.

Infrastructure

Workspace-level file isolation: All newly uploaded structured and unstructured data files (PDFs, CSVs, Parquet, etc.) are now read-only and accessible only within the workspace where they were uploaded. Cross-workspace access is no longer allowed, improving data isolation and security. For help migrating existing files, contact your Snorkel representative.
Hardened runtime environments: Containers can now run with read-only root filesystems, reducing the surface area for potential vulnerabilities and further strengthening workspace isolation.

Bug fixes

Data upload

Ground truth upload failures now have better error reporting.

Annotation

Column sorting now matches the column order in a datasource, rather than defaulting to alphabetical sorting.

Data upload

The ground truth upload dialogue information is now correct.

User interface

While scrolling through filtered data, if all filters are removed, then the current data being viewed continues to be shown, instead of displaying the first one.

Known issues

Data upload

Uploading a dataset for auto-splitting requires uploading twice.
Uploading large CSV files can show unrelated errors during data upload.
Uploading large CSV files to S3 can generate an internal server error.
New data sources do not have embeddings generated if that feature is not activated.
There may be dataset size discrepancies between the actual file size and what is shown in the GUI.
Downloading PDFs with the https URL fails.
A populator saves two different sets of arrow files at the same path.

Application setup

New applications cannot be created if that application already exists.
App creation does not show an error message when data sources have inconsistent schemas.
App creation can fail when creating an application with scanned PDFs in an air-gapped environment.
When copying an application, the dev split is not maintained for PDFs.
When copying an application, you may receive a server error.

Annotation

Selecting columns in an annotation batch export causes errors.
For PDFs, the annotation filter for negative ground truth doesn't work.
Snorkel-generated labels are not updated for word-based PDF extraction.

Data development

A dataset batch filter can produce a type error.
You can create a dataset view with an unattached label schema.
The ability to override LF labels with ground truth labels where available doesn't override LF labels with negative GT labels.
If there is no span in certain views, you may see a cryptic error message.
The confusion and clarity matrix numbers don't add up for sequence tagging applications.
A server error may occur when creating a labeling function.
You may receive an error that certain datapoints are not in index after resampling a dev split.

Prompt development

Text input for prompts changes the viewport position.
Generating a response in comparison view has a broken transition without the loading state.

Model development

Model training does not show an error when one of the vectorizers is not fit.
Modeling fails if columns are categorical.
You may see an error when deleting custom models.

Evaluation

The evaluation data popup does not show any data.
The default criteria values for ordinal criteria are editable, but don't indicate this.
In a populated evaluation report, the user who created the reported is the requester user, rather than the original user from the populated app.
Multi-trace dictionary values for trace steps have limitations with int, float, and bool types.

SDK

The aggregate_annotations() SDK method fails.

Introducing the GenAI Evaluation framework​

Features and improvements​

Integrations​

Data upload​

Annotation​

Data development​

Prompt development​

Evaluation​

Infrastructure​

Bug fixes​

Data upload​

Annotation​

Data upload​

User interface​

Known issues​

Data upload​

Application setup​

Annotation​

Data development​

Prompt development​

Model development​

Evaluation​

SDK​

Introducing the GenAI Evaluation framework

Features and improvements

Integrations

Data upload

Annotation

Data development

Prompt development

Evaluation

Infrastructure

Bug fixes

Data upload

Annotation

Data upload

User interface

Known issues

Data upload

Application setup

Annotation

Data development

Prompt development

Model development

Evaluation

SDK