Snorkel Flow v25.1 (LTS) release notes

Snorkel Flow is using a new versioning syntax. The version number now reflects the calendar year and number of the release. For example, v25.1 is the first release of 2025.

Breaking changes

SDK

When you upload a CSV file as a data source, Snorkel Flow automatically converts any cell with the value "None" to NaN (Not a Number). Previously, the value "None" was uploaded as the string "None".
Snorkel Flow now runs Pandas 2.2.3. Pandas 2.0 introduced breaking changes that now apply to Snorkel Flow. Custom operators and other user-defined functions will break if they rely on deprecated syntax. Review the Pandas 2.0.0 release notes for details. If you need assistance with updating custom operators, please reach out to Snorkel Support.
Automatic type conversion for JSON-like strings in CSVs: With the upgrade to Pandas 2.2.3, cells in CSV files containing string representations of JSON objects (e.g., "{'key': 'value'}") or lists (e.g., '[1,2,3]') may now be automatically parsed by Pandas into Python dict or list objects by default. Previously, these were generally preserved as strings. This change in data type can lead to unexpected behavior or errors in existing data processing pipelines and models that expect these fields to be strings (e.g., a TypeError related to schema mismatch: failed to convert partition… expected bytes? Got a dict object). You may need to explicitly convert them back to strings after data loading.

Deprecations

Python 3.8 is deprecated in favor of Python 3.9. This affects the SDK and model deployment.

SDK

Deprecated sf.get_dataset_data; use snorkelflow.sdk.Dataset instead.
- Deprecated sf.get_dataset_data.get_datasets(); use snorkelflow.sdk.Dataset.list() instead.
- Deprecated sf.get_dataset_data.create_dataset(); use snorkelflow.sdk.Dataset.create() instead.
- Deprecated sf.get_dataset_data.delete_dataset(); use snorkelflow.sdk.Dataset.delete() instead.
- Deprecated sf.get_dataset_data.get_dataset_data(); use snorkelflow.sdk.Dataset.get_dataframe() instead.

Upcoming SDK deprecation

In a future release, MinIO will be deprecated for image uploads. Use sf.upload_dir or sf.upload_file instead.

Features and improvements

Integrations

AWS Bedrock Claude is now available as a model provider in the Foundation Model (FM) suite, with support for Claude 3.5 Sonnet v2 (model ID: anthropic.claude-3-5-sonnet-20241022-v2:0).
Added support for k3d NVIDIA CUDA.

Application setup

A new tool tip explains the Lab and Standard licenses during app creation.

Annotation

When doing multi-schema audio annotation, the audio player now supports 0.1-second increments for precise playback, pausing, and navigation. This is accompanied by an updated progress bar and waveform visualization.
Snorkel Flow now supports uploading a labeled ground truth dataset for sequence tagging applications as a CSV file. The user interface supports column mapping during the upload process.
The Develop prompt beta workflow is now better integrated with the annotation workflow. You can create batches for SME annotation, and view ground truth annotations provided by SMEs for each data point, directly from the Develop prompt page. For more details, see Create prompt development workflow.

To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.
Sequence tagging applications now represent information at the word level, rather than the character level. This aligns with the most common text extraction use cases. Word-level representation applies to model predictions, analysis metrics, and ground truth. If you upload a ground truth dataset that has character-level annotations, Snorkel Flow will transform them to the word level.

Data development

min_per_class now supports resampling word-based PDF applications.

Prompt development

Prompt development now supports favorites and custom names for prompt versions.
LLM responses are now streamed, so you can view responses as each finishes, rather than waiting for the entire run to complete.
Prompt job progress is now displayed.
The page now functions correctly when there is no data.
Prompt development now supports a table view for records.

Evaluation

The Evaluate page now lets you edit slice membership. Slices let you group LLM-generated responses into different topics so you can evaluate your current model’s performance individually per topic. This new Snorkel Flow feature lets you manually add or remove datapoints from slices as you review an evaluation report.

SDK

Upgraded Pandas to 2.2.3.
Added a workspace-scoped path to sf.download_remote_object.
open_file is now workspace-scoped.
LabelSchema has an improved description.

Bug Fixes

Quality of life updates to app navigation and display.
Fixed the back button on the Jobs page.
The Pipeline (DAG) now supports multiple output node IDs when creating a node.

Data upload

Fixed a bug during data upload with splitting data by a defined percentage to create data sources.

Annotation

Review mode now displays the correct labels.
Fixed an error during ranking dataset view loading in annotation.

Data development

Removed duplicate data from the table view in the Develop (Studio) page.
False-equivalent values like "0" and "No Data" now render correctly.
Fixed a bug with the Data Drag Explorer for Embeddings not appearing.
Exporting data from the Develop (Studio) page now functions correctly when column names contain the '#' character.
Prevented duplicate labels in a dataset label schema.
The View incorrect filter now excludes correct output from foundation models.
Slice-based analysis no longer shows negative numbers and other incorrect percentages.
Selecting multiple documents now shows only slices associated with all selected documents.
You can now add selected documents to a slice from the cluster data explorer.
Fixed an issue with negated labeling functions for PDFs.

Prompt development

Prompts now load correctly.
SME feedback now appears on pages past the first page.
Removed the error for empty SME feedback.
Fixed a bug with switching the current prompt to the compared prompt in the comparison view.
Comparison view now functions when there's only one prompt version.
The response and user data columns no longer overlap.
The prompt job now finishes before the responses endpoint is called.
The Snorkel Flow ID is no longer displayed as a prompt column.

Model development

The metric plot values for Model comparison now show the percent value.
Excluded non-deployable LFs from being used as features for AutoML.
PDF model metrics are now more accurate after excluding unknown and negative classes.
When ground truth is negative for PDF models, the data studio now correctly shows No data found.

Evaluation

The Evaluation page now functions properly even if no models exist.

SDK

Fixed a bug with the get_preprocessing_issues function.
sf.align_external_ground_truth now functions for filtered text.

Notebooks and templates

Enabled JupyterHub to run in strict mTLS mode with Istio.
Removed hardcoded values for network policy in the Helm template.

Known Issues

File Download Endpoint (/api/download) enables users to read arbitrary files.

Application setup

Copying an application may cause a 500 error.
Copying a PDF application may not maintain the dev split.
Creating a new PDF application from a template fails at initialization.

Data upload

Labels added from the Onboarding page are not visible in the Develop (Studio) page.
Large uploads of CSV data to Amazon S3 may cause a server error.
Newly created datasets show Create from Template. They should show Create Application.
Snorkel Flow does not generate embeddings for new data sources activated from the Application page.

Annotation

In a sequence tagging application, you may receive a label validation error even if a label was successfully applied.
In a sequence tagging application, selecting overlapping text for different labels causes the wrong span to be selected.
Opening metadata breaks up highlighting in Studio.
Cannot switch labels from one schema to another.
In sequence tagging applications, the number of entries in the confusion and clarity matrices don't match.
Exporting a dataset annotation can cause a 500 error.
An annotation batch filter can produce TypeError: Cannot interpret 'string[python]' as a data type.

Data development

It's possible to create a dataset view with a mismatched dataset and label schema.
If you select Next more than once when saving changes to a label schema, you will get stuck in an animation loop.
Studio /dataset and /advanced-lf-state error out with a cryptic error message when there is no span.
When a labeling function in the builder is canceled before being applied, the highlights are not removed from the document.
Snorkel Flow does not generate its own labels for the new PDF workflow.
The new PDF workflow does not support negative filters.
Populator saves two different sets of arrow files at the same path when applying.
Negative ground truth labels do not override labeling function labels.

Model development

Deleting a custom model can cause a 500 error.
Modeling fails if columns are categorical.
Model training does not stop and display an error if any of the vectorizers is not fit.

Evaluation

In a populated evaluation report, the user who created the reported is the requester user, rather than the original user from the populated app.
Evaluation report data popup may be empty.

Breaking changes​

SDK​

Deprecations​

SDK​

Upcoming SDK deprecation​

Features and improvements​

Integrations​

Application setup​

Annotation​

Data development​

Prompt development​

Evaluation​

SDK​

Bug Fixes​

Data upload​

Annotation​

Data development​

Prompt development​

Model development​

Evaluation​

SDK​

Notebooks and templates​

Known Issues​

Application setup​

Data upload​

Annotation​

Data development​

Model development​

Evaluation​

Breaking changes

SDK

Deprecations

SDK

Upcoming SDK deprecation

Features and improvements

Integrations

Application setup

Annotation

Data development

Prompt development

Evaluation

SDK

Bug Fixes

Data upload

Annotation

Data development

Prompt development

Model development

Evaluation

SDK

Notebooks and templates

Known Issues

Application setup

Data upload

Annotation

Data development

Model development

Evaluation