Skip to main content

Snorkel Flow v25.1 (LTS) release notes

Snorkel Flow is using a new versioning syntax. The version number now reflects the calendar year and number of the release. For example, v25.1 is the first release of 2025.

Breaking changes

SDK

  • When you upload a CSV file as a data source, Snorkel Flow automatically converts any cell with the value "None" to NaN (Not a Number). Previously, the value "None" was uploaded as the string "None".
  • Snorkel Flow now runs Pandas 2.2.3. Pandas 2.0 introduced breaking changes that now apply to Snorkel Flow. Custom operators and other user-defined functions will break if they rely on deprecated syntax. Review the Pandas 2.0.0 release notes for details. If you need assistance with updating custom operators, please reach out to Snorkel Support.

Deprecations

  • Python 3.8 is deprecated in favor of Python 3.9. This affects the SDK and model deployment.

SDK

  • Deprecated sf.get_dataset_data; use snorkelflow.sdk.Dataset instead.
    • Deprecated sf.get_dataset_data.get_datasets(); use snorkelflow.sdk.Dataset.list() instead.
    • Deprecated sf.get_dataset_data.create_dataset(); use snorkelflow.sdk.Dataset.create() instead.
    • Deprecated sf.get_dataset_data.delete_dataset(); use snorkelflow.sdk.Dataset.delete() instead.
    • Deprecated sf.get_dataset_data.get_dataset_data(); use snorkelflow.sdk.Dataset.get_dataframe() instead.

Upcoming SDK deprecation

  • In a future release, MinIO will be deprecated for image uploads. Use sf.upload_dir or sf.upload_file instead.

Features and improvements

Integrations

  • AWS Bedrock Claude is now available as a model provider in the Foundation Model (FM) suite, with support for Claude 3.5 Sonnet v2 (model ID: anthropic.claude-3-5-sonnet-20241022-v2:0).
  • Added support for k3d NVIDIA CUDA.

Application setup

  • A new tool tip explains the Lab and Standard licenses during app creation.

Annotation

  • When doing multi-schema audio annotation, the audio player now supports 0.1-second increments for precise playback, pausing, and navigation. This is accompanied by an updated progress bar and waveform visualization.

  • Snorkel Flow now supports uploading a labeled ground truth dataset for sequence tagging applications as a CSV file. The user interface supports column mapping during the upload process.

  • The Develop prompt beta workflow is now better integrated with the annotation workflow. You can create batches for SME annotation, and view ground truth annotations provided by SMEs for each data point, directly from the Develop prompt page. For more details, see Create prompt development workflow.

    To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.

  • Sequence tagging applications now represent information at the word level, rather than the character level. This aligns with the most common text extraction use cases. Word-level representation applies to model predictions, analysis metrics, and ground truth. If you upload a ground truth dataset that has character-level annotations, Snorkel Flow will transform them to the word level.

Data development

  • min_per_class now supports resampling word-based PDF applications.

Prompt development

  • Prompt development now supports favorites and custom names for prompt versions.
  • LLM responses are now streamed, so you can view responses as each finishes, rather than waiting for the entire run to complete.
  • Prompt job progress is now displayed.
  • The page now functions correctly when there is no data.
  • Prompt development now supports a table view for records.

Evaluation

  • The Evaluate page now lets you edit slice membership. Slices let you group LLM-generated responses into different topics so you can evaluate your current model’s performance individually per topic. This new Snorkel Flow feature lets you manually add or remove datapoints from slices as you review an evaluation report.

SDK

  • Upgraded Pandas to 2.2.3.
  • Added a workspace-scoped path to sf.download_remote_object.
  • open_file is now workspace-scoped.
  • LabelSchema has an improved description.

Bug Fixes

  • Quality of life updates to app navigation and display.
  • Fixed the back button on the Jobs page.
  • The Pipeline (DAG) now supports multiple output node IDs when creating a node.

Data upload

  • Fixed a bug during data upload with splitting data by a defined percentage to create data sources.

Annotation

  • Review mode now displays the correct labels.
  • Fixed an error during ranking dataset view loading in annotation.

Data development

  • Removed duplicate data from the table view in the Develop (Studio) page.
  • False-equivalent values like "0" and "No Data" now render correctly.
  • Fixed a bug with the Data Drag Explorer for Embeddings not appearing.
  • Exporting data from the Develop (Studio) page now functions correctly when column names contain the '#' character.
  • Prevented duplicate labels in a dataset label schema.
  • The View incorrect filter now excludes correct output from foundation models.
  • Slice-based analysis no longer shows negative numbers and other incorrect percentages.
  • Selecting multiple documents now shows only slices associated with all selected documents.
  • You can now add selected documents to a slice from the cluster data explorer.
  • Fixed an issue with negated labeling functions for PDFs.

Prompt development

  • Prompts now load correctly.
  • SME feedback now appears on pages past the first page.
  • Removed the error for empty SME feedback.
  • Fixed a bug with switching the current prompt to the compared prompt in the comparison view.
  • Comparison view now functions when there's only one prompt version.
  • The response and user data columns no longer overlap.
  • The prompt job now finishes before the responses endpoint is called.
  • The Snorkel Flow ID is no longer displayed as a prompt column.

Model development

  • The metric plot values for Model comparison now show the percent value.
  • Excluded non-deployable LFs from being used as features for AutoML.
  • PDF model metrics are now more accurate after excluding unknown and negative classes.
  • When ground truth is negative for PDF models, the data studio now correctly shows No data found.

Evaluation

  • The Evaluation page now functions properly even if no models exist.

SDK

  • Fixed a bug with the get_preprocessing_issues function.
  • sf.align_external_ground_truth now functions for filtered text.

Notebooks and templates

  • Enabled JupyterHub to run in strict mTLS mode with Istio.
  • Removed hardcoded values for network policy in the Helm template.

Known Issues

  • File Download Endpoint (/api/download) enables users to read arbitrary files.

Application setup

  • Copying an application may cause a 500 error.
  • Copying a PDF application may not maintain the dev split.
  • Creating a new PDF application from a template fails at initialization.

Data upload

  • Labels added from the Onboarding page are not visible in the Develop (Studio) page.
  • Large uploads of CSV data to Amazon S3 may cause a server error.
  • Newly created datasets show Create from Template. They should show Create Application.
  • Snorkel Flow does not generate embeddings for new data sources activated from the Application page.

Annotation

  • In a sequence tagging application, you may receive a label validation error even if a label was successfully applied.
  • In a sequence tagging application, selecting overlapping text for different labels causes the wrong span to be selected.
  • Opening metadata breaks up highlighting in Studio.
  • Cannot switch labels from one schema to another.
  • In sequence tagging applications, the number of entries in the confusion and clarity matrices don't match.
  • Exporting a dataset annotation can cause a 500 error.
  • An annotation batch filter can produce TypeError: Cannot interpret 'string[python]' as a data type.

Data development

  • It's possible to create a dataset view with a mismatched dataset and label schema.
  • If you select Next more than once when saving changes to a label schema, you will get stuck in an animation loop.
  • Studio /dataset and /advanced-lf-state error out with a cryptic error message when there is no span.
  • When a labeling function in the builder is canceled before being applied, the highlights are not removed from the document.
  • Snorkel Flow does not generate its own labels for the new PDF workflow.
  • The new PDF workflow does not support negative filters.
  • Populator saves two different sets of arrow files at the same path when applying.
  • Negative ground truth labels do not override labeling function labels.

Model development

  • Deleting a custom model can cause a 500 error.
  • Modeling fails if columns are categorical.
  • Model training does not stop and display an error if any of the vectorizers is not fit.

Evaluation

  • In a populated evaluation report, the user who created the reported is the requester user, rather than the original user from the populated app.
  • Evaluation report data popup may be empty.