Snorkel Flow v0.95 (LTS) release notes

Breaking changes

Label + training

The argument for labels is now keyword-only, excluding x_uids. This change might break some existing notebooks.

Machine learning: PDF

Incremented PARSER2_VERSION to 1: PDFToRichDocParser2 now parses PDFs differently. Existing PDFToRichDocParser2 instances will continue using parser_version=0 unless manually updated to parser_version=1. Changing to parser_version=1 will make the model node stale and may cause ground truth labels to be lost when you refresh the Directed Acyclic Graph (DAG).

SDK

Removed deprecated SDK functions. This primarily affects all functions in the exports submodule (also known as deployments). Please use snorkelflow.sdk.MLflowDeployment instead.

Deprecations

Data + slices

Removed the open tag editor button.

Provisioning

Deprecated and ended support for Snorkel Flow installed on single-node VMs for on-premises and private cloud instances. Customers using a single-node VM setup should contact their Snorkel representative for guidance on migrating to a different provisioning and installation method. Read more about this decision on the Snorkel blog.

SDK

Deprecated sf.SnorkelFlowContext.from_user_config.

Upcoming SDK deprecation

Removed or replaced the following SDK MinIO file upload functions. You can replace them with new functions to be released in v0.95 LTS. The new functions upload file(s) to a workspace-scoped directory:
- [upcoming deprecation notice] upload_to_minio(); use upload_file instead
- [upcoming deprecation notice] upload_fileobj_to_minio()
- [upcoming deprecation notice] upload_files_to_minio()
- [upcoming deprecation notice] upload_dir_to_minio(); use upload_dir instead
- [new] download_file()
- [new] download_dir()
- [new] upload_dir()
- [new] upload_file()
- [new] list_dir()
Deprecated the MinIO Console as of Q4 2024. The MinIO Console will be accessible until the end of Q1 2025 to support backwards compatibility for older workflows. After Q1 2025, access to MinIO Console will be removed. Instead, use the new Files feature for uploading PDFs and images. Support for arbitrary file type upload and basic file management utilities within the Files feature will be provided by end of Q1 2025 to meet MinIO Console core feature parity. For files that are not PDFs and images, you can continue to use Snorkel Flow's SDK.
Removed the MinIO clientfrom the Notebook environment. For example, the following function is no longer be available to use in a notebook:

!mc <command/arguments>
Deprecated the Boto3 SDK client to upload and download files from MinIO. Instead, use the new Files feature.

Bug fixes

Annotation

Fixed missing indicator for batch creation failures.
Caret icon opens and closes the batch.
Fixed missing errors for PDFs in label-schema creation form.
Removed circular dependency check for DS preprocessing.
Allowed bulk export of batches.
Batch creation modal toggle and submit button disabled state.
Updated sequence tagging LS list styles.
Improved dataset preprocessing call order.
Applied node datasources correctly.
Improved label-schema user experience.
Added ability to remove expert status from a batch.
Hid Annotate & Batches from nav for multi-schema annotation enabled apps.
Separated out copy and edit for label-schema.
Enabled search to retrieve context key phrases in ranking view with Ctrl + F.
Hid info icon if description is empty.
No longer allow label description without label name.
Added ability to handle long label-schema names in sequence tagging popup.
Updated RHS user experience for label-schema names, labels, and description.
Prevented empty dataset views from being rendered.
Fixed filtering for progress in overview page to show only relevant annotators.
Allowed highlighting of single characters in MSA.
Various user experience bug fixes.

Application deployment

Fixed a bug with DAG rendering in Deployment.
Prevented a deployment from being created when the DAG is stale.

Data + slices

Added max height to the edit slice popover.
Added breadcrumbs to the Eval All Reports and Single-report* pages.
Fixed a runtime error when suggested labeling functions (LFs) are loading.
Updated the UI to reflect the action of deleting a tag.
Fixed the frontend logic for adding and assigning a tag.
Added optional chaining to HTML cells.

Foundation models

Added the ability to retry AzureOpenAI rate limits on error.
Displayed prompt in LF table for Text2Text models.
Updated setters for redux prompt template.
Displayed the number of foundation model matches.

Label + training

Fixed per-class token metrics computation.
Allowed users to edit the model config.

Machine learning

Fixed the DAG staleness check logic for custom operator classes.

SDK

The sf.get_node_inputs_data and sf.get_node_output_data now work when the context_uid is of string type.

User interface

Fixed the hover state being partially obscured for the Clarity Matrix.
Ensured dropdown was at least as wide as the trigger element.
Expanded LF view offsets.
Homepage to background alignment.
Updated visuals of login page SSO button.
Updated home page and login page branding.
Eliminated sidebar icon layout shift.
Removed shortcuts modal.
Fixed popup in record view not showing in the correct place.
Fixed scroll and removed spotlight from snippet view.
Fixed Suggested LFs cache when switching apps.
Fixed the Develop navigation link not appearing in breadcrumbs on the Studio page.
Fixed errors with FBAC Role Assignment.
Fixed reloading file cells.
Tracked application and dataset recency when creating.
Fixed span popup not closing when editing ground truth.
Made entire Dataset Card link clickable.
Fixed a failing word-based PDF onboarding test.
Fixed unknown model request issues after deleting the last model.
Re-added necessary color variables.

Features

Annotation

Added drag-and-drop interaction for ranking responses.
Added ability to bulk delete batches.
Added a new batch creation modal.
Added an endpoint to fetch datapoints, filtered by data points not in the batch.
Added new batch creation parameters to the SDK.
Updated batch creation endpoint to support new batch creation workflow.
Created a new endpoint that bulk deleted batches.
Allowed label selection popup when selecting a label span.
Hid record view when custom view was configured.
Added the ability to bulk apply the same label in sequence tagging.
Added dataset preprocessing UI.
Restricted the ranking view to have unique label annotation for prompt responses.
Rendered URL in retrieved context for custom dataset views.
Added the ability to edit the label-schema name and description in place.

Data + slices

Removed tag references from popouts.
Replaced the tag filter with the slice filter.
Added a link to the Studio from the Evaluate page.
Added ability to add a programmatic slice in the SDK.
Showed data in a popout when a cell in the eval report is selected.
Added slice filter for Studio.
Built slice filter for dataset batches.
Created a popout Dataviewer modal for cell click.
Added new All Reports page.
Added line chart to Eval Reports page.
Added ability to download your evaluation report as a CSV.

Data management

Created label filter for multi-label classification.

Foundation models

Supported OpenAI o1 models.
Added log streaming for SageMaker fine-tuning.
Supported TextGeneration models for SageMaker inference.
Added SDK Methods For Synthetically Augmenting Data.

Infrastructure

Updated Ports Configurations of Ray to run in the strict mTLS mode with Istio.
Added the ability to skip auto-adding users to the default workspace.
Upgraded Grafana to 11.2.0 and removed Angular support.
Added a landing page for users with no workspaces.
Created an API to specify if users should be automatically added to the default workspace.
Gated GenAI Finetuning using FBAC.
Gated queries to foundation models.

Label + training

Added a flat GT API for sequence tagging (SDK + API changes).

Machine learning: PDF

Introduced a word-based PDF extraction as a beta feature.

Machine learning: text

Added a dataset template for text data.

Provisioning

Updated the Helm chart to add a flag to run SnorkelFlow in the mTLS strict mode.

SDK

Replaced MinIO functions with new SDK methods.

User interface

Added first-run explanation modals for LF table options (trust, make inactive).
Added a multi-node selector in the sidebar.
Replaced user profile initial with icon in the sidebar.
Swapped order of Datasets and Applications in the sidebar.
Added Collections to Datasets, Files, and Deployments.
Added spotlight mode for sequence tagging.
Allowed users to edit the name of suggested LFs.
Added column sorting for the collections table.
Used collections for application templates.
Added card view to Collections.
Added file cell in Annotation Data Viewer.
Added label dropdown for model quality.
Added collections page component.
Implemented feature access control (FBAC) for GenAI, foundation models, and custom operators.
Implemented a modal for workspace / role level access for FBAC.

Improvements

Annotation

Simplified datasource related types and code.
Tweaked UI of single-response dataset view.

Foundation models

Added data exploration view.
Removed LLM extractor and made the LLM LF caching node namespaced.

Infrastructure

Upgraded the base image to Ubuntu 22.04 (Jammy Jellyfish).

Machine learning: PDF

Incremented PARSER2_VERSION at PDFToRichDocParser2 for the upgraded Poppler.

Provisioning

Increased the default memory allocation for database memory from 4Gi to 8Gi.

SDK

Allowed creating SnorkelFlowContext from an endpoint URL.

User interface

Enhanced the user experience for the LF table.
Enabled the active LF row.
Added an expandable sidebar accordion.
Added the suggested LF panel.
Moved the LF controls and LFCoverage toggle to a toolbar.
Allowed group toggling for table and LF coverage.
Removed the LF highlight from sequence tagging.
Removed UIDs from the toolbar in sequencetagging apps.
Improved the sidebar navigation.

<img src={require('./images/nav-develop-studio.webp').default} alt="Navigation in Develop Studio" style={{ width: 800 }}/>

Known issues

Selecting an existing sequence tagging annotation positions the popup incorrectly.
Review workflow is broken for sequence tagging overlapping spans.
Dataset views allow adding an unattached label schema.
Some datasets are missing Last opened, Size, or First created metadata.
When saving Settings in a label schema, if you click the Next button a second time, the animation plays endlessly.
Onboarding accordion not opening at the expected time.
Confirmation of bulk accepting suggested LFs takes a long time.
Snorkel Flow allows inconsistent schemas for data sources during app creation.
Evaluation report doesn't show the data from a newly fine-tuned model.
The PDF extractor sometimes exceeds the max token limit.
Evaluation reports sometimes show the wrong user as the report creator.
Model metrics report 100% when there is no ground truth.
Error occurs during model training: operands could not be broadcast together with shapes (336,) (482,).
Negated LFs don't highlight matched words.
Studio filters with GT = negative don't show any data points in the Studio.
In the index column, selecting the Preview LF button does not show the LF preview.
For sequence tagging, the Confusion and Clarity Matrix numbers don't match.
PDFv2 min_per_class resampling creates an error.
PDFv2 always shows 0 Snorkel-Flow-generated labels in the top-level metrics, regardless of the number of labels generated.
Create LF fails silently for LLM LFs.
DAG is not marked stale when it is stale.
Redis can consume memory until it reaches an out-of-memory error.

Breaking changes​

Label + training​

Machine learning: PDF​

SDK​

Deprecations​

Data + slices​

Provisioning​

SDK​

Upcoming SDK deprecation​

Bug fixes

Annotation​

Application deployment​

Data + slices​

Foundation models​

Label + training​

Machine learning​

SDK​

User interface​

Features​

Annotation​

Data + slices​

Data management​

Foundation models​

Infrastructure​

Label + training​

Machine learning: PDF​

Machine learning: text​

Provisioning​

SDK​

User interface​

Improvements​

Annotation​

Foundation models​

Infrastructure​

Machine learning: PDF​

Provisioning​

SDK​

User interface​

Known issues​

Breaking changes

Label + training

Machine learning: PDF

SDK

Deprecations

Data + slices

Provisioning

SDK

Upcoming SDK deprecation

Annotation

Application deployment

Data + slices

Foundation models

Label + training

Machine learning

SDK

User interface

Features

Annotation

Data + slices

Data management

Foundation models

Infrastructure

Label + training

Machine learning: PDF

Machine learning: text

Provisioning

SDK

User interface

Improvements

Annotation

Foundation models

Infrastructure

Machine learning: PDF

Provisioning

SDK

User interface

Known issues