2024.R1 LTS (v0.91) Snorkel Flow release notes
What's new
Annotation (Beta)
- Multi-task annotations are now supported.
- Label schemas and batches are now created at the dataset level (vs application level).
- New Label Schema and Batch tabs have been added to the Datasets page.
- New label schemas can be created in the Label Schema tab. Only text data type is supported. Note that multi-label classification and candidate based text extraction label schemas are not yet supported.
- New dataset batches can be created in the Batch tab.
- Dataset annotations can be created by accessing the annotation page from the Batch tab.
- Aggregations can be performed by selecting more than one annotators
- Annotations from a single annotation source can be committed as ground truth
- You can load an existing label schema when creating an application using the guided flow.
Image (Beta)
- Image classification is now supported in Snorkel Flow.
- Manual annotation of images is supported.
- New LF types are supported for images: text-to-image similarity, image-to-image similarity, image-to-patch similarity, model based similarity, and text-to-caption LFs.
- Note that for large image applications (more than 50K images / more than 10 classes), we strongly recommend turning on Dedicated Resources.
Foundation model suite
- (Beta) Added default RAG configuration for sequence tagging prompting that chunks & prioritizes sections of the document to help improve performance on longer documents.
- Added the capability for users to connect a custom inference service (one that follows the OpenAI API-specification) for use in prompting.
- Added support for Google/Gemini and Mistral LLMs.
Studio
- Record view has been added to the list of data content view options. This replaces the Span view and Raw view options for extraction applications, and the Data view and Raw view options for all other applications.
- Snippet view is now available for PDF applications.
- The Data summary pane is now available in sequence tagging and multi label applications.
- Added the ability to sort by all columns when viewing documents in Table view.
- Cmd+F is supported across all data viewer modes.
Data and application management
- New features for when creating applications using the guided flow:
- Added the option to create RAG embeddings when creating sequence tagging applications to optimize prompt performance.
- Added search capabilities to the label schema dropdown.
- Added Snippet view to the data preview.
Model training
- Added support for distributed training with Ray Train to HuggingFace transformers.
PDF
- (Beta) Added the ability to leverage the FM prompt builder for PDF applications in Studio.
- Prompt preview is enabled for all PDF applications.
- Added a LF vote filter on PDF prompts.
- Added the ability to call the prompt builder from the SDK for PDF applications.
- When annotating PDF documents, there is now a toolbar that gives you the option to label all spans in a document as Other.
Text
- Added a class filter to the Clarity Matrix for sequence tagging applications
Deployment
- Added support for deploying a model to the Databricks Unity Catalog workspace.
- Added support for deploying a model to Azure Machine Learning.
- Added support for deploying a model to Vertex AI.
Enterprise Infrastructure
- Added support for secure, on-prem/air-gapped access to FMs out-of-the-box with Snorkel Flow install.
- Using the Dataset Access Control experience within Admin Settings, admins can now grant access to the following data connectors within the New Dataset experience in Snorkel Flow. They can choose to grant access either across the entire instance, OR provision access to specific users by configuring role and workspace rules.
- Local File Upload
- Cloud Bucket
- Snowflake
- Databricks
- BigQuery
- Admins can configure whether users have full “Create, Read, Update, and Delete” access to DataConnectorConfigs within a given workspace, or whether users only have the ability to “Read and Use Existing” configs. This enables Admins to safely define access control rules that enable collaborators within the same workspace to share connector configs without exposing the underlying credentials.
- Split admission roles scope and claim into two fields. Before, this was a single field that was separated by a colon.
Improvements and bug fixes
Foundation model suite
- Improved the default code mapper to fuzzy match on the FM output.
- Added arrow backed caching for LLM outputs, improving prompt LFs performance and retention duration.
- Clearer error messaging for local inference.
- Clearer naming convention for prompt LFs (e.g., abbreviation FFP -> PROMPT).
- When previewing a prompt on a sample of data points, ground truth examples are now guaranteed to be included in the sample.
- For multi-label applications, all labels are now mapped correctly (vs. just the first label).
- Fixed issue where prompt builder snippets didn’t always show highlights in sequence tagging applications.
- Fixed an issue where embeddings columns were available in the prompt builder template before they were finished generating.
Studio
- The right-side pane in the dataviewer has been consolidated to now include the options to edit the label schema, edit ground truth, and view the data summary.
- For candidate extraction applications, spans are now ground truth color coded in Record view.
- The create annotation batch from filter result option has been moved to the filter bar with the other filter actions
- The Export Studio dataset option has been moved to the Export button in Studio with the other export options.
- The Resample data option has been moved into the data split selector drop down menu.
- LF votes are now updated automatically in the dataviewer when ground truth is updated (in the dataviewer).
- Table view now renders 20 rows by default.
- Snippet view supports scrolling within each snippet to view more text.
- For sequence tagging applications, Record View supports clicking the column name to jump to the associated field entry.
- Span styling is consistent across all task types, and ground truth spans render supertext displaying the ground truth class.
- In PDF extraction applications, while in Document view, you are now able to both remove all labels or add default class labels to all spans in the document.
- All label indicators have the same styling. This applies to ground truth, model predictions, and training set labels.
- Keyboard navigation is available between highlights of any type.
- Send to top is now available in the drag to reorder component for the Select Displayed Columns menu.
- Fixed an issue where quickly clicking different options in Studio resulted in the latest action not loading.
- Fixed an issue where user settings did not persist after reloading the screen.
- Fixed an issue where the selected data content view mode did not persist when you switched tabs in Studio.
- Fixed a pagination issue that appeared on applications that contain very large data points.
- Sorting the table in the Labeling Functions pane now works properly.
- The Export Studio dataset option now correctly downloads the model predictions from the selected model (vs. the current model in Studio).
Data and application management
- When creating an application using the guided flow, data and task type selections have moved to the define label schema step.
- Job logs are now accessible to all roles except for Annotator.
- Added the job owner and the job end date/time to the Jobs Dashboard.
- De-cluttered the data preview for PDF applications by removing unnecessary fields.
- Fixed an issue where the data preview of PDFs was not rendering when the page splitter option was used.
Model training
- The search range for the model decision threshold when threshold tuning has increased by a factor of 5, extending the search granularity from 0.05 increments to 0.01 increments. This broader search area will enhance the ability to capture optimal decision thresholds effectively.
- When training a custom model, clearer and more regular status updates display under the progress bar.
- When building a model from a previous AutoML model, the model config under Model options now correctly shows the config from the best model from the AutoML run.
PDF
- The PDF prompt builder can now read files from MinIO and files with an HTTPS path.
- The Rich Document Expression LF builder is now not case sensitive by default.
- For PDF applications where page splitting is used, the spans are now sorted in numerical page order in the Span view (instead of alphabetical).
Embeddings
- If embeddings are created from embeddings home, then they cannot be used in model training. These fields are no longer shown as options for input fields when configuring a model.
Deployment
- A model signature is now included by default in both the Snorkel Flow UI and the SDK (signature=True is now the default in MLflowDeployment.create). See the mlflow docs for more details about the signature.
- Deployments are now compatible with Python 3.9 and 3.10.
- Deployments can now be run in Snorkel Flow if the source application has been deleted.
- Removed unnecessary dependencies from TableConverter and from sklearn-based models to reduce the container image size.
- Fixed an issue where MLflow deployments were leaking memory.
SDK
- Fixed the source code retrieval of a code LF that is decorated by resources_fn_labeling_function.
Enterprise infrastructure
- Various library upgrades to patch CVEs.
- Postgres upgraded to v16.
Deprecations and breaking changes
Studio
- Plot mode and embedded table viewer are no longer supported in the dataviewer. If you are using those features, then you must enable the old dataviewer.
- You can no longer edit ground truth in Table view.
- You can no longer enable Mark spaces in text displayed in Record View
- You can no longer enable hyperlink URLs over arbitrary text in Record View, instead, URLs will be hyperlinked if their column name is “url”.
Data and application management
- The ModelBasedFeaturizer node has been deprecated. DAGs with previously trained ModelBasedFeaturizer nodes can no longer be refreshed. Remove the nodes, archive the older LFs based on the featurizer scores, and use the model based LF template instead.
Deployment
- Deployments created in v0.21 or earlier in either format and those created in v0.64 or earlier in the MLflow format will no longer run in the platform. Please export and run them outside the platform, or recreate them for in-platform inference.
- Deployments that have a removed built-in operator (such as SpanJoiner) as part of the DAG will no longer run in the platform. Please export and run them outside the platform, or recreate them for in-platform inference.
- Deployments created between v0.22 and v0.51 still run in the platform but will return a dataframe with a different index and different names for monitoring-related columns.
SDK
- The `pre` part of a code LF for preprocessors is no longer supported. Please add an equivalent operator to the DAG or fold them into the main function.
- Code LFs, custom operators, and custom metrics will not work if libraries such as re are not explicitly imported within the user defined function (UDF) or variables that are not defined within the UDF are referenced. Please recreate them.
- Removed sf.swap_application_dataset.
- Removed sf.get_column_summaries.
- Deprecated sf.export_workflow_config.
- Deprecated sf.poll_job_status_with_timeout in favor of sf.poll_job_status.
- Deprecated deployment functions in favor of new ones. Deployment. = new, sf. = old
- sf.execute_workflow → Deployment.execute
- sf.delete_export → Deployment.delete
- sf.deploy_application → Deployment.create
- sf.export_deployment → Deployment.download
- sf.get_export → Deployment.get
- sf.patch_export → Deployment.update
- sf.get_application_exports → Deployment.list
- sf.export_deployment_to_registry → MLflowDeployment.create
Known issues
- When creating an application using the guided flow, generating preview data on really large documents will throw an error.
- When creating a PDF application using the guided flow, you cannot edit the label schema during onboarding.
- When creating PDF prompt LFs, you must increase the timeout to be able to load and view the LF.