Skip to main content

2024.R3 STS (v0.94) Snorkel Flow Release Notes

Breaking Changes

SDK

  • The snorkelflow.ingest.docs.dirtree_to_parquet() function now creates the parquet file in a workspace-scoped directory.

Deprecations

Annotation

  • Deprecated annotation_rate from the annotation overview.

Foundation models

  • Deprecated OpenAI GPT-3 Fine-tuning.

SDK

  • Deprecated the Prompt Regex Generator.

Upcoming SDK deprecation

  • The following SDK MinIO file upload functions will be replaced or removed. You can replace them with new functions to be released in v0.95 LTS. The new functions upload file(s) to a workspace-scoped directory:

    • [upcoming deprecation notice] upload_to_minio(); use upload_file instead
    • [upcoming deprecation notice] upload_fileobj_to_minio()
    • [upcoming deprecation notice] upload_files_to_minio()
    • [upcoming deprecation notice] upload_dir_to_minio(); use upload_dir instead
    • [new] download_file()
    • [new] download_dir()
    • [new] upload_dir()
    • [new] upload_file()
    • [new] list_dir()
  • The MinIO Console will be deprecated as of Q4 2024, but will still be accessible until the end of Q1 2025 to support backwards compatibility for older workflows. After Q1 2025, access to MinIO Console will be removed. Instead, use the new Files feature for uploading PDFs and images. Support for arbitrary file type upload and basic file management utilities within the Files feature will be provided by end of Q1 2025 to meet MinIO Console core feature parity. For files that are not PDFs and images, you can continue to use Snorkel Flow's SDK.

  • The MinIO client will be removed from the Notebook environment. For example, the following function will no longer be available to use in a notebook:

    !mc <command/arguments>

  • The Boto3 SDK client to upload and download files from MinIO will also be deprecated. Users are encouraged to use the new Files feature.

Bug fixes

Machine Learning: PDF

  • Removed rendering of PDFs in Studio record view for extraction apps.
  • Fixed PDF processing to prevent unwanted lines being added to PDF.
  • Added PDF URL details to error messages from Tesseract operator.
  • Fixed an issue where Studio showed all checkboxes in native PDF docs as unchecked.

Foundation Models

  • Refreshed prompt templates.
  • Supported custom inference services without logprobs.
  • Fixed clarity and reduced spam in FM usage tracking errors.
  • Increased the SageMaker session duration.
  • Fixed the useGetModelFromTask function for initial loads.
  • Set the selectedLabel correctly from the previous prompt in sequence tagging.
  • Resolved prompting bugs for Amazon SageMaker.
  • Prevented showing a "no models" state during loading.
  • Explicitly requested fields data from the classification prompt view.

User Interface

  • Fixed API calls that used the wrong ID after login.
  • Fixed an issue where token metrics didn’t show up on some apps.
  • Rounded label density to a reasonable number of decimal places.
  • Fixed an issue where the suggested LF modal sometimes showed an empty columns list.
  • Prevented unexpected logouts.
  • Fixed scroll and pagination for ApplicationList.
  • Fixed scrollToPDFSpan for single pages.
  • Fixed the inactivity timeout.
  • Fixed the ability to manually resize the Studio left sidebar.
  • Corrected the casing of model metric names.
  • Passed override parameters to GTLabelSelect.

Annotation

  • Restricted dataset views to classification label schemas only.
  • Correctly included the last label during LS creation.
  • Adjusted the positioning of the sequence tagging popup.
  • Fixed an issue where editing a label changed all labels.
  • Defaulted to page view for word-based PDFs.
  • Prevented filters from being carried over between Studio and annotation.
  • Restricted annotators to viewing only their progress.
  • Filtered text label schemas from inter-annotator agreement.
  • Fixed an issue where saved annotations didn’t show up in the ranking view.
  • Filtered and searched within the sequence tagging popup.
  • Displayed the correct completion status for a batch.
  • Removed sort parameters for dataset calls in annotation studio.
  • Fixed URL parsing.

Data + Slices

  • Fixed offset calculation for existing highlights.
  • Adjusted the z-index for MenuButton popovers on the Eval page.
  • Filtered out unknown labels for sequence tagging annotation.
  • Enabled SDK documentation for the Slice module.
  • Removed UNKNOWN as a label option for sequence tagging.
  • Made highlights display correctly across new lines.
  • Memoized query params generation to prevent infinite dataset fetching.
  • Used the correct primary_text_field and fixed getSuggestField.

Label + Training

  • Correctly copied the model schema from the payload.
  • Threw the "Only binary tasks support thresholding." error earlier.

SDK

  • Fixed bug that allowed Annotator to have access to all Developer permissions in Notebooks.
  • Identified the default workspace name when changed from "default."
  • Handled cases with no predictions at sf.get_node_data.
  • Raised an error when the join column was not unique at FineTuningApp.import_ground_truth.

Provisioning

  • Fixed Helm ingress indentation.

ML Tasks: Text

  • Added support for auto-generating negative labels in node block copying.
  • Updated the Sequence NER LF template to more explicitly store the correct fields.

Supervision

  • Fetched the correct LF configuration from the backend.

Features

Foundation Models

  • Enabled prompt FMs to extract multiple entities for sequence tagging.
  • Added SDK sf.prompt_fm() and sf.prompt_fm_over_dataset() methods for prompting FMs.
  • Added model_name and detail view support for the sf.get_external_model_endpoint SDK method.
  • Added warm start support to GenAI apps.

Provisioning

  • Enabled conditional use of file-based prefect secrets in model-trainer.
  • Enabled conditional use of file-based secrets for JupyterHubProxy.
  • Enabled conditional use of file-based influxDB secrets in flow-ui.
  • Allowed conditional use of file-based Database, influxdb environment variables.
  • Enabled conditional use of file-based secrets for JupyterHub.
  • Supported conditional use of file-based TDM_CONN_STR environment variables.
  • Mounted MinIO secret keys as files when secretsFromFile was true.
  • Enabled conditional use of file-based secrets for tdm-api.
  • Enabled TDM API to read a JupyterHub secret from a mounted file.
  • Removed MinIO Secret Environments from InferenceService.
  • Enabled conditional use of file-based secrets for telegraf.
  • Enabled conditional use of file-based secrets for grafana.
  • Removed MinIO Secret Environments from RayHead and StuioRayHead.
  • Removed MinIO Secret Environments from JupyterHub and JupyterHubProxy.
  • Enabled reading a postgres password from a file.

Data + Slices

  • Added custom metrics support to FineTuningApp.
  • Supported custom metrics with the evaluation report.
  • Transitioned cluster view to MDV.

User Interface

  • Improved styles for class-level metrics when there were too many.
  • Added audio file cell.
  • Added admin settings to gate access to Notebooks/Deployments at instance level.
  • Updated the Jobs link in the sidebar to open a new page instead of a modal.

Annotation

  • Grouped retrieved contexts of a prompt in ranking view for annotation.
  • Introduced annotation of retrieved context in single response view and ranking view.
  • Made label table labels editable.

Label + Training

  • Added per-class token metrics.

Machine Learning: PDF

  • Added a new layout-aware parsing library for native PDFs. All new applications will use the new parser.

Image

  • Enabled improved computer vision onboarding experience.

Data + App Management

  • Added support for authenticated S3 and GCS buckets.

Improvements

SDK

  • Enabled conditional use of file-based MinIO secrets.
  • Migrated MinIO SDK functionality to HTTP Storage Proxy.
  • Enabled the SDK to support datasets with non-text fields for prompting.
  • Enabled sequence tagging prompting for Text2Text models via SDK.

Machine Learning: PDF

  • Improved PDF rendering performance in Studio.
  • Added filter to remove data with parsing issues to the PDF dataset templates.

User Interface

  • Added the year to the date format on the application page.
  • Improved memory utilization of Studio.

Annotation

  • Made word-based label schemas selectable.
  • Sorted model and training fields in annotation mode.
  • Made annotator names available via tooltip in annotator progress.

Label + Training

  • Modified the base behavior of training Transformer models.

Data + Slices

  • Added more explanation to /download error messages.

Foundation Models

  • Displayed abstained examples during prompt previews in sequence tagging.
  • Displayed FM outputs for examples during prompt previews in sequence tagging.
  • Enabled extraction of multiple entities in sequence tagging.
  • Migrated to use FMType instead of PromptTool.
  • Updated the prompt template when the model changed.
  • Created prompt templates for sequence tagging.
  • Displayed time remaining for in-progress LFs.
  • Enabled SageMaker models to be configured with a provider prefix.
  • Verified SageMaker authentication setup for fine-tuning.
  • Improved performance for preview prompt LFs for all OpenAI models when rate limit values are set. Expect 10x-20x speed improvements.

Known Issues

  • When DAG (directed acyclic graph) is stale, users can still make a deployment.
  • If labels are added before the primary text field is selected, labels are removed when the primary text field is chosen.
  • The Reviewer workflow is broken for sequence tagging overlapping spans.
  • "No data found" when applying a filter that yields fewer results than current pagination.
  • You cannot switch labels from one label schema to another.