Skip to main content

Snorkel Flow v0.92 (STS) release notes

What's new

Annotation

  • You can now limit access to annotations based on user roles.
  • Added update/delete dataset view label schema mapping.
  • Introduced the generative AI single response view in the annotation view.
  • Added new functionality to multi-schema annotation:
    • Nodes now use dataset tags.
    • Added overview and review pages.
    • Nodes now shift to dataset comments.
    • Updated dataviewer settings.
    • Added comments to the modular data viewer level.
  • Added filters for dataset tags.
  • Updated annotation side panel.
  • Created API endpoints to manage dataset views.
  • Added endpoints for dataset level tags.
  • Added support filtering by source type and search by batch name in My work page.
  • Enabled auto-advance only when all single-label label-schemas are annotated.
  • Added Shuffle data order checkbox.
  • Added set as expert in batches.

Data + App Management

  • Hid system labels from LLM extraction table in onboarding.

Data + Slices

  • Added model-based acceptance rate metric for LLM fine-tuning apps.
  • Added new evaluation module with GTAcceptanceRateMetric.
  • Move EmbeddedTable to modular data viewer.
  • Added multi-column sort on table view in modular data viewer.
  • Added delta indicator to Metrics table.
  • Added Model Metrics Table to Evaluation page for LLM fine-tuning apps.

Enterprise Infrastructure

  • Encrypted client secret for OIDC in db.
  • Added the ability to log audit events to STDOUT with special environment variable.
  • Gated create populator endpoint with RBAC.

User interface

  • Added role-based access controls for user file uploads.
  • Added computer vision (CV) application onboarding image preview.
  • Associated with static assets when creating new datasets.
  • Added static asset upload.
  • Added search to users and workspaces list in Admin settings.
  • Added image task application onboarding flow.

Foundation models

  • Added modular data viewer PromptView for classification.
  • Added SageMaker model fine-tuning.
  • Added support for SageMaker Inference.
  • Added new database table for fine-tuned foundation models.
  • Added API endpoint for deleting foundation model integrations.
  • Added support for gated Hugging Face models.
  • Retrieved configuration status of foundation model providers.
  • Removed FM_FREEFORM_QA_PROMPTING feature flag.
  • Improved error messaging.
  • Improved faster inference.
  • Passed through arbitrary FM kwargs.
  • Enable only document view in prompt PDF.
  • Added support for LLM model in prompt.

Machine Learning

PDFs

  • Added CheckboxSpanMapper operator that relates detected checkboxes to spans.
  • Added save manifests for dataset creation when files are uploaded.
  • Added ignore_errors option to HocrToRichDocParser.
  • Added general purpose operator for checkbox detector.
  • Added Record View for PDF prompt labeling functions.
  • Added fuzzy matching for spans and LLM responses for PDF LLM prompt labeling functions.

Images

  • Added RBAC endpoints for asset upload modal.
  • Added GET endpoints for fetching static asset information.
  • Added audit trail logs for file upload and datasource creation with static asset.
  • Merged image extensions into one image file type for asset upload.
  • Added RBAC endpoints for static assets.
  • Added RBAC to gate local/remote file upload.
  • Added image support for new onboarding.
  • Added image resizing to new dataset creation.
  • Added static asset column into dataset.
  • Added local and remote asset upload endpoints.

SDK

  • Added a create_dataset_view method.
  • Added sf.get_user and deprecate sf.get_user_id.
  • Added read and delete functions to SDK for dataset_view.

Improvements and bug fixes

Annotation

  • Improved loading time of annotator-agreement matrix.
  • Limited the aggregated annotations sources to origin batch.
  • Disabled multi-label option for multi-schema annotation dataset.
  • Fixed datasets loading state.
  • Fixed bug that returned the wrong data as a response to going to first unlabeled document.
  • Removed negative label from sequence tagging annotation pop-up.
  • Added signal to label_schema origin.
  • Fixed bug that prevented Filters menu from closing when clicking outside.
  • Fixed bug that crashed the application when navigating between docs.
  • Showed all annotations for all label-schemas.
  • Fixed duplicated batches on datasets page.
  • Fixed selection offset when the highlight is selected.
  • Disabled image label-schema creation for multi-schema annotation.
  • Fixed bug to pass the correct fields into initialize selected fields.
  • Resolved issue where default label schema was used to label ground truth.
  • Resolved wrong x_uid returned by comment filter.
  • Fixed bug to show correct batch size.
  • Fixed bug where the application crashed when selecting the comment icon.
  • Updated comment filter placeholder and comment button sizes.
  • Set the correct split when navigating from train batch to studio.
  • Hid filter docs toggle in multi-schema annotation label-schema form.
  • Updated to send correct annotation filter value.
  • Hide settings icon if no props are passed.
  • Updated to prevent users from labeling same span multiple times.
  • Updated to show correct empty state when no annotators are present in a batch.
  • Moved selected label-schema to user-settings.
  • Updated focus on spans that belong to current filter.

App + Model Deployment

  • Added some comments to init.py to be compatible with Microsoft Azure.
  • Added missing sources to a deployment.

Core App UX

  • Updated SelectedDocumentsProvider-related table view selection behavior.
  • Increased independence of studio2 tests.
  • Reduced the number of /download requests.
  • Disabled deployment page link in breadcrumbs.
  • Fixed regression in modular data viewer. The selected span was not maintained when switching between document and record views.

Data + App Management

  • Deleted phantom batches on a deleted datasource.
  • Fixed bug where selecting a subset of dataset was not effective when created through onboarding.

Data + Slices

  • Updated Slices/Model metrics table.
  • Corrected dynamically page /dataset request.
  • Updated to maintain selected span when switching from record view to document view.
  • Fixed span not getting selected. Reset scroll on page change.
  • Fixed file URL duplication in FileCell in Table view.
  • Enabled load all the highlights for the new document.
  • Updated to correctly register PDF type.
  • Fixed broken preview labeling function.
  • Fixed problem loading PDFs by URL in modular data viewer in a non-PDF application.
  • Added missing inversion key.
  • Made Disable export button more specific.
  • Made ContextIndex relative.
  • Fixed correctly set document view page.
  • Updated to show only spans relevant to the search.
  • Updated to handle super text across task types.
  • Updated to use real total count for data summary pane.
  • Sync /context endpoint with modular data viewer pagination.
  • Added option to show plot mode and embedded table when flags are on.
  • Added Color Span by dropdown to modular data viewer.
  • Added export dataset button for only annotation mode.
  • Added filter highlights by highlight type.
  • Added send to top in drag to select tool.
  • Converted NaN values in the dataset to None to make OpenAPI json encoder.

Enterprise Infrastructure

  • Upgraded PyTorch to 2.2.2.
  • Updated AllenNLP to 2.10.2 to remove jsonpickle.
  • Upgraded Jupyter Notebook to 7.
  • Updated FastAPI and Starlette.
  • Used local version of InfluxDB to resolve CVEs.
  • Updated local version of MinIO on Alpine v2.
  • Added library path for CUDA libraries to ensure Python libraries can find them.
  • Used memory limit in notebook from helm values.
  • Fixed helm template for influx cloud secret and compatibility with userconfig.
  • Fixed deployment configuration when influx_cloud_key is not applied.
  • Fixed lantern build.
  • Patch CVE-2024-28179.
  • Ensure DATA_SOURCE_ACCESS_CONTROL is only set by environment.

Front-end infrastructure

  • Fixed session expiration handling.
  • Upgraded Node to version 22.
  • Hide Resize images checkbox for PDF files in the new dataset modal.
  • Fixed new dataset file association request data.
  • Fixed server-side Redux store to not be shared between clients.
  • Fixed pagination for application list on home page and use cached requests.
  • Fixed blank page.

Foundation models

  • Added model_type for SupportedLLM.
  • Added BasePromptView across tasks.
  • Set and migrated model_type for external model endpoints.
  • Shared prompt data view.
  • Added Custom Inference Service as an ExternalLLMProvider.
  • Leveraged single-shot prompting for freeform LLMs in sequence tagging.
  • Add transform PDF format at data-fetching level.
  • Added dataset-output API to node server.
  • Added prompt state transform for PDF.
  • Add PDF extraction LLM type.
  • Made PDF scrollable in foundation model prompting.
  • Added better error messages for previewing.
  • Showed codemapper button in only Text2Text tool.
  • Disabled prompt view for PDF classification application.
  • Migrated OpenAI external model endpoint URLs.
  • Updated requirement for endpoint URL for OpenAI Models.
  • Updated to load llmqa models from previous runs into prompt view.
  • Cleared prompt input when resetting prompt state.
  • Fixed snackbar on get-prompt-output in refresh of previous prompt.
  • Changed PDF application to show max sample size not greater than number of documents in dataset.
  • Added loading on screen when prompts are getting loaded.
  • Updated to show settings icon button in sequence tagging across tool types.
  • Changed so prompt results aren’t cleared with the Cancel button.
  • Added populator support for arrow cache.
  • Used document total count as max limit on sample size for preview labeling function.
  • Fixed pandas query filter operator documentation.
  • Filtered out embeddings columns from prompt input on init.

Label + Training

  • Fixed labeling function name race condition.

Machine Learning

PDFs

  • Fixed loading and progress bar delays.
  • Fixed handling of PDF classification URL cells.
  • Fixed drag to select for PDF.
  • Fixed Checkboxes.deserialize().
  • Revert HocrToRichDocParser to version 3.
  • Added page_id for the output column of CheckboxFeaturizer operator.
  • Added the use of a feature store for DocVQA prompting.
  • Added the use of the page_ids from rich doc.
  • Added in-memory caching of LLM responses to avoid accessing MinIO.
  • Exposed filtered result batch creation in PDF IE doc view.

Images

  • Saved static asset column in datasource metadata.

Other

  • Upgrade Python to 3.10.
  • Extended dataset numeric sorting to all span-based apps.
  • Added _resources to custom operators code to avoid staleness.
  • Allowed for context_uid for string.
  • Fixed failing imported labeling functions when an inactive labeling function couldn't be copied.

SDK

  • Install SHapley Additive exPlanations (SHAP) to notebook.

Deprecations and breaking changes

App + Model Deployment

  • Error out if LabelingFunctionsExportable runs on a different version of Python.

Machine Learning

PDFs

  • Removed detectron2.

Other

  • Compiled custom operator class.
  • Compiled labeling function template class.
  • Upgraded Python from 3.8 to 3.10. You might need to re-install previously custom installed libraries.
  • Custom operator classes will not work if libraries, such as re, are not explicitly imported within the user defined function (UDF) or if variables that are not defined within the UDF are referenced. As needed, recreate classes.