Snorkel Flow v0.92 (STS) release notes
What's new
Annotation
- You can now limit access to annotations based on user roles.
- Added update/delete dataset view label schema mapping.
- Introduced the generative AI single response view in the annotation view.
- Added new functionality to multi-schema annotation:
- Nodes now use dataset tags.
- Added overview and review pages.
- Nodes now shift to dataset comments.
- Updated dataviewer settings.
- Added comments to the modular data viewer level.
- Added filters for dataset tags.
- Updated annotation side panel.
- Created API endpoints to manage dataset views.
- Added endpoints for dataset level tags.
- Added support filtering by source type and search by batch name in My work page.
- Enabled auto-advance only when all single-label label-schemas are annotated.
- Added Shuffle data order checkbox.
- Added set as expert in batches.
Data + App Management
- Hid system labels from LLM extraction table in onboarding.
Data + Slices
- Added model-based acceptance rate metric for LLM fine-tuning apps.
- Added new evaluation module with GTAcceptanceRateMetric.
- Move EmbeddedTable to modular data viewer.
- Added multi-column sort on table view in modular data viewer.
- Added delta indicator to Metrics table.
- Added Model Metrics Table to Evaluation page for LLM fine-tuning apps.
Enterprise Infrastructure
- Encrypted client secret for OIDC in db.
- Added the ability to log audit events to STDOUT with special environment variable.
- Gated create populator endpoint with RBAC.
User interface
- Added role-based access controls for user file uploads.
- Added computer vision (CV) application onboarding image preview.
- Associated with static assets when creating new datasets.
- Added static asset upload.
- Added search to users and workspaces list in Admin settings.
- Added image task application onboarding flow.
Foundation models
- Added modular data viewer PromptView for classification.
- Added SageMaker model fine-tuning.
- Added support for SageMaker Inference.
- Added new database table for fine-tuned foundation models.
- Added API endpoint for deleting foundation model integrations.
- Added support for gated Hugging Face models.
- Retrieved configuration status of foundation model providers.
- Removed
FM_FREEFORM_QA_PROMPTING
feature flag. - Improved error messaging.
- Improved faster inference.
- Passed through arbitrary FM kwargs.
- Enable only document view in prompt PDF.
- Added support for LLM model in prompt.
Machine Learning
PDFs
- Added
CheckboxSpanMapper
operator that relates detected checkboxes to spans. - Added save manifests for dataset creation when files are uploaded.
- Added
ignore_errors
option to HocrToRichDocParser. - Added general purpose operator for checkbox detector.
- Added Record View for PDF prompt labeling functions.
- Added fuzzy matching for spans and LLM responses for PDF LLM prompt labeling functions.
Images
- Added RBAC endpoints for asset upload modal.
- Added GET endpoints for fetching static asset information.
- Added audit trail logs for file upload and datasource creation with static asset.
- Merged image extensions into one image file type for asset upload.
- Added RBAC endpoints for static assets.
- Added RBAC to gate local/remote file upload.
- Added image support for new onboarding.
- Added image resizing to new dataset creation.
- Added static asset column into dataset.
- Added local and remote asset upload endpoints.
SDK
- Added a
create_dataset_view
method. - Added
sf.get_user
anddeprecate sf.get_user_id
. - Added read and delete functions to SDK for
dataset_view
.
Improvements and bug fixes
Annotation
- Improved loading time of annotator-agreement matrix.
- Limited the aggregated annotations sources to origin batch.
- Disabled multi-label option for multi-schema annotation dataset.
- Fixed datasets loading state.
- Fixed bug that returned the wrong data as a response to going to first unlabeled document.
- Removed negative label from sequence tagging annotation pop-up.
- Added signal to
label_schema
origin. - Fixed bug that prevented Filters menu from closing when clicking outside.
- Fixed bug that crashed the application when navigating between docs.
- Showed all annotations for all label-schemas.
- Fixed duplicated batches on datasets page.
- Fixed selection offset when the highlight is selected.
- Disabled image label-schema creation for multi-schema annotation.
- Fixed bug to pass the correct fields into initialize selected fields.
- Resolved issue where default label schema was used to label ground truth.
- Resolved wrong
x_uid
returned by comment filter. - Fixed bug to show correct batch size.
- Fixed bug where the application crashed when selecting the comment icon.
- Updated comment filter placeholder and comment button sizes.
- Set the correct split when navigating from train batch to studio.
- Hid filter docs toggle in multi-schema annotation label-schema form.
- Updated to send correct annotation filter value.
- Hide settings icon if no props are passed.
- Updated to prevent users from labeling same span multiple times.
- Updated to show correct empty state when no annotators are present in a batch.
- Moved selected label-schema to user-settings.
- Updated focus on spans that belong to current filter.
App + Model Deployment
- Added some comments to
init.py
to be compatible with Microsoft Azure. - Added missing sources to a deployment.
Core App UX
- Updated SelectedDocumentsProvider-related table view selection behavior.
- Increased independence of studio2 tests.
- Reduced the number of
/download
requests. - Disabled deployment page link in breadcrumbs.
- Fixed regression in modular data viewer. The selected span was not maintained when switching between document and record views.
Data + App Management
- Deleted phantom batches on a deleted datasource.
- Fixed bug where selecting a subset of dataset was not effective when created through onboarding.
Data + Slices
- Updated Slices/Model metrics table.
- Corrected dynamically page /
dataset
request. - Updated to maintain selected span when switching from record view to document view.
- Fixed span not getting selected. Reset scroll on page change.
- Fixed file URL duplication in FileCell in Table view.
- Enabled load all the highlights for the new document.
- Updated to correctly register PDF type.
- Fixed broken preview labeling function.
- Fixed problem loading PDFs by URL in modular data viewer in a non-PDF application.
- Added missing inversion key.
- Made Disable export button more specific.
- Made ContextIndex relative.
- Fixed correctly set document view page.
- Updated to show only spans relevant to the search.
- Updated to handle super text across task types.
- Updated to use real total count for data summary pane.
- Sync
/context
endpoint with modular data viewer pagination. - Added option to show plot mode and embedded table when flags are on.
- Added Color Span by dropdown to modular data viewer.
- Added export dataset button for only annotation mode.
- Added filter highlights by highlight type.
- Added send to top in drag to select tool.
- Converted NaN values in the dataset to None to make OpenAPI json encoder.
Enterprise Infrastructure
- Upgraded PyTorch to 2.2.2.
- Updated AllenNLP to 2.10.2 to remove jsonpickle.
- Upgraded Jupyter Notebook to 7.
- Updated FastAPI and Starlette.
- Used local version of InfluxDB to resolve CVEs.
- Updated local version of MinIO on Alpine v2.
- Added library path for CUDA libraries to ensure Python libraries can find them.
- Used memory limit in notebook from helm values.
- Fixed helm template for influx cloud secret and compatibility with userconfig.
- Fixed deployment configuration when
influx_cloud_key
is not applied. - Fixed lantern build.
- Patch CVE-2024-28179.
- Ensure
DATA_SOURCE_ACCESS_CONTROL
is only set by environment.
Front-end infrastructure
- Fixed session expiration handling.
- Upgraded Node to version 22.
- Hide Resize images checkbox for PDF files in the new dataset modal.
- Fixed new dataset file association request data.
- Fixed server-side Redux store to not be shared between clients.
- Fixed pagination for application list on home page and use cached requests.
- Fixed blank page.
Foundation models
- Added
model_type
forSupportedLLM
. - Added BasePromptView across tasks.
- Set and migrated
model_type
for external model endpoints. - Shared prompt data view.
- Added Custom Inference Service as an
ExternalLLMProvider
. - Leveraged single-shot prompting for freeform LLMs in sequence tagging.
- Add transform PDF format at data-fetching level.
- Added dataset-output API to node server.
- Added prompt state transform for PDF.
- Add PDF extraction LLM type.
- Made PDF scrollable in foundation model prompting.
- Added better error messages for previewing.
- Showed codemapper button in only Text2Text tool.
- Disabled prompt view for PDF classification application.
- Migrated OpenAI external model endpoint URLs.
- Updated requirement for endpoint URL for OpenAI Models.
- Updated to load llmqa models from previous runs into prompt view.
- Cleared prompt input when resetting prompt state.
- Fixed snackbar on
get-prompt-output
in refresh of previous prompt. - Changed PDF application to show max sample size not greater than number of documents in dataset.
- Added loading on screen when prompts are getting loaded.
- Updated to show settings icon button in sequence tagging across tool types.
- Changed so prompt results aren’t cleared with the Cancel button.
- Added populator support for arrow cache.
- Used document total count as max limit on sample size for preview labeling function.
- Fixed pandas query filter operator documentation.
- Filtered out embeddings columns from prompt input on init.
Label + Training
- Fixed labeling function name race condition.
Machine Learning
PDFs
- Fixed loading and progress bar delays.
- Fixed handling of PDF classification URL cells.
- Fixed drag to select for PDF.
- Fixed
Checkboxes.deserialize()
. - Revert HocrToRichDocParser to version 3.
- Added
page_id
for the output column of CheckboxFeaturizer operator. - Added the use of a feature store for DocVQA prompting.
- Added the use of the
page_ids
from rich doc. - Added in-memory caching of LLM responses to avoid accessing MinIO.
- Exposed filtered result batch creation in PDF IE doc view.
Images
- Saved static asset column in datasource metadata.
Other
- Upgrade Python to 3.10.
- Extended dataset numeric sorting to all span-based apps.
- Added
_resources
to custom operators code to avoid staleness. - Allowed for
context_uid
for string. - Fixed failing imported labeling functions when an inactive labeling function couldn't be copied.
SDK
- Install SHapley Additive exPlanations (SHAP) to notebook.
Deprecations and breaking changes
App + Model Deployment
- Error out if
LabelingFunctionsExportable
runs on a different version of Python.
Machine Learning
PDFs
- Removed
detectron2
.
Other
- Compiled custom operator class.
- Compiled labeling function template class.
- Upgraded Python from 3.8 to 3.10. You might need to re-install previously custom installed libraries.
- Custom operator classes will not work if libraries, such as
re
, are not explicitly imported within the user defined function (UDF) or if variables that are not defined within the UDF are referenced. As needed, recreate classes.