What is Snorkel Flow?
Snorkel Flow is a data development platform. Snorkel Flow enables organizations, teams, and individuals to raise the quality and development velocity of machine learning (ML) applications through more effective data development.
This article introduces Snorkel Flow’s key use cases and features.
Why is data development important?
Data development is the process of applying a variety of data science techniques—for example, labeling—to a raw dataset with the goal of developing the dataset into a curated dataset that performs well for a specific application.
Data is the raw material for building machine learning (ML) applications. Off-the-shelf ML models rely on vast, non-specialized troves of data for their training. While their response patterns are impressive, they often aren’t suitable for use cases with proprietary data, bespoke objectives, or high accuracy requirements. These types of use cases require curated datasets to achieve acceptable response quality. A data-first approach to ML application development is data-centric AI development.
Levels of data development
Data development can occur on many levels:
- Prompt engineering: Prompt engineering adds relatively small amounts of curated data to a model’s input as in-context learning.
- Retrieval-augmented generation (RAG): Like prompt engineering, RAG adds relatively small amounts of curated data to any single prompt, but draws on a potentially large database of curated data to include the right chunk(s) of data.
- Model fine-tuning: Model fine-tuning adjusts an existing ML model’s parameters with a curated dataset.
- Custom model training: Model training trains an entirely new ML model with a curated dataset.
How does Snorkel Flow improve data development?
Snorkel Flow provides a suite of tools that enable users to unlock the value of their data. You can use that data to train and fine-tune models and to create embeddings databases and prompts. These curated datasets and resulting models and databases yield real enterprise value.
Snorkel Flow offers:
- A centralized tool to build and store your curated datasets, instead of spreadsheets
- Faster, automated labeling workflows that save time for subject matter experts (SMEs), without losing significant label relevance
- Versioned, cumulative data labeling so you don’t have to start from scratch when you add to your dataset or update your label schema
These features leads to faster iterations and faster creation of high quality curated datasets.
Data development workflow
Data scientists and subject matter experts can collaborate using Snorkel Flow to craft datasets and models in days instead of weeks or months.
These are the steps in the typical data development loop:
- Build labeling functions (LFs).
- Approximate the combined accuracy of the labeling functions by training a quick model.
- Analyze the results of the quick model.
- Add and refine labeling functions to improve accuracy.
- Build more robust models using the probabilistic data labels to check the current model’s likely deployment-ready accuracy.
- Build a final, deployment-ready model in the MLFlow format.
- Evaluate the model’s real-world performance.
Snorkel Flow offers specific tools to enable the core data development workflow.
Snorkel Flow tools
Snorkel Flow’s suite of tools enable the following data development tasks:
- Labeling functions for programmatic labeling
- Weak supervision
- Labeling function evaluation
- Data labeling and correction
- Rapid small-model training
- Final model training
Labeling functions
Labeling functions codify expert knowledge and intuition into scalable rules. Labeling functions take the same knowledge your SMEs would use to manually label data, such as looking for a certain phrase in a document, and apply it automatically.
Labeling functions (LFs) can take many forms, including:
- Simple string searches
- Legacy rule-based systems
- Prompts for cutting-edge large language models (LLMs)
Labeling functions aren’t perfect:
- Some labeling functions apply to fragments of a dataset and others cover large portions.
- Labeling functions can disagree with each other.
- Labeling functions can produce wrong labels.
Snorkel Flow handles these imperfections with the next step of the data development workflow: weak supervision.
Weak supervision
Snorkel Flow’s weak supervision algorithm determines which labeling function to trust for any particular data point.
Weak supervision is a statistical approach that sorts through potentially noisy or conflicting sources of labels. By evaluating the likelihood of each labeling function being correct in each case, the algorithm applies a probabilistic label to each data point.
Labeling function evaluation
Snorkel Flow provides a friendly user interface for examining the labels produced by the current model. With Snorkel Flow, you can quickly identify the individual data points that have incorrect or low-confidence labels by:
- Comparing the labels they produce to existing ground truth labels
- Assessing the Snorkel Flow weak supervision algorithm’s confidence for the label for each data point
This process helps your experts focus additional labeling efforts to redraw classification boundaries or shore up model confidence.
Example:
A user creating a spam-identification application can write a labeling function to tag all emails that contain the words “wire transfer” as spam
. When evaluating the results of this labeling function, the user discovers that the model labels legitimate messages from banks as spam
. The user can create a second labeling function that cancels out the first when the “wire transfer” message originates from a bank email address.
Through repeated iteration, users add and adjust labeling functions to capture more nuance until the model reaches the required level of accuracy.
Data labeling and correction
In addition to labeling functions, Snorkel Flow allows users to view and label individual data points.
Subject matter experts can use this feature to:
- Label some ground truth for the cases when this hasn’t been done already.
- Explain their reasoning with comments and slices. This helps SMEs communicate their reasoning to data scientists, who can use comments to inform labeling functions.
- Correct ground truth labels. Sometimes a labeling function disagrees with a label because the label is wrong. Once users identify one of these points, Snorkel Flow makes it easy to relabel it.
Each of these options helps nudge the probabilistic data set closer to usability.
Rapid small-model training
Snorkel Flow has fast model training options that allow for rapid iteration and evaluation. Once you have a fast model trained, you can use it to quickly evaluate the performance of your labeling functions. Read more about how to use fast models to create good labeling functions.
Final model training
Snorkel Flow also offers production-level model training for when your model is producing good results and ready to export. Explore the model training options in Configure and train models.
Get started with Snorkel Flow
Follow along with Getting Started to get hands-on with Snorkel Flow.