Skip to main content
Version: 0.94

What is Snorkel Flow?

This document provides an overview on what Snorkel Flow is and how to use it at a high-level.

Snorkel Flow is a data development platform

Snorkel Flow provides a suite of tools that enables users to rapidly develop data and train models that yield real enterprise value.

These tools fall into the following categories:

  1. Programmatic labeling
  2. Weak supervision
  3. Rapid small-model training
  4. In-platform final model training

These tools combine to create an iterative development loop through which data scientists and subject matter experts collaborate to craft models in days that could take weeks or months to complete without these tools.

Snorkel Flow’s core iteration loop

Data scientists and subject matter experts using Snorkel Flow generally adhere to the following development loop:

  1. Build labeling functions.
  2. Approximate the combined accuracy of the labeling functions by training a quick model.
  3. Analyze the results of the quick model.
  4. Add and refine labeling functions to improve accuracy.

At key points in this process, data scientists can build more robust models using the probabilistic data labels to check its likely deployment-ready accuracy. And, ultimately, they can build a final, deployment-ready model in MLFlow format.

What are labeling functions?

Labeling functions codify expert knowledge and intuition into scalable rules. These can take many forms, from simple string searches to the use of legacy rule-based systems to prompts prepared and sent to cutting-edge large language models (LLMs).

Some labeling functions apply to fragments of the dataset. Others cover large portions. All of them disagree with others and give wrong answers under some circumstances. That’s okay and expected. Our weak supervision algorithm sorts out which labeling function to trust in which circumstances.

What is weak supervision?

Weak supervision is a statistical approach that sorts through potentially noisy or conflicting sources of labels. By evaluating the likelihood of each source being correct in each case, the algorithm applies a probabilistic label to each data point.

How does analysis support the iteration loop?

The Snorkel Flow platform enables users to examine the effectiveness of individual labeling functions by evaluating them against existing ground truth labels as well as our weak supervision algorithm’s “confidence” on given data points.

By clicking the regions of the interface that correspond to incorrectly labeled or low-confidence data points, data scientists and subject matter experts can identify what regions of the data the model needs help on. Then, they can devise new labeling functions to redraw classification boundaries or shore up model confidence.

For example, a user creating a spam-idenficaition application may write a labeling function to tag as spam all emails that contain the words “wire transfer.” In the next loop, the user discovers that the model labels legitimate messages from banks as spam. The user may then create a second labeling function that cancels out the first when the “wire transfer” message originates from a bank email address.

Each correction will miss some nuances, but that’s okay. Through repeated loops, users continue to add and adjust labeling functions to capture more nuance until the model reaches the required level of accuracy.

Data-labeling and correction

In addition to labeling functions, Snorkel Flow allows users to view and label individual data points.

Snorkel Flow users can use this portion of the workflow in the following ways:

  1. Subject matter experts can use the interface to label some ground truth for the cases when this hasn’t been done already.
  2. SMEs can explain their reasoning with comments and tags. This helps SMEs communicate their reasoning to data scientists, who can use comments to inform labeling functions.
  3. Correcting ground truth labels. Sometimes a labeling function disagrees with a label because the label is wrong. Once users identify one of these points, Snorkel Flow makes it easy to relabel it.

Each of these options helps nudge the probabilistic data set closer to usability.

Get started with Snorkel Flow

Now that you understand Snorkel Flow’s suite of data-development tools, get started with a project. The documentation within this resource should give you everything you need to get to complete your first project.