Skip to main content
Version: 0.94

Active learning and weak supervision

In Snorkel Flow, programmatic labeling combines active learning and weak supervision.

Active learning

Active learning is a machine learning framework in which a model is initially trained on a small set of high-quality labeled data. The model then iteratively selects additional data points to be labeled by an expert, based on criteria such as uncertainty or informativeness relative to the current model's understanding.

This is the typical active learning process:

  1. Initial model training: Start with a small, carefully labeled dataset.
  2. Uncertainty evaluation: Assess which data points the model is least certain about.
  3. Expert querying: Request an expert to label these uncertain data points.
  4. Model re-training: Update the model with the newly acquired labels.
  5. Iterative improvement: Repeat the cycle to enhance model performance continuously.

Weak supervision

Weak supervision is an approach that facilitates labeling large datasets using less accurate or noisier sources. It often involves multiple weak labelers whose outputs are integrated to enhance the overall label quality.

The weak supervision process may include these elements:

  • Heuristic rules: Automated rules based on domain knowledge.
  • Crowdsourced data: Labels gathered from non-experts or a large group of annotators.
  • Previously trained models: Outputs from other models that can serve as provisional labels.

Snorkel Flow hybrid framework

By combining active learning and weak supervision, Snorkel Flow creates a powerful hybrid framework for data labeling and machine learning:

  1. Initial label generation: Snorkel Flow starts with a small amount of annotated data, which defines the ground truth (GT) . Then Snorkel Flow applies the proprietary weak supervision algorithms to quickly generate a broad set of initial labels across a large dataset. This initial labeling uses heuristic rules, crowdsourced data, and outputs from other models to rapidly label data.
  2. Refined learning process: Once the dataset is preliminarily labeled, Snorkel Flow enables users to train an initial model and evaluate the model outputs on the initial GT dataset. With the built-in Error Guided Analysis tools, apply active learning techniques to identify samples of data where the model's performance can be improved. In the Annotation Suite, assign these subsets of data to Subject Matter Experts (SMEs) for their review. This process ensures that resources are focused where they are most needed: SMEs are focused on labeling active learning batches while data scientists focus on programmatic labeling. Both team work in unison to improve data quality and end-model performance.
  3. Programmatic labeling integration: After receiving SME feedback, modify the existing labeling functions or create new labeling functions. This iterative development continues with the creation of a new higher quality dataset, which is then used to train a new model to be analyzed. In this way, Snorkel Flow uniquely integrates programmatic labeling with active learning, combining the scalability of weak supervision with the precision of active learning.
  4. Continuous enhancement: The iterative nature of this integrated approach means that each cycle of active learning and re-training with new expert labels makes the model progressively smarter and more reliable. Additionally, this iterative cycle doesn't stop after the model is deployed: Use Snorkel Flow to update a model anytime you need to, including post deployment.