Cold start scenarios
Starting a Snorkel Flow application without pre-labeled data is known as a "cold start." Snorkel Flow’s features make cold start data development more approachable.
In this article you will learn practical strategies to successfully develop your data from a cold start, such as:
- Manual labeling
- Leveraging external resources like knowledge graphs
- Using existing models as labeling functions
Read through the following cold start approaches to see which ones fit your data.
Warming Up your Cold Start in Snorkel Flow
Sometimes, a cold start isn’t as cold as it looks. While the data set itself may be raw and unlabeled, existing structures can help kick-start the process.
Users can leverage existing external knowledge sources, such as ontologies, taxonomies, or knowledge graphs, to derive label categories or relationships between entities. This can help to bootstrap the labeling process and reduce the amount of manual effort required.
Existing internal resources can help. If the organization has a mechanism for applying labels to this data set—be it a rules-based system or a model that doesn’t perform up to the desired accuracy—they can use that as a starting point. Snorkel Flow accepts existing models and systems as labeling functions.
Snorkel Flow cold start blocker: Schema or ground truth?
Cold start problems typically lack one of two types of metadata:
- The data lacks a ground truth
- The data lacks a schema
Let's address each.
How to tackle a cold start when ground truth is missing:
Starting a Snorkel application without any labeled data can be a challenging task. While having a label schema defined helps, it's still necessary to generate labeled data to train a model.
The simplest solution to this problem is to label some ground truth data using Snorkel Flow’s annotation suite. Subject matter experts within your organization can set aside a few hours to pull up examples, apply labels and use the comment feature to explain why they chose a particular label; data scientists can later use those comments to formulate labeling functions.
Your team can also tackle this problem in the reverse direction. Instead of asking subject matter experts to label randomly-extracted data in the annotation suite, data scientists can define an initial set of labeling functions—one per label in the schema—and then ask subject matter experts to verify or invalidate the applied labels from a random sample. This approach helps ensure that you label a sufficient number of examples per class.
Addressing the challenge of no label schema
In some cases, users may not have a well-defined label schema when starting a Snorkel Flow application. This can be due to the complexity of the task or the lack of prior domain knowledge. For example, an online retailer wants to present optimized versions of its home page based on the kind of visitor seeing it, but they don’t know how to categorize those visitors.
An organization can kick off the process by gathering stakeholders with domain knowledge to brainstorm and outline relevant attributes. From there, define your classes: likely buyers and likely browsers, for example. These categories should be collectively exhaustive, meaning that every example can be assigned to at least one category. In our online retailer example, the visitors should belong to only one category, but some problems require a multi-class approach.
The above ignores a potential blocker: despite their expertise, your stakeholders may not have a deep enough understanding of the data to know what buckets are appropriate. That’s where machine learning may help. Unsupervised learning techniques, such as clustering or topic modeling, can help group examples into meaningful categories. These categories can then be used as your initial labels.
Once your team agrees on a set of labels—whether through raw human intellectual power or with the assistance of unsupervised learning—they can begin the process of building ground truth, as described in the previous section.
Refine and iterate
Once your team has decided on a set of labels and begins building out your probabilistic data set with labeling functions and weak supervision in Snorkel Flow, your schema may change. That’s okay. Snorkel Flow is built to handle that.
Conclusion
Snorkel Flow provides tools to capture and build the initial metadata for your data, whether that metadata comes from existing resources, human expertise, or unsupervised learning.
From a cold start, Snorkel Flow helps you expand and refine your labeling process with confidence and quick iteration. As you continue the data development process, you'll reach the point where you can deploy your new machine learning application.