Skip to main content
Version: 0.93

Cold start scenarios

Starting a Snorkel Flow application with no ground truth available can feel like a daunting task. This "cold start" challenge can serve as a blocker for users ready to take advantage of Snorkel Flow’s powerful capabilities.

However, there are ways to overcome this challenge and jumpstart the annotation and modeling process. In this blog post, we will explore some strategies to tackle the cold start problem, including manual labeling and leveraging existing models as labeling functions.

Not-quite-cold starts in Snorkel Flow

Sometimes, a client’s cold start isn’t as cold as it looks. While the data set itself may be raw and unlabeled, existing structures—perhaps some within the organization—can help kick-start the process.

Users can leverage existing external knowledge sources, such as ontologies, taxonomies, or knowledge graphs, to derive label categories or relationships between entities. This can help to bootstrap the labeling process and reduce the amount of manual effort required.

Existing internal resources can help. If the organization has a mechanism for applying labels to this data set—be it a rules-based system or a model that doesn’t perform up to the desired accuracy—they can use that as a starting point. Snorkel Flow accepts existing models and systems as labeling functions.

Snorkel Flow cold start blocker: Schema or ground truth?

True cold start problems fall into two categories: lack of schema, and lack of ground truth. Here, we’ll address each.

Addressing the challenge of no ground truth

Starting a Snorkel application without any labeled data can be a challenging task. While having a label schema defined helps, it's still necessary to generate labeled data to train a model.

The simplest solution to this problem is to label some ground truth data using Snorkel Flow’s annotation suite. Subject matter experts within your organization can set aside a few hours to pull up examples, apply labels and use the comment feature to explain why they chose a particular label; data scientists can later use those comments to formulate labeling functions.

Your team can also tackle this problem in the reverse direction. Instead of asking subject matter experts to label randomly-extracted data in the annotation suite, data scientists can define an initial set of labeling functions—one per label in the schema—and then ask subject matter experts to verify or invalidate the applied labels from a random sample. This approach helps ensure that you label a sufficient number of examples per class.

Addressing the challenge of no label schema

In some cases, users may not have a well-defined label schema when starting a Snorkel Flow application. This can be due to the complexity of the task or the lack of prior domain knowledge. For example, an online retailer wants to present optimized versions of its home page based on the kind of visitor seeing it, but they don’t know how to categorize those visitors.

An organization can handle this in an analog fashion by gathering stakeholders and using their collective knowledge. As a first step, make sure everyone understands the business problem, and then identify the outcomes or attributes of interest. From there, define your classes: likely buyers and likely browsers, for example. These categories should be collectively exhaustive, meaning that every example can be assigned to at least one category. In our online retailer example, the visitors should belong to only one category, but some problems require a multi-class approach.

The above ignores a potential blocker: despite their expertise, your stakeholders may not have a deep enough understanding of the data to know what buckets are appropriate. That’s where machine learning may help. Unsupervised learning techniques, such as clustering or topic modeling, can help group examples into meaningful categories. These categories can then be used as your initial labels.

Once your team agrees on a set of labels—whether through raw human intellectual power or with the assistance of unsupervised learning—they can begin the process of building ground truth, as described in the previous section.

Refine and iterate

Once your team has decided on a set of labels and begins building out your probabilistic data set with labeling functions and weak supervision in Snorkel Flow, your schema may change. That’s okay. Snorkel Flow is built to handle that.

Conclusion

Approaching a machine learning problem with a cold start can be daunting, but these straightforward approaches can help clear the frost. Whether you use human intuition, unsupervised learning or existing resources as a guide, you can build out an initial label schema and ground-truth dataset.

From there, Snorkel Flow can help you expand, refine and finalize your labeling approach as part of its iterative workflow. Soon after that, you can deploy your new machine learning application.