Key concepts
This page lists terminology used throughout the user guide and the Snorkel Flow interface.
Data
A dataset refers to the collection of data records you want to work with. It contains data sources assigned to relevant splits.
A data source refers to a file that will be loaded into a specific data split in a dataset. Snorkel Flow currently supports CSV and Parquet formats. Each data source requires an index field that specifies a unique index for each row of the data. The indices must be unique across all splits within a dataset.
Application
An application refers to a specific machine learning problem that you want to solve using Snorkel Flow. Implemented as a directed acyclic graph (DAG) over operators, an application can be used to compose complex data flow pipelines from simpler building blocks. Applications can be visualized and manipulated in Application Studio or the Python SDK.
Application templates
Currently, Snorkel Flow supports the following application templates:
- Classification: Classify datapoints as one of many labels.
- Date Extraction: Extract specific types of dates from documents.
- Email Address Extraction: Extract email addresses from documents.
- hOCR Extraction: Extract structured information using hocr data (i.e. from pdfs).
- Multi-label Classification: Tag datapoints with zero, one, or more non-exclusive labels.
- Multi-label PDF Classification: Classify PDF documents with zero, one, or more non-exclusive labels.
- Native PDF Extraction: Extract structured information from Native PDFs.
- PDF Classification: Classify PDF documents as one of many labels.
- Sequence Tagging: Classify tokens of documents as one of many labels.
- Text Entity Classification: Extract entities, link them to canonical entities, and classify those entities.
- Text Extraction: Extract entities by first identifying high-recall candidate spans and then classifying each as the target entity or not.
- US Currency Extraction: Extract specific US dollar amounts from documents.
Blocks
A block is a convenient way to group operators that perform some key functionality (e.g. span extraction). They can be instantiated using Application templates, and it’s possible to chain multiple blocks together in a single application.
Operators
Snorkel Flow Operators are functions that define transformations over dataframes. For instance, they may be used to clean data, extract candidate spans from documents, or group by and reduce predicted spans by document. Operators may be implemented as heuristics or learned models, whatever is most practical for your application. For advanced usage, the Python SDK also exposes utilities to register custom operators.
See Operators: Transform and process your data for more information.
Split
Snorkel Flow supports the following 4 data splits: train, dev, valid, test. All of them, except dev, can be uploaded by users. The dev split will be sampled from the train split.
- Train: The partially labeled or unlabeled data to which Snorkel Flow will assign labels. It should represent the largest proportion of the dataset.
- Dev: A subset of the train split is randomly sampled to create a dev split, which will be used for guiding the development of LFs and models.
- By default, 10% of the train split, up to 10,000 samples, will be used for the dev split.
- Ground truth labels for this split are optional and can be specified during data loading or annotated within Snorkel Flow.
- Snorkel Flow does not require the dev split to have GTs to generate labels for the train split.
- Valid: The split used for tuning the hyperparameters of machine learning models. It requires labels, but can be labeled within Snorkel Flow.
- Test: The split used for evaluating models’ performance. It requires labels, but can be labeled within Snorkel Flow.
Labeling function (LF)
The key data programming abstraction in Snorkel, a labeling function (LF) is a programmatic rule or heuristic that assigns labels to unlabeled data. Each LF works on one class: to vote whether a data point has a certain label.
Snorkel's core label model is responsible for estimating LF accuracies and aggregating them into training labels without relying on ground truth labels.
See the Labeling function builders index page for a full list of LFs that can be created in Snorkel Flow.
Label package
A label package is created from a set of labeling functions. Snorkel Flow's core engine will combine, de-noise, and reweigh the outputs of each labeling function to generate labels for your data. You can create a label package in the Label page and see the available label packages in the LF Packages page.
Label spaces
Three main label spaces are used for data representation, which you might come across while using Snorkel. They are multi label, sequence label, and single label spaces.
- Single label - one data point can belong to only one class.
- Multi label - one data point can belong to multiple classes.
- Sequence label - A datapoint in seq tagging is a document, which contains/consists of multiple spans, and each span can only have one class.
Example: You are being offered employment at Some Company and you start on September 1st, 2022.
Different label spaces can be used to perform specific operations on the above example. Here are some ways each of them could be used on the text:
Single label spaces can be used to find the company name in the pre-extracted/pre-determined span Single label spaces can also be used to label at the document level.
Some Company -> company name label
Multi label spaces can be used to label at the document level. Example: determine the category of a document.
Some Company -> employment contract and personal email labels
Sequence label spaces can be used to detect spans in the text using character offsets, for example, dates or company names.
A span is characterized by its character offsets (char_start, char_end)in the text, and each span is associated with a single label. (0, 8, ‘OTHER’) denotes a span starting at index 0, and ending at 8 belongs to OTHER class.
(0, 8, "OTHER") corresponds to: You are -> OTHER label (8, 36, "OTHER") corresponds to: being offered employment at -> OTHER label (36, 47, "COMPANY") corresponds to: Some Company -> COMPANY label (48, 65, "OTHER ") corresponds to: and you start on -> OTHER label (66, 86, "DATE") corresponds to: September 1st, 2022 -> DATE label
Application templates and label spaces
Each application template is exclusive to one label space. When you’re working with a specific template, you can expect the GT formats to align with the label space associated with the application template.
All application templates that specify Multi label in their naming conventions use multi label as the label space. These are Multi-label Classification and Multi-label PDF Classification.
All Information Extraction tasks (applications that have the Information Extraction label on the application template cards) use single label spaces. These are US Currency Extraction, Native PDF Extraction, hOCR Extraction, Date Extraction, Email Address Extraction, and Text Extraction.
Sequence Tagging is the only application that uses sequence label spaces.
All other application templates also use single label spaces. These are PDF Classification and plain Classification, Utterance and Conversation Classification, and Text Entity Classification.
Format for ground truth interaction in the SDK
In this section we’ll show example formats for each label space for usage in our SDK for a given datapoint.
Multi-label spaces ground truth is represented as a dictionary where there is a mapping from each label to one of PRESENT, ABSENT, or ABSTAIN:
label: {"Japanese Movies": "PRESENT", "World cinema": "ABSTAIN", "Black-and-White": "ABSENT", "Short Film": "ABSTAIN"}
Sequence tagging ground truth for a document is a list of spans, where each span is a triple of (char_start, char_end, label). The spans cannot be empty (char_start must be smaller than char_end). Overlapping or duplicating spans are not allowed. The sets of char offsets (char_start, char_end) must be sorted:
label: [ [0, 29, 'OTHER'], [29, 40, 'COMPANY'], [40, 228, 'OTHER'], [228, 239, 'COMPANY'], [239, 395, 'OTHER'], ]
Single label space ground truth format is represented by their class label
label: "loan"
Training set
A training set is a set of labels generated by a label package for a certain split, e.g. the train split. You can create a training set from existing label packages in the LF Packages page and use the resulting labels to train a downstream machine learning model via the Python SDK or the Train page.
Candidate spans
Spans of texts extracted from the original document that the labeling functions and models operate over. For example, if your application extracts the focal entity of an article, then candidate spans are entities extracted from that article, and you will train a model to predict whether each span is the focal entity or not.
Candidate/Span concept is relevant for information extraction and entity classification applications.
Active learning
Machine learning uses collected experience and data to improve a system’s performance. In many tasks, this experience is gained by doing experiments or making queries to the user. This treats the learner (ML model) as a passive recipient of data.
Active learning, on the other hand, is the study on how to use the learner’s ability to gather data and act on the experience it receives.
Entropy is the measure of distance from how close the predicted distribution is to a uniform distribution. A uniform distribution is when each predicted label is equally likely. For example, the entropy of a model where it thinks that each outcome is equally likely (each label could happen at X percent) would be 1.
If you’re interested in learning more about the math behind entropy, Burr Settles Survey is a great source.