Skip to main content

Glossary

AI and tech-related fields often use specialized jargon and abbreviations that can be overwhelming. This glossary provides clear definitions of key terms and abbreviations you’ll encounter, helping you better understand and build with AI technologies.

active-learning

Machine learning uses collected experience and data to improve a system’s performance. In many tasks, this experience is gained by doing experiments or making queries to the user. This treats the learner (ML model) as a passive recipient of data. Active learning, on the other hand, is the study on how to use the learner’s ability to gather data and act on the experience it receives. Entropy is the measure of distance from how close the predicted distribution is to a uniform distribution. A uniform distribution is when each predicted label is equally likely. For example, the entropy of a model where it thinks that each outcome is equally likely (each label could happen at X percent) would be 1.

annotation

The process of assigning labels or classes to specific data points for training datasets. For example, for a classification problem, assign banking contract documents to one of the following classes: "employment," "loan," "services," or "stock."

application

An application refers to a specific machine learning problem that you want to solve using Snorkel Flow. Implemented as a direct acyclic graph (DAG) over operators, an application can be used to compose complex data flow pipelines from simpler building blocks. Applications can be visualized and manipulated in Application Studio or the Python SDK.

blocks

Group operators that perform some key functionality (e.g., span extraction). They can be instantiated using Application templates, and it's possible to chain multiple blocks together in a single application.

candidate spans

Spans of texts extracted from the original document that the labeling functions and models operate over. For example, if your application extracts the focal entity of an article, then candidate spans are entities extracted from that article, and you will train a model to predict whether each span is the focal entity or not. Candidate/Span concept is relevant for information extraction and entity classification applications.

classification

Classifies data points as one of many labels.

continuous model validation

The regular assessment of a model to ensure it continues to perform as expected despite changes in the data it processes or in its operational environment.

data development

The process of creating and improving curated datasets for a unique application. Like software development, data development is a discipline with best practices for iteration and collaboration.

data drift

When a model drifts from their initial performance metrics as the data they predict on changes over time.

data extraction

Extracts specific types of data from documents.

data source

A file that will be loaded in into a specific data split in a dataset. Snorkel Flow currently supports CSV and Parquet formats. Each data source requires an index field that specifies a unique index for each new row of the data. The indices must be unique across all splits within a dataset.

dataset

The collection of data records you want to work with. It contains data sources assigned to relevant splits.

dev-split

A subset of the train split is randomly sampled to create a dev split, which will be used for guiding the development of LFs and models. By default, 10% of the train split, up to 10,000 samples, will be used for the deve split. Ground truth labels for this split are optional and can be specified during data loading or annotated within Snorkel Flow. Snorkel Flow does not require the dev split to have GTs to generate labels for the train split.

email address extraction

Extracts email addresses from documents.

ground truth

The set of labeled and accurately annotated data that serves as a reference or benchmark for training and evaluating machine learning models.

hOCR extraction

Extracts structured information using hOCR data (i.e. from pdfs).

label package

Created from a set of labeling functions. Snorkel Flow's core engine combines, de-noises, and reweighs the outputs of each labeling function to generate labels for your data. You can create a label package in the Label page and see the available label packages in the LF Packages page.

label spaces

Three main label spaces are used for data representation, which you might come across while using Snorkel. They are multi-label, sequence label, and single label spaces.

labeling function (LF)

The key data programming abstraction in Snorkel, a labeling function is a programmatic rule or heuristic that assigns labels to unlabeled data. Each LF works on one class: to vote whether a data point has a certain label. Snorkel's core label model is responsible for estimating LF accuracies and aggregating them into training labels without relying on ground truth labels.

majority vote

This aggregation strategy takes the majority label for each data point if one exists, and leaves an `UNKNOWN` label where no annotations exist. The only supported aggregation strategy.

multi-label

One data point can belong to multiple classes.

multi-label classification

Tag data points with zero, one, or more non-exclusive labels.

multi-label PDF classification

Classify PDF documents with zero, one, or more non-exclusive labels.

native PDF extraction

Extract structured information from native PDFs.

operators

Snorkel Flow operators are functions that define transformations over dataframes. For instance, they may be used to clean data, extract candidate spans from documents, or group by and reduce predicted spans by a document. Operator may be implemented as heuristics or learned models, whatever is most practical for your application. For advanced usage, the Python SDK also exposes utilities to register custom operators.

PDF classification

Classify PDF documents as one of many labels.

sequence tagging

Classify tokens of documents as one of many labels.

sequence-label

A data point in seq tagging is a document, which contains/consists of multiple spans, and each span can only have one class.

single label

One data point can belong to only one class.

split

Snorkel Flow supports four data splits: train, dev, valid, test. All of them, except dev, can be uploaded by users. The dev split will be sampled from the train split.

test split

The split used for evaluating models' performance. It requires labels but can be labeled within Snorkel Flow.

text entity classification

Extract entities, link them to canonical entities, and classify those entities.

text extraction

Extract entities by first identifying high-recall candidate spans and then classifying each as the target entity or not.

train split

The partially labeled or unlabeled data to which Snorkel Flow will assign labels. It should represent the largest proportion of the dataset.

training-set

A training set is a set of labels generated by a label package for a certain split, e.g. the train split. You can create a training set from existing label packages in the LF Packages page and use the resulting labels to train a downstream machine learning model via the Python SDK or the Train page.

US currency extraction

Extract specific US dollar amounts from documents.

valid split

The split used for tuning the hyperparameters of machine learning models. It requires labels but can be labeled within Snorkel Flow.