Glossary

AI and tech-related fields often use specialized jargon and abbreviations that can be overwhelming. This glossary provides clear definitions of key terms and abbreviations you’ll encounter, helping you better understand and build with AI technologies.

active learning

Machine learning uses collected experience and data to improve a system's performance. In many tasks, this experience is gained by doing experiments or making queries to the user. This treats the learner (ML model) as a passive recipient of data. Active learning, on the other hand, is the study on how to use the learner's ability to gather data and act on the experience it receives. Entropy is the measure of distance from how close the predicted distribution is to a uniform distribution. A uniform distribution is when each predicted label is equally likely. For example, the entropy of a model where it thinks that each outcome is equally likely (each label could happen at X percent) would be 1.

annotation

The process of assigning labels or classes to specific data points for training datasets. For example, for a classification problem, assign banking contract documents to one of the following classes: "employment," "loan," "services," or "stock."

application

An application refers to a specific machine learning problem that you want to solve using Snorkel Flow. Implemented as a direct acyclic graph (DAG) over operators, an application can be used to compose complex data flow pipelines from simpler building blocks. Applications can be visualized and manipulated in Application Studio or the Python SDK.

benchmark

A collection of criteria, corresponding evaluators, reference prompts, and slices used for evaluating an LLM for a specific type of use case or domain.

blocks

Group operators that perform some key functionality (e.g., span extraction). They can be instantiated using Application templates, and it's possible to chain multiple blocks together in a single application.

candidate spans

Spans of texts extracted from the original document that the labeling functions and models operate over. For example, if your application extracts the focal entity of an article, then candidate spans are entities extracted from that article, and you will train a model to predict whether each span is the focal entity or not. Candidate/Span concept is relevant for information extraction and entity classification applications.

classification

Classifies data points as one of many labels.

continuous model validation

In the evaluation workflow, the defined criteria are the details a user cares about optimizing. Criteria can be broad, such as "correctness", or narrow, such as "Does not contain PII".

criteria

The regular assessment of a model to ensure it continues to perform as expected despite changes in the data it processes or in its operational environment.

data drift

When a model drifts from their initial performance metrics as the data they predict on changes over time.

data extraction

Extracts specific types of data from documents.

data source

A file that will be loaded in into a specific data split in a dataset. Snorkel Flow currently supports CSV and Parquet formats. Each data source requires an index field that specifies a unique index for each new row of the data. The indices must be unique across all splits within a dataset.

dataset

The collection of data records you want to work with. It contains data sources assigned to relevant splits.

dev split

A dev split is a subset of the train split that enables faster model development cycles. By default, Snorkel creates a dev split by randomly sampling 10% of the training data (up to 10,000 samples). The dev split allows for quicker model iterations by providing a smaller dataset that loads faster in the Studio UI. Snorkel also applies labeling functions (LFs) to the dev split first to generate metrics quickly and reports model performance scores for the dev split, alongside the valid and test splits. Snorkel still uses the entire training split when actually training a model. The dev split is primarily used to accelerate the development workflow in Studio by making data loading, labeling function application, and performance analysis faster. Ground truth labels for the dev split are optional and can be specified during data loading or annotated within Snorkel.

email address extraction

Extracts email addresses from documents.

evaluator

A function that programmatically checks whether the response from a Gen AI system satisfies a criteria. An evaluator can be anything that follows the signature: `(prompt, response, [context]) → {0,1}`. Every evaluator must produce a numeric output between 0 and 1.

golden response

A definitive, high-quality example response that sets the standard for evaluating and improving a model's predictive responses. Submitted by an expert annotator in a free-text input field to serve as the ground truth output for the given input.

ground truth

The set of labeled and accurately annotated data that serves as a reference or benchmark for training and evaluating machine learning models.

hOCR extraction

Extracts structured information using hOCR data (i.e. from pdfs).

label package

Created from a set of labeling functions. Snorkel Flow's core engine combines, de-noises, and reweighs the outputs of each labeling function to generate labels for your data. You can create a label package in the Label page and see the available label packages in the LF Packages page.

label spaces

Three main label spaces are used for data representation, which you might come across while using Snorkel. They are multi-label, sequence label, and single label spaces.

labeled set

A labeled set is a set of labels generated by a label package for a certain split, e.g. the train split. You can create a labeled set from existing label packages in the LF Packages page and use the resulting labels to train a downstream machine learning model via the Python SDK or the Train page.

labeling function (LF)

The key data programming abstraction in Snorkel, a labeling function is a programmatic rule or heuristic that assigns labels to unlabeled data. Each LF works on one class: to vote whether a data point has a certain label. Snorkel's core label model is responsible for estimating LF accuracies and aggregating them into training labels without relying on ground truth labels.

majority vote

This aggregation strategy takes the majority label for each data point if one exists, and leaves an `UNKNOWN` label where no annotations exist. The only supported aggregation strategy.

metrics

Related to evaluation, a measure of adherence to the criteria.

multi-label

One data point can belong to multiple classes.

multi-label classification

Tag data points with zero, one, or more non-exclusive labels.

multi-label PDF classification

Classify PDF documents with zero, one, or more non-exclusive labels.

native PDF extraction

Extract structured information from native PDFs.

operators

Snorkel Flow operators are functions that define transformations over dataframes. For instance, they may be used to clean data, extract candidate spans from documents, or group by and reduce predicted spans by a document. Operator may be implemented as heuristics or learned models, whatever is most practical for your application. For advanced usage, the Python SDK also exposes utilities to register custom operators.

PDF classification

Classify PDF documents as one of many labels.

prompt development

Prompt development, or prompt engineering, is the process of designing and refining prompt inputs to large language models (LLMs) to achieve specific, accurate, and contextually relevant outputs.

reference prompts

A set of prompts that the LLM is to be evaluated against. These prompts can be collected, curated, authored, and generated.

sequence label

A data point in sequence tagging is a document, which contains/consists of multiple spans, and each span can only have one class.

sequence tagging

Classify words within a document as one of many labels.

single label

One data point can belong to only one class.

slice

A slice is a filtered subset of data rows that share a specific characteristic.

slicing function

A user-defined function (UDF) that defines a slice programmatically.

split

Snorkel Flow supports four data splits: train, dev, valid, test. All of them, except dev, can be uploaded by users. The dev split will be sampled from the train split.

test split

The split used for evaluating models' performance. It requires labels but can be labeled within Snorkel Flow.

text entity classification

Extract entities, link them to canonical entities, and classify those entities.

text extraction

Extract entities by first identifying high-recall candidate spans and then classifying each as the target entity or not.

train split

The partially labeled or unlabeled data to which Snorkel Flow will assign labels. It should represent the largest proportion of the dataset.

US currency extraction

Extract specific US dollar amounts from documents.

valid split

The split used for tuning the hyperparameters of machine learning models. It requires labels but can be labeled within Snorkel Flow.

weak supervision

Weak supervision is an approach that facilitates labeling large datasets using less accurate or noisier sources.