Glossary
AI and tech-related fields often use specialized jargon and abbreviations that can be overwhelming. This glossary provides clear definitions of key terms and abbreviations you’ll encounter, helping you better understand and build with AI technologies.
active learning
Machine learning uses collected experience and data to improve a system’s performance. In many tasks, this experience is gained by doing experiments or making queries to the user. This treats the learner (ML model) as a passive recipient of data. Active learning, on the other hand, is the study on how to use the learner’s ability to gather data and act on the experience it receives. Entropy is the measure of distance from how close the predicted distribution is to a uniform distribution. A uniform distribution is when each predicted label is equally likely. For example, the entropy of a model where it thinks that each outcome is equally likely (each label could happen at X percent) would be 1.
annotation
The process of assigning labels or classes to specific data points for training datasets. For example, for a classification problem, assign banking contract documents to one of the following classes: "employment," "loan," "services," or "stock."
application
An application refers to a specific machine learning problem that you want to solve using Snorkel Flow. Implemented as a direct acyclic graph (DAG) over operators, an application can be used to compose complex data flow pipelines from simpler building blocks. Applications can be visualized and manipulated in Application Studio or the Python SDK.
blocks
Group operators that perform some key functionality (e.g., span extraction). They can be instantiated using Application templates, and it's possible to chain multiple blocks together in a single application.
candidate spans
Spans of texts extracted from the original document that the labeling functions and models operate over. For example, if your application extracts the focal entity of an article, then candidate spans are entities extracted from that article, and you will train a model to predict whether each span is the focal entity or not. Candidate/Span concept is relevant for information extraction and entity classification applications.
continuous model validation
The regular assessment of a model to ensure it continues to perform as expected despite changes in the data it processes or in its operational environment.
data development
The process of creating and improving curated datasets for a unique application. Like software development, data development is a discipline with best practices for iteration and collaboration.
data drift
When a model drifts from their initial performance metrics as the data they predict on changes over time.
data source
A file that will be loaded in into a specific data split in a dataset. Snorkel Flow currently supports CSV and Parquet formats. Each data source requires an index field that specifies a unique index for each new row of the data. The indices must be unique across all splits within a dataset.
dataset
The collection of data records you want to work with. It contains data sources assigned to relevant splits.
dev-split
A subset of the train split is randomly sampled to create a dev split, which will be used for guiding the development of LFs and models. By default, 10% of the train split, up to 10,000 samples, will be used for the deve split. Ground truth labels for this split are optional and can be specified during data loading or annotated within Snorkel Flow. Snorkel Flow does not require the dev split to have GTs to generate labels for the train split.
ground truth
The set of labeled and accurately annotated data that serves as a reference or benchmark for training and evaluating machine learning models.
label package
Created from a set of labeling functions. Snorkel Flow's core engine combines, de-noises, and reweighs the outputs of each labeling function to generate labels for your data. You can create a label package in the Label page and see the available label packages in the LF Packages page.
label spaces
Three main label spaces are used for data representation, which you might come across while using Snorkel. They are multi-label, sequence label, and single label spaces.
labeled set
A labeled set is a set of labels generated by a label package for a certain split, e.g. the train split. You can create a labeled set from existing label packages in the LF Packages page and use the resulting labels to train a downstream machine learning model via the Python SDK or the Train page.
labeling function (LF)
The key data programming abstraction in Snorkel, a labeling function is a programmatic rule or heuristic that assigns labels to unlabeled data. Each LF works on one class: to vote whether a data point has a certain label. Snorkel's core label model is responsible for estimating LF accuracies and aggregating them into training labels without relying on ground truth labels.
majority vote
This aggregation strategy takes the majority label for each data point if one exists, and leaves an `UNKNOWN` label where no annotations exist. The only supported aggregation strategy.
operators
Snorkel Flow operators are functions that define transformations over dataframes. For instance, they may be used to clean data, extract candidate spans from documents, or group by and reduce predicted spans by a document. Operator may be implemented as heuristics or learned models, whatever is most practical for your application. For advanced usage, the Python SDK also exposes utilities to register custom operators.
prompt development
Prompt development, or prompt engineering, is the process of designing and refining prompt inputs to large language models (LLMs) to achieve specific, accurate, and contextually relevant outputs.
sequence label
A data point in sequence tagging is a document, which contains/consists of multiple spans, and each span can only have one class.
split
Snorkel Flow supports four data splits: train, dev, valid, test. All of them, except dev, can be uploaded by users. The dev split will be sampled from the train split.
test split
The split used for evaluating models' performance. It requires labels but can be labeled within Snorkel Flow.
text entity classification
Extract entities, link them to canonical entities, and classify those entities.
text extraction
Extract entities by first identifying high-recall candidate spans and then classifying each as the target entity or not.
train split
The partially labeled or unlabeled data to which Snorkel Flow will assign labels. It should represent the largest proportion of the dataset.