Skip to main content
Version: 0.91

Annotation Studio overview

This page introduces the concept of annotations and provides an overview of Annotation Studio.

Annotations overview

Data annotation refers to the process of assigning labels or classes to specific data points for training datasets. For example, for a classification problem, you may want to assign banking contract documents to one of the following classes: "employment," "loan," "services," or "stock."

Ground truth refers to the set of labeled and accurately annotated data that serves as a reference or benchmark for training and evaluating machine learning models. During the evaluation phase, ground truth data is used to assess the model's performance. By comparing the model's predictions with the actual labels in the ground truth, metrics such as accuracy, precision, recall, and F1 score can be calculated.

Why do we need to do manual annotation in Snorkel Flow? While Snorkel Flow programmatically generates labels for your training data sets, you'll still want some initial ground truth to develop labeling functions (LFs) and models. In addition, manual annotation allows for an iterative process of model development. Annotators can review model predictions, identify errors, and refine the labeled dataset, leading to improved model performance over time. With Snorkel, you can pinpoint exactly where manual annotation is needed, resulting in significantly less time and money spent on manual annotation!

Typically, you'll want multiple annotators to review each data point. This is for a variety of reasons:

  • Reduce bias: Manually labeling data can be prone to bias, as the person labeling the data may have preconceptions that can influence the labels they assign.
  • Handle ambiguity: Some documents may be inherently ambiguous or open to interpretation. In these cases, having multiple annotators allows for capturing different viewpoints and addressing the inherent uncertainty in the data.
  • Enhance robustness: By aggregating annotations from multiple annotators, the labeled dataset becomes more robust and less dependent on the idiosyncrasies of any single annotator. This robustness is particularly important when dealing with diverse datasets or complex tasks.
  • Refine annotation guidelines: Comparing annotations from multiple annotators can identify areas where guidelines need clarification. It also provides an opportunity for continuous training and refinement of guidelines to improve consistency and understanding among annotators.

Annotation Studio overview

Annotation Studio is divided into four sections:

  • Overview page: Provides various aggregate metrics on the number of annotations that have been completed, the distribution of labels annotator agreement, and a view into recent annotator activity. See Overview page: View aggregate metrics for more information.
  • Review page: Provides a full list of all annotations and their corresponding documents to easily review all annotations in one place. See Walkthrough for reviewers for more information.
  • Batches page: Provides access to, and information about all batches that have been created. Here, you can create and manage batches, and commit annotations to ground truth. See Create batches and Manage batches and commit ground truth for more information.
  • Within a batch: This is the canvas where annotators view and label documents. See Walkthrough for annotators for more information about how to annotate in Annotation Studio.

We provide multiple user roles that administrators can assign, each with different levels of data access and permissions. See Roles and user permissions for more information.