Skip to main content
Version: 0.95

Annotation Studio overview

Data annotation is the process of assigning labels or classes to specific data points for training datasets. For example, for a classification problem, you could assign banking contract documents to one of the following classes:

  • employment
  • loan
  • services
  • stock

Ground truth (GT) refers to the set of labeled and accurately annotated data that serves as a reference or benchmark for training and evaluating machine learning models. During the evaluation phase, use ground truth data to assess the model's performance. By comparing the model's predictions with the actual labels in the ground truth, you can calculate metrics such as accuracy, precision, recall, and F1 score.

Annotation Studio is divided into four sections:

  • Overview page: Provides aggregate metrics on the number of completed annotations, the distribution of labels annotator agreement, and a view into recent annotator activity. See Overview page: View aggregate metrics for more information.
  • Review page: Provides a full list of all annotations and their corresponding documents to easily review all annotations in one place. See Walkthrough for reviewers for more information.
  • Batches page: Provides access to and information about all batches that have been created. You can create and manage batches and commit annotations to ground truth. See Create batches and Manage batches and commit ground truth for more information.
  • Within a batch: Provides a canvas where annotators can view and label documents. See Walkthrough for annotators for more information about how to annotate in Annotation Studio.

Snorkel Flow provides multiple user roles that administrators can assign, each with different levels of data access and permissions. See Roles and user permissions for more information.

Why do we need to do manual annotation in Snorkel Flow?

While Snorkel Flow programmatically generates labels for your training data sets, you need to define your initial ground truth to develop labeling functions (LFs) and models. In addition, manual annotation allows for an iterative process of model development. Annotators can review model predictions, identify errors, and refine the labeled dataset. These manual annotations lead to improved model performance. With Snorkel, you can pinpoint where manual annotation is needed, which significantly reduces the time and money spent on manual annotation.

Typically, you'll want multiple annotators to review each data point for these reasons:

  • Reduce bias: Manually labeling data is prone to bias because the person labeling the data may have preconceptions that can influence the labels they assign.
  • Handle ambiguity: Some documents might be inherently ambiguous or open to interpretation. In these cases, having multiple annotators allows for capturing different viewpoints and addressing the inherent uncertainty in the data.
  • Enhance robustness: By aggregating annotations from multiple annotators, the labeled dataset becomes more robust and less dependent on the idiosyncrasies of any single annotator. This robustness is particularly important when dealing with diverse datasets or complex tasks.
  • Refine annotation guidelines: Comparing annotations from multiple annotators identifies areas where guidelines need clarification. It also provides an opportunity for continuous training to improve consistency and understanding among annotators.