Skip to main content
Version: 0.96

Aggregate annotations

While managing batches, you can aggregate annotations to create a single ground truth label for each data point. The majority vote aggregation algorithm works differently for each of these task types.

Single-label

Snorkel Flow uses a simple majority algorithm as the aggregation strategy for single-label applications. Based on the number of votes for each label, the label with the most votes is applied to the ground truth as the final decision.

For example, consider a single-label application with ten annotators and two classes. Six annotators label a data point with class A, and four annotators label a data point with class B. Because class A now represents the majority vote at aggregation time, the aggregated result will label this data point as class A.

If there is a tie between two or more majority labels, Snorkel Flow uses a pre-determined random seed to perform a random choice selection.

Multi-label

Snorkel Flow uses a simple union algorithm as the aggregation strategy for multi-label applications. Based on the number of present/absent votes for each class, Snorkel Flow applies the label (present or absent) with the most votes for each class as the final decision.

For example, consider a multi-label application with ten annotators and two classes. For class A, six annotators vote present and four annotators vote absent. For class B, four annotators vote present and six annotators vote absent. By taking the majority vote of each class, the aggregated label for the data point applies present for class A and absent for class B.

If there is a tie between present and absent labels, the resulting label has an equal chance of being marked as present or absent.

Sequence tagging

Snorkel Flow uses a simple majority algorithm as the aggregation strategy for each span in sequence tagging applications. Based on the number of votes for each label in any interval where annotators disagree, Snorkel Flow selects the label with the most votes as the final decision.

If there is a tie between two or more majority labels, the resulting label is a negative class to reduce the risk of false positives, such as trailing spaces or mistaken tokens. If the tie is between only positive labels, the resulting label is a pre-determined random seed to perform a random choice selection.