Skip to main content
Version: 0.94

Upload ground truth

We will demonstrate how to upload ground truth (GT) in Snorkel Flow for various applications. Below, we describe the two types of GT in Snorkel Flow (document-level GT and span-level GT) and how to upload GT for each.

Document level GT

Document level GT often applies to classification applications, where each document has one label.

The document level GT can be added into the data source files, where we have a label column to indicate the GT label for each row of the data. During application creation, you will need to specify the Ground truth column by the name of the label column.

You can also upload the document level GT from the Datasource upload page in the Model node. In particular, the GT contains the following columns:

  • uid (int): The uid of the document.
  • label (str): The GT label of the document.

Span level GT

Span level GT often applies to information extraction applications, where each span has one label (see the Information extraction: Extracting execution dates from contracts tutorial). The span level ground truth can only be uploaded from the Datasource upload page in the Model node. We require that the uploaded span level GT contain the following columns:

  • context_uid (int): The uid of the document from which the span was extracted. All spans that have the same context_uid come from the same document. The context_uid is associated with the UID column when one creates the dataset (for more information, see Data upload).
  • span_field (str): The field where the spans were extracted from.
  • char_start (int): The index of the first character of the span in the document.
  • char_end (int): The index of the last character of the span in the document.
  • _gt_label (str): The GT label of the span.

As an example, below is an example ground truth for a single document with context_uid = 0, and extraction field of text:

gt_labelcontext_uidspan_fieldchar_startchar_end
class_10text1728
class_20text3746

Note: only the spans already extracted by the candidate extractor can be recognized as GT. Both the span extracted and the span in the GT must have the same vaule of context_uid, char_start and char_end.

We recommend you to iterate on the extractor until you get as close as possible to 100% recall. One can evaluate the performance of the candidate extractor via sf.get_candidate_extractor_metrics SDK method (see Candidate extractor scoring for more details.)

If you find most of your GT spans overlap with the extracted spans, you can consider resolving you GT to be exactly match with the extracted spans. For example, in the sentence It's a lovely day., if the GT span is day, but the extracted span is lovely day, we can resolve the GT span to be lovely day and use the corresponding char_start and char_end of lovely day in the span level GT.