Skip to main content
Version: 25.2

Upload ground truth

This topic demonstrates how to upload ground truth (GT) in Snorkel Flow. There are two types of GT in Snorkel Flow: document-level GT and span-level GT.

Document-level ground truth

Document-level GT often applies to classification applications, where each document has one label.

Uploading from a dataset

In a Dataset, navigate to the Data Sources tab. From there, click on Upload ground truth button on the right. Add your ground truth file for upload.

note

Only classification label schemas are supported for upload on the Datasets page.

Uploading from an application

The document-level GT can be added into the data source files, where we have a label column to indicate the GT label for each row of the data. During application creation, you need to specify the Ground truth column by the name of the label column.

You can also upload the document-level GT from the Datasource upload page in the Model pane.

Data format

In particular, the GT contains the following columns:

  • uid (int): The uid of the document.
  • label (str): The GT label of the document.

Span-level ground truth

Span-level GT often applies to information extraction applications, where each span has one label. For more, see the Information extraction: Extracting execution dates from contracts tutorial.

Uploading from an application

The span-level ground truth can only be uploaded from the Datasource upload page in the Model node.

Data format

The uploaded span-level GT must contain the following columns:

  • context_uid (int): The UID of the document from which the span was extracted. All spans that have the same context_uid come from the same document. The context_uid is associated with the UID column when one creates the dataset. For more, see Data upload.
  • span_field (str): The field where the spans were extracted from.
  • char_start (int): The index of the first character of the span in the document.
  • char_end (int): The index of the last character of the span in the document.
  • _gt_label (str): The GT label of the span.

As an example, below is an example ground truth for a single document with context_uid = 0, and extraction field of text:

gt_labelcontext_uidspan_fieldchar_startchar_end
class_10text1728
class_20text3746
note

Only the spans already extracted by the candidate extractor can be recognized as GT. Both the span extracted and the span in the GT must have the same vaule of context_uid, char_start and char_end.

We recommend you to iterate on the extractor until you get as close as possible to 100% recall. One can evaluate the performance of the candidate extractor via sf.get_candidate_extractor_metrics SDK method. For more, see Candidate extractor scoring.

If you find most of your GT spans overlap with the extracted spans, you can consider resolving you GT to be exactly match with the extracted spans. For example, in the sentence It's a lovely day., if the GT span is day, but the extracted span is lovely day, we can resolve the GT span to be lovely day and use the corresponding char_start and char_end of lovely day in the span level GT.