Upload ground truth
This topic demonstrates how to upload ground truth (GT) in Snorkel Flow. There are two types of GT in Snorkel Flow: document-level GT and span-level GT.
Document-level ground truth
Document-level GT often applies to classification applications, where each document has one label.
Uploading from a dataset
In a Dataset, navigate to the Data Sources tab. From there, click on Upload ground truth button on the right. Add your ground truth file for upload.
Only classification label schemas are supported for upload on the Datasets page.
Uploading from an application
The document-level GT can be added into the data source files, where we have a label column to indicate the GT label for each row of the data. During application creation, you need to specify the Ground truth column
by the name of the label column.
You can also upload the document-level GT from the Datasource upload page in the Model pane.
Data format
In particular, the GT contains the following columns:
uid
(int): The uid of the document.label
(str): The GT label of the document.
Span-level ground truth
Span-level GT often applies to information extraction applications, where each span has one label. For more, see the Information extraction: Extracting execution dates from contracts tutorial.
Uploading from an application
The span-level ground truth can only be uploaded from the Datasource upload page in the Model node.
Data format
The uploaded span-level GT must contain the following columns:
context_uid
(int): The UID of the document from which the span was extracted. All spans that have the samecontext_uid
come from the same document. Thecontext_uid
is associated with the UID column when one creates the dataset. For more, see Data upload.span_field
(str): The field where the spans were extracted from.char_start
(int): The index of the first character of the span in the document.char_end
(int): The index of the last character of the span in the document._gt_label
(str): The GT label of the span.
As an example, below is an example ground truth for a single document with context_uid = 0
, and extraction field of text
:
gt_label | context_uid | span_field | char_start | char_end |
---|---|---|---|---|
class_1 | 0 | text | 17 | 28 |
class_2 | 0 | text | 37 | 46 |
Only the spans already extracted by the candidate extractor can be recognized as GT. Both the span extracted and the span in the GT must have the same vaule of context_uid
, char_start
and char_end
.
We recommend you to iterate on the extractor until you get as close as possible to 100% recall. One can evaluate the performance of the candidate extractor via sf.get_candidate_extractor_metrics
SDK method. For more, see Candidate extractor scoring.
If you find most of your GT spans overlap with the extracted spans, you can consider resolving you GT to be exactly match with the extracted spans. For example, in the sentence It's a lovely day.
, if the GT span is day
, but the extracted span is lovely day
, we can resolve the GT span to be lovely day
and use the corresponding char_start
and char_end
of lovely day
in the span level GT.