Upload ground truth
We will demonstrate how to upload ground truth (GT) in Snorkel Flow for various applications. Below, we describe the two types of GT in Snorkel Flow (document-level GT and span-level GT) and how to upload GT for each.
Document level GT
Document level GT often applies to classification applications, where each document has one label.
The document level GT can be added into the data source files, where we have a label column to indicate the GT label for each row of the data. During application creation, you will need to specify the Ground truth column
by the name of the label column.
You can also upload the document level GT from the Datasource
upload page in the Model
node. In particular, the GT contains the following columns:
uid
(int): The uid of the document.label
(str): The GT label of the document.
Span level GT
Span level GT often applies to information extraction applications, where each span has one label (see the Information extraction: Extracting execution dates from contracts tutorial). The span level ground truth can only be uploaded from the Datasource upload page in the Model node. We require that the uploaded span level GT contain the following columns:
context_uid
(int): The uid of the document from which the span was extracted. All spans that have the samecontext_uid
come from the same document. Thecontext_uid
is associated with the UID column when one creates the dataset (for more information, see Data upload).span_field
(str): The field where the spans were extracted from.char_start
(int): The index of the first character of the span in the document.char_end
(int): The index of the last character of the span in the document._gt_label
(str): The GT label of the span.
As an example, below is an example ground truth for a single document with context_uid = 0
, and extraction field of text
:
gt_label context_uid span_field char_start char_end class_1 0 text 17 28 class_2 0 text 37 46
Note: only the spans already extracted by the candidate extractor can be recognized as GT. Both the span extracted and the span in the GT must have the same vaule of context_uid
, char_start
and char_end
.
We recommend you to iterate on the extractor until you get as close as possible to 100% recall. One can evaluate the performance of the candidate extractor via sf.get_candidate_extractor_metrics
SDK method (see Candidate extractor scoring for more details.)
If you find most of your GT spans overlap with the extracted spans, you can consider resolving you GT to be exactly match with the extracted spans. For example, in the sentence It's a lovely day.
, if the GT span is day
, but the extracted span is lovely day
, we can resolve the GT span to be lovely day
and use the corresponding char_start
and char_end
of lovely day
in the span level GT.