Ground truth formats for different ML tasks
Text classification / PDF classification
Document-level ground truth (GT) often applies to classification applications, where each document has one label.
The document-level GT can be added to the data source files, where there is a label column to indicate the GT label for each row of the data. During application creation, you will need to specify the Ground truth column
by the name of the label column.
You can also upload the document-level GT from the Datasource
upload page for the Application / Model you are working on. To get there, select Overview from the top of the Studio page, find the "View Data Sources" button at the bottom of the Overview page, then click the "Upload GTs" button. In particular, the GT contains the following columns:
uid
(int): The uid of the document.label
(str): The GT label of the document.
As an example, below is an example ground truth for four documents:
uid | label |
---|---|
1 | POSITIVE |
2 | NEGATIVE |
3 | POSITIVE |
4 | POSITIVE |
Multi label text classification
Multi label text classification ground truth for a document is a JSON dump of a mapping from label to label marker, where the label marker is one of "PRESENT","ABSENT", "ABSTAIN"
. PRESENT label marker indicates that the document is labeled positively on the corresponding label. While ABSENT label marker indicates that the document is labeled negatively on the corresponding label. ABSTAIN label marker is an indication that the document is not labeled either positively or negatively on the label. ABSTAIN is a way to provide partial ground truth on a document and comes handy when ground truth for a particular label is not known at the time of labeling.
A special label "_default" is also supported as a proxy for missing labels in the mapping. Think of it as a catch all for everything else. This will help in reducing the memory needed to store ground truth
For example,
json.dumps({
"class_1" : "PRESENT",
"class_2" : "ABSTAIN",
"class_3" : "ABSENT",
"class_4" : "ABSENT",
"class_5" : "ABSENT"
})
is the same as
json.dumps({
"class_1" : "PRESENT",
"class_1" : "ABSTAIN",
"_default" : "ABSENT",
})
Information extraction
Span-level GT often applies to information extraction applications, where each span has one label. The span-level ground truth can only be uploaded from the Datasource
upload page for the Application / Model you are working on. To get there, select Overview from the top of the Studio page, find the "View Data Sources" button at the bottom of the Overview page, then click the "Upload GTs" button. Snorkel Flow requires that the uploaded span-level GT contain the following columns:
context_uid
(int): The uid of the document from which the span was extracted.- All spans that have the same
context_uid
come from the same document. Thecontext_uid
is associated with theUID column
when one creates the dataset. span_field
(str): The field where the spans were extracted from.char_start
(int): The index of the first character of the span in the document.char_end
(int): The index of the last character of the span in the document._gt_label
(str): The GT label of the span.
Below is an example of ground truth for a single document with context_uid=0
.
_gt_label | context_uid | span_field | char_start | char_end |
---|---|---|---|---|
class_1 | 0 | text | 17 | 28 |
class_2 | 0 | text | 37 | 46 |
Only the spans already extracted by the candidate extractor can be recognized as GT. Both the span extracted and the span in the GT must have the same value of context_uid
, char_start
and char_end
.
A GT span will not be used if it does not exactly match an existing candidate extracted span. To verify which GT spans are included in an application after upload, you can use the sf.export_ground_truth
SDK method.
It is recommended that you iterate on the extractor until you get as close as possible to 100% recall. You can evaluate the performance of the candidate extractor via the sf.get_candidate_extractor_metrics
SDK method.
PDF information extraction
The span-level ground truth can only be uploaded from the Datasource
upload page for the Application / Model you are working on. To get there, select Overview from the top of the Studio page, find the "View Data Sources" button at the bottom of the Overview page, then click the "Upload GTs" button. In particular, the GT contains the following columns:
uid
(int): The data point UID of the span. The UID here has to match the span data point UIDs exactly. Example UID:span::17,2,rich_doc_text,6e85bf3f0698497465102d9104bfb4fe,412,442
where:17
is thecontext_uid
(int): the uid of the document from which the span was extracted.2
is thepage_id
(int): the number of the page in the document. This field only exists if you've selectedSplit docs by page
when creating a PDF application or have added aPage splitter
manually. Otherwise, it should be omitted, including the related comma (e.g., using the example UID above:span::17,rich_doc_text,6e85bf3f0698497465102d9104bfb4fe,412,442
)rich_doc_text
is thespan_field
(str): the field where the spans were extracted from.6e85bf3f0698497465102d9104bfb4fe
is thespan_field_hash_value
(str): the hash of the value of the field that the span is extracted from. Can be computed by callinghashlib.md5(span_field_value.encode("utf-8")).hexdigest()
in python412
is thechar_start
(int): the index of the first character of the span in the document.442
is thechar_end
(int): the index of the last character of the span in the document.
label
(str): The GT label of the document.
Conversational AI
The GT can be added to the data source files in the field that will correspond to metadata
. Expected data format for Conversational AI is a JSON file which contains a list of objects, where each object represents a conversation. Each conversation in turn contains a list of objects where each object represents an utterance.
Each conversation dictionary needs to contain the following:
- Field: string key under which list of utterances can be found
- Utterances Path: path to actual utterances if the input JSON has a nested path.
Each utterance dictionary needs to contain the following:
- Speaker field: string key for speaker of each utterance
- Text field: string content of the utterance
- Metadata field: dictionary potentially containing "GT"
Example:
[
{
"turns": [
{
"speaker": "USER",
"utterance": "I want to transfer $500 to XYZ.",
"frames": {
"GT": 0
}
},
{
"speaker": "SYSTEM",
"utterance": "Okay your money was transferred.",
}
]
},
{
"turns": [
{
"speaker": "USER",
"utterance": "I want to check my balance.",
"frames": {
"GT": 1
}
},
{
"speaker": "SYSTEM",
"utterance": "Your balance is $100.",
}
]
},
]
In this example
- Field: "turns"
- Utterances path: None
- Speaker field: "speaker"
- Text field: "utterance"
- Metadata field: "frames" -- this field has ground truth under a key "GT"
To import ground truth data after you've created the dataset and application, you can use the SDK function import_utterance_ground_truth
.
Sequence tagging
Sequence tagging ground truth for a document is a JSON dump of a list of spans, where each span is a triple of (char_start, char_end, label)
. Here is an example label for a document:
json.dumps([
[0, 29, 'OTHER'],
[29, 40, 'COMPANY'],
[40, 228, 'OTHER'],
[228, 239, 'COMPANY'],
[239, 395, 'OTHER'],
])
The spans cannot be empty (char_start
must be smaller than char_end
). Overlapping or duplicating spans are not allowed. The sets of char offsets (char_start, chat_end)
must be sorted.
By default, a preprocessor AsciiCharFilter
is added in the DAG, and filters out the non-ascii characters from the documents. If you have ground truths that are collected outside of SnorkelFlow, please use the SDK function align_external_ground_truth
to align the ground truth before ingesting them into SnorkelFlow.
Upload ground truth file
On the Overview
page we get a high-level view of our application. We can also add any ground truth labels we have on this page. On the center of the dashboard, you'll notice an Upload GTs
button. After clicking on it you'll be prompted to provide the following information to import ground truth from a file:
File path
:s3://path_to_your_file.csv
File format
:CSV
Label column
:label
- whichever column contains the label. Example label:[[0, 29, "OTHER"], [29, 50, "COMPANY"], [50, 500, "OTHER"]]
UID column
:x_uid
- column containing the document UID, in the following formatdoc::2005
If you have external ground truth, but it does not have negative ground truth labels, please select Auto generate negative ground truth labels
in Upload GTs
page. The same option is also provided on the application creation page. Alternatively, refer to the SDK function add_ground_truth
to infer negative labels.