Skip to main content
Version: 0.93

Ground truth formats for different ML tasks

Text classification / PDF classification

Document-level ground truth (GT) often applies to classification applications, where each document has one label.

The document-level GT can be added to the data source files, where there is a label column to indicate the GT label for each row of the data. During application creation, you will need to specify the Ground truth column by the name of the label column.

You can also upload the document-level GT from the Datasource upload page for the Application / Model you are working on. To get there, select Overview from the top of the Studio page, find the "View Data Sources" button at the bottom of the Overview page, then click the "Upload GTs" button. In particular, the GT contains the following columns:

  • uid (int): The uid of the document.
  • label (str): The GT label of the document.

As an example, below is an example ground truth for four documents:

uidlabel
1POSITIVE
2NEGATIVE
3POSITIVE
4POSITIVE

Multi label text classification

Multi label text classification ground truth for a document is a JSON dump of a mapping from label to label marker, where the label marker is one of "PRESENT","ABSENT", "ABSTAIN". PRESENT label marker indicates that the document is labeled positively on the corresponding label. While ABSENT label marker indicates that the document is labeled negatively on the corresponding label. ABSTAIN label marker is an indication that the document is not labeled either positively or negatively on the label. ABSTAIN is a way to provide partial ground truth on a document and comes handy when ground truth for a particular label is not known at the time of labeling.

A special label "_default" is also supported as a proxy for missing labels in the mapping. Think of it as a catch all for everything else. This will help in reducing the memory needed to store ground truth

For example,

json.dumps({
"class_1" : "PRESENT",
"class_2" : "ABSTAIN",
"class_3" : "ABSENT",
"class_4" : "ABSENT",
"class_5" : "ABSENT"
})

is the same as

json.dumps({
"class_1" : "PRESENT",
"class_1" : "ABSTAIN",
"_default" : "ABSENT",
})

Information extraction

Span-level GT often applies to information extraction applications, where each span has one label. The span-level ground truth can only be uploaded from the Datasource upload page for the Application / Model you are working on. To get there, select Overview from the top of the Studio page, find the "View Data Sources" button at the bottom of the Overview page, then click the "Upload GTs" button. Snorkel Flow requires that the uploaded span-level GT contain the following columns:

  • context_uid (int): The uid of the document from which the span was extracted.
  • All spans that have the same context_uid come from the same document. The context_uid is associated with the UID column when one creates the dataset.
  • span_field (str): The field where the spans were extracted from.
  • char_start (int): The index of the first character of the span in the document.
  • char_end (int): The index of the last character of the span in the document.
  • _gt_label (str): The GT label of the span.

Below is an example of ground truth for a single document with context_uid=0.

_gt_labelcontext_uidspan_fieldchar_startchar_end
class_10text1728
class_20text3746
note

Only the spans already extracted by the candidate extractor can be recognized as GT. Both the span extracted and the span in the GT must have the same value of context_uid, char_start and char_end.

A GT span will not be used if it does not exactly match an existing candidate extracted span. To verify which GT spans are included in an application after upload, you can use the sf.export_ground_truth SDK method.

It is recommended that you iterate on the extractor until you get as close as possible to 100% recall. You can evaluate the performance of the candidate extractor via the sf.get_candidate_extractor_metricsSDK method.

PDF information extraction

The span-level ground truth can only be uploaded from the Datasource upload page for the Application / Model you are working on. To get there, select Overview from the top of the Studio page, find the "View Data Sources" button at the bottom of the Overview page, then click the "Upload GTs" button. In particular, the GT contains the following columns:

  • uid (int): The data point UID of the span. The UID here has to match the span data point UIDs exactly. Example UID: span::17,2,rich_doc_text,6e85bf3f0698497465102d9104bfb4fe,412,442 where:
    • 17 is the context_uid (int): the uid of the document from which the span was extracted.
    • 2 is the page_id (int): the number of the page in the document. This field only exists if you've selected Split docs by page when creating a PDF application or have added a Page splittermanually. Otherwise, it should be omitted, including the related comma (e.g., using the example UID above: span::17,rich_doc_text,6e85bf3f0698497465102d9104bfb4fe,412,442)
    • rich_doc_text is the span_field (str): the field where the spans were extracted from.
    • 6e85bf3f0698497465102d9104bfb4fe is the span_field_hash_value (str): the hash of the value of the field that the span is extracted from. Can be computed by calling hashlib.md5(span_field_value.encode("utf-8")).hexdigest() in python
    • 412 is the char_start (int): the index of the first character of the span in the document.
    • 442 is the char_end (int): the index of the last character of the span in the document.
  • label (str): The GT label of the document.

Conversational AI

The GT can be added to the data source files in the field that will correspond to metadata. Expected data format for Conversational AI is a JSON file which contains a list of objects, where each object represents a conversation. Each conversation in turn contains a list of objects where each object represents an utterance.

Each conversation dictionary needs to contain the following:

  • Field: string key under which list of utterances can be found
  • Utterances Path: path to actual utterances if the input JSON has a nested path.

Each utterance dictionary needs to contain the following:

  • Speaker field: string key for speaker of each utterance
  • Text field: string content of the utterance
  • Metadata field: dictionary potentially containing "GT"

Example:

[
{
"turns": [
{
"speaker": "USER",
"utterance": "I want to transfer $500 to XYZ.",
"frames": {
"GT": 0
}
},
{
"speaker": "SYSTEM",
"utterance": "Okay your money was transferred.",
}
]
},
{
"turns": [
{
"speaker": "USER",
"utterance": "I want to check my balance.",
"frames": {
"GT": 1
}
},
{
"speaker": "SYSTEM",
"utterance": "Your balance is $100.",
}
]
},
]

In this example

  • Field: "turns"
  • Utterances path: None
  • Speaker field: "speaker"
  • Text field: "utterance"
  • Metadata field: "frames" -- this field has ground truth under a key "GT"

To import ground truth data after you've created the dataset and application, you can use the SDK function import_utterance_ground_truth.

Sequence tagging

Sequence tagging ground truth for a document is a JSON dump of a list of spans, where each span is a triple of (char_start, char_end, label). Here is an example label for a document:

json.dumps([
[0, 29, 'OTHER'],
[29, 40, 'COMPANY'],
[40, 228, 'OTHER'],
[228, 239, 'COMPANY'],
[239, 395, 'OTHER'],
])

The spans cannot be empty (char_start must be smaller than char_end). Overlapping or duplicating spans are not allowed. The sets of char offsets (char_start, chat_end) must be sorted.
By default, a preprocessor AsciiCharFilter is added in the DAG, and filters out the non-ascii characters from the documents. If you have ground truths that are collected outside of SnorkelFlow, please use the SDK function align_external_ground_truthto align the ground truth before ingesting them into SnorkelFlow.

Upload ground truth file

On the Overview page we get a high-level view of our application. We can also add any ground truth labels we have on this page. On the center of the dashboard, you'll notice an Upload GTs button. After clicking on it you'll be prompted to provide the following information to import ground truth from a file:

  • File path: s3://path_to_your_file.csv
  • File format: CSV
  • Label column: label - whichever column contains the label. Example label: [[0, 29, "OTHER"], [29, 50, "COMPANY"], [50, 500, "OTHER"]]
  • UID column: x_uid - column containing the document UID, in the following format doc::2005

If you have external ground truth, but it does not have negative ground truth labels, please select Auto generate negative ground truth labels in Upload GTs page. The same option is also provided on the application creation page. Alternatively, refer to the SDK function add_ground_truthto infer negative labels.