Skip to main content
Version: 0.94

Ground truth formats for different ML tasks

This document serves as a reference for the ground truth (GT) formats for different ML task types in Snorkel Flow. Use this guide before uploading your data to ensure you know the correct format needed for adding labels to your dataset. Proper formatting is crucial for the successful integration of your data within the Snorkel Flow platform.

Refer to the sections below for specific details on the GT format for each ML task type.

Text classification / PDF classification

For text and PDF classification tasks, each document requires a ground truth (GT) label at the document level. Each document has a single label. In this example, documents receive the label of POSITIVE or NEGATIVE sentiment.

You can provide the document-level GT as part of your data source file, which includes a column specifying the GT label for each row of data. When creating your application, make sure to set the Ground truth column to match the name of the label column in your data file.

Alternatively, you can upload the document-level GT directly from the Datasource upload page of your application or model. To do this:

  1. Click the Develop menu on the left side of the Studio page.
  2. In the Ground Truths section, click the Upload GTs button.
  3. Provide the remote path of the GT file or select the GT file from your local machine.

The uploaded GT file must contain the following columns:

  • UID column: The unique identifier for each document, in the format doc::1.
  • Label column: The GT label for each document.

Below is an example ground truth for four documents:

uidlabel
doc::1POSITIVE
doc::2NEGATIVE
doc::3POSITIVE
doc::4POSITIVE

Multi-label text classification

For multi-label text classification tasks, the ground truth (GT) for a document is represented as a JSON object. This JSON object maps each label to a corresponding label marker, which can be one of the following:

  • PRESENT: Indicates the document is positively labeled for the given label.
  • ABSENT: Indicates the document is negatively labeled for the given label.
  • ABSTAIN: Indicates the document is neither positively nor negatively labeled for the given label. This option is useful when the ground truth for a specific label is not known at the time of labeling.

Additionally, a special label, _default, can be used to represent missing labels in the mapping. This serves as a catch-all for any labels not explicitly specified, helping to reduce the memory required to store the ground truth.

For example, the following example of label markers

{
"class_1" : "PRESENT",
"class_2" : "ABSTAIN",
"class_3" : "ABSENT",
"class_4" : "ABSENT",
"class_5" : "ABSENT"
}

is equivalent to:

{
"class_1" : "PRESENT",
"class_2" : "ABSTAIN",
"_default" : "ABSENT"
}

By using _default, you can reduce the need to repeat the same value for multiple labels, making the JSON object more compact and memory-efficient.

Information extraction

For information extraction tasks, ground truth (GT) is provided at the span level, where each extracted span is associated with a specific label. The span-level GT can only be uploaded via the Datasource upload page for the relevant Application or Model. To do this:

  1. Click the Develop menu on the left side of the Studio page.
  2. In the Ground Truths section, click the Upload GTs button.

The uploaded GT file must contain the following columns:

  • context_uid (int): The unique identifier of the document from which the span was extracted. All spans with the same context_uid come from the same document. This is linked to the UID column specified when creating the dataset.
  • span_field (str): The field name from which the spans were extracted.
  • char_start (int): The index of the first character of the span in the document.
  • char_end (int): The index of the last character of the span in the document.
  • _gt_label (str): The ground truth label for the span.

Below is an example of ground truth for a single document with context_uid=0.

_gt_labelcontext_uidspan_fieldchar_startchar_end
class_10text1728
class_20text3746
note

Only spans that match the spans extracted by the candidate extractor can be recognized as ground truth. The GT file must have identical values for context_uid, char_start, and char_end as the candidate extractor’s output. A GT span will not be used if its values do not exactly match to any of the extracted spans.

To verify which GT spans are included in the application after upload, you can use the sf.export_ground_truth SDK method.

For optimal results, you should iterate on the extracted ground truth until the recall is as close to 100% as possible. You can evaluate the candidate extractor's performance using the sf.get_candidate_extractor_metrics SDK method.

PDF information extraction

For PDF information extraction tasks, the span-level ground truth (GT) can only be uploaded via the Datasource upload page for the relevant Application or Model. To do this:

  1. Click the Develop menu on the left side of the Studio page.
  2. In the Ground Truths section, click the Upload GTs button.

The uploaded GT file must contain the following columns:

uidlabel
span::17,2,rich_doc_text,6e85bf3f0698497465102d9104bfb4fe,412,442LABEL_VALUE
  • UID column: The unique identifier (UID) of the span. This UID must match the data point UIDs exactly.
    • Example UID: span::17,2,rich_doc_text,6e85bf3f0698497465102d9104bfb4fe,412,442 where:
      • 17 is the context_uid (int): The UID of the document from which the span was extracted.
      • 2 is the page_id (int): The page number within the document. This field is included if you select Split docs by page when creating a PDF application or manually add a Page splitter. Otherwise, it should be omitted, along with the associated comma (e.g., span::17,rich_doc_text,6e85bf3f0698497465102d9104bfb4fe,412,442).
      • rich_doc_text (str): the internal name of the field where the spans were extracted from.
      • 6e85bf3f0698497465102d9104bfb4fe is the span_field_hash_value (str): the hash of the value of the field that the span is extracted from. Can be computed by the python code hashlib.md5(span_field_value.encode("utf-8")).hexdigest().
      • 412 is the char_start (int): the index of the first character of the span in the document.
      • 442 is the char_end (int): the index of the last character of the span in the document.
  • Label column: The GT label of the document.

Ensure that all these elements are correctly formatted and included in your GT file to ensure proper processing and accurate extraction results.

Conversational AI

For Conversational AI, the ground truth (GT) can be added to the data source files within the field corresponding to metadata. The expected data format is a JSON file containing a list of objects, where each object represents a conversation. Each conversation object contains a list of sub-objects, where each sub-object represents an individual utterance.

Each conversation dictionary should include the following:

  • Field: The string key under which list of utterances can be found.
  • Utterances Path: The path to actual utterances if the input JSON has a nested path.

Each utterance dictionary should include:

  • Speaker field: The string key for speaker of each utterance.
  • Text field: The string content of the utterance.
  • Metadata field: The dictionary potentially containing "GT".

Below is an example of the expected JSON format:

[
{
"turns": [
{
"speaker": "USER",
"utterance": "I want to transfer $500 to XYZ.",
"frames": {
"GT": 0
}
},
{
"speaker": "SYSTEM",
"utterance": "Okay your money was transferred.",
}
]
},
{
"turns": [
{
"speaker": "USER",
"utterance": "I want to check my balance.",
"frames": {
"GT": 1
}
},
{
"speaker": "SYSTEM",
"utterance": "Your balance is $100.",
}
]
},
]

In this example

  • Field: "turns" – The key where the list of utterances is stored.
  • Utterances Path: None – No nested path is needed in this example.
  • Speaker Field: "speaker" – The key indicating the speaker of each utterance.
  • Text Field: "utterance" – The key containing the text of each utterance.
  • Metadata Field: "frames" – This field contains the ground truth, indicated by the key "GT".

After creating your dataset and application, you can import ground truth data using the SDK function import_utterance_ground_truth.

Sequence tagging

For sequence tagging tasks, the ground truth (GT) for a document is a JSON representation of a list of spans. Each span is represented as a 3-tuple of (char_start, char_end, label) consisting of:

  • char_start: The start index of the span.
  • char_end: The end index of the span.
  • label: The label associated with the span.

Below is an example of a ground truth for a document when tagging the spans as COMPANY or OTHER:

[
[0, 29, 'OTHER'],
[29, 40, 'COMPANY'],
[40, 228, 'OTHER'],
[228, 239, 'COMPANY'],
[239, 395, 'OTHER'],
]

Ensure that the following conditions are met:

  • Spans cannot be empty (i.e., char_start must be less than char_end).
  • Overlapping or duplicate spans are not allowed.
  • The sets of character offsets (char_start, char_end) should be sorted in ascending order.

By default, the AsciiCharFilter preprocessor is added in the DAG, and filters out non-ASCII characters from documents. If your ground truth data was collected outside of SnorkelFlow, use the SDK function align_external_ground_truth to align it before ingestion into SnorkelFlow. To upload the ground truth file:

  1. Click the Develop menu on the left side of the Studio page.
  2. In the Ground Truths section, click the Upload GTs button.
  3. Provide the following information to import the ground truth from a file:
  • File path: s3://path_to_your_file.csv
  • File format: CSV
  • Label column: label - The column containing the label.
  • UID column: x_uid - The column containing the document UID, in the format doc::2005.

If your external ground truth data does not include negative ground truth labels, select the Auto generate negative labels option on the Upload GTs page. This option is also available during application creation. Alternatively, use the SDK function add_ground_truth to infer negative labels.