Version: 0.93

Data format for different ML tasks

Depending on the ML task a dataset would be used for, different fields are required in a specific format. This document outlines those required fields. Note that a field can be named anything unless explicitly called out.

Text classification

No required fields.

Information extraction

Text field with the content of the document.

Conversational AI

A JSON string field containing a list of dictionaries. Each of the list items is one utterance. Each utterance dictionary has to contain:

Speaker field: string key for speaker of each utterance
Text field: string content of the utterance
Metadata field: dictionary potentially containing "GT"

Example (as JSON and not string for readability):

[
  {
    "turns": [
      {
        "speaker": "USER",
        "utterance": "I want to transfer $500 to XYZ.",
        "frames": {
          "GT": 0
        }
      },
      {
        "speaker": "SYSTEM",
        "utterance": "Okay your money was transferred.",
      }
    ]
  },
  {
    "turns": [
      {
        "speaker": "USER",
        "utterance": "I want to check my balance.",
        "frames": {
          "GT": 1
        }
      },
      {
        "speaker": "SYSTEM",
        "utterance": "Your balance is $100.",
      }
    ]
  },
]

In this example

Speaker field: "speaker"
Text field: "utterance"
Metadata field: "frames"

Sequence tagging

Text field with the content of the document.

PDF classification and PDF information extraction

For both scanned/native PDFs, you can prepare your data in the expected format manually or using the snorkelflow.ingest.docs.dirtree_to_parquet() helper in the SDK. For a walkthrough of how to use that method, open the in-app notebook by selecting Notebook in the menu bar, then “File Browser” → “SampleNotebooks” → “ingest_ocr_docs.py.” Once you’ve prepared your data in this format, upload them to Snorkel Flow on the home page.

Scanned PDFs

Required fields include:

hocr (string) field containing a representation of the data in hOCR format
string field with a URL pointing to the original PDF file corresponding to the pages included in the hocr input. This URL can be either to minio or s3.

Native PDFs

Only a field with a URL pointing to the native PDF file is required. The URL can be either to minio or s3.

Computer vision

A field containing URLs to images stored in MinIO. The images must be png or jpeg and shouldn't be larger than 512 pixels.

The images in MinIO can be either manually uploaded or (recommended) via the SDK using upload_images_to_minio. The upload_images_to_minio function automatically:

validates that images exist and can be opened
resizes images to a maximum resolution of 512 pixels
converts images to one of the supported image formats
uploads images to MinIO
outputs the corresponding MinIO image paths

For example, the upload_images_to_minio function can be used to upload images provided in a pandas dataframe as follows:

_[Missing image goes here.]: # _

Text classification​

Information extraction​

Conversational AI​

Sequence tagging​

PDF classification and PDF information extraction​

Scanned PDFs​

Native PDFs​

Computer vision​