Data format for different ML tasks
For each Application type in Snorkel Flow, there are required fields that Datasets must have depending on the Application type chosen for a use case. An error will appear if these fields are missing.
This document outlines those required fields that a Dataset must have for each Application type in Snorkel Flow.
Text
Text
use cases in Snorkel Flow are defined as any use case working with raw text strings. Text data does not include PDFs, images, or any other file types not containing raw text strings.
Text use cases currently supported in Snorkel Flow are:
Classification
Information extraction
Sequence tagging
Datasets for Text applications only require a field (string) that contains text for each data point. This is what will be annotated and predicted on by the ML model. There are no requirements for the naming of the field, and you will specify which field you want to use during application creation.
Due to this flexible requirement, any file type can be treated as a Text use case in Snorkel Flow as long as it is converted into the required format laid out in this section prior to uploading to a Dataset.
Once your data is in the correct format you can upload a Dataset.
Example table with required fields:
text_column (can have any name) |
---|
The quick brown fox jumps over the lazy dog |
... |
PDF classification and PDF information extraction
PDF Datasets require a manifest file with a field (string) containing URLs that point to the location of each PDF. These URLs can point to either MinIO or S3. There are no requirements for the naming of the field, and you will specify which field you want to use during application creation.
This manifest file can be created manually, or there are two options to create the manifest file within Snorkel Flow:
-
(Recommended) If the PDFs are stored in the user files section in Datasets, then a manifest file will automatically be created which can be downloaded and used directly in creating the Dataset. See our guide on uploading user files.
-
There is an SDK option if PDF files are first uploaded to MinIO. Using the
snorkelflow.ingest.docs.dirtree_to_parquet()
SDK helper function this will create a manifest file that can be downloaded from MinIO and used directly in creating the Dataset.
Scanned PDFs
Scanned PDFs do not contain parseable text data and need to go through an OCR (optical character recognition), therefore, they will additionally require a field (string) which contains a representation of the PDF data in hOCR format. It must be named 'hocr'. This field can be optionally added manually prior to Dataset upload, or it can be created in the Application afterwards using one of Snorkel's built-in OCR preprocessors. Read more about working with Scanned PDFs in Snorkel Flow.
Once your data is in the correct format you can upload a Dataset.
Example table of a manifest file with the required fields:
url_col (can have any name) | hocr (optional for scanned PDFs) |
---|---|
minio://example_bucket/documents/report.pdf | [hocr data] |
... | ... |
Computer vision
Computer vision Datasets require a manifest file with a field (string) containing URLs that point to the location of each image. These URLs can be either to MinIO or S3. There are no requirements for the naming of the field. The images must be PNG or JPEG and cannot be larger than 512 pixels.
This manifest file can be created manually, or there are two options to create the manifest file within Snorkel Flow:
-
If the images are stored in the user files section in Datasets, then a manifest file will automatically be created which can be downloaded and used directly in creating the Dataset. See our guide on uploading user files.
-
There is an SDK option using the
upload_images_to_MinIO
function which automatically:- validates whether images exist and can be opened
- resizes images to a maximum resolution of 512 pixels
- converts images to one of the supported image formats
- uploads images to MinIO
- outputs the corresponding MinIO image paths
Once your data is in the correct format you can upload a Dataset.
Example table of a manifest file with the required fields:
url_col (can have any name) |
---|
minio://example_bucket/images/photo1.png |
... |