Data management | Snorkel AI

Scanned PDF guide

Scanned PDF documents are created from scanned images of printed documents. They don't contain information about the text in the document and where it is locate...

(Beta) Use LLMs for extracting candidate spans in PDF Applications

PDF extraction applications often require accurate span extractors; however, defining span extractors can be challenging. For example, extracting addresses from...

Upload ground truth

We will demonstrate how to upload ground truth (GT) in Snorkel Flow for various applications. Below, we describe the two types of GT in Snorkel Flow (document-l...

Tips for splitting and partitioning data

When working with a new dataset, one of the most important steps is to create three representative splits of data.

Data format for different ML tasks

Depending on the ML task a dataset would be used for, different fields are required in a specific format. This document outlines those required fields. Note tha...

Training set overview: Review your training sets

This page shows you a list of the training sets that have been created from your LF packages. For a given training set, you can edit its name and view summary s...

Re-split data

Sometimes, new data can be added to a dataset, or the data distribution can change in an application. When this happens, we may want to resplit the dataset in a...

Upload data to MinIO

Overview

Ground truth formats for different ML tasks

Text classification / PDF classification

Manage the data sources for a model

This page allows you to manage the active data sources and upload ground truth for your model.