Scanned PDF guide
Scanned PDF documents are created from scanned images of printed documents. They don't contain information about the text in the document and where it is locate...
(Beta) Use LLMs for extracting candidate spans in PDF Applications
PDF extraction applications often require accurate span extractors; however, defining span extractors can be challenging. For example, extracting addresses from...
Upload ground truth
We will demonstrate how to upload ground truth (GT) in Snorkel Flow for various applications. Below, we describe the two types of GT in Snorkel Flow (document-l...
Tips for splitting and partitioning data
When working with a new dataset, one of the most important steps is to create three representative splits of data.
Data format for different ML tasks
Depending on the ML task a dataset would be used for, different fields are required in a specific format. This document outlines those required fields. Note tha...
Training set overview: Review your training sets
This page shows you a list of the training sets that have been created from your LF packages. For a given training set, you can edit its name and view summary s...
Re-split data
Sometimes, new data can be added to a dataset, or the data distribution can change in an application. When this happens, we may want to resplit the dataset in a...
Upload data to MinIO
Overview
Ground truth formats for different ML tasks
Text classification / PDF classification
Manage the data sources for a model
This page allows you to manage the active data sources and upload ground truth for your model.