Tips for splitting and partitioning data
When working with a new dataset, one of the most important steps is to create three representative splits of data.
train
: for LF development and training set creationvalid
: for model hyperparameter tuningtest
: for final evaluation
Given the nature of the task and the particularities of the data, it may not be appropriate to randomly split the data across these partitions.
General tips
- Split size: While split size is problem-dependent, a good starting breakdown is around 70/15/15 (
train
/valid
/test
). - Test split: Make sure your
test
split contains the most reliable ground truth labels compared to other splits. - Hierarchical Data: Create splits based on the hierarchical nature of your data. For example, many text extraction tasks are done at the “page level” but each page may correspond to a given document. In this case, your should split by a unique document identifier, and all of its individual pages in a given split.
- Diverse data: If the diversity of data is high, ensure that each split, to the degree possible, contains a representative “slice” of data from each source of diversity (i.e. stratified sampling). For example, if you want to extract fields from PDF reports, and you have noticed a dozen or so different report styles that account for 80% of the data. Ensure that each split contains some degree of each report style across
train
,valid
, andtest
. While this is important for ensuring models can generalize, it’s more critical to account for hierarchical sources of data leakage.
Snorkel Flow specific tips
These tips are relevant when uploading a dataset.
- Dev split: Snorkel Flow automatically generates a
dev
split from thetrain
split. The default size is 10% of the totaltrain
size and has a hard limit of 10k examples (or 4k when sampling by documents for extraction tasks). Thedev
split can be resampled to comprise the entiretrain
split if thetrain
split size is <10k. We currently do not support manually selecting yourdev
split, so you should think carefully about how this could potentially cause data leakage issues (e.g., if data from multiple documents are split by pages anddev
is a random sample of pages fromtrain
). - Ensure each datasource file is < 100 MB: If you have one large file, please repartition the file into several smaller files and ensure each file is <100 MB in size to reduce overhead.
- Discrete Splits for Slices: Consider creating multiple files for
valid
andtest
splits for specific “slices” of data, depending on the nature of your data or application. Snorkel Flow supports uploading and enabling/disabling multiple data sources per split, which is useful for quickly assessing model performance on a given segment or slice of data. What you choose to include in these slices will depend on the specific nature of each application.