Version: 0.94

Tips for splitting and partitioning data

When working with a new dataset, one of the most important steps is to create three representative splits of data.

Given the nature of the task and the particularities of the data, it may not be appropriate to randomly split the data across these partitions.

Split size: While split size is problem-dependent, a good starting breakdown is around 70/15/15 (train/valid/test).
Test split: Make sure your test split contains the most reliable ground truth labels compared to other splits.
Hierarchical Data: Create splits based on the hierarchical nature of your data. For example, many text extraction tasks are done at the “page level” but each page may correspond to a given document. In this case, your should split by a unique document identifier, and all of its individual pages in a given split.
Diverse data: If the diversity of data is high, ensure that each split, to the degree possible, contains a representative “slice” of data from each source of diversity (i.e. stratified sampling). For example, if you want to extract fields from PDF reports, and you have noticed a dozen or so different report styles that account for 80% of the data. Ensure that each split contains some degree of each report style across train, valid, and test. While this is important for ensuring models can generalize, it’s more critical to account for hierarchical sources of data leakage.

These tips are relevant when uploading a dataset.

Dev split: Snorkel Flow automatically generates a dev split from the train split. The default size is 10% of the total train size and has a hard limit of 10k examples (or 4k when sampling by documents for extraction tasks). When resampling a dev split close to the total dataset size, the system always reserves at least one sample for training. Because Snorkel does not support manually selecting your dev split, think carefully about how this could potentially cause data leakage issues. For example, if data from multiple documents are split by pages and dev is a random sample of pages from train.
Ensure each datasource file is < 100 MB: If you have one large file, please repartition the file into several smaller files and ensure each file is <100 MB in size to reduce overhead.
Discrete Splits for Slices: Consider creating multiple files for valid and test splits for specific “slices” of data, depending on the nature of your data or application. Snorkel Flow supports uploading and enabling/disabling multiple data sources per split, which is useful for quickly assessing model performance on a given segment or slice of data. What you choose to include in these slices will depend on the specific nature of each application.