Skip to main content
Version: 0.91

Re-split data

Sometimes, new data can be added to a dataset, or the data distribution can change in an application. When this happens, we may want to resplit the dataset in an application to move ground truth to different splits or to put new data into different splits. This creates a new application, copying everything from the previous application, and links the new dataset to the new application.

  • Split Size: While split size is problem dependent, the default split percentages are 70/20/10 (train/valid/test).

  • Usage:

    1. Select `Datasets` menu from the sidebar nav
      Screenshot
    2. Select the dataset you want to re-split data on
      Screenshot
    3. Click on the `New Data Source(s)` button
      Screenshot
    4. Select your data source upload strategy (cloud, file, Snowflake, etc.) and then click on `Split by %`
      Screenshot
    5. You'll see that there are defaults already set for `train`/`valid`/`test` but you can change them as per your requirement
  • SDK: Snorkel Flow SDK supports resplitting data with the following functionality:sf.resplit_datsources_by_percent(application_uid, datasource_uids). Additionally, an optionalsplit_random_seedparameter, as well as an optionalsplit_pctparameter, can be input to further specify a random seed or override the default split size. An example call might look like the following:


    sf.create_new_application_with_resplit_datasources(
    application_uid=application_info["application_uid"],
    datasource_uids=datasource_uids,
    split_pct=json.loads(SplitWiseDistribution(
    train=train_percent,
    test=test_percent,
    valid=valid_percent).json()))