Skip to main content
Version: 25.6

Uploading a dataset

The Snorkel AI Data Development Platform organizes data into data sources and datasets:

  • Data sources are individual partitions of data points, such as the rows of an individual parquet file or resulting rows from an individual SQL query. Each data source is assigned to exactly one split (train, valid, or test).
  • Datasets are collections of data sources, with each data source assigned to a specific split (train, valid, or test) within the dataset.

You can upload datasets to the Snorkel AI Data Development Platform, starting with a single data source. Data sources can be added to datasets at any time.

Prerequisite

Before uploading your dataset to the Snorkel AI Data Development Platform, prepare your data to minimize any unexpected data processing issues when managing your applications.

To upload a dataset

  1. Select the Datasets option in the left navigation, and then select Upload new dataset in the top right corner.

    ingress_v2 step 1
  2. For the New dataset, enter a Dataset name.

    ingress_v2 step 2
  3. Select you Data source from the dropdown menu. Learn more about supported data source types.

    ingress_v2 step 3
  4. Put in the path or query for the chosen data source. To ingest multiple paths or queries for that data source, select "Add path" or "Add query", based on your chosen data source type.

    ingress_v2 step 4 add path

    a. You can also use credentials to ingest data from private data sources. This is mandatory for some data source types, such as Databricks and Snowflake. This is optional for other sources, such as S3 buckets, which may be public or private.

    Configuration button

    b. To ingest from a private S3 bucket, toggle on Use credentials and select + New credential.

    Toggle on credentials

    c. In the module, add your credentials and select Save.

    Credentials module

    d. Select your saved configuration from the dropdown menu and enter your path or query to ingest data from the new configuration.

    ingress_v2 step 4d
  5. To add multiple data sources with one or more paths or queries under each, select Add data source.

    ingress_v2 step 5
  6. Repeat these steps for each data source, and then select Next to verify your data sources. This step verifies that your data sources are accessible and runs basic checks, such as detecting NaN values, to ensure your data is valid for ingestion.

    ingress_v2 step 6
  7. Assign splits to data sources. There are two ways to assign splits to data sources:

    • Automatically with the Split data by % option. If you select this option, then you will also need to define your ground truth (GT) column.

      ingress v2 step 7a
      tip

      If you have a large amount of unlabeled data, split your training data into smaller data sources. You can enable a subset of data sources for faster initial development before scaling up to all of your data.

    • Manually with the Split data by file option.

      ingress v2 step 7b
  8. Once all of your data sources are verified, choose a UID column that is unique for each row across the dataset. This UID column can be text or an integer. If your dataset does not have that type of field, then choose Snorkel Generated to have the Snorkel AI Data Development Platform generate a UID column.

    ingress v2 step 8

    Once the dataset is created, a new context_uid column is added to your data. This column is populated with the selected UID column or the Snorkel Generated UID.

    a. If you chose to automatically split your data with the Split data by % option, you can Stratify ground truth across splits to ensure the ground truth labels are evenly distributed across the splits.

    • If you opt in to stratify ground truth, provide the GT column and the value corresponding to UNKNOWN ground truth.
    • If you opt out, the data is split randomly.
    ingress v2 step 9

    b. Fill in the data type for the task.

  9. Select Complete to begin ingesting your data into the platform.

    ingress v2 step 9