Skip to main content
Version: 0.96

Data preparation

Before uploading your data to Snorkel Flow, we recommend that you go through the basic preparation steps outlined in this page (in addition to any custom preprocessing needed for your unique application) to minimize any unexpected data processing issues when managing your applications. After you are finished preparing your data, follow the steps in Data upload to upload your dataset to Snorkel Flow.

note

The examples on this page use data that is stored in Pandas DataFrame and Snorkel Flow’s object storage (MinIO). However, Snorkel Flow supports data uploaded from other sources as well. See Supported data source types for more information.

Format data and ground truth for your ML task

The following pages walk through how to format your data and ground truth for different input data types and output tasks:

Check for NaN, -Inf, and Inf values

We recommend removing NaN, -Inf, and Inf values from your data before uploading. Out of range float values are not JSON-compliant errors are common with these types of values while working with data in Snorkel Flow.

There are many resources online that walk you through how to remove NaN, -Inf, and Inf values. Here is an example for sanitizing data in this format for Pandas DataFrames.

Check column types

We recommend checking the column types of your data to ensure that they are typed as expected. This can be done using dtypes for Pandas DataFrames.

tip

Snorkel Flow treats columns titled url as URLs, which means the raw URL and the content hosted at the URL is displayed.

Make note of the label column name and label classes

If you have ground truth as a column in the data that you are uploading, make a note of the label column name and the label classes. You’ll need to specify these while uploading data to Snorkel Flow.

(Optional) split data into train, valid, and test sets

You can optionally split your data into train, valid, and test sets before uploading it to Snorkel Flow. You many also choose to split by percentages during data upload to Snorkel Flow.

  • Train split: The train split does not require ground truth labels, but we recommend including a few ground truths so that you can utilize them during model development. You can also add ground truth labels in Snorkel Flow's Annotation Studio after you upload the dataset.
  • Valid split: The valid split should contain ground truth labels for evaluation purposes. This split can be used to tune your ML model.
  • Test split: The held-out test split should contain ground truth labels for evaluation purposes.

The following example shows how data can be shuffled and split into train, valid, and test sets. The data is stratified over class labels in the column named "label_column", with 70% of the data in the train set, and 15% in each of the valid and test sets.

# Shuffle dataset
df = df.sample(frac=1, random_state=42)
# 70% / 15% / 15% split by class
df_train = df.groupby("label_column").sample(frac=0.7, random_state=42)
df_valid = df.drop(df_train.index).groupby("label_column").sample(frac=0.5, random_state=42)
df_test = df.drop(df_train.index).drop(df_valid.index)
tip

Be sure to check out our Tips for splitting and partitioning data guide when preparing data splits.

Export the data splits as Parquet or CSV files

If you are uploading data from a Pandas DataFrame via object storage or locally, first export the data to a Parquet or CSV file.

Snorkel recommends Parquet files instead of CSV files to avoid column type ambiguities. For example, defining if "123" is meant to be ingested as a string or integer type. Parquets explicitly store these column types; CSVs do not.

The following example shows how to export data from a Pandas DataFrame to Parquet files:

# Export to Parquet file
df_train.to_parquet("/path/to/my/dataset_name_train.parquet")
df_valid.to_parquet("/path/to/my/dataset_name_valid.parquet")
df_test.to_parquet("/path/to/my/dataset_name_test.parquet")

(Optional) Upload data to Snorkel Flow’s object storage (MinIO)

Data can be uploaded from a variety of sources (see Supported data source types for more information). One option is to upload data from the MinIO object storage system deployment that comes with Snorkel Flow. This can be a good option if your data is stored locally and is larger than 100MB.

If you plan to use MinIO as a data source, you need to upload it to MinIO first. Follow the steps below:

  1. Access Snorkel Flow’s object storage. On the bottom of the left-side menu, click your user name, then click ResourcesMinIO object storage. From there you can login to the MinIO console. For any issues with the access key, or for access issues with Kubernetes installations, contact your Snorkel administrator.

  2. Create a bucket to upload your data. On the bottom right corner of your screen, click the red + button, then click Create bucket. From there, you can name your bucket.

    note

    Make sure that you use lowercase characters while naming your bucket. Using upper-case characters will result in an error.

    note

    MinIO does not create a folder at the specified location until a file is uploaded. If you plan to use a bucket programmatically and need the folder to exist, upload an empty file in the UI to create the folder.

  3. Upload your data. Once your bucket is created, click the red + button again, then click Upload file. Select the files that you’d like to upload.

Make sure that you remember your file paths on MinIO, as you’ll need them to create a data source on Snorkel Flow. The file path is in the form minio://<bucket>/<path/to/file>. For example, a file named train.parquet inside the bucket mybucket in MinIO would be referred to as minio://mybucket/train.parquet.

Supported data formats

Snorkel Flow does not support the DateTime data format. When you import data to Snorkel Flow, convert DateTime to String.

When you save a Snorkel Flow DataFrame to a .csv file, Snorkel Flow automatically converts unsupported data formats to supported data formats. For example, Snorkel Flow converts DateTime to String. If you import a .csv generated from another source, you must convert unsupported data types manually before importing the file to Snorkel Flow.