Data preparation
Before uploading your data to Snorkel Flow, we recommend that you go through the basic preparation steps outlined in this page (in addition to any custom preprocessing needed for your unique application) to minimize any unexpected data processing issues when managing your applications. After you are finished preparing your data, follow the steps in Data upload to upload your dataset to Snorkel Flow.
The examples on this page use data that is stored in Pandas DataFrame and Snorkel Flow’s object storage (MinIO). However, Snorkel Flow supports data uploaded from other sources as well. See Supported data source types for more information.
Format data and ground truth for your ML task
The following pages walk through how to format your data and ground truth for different input data types and output tasks:
Check for NaN, -Inf, and Inf values
We recommend removing NaN
, -Inf
, and Inf
values from your data before uploading. Out of range float values are not JSON-compliant errors are common with these types of values while working with data in Snorkel Flow.
There are many resources online that walk you through how to remove NaN
, -Inf
, and Inf
values. Here is an example for sanitizing data in this format for Pandas DataFrames.
Check column types
We recommend checking the column types of your data to ensure that they are typed as expected. This can be done using dtypes
for Pandas DataFrames.
Snorkel Flow treats columns titled url
as URLs, which means the raw URL and the content hosted at the URL is displayed.
Make note of the label column name and label classes
If you have ground truth as a column in the data that you are uploading, make a note of the label column name and the label classes. You’ll need to specify these while uploading data to Snorkel Flow.
(Optional) split data into train, valid, and test sets
You can optionally split your data into train, valid, and test sets before uploading it to Snorkel Flow. You many also choose to split by percentages during data upload to Snorkel Flow.
- Train split: The train split does not require ground truth labels, but we recommend including a few ground truths so that you can utilize them during model development. You can also add ground truth labels in Snorkel Flow's Annotation Studio after you upload the dataset.
- Valid split: The valid split should contain ground truth labels for evaluation purposes. This split can be used to tune your ML model.
- Test split: The held-out test split should contain ground truth labels for evaluation purposes.
The following example shows how data can be shuffled and split into train, valid, and test sets. The data is stratified over class labels in the column named "label_column"
, with 70% of the data in the train set, and 15% in each of the valid and test sets.
# Shuffle dataset
df = df.sample(frac=1, random_state=42)
# 70% / 15% / 15% split by class
df_train = df.groupby("label_column").sample(frac=0.7, random_state=42)
df_valid = df.drop(df_train.index).groupby("label_column").sample(frac=0.5, random_state=42)
df_test = df.drop(df_train.index).drop(df_valid.index)
Be sure to check out our Tips for splitting and partitioning data guide when preparing data splits.
Export the data splits as Parquet or CSV files
If you are uploading data from a Pandas DataFrame via object storage or locally, first export the data to a Parquet or CSV file.
Snorkel recommends Parquet files instead of CSV files to avoid column type ambiguities. For example, defining if "123" is meant to be ingested as a string
or integer
type. Parquets explicitly store these column types; CSVs do not.
The following example shows how to export data from a Pandas DataFrame to Parquet files:
# Export to Parquet file
df_train.to_parquet("/path/to/my/dataset_name_train.parquet")
df_valid.to_parquet("/path/to/my/dataset_name_valid.parquet")
df_test.to_parquet("/path/to/my/dataset_name_test.parquet")
(Optional) Upload data to Snorkel Flow’s object storage (MinIO)
Data can be uploaded from a variety of sources (see Supported data source types for more information). One option is to upload data from the MinIO object storage system deployment that comes with Snorkel Flow. This can be a good option if your data is stored locally and is larger than 100MB.
If you plan to use MinIO as a data source, you need to upload it to MinIO first. Follow the steps below:
-
Access Snorkel Flow’s object storage. On the bottom of the left-side menu, click your user name, then click Resources → MinIO object storage. From there you can login to the MinIO console. For any issues with the access key, or for access issues with Kubernetes installations, contact your Snorkel administrator.
-
Create a bucket to upload your data. On the bottom right corner of your screen, click the red + button, then click Create bucket. From there, you can name your bucket.
noteMake sure that you use lowercase characters while naming your bucket. Using upper-case characters will result in an error.
noteMinIO does not create a folder at the specified location until a file is uploaded. If you plan to use a bucket programmatically and need the folder to exist, upload an empty file in the UI to create the folder.
-
Upload your data. Once your bucket is created, click the red + button again, then click Upload file. Select the files that you’d like to upload.
Make sure that you remember your file paths on MinIO, as you’ll need them to create a data source on Snorkel Flow. The file path is in the form minio://<bucket>/<path/to/file>
. For example, a file named train.parquet
inside the bucket mybucket
in MinIO would be referred to as minio://mybucket/train.parquet
.
Supported data formats
Snorkel Flow does not support the DateTime data format. When you import data to Snorkel Flow, convert DateTime to String.
When you save a Snorkel Flow DataFrame to a .csv
file, Snorkel Flow automatically converts unsupported data formats to supported data formats. For example, Snorkel Flow converts DateTime to String. If you import a .csv
generated from another source, you must convert unsupported data types manually before importing the file to Snorkel Flow.