Skip to main content
Version: 25.1

snorkelflow.ingest.time_series_csv_to_parquet

snorkelflow.ingest.time_series_csv_to_parquet(input_csv_file_path, output_parquet_file_path, value_col, timestamp_col, label_col=None, doc_size=None, overlap=0, uid_offset=None, fill_na=None)

Generate SnorkelFlow-ingestible PARQUET file from CSV.

Examples

Input CSV file:

timestamp,value,label
2012-05-11 03:45:00,3.1,0
2012-05-11 04:45:00,1.5,0
2012-05-11 05:45:00,2.3,1
2012-05-11 06:45:00,4.4,1
2012-05-11 07:45:00,8.9,0
2012-05-12 03:45:00,10.1,1
2012-05-12 04:45:00,12.0,1
2012-05-12 05:45:00,11.0,0
2012-05-12 06:45:00,10.0,0
2012-05-12 07:45:00,15.0,1

The output parquet file when doc_size=5 would look like this:

                                        values_array
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT...
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT...

This parquet file can be uploaded to a dataset as a data source like below:

sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type="PARQUET")

In the example above, Snorkel Flow generates the UID for each row (or “doc”) (see Data Onboarding for Snorkel Flow Generated UID). If you want to control and make it deterministic (e.g., when you want to upload ground truth labels), please specify uid_offset and uid_col="uid" at each function.

The output parquet file when doc_size=5 and uid_offset=2 would look like this:

                                        values_array        uid
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT... 2
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT... 3
Then specify uid_col="uid" when uploading this parquet file as follows::

import snorkelflow.client as sf sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type=”PARQUET”, uid_col=”uid”)

Parameters

NameTypeDefaultInfo
input_csv_file_pathstrPath to the input CSV file. Both local path and MinIO path are supported.
output_parquet_file_pathstrPath of the generated parquet file. Only MinIO path is supported.
value_colstrName of column in CSV containing values.
timestamp_colstrName of column in CSV containing timestamps. The timestamp has to be in a format that can be parsed by pandas.Timestamp.
label_colOptional[str]NoneThe name of the column containing ground truth labels, or None if no such column exists.
doc_sizeOptional[int]NoneThe desired size of each document generated (in terms of the number of datapoints each contains), excluding overlap. None by default, meaning the entire CSV becomes one document.
overlapint0The desired overlap between the generated documents. 0 by default.
uid_offsetOptional[int]NoneIf specified, “uid” column will be added to the parquet file. The UID starts from uid_offset.
fill_naOptional[str]NoneIf specified, the “value” column will process NaN values via the following strategies (‘fillzero’, ‘pad’,’ffill’,’backfill’,’bfill’,’interpolate’).

Return type

None