Skip to main content
Version: 0.94

snorkelflow.ingest.time_series_csv_to_parquet

snorkelflow.ingest.time_series_csv_to_parquet(input_csv_file_path, output_parquet_file_path, value_col, timestamp_col, label_col=None, doc_size=None, overlap=0, uid_offset=None, fill_na=None)

Generate SnorkelFlow-ingestible PARQUET file from CSV.

Examples

Input CSV file:

timestamp,value,label
2012-05-11 03:45:00,3.1,0
2012-05-11 04:45:00,1.5,0
2012-05-11 05:45:00,2.3,1
2012-05-11 06:45:00,4.4,1
2012-05-11 07:45:00,8.9,0
2012-05-12 03:45:00,10.1,1
2012-05-12 04:45:00,12.0,1
2012-05-12 05:45:00,11.0,0
2012-05-12 06:45:00,10.0,0
2012-05-12 07:45:00,15.0,1

The output parquet file when doc_size=5 would look like this:

                                        values_array
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT...
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT...

This parquet file can be uploaded to a dataset as a data source like below:

sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type="PARQUET")

In the example above, Snorkel Flow generates the UID for each row (or “doc”) (see Data Onboarding for Snorkel Flow Generated UID). If you want to control and make it deterministic (e.g., when you want to upload ground truth labels), please specify uid_offset and uid_col="uid" at each function.

The output parquet file when doc_size=5 and uid_offset=2 would look like this:

                                        values_array        uid
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT... 2
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT... 3
Then specify uid_col="uid" when uploading this parquet file as follows::

import snorkelflow.client as sf sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type=”PARQUET”, uid_col=”uid”)

Parameters:
  • input_csv_file_path (str) – Path to the input CSV file. Both local path and MinIO path are supported.

  • output_parquet_file_path (str) – Path of the generated parquet file. Only MinIO path is supported.

  • value_col (str) – Name of column in CSV containing values.

  • timestamp_col (str) – Name of column in CSV containing timestamps. The timestamp has to be in a format that can be parsed by pandas.Timestamp.

  • label_col (Optional[str], default: None) – The name of the column containing ground truth labels, or None if no such column exists.

  • doc_size (Optional[int], default: None) – The desired size of each document generated (in terms of the number of datapoints each contains), excluding overlap. None by default, meaning the entire CSV becomes one document.

  • overlap (int, default: 0) – The desired overlap between the generated documents. 0 by default.

  • uid_offset (Optional[int], default: None) – If specified, “uid” column will be added to the parquet file. The UID starts from uid_offset.

  • fill_na (Optional[str], default: None) – If specified, the “value” column will process NaN values via the following strategies (‘fillzero’, ‘pad’,’ffill’,’backfill’,’bfill’,’interpolate’)

Return type:

None