snorkelflow.ingest.time_series_csv_to_parquet
- snorkelflow.ingest.time_series_csv_to_parquet(input_csv_file_path, output_parquet_file_path, value_col, timestamp_col, label_col=None, doc_size=None, overlap=0, uid_offset=None, fill_na=None)
Generate SnorkelFlow-ingestible PARQUET file from CSV.
Examples
Input CSV file:
timestamp,value,label
2012-05-11 03:45:00,3.1,0
2012-05-11 04:45:00,1.5,0
2012-05-11 05:45:00,2.3,1
2012-05-11 06:45:00,4.4,1
2012-05-11 07:45:00,8.9,0
2012-05-12 03:45:00,10.1,1
2012-05-12 04:45:00,12.0,1
2012-05-12 05:45:00,11.0,0
2012-05-12 06:45:00,10.0,0
2012-05-12 07:45:00,15.0,1The output parquet file when
doc_size=5
would look like this:values_array
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT...
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT...This parquet file can be uploaded to a dataset as a data source like below:
sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type="PARQUET")
In the example above, Snorkel Flow generates the UID for each row (or “doc”) (see Data Onboarding for Snorkel Flow Generated UID). If you want to control and make it deterministic (e.g., when you want to upload ground truth labels), please specify
uid_offset
anduid_col="uid"
at each function.The output parquet file when
doc_size=5
anduid_offset=2
would look like this:values_array uid
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT... 2
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT... 3- Then specify
uid_col="uid"
when uploading this parquet file as follows:: import snorkelflow.client as sf sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type=”PARQUET”, uid_col=”uid”)
- Parameters:
input_csv_file_path (
str
) – Path to the input CSV file. Both local path and MinIO path are supported.output_parquet_file_path (
str
) – Path of the generated parquet file. Only MinIO path is supported.value_col (
str
) – Name of column in CSV containing values.timestamp_col (
str
) – Name of column in CSV containing timestamps. The timestamp has to be in a format that can be parsed bypandas.Timestamp
.label_col (
Optional
[str
], default:None
) – The name of the column containing ground truth labels, or None if no such column exists.doc_size (
Optional
[int
], default:None
) – The desired size of each document generated (in terms of the number of datapoints each contains), excluding overlap. None by default, meaning the entire CSV becomes one document.overlap (
int
, default:0
) – The desired overlap between the generated documents. 0 by default.uid_offset (
Optional
[int
], default:None
) – If specified, “uid” column will be added to the parquet file. The UID starts fromuid_offset
.fill_na (
Optional
[str
], default:None
) – If specified, the “value” column will process NaN values via the following strategies (‘fillzero’, ‘pad’,’ffill’,’backfill’,’bfill’,’interpolate’)
- Return type:
None
- Then specify