snorkelflow.ingest.time_series_csv_to_parquet
- snorkelflow.ingest.time_series_csv_to_parquet(input_csv_file_path, output_parquet_file_path, value_col, timestamp_col, label_col=None, doc_size=None, overlap=0, uid_offset=None, fill_na=None)
Generate SnorkelFlow-ingestible PARQUET file from CSV.
Examples
Input CSV file:
timestamp,value,label
2012-05-11 03:45:00,3.1,0
2012-05-11 04:45:00,1.5,0
2012-05-11 05:45:00,2.3,1
2012-05-11 06:45:00,4.4,1
2012-05-11 07:45:00,8.9,0
2012-05-12 03:45:00,10.1,1
2012-05-12 04:45:00,12.0,1
2012-05-12 05:45:00,11.0,0
2012-05-12 06:45:00,10.0,0
2012-05-12 07:45:00,15.0,1The output parquet file when
doc_size=5
would look like this:values_array
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT...
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT...This parquet file can be uploaded to a dataset as a data source like below:
sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type="PARQUET")
In the example above, Snorkel Flow generates the UID for each row (or “doc”). If you want to control and make it deterministic (e.g., when you want to upload ground truth labels), please specify
uid_offset
anduid_col="uid"
at each function.The output parquet file when
doc_size=5
anduid_offset=2
would look like this:values_array uid
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT... 2
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT... 3- Then specify
uid_col="uid"
when uploading this parquet file as follows:: import snorkelflow.client as sf sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type=”PARQUET”, uid_col=”uid”)
Parameters
Parameters
Return type
Return type
None
Name Type Default Info input_csv_file_path str
Path to the input CSV file. Both local path and MinIO path are supported. output_parquet_file_path str
Path of the generated parquet file. Only MinIO path is supported. value_col str
Name of column in CSV containing values. timestamp_col str
Name of column in CSV containing timestamps. The timestamp has to be in a format that can be parsed by pandas.Timestamp
.label_col Optional[str]
None
The name of the column containing ground truth labels, or None if no such column exists. doc_size Optional[int]
None
The desired size of each document generated (in terms of the number of datapoints each contains), excluding overlap. None by default, meaning the entire CSV becomes one document. overlap int
0
The desired overlap between the generated documents. 0 by default. uid_offset Optional[int]
None
If specified, “uid” column will be added to the parquet file. The UID starts from uid_offset
.fill_na Optional[str]
None
If specified, the “value” column will process NaN values via the following strategies (‘fillzero’, ‘pad’,’ffill’,’backfill’,’bfill’,’interpolate’). - Then specify