snorkelflow.ingest.time_series_csv_to_parquet
- snorkelflow.ingest.time_series_csv_to_parquet(input_csv_file_path, output_parquet_file_path, value_col, timestamp_col, label_col=None, doc_size=None, overlap=0, uid_offset=None, fill_na=None)
Generate SnorkelFlow-ingestible PARQUET file from CSV.
Examples
Input CSV file:
timestamp,value,label
2012-05-11 03:45:00,3.1,0
2012-05-11 04:45:00,1.5,0
2012-05-11 05:45:00,2.3,1
2012-05-11 06:45:00,4.4,1
2012-05-11 07:45:00,8.9,0
2012-05-12 03:45:00,10.1,1
2012-05-12 04:45:00,12.0,1
2012-05-12 05:45:00,11.0,0
2012-05-12 06:45:00,10.0,0
2012-05-12 07:45:00,15.0,1The output parquet file when
doc_size=5would look like this:values_array
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT...
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT...This parquet file can be uploaded to a dataset as a data source like below:
sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type="PARQUET")In the example above, Snorkel Flow generates the UID for each row (or “doc”). If you want to control and make it deterministic (e.g., when you want to upload ground truth labels), please specify
uid_offsetanduid_col="uid"at each function.The output parquet file when
doc_size=5anduid_offset=2would look like this:values_array uid
0 gASVRAEAAAAAAABdlCiMMnsidGltZXN0YW1wIjogIjIwMT... 2
1 gASVEwEAAAAAAABdlCiMM3sidGltZXN0YW1wIjogIjIwMT... 3- Then specify
uid_col="uid"when uploading this parquet file as follows:: import snorkelflow.client as sf sf.create_datasource(DATASET_NAME, output_parquet_file_path, file_type=”PARQUET”, uid_col=”uid”)
Parameters
Parameters
Return type
Return type
None
Name Type Default Info input_csv_file_path strPath to the input CSV file. Both local path and MinIO path are supported. output_parquet_file_path strPath of the generated parquet file. Only MinIO path is supported. value_col strName of column in CSV containing values. timestamp_col strName of column in CSV containing timestamps. The timestamp has to be in a format that can be parsed by pandas.Timestamp.label_col Optional[str]NoneThe name of the column containing ground truth labels, or None if no such column exists. doc_size Optional[int]NoneThe desired size of each document generated (in terms of the number of datapoints each contains), excluding overlap. None by default, meaning the entire CSV becomes one document. overlap int0The desired overlap between the generated documents. 0 by default. uid_offset Optional[int]NoneIf specified, “uid” column will be added to the parquet file. The UID starts from uid_offset.fill_na Optional[str]NoneIf specified, the “value” column will process NaN values via the following strategies (‘fillzero’, ‘pad’,’ffill’,’backfill’,’bfill’,’interpolate’). - Then specify