snorkelflow.client.datasources.create_datasource
- snorkelflow.client.datasources.create_datasource(dataset, path, file_type, uid_col=None, split='train', datasource_ds=None, reader_kwargs=None, credential_kwargs=None, scheduler=None, load_to_model_nodes=False, sync=True)
Create a data source.
Parameters
Parameters
Returns
Returns
UID of the created data source if sync mode used. Otherwise, return job_id
Return type
Return type
Union[str, int]
Name Type Default Info dataset Union[str, int]Name or UID of the dataset to create the data source in. path strPath to the data source (e.g. minio, s3, http). file_type strFile type (csv or parquet). uid_col Optional[str]NoneName of the UID column in the data source. The values in this column must be unique non-negative integers that are not duplicated across files. If not specified, we will generate a SnorkelFlow ID column. split str'train'Split of the dataset to add data source to (train, valid, or test). datasource_ds Optional[str]NoneDatestamp of the data source in YYYY-MM-DD format. reader_kwargs Optional[Dict[str, Any]]NoneDictionary of keyword arguments to pass to Dask read functions. credential_kwargs Optional[Dict[str, Any]]NoneDictionary of credentials for specific data connectors. scheduler Optional[str]NoneDask scheduler (threads, client, or group) to use. load_to_model_nodes boolFalseLoad datasource in all tasks in dataset?. sync boolTruePoll job status and block until complete?.