snorkelflow.client.datasources.create_datasource
- snorkelflow.client.datasources.create_datasource(dataset, path, file_type, uid_col=None, split='train', datasource_ds=None, reader_kwargs=None, credential_kwargs=None, scheduler=None, load_to_model_nodes=False, sync=True)
Create a data source.
Parameters
Parameters
Returns
Returns
UID of the created data source if sync mode used. Otherwise, return job_id
Return type
Return type
Union[str, int]
Name Type Default Info dataset Union[str, int]
Name or UID of the dataset to create the data source in. path str
Path to the data source (e.g. minio, s3, http). file_type str
File type (csv or parquet). uid_col Optional[str]
None
Name of the UID column in the data source. The values in this column must be unique non-negative integers that are not duplicated across files. If not specified, we will generate a SnorkelFlow ID column. split str
'train'
Split of the dataset to add data source to (train, valid, or test). datasource_ds Optional[str]
None
Datestamp of the data source in YYYY-MM-DD format. reader_kwargs Optional[Dict[str, Any]]
None
Dictionary of keyword arguments to pass to Dask read functions. credential_kwargs Optional[Dict[str, Any]]
None
Dictionary of credentials for specific data connectors. scheduler Optional[str]
None
Dask scheduler (threads, client, or group) to use. load_to_model_nodes bool
False
Load datasource in all tasks in dataset?. sync bool
True
Poll job status and block until complete?.