snorkelflow.client.datasources.create_datasource
warning
This is a beta function in 25.4. Beta features may have known gaps or bugs, but are functional workflows and eligible for Snorkel Support. To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.
- snorkelflow.client.datasources.create_datasource(dataset, path, file_type, uid_col=None, split='train', datasource_ds=None, reader_kwargs=None, credential_kwargs=None, scheduler=None, load_to_model_nodes=False, sync=True)
Create a data source.
Parameters
Parameters
Returns
Returns
UID of the created data source if sync mode used. Otherwise, return job_id
Return type
Return type
Union[str, int]
Name Type Default Info dataset Union[str, int]
Name or UID of the dataset to create the data source in. path str
Path to the data source (e.g. minio, s3, http). file_type str
File type (csv or parquet). uid_col Optional[str]
None
Name of the UID column in the data source. The values in this column must be unique non-negative integers that are not duplicated across files. If not specified, we will generate a SnorkelFlow ID column. split str
'train'
Split of the dataset to add data source to (train, valid, or test). datasource_ds Optional[str]
None
Datestamp of the data source in YYYY-MM-DD format. reader_kwargs Optional[Dict[str, Any]]
None
Dictionary of keyword arguments to pass to Dask read functions. credential_kwargs Optional[Dict[str, Any]]
None
Dictionary of credentials for specific data connectors. scheduler Optional[str]
None
Dask scheduler (threads, client, or group) to use. load_to_model_nodes bool
False
Load datasource in all tasks in dataset?. sync bool
True
Poll job status and block until complete?.