Version: 25.4

snorkelflow.client.datasources.create_datasource

warning

This is a beta function. Beta features may have known gaps or bugs, but are functional workflows and eligible for Snorkel Support. To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.

snorkelflow.client.datasources.create_datasource(dataset, path, file_type, uid_col=None, split='train', datasource_ds=None, reader_kwargs=None, credential_kwargs=None, scheduler=None, load_to_model_nodes=False, sync=True)

Create a data source.

Parameters Parameters
Returns Returns: UID of the created data source if sync mode used. Otherwise, return job_id
Return type Return type: Union[str, int]

Name	Type	Default	Info
dataset	`Union[str, int]`		Name or UID of the dataset to create the data source in.
path	`str`		Path to the data source (e.g. minio, s3, http).
file_type	`str`		File type (csv or parquet).
uid_col	`Optional[str]`	`None`	Name of the UID column in the data source. The values in this column must be unique non-negative integers that are not duplicated across files. If not specified, we will generate a SnorkelFlow ID column.
split	`str`	`'train'`	Split of the dataset to add data source to (train, valid, or test).
datasource_ds	`Optional[str]`	`None`	Datestamp of the data source in YYYY-MM-DD format.
reader_kwargs	`Optional[Dict[str, Any]]`	`None`	Dictionary of keyword arguments to pass to Dask read functions.
credential_kwargs	`Optional[Dict[str, Any]]`	`None`	Dictionary of credentials for specific data connectors.
scheduler	`Optional[str]`	`None`	Dask scheduler (threads, client, or group) to use.
load_to_model_nodes	`bool`	`False`	Load datasource in all tasks in dataset?.
sync	`bool`	`True`	Poll job status and block until complete?.

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

Parameters

Returns

Return type