Skip to main content
Version: 25.4

snorkelflow.client.datasources.create_datasource

warning

This is a beta function in 25.4. Beta features may have known gaps or bugs, but are functional workflows and eligible for Snorkel Support. To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.

snorkelflow.client.datasources.create_datasource(dataset, path, file_type, uid_col=None, split='train', datasource_ds=None, reader_kwargs=None, credential_kwargs=None, scheduler=None, load_to_model_nodes=False, sync=True)

Create a data source.

Parameters

NameTypeDefaultInfo
datasetUnion[str, int]Name or UID of the dataset to create the data source in.
pathstrPath to the data source (e.g. minio, s3, http).
file_typestrFile type (csv or parquet).
uid_colOptional[str]NoneName of the UID column in the data source. The values in this column must be unique non-negative integers that are not duplicated across files. If not specified, we will generate a SnorkelFlow ID column.
splitstr'train'Split of the dataset to add data source to (train, valid, or test).
datasource_dsOptional[str]NoneDatestamp of the data source in YYYY-MM-DD format.
reader_kwargsOptional[Dict[str, Any]]NoneDictionary of keyword arguments to pass to Dask read functions.
credential_kwargsOptional[Dict[str, Any]]NoneDictionary of credentials for specific data connectors.
schedulerOptional[str]NoneDask scheduler (threads, client, or group) to use.
load_to_model_nodesboolFalseLoad datasource in all tasks in dataset?.
syncboolTruePoll job status and block until complete?.

Returns

UID of the created data source if sync mode used. Otherwise, return job_id

Return type

Union[str, int]