Skip to main content
Version: 25.4

snorkelflow.client.datasources.create_datasource

snorkelflow.client.datasources.create_datasource(dataset, path, file_type, uid_col=None, split='train', datasource_ds=None, reader_kwargs=None, credential_kwargs=None, scheduler=None, load_to_model_nodes=False, sync=True)

Create a data source.

Parameters

NameTypeDefaultInfo
datasetUnion[str, int]Name or UID of the dataset to create the data source in.
pathstrPath to the data source (e.g. minio, s3, http).
file_typestrFile type (csv or parquet).
uid_colOptional[str]NoneName of the UID column in the data source. The values in this column must be unique non-negative integers that are not duplicated across files. If not specified, we will generate a SnorkelFlow ID column.
splitstr'train'Split of the dataset to add data source to (train, valid, or test).
datasource_dsOptional[str]NoneDatestamp of the data source in YYYY-MM-DD format.
reader_kwargsOptional[Dict[str, Any]]NoneDictionary of keyword arguments to pass to Dask read functions.
credential_kwargsOptional[Dict[str, Any]]NoneDictionary of credentials for specific data connectors.
schedulerOptional[str]NoneDask scheduler (threads, client, or group) to use.
load_to_model_nodesboolFalseLoad datasource in all tasks in dataset?.
syncboolTruePoll job status and block until complete?.

Returns

UID of the created data source if sync mode used. Otherwise, return job_id

Return type

Union[str, int]