snorkelflow.client.datasources.prep_and_ingest_datasource
- snorkelflow.client.datasources.prep_and_ingest_datasource(dataset, paths, input_type, split, run_datasource_checks=True)
Create a data source.
- Parameters:
dataset (
Union
[str
,int
]) – Name or UID of the dataset to create the data source inpaths (
List
[str
]) – List of paths to the data source (e.g. MinIO, S3)input_type (
str
) – Type of input type of the files in the folder to be processed (eg. pdf, image) The supported types arepdf
,image
, andhocr
split (
str
) – Split of the dataset to add data source to (train, valid, or test)run_datasource_checks (
bool
, default:True
) – Whether we should run data source checks before ingestion (defaults to True)
- Returns:
UID of the created data source if sync mode used
- Return type:
Optional[int]
Examples
>>> sf.prep_and_ingest_datasource(
>>> dataset="test-dataset",
>>> paths=["minio://pdf-bucket/"],
>>> input_type="pdf",
>>> split="train",
>>> )
1