snorkelflow.client.datasources.prep_and_ingest_datasource
- snorkelflow.client.datasources.prep_and_ingest_datasource(dataset, paths, input_type, split, run_datasource_checks=True)
Create a data source.
Parameters
Parameters
Returns
Returns
UID of the created data source if sync mode used
Return type
Return type
Optional[int]
Name Type Default Info dataset Union[str, int]
Name or UID of the dataset to create the data source in. paths List[str]
List of paths to the data source (e.g. MinIO, S3). input_type str
Type of input type of the files in the folder to be processed (eg. pdf, image) The supported types are pdf
,image
, andhocr
split str
Split of the dataset to add data source to (train, valid, or test). run_datasource_checks bool
True
Whether we should run data source checks before ingestion (defaults to True). Examples
>>> sf.prep_and_ingest_datasource(
>>> dataset="test-dataset",
>>> paths=["minio://pdf-bucket/"],
>>> input_type="pdf",
>>> split="train",
>>> )
1