Skip to main content
Version: 0.91

snorkelflow.client.datasources.prep_and_ingest_datasource

snorkelflow.client.datasources.prep_and_ingest_datasource(dataset, paths, input_type, split, run_datasource_checks=True)

Create a data source.

Parameters

NameTypeDefaultInfo
datasetUnion[str, int]Name or UID of the dataset to create the data source in.
pathsList[str]List of paths to the data source (e.g. MinIO, S3).
input_typestrType of input type of the files in the folder to be processed (eg. pdf, image) The supported types are pdf, image, and hocr
splitstrSplit of the dataset to add data source to (train, valid, or test).
run_datasource_checksboolTrueWhether we should run data source checks before ingestion (defaults to True).

Returns

UID of the created data source if sync mode used

Return type

Optional[int]

Examples

>>> sf.prep_and_ingest_datasource(
>>> dataset="test-dataset",
>>> paths=["minio://pdf-bucket/"],
>>> input_type="pdf",
>>> split="train",
>>> )
1