Skip to main content
Version: 0.91

snorkelflow.client.datasources.prep_and_ingest_datasource

snorkelflow.client.datasources.prep_and_ingest_datasource(dataset, paths, input_type, split, run_datasource_checks=True)

Create a data source.

Parameters:
  • dataset (Union[str, int]) – Name or UID of the dataset to create the data source in

  • paths (List[str]) – List of paths to the data source (e.g. MinIO, S3)

  • input_type (str) – Type of input type of the files in the folder to be processed (eg. pdf, image) The supported types are pdf, image, and hocr

  • split (str) – Split of the dataset to add data source to (train, valid, or test)

  • run_datasource_checks (bool, default: True) – Whether we should run data source checks before ingestion (defaults to True)

Returns:

UID of the created data source if sync mode used

Return type:

Optional[int]

Examples

>>> sf.prep_and_ingest_datasource(
>>> dataset="test-dataset",
>>> paths=["minio://pdf-bucket/"],
>>> input_type="pdf",
>>> split="train",
>>> )
1