Skip to main content
Version: 25.4

snorkelflow.ingest.dirtree_to_parquet

warning

This is a beta function in 25.4. Beta features may have known gaps or bugs, but are functional workflows and eligible for Snorkel Support. To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.

snorkelflow.ingest.dirtree_to_parquet(dir_root, native=False, page_selector=None, docs_per_part=1000, parquet_name='generated', scheduler='threads')

Generate a partitioned parquet file from a directory of PDF documents.

NB: This function is now workspace-aware, which means the uploaded parquet file will be added to a minio://workspace-#/ workspace-scoped directory. Notably, this means the parquet file will no longer be co-located with a non-workspace-scoped input directory.

Parameters

NameTypeDefaultInfo
dir_rootstrPath to the directory containing pdf documents. Only MinIO paths are supported. Every file (recursively) inside the directory with a .pdf extension will be ingested. For every file xyz.pdf, there must be a corresponding hOCR file xyz.xml or xyz.hocr in the same directory if native=False.
nativeboolFalseThis should be set to true when we are dealing with only native PDFs. This way the method won’t look for corresponding hOCR or XML files for each document.
page_selectorOptional[Dict[str, List[int]]]NoneDictionary to select a few pages of interest from each document. The keys are file names, and values are list of (1-indexed) page numebers. For example, {"xyz": [4, 1, ...], "abc": [2], ...}
docs_per_partint1000Number of documents to include in each partition of the parquet file.
parquet_namestr'generated'The name of the generated parquet file. The file is created at dir_root/<parquet_name>.parquet, removing anyting else that existed there previously.
schedulerstr'threads'[Advanced] Dask scheduler used to perform computations. Please contact Snorkel for help with setting this parameter to non-default value.

Return type

None

Examples

>>> from snorkelflow.ingest import dirtree_to_parquet
>>> dirtree_to_parquet(
>>> dir_root="minio://pdf-bucket/",
>>> native=True,
>>> parquet_name="data",
>>> )
Created parquet file minio://pdf-bucket/data.parquet