Version: 25.4

snorkelflow.ingest.dirtree_to_parquet

warning

This is a beta function. Beta features may have known gaps or bugs, but are functional workflows and eligible for Snorkel Support. To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.

snorkelflow.ingest.dirtree_to_parquet(dir_root, native=False, page_selector=None, docs_per_part=1000, parquet_name='generated', scheduler='threads')

Generate a partitioned parquet file from a directory of PDF documents.

NB: This function is now workspace-aware, which means the uploaded parquet file will be added to a minio://workspace-#/ workspace-scoped directory. Notably, this means the parquet file will no longer be co-located with a non-workspace-scoped input directory.

Parameters Parameters
Return type Return type: None

Name	Type	Default	Info
dir_root	`str`		Path to the directory containing pdf documents. Only MinIO paths are supported. Every file (recursively) inside the directory with a `.pdf` extension will be ingested. For every file `xyz.pdf`, there must be a corresponding hOCR file `xyz.xml` or `xyz.hocr` in the same directory if `native=False`.
native	`bool`	`False`	This should be set to true when we are dealing with only native PDFs. This way the method won’t look for corresponding hOCR or XML files for each document.
page_selector	`Optional[Dict[str, List[int]]]`	`None`	Dictionary to select a few pages of interest from each document. The keys are file names, and values are list of (1-indexed) page numebers. For example, `{"xyz": [4, 1, ...], "abc": [2], ...}`
docs_per_part	`int`	`1000`	Number of documents to include in each partition of the parquet file.
parquet_name	`str`	`'generated'`	The name of the generated parquet file. The file is created at `dir_root/<parquet_name>.parquet`, removing anyting else that existed there previously.
scheduler	`str`	`'threads'`	[Advanced] Dask scheduler used to perform computations. Please contact Snorkel for help with setting this parameter to non-default value.

Examples

>>> from snorkelflow.ingest import dirtree_to_parquet
>>> dirtree_to_parquet(
>>>     dir_root="minio://pdf-bucket/",
>>>     native=True,
>>>     parquet_name="data",
>>> )
Created parquet file minio://pdf-bucket/data.parquet

Parameters

Parameters​

Return type

Return type​

Examples​

Parameters

Return type

Examples