Version: 0.93

snorkelflow.ingest.dirtree_to_parquet

snorkelflow.ingest.dirtree_to_parquet(dir_root, native=False, page_selector=None, docs_per_part=1000, parquet_name='generated', scheduler='threads')

Generate a partitioned parquet file from a directory of PDF documents.

Parameters Parameters
Return type Return type: None

Name	Type	Default	Info
dir_root	`str`		Path to the directory containing pdf documents. Only MinIO paths are supported. Every file (recursively) inside the directory with a `.pdf` extension will be ingested. For every file `xyz.pdf`, there must be a corresponding hOCR file `xyz.xml` or `xyz.hocr` in the same directory if `native=False`.
native	`bool`	`False`	This should be set to true when we are dealing with only native PDFs. This way the method won’t look for corresponding hOCR or XML files for each document.
page_selector	`Optional[Dict[str, List[int]]]`	`None`	Dictionary to select a few pages of interest from each document. The keys are file names, and values are list of (1-indexed) page numebers. For example, `{"xyz": [4, 1, ...], "abc": [2], ...}`
docs_per_part	`int`	`1000`	Number of documents to include in each partition of the parquet file.
parquet_name	`str`	`'generated'`	The name of the generated parquet file. The file is created at `dir_root/<parquet_name>.parquet`, removing anyting else that existed there previously.
scheduler	`str`	`'threads'`	[Advanced] Dask scheduler used to perform computations. Please contact Snorkel for help with setting this parameter to non-default value.

Examples

>>> from snorkelflow.ingest import dirtree_to_parquet
>>> dirtree_to_parquet(
>>>     dir_root="minio://pdf-bucket/",
>>>     native=True,
>>>     parquet_name="data",
>>> )
Created parquet file minio://pdf-bucket/data.parquet

Parameters

Parameters​

Return type

Return type​

Examples​

Parameters

Return type

Examples