Skip to main content
Version: 0.93

snorkelflow.ingest.dirtree_to_parquet

snorkelflow.ingest.dirtree_to_parquet(dir_root, native=False, page_selector=None, docs_per_part=1000, parquet_name='generated', scheduler='threads')

Generate a partitioned parquet file from a directory of PDF documents.

Parameters:
  • dir_root (str) – Path to the directory containing pdf documents. Only MinIO paths are supported. Every file (recursively) inside the directory with a .pdf extension will be ingested. For every file xyz.pdf, there must be a corresponding hOCR file xyz.xml or xyz.hocr in the same directory if native=False.

  • native (bool, default: False) – This should be set to true when we are dealing with only native PDFs. This way the method won’t look for corresponding hOCR or XML files for each document.

  • page_selector (Optional[Dict[str, List[int]]], default: None) – Dictionary to select a few pages of interest from each document. The keys are file names, and values are list of (1-indexed) page numebers. For example, {"xyz": [4, 1, ...], "abc": [2], ...}

  • docs_per_part (int, default: 1000) – Number of documents to include in each partition of the parquet file.

  • parquet_name (str, default: 'generated') – The name of the generated parquet file. The file is created at dir_root/<parquet_name>.parquet, removing anyting else that existed there previously.

  • scheduler (str, default: 'threads') – [Advanced] Dask scheduler used to perform computations. Please contact Snorkel for help with setting this parameter to non-default value.

Return type:

None

Examples

>>> from snorkelflow.ingest import dirtree_to_parquet
>>> dirtree_to_parquet(
>>> dir_root="minio://pdf-bucket/",
>>> native=True,
>>> parquet_name="data",
>>> )
Created parquet file minio://pdf-bucket/data.parquet