snorkelflow.ingest.dirtree_to_parquet
- snorkelflow.ingest.dirtree_to_parquet(dir_root, native=False, page_selector=None, docs_per_part=1000, parquet_name='generated', scheduler='threads')
Generate a partitioned parquet file from a directory of PDF documents.
NB: This function is now workspace-aware, which means the uploaded parquet file will be added to a minio://workspace-#/ workspace-scoped directory. Notably, this means the parquet file will no longer be co-located with a non-workspace-scoped input directory.
- Parameters:
dir_root (
str
) – Path to the directory containing pdf documents. Only MinIO paths are supported. Every file (recursively) inside the directory with a.pdf
extension will be ingested. For every filexyz.pdf
, there must be a corresponding hOCR filexyz.xml
orxyz.hocr
in the same directory ifnative=False
.native (
bool
, default:False
) – This should be set to true when we are dealing with only native PDFs. This way the method won’t look for corresponding hOCR or XML files for each document.page_selector (
Optional
[Dict
[str
,List
[int
]]], default:None
) – Dictionary to select a few pages of interest from each document. The keys are file names, and values are list of (1-indexed) page numebers. For example,{"xyz": [4, 1, ...], "abc": [2], ...}
docs_per_part (
int
, default:1000
) – Number of documents to include in each partition of the parquet file.parquet_name (
str
, default:'generated'
) – The name of the generated parquet file. The file is created atdir_root/<parquet_name>.parquet
, removing anyting else that existed there previously.scheduler (
str
, default:'threads'
) – [Advanced] Dask scheduler used to perform computations. Please contact Snorkel for help with setting this parameter to non-default value.
- Return type:
None
Examples
>>> from snorkelflow.ingest import dirtree_to_parquet
>>> dirtree_to_parquet(
>>> dir_root="minio://pdf-bucket/",
>>> native=True,
>>> parquet_name="data",
>>> )
Created parquet file minio://pdf-bucket/data.parquet