snorkelflow.ingest.dirtree_to_parquet
- snorkelflow.ingest.dirtree_to_parquet(dir_root, native=False, page_selector=None, docs_per_part=1000, parquet_name='generated', scheduler='threads')
Generate a partitioned parquet file from a directory of PDF documents.
NB: This function is now workspace-aware, which means the uploaded parquet file will be added to a minio://workspace-#/ workspace-scoped directory. Notably, this means the parquet file will no longer be co-located with a non-workspace-scoped input directory.
Parameters
Parameters
Return type
Return type
None
Name Type Default Info dir_root str
Path to the directory containing pdf documents. Only MinIO paths are supported. Every file (recursively) inside the directory with a .pdf
extension will be ingested. For every filexyz.pdf
, there must be a corresponding hOCR filexyz.xml
orxyz.hocr
in the same directory ifnative=False
.native bool
False
This should be set to true when we are dealing with only native PDFs. This way the method won’t look for corresponding hOCR or XML files for each document. page_selector Optional[Dict[str, List[int]]]
None
Dictionary to select a few pages of interest from each document. The keys are file names, and values are list of (1-indexed) page numebers. For example, {"xyz": [4, 1, ...], "abc": [2], ...}
docs_per_part int
1000
Number of documents to include in each partition of the parquet file. parquet_name str
'generated'
The name of the generated parquet file. The file is created at dir_root/<parquet_name>.parquet
, removing anyting else that existed there previously.scheduler str
'threads'
[Advanced] Dask scheduler used to perform computations. Please contact Snorkel for help with setting this parameter to non-default value. Examples
>>> from snorkelflow.ingest import dirtree_to_parquet
>>> dirtree_to_parquet(
>>> dir_root="minio://pdf-bucket/",
>>> native=True,
>>> parquet_name="data",
>>> )
Created parquet file minio://pdf-bucket/data.parquet