operators.pdf.checkbox.CheckboxFeaturizer
- class operators.pdf.checkbox.CheckboxFeaturizer(pdf_url_field='rich_doc_pdf_url', min_box_length_px=25, max_box_length_px=55, px_threshold_ratio=0.1, num_pages_per_batch=100, pages_field=None)
A featurizer that identifies checkboxes in PDF documents.
Parameters
Parameters
Returns
Returns
A Checkbox object containing the checkbox info.
Return type
Return type
{RichDocCols.CHECKBOXES}
Name Type Default Info pdf_url_field str
'rich_doc_pdf_url'
The name of the column containing the PDF file paths. min_box_length_px int
25
The minimum length of a checkbox. Defaults to 25. max_box_length_px int
55
The maximum length of a checkbox. Defaults to 55. px_threshold_ratio float
0.1
The threshold ratio of non-empty pixels inside a checkbox to be considered as checked. Defaults to 0.1. num_pages_per_batch int
100
The number of pages to process in each batch. Defaults to 100. pages_field Optional[str]
None
The name of the column containing the page numbers on which to run the operator on. If None, the operator will run on all pages. Defaults to None.