Skip to main content
Version: 0.94

operators.pdf.checkbox.CheckboxFeaturizer

class operators.pdf.checkbox.CheckboxFeaturizer(pdf_url_field='rich_doc_pdf_url', min_box_length_px=25, max_box_length_px=55, px_threshold_ratio=0.1, num_pages_per_batch=100, pages_field=None)

A featurizer that identifies checkboxes in PDF documents.

Parameters

NameTypeDefaultInfo
pdf_url_fieldstr'rich_doc_pdf_url'The name of the column containing the PDF file paths.
min_box_length_pxint25The minimum length of a checkbox. Defaults to 25.
max_box_length_pxint55The maximum length of a checkbox. Defaults to 55.
px_threshold_ratiofloat0.1The threshold ratio of non-empty pixels inside a checkbox to be considered as checked. Defaults to 0.1.
num_pages_per_batchint100The number of pages to process in each batch. Defaults to 100.
pages_fieldOptional[str]NoneThe name of the column containing the page numbers on which to run the operator on. If None, the operator will run on all pages. Defaults to None.

Returns

A Checkbox object containing the checkbox info.

Return type

{RichDocCols.CHECKBOXES}