operators.pdf.checkbox.CheckboxFeaturizer
- class operators.pdf.checkbox.CheckboxFeaturizer(pdf_url_field='rich_doc_pdf_url', min_box_length_px=25, max_box_length_px=55, px_threshold_ratio=0.1, num_pages_per_batch=100, pages_field=None)
A featurizer that identifies checkboxes in PDF documents.
- Parameters:
pdf_url_field (
str
, default:'rich_doc_pdf_url'
) – The name of the column containing the PDF file paths.min_box_length_px (
int
, default:25
) – The minimum length of a checkbox. Defaults to 25.max_box_length_px (
int
, default:55
) – The maximum length of a checkbox. Defaults to 55.px_threshold_ratio (
float
, default:0.1
) – The threshold ratio of non-empty pixels inside a checkbox to be considered as checked. Defaults to 0.1.num_pages_per_batch (
int
, default:100
) – The number of pages to process in each batch. Defaults to 100.pages_field (
Optional
[str
], default:None
) – The name of the column containing the page numbers on which to run the operator on. If None, the operator will run on all pages. Defaults to None.
- Returns:
A Checkbox object containing the checkbox info.
- Return type:
{RichDocCols.CHECKBOXES}