Skip to main content
Version: 0.94

operators.pdf.checkbox.CheckboxFeaturizer

class operators.pdf.checkbox.CheckboxFeaturizer(pdf_url_field='rich_doc_pdf_url', min_box_length_px=25, max_box_length_px=55, px_threshold_ratio=0.1, num_pages_per_batch=100, pages_field=None)

A featurizer that identifies checkboxes in PDF documents.

Parameters:
  • pdf_url_field (str, default: 'rich_doc_pdf_url') – The name of the column containing the PDF file paths.

  • min_box_length_px (int, default: 25) – The minimum length of a checkbox. Defaults to 25.

  • max_box_length_px (int, default: 55) – The maximum length of a checkbox. Defaults to 55.

  • px_threshold_ratio (float, default: 0.1) – The threshold ratio of non-empty pixels inside a checkbox to be considered as checked. Defaults to 0.1.

  • num_pages_per_batch (int, default: 100) – The number of pages to process in each batch. Defaults to 100.

  • pages_field (Optional[str], default: None) – The name of the column containing the page numbers on which to run the operator on. If None, the operator will run on all pages. Defaults to None.

Returns:

A Checkbox object containing the checkbox info.

Return type:

{RichDocCols.CHECKBOXES}