Version: 25.3

operators.ocr.tesseract_featurizer.TesseractFeaturizer

class operators.ocr.tesseract_featurizer.TesseractFeaturizer(pdf_url_field, output_field, ignore_errors=False)

Operator that takes in a PDF URL and outputs the hOCR text.

This operator uses the Tesseract OCR engine to convert PDFs to hOCR text. The hOCR text is then used by the HocrToRichDocParser to generate a RichDoc object for the PDF.

Parameters Parameters
Returns Returns: output_field
Return type Return type: The hOCR text

Name	Type	Default	Info
pdf_url_field	`NewType(DataframeFieldType, str)`		The name of the field that contains the URL of the PDF to be OCR’d.
output_field	`str`		The name of the field that will contain the hOCR text.
ignore_errors	`bool`	`False`	Whether we want to raise errors or not for bad PDF files.

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

Parameters

Returns

Return type