Skip to main content
Version: 25.3

operators.ocr.tesseract_featurizer.TesseractFeaturizer

class operators.ocr.tesseract_featurizer.TesseractFeaturizer(pdf_url_field, output_field, ignore_errors=False)

Operator that takes in a PDF URL and outputs the hOCR text.

This operator uses the Tesseract OCR engine to convert PDFs to hOCR text. The hOCR text is then used by the HocrToRichDocParser to generate a RichDoc object for the PDF.

Parameters

NameTypeDefaultInfo
pdf_url_fieldNewType(DataframeFieldType, str)The name of the field that contains the URL of the PDF to be OCR’d.
output_fieldstrThe name of the field that will contain the hOCR text.
ignore_errorsboolFalseWhether we want to raise errors or not for bad PDF files.

Returns

output_field

Return type

The hOCR text