operators.ocr.tesseract_featurizer.TesseractFeaturizer
- class operators.ocr.tesseract_featurizer.TesseractFeaturizer(pdf_url_field, output_field, ignore_errors=False)
Operator that takes in a PDF URL and outputs the hOCR text.
This operator uses the Tesseract OCR engine to convert PDFs to hOCR text. The hOCR text is then used by the HocrToRichDocParser to generate a RichDoc object for the PDF.
Parameters
Parameters
Returns
Returns
output_field
Return type
Return type
The hOCR text
Name Type Default Info pdf_url_field NewType(DataframeFieldType, str)
The name of the field that contains the URL of the PDF to be OCR’d. output_field str
The name of the field that will contain the hOCR text. ignore_errors bool
False
Whether we want to raise errors or not for bad PDF files.