Skip to main content
Version: 0.96

operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser

class operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser(pdf_url_field, form_recognizer_endpoint, form_recognizer_key, result_upload_storage_key='', result_upload_container=None, result_upload_blob_prefix=None, result_upload_overwrite=True)

Takes in a PDF URL and runs Azure Form Recognizer on it. The result is returned as a RichDoc. The result is also uploaded to blob storage if configured. The form recognizer endpoint and key are configured as secrets.

Parameters

NameTypeDefaultInfo
pdf_url_fieldNewType(DataframeFieldType, str)The name of the column in the dataframe contains PDF urls.
form_recognizer_endpointstrThe endpoint for the Azure Form Recognizer service.
form_recognizer_keystrThe key in the secret store that has the Azure Form Recognizer key.
result_upload_storage_keystr''The key in the secret store that has connection string for the Azure blob storage account.
result_upload_containerOptional[str]NoneThe container in the Azure blob storage account to upload results to.
result_upload_blob_prefixOptional[str]NoneThe prefix to use for the blob name.
result_upload_overwriteboolTrueWhether to overwrite existing blobs.

Returns

  • rich_doc_text – A stripped raw text representation of the rich doc

  • rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text

  • page_docs – A serialized list of RichDoc objects, one per page

  • page_char_starts – A character offsets of text starting on each page