Skip to main content
Version: 0.93

operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser

class operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser(pdf_url_field, form_recognizer_endpoint, form_recognizer_key, result_upload_storage_key='', result_upload_container=None, result_upload_blob_prefix=None, result_upload_overwrite=True)

Takes in a PDF URL and runs Azure Form Recognizer on it. The result is returned as a RichDoc. The result is also uploaded to blob storage if configured. The form recognizer endpoint and key are configured as secrets.

Parameters:
  • pdf_url_field (NewType(DataframeFieldType, str)) – The name of the column in the dataframe contains PDF urls.

  • form_recognizer_endpoint (str) – The endpoint for the Azure Form Recognizer service

  • form_recognizer_key (str) – The key in the secret store that has the Azure Form Recognizer key

  • result_upload_storage_key (str, default: '') – The key in the secret store that has connection string for the Azure blob storage account

  • result_upload_container (Optional[str], default: None) – The container in the Azure blob storage account to upload results to

  • result_upload_blob_prefix (Optional[str], default: None) – The prefix to use for the blob name

  • result_upload_overwrite (bool, default: True) – Whether to overwrite existing blobs

Returns:

  • rich_doc_text – A stripped raw text representation of the rich doc

  • rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text

  • page_docs – A serialized list of RichDoc objects, one per page

  • page_char_starts – A character offsets of text starting on each page