operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser
- class operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser(pdf_url_field, form_recognizer_endpoint, form_recognizer_key, result_upload_storage_key='', result_upload_container=None, result_upload_blob_prefix=None, result_upload_overwrite=True)
Takes in a PDF URL and runs Azure Form Recognizer on it. The result is returned as a RichDoc. The result is also uploaded to blob storage if configured. The form recognizer endpoint and key are configured as secrets.
- Parameters:
pdf_url_field (
NewType
(DataframeFieldType
,str
)) – The name of the column in the dataframe contains PDF urls.form_recognizer_endpoint (
str
) – The endpoint for the Azure Form Recognizer serviceform_recognizer_key (
str
) – The key in the secret store that has the Azure Form Recognizer keyresult_upload_storage_key (
str
, default:''
) – The key in the secret store that has connection string for the Azure blob storage accountresult_upload_container (
Optional
[str
], default:None
) – The container in the Azure blob storage account to upload results toresult_upload_blob_prefix (
Optional
[str
], default:None
) – The prefix to use for the blob nameresult_upload_overwrite (
bool
, default:True
) – Whether to overwrite existing blobs
- Returns:
rich_doc_text – A stripped raw text representation of the rich doc
rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text
page_docs – A serialized list of RichDoc objects, one per page
page_char_starts – A character offsets of text starting on each page