operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser
- class operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser(pdf_url_field, form_recognizer_endpoint, form_recognizer_key, result_upload_storage_key='', result_upload_container=None, result_upload_blob_prefix=None, result_upload_overwrite=True)
Takes in a PDF URL and runs Azure Form Recognizer on it. The result is returned as a RichDoc. The result is also uploaded to blob storage if configured. The form recognizer endpoint and key are configured as secrets.
Parameters
Parameters
Returns
Returns
rich_doc_text – A stripped raw text representation of the rich doc
rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text
page_docs – A serialized list of RichDoc objects, one per page
page_char_starts – A character offsets of text starting on each page
Name Type Default Info pdf_url_field NewType(DataframeFieldType, str)
The name of the column in the dataframe contains PDF urls. form_recognizer_endpoint str
The endpoint for the Azure Form Recognizer service. form_recognizer_key str
The key in the secret store that has the Azure Form Recognizer key. result_upload_storage_key str
''
The key in the secret store that has connection string for the Azure blob storage account. result_upload_container Optional[str]
None
The container in the Azure blob storage account to upload results to. result_upload_blob_prefix Optional[str]
None
The prefix to use for the blob name. result_upload_overwrite bool
True
Whether to overwrite existing blobs.