Scanned PDF guide
Scanned PDF documents are created from scanned images of printed documents. They don't contain information about the text in the document and where it is located. A quick way to check if a PDF is scanned is by trying to search for words in the document. If no matches are found, the PDF is not machine-readable as is.
Scanned PDFs require additional preprocessing using an optical character recognition (OCR) library to derive the text and layout information from these documents. Snorkel Flow supports both running OCR in-platform and importing OCR results from external tools. This page the process for setting up applications with both approaches.
In-platform OCR
We support in-platform OCR with Tesseract and Azure Form Recognizer. Azure Form Recognizer performs better on average but requires a paid subscription. We recommend starting with Tesseract which is free and open source. It is usually sufficient for data that isn't too noisy or handwritten.
To get started with a PDF application, create data sources with the following fields and add a new dataset. See Data preparation and Data upload for more information about creating and uploading data.
Field name | Data type | Description |
---|---|---|
uid | int |
A unique id that is mapped to each row.
This is standard across all Snorkel Flow data sources.
|
| url
| str |
The file path of the original PDF.
This will be used to access and process the PDF.
|
Tesseract
Follow these steps to create an application that runs OCR with Tesseract:
- Click the Applications option in the left-side menu, then click Create application. This brings up the New application guided flow.
- Enter a name and optional description for your application.
- In the Data accordion, select the scanned PDF dataset that you previously uploaded.
- In the Label schema accordion, set the following options:
- Data type: PDF
- Task type: Extraction
- PDF type: Scanned PDF, need to run OCR
- PDF URL field: Select the column with URL of the original PDF.
- Follow the application creation steps described in the PDF Extraction tutorial to finish setting up your application.
This setup runs tesseract OCR over your input PDFs and saves the output in the hOCR format. We ingest this and store the data using Snorkel's internal RichDoc
representation.
Azure Form Recognizer
To set up an application with Azure Form Recognizer, you need to use the Python SDK (this will be coming to the Snorkel Flow UI soon!). First, follow the steps above to set up an application with Tesseract. Then, follow these steps to modify the application:
-
Set up a Form Recognizer endpoint and an Azure blob storage container. Follow the steps in the Azure portal to do this.
-
Add secret keys from the SDK.
TheAzureFormRecognizerParser
requires access credentials to the Form Recognizer endpoint and storage container. We can store these credentials using our Credential management tool. A superadmin user can add the credentials from the SDK:sf.set_secret("azure_fr_key", "YOUR_AZURE_FORM_RECOGNIZER_KEY")
sf.set_secret("azure_connection_string", "YOUR_AZURE_CONNECTION_STRING") -
In Snorkel Flow, select your application name under Current App in the left-side menu.
-
Ensure that you are in "EDIT" mode.
-
Select the three dots on the first node in the DAG.
-
Click Add node before, then click ChangeColumns.
-
Click the newly created ChangeColumns node, then under Select Operator, click
AzureFormRecognizerParser
. -
Enter the following fields:
Field Value Pdf url field rich_doc_pdf_url
Form recognizer endpoint form recognizer endpoint url Form recognizer key azure_fr_key (set using secret store) Result upload storage key azure_connection_string (set using secret store) Result upload container the name of your container Result upload blob prefix leave this blank Result upload overwrite leave this enabled. -
Remove the
HocrToRichDocParser
node (andTesseractFeaturizer
if added).
This operator is not needed once the Azure operator is added to the Application DAG.
# Deleting HocrToRichDocParser
hocr_node = sf.get_node_uid(APP_NAME, "HocrToRichDocParser")[0]
sf.delete_node(hocr_node)
# Deleting TesseractFeaturizer
tesseract_node = sf.get_node_uid(APP_NAME, "TesseractFeaturizer")[0]
sf.delete_node(tesseract_node)
Importing OCR results
There are several other open-source and paid OCR tools available for use in the market. If the output of these tools are provided in a standard hOCR format, then it can be ingested for use in Snorkel Flow. The input data sources need to have an additional hocr
column as input. Create data sources with the following fields and add a new dataset. See Data preparation and Data upload for more information about creating and uploading data.
Field name | Data type | Description |
---|---|---|
uid | int |
A unique id that is mapped to each row.
This is standard across all Snorkel Flow data sources.
|
| url
| str |
The file path of the original PDF.
This will be used to access and process the PDF.
|
| hocr
| str | a representation of the data in standard hOCR format. For example:
...
<p class='ocr_par' lang='deu' title="bbox930">
<span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
<span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span>
<span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span>
......
|
Follow these steps to create an application and import your OCR results:
- Click the Applications option in the left-side menu, then click Create application. This brings up the New application guided flow.
- Enter a name and optional description for your application.
- In the Data accordion, select the scanned PDF dataset that you previously uploaded.
- In the Label schema accordion, set the following options:
- Data type: PDF
- Task type: Extraction
- PDF type: Scanned PDF, no need to run OCR
- PDF URL field: Select the column with URL of the original PDF.
- hOCR field: Select the column with the OCR predictions.
- Follow the application creation steps described in the PDF Extraction tutorial to finish setting up your application.
The team at Snorkel has used the open-source libraries Tesseract and DocTR outside the platform previously, so we provide some sample code below. Snorkel does not make any specific recommendations for or against the available OCR tools. If you have an OCR tool that you would like to use, but have concerns about data ingestion, please reach out to the Snorkel support team.
Tesseract
Follow these steps to use the in-platform notebook to run OCR with Tesseract:
1. Install the Tesseract library locally. To use the script that we provide, you'll also need the convert utility function. On a Mac, you can use the following steps for installation:
`brew install tesseract`
`brew install imagemagick # Verifying installation which tesseract which convert`
2. Run the shell script below to convert PDFs to images and run OCR on them:
# Script to parse PDF files to HOCR files
PDF_DATASET_DIR=pdf/
HOCR_DATASET_DIR=hocr/
TIFF_DATASET_DIR=tiff/
# Command to standardize filenames in a directory to 1.pdf, 2.pdf, ... n.pdf
# ls -v | cat -n | while read n f; do mv -n "$f" "$n.pdf"; done
for f in $PDF_DATASET_DIR*
do
filename=$(basename -- "$f")
filename="${filename%.*}"
echo "Processing $filename"
# Convert PDF to TIFF image
convert -strip -depth 8 -alpha off -density 300 -quality 100 $f $TIFF_DATASET_DIR$filename.tiff
# Use Tesseract to convert TIFF image to HOCR file
# psm 4 instructs Tesseract to parse as a single column
tesseract -l eng --psm 4 $TIFF_DATASET_DIR$filename.tiff $HOCR_DATASET_DIR$filename hocr
done
DocTR
Follow these steps to use the in-platform notebook to run OCR with DocTR:
1. Follow the instructions to install DocTR.
2. Run the python script below to run OCR over the scanned PDFs:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from lxml import etree, html
# Load in the OCR model
model = ocr_predictor(pretrained=True)
# Load in the PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Run OCR on the PDF
result = model(doc)
# View the OCR result on the PDF
result.show(doc)
# Export to hOCR - per page so you have to combine the strings
xml_output = result.export_as_xml()
# Get the first page's body
doc = html.fromstring(xml_output[0][0].decode("utf-8"))
body = doc.xpath("//*[@class='ocr_page']")[0].getparent()
# Adding only the page tag from each subsequent page
for page in xml_output[1:]:
page_doc = html.fromstring(page[0].decode("utf-8"))
page_body = page_doc.xpath("//*[@class='ocr_page']")[0].getparent()
for child in page_body:
body.append(child)
total_hocr = etree.tostring(doc, pretty_print=True, method='xml', xml_declaration=True).decode("utf-8")
with open('x_doctr.hocr', 'w') as file:
file.write(total_hocr)
Conclusion
This page discussed the different ways that you can process and ingest scanned PDF documents in Snorkel Flow. To learn more about using PDF documents in Snorkel Flow, please see our other documentation:
- Check out our PDF Information Extraction tutorial.
- Read more about PDF-based Operators.
- See more examples of Rich Document Builders.