operators.pdf.parser2.PDFToRichDocParser2
- class operators.pdf.parser2.PDFToRichDocParser2(field, parser_version=0)
Operator that parses a PDF into Snorkel’s RichDoc representation.
This operator parses PDF to create a Snorkel-Flow-native representation of PDF documents, including a richer text representation, spatial information, etc. with the original formatting preserved. RichDoc representation empowers in-depth tools with PDF-formatted data.
The output includes: a stripped raw text representation of the rich doc (rich_doc_text), a serialized RichDoc that corresponds to rich_doc_text (rich_doc_pkl), a serialized list of RichDoc objects, one per page (page_docs), and character offsets of text starting on each page (page_char_starts).
This parser will ignore parsing errors by default. The documents with errors will be skipped. PDFs with parsing errors are logged and errors are raised to the user.
- Parameters:
field (
str
) – The name of the column in the dataframe contains PDF urls.parser_version (
Optional
[int
], default:0
) – The version of the parser used to parse the PDF. If not provided, the latest parser version is used.
- Returns:
rich_doc_text – A stripped raw text representation of the rich doc
rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text
page_docs – A serialized list of RichDoc objects, one per page
page_char_starts – A character offsets of text starting on each page