Skip to main content
Version: 0.94

operators.pdf.parser2.PDFToRichDocParser2

class operators.pdf.parser2.PDFToRichDocParser2(field, parser_version=0)

Operator that parses a PDF into Snorkel’s RichDoc representation.

This operator parses PDF to create a Snorkel-Flow-native representation of PDF documents, including a richer text representation, spatial information, etc. with the original formatting preserved. RichDoc representation empowers in-depth tools with PDF-formatted data.

The output includes: a stripped raw text representation of the rich doc (rich_doc_text), a serialized RichDoc that corresponds to rich_doc_text (rich_doc_pkl), a serialized list of RichDoc objects, one per page (page_docs), and character offsets of text starting on each page (page_char_starts).

This parser will ignore parsing errors by default. The documents with errors will be skipped. PDFs with parsing errors are logged and errors are raised to the user.

Parameters:
  • field (str) – The name of the column in the dataframe contains PDF urls.

  • parser_version (Optional[int], default: 0) – The version of the parser used to parse the PDF. If not provided, the latest parser version is used.

Returns:

  • rich_doc_text – A stripped raw text representation of the rich doc

  • rich_doc_pkl – A serialized RichDoc that corresponds to rich_doc_text

  • page_docs – A serialized list of RichDoc objects, one per page

  • page_char_starts – A character offsets of text starting on each page