Skip to main content
Version: 0.94

snorkelflow.rich_docs.RichDoc

class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)

Bases: Serializable

An object representing a document with rich formatting preserved.

For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.

Types of signals available via RichDoc:

  • textual:
    • A text-only representation of the document (no markup present)

  • structural:
    • All words are assigned a line, par(agraph), area, and page. Lines may include font size information

  • visual:
    • Objects of all granularities contain bounding box coordinates

  • tabular:
    • Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)

Other Notes:

  • [0, 0] is in the top-left corner (so bottom > top and right > left)

Parameters:
  • pages (DataFrame) – A DataFrame of page objects

  • areas (DataFrame) – A DataFrame of area objects

  • pars (DataFrame) – A DataFrame of paragraph objects

  • lines (DataFrame) – A DataFrame of line objects

  • words (DataFrame) – A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.

  • text (Optional[str], default: None) – The derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency.

__init__(pages, areas, pars, lines, words, text=None, text_by_row=None)

Methods

__init__(pages, areas, pars, lines, words[, ...])

deserialize(serialized)

Deserialize the RichDoc instance from the encoded pickle representation.

extract_span_page(char_start, char_end)

Create a RichDoc with only objects on the same page as the requested span.

from_page_docs(page_docs)

Construct a Richdoc from the page richdoc.

from_records(*, pages, areas, pars, lines, words)

Construct RichDoc object from records lists.

get_row_text(page, row)

Return the text corresponding to a given row_id and page_id.

get_span_location_fields(char_start, char_end)

Get the location of a span of words in the rich doc in terms of word_ids.

normalize_char_starts()

Normalize char offsets for pages and words dfs inplace.

serialize()

Serialize the RichDoc instance into an encoded pickle representation.

split_pages()

Split a RichDoc document into a list of RichDocs, one page per doc.

to_hocr()

Convert a RichDoc document into an hOCR formatted string.

to_json([start_page])

Convert a RichDoc document into a JSON formatted string.

to_text()

Return a text-only representation of a RichDoc (without markup).

Attributes

page_char_starts

Get an array of the char_start for all pages in the RichDoc.

word_char_ends

Get an array of the char_end for all words in the RichDoc.

word_char_starts

Get an array of the char_start for all words in the RichDoc.

classmethod deserialize(serialized)

Deserialize the RichDoc instance from the encoded pickle representation.

Parameters:

serialized (str) – A base64-encoded string representation of the RichDoc pickle.

Return type:

RichDoc

extract_span_page(char_start, char_end)

Create a RichDoc with only objects on the same page as the requested span.

Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.

Parameters:
  • char_start (int) – The inclusive start index of the span in RichDoc.text.

  • char_end (int) – The inclusive end index of the span in RichDoc.text.

Return type:

RichDoc

classmethod from_page_docs(page_docs)

Construct a Richdoc from the page richdoc.

Parameters:

page_docs (RichDocList) – RichDocList datastructure.

Return type:

RichDoc

classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)

Construct RichDoc object from records lists.

Return type:

Series

get_row_text(page, row)

Return the text corresponding to a given row_id and page_id.

Return type:

str

get_span_location_fields(char_start, char_end, span_words=None)

Get the location of a span of words in the rich doc in terms of word_ids.

These fields are used by the frontend for rendering.

Parameters:
  • char_start (int) – The inclusive start index of the span in RichDoc.text.

  • char_end (int) – The inclusive end index of the span in RichDoc.text.

  • span_words (Optional[DataFrame], default: None) – The dataframe of words associated with the span. If not specified, it is computed using char start and char end.

Return type:

Dict[str, Any]

normalize_char_starts()

Normalize char offsets for pages and words dfs inplace.

This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)

Return type:

None

serialize()

Serialize the RichDoc instance into an encoded pickle representation.

Return type:

str

split_pages()

Split a RichDoc document into a list of RichDocs, one page per doc.

Return type:

RichDocList

to_hocr()

Convert a RichDoc document into an hOCR formatted string.

Return type:

str

to_json(start_page=0)

Convert a RichDoc document into a JSON formatted string.

Return type:

str

to_text()

Return a text-only representation of a RichDoc (without markup).

Return type:

str

property page_char_starts: List[int]

Get an array of the char_start for all pages in the RichDoc.

NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.

property word_char_ends: ndarray

Get an array of the char_end for all words in the RichDoc.

property word_char_starts: ndarray

Get an array of the char_start for all words in the RichDoc.