Skip to main content
Version: 0.95

snorkelflow.rich_docs.RichDoc

class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)

Bases: Serializable

An object representing a document with rich formatting preserved.

For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.

Types of signals available via RichDoc:

  • textual:
    • A text-only representation of the document (no markup present)

  • structural:
    • All words are assigned a line, par(agraph), area, and page. Lines may include font size information

  • visual:
    • Objects of all granularities contain bounding box coordinates

  • tabular:
    • Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)

Other Notes:

  • [0, 0] is in the top-left corner (so bottom > top and right > left)

Parameters

NameTypeDefaultInfo
pagesDataFrameA DataFrame of page objects.
areasDataFrameA DataFrame of area objects.
parsDataFrameA DataFrame of paragraph objects.
linesDataFrameA DataFrame of line objects.
wordsDataFrameA DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.
textOptional[str]NoneThe derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency.

__init__

__init__(pages, areas, pars, lines, words, text=None, text_by_row=None)

Methods

__init__(pages, areas, pars, lines, words[, ...])

deserialize(serialized)

Deserialize the RichDoc instance from the encoded pickle representation.

extract_span_page(char_start, char_end)

Create a RichDoc with only objects on the same page as the requested span.

from_page_docs(page_docs)

Construct a Richdoc from the page richdoc.

from_records(*, pages, areas, pars, lines, words)

Construct RichDoc object from records lists.

get_row_text(page, row)

Return the text corresponding to a given row_id and page_id.

get_span_location_fields(char_start, char_end)

Get the location of a span of words in the rich doc in terms of word_ids.

normalize_char_starts()

Normalize char offsets for pages and words dfs inplace.

serialize()

Serialize the RichDoc instance into an encoded pickle representation.

split_pages()

Split a RichDoc document into a list of RichDocs, one page per doc.

to_hocr()

Convert a RichDoc document into an hOCR formatted string.

to_json([start_page])

Convert a RichDoc document into a JSON formatted string.

to_text()

Return a text-only representation of a RichDoc (without markup).

Attributes

page_char_starts

Get an array of the char_start for all pages in the RichDoc.

word_char_ends

Get an array of the char_end for all words in the RichDoc.

word_char_starts

Get an array of the char_start for all words in the RichDoc.

word_indexes

Get an array of the word indexes for all words in the RichDoc.

deserialize

classmethod deserialize(serialized)

Deserialize the RichDoc instance from the encoded pickle representation.

Parameters

NameTypeDefaultInfo
serializedstrA base64-encoded string representation of the RichDoc pickle.

Return type

RichDoc

extract_span_page

extract_span_page(char_start, char_end)

Create a RichDoc with only objects on the same page as the requested span.

Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.

Parameters

NameTypeDefaultInfo
char_startintThe inclusive start index of the span in RichDoc.text.
char_endintThe inclusive end index of the span in RichDoc.text.

Return type

RichDoc

from_page_docs

classmethod from_page_docs(page_docs)

Construct a Richdoc from the page richdoc.

Parameters

NameTypeDefaultInfo
page_docsRichDocListRichDocList datastructure.

Return type

RichDoc

from_records

classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)

Construct RichDoc object from records lists.

Return type

Series

get_row_text

get_row_text(page, row)

Return the text corresponding to a given row_id and page_id.

Return type

str

get_span_location_fields

get_span_location_fields(char_start, char_end, span_words=None)

Get the location of a span of words in the rich doc in terms of word_ids.

These fields are used by the frontend for rendering.

Parameters

NameTypeDefaultInfo
char_startintThe inclusive start index of the span in RichDoc.text.
char_endintThe inclusive end index of the span in RichDoc.text.
span_wordsOptional[DataFrame]NoneThe dataframe of words associated with the span. If not specified, it is computed using char start and char end.

Return type

Dict[str, Any]

normalize_char_starts

normalize_char_starts()

Normalize char offsets for pages and words dfs inplace.

This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)

Return type

None

serialize

serialize()

Serialize the RichDoc instance into an encoded pickle representation.

Return type

str

split_pages

split_pages()

Split a RichDoc document into a list of RichDocs, one page per doc.

Return type

RichDocList

to_hocr

to_hocr()

Convert a RichDoc document into an hOCR formatted string.

Return type

str

to_json

to_json(start_page=0)

Convert a RichDoc document into a JSON formatted string.

Return type

str

to_text

to_text()

Return a text-only representation of a RichDoc (without markup).

Return type

str

property page_char_starts: List[int]

Get an array of the char_start for all pages in the RichDoc.

NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.

property word_char_ends: ndarray

Get an array of the char_end for all words in the RichDoc.

property word_char_starts: ndarray

Get an array of the char_start for all words in the RichDoc.

property word_indexes: List[int]

Get an array of the word indexes for all words in the RichDoc.

Note that the index is 0-based and relative to the whole page. For example, when a RichDoc represents the 2nd page of a 2-page document, and the word indexes for such a RichDoc will be n, n+1, n+2, …, n+m, assuming the first page has n words.