Version: 0.95

snorkelflow.rich_docs.RichDoc

class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)

Bases: Serializable

An object representing a document with rich formatting preserved.

For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.

Types of signals available via RichDoc:

textual:
- A text-only representation of the document (no markup present)
structural:
- All words are assigned a line, par(agraph), area, and page. Lines may include font size information
visual:
- Objects of all granularities contain bounding box coordinates
tabular:
- Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)

Other Notes:

[0, 0] is in the top-left corner (so bottom > top and right > left)

Parameters:

pages (DataFrame) – A DataFrame of page objects
areas (DataFrame) – A DataFrame of area objects
pars (DataFrame) – A DataFrame of paragraph objects
lines (DataFrame) – A DataFrame of line objects
words (DataFrame) – A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.
text (Optional[str], default: None) – The derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency.

__init__(pages, areas, pars, lines, words, text=None, text_by_row=None)

Methods

`__init__`(pages, areas, pars, lines, words[, ...])
`deserialize`(serialized)	Deserialize the RichDoc instance from the encoded pickle representation.
`extract_span_page`(char_start, char_end)	Create a RichDoc with only objects on the same page as the requested span.
`from_page_docs`(page_docs)	Construct a Richdoc from the page richdoc.
`from_records`(*, pages, areas, pars, lines, words)	Construct RichDoc object from records lists.
`get_row_text`(page, row)	Return the text corresponding to a given row_id and page_id.
`get_span_location_fields`(char_start, char_end)	Get the location of a span of words in the rich doc in terms of word_ids.
`normalize_char_starts`()	Normalize char offsets for pages and words dfs inplace.
`serialize`()	Serialize the RichDoc instance into an encoded pickle representation.
`split_pages`()	Split a RichDoc document into a list of RichDocs, one page per doc.
`to_hocr`()	Convert a RichDoc document into an hOCR formatted string.
`to_json`([start_page])	Convert a RichDoc document into a JSON formatted string.
`to_text`()	Return a text-only representation of a RichDoc (without markup).

Attributes

`page_char_starts`	Get an array of the char_start for all pages in the RichDoc.
`word_char_ends`	Get an array of the char_end for all words in the RichDoc.
`word_char_starts`	Get an array of the char_start for all words in the RichDoc.
`word_indexes`	Get an array of the word indexes for all words in the RichDoc.

classmethod deserialize(serialized)

Deserialize the RichDoc instance from the encoded pickle representation.

Parameters:: serialized (str) – A base64-encoded string representation of the RichDoc pickle.
Return type:: RichDoc

extract_span_page(char_start, char_end)

Create a RichDoc with only objects on the same page as the requested span.

Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.

Parameters:

char_start (int) – The inclusive start index of the span in RichDoc.text.
char_end (int) – The inclusive end index of the span in RichDoc.text.

Return type:

RichDoc

classmethod from_page_docs(page_docs)

Construct a Richdoc from the page richdoc.

Parameters:: page_docs (RichDocList) – RichDocList datastructure.
Return type:: RichDoc

classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)

Construct RichDoc object from records lists.

Return type:: Series

get_row_text(page, row)

Return the text corresponding to a given row_id and page_id.

Return type:: str

get_span_location_fields(char_start, char_end, span_words=None)

Get the location of a span of words in the rich doc in terms of word_ids.

These fields are used by the frontend for rendering.

Parameters:

char_start (int) – The inclusive start index of the span in RichDoc.text.
char_end (int) – The inclusive end index of the span in RichDoc.text.
span_words (Optional[DataFrame], default: None) – The dataframe of words associated with the span. If not specified, it is computed using char start and char end.

Return type:

Dict[str, Any]

normalize_char_starts()

Normalize char offsets for pages and words dfs inplace.

This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)

Return type:: None

serialize()

Serialize the RichDoc instance into an encoded pickle representation.

Return type:: str

split_pages()

Split a RichDoc document into a list of RichDocs, one page per doc.

Return type:: RichDocList

to_hocr()

Convert a RichDoc document into an hOCR formatted string.

Return type:: str

to_json(start_page=0)

Convert a RichDoc document into a JSON formatted string.

Return type:: str

to_text()

Return a text-only representation of a RichDoc (without markup).

Return type:: str

property page_char_starts: List[int]

Get an array of the char_start for all pages in the RichDoc.

NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.

property word_char_ends: ndarray: Get an array of the char_end for all words in the RichDoc.

property word_char_starts: ndarray: Get an array of the char_start for all words in the RichDoc.

property word_indexes: List[int]

Get an array of the word indexes for all words in the RichDoc.

Note that the index is 0-based and relative to the whole page. For example, when a RichDoc represents the 2nd page of a 2-page document, and the word indexes for such a RichDoc will be n, n+1, n+2, …, n+m, assuming the first page has n words.