Version: 25.2

snorkelflow.rich_docs.RichDoc

class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)

Bases: Serializable

An object representing a document with rich formatting preserved.

For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.

Types of signals available via RichDoc:

textual:
- A text-only representation of the document (no markup present)
structural:
- All words are assigned a line, par(agraph), area, and page. Lines may include font size information
visual:
- Objects of all granularities contain bounding box coordinates
tabular:
- Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)

Other Notes:

[0, 0] is in the top-left corner (so bottom > top and right > left)

Parameters Parameters

Name	Type	Default	Info
pages	`DataFrame`		A DataFrame of page objects.
areas	`DataFrame`		A DataFrame of area objects.
pars	`DataFrame`		A DataFrame of paragraph objects.
lines	`DataFrame`		A DataFrame of line objects.
words	`DataFrame`		A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.
text	`Optional[str]`	`None`	The derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency.

__init__(pages, areas, pars, lines, words, text=None, text_by_row=None)

Methods

`__init__`(pages, areas, pars, lines, words[, ...])
`deserialize`(serialized)	Deserialize the RichDoc instance from the encoded pickle representation.
`extract_span_page`(char_start, char_end)	Create a RichDoc with only objects on the same page as the requested span.
`from_page_docs`(page_docs)	Construct a Richdoc from the page richdoc.
`from_records`(*, pages, areas, pars, lines, words)	Construct RichDoc object from records lists.
`get_row_text`(page, row)	Return the text corresponding to a given row_id and page_id.
`get_span_location_fields`(char_start, char_end)	Get the location of a span of words in the rich doc in terms of word_ids.
`normalize_char_starts`()	Normalize char offsets for pages and words dfs inplace.
`serialize`()	Serialize the RichDoc instance into an encoded pickle representation.
`split_pages`()	Split a RichDoc document into a list of RichDocs, one page per doc.
`to_hocr`()	Convert a RichDoc document into an hOCR formatted string.
`to_json`([start_page])	Convert a RichDoc document into a JSON formatted string.
`to_text`()	Return a text-only representation of a RichDoc (without markup).

Attributes

`page_char_starts`	Get an array of the char_start for all pages in the RichDoc.
`word_char_ends`	Get an array of the char_end for all words in the RichDoc.
`word_char_starts`	Get an array of the char_start for all words in the RichDoc.
`word_indexes`	Get an array of the word indexes for all words in the RichDoc.

deserialize

classmethod deserialize(serialized)

Deserialize the RichDoc instance from the encoded pickle representation.

Parameters Parameters
Return type Return type: RichDoc

Name	Type	Default	Info
serialized	`str`		A base64-encoded string representation of the RichDoc pickle.

extract_span_page

extract_span_page(char_start, char_end)

Create a RichDoc with only objects on the same page as the requested span.

Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.

Parameters Parameters
Return type Return type: RichDoc

Name	Type	Default	Info
char_start	`int`		The inclusive start index of the span in RichDoc.text.
char_end	`int`		The inclusive end index of the span in RichDoc.text.

from_page_docs

classmethod from_page_docs(page_docs)

Construct a Richdoc from the page richdoc.

Parameters Parameters
Return type Return type: RichDoc

Name	Type	Default	Info
page_docs	`RichDocList`		RichDocList datastructure.

from_records

classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)

Construct RichDoc object from records lists.

Return type Return type: RichDoc

get_row_text

get_row_text(page, row)

Return the text corresponding to a given row_id and page_id.

Return type Return type: str

get_span_location_fields

get_span_location_fields(char_start, char_end, span_words=None)

Get the location of a span of words in the rich doc in terms of word_ids.

These fields are used by the frontend for rendering.

Parameters Parameters
Return type Return type: Dict[str, Any]

Name	Type	Default	Info
char_start	`int`		The inclusive start index of the span in RichDoc.text.
char_end	`int`		The inclusive end index of the span in RichDoc.text.
span_words	`Optional[DataFrame]`	`None`	The dataframe of words associated with the span. If not specified, it is computed using char start and char end.

normalize_char_starts

normalize_char_starts()

Normalize char offsets for pages and words dfs inplace.

This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)

Return type Return type: None

serialize

serialize()

Serialize the RichDoc instance into an encoded pickle representation.

Return type Return type: str

split_pages

split_pages()

Split a RichDoc document into a list of RichDocs, one page per doc.

Return type Return type: RichDocList

to_hocr

to_hocr()

Convert a RichDoc document into an hOCR formatted string.

Return type Return type: str

to_json

to_json(start_page=0)

Convert a RichDoc document into a JSON formatted string.

Return type Return type: str

to_text

to_text()

Return a text-only representation of a RichDoc (without markup).

Return type Return type: str

property page_char_starts: List[int]

Get an array of the char_start for all pages in the RichDoc.

NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.

property word_char_ends: ndarray: Get an array of the char_end for all words in the RichDoc.

property word_char_starts: ndarray: Get an array of the char_start for all words in the RichDoc.

property word_indexes: List[int]

Get an array of the word indexes for all words in the RichDoc.

Note that the index is 0-based and relative to the whole page. For example, when a RichDoc represents the 2nd page of a 2-page document, and the word indexes for such a RichDoc will be n, n+1, n+2, …, n+m, assuming the first page has n words.

Parameters

Parameters​

\_\_init\_\_

__init__​

deserialize

deserialize​

Parameters

Parameters​

Return type

Return type​

extract\_span\_page

extract_span_page​

Parameters

Parameters​

Return type

Return type​

from\_page\_docs

from_page_docs​

Parameters

Parameters​

Return type

Return type​

from\_records

from_records​

Return type

Return type​

get\_row\_text

get_row_text​

Return type

Return type​

get\_span\_location\_fields

get_span_location_fields​

Parameters

Parameters​

Return type

Return type​

normalize\_char\_starts

normalize_char_starts​

Return type

Return type​

serialize

serialize​

Return type

Return type​

split\_pages

split_pages​

Return type

Return type​

to\_hocr

to_hocr​

Return type

Return type​

to\_json

to_json​

Return type

Return type​

to\_text

to_text​

Return type

Return type​

Parameters

init

deserialize

Parameters

Return type

extract_span_page

Parameters

Return type

from_page_docs

Parameters

Return type

from_records

Return type

get_row_text

Return type

get_span_location_fields

Parameters

Return type

normalize_char_starts

Return type

serialize

Return type

split_pages

Return type

to_hocr

Return type

to_json

Return type

to_text

Return type