Version: 0.91

snorkelflow.rich_docs.RichDoc

class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)

Bases: Serializable

An object representing a document with rich formatting preserved.

For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.

Types of signals available via RichDoc:

textual:
- A text-only representation of the document (no markup present)
structural:
- All words are assigned a line, par(agraph), area, and page. Lines may include font size information
visual:
- Objects of all granularities contain bounding box coordinates
tabular:
- Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)

Other Notes:

[0, 0] is in the top-left corner (so bottom > top and right > left)

Parameters Parameters

Name	Type	Default	Info
pages	`DataFrame`		A DataFrame of page objects.
areas	`DataFrame`		A DataFrame of area objects.
pars	`DataFrame`		A DataFrame of paragraph objects.
lines	`DataFrame`		A DataFrame of line objects.
words	`DataFrame`		A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.
text	`Optional[str]`	`None`	The derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency.

__init__(pages, areas, pars, lines, words, text=None, text_by_row=None)

Methods

`__init__`(pages, areas, pars, lines, words[, ...])
`deserialize`(serialized)	Deserialize the RichDoc instance from the encoded pickle representation.
`extract_span_page`(char_start, char_end)	Create a RichDoc with only objects on the same page as the requested span.
`from_page_docs`(page_docs)	Construct a Richdoc from the page richdoc.
`from_records`(*, pages, areas, pars, lines, words)	Construct RichDoc object from records lists.
`get_row_text`(page, row)	Return the text corresponding to a given row_id and page_id.
`get_span_location_fields`(char_start, char_end)	Get the location of a span of words in the rich doc in terms of word_ids.
`normalize_char_starts`()	Normalize char offsets for pages and words dfs inplace.
`serialize`()	Serialize the RichDoc instance into an encoded pickle representation.
`split_pages`()	Split a RichDoc document into a list of RichDocs, one page per doc.
`to_hocr`()	Convert a RichDoc document into an hOCR formatted string.
`to_json`([start_page])	Convert a RichDoc document into a JSON formatted string.
`to_text`()	Return a text-only representation of a RichDoc (without markup).

Attributes

`page_char_starts`	Get an array of the char_start for all pages in the RichDoc.
`word_char_ends`	Get an array of the char_end for all words in the RichDoc.
`word_char_starts`	Get an array of the char_start for all words in the RichDoc.

deserialize

classmethod deserialize(serialized)

Deserialize the RichDoc instance from the encoded pickle representation.

Parameters Parameters
Return type Return type: RichDoc

Name	Type	Default	Info
serialized	`str`		A base64-encoded string representation of the RichDoc pickle.

extract_span_page

extract_span_page(char_start, char_end)

Create a RichDoc with only objects on the same page as the requested span.

Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.

Parameters Parameters
Return type Return type: RichDoc

Name	Type	Default	Info
char_start	`int`		The inclusive start index of the span in RichDoc.text.
char_end	`int`		The inclusive end index of the span in RichDoc.text.

from_page_docs

classmethod from_page_docs(page_docs)

Construct a Richdoc from the page richdoc.

Parameters Parameters
Return type Return type: RichDoc

Name	Type	Default	Info
page_docs	`RichDocList`		RichDocList datastructure.

from_records

classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)

Construct RichDoc object from records lists.

Return type Return type: Series

get_row_text

get_row_text(page, row)

Return the text corresponding to a given row_id and page_id.

Return type Return type: str

get_span_location_fields

get_span_location_fields(char_start, char_end, span_words=None)

Get the location of a span of words in the rich doc in terms of word_ids.

These fields are used by the frontend for rendering.

Parameters Parameters
Return type Return type: Dict[str, Any]

Name	Type	Default	Info
char_start	`int`		The inclusive start index of the span in RichDoc.text.
char_end	`int`		The inclusive end index of the span in RichDoc.text.
span_words	`Optional[DataFrame]`	`None`	The dataframe of words associated with the span. If not specified, it is computed using char start and char end.

normalize_char_starts

normalize_char_starts()

Normalize char offsets for pages and words dfs inplace.

This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)

Return type Return type: None

serialize

serialize()

Serialize the RichDoc instance into an encoded pickle representation.

Return type Return type: str

split_pages

split_pages()

Split a RichDoc document into a list of RichDocs, one page per doc.

Return type Return type: RichDocList

to_hocr

to_hocr()

Convert a RichDoc document into an hOCR formatted string.

Return type Return type: str

to_json

to_json(start_page=0)

Convert a RichDoc document into a JSON formatted string.

Return type Return type: str

to_text

to_text()

Return a text-only representation of a RichDoc (without markup).

Return type Return type: str

property page_char_starts: List[int]

Get an array of the char_start for all pages in the RichDoc.

NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.

property word_char_ends: ndarray: Get an array of the char_end for all words in the RichDoc.

property word_char_starts: ndarray: Get an array of the char_start for all words in the RichDoc.

Parameters

Parameters​

\_\_init\_\_

__init__​

deserialize

deserialize​

Parameters

Parameters​

Return type

Return type​

extract\_span\_page

extract_span_page​

Parameters

Parameters​

Return type

Return type​

from\_page\_docs

from_page_docs​

Parameters

Parameters​

Return type

Return type​

from\_records

from_records​

Return type

Return type​

get\_row\_text

get_row_text​

Return type

Return type​

get\_span\_location\_fields

get_span_location_fields​

Parameters

Parameters​

Return type

Return type​

normalize\_char\_starts

normalize_char_starts​

Return type

Return type​

serialize

serialize​

Return type

Return type​

split\_pages

split_pages​

Return type

Return type​

to\_hocr

to_hocr​

Return type

Return type​

to\_json

to_json​

Return type

Return type​

to\_text

to_text​

Return type

Return type​

Parameters

init

deserialize

Parameters

Return type

extract_span_page

Parameters

Return type

from_page_docs

Parameters

Return type

from_records

Return type

get_row_text

Return type

get_span_location_fields

Parameters

Return type

normalize_char_starts

Return type

serialize

Return type

split_pages

Return type

to_hocr

Return type

to_json

Return type

to_text

Return type