snorkelflow.rich_docs.RichDoc
- class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)
Bases:
SerializableAn object representing a document with rich formatting preserved.
For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.
Types of signals available via RichDoc:
- textual:
A text-only representation of the document (no markup present)
- structural:
All words are assigned a line, par(agraph), area, and page. Lines may include font size information
- visual:
Objects of all granularities contain bounding box coordinates
- tabular:
Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)
Other Notes:
[0, 0] is in the top-left corner (so bottom > top and right > left)
Parameters
Parameters
Name Type Default Info pages DataFrameA DataFrame of page objects. areas DataFrameA DataFrame of area objects. pars DataFrameA DataFrame of paragraph objects. lines DataFrameA DataFrame of line objects. words DataFrameA DataFrame of word objects, with bounding boxes, parent obj. assignments, etc. text Optional[str]NoneThe derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency. - __init__(pages, areas, pars, lines, words, text=None, text_by_row=None)
\_\_init\_\_
__init__
Methods
__init__(pages, areas, pars, lines, words[, ...])deserialize(serialized)Deserialize the RichDoc instance from the encoded pickle representation. extract_span_page(char_start, char_end)Create a RichDoc with only objects on the same page as the requested span. from_page_docs(page_docs)Construct a Richdoc from the page richdoc. from_records(*, pages, areas, pars, lines, words)Construct RichDoc object from records lists. get_row_text(page, row)Return the text corresponding to a given row_id and page_id. get_span_location_fields(char_start, char_end)Get the location of a span of words in the rich doc in terms of word_ids. normalize_char_starts()Normalize char offsets for pages and words dfs inplace. serialize()Serialize the RichDoc instance into an encoded pickle representation. split_pages()Split a RichDoc document into a list of RichDocs, one page per doc. to_hocr()Convert a RichDoc document into an hOCR formatted string. to_json([start_page])Convert a RichDoc document into a JSON formatted string. to_text()Return a text-only representation of a RichDoc (without markup). Attributes
page_char_startsGet an array of the char_start for all pages in the RichDoc. word_char_endsGet an array of the char_end for all words in the RichDoc. word_char_startsGet an array of the char_start for all words in the RichDoc. word_indexesGet an array of the word indexes for all words in the RichDoc. - classmethod deserialize(serialized)
Deserialize the RichDoc instance from the encoded pickle representation.
deserialize
deserialize
- extract_span_page(char_start, char_end)
Create a RichDoc with only objects on the same page as the requested span.
Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.
extract\_span\_page
extract_span_page
- classmethod from_page_docs(page_docs)
Construct a Richdoc from the page richdoc.
from\_page\_docs
from_page_docs
- classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)
Construct RichDoc object from records lists.
from\_records
from_records
- get_row_text(page, row)
Return the text corresponding to a given row_id and page_id.
Return type
Return type
str
get\_row\_text
get_row_text
- get_span_location_fields(char_start, char_end, span_words=None)
Get the location of a span of words in the rich doc in terms of word_ids.
These fields are used by the frontend for rendering.
Parameters
Parameters
Return type
Return type
Dict[str,Any]
Name Type Default Info char_start intThe inclusive start index of the span in RichDoc.text. char_end intThe inclusive end index of the span in RichDoc.text. span_words Optional[DataFrame]NoneThe dataframe of words associated with the span. If not specified, it is computed using char start and char end.
get\_span\_location\_fields
get_span_location_fields
- normalize_char_starts()
Normalize char offsets for pages and words dfs inplace.
This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)
Return type
Return type
None
normalize\_char\_starts
normalize_char_starts
- serialize()
Serialize the RichDoc instance into an encoded pickle representation.
Return type
Return type
str
serialize
serialize
- split_pages()
Split a RichDoc document into a list of RichDocs, one page per doc.
Return type
Return type
split\_pages
split_pages
- to_hocr()
Convert a RichDoc document into an hOCR formatted string.
Return type
Return type
str
to\_hocr
to_hocr
- to_json(start_page=0)
Convert a RichDoc document into a JSON formatted string.
Return type
Return type
str
to\_json
to_json
- to_text()
Return a text-only representation of a RichDoc (without markup).
Return type
Return type
str
to\_text
to_text
- property page_char_starts: List[int]
Get an array of the char_start for all pages in the RichDoc.
NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.
- property word_char_ends: ndarray
Get an array of the char_end for all words in the RichDoc.
- property word_char_starts: ndarray
Get an array of the char_start for all words in the RichDoc.
- property word_indexes: List[int]
Get an array of the word indexes for all words in the RichDoc.
Note that the index is 0-based and relative to the whole page. For example, when a RichDoc represents the 2nd page of a 2-page document, and the word indexes for such a RichDoc will be n, n+1, n+2, …, n+m, assuming the first page has n words.