snorkelflow.rich_docs.RichDoc
- class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)
Bases:
Serializable
An object representing a document with rich formatting preserved.
For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.
Types of signals available via RichDoc:
- textual:
A text-only representation of the document (no markup present)
- structural:
All words are assigned a line, par(agraph), area, and page. Lines may include font size information
- visual:
Objects of all granularities contain bounding box coordinates
- tabular:
Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)
Other Notes:
[0, 0] is in the top-left corner (so bottom > top and right > left)
Parameters
Parameters
Name Type Default Info pages DataFrame
A DataFrame of page objects. areas DataFrame
A DataFrame of area objects. pars DataFrame
A DataFrame of paragraph objects. lines DataFrame
A DataFrame of line objects. words DataFrame
A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc. text Optional[str]
None
The derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency. - __init__(pages, areas, pars, lines, words, text=None, text_by_row=None)
\_\_init\_\_
__init__
Methods
__init__
(pages, areas, pars, lines, words[, ...])deserialize
(serialized)Deserialize the RichDoc instance from the encoded pickle representation.
extract_span_page
(char_start, char_end)Create a RichDoc with only objects on the same page as the requested span.
from_page_docs
(page_docs)Construct a Richdoc from the page richdoc.
from_records
(*, pages, areas, pars, lines, words)Construct RichDoc object from records lists.
get_row_text
(page, row)Return the text corresponding to a given row_id and page_id.
get_span_location_fields
(char_start, char_end)Get the location of a span of words in the rich doc in terms of word_ids.
Normalize char offsets for pages and words dfs inplace.
Serialize the RichDoc instance into an encoded pickle representation.
Split a RichDoc document into a list of RichDocs, one page per doc.
to_hocr
()Convert a RichDoc document into an hOCR formatted string.
to_json
([start_page])Convert a RichDoc document into a JSON formatted string.
to_text
()Return a text-only representation of a RichDoc (without markup).
Attributes
Get an array of the char_start for all pages in the RichDoc.
Get an array of the char_end for all words in the RichDoc.
Get an array of the char_start for all words in the RichDoc.
- classmethod deserialize(serialized)
Deserialize the RichDoc instance from the encoded pickle representation.
deserialize
deserialize
- extract_span_page(char_start, char_end)
Create a RichDoc with only objects on the same page as the requested span.
Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.
extract\_span\_page
extract_span_page
- classmethod from_page_docs(page_docs)
Construct a Richdoc from the page richdoc.
from\_page\_docs
from_page_docs
- classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)
Construct RichDoc object from records lists.
Return type
Return type
Series
from\_records
from_records
- get_row_text(page, row)
Return the text corresponding to a given row_id and page_id.
Return type
Return type
str
get\_row\_text
get_row_text
- get_span_location_fields(char_start, char_end, span_words=None)
Get the location of a span of words in the rich doc in terms of word_ids.
These fields are used by the frontend for rendering.
Parameters
Parameters
Return type
Return type
Dict
[str
,Any
]
Name Type Default Info char_start int
The inclusive start index of the span in RichDoc.text. char_end int
The inclusive end index of the span in RichDoc.text. span_words Optional[DataFrame]
None
The dataframe of words associated with the span. If not specified, it is computed using char start and char end.
get\_span\_location\_fields
get_span_location_fields
- normalize_char_starts()
Normalize char offsets for pages and words dfs inplace.
This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)
Return type
Return type
None
normalize\_char\_starts
normalize_char_starts
- serialize()
Serialize the RichDoc instance into an encoded pickle representation.
Return type
Return type
str
serialize
serialize
- split_pages()
Split a RichDoc document into a list of RichDocs, one page per doc.
Return type
Return type
split\_pages
split_pages
- to_hocr()
Convert a RichDoc document into an hOCR formatted string.
Return type
Return type
str
to\_hocr
to_hocr
- to_json(start_page=0)
Convert a RichDoc document into a JSON formatted string.
Return type
Return type
str
to\_json
to_json
- to_text()
Return a text-only representation of a RichDoc (without markup).
Return type
Return type
str
to\_text
to_text
- property page_char_starts: List[int]
Get an array of the char_start for all pages in the RichDoc.
NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.
- property word_char_ends: ndarray
Get an array of the char_end for all words in the RichDoc.
- property word_char_starts: ndarray
Get an array of the char_start for all words in the RichDoc.