snorkelflow.rich_docs.RichDoc
- class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)
Bases:
Serializable
An object representing a document with rich formatting preserved.
For additional utilities for operating on a RichDoc document, see the RichDocWrapper class.
Types of signals available via RichDoc:
- textual:
A text-only representation of the document (no markup present)
- structural:
All words are assigned a line, par(agraph), area, and page. Lines may include font size information
- visual:
Objects of all granularities contain bounding box coordinates
- tabular:
Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments)
Other Notes:
[0, 0] is in the top-left corner (so bottom > top and right > left)
- Parameters:
pages (
DataFrame
) – A DataFrame of page objectsareas (
DataFrame
) – A DataFrame of area objectspars (
DataFrame
) – A DataFrame of paragraph objectslines (
DataFrame
) – A DataFrame of line objectswords (
DataFrame
) – A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.text (
Optional
[str
], default:None
) – The derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency.
- __init__(pages, areas, pars, lines, words, text=None, text_by_row=None)
Methods
__init__
(pages, areas, pars, lines, words[, ...])deserialize
(serialized)Deserialize the RichDoc instance from the encoded pickle representation.
extract_span_page
(char_start, char_end)Create a RichDoc with only objects on the same page as the requested span.
from_page_docs
(page_docs)Construct a Richdoc from the page richdoc.
from_records
(*, pages, areas, pars, lines, words)Construct RichDoc object from records lists.
get_row_text
(page, row)Return the text corresponding to a given row_id and page_id.
get_span_location_fields
(char_start, char_end)Get the location of a span of words in the rich doc in terms of word_ids.
Normalize char offsets for pages and words dfs inplace.
Serialize the RichDoc instance into an encoded pickle representation.
Split a RichDoc document into a list of RichDocs, one page per doc.
to_hocr
()Convert a RichDoc document into an hOCR formatted string.
to_json
([start_page])Convert a RichDoc document into a JSON formatted string.
to_text
()Return a text-only representation of a RichDoc (without markup).
Attributes
Get an array of the char_start for all pages in the RichDoc.
Get an array of the char_end for all words in the RichDoc.
Get an array of the char_start for all words in the RichDoc.
Get an array of the word indexes for all words in the RichDoc.
- classmethod deserialize(serialized)
Deserialize the RichDoc instance from the encoded pickle representation.
- Parameters:
serialized (
str
) – A base64-encoded string representation of the RichDoc pickle.- Return type:
- extract_span_page(char_start, char_end)
Create a RichDoc with only objects on the same page as the requested span.
Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages.
- Parameters:
char_start (
int
) – The inclusive start index of the span in RichDoc.text.char_end (
int
) – The inclusive end index of the span in RichDoc.text.
- Return type:
- classmethod from_page_docs(page_docs)
Construct a Richdoc from the page richdoc.
- Parameters:
page_docs (
RichDocList
) – RichDocList datastructure.- Return type:
- classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)
Construct RichDoc object from records lists.
- Return type:
Series
- get_row_text(page, row)
Return the text corresponding to a given row_id and page_id.
- Return type:
str
- get_span_location_fields(char_start, char_end, span_words=None)
Get the location of a span of words in the rich doc in terms of word_ids.
These fields are used by the frontend for rendering.
- Parameters:
char_start (
int
) – The inclusive start index of the span in RichDoc.text.char_end (
int
) – The inclusive end index of the span in RichDoc.text.span_words (
Optional
[DataFrame
], default:None
) – The dataframe of words associated with the span. If not specified, it is computed using char start and char end.
- Return type:
Dict
[str
,Any
]
- normalize_char_starts()
Normalize char offsets for pages and words dfs inplace.
This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs)
- Return type:
None
- serialize()
Serialize the RichDoc instance into an encoded pickle representation.
- Return type:
str
- split_pages()
Split a RichDoc document into a list of RichDocs, one page per doc.
- Return type:
- to_hocr()
Convert a RichDoc document into an hOCR formatted string.
- Return type:
str
- to_json(start_page=0)
Convert a RichDoc document into a JSON formatted string.
- Return type:
str
- to_text()
Return a text-only representation of a RichDoc (without markup).
- Return type:
str
- property page_char_starts: List[int]
Get an array of the char_start for all pages in the RichDoc.
NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute.
- property word_char_ends: ndarray
Get an array of the char_end for all words in the RichDoc.
- property word_char_starts: ndarray
Get an array of the char_start for all words in the RichDoc.
- property word_indexes: List[int]
Get an array of the word indexes for all words in the RichDoc.
Note that the index is 0-based and relative to the whole page. For example, when a RichDoc represents the 2nd page of a 2-page document, and the word indexes for such a RichDoc will be n, n+1, n+2, …, n+m, assuming the first page has n words.