snorkelflow.rich_docs.RichDoc
- class snorkelflow.rich_docs.RichDoc(pages, areas, pars, lines, words, text=None, text_by_row=None)
- Bases: - Serializable- An object representing a document with rich formatting preserved. - For additional utilities for operating on a RichDoc document, see the RichDocWrapper class. - Types of signals available via RichDoc: - textual:
- A text-only representation of the document (no markup present) 
 
 
- structural:
- All words are assigned a line, par(agraph), area, and page. Lines may include font size information 
 
 
- visual:
- Objects of all granularities contain bounding box coordinates 
 
 
- tabular:
- Currently expressed primarily via visual signals (e.g., vertical and horizontal alignments) 
 
 
 - Other Notes: - [0, 0] is in the top-left corner (so bottom > top and right > left) 
 - Parameters- Parameters
 - Name - Type - Default - Info - pages - DataFrame- A DataFrame of page objects. - areas - DataFrame- A DataFrame of area objects. - pars - DataFrame- A DataFrame of paragraph objects. - lines - DataFrame- A DataFrame of line objects. - words - DataFrame- A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc. - text - Optional[str]- None- The derived text representation of the RichDoc if it is known. Typically this is None and the text is auto-generated to ensure consistency. It may be passed, however, when serializing/deserializing, for efficiency. - __init__(pages, areas, pars, lines, words, text=None, text_by_row=None)
 - \_\_init\_\_- __init__- Methods - __init__(pages, areas, pars, lines, words[, ...])- deserialize(serialized)- Deserialize the RichDoc instance from the encoded pickle representation. - extract_span_page(char_start, char_end)- Create a RichDoc with only objects on the same page as the requested span. - from_page_docs(page_docs)- Construct a Richdoc from the page richdoc. - from_records(*, pages, areas, pars, lines, words)- Construct RichDoc object from records lists. - get_row_text(page, row)- Return the text corresponding to a given row_id and page_id. - get_span_location_fields(char_start, char_end)- Get the location of a span of words in the rich doc in terms of word_ids. - normalize_char_starts()- Normalize char offsets for pages and words dfs inplace. - serialize()- Serialize the RichDoc instance into an encoded pickle representation. - split_pages()- Split a RichDoc document into a list of RichDocs, one page per doc. - to_hocr()- Convert a RichDoc document into an hOCR formatted string. - to_json([start_page])- Convert a RichDoc document into a JSON formatted string. - to_text()- Return a text-only representation of a RichDoc (without markup). - Attributes - page_char_starts- Get an array of the char_start for all pages in the RichDoc. - word_char_ends- Get an array of the char_end for all words in the RichDoc. - word_char_starts- Get an array of the char_start for all words in the RichDoc. - word_indexes- Get an array of the word indexes for all words in the RichDoc. - classmethod deserialize(serialized)
- Deserialize the RichDoc instance from the encoded pickle representation. 
 - deserialize- deserialize- extract_span_page(char_start, char_end)
- Create a RichDoc with only objects on the same page as the requested span. - Because visual signals are only guaranteed for words on a span’s page, dropping the other pages can save a significant amount of space, especially in documents with many pages. 
 - extract\_span\_page- extract_span_page- classmethod from_page_docs(page_docs)
- Construct a Richdoc from the page richdoc. 
 - from\_page\_docs- from_page_docs- classmethod from_records(*, pages, areas, pars, lines, words, convert_to_int=True)
- Construct RichDoc object from records lists. - Return type- Return type
- Series
 
 - from\_records- from_records- get_row_text(page, row)
- Return the text corresponding to a given row_id and page_id. - Return type- Return type
- str
 
 - get\_row\_text- get_row_text- get_span_location_fields(char_start, char_end, span_words=None)
- Get the location of a span of words in the rich doc in terms of word_ids. - These fields are used by the frontend for rendering. - Parameters- Parameters
- Return type- Return type
- Dict[- str,- Any]
 - Name - Type - Default - Info - char_start - int- The inclusive start index of the span in RichDoc.text. - char_end - int- The inclusive end index of the span in RichDoc.text. - span_words - Optional[DataFrame]- None- The dataframe of words associated with the span. If not specified, it is computed using char start and char end. 
 - get\_span\_location\_fields- get_span_location_fields- normalize_char_starts()
- Normalize char offsets for pages and words dfs inplace. - This is needed when a RichDoc is constructed from disjoint pages (e.g. from_page_docs) - Return type- Return type
- None
 
 - normalize\_char\_starts- normalize_char_starts- serialize()
- Serialize the RichDoc instance into an encoded pickle representation. - Return type- Return type
- str
 
 - serialize- serialize- split_pages()
- Split a RichDoc document into a list of RichDocs, one page per doc. - Return type- Return type
 
 - split\_pages- split_pages- to_hocr()
- Convert a RichDoc document into an hOCR formatted string. - Return type- Return type
- str
 
 - to\_hocr- to_hocr- to_json(start_page=0)
- Convert a RichDoc document into a JSON formatted string. - Return type- Return type
- str
 
 - to\_json- to_json- to_text()
- Return a text-only representation of a RichDoc (without markup). - Return type- Return type
- str
 
 - to\_text- to_text- property page_char_starts: List[int]
- Get an array of the char_start for all pages in the RichDoc. - NOTE: If the RichDoc has been trimmed to a subset of pages, then only the char_start values for those pages will be present in this attribute. 
 - property word_char_ends: ndarray
- Get an array of the char_end for all words in the RichDoc. 
 - property word_char_starts: ndarray
- Get an array of the char_start for all words in the RichDoc. 
 - property word_indexes: List[int]
- Get an array of the word indexes for all words in the RichDoc. - Note that the index is 0-based and relative to the whole page. For example, when a RichDoc represents the 2nd page of a 2-page document, and the word indexes for such a RichDoc will be n, n+1, n+2, …, n+m, assuming the first page has n words.