Rich doc libraries
Functionality for writing operators and LFs over Rich Document objects.
- class rich_doc_wrapper.rich_doc_wrapper.Ngram(words)
Bases:
Serializable
A class that represents an ngram (one or more grouped words).
Note: A list of ngrams is sometimes represented as rows in a DataFrame with the same fields, as a performance optimization.
Parameters
Parameters
Name Type Default Info words DataFrame
A DataFrame of words that represent the given ngram. - classmethod deserialize(serialized)
Deserialize instance to string.
deserialize
deserialize
- property area_id: int
- property bottom: int
- property center: int
- property char_end: int
- property char_start: int
- property left: int
- property line_id: int
- property middle: int
- property page_id: int
- property par_id: int
- property right: int
- property row_id: int
- property text: str
- property top: int
- property word0: pandas.Series
- property word_id: int
- class rich_doc_wrapper.rich_doc_wrapper.RichDocSpanFeatures
Bases:
object
Class that specifies RichDoc features.
- ALIGNED_NGRAMS = 'rich_doc_aligned_ngrams'
- FONT_SIZE = 'rich_doc_font_size'
- HORZ_ALIGNED_NGRAMS = 'rich_doc_horz_aligned_ngrams'
- INFERRED_ROW_HEADERS = 'rich_doc_inferred_row_headers'
- PROXIMATE_TEXT = 'rich_doc_proximate_text'
- PROXIMATE_TEXT_AFTER = 'rich_doc_proximate_text_after'
- PROXIMATE_TEXT_BEFORE = 'rich_doc_proximate_text_before'
- ROW_HEADER = 'rich_doc_row_header'
- ROW_ID = 'rich_doc_row_id'
- ROW_TEXT_AFTER = 'rich_doc_row_text_after'
- ROW_TEXT_BEFORE = 'rich_doc_row_text_before'
- ROW_TEXT_INLINE = 'rich_doc_row_text_inline'
- VERT_ALIGNED_NGRAMS = 'rich_doc_vert_aligned_ngrams'
- class rich_doc_wrapper.rich_doc_wrapper.RichDocWrapper(rd)
Bases:
object
A class that wraps a RichDoc and calculates derivative values and properties.
For more details on RichDoc, see the documentation for the RichDoc class.
Definitions used in this object:
- A span is a (char_start, char_end) that was produced by a SpanExtractor.
Most public methods are passed a span and calculate attributes w/r/t it.
NOTE: we expect (char_start, char_end) to be (inclusive, inclusive).
- A word is a single row in a DataFrame that represents a single word.
Tokenization into words is performed by whatever tool was used to create the hOCR representation.
- An ngram is a class that represents one or more words (usually from the same line)
It usually represented via the Ngram class, but may represented as a row in a DataFrame for efficiency when dealing with many ngrams
Attributes will reflect a “convex hull” (min char_start/left/top, max char_end/right/bottom) where possible and the values of the first word otherwise.
All words are ngrams but not all ngrams are words.
Other Notes:
[0, 0] is in the top-left corner (so bottom > top and right > left)
By default, for spans that cross page boundaries, all visual alignment information will be based only on the words that occur on the same page as the first word of the span.
CENTER is the halfway point between LEFT and RIGHT, MIDDLE is the halfway point between TOP and BOTTOM
Parameters
Parameters
Name Type Default Info rd RichDoc
A RichDoc object containing textual, structural, visual, and tabular data about a rich document (e.g. a PDF). text The raw text corresponding to the RichDoc object rd, which is used for computing character alignments to the RichDoc object. - get_aligned_ngrams(char_start, char_end, locations, scope='page', vert_threshold=10, vert_threshold_unit='pixels', vert_threshold_dir=None, horz_threshold=10, horz_threshold_unit='pixels', horz_threshold_dir=None, mask_span_ngrams=True, ngram_range=(1, 1))
Return all ngrams that are aligned in the given scope/threshold by location
Note that we can’t say whether an ngram is aligned in any particular direction until we know what the ngram is, so we can’t just identify the aligned words and then make ngrams after the fact (e.g., a date span may be aligned with only the word “the” in “the execution date”)
Parameters
Parameters
Return type
Return type
Dict
[str
,List
[str
]]
Name Type Default Info char_start int
The (inclusive) start character of the span. char_end int
The (inclusive) end character of the span. locations List[str]
The locations of the span/regex bboxes to use to compare alignments. scope str
'page'
The scope within which to look for aligned ngrams. vert_threshold Union[float, int]
10
The threshold used for comparing alignment vertically on the page. vert_threshold_unit str
'pixels'
The unit of the vertical threshold. vert_threshold_dir Optional[str]
None
The direction on which the vertical threshold applies. If None, threshold applies in both directions. horz_threshold Union[float, int]
10
The threshold used for comparing alignment horizontally on the page. horz_threshold_unit str
'pixels'
The unit of the horizontal threshold. horz_threshold_dir Optional[str]
None
The direction on which the horizontal threshold applies. If None, threshold applies in both directions. mask_span_ngrams bool
True
True to replace the anchor span with SPAN_TOKEN in ngrams. False otherwise. ngram_range Tuple[int, int]
(1, 1)
The range of ngram lengths to return.
get\_aligned\_ngrams
get_aligned_ngrams
- get_font_size(char_start, char_end)
Get font size of a span, rounded to the nearest int.
If there are multiple font sizes, we default to font size of the first word in the span.
get\_font\_size
get_font_size
- get_page_num(char_start, char_end)
Get the 1-indexed page number of a span.
get\_page\_num
get_page_num
- get_proximate_text(char_start, char_end, window, scope_unit='line', direction='before or after')
Returns text found near the given span.
Parameters
Parameters
Return type
Return type
str
Name Type Default Info char_start int
The (inclusive) start character of the span. char_end int
The (inclusive) end character of the span. window int
The number of scope_units from the given span to aggregate text. scope_unit str
'line'
The unit with which to define the window. direction str
'before or after'
The direction of the window from the given span aggregate text.
get\_proximate\_text
get_proximate_text
- get_row_headers(char_start, char_end, scope='page', multi_row=True, min_margin=10, max_gap=20, max_left_page_pct=50)
Heuristically fetch row headers and (optionally) inferred headers
The horizontal header is the phrase in scope that is furthest to the left of the span. The inferred headers are strings that are the first string above and to the left of a Span to be indented at a particular level.
Example: For the span “$1.00”, inferred headers will be [“Planes and other stuff”, “Equipment”, “Assets”] if multirow is True, and [“other stuff”, “Equipment”, “Assets”] otherwise.
Assets
Cash $10.00
Inventory $5.00
Equipment
Cars $4.00
Planes and
other stuff $1.00
LiabilitiesParameters
Parameters
Return type
Return type
Dict
[str
,Union
[str
,List
[str
]]]
Name Type Default Info char_start int
The (inclusive) start character of the span. char_end int
The (inclusive) end character of the span. scope str
'page'
The scope within which to look for headers. min_margin int
10
The minimum margin (in pixels) to the left and above a word that is required for it be considered as a valid inferred header (i.e., not on the same horizontal line as or vertically aligned with the previous header). max_gap int
20
The maximum gap allowed between words on a line for them to be considered part of the same header. max_left_page_pct int
50
A percentage (0-100) of the page from the left boundary that all headers must have a left coordinate less than (e.g., if 25, then all row headers must be on the left quarter of the page).
get\_row\_headers
get_row_headers
- get_span_features(char_start, char_end, config)
Get rich doc feature library for a given span.
Return type
Return type
Dict
[str
,Union
[int
,str
]]
get\_span\_features
get_span_features
- get_span_ngram(char_start, char_end, one_page=True)
Make an ngram corresponding to (char_start, char_end).
Parameters
Parameters
Return type
Return type
Name Type Default Info char_start int
The inclusive start index of the span in RichDoc.text. char_end int
The inclusive end index of the span in RichDoc.text. one_page bool
True
If one_page == True, an ngram containing only those words occuring on the same page as the first word of the span will be returned. This allows us to make assumptions downstream about all words (and bbox coordinates) coming from the same page.
get\_span\_ngram
get_span_ngram
- get_span_row_text(char_start, char_end, row_offsets=(0, 0), mask_span=True, span_ngram=None)
Return the text from one or more rows in the vicinity of a span
Parameters
Parameters
Return type
Return type
str
Name Type Default Info char_start int
The (inclusive) start character of the span. char_end int
The (inclusive) end character of the span. row_offsets Tuple[int, int]
(0, 0)
The inclusive range of rows to extract text from and concatenate If more than one row is included, a delimiter is used between rows The range (-2, 1) would return a string containing the text from four rows: Two before the span, the span’s own row, and one after the span.
get\_span\_row\_text
get_span_row_text
- is_regex_aligned(char_start, char_end, ngrams, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None, case_sensitive=False)
Returns whether or not regex_pattern is found in scope that aligns with the given span.
Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches up to a given threshold (either in pixels or a percentage of the page).
Parameters
Parameters
Return type
Return type
bool
Name Type Default Info char_start int
The (inclusive) start character of the span. char_end int
The (inclusive) end character of the span. ngrams List[Ngram]
The regex ngrams to check for alignment with the given span. scope str
The scope within which to check for the regex_pattern. location str
The location of the span/regex bboxes to compare alignment. threshold Union[float, int]
10
The threshold used for computing alignment. threshold_unit str
'pixels'
The unit of the threshold. threshold_dir Optional[str]
None
The direction on which the threshold applies. If None, the threshold applies in both directions. case_sensitive bool
False
True to use a case sensitive regex match, False otherwise.
is\_regex\_aligned
is_regex_aligned
- is_regex_proximate(char_start, char_end, regex_pattern, window, scope_unit='line', direction='before or after', case_sensitive=False)
Returns whether or not the given regex is found near the given span.
Parameters
Parameters
Return type
Return type
bool
Name Type Default Info char_start int
The (inclusive) start character of the span. char_end int
The (inclusive) end character of the span. regex_pattern str
The regex pattern to be matched near the span. window int
The number of scope_units from the given span to check for the given regex. scope_unit str
'line'
The unit with which to define the window. direction str
'before or after'
The direction of the window from the given span to check for the given regex. case_sensitive bool
False
True for the regex match to be case sensitive, False otherwise.
is\_regex\_proximate
is_regex_proximate
- is_regex_within_bounds(char_start, char_end, ngrams, scope, span_horz_location, span_vert_location, regex_horz_location, regex_vert_location, horz_op_func, vert_op_func, disable_horz=False, disable_vert=False, case_sensitive=False)
Returns whether or not given span has position relative to regex_pattern that meets the conditions.
Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches.
Parameters
Parameters
Return type
Return type
bool
Name Type Default Info char_start int
The (inclusive) start character of the of span. char_end int
The (inclusive) end character of the span. ngrams List[Ngram]
The ngrams to check for alignment with the given span. scope str
The scope within which to check for the regex_pattern. span_horz_location str
The horizontal location of the span bboxes to compare alignment. span_vert_location str
The vertical location of the span bboxes to compare alignment. regex_horz_location str
The horizontal location of the regex bboxes to compare alignment. regex_vert_location str
The vertical location of the regex bboxes to compare alignment. horz_op_func Callable
The operator to compare the horizontal span_location and regex_location. vert_op_func Callable
The operator to compare the vertical span_location and regex_location. case_sensitive bool
False
True to use a case sensitive regex match, False otherwise.
is\_regex\_within\_bounds
is_regex_within_bounds
- is_scope_aligned(char_start, char_end, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None)
Return whether the span is aligned w/r/t a given scope and location.
Parameters
Parameters
Return type
Return type
bool
Name Type Default Info char_start int
The (inclusive) start character of the span. char_end int
The (inclusive) end character of the span. scope str
The scope (line, page, etc.). pars A DataFrame of paragraph objects. lines A DataFrame of line objects. words A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc. Note: This uses the bbox for the given scope, which is generally just a convex hull of the words in the scope. The one expection is when scope = PAGE, for which the bbox is actually just the outside border of the page.
is\_scope\_aligned
is_scope_aligned
- property text: str