Skip to main content
Version: 0.93

Rich doc libraries

Functionality for writing operators and LFs over Rich Document objects.

class rich_doc_wrapper.rich_doc_wrapper.Ngram(words)

Bases: Serializable

A class that represents an ngram (one or more grouped words).

Note: A list of ngrams is sometimes represented as rows in a DataFrame with the same fields, as a performance optimization.

Parameters

NameTypeDefaultInfo
wordsDataFrameA DataFrame of words that represent the given ngram.

deserialize

classmethod deserialize(serialized)

Deserialize instance to string.

Return type

Ngram

scope_id

scope_id(scope)

Return type

int

serialize

serialize()

Serialize instance to string.

Return type

str

property area_id: int
property bottom: int
property center: int
property char_end: int
property char_start: int
property left: int
property line_id: int
property middle: int
property page_id: int
property par_id: int
property right: int
property row_id: int
property text: str
property top: int
property word0: pandas.Series
property word_id: int
class rich_doc_wrapper.rich_doc_wrapper.RichDocSpanFeatures

Bases: object

Class that specifies RichDoc features.

ALIGNED_NGRAMS = 'rich_doc_aligned_ngrams'
FONT_SIZE = 'rich_doc_font_size'
HORZ_ALIGNED_NGRAMS = 'rich_doc_horz_aligned_ngrams'
INFERRED_ROW_HEADERS = 'rich_doc_inferred_row_headers'
PROXIMATE_TEXT = 'rich_doc_proximate_text'
PROXIMATE_TEXT_AFTER = 'rich_doc_proximate_text_after'
PROXIMATE_TEXT_BEFORE = 'rich_doc_proximate_text_before'
ROW_HEADER = 'rich_doc_row_header'
ROW_ID = 'rich_doc_row_id'
ROW_TEXT_AFTER = 'rich_doc_row_text_after'
ROW_TEXT_BEFORE = 'rich_doc_row_text_before'
ROW_TEXT_INLINE = 'rich_doc_row_text_inline'
VERT_ALIGNED_NGRAMS = 'rich_doc_vert_aligned_ngrams'
class rich_doc_wrapper.rich_doc_wrapper.RichDocWrapper(rd)

Bases: object

A class that wraps a RichDoc and calculates derivative values and properties.

For more details on RichDoc, see the documentation for the RichDoc class.

Definitions used in this object:

  • A span is a (char_start, char_end) that was produced by a SpanExtractor.
    • Most public methods are passed a span and calculate attributes w/r/t it.

    • NOTE: we expect (char_start, char_end) to be (inclusive, inclusive).

  • A word is a single row in a DataFrame that represents a single word.
    • Tokenization into words is performed by whatever tool was used to create the hOCR representation.

  • An ngram is a class that represents one or more words (usually from the same line)
    • It usually represented via the Ngram class, but may represented as a row in a DataFrame for efficiency when dealing with many ngrams

    • Attributes will reflect a “convex hull” (min char_start/left/top, max char_end/right/bottom) where possible and the values of the first word otherwise.

    • All words are ngrams but not all ngrams are words.

Other Notes:

  • [0, 0] is in the top-left corner (so bottom > top and right > left)

  • By default, for spans that cross page boundaries, all visual alignment information will be based only on the words that occur on the same page as the first word of the span.

  • CENTER is the halfway point between LEFT and RIGHT, MIDDLE is the halfway point between TOP and BOTTOM

Parameters

NameTypeDefaultInfo
rdRichDocA RichDoc object containing textual, structural, visual, and tabular data about a rich document (e.g. a PDF).
textThe raw text corresponding to the RichDoc object rd, which is used for computing character alignments to the RichDoc object.

get_aligned_ngrams

get_aligned_ngrams(char_start, char_end, locations, scope='page', vert_threshold=10, vert_threshold_unit='pixels', vert_threshold_dir=None, horz_threshold=10, horz_threshold_unit='pixels', horz_threshold_dir=None, mask_span_ngrams=True, ngram_range=(1, 1))

Return all ngrams that are aligned in the given scope/threshold by location

Note that we can’t say whether an ngram is aligned in any particular direction until we know what the ngram is, so we can’t just identify the aligned words and then make ngrams after the fact (e.g., a date span may be aligned with only the word “the” in “the execution date”)

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the span.
char_endintThe (inclusive) end character of the span.
locationsList[str]The locations of the span/regex bboxes to use to compare alignments.
scopestr'page'The scope within which to look for aligned ngrams.
vert_thresholdUnion[float, int]10The threshold used for comparing alignment vertically on the page.
vert_threshold_unitstr'pixels'The unit of the vertical threshold.
vert_threshold_dirOptional[str]NoneThe direction on which the vertical threshold applies. If None, threshold applies in both directions.
horz_thresholdUnion[float, int]10The threshold used for comparing alignment horizontally on the page.
horz_threshold_unitstr'pixels'The unit of the horizontal threshold.
horz_threshold_dirOptional[str]NoneThe direction on which the horizontal threshold applies. If None, threshold applies in both directions.
mask_span_ngramsboolTrueTrue to replace the anchor span with SPAN_TOKEN in ngrams. False otherwise.
ngram_rangeTuple[int, int](1, 1)The range of ngram lengths to return.

Return type

Dict[str, List[str]]

get_font_size

get_font_size(char_start, char_end)

Get font size of a span, rounded to the nearest int.

If there are multiple font sizes, we default to font size of the first word in the span.

Parameters

NameTypeDefaultInfo
char_startintThe inclusive start index of the span in raw text rep of the whole doc.
char_endintThe inclusive end index of the span in raw text rep of the whole doc.

Return type

int

get_page_num

get_page_num(char_start, char_end)

Get the 1-indexed page number of a span.

Parameters

NameTypeDefaultInfo
char_startintThe inclusive start index of the span in raw text rep of the whole doc.
char_endintThe inclusive end index of the span in raw text rep of the whole doc.

Return type

int

get_proximate_text

get_proximate_text(char_start, char_end, window, scope_unit='line', direction='before or after')

Returns text found near the given span.

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the span.
char_endintThe (inclusive) end character of the span.
windowintThe number of scope_units from the given span to aggregate text.
scope_unitstr'line'The unit with which to define the window.
directionstr'before or after'The direction of the window from the given span aggregate text.

Return type

str

get_row_headers

get_row_headers(char_start, char_end, scope='page', multi_row=True, min_margin=10, max_gap=20, max_left_page_pct=50)

Heuristically fetch row headers and (optionally) inferred headers

The horizontal header is the phrase in scope that is furthest to the left of the span. The inferred headers are strings that are the first string above and to the left of a Span to be indented at a particular level.

Example: For the span “$1.00”, inferred headers will be [“Planes and other stuff”, “Equipment”, “Assets”] if multirow is True, and [“other stuff”, “Equipment”, “Assets”] otherwise.

Assets
Cash $10.00
Inventory $5.00
Equipment
Cars $4.00
Planes and
other stuff $1.00
Liabilities

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the span.
char_endintThe (inclusive) end character of the span.
scopestr'page'The scope within which to look for headers.
min_marginint10The minimum margin (in pixels) to the left and above a word that is required for it be considered as a valid inferred header (i.e., not on the same horizontal line as or vertically aligned with the previous header).
max_gapint20The maximum gap allowed between words on a line for them to be considered part of the same header.
max_left_page_pctint50A percentage (0-100) of the page from the left boundary that all headers must have a left coordinate less than (e.g., if 25, then all row headers must be on the left quarter of the page).

Return type

Dict[str, Union[str, List[str]]]

get_span_features

get_span_features(char_start, char_end, config)

Get rich doc feature library for a given span.

Return type

Dict[str, Union[int, str]]

get_span_ngram

get_span_ngram(char_start, char_end, one_page=True)

Make an ngram corresponding to (char_start, char_end).

Parameters

NameTypeDefaultInfo
char_startintThe inclusive start index of the span in RichDoc.text.
char_endintThe inclusive end index of the span in RichDoc.text.
one_pageboolTrueIf one_page == True, an ngram containing only those words occuring on the same page as the first word of the span will be returned. This allows us to make assumptions downstream about all words (and bbox coordinates) coming from the same page.

Return type

Ngram

get_span_row_text

get_span_row_text(char_start, char_end, row_offsets=(0, 0), mask_span=True, span_ngram=None)

Return the text from one or more rows in the vicinity of a span

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the span.
char_endintThe (inclusive) end character of the span.
row_offsetsTuple[int, int](0, 0)The inclusive range of rows to extract text from and concatenate If more than one row is included, a delimiter is used between rows The range (-2, 1) would return a string containing the text from four rows: Two before the span, the span’s own row, and one after the span.

Return type

str

is_regex_aligned

is_regex_aligned(char_start, char_end, ngrams, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None, case_sensitive=False)

Returns whether or not regex_pattern is found in scope that aligns with the given span.

Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches up to a given threshold (either in pixels or a percentage of the page).

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the span.
char_endintThe (inclusive) end character of the span.
ngramsList[Ngram]The regex ngrams to check for alignment with the given span.
scopestrThe scope within which to check for the regex_pattern.
locationstrThe location of the span/regex bboxes to compare alignment.
thresholdUnion[float, int]10The threshold used for computing alignment.
threshold_unitstr'pixels'The unit of the threshold.
threshold_dirOptional[str]NoneThe direction on which the threshold applies. If None, the threshold applies in both directions.
case_sensitiveboolFalseTrue to use a case sensitive regex match, False otherwise.

Return type

bool

is_regex_proximate

is_regex_proximate(char_start, char_end, regex_pattern, window, scope_unit='line', direction='before or after', case_sensitive=False)

Returns whether or not the given regex is found near the given span.

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the span.
char_endintThe (inclusive) end character of the span.
regex_patternstrThe regex pattern to be matched near the span.
windowintThe number of scope_units from the given span to check for the given regex.
scope_unitstr'line'The unit with which to define the window.
directionstr'before or after'The direction of the window from the given span to check for the given regex.
case_sensitiveboolFalseTrue for the regex match to be case sensitive, False otherwise.

Return type

bool

is_regex_within_bounds

is_regex_within_bounds(char_start, char_end, ngrams, scope, span_horz_location, span_vert_location, regex_horz_location, regex_vert_location, horz_op_func, vert_op_func, disable_horz=False, disable_vert=False, case_sensitive=False)

Returns whether or not given span has position relative to regex_pattern that meets the conditions.

Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches.

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the of span.
char_endintThe (inclusive) end character of the span.
ngramsList[Ngram]The ngrams to check for alignment with the given span.
scopestrThe scope within which to check for the regex_pattern.
span_horz_locationstrThe horizontal location of the span bboxes to compare alignment.
span_vert_locationstrThe vertical location of the span bboxes to compare alignment.
regex_horz_locationstrThe horizontal location of the regex bboxes to compare alignment.
regex_vert_locationstrThe vertical location of the regex bboxes to compare alignment.
horz_op_funcCallableThe operator to compare the horizontal span_location and regex_location.
vert_op_funcCallableThe operator to compare the vertical span_location and regex_location.
case_sensitiveboolFalseTrue to use a case sensitive regex match, False otherwise.

Return type

bool

is_scope_aligned

is_scope_aligned(char_start, char_end, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None)

Return whether the span is aligned w/r/t a given scope and location.

Parameters

NameTypeDefaultInfo
char_startintThe (inclusive) start character of the span.
char_endintThe (inclusive) end character of the span.
scopestrThe scope (line, page, etc.).
parsA DataFrame of paragraph objects.
linesA DataFrame of line objects.
wordsA DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.

Return type

bool

Note: This uses the bbox for the given scope, which is generally just a convex hull of the words in the scope. The one expection is when scope = PAGE, for which the bbox is actually just the outside border of the page.

property text: str