Skip to main content
Version: 0.95

Rich doc libraries

Functionality for writing operators and LFs over Rich Document objects.

class rich_doc_wrapper.rich_doc_wrapper.Ngram(words)

Bases: Serializable

A class that represents an ngram (one or more grouped words).

Note: A list of ngrams is sometimes represented as rows in a DataFrame with the same fields, as a performance optimization.

Parameters:

words (DataFrame) – A DataFrame of words that represent the given ngram.

columns()
Return type:

List[str]

classmethod deserialize(serialized)

Deserialize instance to string.

Return type:

Ngram

scope_id(scope)
Return type:

int

serialize()

Serialize instance to string.

Return type:

str

property area_id: int
property bottom: int
property center: int
property char_end: int
property char_start: int
property left: int
property line_id: int
property middle: int
property page_id: int
property par_id: int
property right: int
property row_id: int
property text: str
property top: int
property word0: pandas.Series
property word_id: int
class rich_doc_wrapper.rich_doc_wrapper.RichDocSpanFeatures

Bases: object

Class that specifies RichDoc features.

ALIGNED_NGRAMS = 'rich_doc_aligned_ngrams'
FONT_SIZE = 'rich_doc_font_size'
HORZ_ALIGNED_NGRAMS = 'rich_doc_horz_aligned_ngrams'
INFERRED_ROW_HEADERS = 'rich_doc_inferred_row_headers'
PROXIMATE_TEXT = 'rich_doc_proximate_text'
PROXIMATE_TEXT_AFTER = 'rich_doc_proximate_text_after'
PROXIMATE_TEXT_BEFORE = 'rich_doc_proximate_text_before'
ROW_HEADER = 'rich_doc_row_header'
ROW_ID = 'rich_doc_row_id'
ROW_TEXT_AFTER = 'rich_doc_row_text_after'
ROW_TEXT_BEFORE = 'rich_doc_row_text_before'
ROW_TEXT_INLINE = 'rich_doc_row_text_inline'
VERT_ALIGNED_NGRAMS = 'rich_doc_vert_aligned_ngrams'
class rich_doc_wrapper.rich_doc_wrapper.RichDocWrapper(rd)

Bases: object

A class that wraps a RichDoc and calculates derivative values and properties.

For more details on RichDoc, see the documentation for the RichDoc class.

Definitions used in this object:

  • A span is a (char_start, char_end) that was produced by a SpanExtractor.
    • Most public methods are passed a span and calculate attributes w/r/t it.

    • NOTE: we expect (char_start, char_end) to be (inclusive, inclusive).

  • A word is a single row in a DataFrame that represents a single word.
    • Tokenization into words is performed by whatever tool was used to create the hOCR representation.

  • An ngram is a class that represents one or more words (usually from the same line)
    • It usually represented via the Ngram class, but may represented as a row in a DataFrame for efficiency when dealing with many ngrams

    • Attributes will reflect a “convex hull” (min char_start/left/top, max char_end/right/bottom) where possible and the values of the first word otherwise.

    • All words are ngrams but not all ngrams are words.

Other Notes:

  • [0, 0] is in the top-left corner (so bottom > top and right > left)

  • By default, for spans that cross page boundaries, all visual alignment information will be based only on the words that occur on the same page as the first word of the span.

  • CENTER is the halfway point between LEFT and RIGHT, MIDDLE is the halfway point between TOP and BOTTOM

Parameters:
  • rd (RichDoc) – A RichDoc object containing textual, structural, visual, and tabular data about a rich document (e.g. a PDF).

  • text – The raw text corresponding to the RichDoc object rd, which is used for computing character alignments to the RichDoc object.

get_aligned_ngrams(char_start, char_end, locations, scope='page', vert_threshold=10, vert_threshold_unit='pixels', vert_threshold_dir=None, horz_threshold=10, horz_threshold_unit='pixels', horz_threshold_dir=None, mask_span_ngrams=True, ngram_range=(1, 1))

Return all ngrams that are aligned in the given scope/threshold by location

Note that we can’t say whether an ngram is aligned in any particular direction until we know what the ngram is, so we can’t just identify the aligned words and then make ngrams after the fact (e.g., a date span may be aligned with only the word “the” in “the execution date”)

Parameters:
  • char_start (int) – The (inclusive) start character of the span.

  • char_end (int) – The (inclusive) end character of the span.

  • locations (List[str]) – The locations of the span/regex bboxes to use to compare alignments.

  • scope (str, default: 'page') – The scope within which to look for aligned ngrams.

  • vert_threshold (Union[float, int], default: 10) – The threshold used for comparing alignment vertically on the page.

  • vert_threshold_unit (str, default: 'pixels') – The unit of the vertical threshold.

  • vert_threshold_dir (Optional[str], default: None) – The direction on which the vertical threshold applies. If None, threshold applies in both directions.

  • horz_threshold (Union[float, int], default: 10) – The threshold used for comparing alignment horizontally on the page.

  • horz_threshold_unit (str, default: 'pixels') – The unit of the horizontal threshold

  • horz_threshold_dir (Optional[str], default: None) – The direction on which the horizontal threshold applies. If None, threshold applies in both directions.

  • mask_span_ngrams (bool, default: True) – True to replace the anchor span with SPAN_TOKEN in ngrams. False otherwise.

  • ngram_range (Tuple[int, int], default: (1, 1)) – The range of ngram lengths to return.

Return type:

Dict[str, List[str]]

get_font_size(char_start, char_end)

Get font size of a span, rounded to the nearest int.

If there are multiple font sizes, we default to font size of the first word in the span.

Parameters:
  • char_start (int) – The inclusive start index of the span in raw text rep of the whole doc.

  • char_end (int) – The inclusive end index of the span in raw text rep of the whole doc.

Return type:

int

get_page_num(char_start, char_end)

Get the 1-indexed page number of a span.

Parameters:
  • char_start (int) – The inclusive start index of the span in raw text rep of the whole doc.

  • char_end (int) – The inclusive end index of the span in raw text rep of the whole doc.

Return type:

int

get_proximate_text(char_start, char_end, window, scope_unit='line', direction='before or after')

Returns text found near the given span.

Parameters:
  • char_start (int) – The (inclusive) start character of the span

  • char_end (int) – The (inclusive) end character of the span

  • window (int) – The number of scope_units from the given span to aggregate text.

  • scope_unit (str, default: 'line') – The unit with which to define the window.

  • direction (str, default: 'before or after') – The direction of the window from the given span aggregate text.

Return type:

str

get_row_headers(char_start, char_end, scope='page', multi_row=True, min_margin=10, max_gap=20, max_left_page_pct=50)

Heuristically fetch row headers and (optionally) inferred headers

The horizontal header is the phrase in scope that is furthest to the left of the span. The inferred headers are strings that are the first string above and to the left of a Span to be indented at a particular level.

Example: For the span “$1.00”, inferred headers will be [“Planes and other stuff”, “Equipment”, “Assets”] if multirow is True, and [“other stuff”, “Equipment”, “Assets”] otherwise.

Assets
Cash $10.00
Inventory $5.00
Equipment
Cars $4.00
Planes and
other stuff $1.00
Liabilities
Parameters:
  • char_start (int) – The (inclusive) start character of the span.

  • char_end (int) – The (inclusive) end character of the span.

  • scope (str, default: 'page') – The scope within which to look for headers.

  • min_margin (int, default: 10) – The minimum margin (in pixels) to the left and above a word that is required for it be considered as a valid inferred header (i.e., not on the same horizontal line as or vertically aligned with the previous header).

  • max_gap (int, default: 20) – The maximum gap allowed between words on a line for them to be considered part of the same header.

  • max_left_page_pct (int, default: 50) – A percentage (0-100) of the page from the left boundary that all headers must have a left coordinate less than (e.g., if 25, then all row headers must be on the left quarter of the page).

Return type:

Dict[str, Union[str, List[str]]]

get_span_features(char_start, char_end, config)

Get rich doc feature library for a given span.

Return type:

Dict[str, Union[int, str]]

get_span_ngram(char_start, char_end, one_page=True)

Make an ngram corresponding to (char_start, char_end).

Parameters:
  • char_start (int) – The inclusive start index of the span in RichDoc.text.

  • char_end (int) – The inclusive end index of the span in RichDoc.text.

  • one_page (bool, default: True) – If one_page == True, an ngram containing only those words occuring on the same page as the first word of the span will be returned. This allows us to make assumptions downstream about all words (and bbox coordinates) coming from the same page.

Return type:

Ngram

get_span_row_text(char_start, char_end, row_offsets=(0, 0), mask_span=True, span_ngram=None)

Return the text from one or more rows in the vicinity of a span

Parameters:
  • char_start (int) – The (inclusive) start character of the span.

  • char_end (int) – The (inclusive) end character of the span.

  • row_offsets (Tuple[int, int], default: (0, 0)) – The inclusive range of rows to extract text from and concatenate If more than one row is included, a delimiter is used between rows The range (-2, 1) would return a string containing the text from four rows: Two before the span, the span’s own row, and one after the span.

Return type:

str

is_regex_aligned(char_start, char_end, ngrams, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None, case_sensitive=False)

Returns whether or not regex_pattern is found in scope that aligns with the given span.

Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches up to a given threshold (either in pixels or a percentage of the page).

Parameters:
  • char_start (int) – The (inclusive) start character of the span.

  • char_end (int) – The (inclusive) end character of the span.

  • ngrams (List[Ngram]) – The regex ngrams to check for alignment with the given span.

  • scope (str) – The scope within which to check for the regex_pattern.

  • location (str) – The location of the span/regex bboxes to compare alignment.

  • threshold (Union[float, int], default: 10) – The threshold used for computing alignment.

  • threshold_unit (str, default: 'pixels') – The unit of the threshold.

  • threshold_dir (Optional[str], default: None) – The direction on which the threshold applies. If None, the threshold applies in both directions.

  • case_sensitive (bool, default: False) – True to use a case sensitive regex match, False otherwise.

Return type:

bool

is_regex_proximate(char_start, char_end, regex_pattern, window, scope_unit='line', direction='before or after', case_sensitive=False)

Returns whether or not the given regex is found near the given span.

Parameters:
  • char_start (int) – The (inclusive) start character of the span

  • char_end (int) – The (inclusive) end character of the span

  • regex_pattern (str) – The regex pattern to be matched near the span.

  • window (int) – The number of scope_units from the given span to check for the given regex.

  • scope_unit (str, default: 'line') – The unit with which to define the window.

  • direction (str, default: 'before or after') – The direction of the window from the given span to check for the given regex.

  • case_sensitive (bool, default: False) – True for the regex match to be case sensitive, False otherwise.

Return type:

bool

is_regex_within_bounds(char_start, char_end, ngrams, scope, span_horz_location, span_vert_location, regex_horz_location, regex_vert_location, horz_op_func, vert_op_func, disable_horz=False, disable_vert=False, case_sensitive=False)

Returns whether or not given span has position relative to regex_pattern that meets the conditions.

Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches.

Parameters:
  • char_start (int) – The (inclusive) start character of the of span.

  • char_end (int) – The (inclusive) end character of the span.

  • ngrams (List[Ngram]) – The ngrams to check for alignment with the given span.

  • scope (str) – The scope within which to check for the regex_pattern.

  • span_horz_location (str) – The horizontal location of the span bboxes to compare alignment.

  • span_vert_location (str) – The vertical location of the span bboxes to compare alignment.

  • regex_horz_location (str) – The horizontal location of the regex bboxes to compare alignment.

  • regex_vert_location (str) – The vertical location of the regex bboxes to compare alignment.

  • horz_op_func (Callable) – The operator to compare the horizontal span_location and regex_location

  • vert_op_func (Callable) – The operator to compare the vertical span_location and regex_location

  • case_sensitive (bool, default: False) – True to use a case sensitive regex match, False otherwise.

Return type:

bool

is_scope_aligned(char_start, char_end, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None)

Return whether the span is aligned w/r/t a given scope and location.

Parameters:
  • char_start (int) – The (inclusive) start character of the span.

  • char_end (int) – The (inclusive) end character of the span.

  • scope (str) – The scope (line, page, etc.).

  • pars – A DataFrame of paragraph objects.

  • lines – A DataFrame of line objects.

  • words – A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.

Return type:

bool

Note: This uses the bbox for the given scope, which is generally just a convex hull of the words in the scope. The one expection is when scope = PAGE, for which the bbox is actually just the outside border of the page.

property text: str