Rich doc libraries | Snorkel AI

class rich_doc_wrapper.rich_doc_wrapper.Ngram(words)

Bases: Serializable

A class that represents an ngram (one or more grouped words).

Note: A list of ngrams is sometimes represented as rows in a DataFrame with the same fields, as a performance optimization.

Parameters Parameters

Name	Type	Default	Info
words	`DataFrame`		A DataFrame of words that represent the given ngram.

deserialize

classmethod deserialize(serialized)

Deserialize instance to string.

Return type Return type: Ngram

scope_id

scope_id(scope)

Return type Return type: int

serialize

serialize()

Serialize instance to string.

Return type Return type: str

property area_id: int

property bottom: int

property center: int

property char_end: int

property char_start: int

property left: int

property line_id: int

property middle: int

property page_id: int

property par_id: int

property right: int

property row_id: int

property text: str

property top: int

property word0: pandas.Series

property word_id: int

class rich_doc_wrapper.rich_doc_wrapper.RichDocSpanFeatures

Bases: object

Class that specifies RichDoc features.

ALIGNED_NGRAMS = 'rich_doc_aligned_ngrams'

FONT_SIZE = 'rich_doc_font_size'

HORZ_ALIGNED_NGRAMS = 'rich_doc_horz_aligned_ngrams'

INFERRED_ROW_HEADERS = 'rich_doc_inferred_row_headers'

PROXIMATE_TEXT = 'rich_doc_proximate_text'

PROXIMATE_TEXT_AFTER = 'rich_doc_proximate_text_after'

PROXIMATE_TEXT_BEFORE = 'rich_doc_proximate_text_before'

ROW_HEADER = 'rich_doc_row_header'

ROW_ID = 'rich_doc_row_id'

ROW_TEXT_AFTER = 'rich_doc_row_text_after'

ROW_TEXT_BEFORE = 'rich_doc_row_text_before'

ROW_TEXT_INLINE = 'rich_doc_row_text_inline'

VERT_ALIGNED_NGRAMS = 'rich_doc_vert_aligned_ngrams'

class rich_doc_wrapper.rich_doc_wrapper.RichDocWrapper(rd)

Bases: object

A class that wraps a RichDoc and calculates derivative values and properties.

For more details on RichDoc, see the documentation for the RichDoc class.

Definitions used in this object:

A span is a (char_start, char_end) that was produced by a SpanExtractor.
- Most public methods are passed a span and calculate attributes w/r/t it.
- NOTE: we expect (char_start, char_end) to be (inclusive, inclusive).
A word is a single row in a DataFrame that represents a single word.
- Tokenization into words is performed by whatever tool was used to create the hOCR representation.
An ngram is a class that represents one or more words (usually from the same line)
- It usually represented via the Ngram class, but may represented as a row in a DataFrame for efficiency when dealing with many ngrams
- Attributes will reflect a “convex hull” (min char_start/left/top, max char_end/right/bottom) where possible and the values of the first word otherwise.
- All words are ngrams but not all ngrams are words.

Other Notes:

[0, 0] is in the top-left corner (so bottom > top and right > left)
By default, for spans that cross page boundaries, all visual alignment information will be based only on the words that occur on the same page as the first word of the span.
CENTER is the halfway point between LEFT and RIGHT, MIDDLE is the halfway point between TOP and BOTTOM

Parameters Parameters

Name	Type	Default	Info
rd	`RichDoc`		A RichDoc object containing textual, structural, visual, and tabular data about a rich document (e.g. a PDF).
text			The raw text corresponding to the RichDoc object rd, which is used for computing character alignments to the RichDoc object.

get_aligned_ngrams

get_aligned_ngrams(char_start, char_end, locations, scope='page', vert_threshold=10, vert_threshold_unit='pixels', vert_threshold_dir=None, horz_threshold=10, horz_threshold_unit='pixels', horz_threshold_dir=None, mask_span_ngrams=True, ngram_range=(1, 1))

Return all ngrams that are aligned in the given scope/threshold by location

Note that we can’t say whether an ngram is aligned in any particular direction until we know what the ngram is, so we can’t just identify the aligned words and then make ngrams after the fact (e.g., a date span may be aligned with only the word “the” in “the execution date”)

Parameters Parameters
Return type Return type: Dict[str, List[str]]

Name	Type	Default	Info
char_start	`int`		The (inclusive) start character of the span.
char_end	`int`		The (inclusive) end character of the span.
locations	`List[str]`		The locations of the span/regex bboxes to use to compare alignments.
scope	`str`	`'page'`	The scope within which to look for aligned ngrams.
vert_threshold	`Union[float, int]`	`10`	The threshold used for comparing alignment vertically on the page.
vert_threshold_unit	`str`	`'pixels'`	The unit of the vertical threshold.
vert_threshold_dir	`Optional[str]`	`None`	The direction on which the vertical threshold applies. If None, threshold applies in both directions.
horz_threshold	`Union[float, int]`	`10`	The threshold used for comparing alignment horizontally on the page.
horz_threshold_unit	`str`	`'pixels'`	The unit of the horizontal threshold.
horz_threshold_dir	`Optional[str]`	`None`	The direction on which the horizontal threshold applies. If None, threshold applies in both directions.
mask_span_ngrams	`bool`	`True`	True to replace the anchor span with SPAN_TOKEN in ngrams. False otherwise.
ngram_range	`Tuple[int, int]`	`(1, 1)`	The range of ngram lengths to return.

get_font_size

get_font_size(char_start, char_end)

Get font size of a span, rounded to the nearest int.

If there are multiple font sizes, we default to font size of the first word in the span.

Parameters Parameters
Return type Return type: int

Name	Type	Default	Info
char_start	`int`		The inclusive start index of the span in raw text rep of the whole doc.
char_end	`int`		The inclusive end index of the span in raw text rep of the whole doc.

get_page_num

get_page_num(char_start, char_end)

Get the 1-indexed page number of a span.

Parameters Parameters
Return type Return type: int

Name	Type	Default	Info
char_start	`int`		The inclusive start index of the span in raw text rep of the whole doc.
char_end	`int`		The inclusive end index of the span in raw text rep of the whole doc.

get_proximate_text

get_proximate_text(char_start, char_end, window, scope_unit='line', direction='before or after')

Returns text found near the given span.

Parameters Parameters
Return type Return type: str

Name	Type	Default	Info
char_start	`int`		The (inclusive) start character of the span.
char_end	`int`		The (inclusive) end character of the span.
window	`int`		The number of scope_units from the given span to aggregate text.
scope_unit	`str`	`'line'`	The unit with which to define the window.
direction	`str`	`'before or after'`	The direction of the window from the given span aggregate text.

get_row_headers

get_row_headers(char_start, char_end, scope='page', multi_row=True, min_margin=10, max_gap=20, max_left_page_pct=50)

Heuristically fetch row headers and (optionally) inferred headers

The horizontal header is the phrase in scope that is furthest to the left of the span. The inferred headers are strings that are the first string above and to the left of a Span to be indented at a particular level.

Example: For the span “$1.00”, inferred headers will be [“Planes and other stuff”, “Equipment”, “Assets”] if multirow is True, and [“other stuff”, “Equipment”, “Assets”] otherwise.

Assets
    Cash             $10.00
    Inventory         $5.00
    Equipment
        Cars          $4.00
        Planes and
        other stuff   $1.00
Liabilities

Parameters Parameters
Return type Return type: Dict[str, Union[str, List[str]]]

Name	Type	Default	Info
char_start	`int`		The (inclusive) start character of the span.
char_end	`int`		The (inclusive) end character of the span.
scope	`str`	`'page'`	The scope within which to look for headers.
min_margin	`int`	`10`	The minimum margin (in pixels) to the left and above a word that is required for it be considered as a valid inferred header (i.e., not on the same horizontal line as or vertically aligned with the previous header).
max_gap	`int`	`20`	The maximum gap allowed between words on a line for them to be considered part of the same header.
max_left_page_pct	`int`	`50`	A percentage (0-100) of the page from the left boundary that all headers must have a left coordinate less than (e.g., if 25, then all row headers must be on the left quarter of the page).

get_span_features

get_span_features(char_start, char_end, config)

Get rich doc feature library for a given span.

Return type Return type: Dict[str, Union[int, str]]

get_span_ngram

get_span_ngram(char_start, char_end, one_page=True)

Make an ngram corresponding to (char_start, char_end).

Parameters Parameters
Return type Return type: Ngram

Name	Type	Default	Info
char_start	`int`		The inclusive start index of the span in RichDoc.text.
char_end	`int`		The inclusive end index of the span in RichDoc.text.
one_page	`bool`	`True`	If one_page == True, an ngram containing only those words occuring on the same page as the first word of the span will be returned. This allows us to make assumptions downstream about all words (and bbox coordinates) coming from the same page.

get_span_row_text

get_span_row_text(char_start, char_end, row_offsets=(0, 0), mask_span=True, span_ngram=None)

Return the text from one or more rows in the vicinity of a span

Parameters Parameters
Return type Return type: str

Name	Type	Default	Info
char_start	`int`		The (inclusive) start character of the span.
char_end	`int`		The (inclusive) end character of the span.
row_offsets	`Tuple[int, int]`	`(0, 0)`	The inclusive range of rows to extract text from and concatenate If more than one row is included, a delimiter is used between rows The range (-2, 1) would return a string containing the text from four rows: Two before the span, the span’s own row, and one after the span.

is_regex_aligned

is_regex_aligned(char_start, char_end, ngrams, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None, case_sensitive=False)

Returns whether or not regex_pattern is found in scope that aligns with the given span.

Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches up to a given threshold (either in pixels or a percentage of the page).

Parameters Parameters
Return type Return type: bool

Name	Type	Default	Info
char_start	`int`		The (inclusive) start character of the span.
char_end	`int`		The (inclusive) end character of the span.
ngrams	`List[Ngram]`		The regex ngrams to check for alignment with the given span.
scope	`str`		The scope within which to check for the regex_pattern.
location	`str`		The location of the span/regex bboxes to compare alignment.
threshold	`Union[float, int]`	`10`	The threshold used for computing alignment.
threshold_unit	`str`	`'pixels'`	The unit of the threshold.
threshold_dir	`Optional[str]`	`None`	The direction on which the threshold applies. If None, the threshold applies in both directions.
case_sensitive	`bool`	`False`	True to use a case sensitive regex match, False otherwise.

is_regex_proximate

is_regex_proximate(char_start, char_end, regex_pattern, window, scope_unit='line', direction='before or after', case_sensitive=False)

Returns whether or not the given regex is found near the given span.

Parameters Parameters
Return type Return type: bool

Name	Type	Default	Info
char_start	`int`		The (inclusive) start character of the span.
char_end	`int`		The (inclusive) end character of the span.
regex_pattern	`str`		The regex pattern to be matched near the span.
window	`int`		The number of scope_units from the given span to check for the given regex.
scope_unit	`str`	`'line'`	The unit with which to define the window.
direction	`str`	`'before or after'`	The direction of the window from the given span to check for the given regex.
case_sensitive	`bool`	`False`	True for the regex match to be case sensitive, False otherwise.

is_regex_within_bounds

is_regex_within_bounds(char_start, char_end, ngrams, scope, span_horz_location, span_vert_location, regex_horz_location, regex_vert_location, horz_op_func, vert_op_func, disable_horz=False, disable_vert=False, case_sensitive=False)

Returns whether or not given span has position relative to regex_pattern that meets the conditions.

Note: Alignment is assessed by comparing a given location (LEFT/CENTER/RIGHT/TOP/ MIDDLE/BOTTOM) on both the span and any regex matches.

Parameters Parameters
Return type Return type: bool

Name	Type	Default	Info
char_start	`int`		The (inclusive) start character of the of span.
char_end	`int`		The (inclusive) end character of the span.
ngrams	`List[Ngram]`		The ngrams to check for alignment with the given span.
scope	`str`		The scope within which to check for the regex_pattern.
span_horz_location	`str`		The horizontal location of the span bboxes to compare alignment.
span_vert_location	`str`		The vertical location of the span bboxes to compare alignment.
regex_horz_location	`str`		The horizontal location of the regex bboxes to compare alignment.
regex_vert_location	`str`		The vertical location of the regex bboxes to compare alignment.
horz_op_func	`Callable`		The operator to compare the horizontal span_location and regex_location.
vert_op_func	`Callable`		The operator to compare the vertical span_location and regex_location.
case_sensitive	`bool`	`False`	True to use a case sensitive regex match, False otherwise.

is_scope_aligned

is_scope_aligned(char_start, char_end, scope, location, threshold=10, threshold_unit='pixels', threshold_dir=None)

Return whether the span is aligned w/r/t a given scope and location.

Parameters Parameters
Return type Return type: bool

Name	Type	Info
char_start	`int`	The (inclusive) start character of the span.
char_end	`int`	The (inclusive) end character of the span.
scope	`str`	The scope (line, page, etc.).
pars		A DataFrame of paragraph objects.
lines		A DataFrame of line objects.
words		A DataFrame of word objects, with bounding boxes, parent obj. assignments, etc.

Note: This uses the bbox for the given scope, which is generally just a convex hull of the words in the scope. The one expection is when scope = PAGE, for which the bbox is actually just the outside border of the page.

property text: str

Parameters

Parameters​

deserialize

deserialize​

Return type

Return type​

scope\_id

scope_id​

Return type

Return type​

serialize

serialize​

Return type

Return type​

Parameters

Parameters​

get\_aligned\_ngrams

get_aligned_ngrams​

Parameters

Parameters​

Return type

Return type​

get\_font\_size

get_font_size​

Parameters

Parameters​

Return type

Return type​

get\_page\_num

get_page_num​

Parameters

Parameters​

Return type

Return type​

get\_proximate\_text

get_proximate_text​

Parameters

Parameters​

Return type

Return type​

get\_row\_headers

get_row_headers​

Parameters

Parameters​

Return type

Return type​

get\_span\_features

get_span_features​

Return type

Return type​

get\_span\_ngram

get_span_ngram​

Parameters

Parameters​

Return type

Return type​

get\_span\_row\_text

get_span_row_text​

Parameters

Parameters​

Return type

Return type​

is\_regex\_aligned

is_regex_aligned​

Parameters

Parameters​

Return type

Return type​

is\_regex\_proximate

is_regex_proximate​

Parameters

Parameters​

Return type

Return type​

is\_regex\_within\_bounds

is_regex_within_bounds​

Parameters

Parameters​

Return type

Return type​

Parameters

deserialize

Return type

scope_id

Return type

serialize

Return type

Parameters

get_aligned_ngrams

Parameters

Return type

get_font_size

Parameters

Return type

get_page_num

Parameters

Return type

get_proximate_text

Parameters

Return type

get_row_headers

Parameters

Return type

get_span_features

Return type

get_span_ngram

Parameters

Return type

get_span_row_text

Parameters

Return type

is_regex_aligned

Parameters

Return type

is_regex_proximate

Parameters

Return type

is_regex_within_bounds

Parameters

Return type

is_scope_aligned

Parameters

Return type