Skip to main content
Version: 0.95

Rich document LF builders

If your application utilizes rich document structure (see the Extraction from PDFs: Extracting balance sheet amounts tutorial), the following LF builders are available. These builders support heuristics based on structural characteristics of the document with respect to the span.

Rich document expression builder

Label data points by evaluating the given expression. The expression uses Python syntax, and can reference any dataframe columns, as well as special variables SPAN (representing the span) and PATTERN1PATTERN2, and so on (referencing the first match for each given regex). The special variables have properties topbottomleftrightcentermiddlepage_idline_idpar_idrow_idchar_startchar_endtext. If any of the regex patterns has no match, the LabelingFunction will abstain.

In the example below, the LF has regex patterns ["Fair", "[Vv]alue"] and expression PATTERN1.top <= PATTERN2.top and PATTERN1.top > PATTERN2.top - 100 and SPAN.left > PATTERN1.left - 100 and SPAN.right < PATTERN1.right + 100 and SPAN.top > PATTERN1.top. It will find documents that have the words Fair and Value (or lowercase value) with Fair not too far below Value, and label spans that are roughly below Fair.

This builder can also be used for PDF classification applications, but without the special SPAN variable.

For example, consider the LF with patterns ["Assets", "Liabilities"] and expression PATTERN1.left == PATTERN2.left and len(hv_lines.dfs_vert[page_idx]) >= 2. This matches any page with Assets and Liabilities in the same column (vertically aligned), AND at least 2 vertical lines (suggesting the presence of a table). This LF could be used to identify whether a document is a balance sheet in a document classification task.

Additional Examples

  • Label any span whose bottom is within 10 pixels of the pattern Term

    • Patterns: Term
    • Expression: abs(PATTERN1.bottom - SPAN.bottom) < 10
  • Label any span whose bottom is 10 pixels above the pattern Term and has Imp in it’s text.

    • Patterns: Term
    • Expression: PATTERN1.bottom < SPAN.bottom + 10 and 'Imp' in SPAN.text
  • Label any span whose lowercase text is credit and whose document has Term in the first 1000 pixels.

    • Patterns: TermCredit
    • Expression: PATTERN1.bottom < 1000 and PATTERN2.text.lower() == SPAN.text.lower()
  • Label any doc which has Term in the first 1000 pixels of the first page

    • Patterns: Term
    • Expression: PATTERN1.bottom < 1000 and page_index == 0

Rich document bounding box

Label documents based on the existence of a given regular expression at a specific location in the page. The bounding box coordinates can be found by hovering the cursor over the words in the Data View. In the example below the LF checks if regex pattern “Loan.{0,15}Agreement” lies within a bounding box defined by coordinates “0, 0, 2400, 1000”. We compare all occurences of the regex pattern with the coordinates. This builder can only be used for Rich Doc classification applications.

Span regex proximity

Label data points based on the existence of a given regular expression in their vicinity. This checks if the span is within some number (called the window size) of units (lines/paragraphs/areas) in a specified direction (before, after, or either direction) of the specified regex. If the window size is 0, only text in the span’s own line/paragraph/area will be considered.

In the example below, this LF labels the 8 spans that are up to 4 LINES after the regex pattern Current Assets: as ASSETS.image__3_.webp

Span regex alignment

Label data points based on whether or not a given regular expression is aligned with the span in some way. Select a location (LEFT / CENTER / RIGHT / TOP / MIDDLE / BOTTOM) to compare between the bounding box of the span and the boundings boxes of any matches for your regular expression. Also select a threshold for how close the two coordinates need to be to be considered aligned. This comparison can be made in PIXELS or PAGE_PERCENT, a percentage (0-100) of the page’s total width/height, depending on which dimension is relevant for the given location. You may also optionally limit matches to the specified scope of the Span, and/or in only the given direction (e.g., if your threshold is 100 pixels, location is TOP, and threshold direction is UP_ONLY, then only matches whose top boundary are within 100 pixels above (not below) the Span on the page will be matched. If no threshold direction is specified, then matches within the threshold are allowed in both directions (LEFT and RIGHT, or UP and DOWN).

As is the example below, this LF labels the spans whose LEFT aligns with the regex pattern Assets (case-sensitive), within a margin of 700 PIXELS, as INVALID. These spans are highlighted in grey in the figure.

image__4_.webp

Span regex row (Rich-document based LFs)

Label data points based on whether or not the span is within a certain number of rows of a given regular expression. Set rows before and rows after to specify the window of rows to search, relative to the span’s current row. If both rows before and rows after are set to 0 only spans in the same row, that match the regex pattern, will be labeled.

In the example below, this LF labels the 10 spans that are at most 2 rows before or 2 rows after the regex pattern Total current liabilites as LIABILITIES.

image__5_.webp

Span regex position

Label data points based on whether the span is at a specific direction with respect to the given regular expression. Select two location attributes (LEFT / CENTER / RIGHT / TOP / MIDDLE / BOTTOM) of the span to compare with two location attributes of any matches to the regular expression. You may optionally select the scope of the comparison (PAGE / AREA / PARA / LINE). You can enable only the first, only the second, or both conditions under the Advanced options.

In the example below, this LF labels 14 spans that lie below and to the right of the regex pattern stock as EQUITY.

image__6_.webp

Span page

Label data points based on the page number of the given span, using 1-indexing (the first page of the document is considered page 1, not page 0).

Span font size

Label data points based on the font size of the given span (in pixels, not points), based on the value reported in the hOCR.