Skip to main content
Version: 0.96

(Beta) Word LF builders

note

This is a beta feature available to customers using a Snorkel-hosted instance of Snorkel Flow. Beta features may have known gaps or bugs, but are functional workflows and eligible for Snorkel Support. To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.

This article describes the set of word-based labeling function (LF) builders that are available for PDF word classification applications.

Word-based regex builder

Label words present in the page based on regex patterns applied over the page text. This builder highlights words that match the regex pattern in the data viewer to help with iteration.

This example uses the word-based regex builder to find all alphabetical words in the page using the regex pattern \b[a-zA-Z]+\b.

word_regex_example.webp

Word-based expression builder

Label words present in the page based on a regex pattern and a Python expression that is evaluated on the pattern and words. The expression uses Python syntax and the following special variables:

  • WORD: the word to evaluate
  • PATTERN: the regex pattern

The special variables have the following properties:

  • top, bottom, left, right: the bounding box of the word
  • center: the center of the word (midpoint between left and right)
  • middle: the middle of the word (midpoint between top and bottom)
  • char_start, char_end: the character start and end indices of the word
  • line_id: the line index of the word
  • par_id: the paragraph index of the word
  • area_id: the area index of the word
  • page_id: the page index of the word
  • row_id: the row index of the word, which is useful for multi-column documents

This builder highlights words that match the expression in the data viewer to help with iteration.

This example uses the word expression builder to find the patent number in the page. The regex pattern is patent number and the expression is PATTERN.right < WORD.left and PATTERN.row_id == WORD.row_id. This builder looks for words that are to the right of "patent number" and in the same row.

word_expression_patent_num_example.webp

This builder is also effective when combined with the word regex builder above. This example finds all asset amounts in the balance sheet by combing multiple LFs:

  • Word regex LF to find all numeric words in the page, using the regex pattern \d.
  • Word expression LF to find the words that are below the term "Current assets" using the pattern Current assets and the expression WORD.row_id > PATTERN.row_id.
  • Word expression LF to find the words that are above the term "Total assets" using the pattern Total assets and the expression WORD.row_id < PATTERN.row_id.

word_expression_multi_lf_example.webp