(Beta) Word LF builders
This is a beta feature available to customers using a Snorkel-hosted instance of Snorkel Flow. Beta features may have known gaps or bugs, but are functional workflows and eligible for Snorkel Support. To access beta features, contact Snorkel Support to enable the feature flag for your Snorkel-hosted instance.
This article describes the set of word-based labeling function (LF) builders that are available for PDF word classification applications.
Word-based regex builder
Label words present in the page based on regex patterns applied over the page text. This builder highlights words that match the regex pattern in the data viewer to help with iteration.
This example uses the word-based regex builder to find all alphabetical words in the page using the regex pattern \b[a-zA-Z]+\b
.
Word-based expression builder
Label words present in the page based on a regex pattern and a Python expression that is evaluated on the pattern and words. The expression uses Python syntax and the following special variables:
WORD
: the word to evaluatePATTERN
: the regex pattern
The special variables have the following properties:
top
,bottom
,left
,right
: the bounding box of the wordcenter
: the center of the word (midpoint between left and right)middle
: the middle of the word (midpoint between top and bottom)char_start
,char_end
: the character start and end indices of the wordline_id
: the line index of the wordpar_id
: the paragraph index of the wordarea_id
: the area index of the wordpage_id
: the page index of the wordrow_id
: the row index of the word, which is useful for multi-column documents
This builder highlights words that match the expression in the data viewer to help with iteration.
This example uses the word expression builder to find the patent number in the page. The regex pattern is patent number
and the expression is PATTERN.right < WORD.left and PATTERN.row_id == WORD.row_id
. This builder looks for words that are to the right of "patent number" and in the same row.
This builder is also effective when combined with the word regex builder above. This example finds all asset amounts in the balance sheet by combing multiple LFs:
- Word regex LF to find all numeric words in the page, using the regex pattern
\d
. - Word expression LF to find the words that are below the term "Current assets" using the pattern
Current assets
and the expressionWORD.row_id > PATTERN.row_id
. - Word expression LF to find the words that are above the term "Total assets" using the pattern
Total assets
and the expressionWORD.row_id < PATTERN.row_id
.