Pattern based LF builders
This page describes the basic set of pattern based LF builders that are available for classification applications.
Regex builder
Label data points that match a user-defined regex over a specific data field. This builder highlights phrases that match the regex pattern in the data point in the data viewer to help iterate. https://regex101.com is also a good resource for regexes. The builder also supports fuzzy matching in regex patterns (see https://pypi.org/project/regex/).
To find service documents in the contract classification dataset, we might label data points that match the regex pattern `This.{1,50} Service Agreement`
Full text regex builder
Label data points that match a user-defined regex over a specific set of data fields.
Whereas the Regex Builder searches on one field, the Full Text Regex Builder can search through multiple or all fields at once. For example, you could search both the Title and Text body of a news article with a single LF. By default, all fields are selected (hence the “full text” match).
Keyword builder
Label data points that satisfy a condition with respect to at least one in a list of keywords you provide. There are three conditions that we support in-app:
- CONTAINS: if the specified field contains at least a keyword.
- CONTAINS LINE MATCHING: if one of the lines in the specified field matches a keyword.
- EQUALS: if the specified field equals a keyword.
Note that the Keyword builder allows a maximum of 3 keywords/phrases per builder.
For the contract classification application, we can look for the word employment
to label documents as EMPLOYMENT
type.
Fuzzy keyword builder
Label data points that contain a string that’s similar to at least one of a list of keywords you provide, within the specified similarity ratio using the SequenceMatcher library. This is particularly useful for documents obtained through OCR.
Keyword location builder
Label data points based on whether a pattern appears in a specific section (lines, paragraphs etc.) of a field. Words are split using the space character (s), lines using the newline character (n), sentences based on selected punctuation ([.?!]s), and paragraphs based on two new line characters (nn+).
Advanced option: You can specify how often the pattern should appear in that location!
For the contract classification application, we look whether the word employment
occurs in the first 5 paragraphs.
Keyword count builder
Label data points based on the frequency of pattern in a field.
Advanced option: You can specify the specific section of the field you want to count the pattern frequency!
For the contract classification application, we look whether the word employment
occurs more than 5 times in the first 5 paragraphs.