Skip to main content
Version: 0.91

Pattern based LF builders

This page describes the basic set of pattern based LF builders that are available for classification applications.

Regex builder

Label data points that match a user-defined regex over a specific data field. This builder highlights phrases that match the regex pattern in the data point in the data viewer to help iterate. https://regex101.com is also a good resource for regexes. The builder also supports fuzzy matching in regex patterns (see https://pypi.org/project/regex/).

note

To find service documents in the contract classification dataset, we might label data points that match the regex pattern `This.{1,50} Service Agreement`

Full text regex builder

Label data points that match a user-defined regex over a specific set of data fields.

Whereas the Regex Builder searches on one field, the Full Text Regex Builder can search through multiple or all fields at once. For example, you could search both the Title and Text body of a news article with a single LF. By default, all fields are selected (hence the “full text” match).

Keyword builder

Label data points that satisfy a condition with respect to at least one in a list of keywords you provide. There are three conditions that we support in-app:

  • CONTAINS: if the specified field contains at least a keyword.
  • CONTAINS LINE MATCHING: if one of the lines in the specified field matches a keyword.
  • EQUALS: if the specified field equals a keyword.
note

Note that the Keyword builder allows a maximum of 3 keywords/phrases per builder.

note

For the contract classification application, we can look for the word employment to label documents as EMPLOYMENT type.

Fuzzy keyword builder

Label data points that contain a string that’s similar to at least one of a list of keywords you provide, within the specified similarity ratio using the SequenceMatcher library. This is particularly useful for documents obtained through OCR.

Keyword location builder

Label data points based on whether a pattern appears in a specific section (lines, paragraphs etc.) of a field. Words are split using the space character (s), lines using the newline character (n), sentences based on selected punctuation ([.?!]s), and paragraphs based on two new line characters (nn+).

Advanced option: You can specify how often the pattern should appear in that location!

note

For the contract classification application, we look whether the word employment occurs in the first 5 paragraphs.

Keyword count builder

Label data points based on the frequency of pattern in a field.

Advanced option: You can specify the specific section of the field you want to count the pattern frequency!

note

For the contract classification application, we look whether the word employment occurs more than 5 times in the first 5 paragraphs.