Skip to main content
Version: 0.96

Sequence LF builders

This article describes the set of builders that are available for sequence tagging applications (see the Sequence tagging: Extracting companies in financial news articles tutorial).

Sequence context builder

Label tokens based on the text surrounding a regular expression pattern. This labels the specified number of tokens if they are [LEFT][RIGHT], or [LEFT OR RIGHT] of the provided pattern.

tip

You can combine the “Show LF Votes” button and the “View [In]Correct” filter to see which tokens are being labeled by this LF.

Sequence fuzzy keyword builder

Label tokens that are similar to one or more of the list of keywords (max of three) you provide, within the specified similarity ratio using the SequenceMatcher library. This is particularly useful for documents obtained through OCR.

Sequence keyword builder

Label a sequence of one or more tokens that match any words or phrases you provide.

note

The Sequence Keyword builder allows a maximum of three (3) keywords/phrases per builder. To add more, you can * combine additional Sequence Keyword Builders using OR, * or use the Sequence Entity Dict Builder.

Sequence substring expansion builder

If a token [matches] the whole value, [contains] the value in it, [starts with], or [ends with] the value, then specify the class to vote for. The value can be toggled as regex or not.

Sequence spaCy prop builder

This LF builder incorporates the spaCy properties of tokens for labeling them.

Prerequisite

In your DAG, add a SpacyPreprocessor node before the Model node to create spaCy properties for your data. The SpacyPreprocessor “Field” and “Target field” values are text and doc, respectively.

Label tokens that satisfy these conditions:

  • The tokens in this spaCy field (doc).
  • Is tagged with any SPACY_TAGS, such as VERB or ADJ for part-of-speech tags. For a complete list of tags, see the spaCy glossary

The dropdown has commonly used spaCy tags. If you have added any other spaCy tags in your application, add these as free text.

The data viewer in the spaCy field shows any tags you added. For example, the spaCy field doc in the data viewer shows custom tags NNP, IN, NNPS, and _SP:

Screenshot 2024-06-24 at 3.54.57 PM.webp

Advanced option: You can specify the spaCy properties that fit your use case. Snorkel Flow supports POS (part-of-speech), DEP (dependencies), and TAG.

Example

If the tokens are tagged as VERB or ADJ under POS part-of-speech tags, they are not likely company names. We can label these tokens as OTHER.

Sequence NER builder

This LF builder incorporates the named entity recognition resources for labeling spans.

Label spans that satisfy these conditions:

  • The spans in this field (NER field)
  • is tagged with any of (NER properties)

Prerequisite: In your DAG, add a SpacyPreprocessor node before the Model node to create Spacy properties for your data. The SpacyPreprocessor takes the your text field as “Field” and outputs “Target field” as doc, in this case, select NER field as doc. Alternatively, add a custom featurizer in the DAG to add NER field with the following format:

note

NER field for text Apple is looking at buying U.K. startup for $1 billion is a JSON dictionary with “ents” as key, and value is a list of entities; Each entity is a dictionary with ‘start’, ‘end’ and ‘label’ as keys.

Note

{
"ents": [
{'start': 0, 'end': 5, 'label': 'ORG'}, # Apple
{'start': 27, 'end': 31, 'label': 'GPE'}, # U.K.
{'start': 44, 'end': 54, 'label': 'MONEY'}, # $1 billion
]
}

If the spans are tagged as ORG, they are likely company names. We can label these spans as COMPANY. In this example, “Apple” would be labeled as COMPANY.

Sequence entity dict builder

Label all tokens containing patterns from a dictionary provided through a JSON file.

Only the values of the JSON are used by the labeling function; the keys can be used for organizational purposes, but are otherwise ignored. See the example below for reference.

Label tokens that satisfy these conditions:

  • The tokens in this field (text)
  • Contains the patterns from this file (file_path)

Note

Example

Our JSON file maps Fortune 500 stock tickers to company name aliases. The keys (stock tickers) are ignored for the purposes of this labeling function. If any of the values in the JSON match a span in the document, we label the matched tokens as COMPANY. Here is an example of the JSON file: s3://snorkel-workshop-data/financial-news/f500_ticker_key_fixed.json

{
"WMT": [
"Walmart",
"www.walmart.com",
"Wal Mart Stores Inc"
],
"XOM": [
"Exxon Mobil",
"www.exxonmobil.com",
"Exxon Mobil Corp"
],
...
}
tip

The location of the JSON file needs to be an S3.

Sequence word vector builder

Label all tokens with a cosine similarity score greater than or equal to the provided threshold. This score is calculated between the associated word vectors loaded from the provided file path.

  • Keywords - The keywords that will be compared to the tokens within the text field
  • Cosine Similarity - Indicates the threshold at which a token will be considered similar enough to one of the keywords so as to be labeled. A score of 1.0 represents an identical word. Typical scores for matching are in the 0.4 to 0.8 range.
  • Word Vector Path - Indicates the S3 or MinIO path to a file containing pre-trained word vectors. This file should be formatted such that each line contains a single token followed by the vector values delimited by spaces.

Note
Sample word vector file format for 4-dimensional vectors: scale 0.96014 -0.18144 0.22938 0.28215 solution 1.1908 0.23095 -0.169 0.043158 banking 0.17391 -0.69398 0.11051 0.7731 ...

Sequence letter case builder

Vote the class of a token based on whether it’s lowercase, uppercase, or fully capitalized. By default, this LF uses the regex \b word boundary to tokenize. You can set the Word Boundary Tokenizer to off under the advanced menu, which will use a spaCy tokenizer instead.

The nearby tokens are merged by default; you can turn off Merge Nearby Tokens under the advanced menu, which will vote on individual tokens.