Version: 25.1

Sequence tagging: Extracting companies in financial news articles

We will demonstrate how to use Snorkel Flow to extract mentions of companies in financial news articles with state-of-the-art sequence tagging technology. In this tutorial, you will learn:

How to create a sequence tagging application
How to write labeling functions for sequence tagging application

Known limitations with sequence tagging

The following items are known limitations and are planned to be added in future releases.

Active learning.
Reviewer workflows in Annotation Studio.
Labeling function conflict in Studio data viewer filters.
Margin distance filter.
Advanced filters (composing with "OR," "ALL/ANY LF Voted")
Sequence Keyword LF supports at most ten keywords.
Sequence Fuzzy Keyword LF and Sequence Word Vector LF support at most three keywords.
Custom metrics.

Create a dataset

First, we will create a dataset named company-dataset by clicking on the Upload new dataset button in the Datasets page. Ensure that the Enable multi-schema annotations checkbox is selected. Then we will add the following data sources for each split:

Train: s3://snorkel-financial-news/mini_train.csv
Valid: s3://snorkel-financial-news/mini_valid.csv

After adding these two data sources, click the Verify Data Source(s) button. Fill out the following fields:

UID Column: index
Data type: Raw text
Task type: Sequence tagging
Primary text field: text

Finally, click the Add data source(s) button. For more information about uploading data, see Data upload.

Create an application

In Linked Applications tab, click the Create from Template button from the Create Application dropdown menu. Select Sequence tagging and Create Application. Fill out the following fields:

Application name: company-ner-seq-tagging
Choose default dataset: company-dataset (created above)
Specify required input fields: Select text, context_uid, and title
Ground truth column: leave as blank
Negative label: OTHER
Labels: COMPANY
Seq field: text

Note that the Negative Label (OTHER in this case) is equivalent to the O (non-entity tokens) in BIO notation.

Adding ground truth

Ground truth format

Here is an example ground truth file:

__DATAPOINT_UID	start	end	label
doc::1	29	40	COMPANY
doc::1	228	239	COMPANY
doc::2	395	406	COMPANY

Spans cannot be empty (char_start must be smaller than char_end). Overlapping or duplicating spans are not allowed. Negative spans (OTHER) are automatically inferred. Refer to snorkelflow.client.add_ground_truth() for more information.

By default, two preprocessors are added in the DAG: TextSizeFilter filters out documents larger than 10 KB. AsciiCharFilter filters out the non-ascii characters from the documents. This may result in some changes in offsets when exporting ground truth out of Snorkel Flow.

Upload ground truth file

On the Overview page we get a high-level view of our application. We can also add any external ground truth labels we have on this page. On the center of the dashboard you’ll notice a Upload GT button. After clicking on it you’ll be prompted to provide the following information to import ground truth from a file:

File path: s3://snorkel-financial-news/aligned_mini_gts.csv
File format: CSV

If you click on the View Data Sources button on the bottom left of the dashboard you can view the full set of data points and ground truth across splits:

	# of data points	# of GT labels
train	2000	78
valid	100	66

In the sequence tagging application, each document is represented as a datapoint. The # of GT labels refers to the number of documents with one or more spans annotated with GT. We can now view our data by clicking the Develop button on the top right.

Annotate ground truth

You can also annotate ground truth labels in the Develop page directly by highlighting a span using either a double click or drag select mechanisms and assigning the label from the drop-down menu.

Write labeling functions

LF builders for sequence tagging application

Back on the Label page, you will notice two main panes:

The left pane for creating labeling functions and viewing labeling function statistics
The right pane for viewing the data

Here are some LFs you can try:

SKEY-facebook
- Sequence Pattern Based LFs > Sequence Keyword Builder
  - text Contains a keyword in ["facebook"]
- Label: COMPANY
- Vote all the mentions of “facebook” as COMPANY
SRGX-Inc-Ltd-Corp
- Sequence Pattern Based LFs > Sequence Regex Builder
  - text Contains the regular expression [A-Z]+[a-z]*,? (Inc|Ltd|Corp)\.?
- Label: COMPANY
- If a capitalized word is followed by “Inc” or “Ltd” or “Corp,” vote the whole phrase as Company
SRGX-Shareof
- Sequence Pattern Based LFs > Sequence Regex Builder
  - text Contains the regular expression (?<=Shares of)\s([A-Z][a-z]+\s){1,5}
- Label: COMPANY
- Vote 1 to 5 capitalized words, as many times as possible, after the phrase “Shares of” as Company
SRGX-CEOCFO
- Sequence Pattern Based LFs > Sequence Regex Builder
  - text Contains the regular expression \w+(\'|\’)?s? (?=(CFO|CEO|President))
- Label: COMPANY
- Vote the word before “CFO”, “CEO” or “President” as Company
SED-f500
- Sequence Pattern Based LFs > Sequence Entity Dict Builder
  - text Contains the patterns from this file s3://snorkel-workshop-data/financial-news/f500_ticker_key_fixed.json
  - This file contains a dictionary of Fortune 500 company names, stock tickers, and their aliases.
- Label: COMPANY
SRGX-lower
- Sequence Pattern Based LFs > Sequence Regex Builder
  - text Contains the regular expression \b[a-z]+\b
- Label: OTHER
- Vote all the lower case words as OTHER
SRGX-wsww
- Sequence Pattern Based LFs > Sequence Regex Builder
  - text Contains the regular expression \(\w+\)
- Label: OTHER
- Vote all the words in the parentheses, along with the parentheses as OTHER
SSPROP-adj-adv-verb
- Sequence Pattern Based LFs > Sequence Spacy Prop Builder
  - If the tokens in this Spacy field doc is tagged with any of ADJ, ADV, VERB
- Label: OTHER
SNER-DATE-PERSON
- Sequence Pattern Based LFs > Sequence NER Builder
  - If the spans in this NER field doc is tagged with any of DATE, PERSON
- Label: OTHER

LF metrics for sequence tagging application

After saving an LF, you’ll see the score for that LF. Note that in the sequence tagging application, the metrics for LFs are computed at the character-level.

LF development with in-platform notebook

The in-platform notebook interface allows you to develop custom labeling functions and explore your data using Jupyter. Here is an example of a custom LF. You can find details on other available helper functions in the Python SDK.

We can use the pre-defined dictionary derived from the 2016 Fortune 500 list to build a custom LF function to label all the Fortune 500 Companies as COMPANY:

from snorkelflow.studio import resources_fn_labeling_function

def get_taxonomy():
    import fsspec
    import json
    taxonomy_file = "s3://snorkel-financial-news/f500_ticker_key_fixed.json"
    with fsspec.open(taxonomy_file, mode='rt') as f:
        taxonomy = json.load(f)
    return {"taxonomy": taxonomy}
@resources_fn_labeling_function(name="lf_taxonomy", resources_fn=get_taxonomy)
def lf_taxonomy(x, taxonomy):
    import re
    spans = []
    for ticker, companies in taxonomy.items():
        pat = '\s(' + '|'.join(companies).replace('\\', '') + ')\s'
        for match in re.finditer(pat, x.text):
            char_start, char_end = match.span(1)
            spans.append([char_start, char_end])
    return spans

sf.add_code_lf(node, lf_taxonomy, label="COMPANY")

You can also create a multi-polar code LF (see Multi-polar LFs for more information) using the doc field that contains entity-level prediction from Spacy, e.g., if an entity is tagged as ORG, label the corresponding span as COMPANY, and if tagged as DATE, label as OTHER:

@labeling_function(name="my_multipolar")
def sample_multi_polar_code_lf_sequence(x):
    ents = x.doc['ents']
    spans = []
    for ent in ents:
        char_start = ent["start"]
        char_end = ent["end"]
        label = ent["label"]
        if label == 'ORG':
            spans.append([char_start, char_end, "COMPANY"])
        elif label == 'DATE':
            spans.append([char_start, char_end, "OTHER"])
    return spans

lf = sf.add_code_lf(node, sample_multi_polar_code_lf_sequence, is_multipolar=True)

Create a label package and label your dataset

See the Create a label package and label your dataset of the Document classification: Classifying contract types.

Train a model

You can train a model on any training set. For sequence tagging applications, we offer the DistillBERT model for its compactness and relatively fast training time while preserving 95% of BERT’s performance. If run on a CPU, this model may take several minutes or more to finish. We also offer BERT Tiny which is about 2x faster than DistillBERT while preserving 90% of BERT's performance. Additional details on models can be found in Model training.

Here are some details about parameters that are specific to Distillbert models:

fraction_of_negative_subsequences_to_use_for_training: Determines how many negative span tokens to show while training. This parameter helps reduce model bias for the negative label class when there is class imbalance between positive and negative label classes.
stride_length and max_sequence_length: These correspond to the transformer tokenizer parameters stride and max_length respectively.

Span, overlap span, and token level model metrics

In the sequence tagging setting, we offer three levels of metrics to evaluate the final model performance:

Word level: Measures how many words are labelled correctly with respect to the ground truth.
Word Entity level: Measures how many groups of words are labelled correctly as entities, e.g. “Advanced Braking Technology Ltd” is considered as one entity when calculating the metrics.
Overlap Span level: Measures how many predicted and ground truth spans overlap more than 50% as measured by Jaccard similarity.

For example, given a ground truth span “Advanced Braking Technology Ltd” and a predicted span “Braking Technology Ltd,” our recall scores for each measure are:

word_recall = 3/4 = 0.75
word_entity_recall = 0/1 = 0.0
overlap_span_recall = 1/1 = 1.0

View model predictions

You can view the model predictions on the Label page. On the models pane, in the drop-down model selector menu, select the Model ID of interest, (e.g. “1”).

This will change the span highlighting to be based on the model prediction. If you hover over individual spans, you can see the predicted class with the confidence associated with the prediction.

View errors and iterate

See View errors and iterate in Document classification: Classifying contract types. Note that in the sequence tagging application, all analysis metrics are computed at the span level, which includes the Confusion Matrix, Clarity Matrix, Label Distribution, Class Level Metrics, Error Correlation, and PR curve.

Filters and span highlighting in record view and snippet view

The filters are applied at the span level and automatically select the span type(s) to highlight among four types of spans (Ground truth, LF vote, Training set vote, and Model prediction). By default, ground truth is always highlighted. The spans matching the filter are presented in full color, whereas the spans that are not matching the filter are faded out in Record view.

In the Snippet view, the highlight control pane is not editable, but spans are still viewable by scrolling each snippet:

Correctness filter

The correctness filter for LF and model/training set spans have different criteria. For model or training set spans to be correct, the exact match on both the boundary and label with ground truth is required. This is aligned with the span-level metrics. For LF spans, the correctness criteria is more relaxed: if an LF span only matches a part of a ground truth span, it is correct, as one can always write more LFs to cover the remaining part.

LF correctness filter

An LF span is correct as long as it is a substring of a ground truth span with the same label. For example, in the screenshot below, both “Spring Bancorp, Inc.” and “Washington Bankshares, inc.” are correct.

An LF span is incorrect as long as it votes on any character that is not in the selected class. For example, the LF voted incorrect on the “acquired ” (whose ground truth is OTHER)

Model and training set correctness filter

For a model or training set span to be correct, it has to match the ground truth span on both the label and the boundary. For example, in the screenshot below, only the first “Nvidia” and last “Nvidia” are correct model predictions and thus in focus. The second “Nvidia” is incorrect due to a mismatch in the span boundary and thus in transparent state.

There are three types of incorrect model/training set spans: (1) False negative (e.g., the second “McDonald,” the ground truth is COMPANY but model predicted OTHER) (2) False positive (e.g., in the second paragraph, the model predicted “s” as COMPANY, but the ground truth is OTHER) (3) a model prediction span has an incorrect boundary (e.g., the last two “McDonald’s”)

Dataset and application requirements

Label classes: A maximum of 25 positive label classes can be specified
Labeling functions: A maximum of 200 labeling functions can be defined
Total dataset size: A maximum of 250 MB total text data (i.e., data in the text field) across splits can be specified
Studio dataset size: A maximum of 15 MB total text data (i.e., data in the text field) can be loaded in Studio
Each text field: The maximum allowable size for the text field in each data point is 10 KB

Please contact a Snorkel representative for recommendations if any of these limitations are a concern.

Known limitations with sequence tagging​

Create a dataset​

Create an application​

Adding ground truth​

Ground truth format​

Upload ground truth file​

Annotate ground truth​

Write labeling functions​

LF builders for sequence tagging application​

LF metrics for sequence tagging application​

LF development with in-platform notebook​

Create a label package and label your dataset​

Train a model​

Span, overlap span, and token level model metrics​

View model predictions​

View errors and iterate​

Filters and span highlighting in record view and snippet view​

Correctness filter​

LF correctness filter​

Model and training set correctness filter​

Dataset and application requirements​