Skip to main content
Version: 0.96

Information extraction: Extracting execution dates from contracts

We will demonstrate how to use Snorkel Flow for a simple text information extraction application: extracting the execution date from a set of loan agreements found in SEC filings. The execution date is the date on which a contract has been signed by all the involved parties.

In Snorkel Flow, we follow one classic approach to information extraction in which the first step is to identify a large, high-recall set of candidate spans: spans of text that might be execution dates. We then use the rest of the Snorkel Flow application to classify these candidate spans as POSITIVE (i.e. actually an execution date) or NEGATIVE (i.e. some other date), similar to how we classified documents in the previous walkthrough.

In this quick start example, we consider all dates as candidate spans. We use Snorkel Flow’s built-in date candidate extractor to do this.

While most contract files have only one execution date, each document can contain multiple dates, such as the date it was restated/amended/terminated, or the date a reference document was executed.

  1. For this demo dataset, the URL field may contain the execution date for some documents. We recommend you do not use this field for writing labeling functions or training end models.
  2. This demo dataset is available via S3 or upon request.

Adding data sources

Before we can create an application we must create a dataset and upload data sources for each split: train, test, and valid. We can also upload additional data sources for each split over time. Data points in the valid and test splits come with ground truth labels. The train split is partially labeled and we’ll use Snorkel Flow to label the rest of it.

First, we’ll create a dataset named loan-agreement-dataset by clicking on the Datasets tab and then the Add dataset button. Then we’ll individually add the data sources for each split by clicking Add Data Source. We can now pass the File path (see S3 paths below), and Split (see below) for each data source accordingly. Next, we’ll hit Verify Data Source(s) and set UID Column to uid. Finally, we’ll select Add data source(s) which will start the data ingestion process.

  • Test: The held-out test split contains ground truth labels for evaluation purposes. While not strictly necessary, we recommend a final evaluation against ground truth (i.e. expert annotated) labels, as we have provided with these splits. Also note that you should not look at any of these data points, or else you risk biasing your evaluation procedure!

    • s3://snorkel-loan-agreement/dropped_loan_agreement.test.parquet
  • Valid: Similar to test, the valid split contains some ground truth labels for evaluation purposes. This split can be used to tune your ML model.

    • s3://snorkel-loan-agreement/dropped_loan_agreement.valid.parquet
  • Train: Data in this split, which does not require the use of any labels, helps guide the development of our LFs and feeds into our end model. We will add two data sources for train. During the development of LFs on the Label page, a subset of the train split is randomly sampled to create a dev split. By default, 10% of the train split (up to 4000 samples when sampling by doc for applications with spans), will be used for the dev split, which can be updated later through resampling. We include a small number of them so that you can look at them during development to get an idea of the problem (see description of the development split below). In the subsequent steps, you will use Snorkel Flow to label the rest of this training split!

    • s3://snorkel-loan-agreement/dropped_loan_agreement.train.parquet
    • s3://snorkel-loan-agreement/dropped_loan_agreement.dev.parquet

Dataset details

Each original data point is a document with the content in the text field and the url of where the document is hosted in the URL field. For this application, we used built-in pre-processors in Snorkel Flow to extract dates from documents. Since we extracted candidates from the text field, we added several fields to the data to keep track of the spans.

  • char_start: The index of the first character of the span in the document.
  • char_end: The index of the last character of the span in the document.
  • context_uid: The uid of the document from which the span was extracted. All spans that have the same context_uid come from the same document.
  • span_text: The content of the candidate span.
  • span_field: The field which the candidates were extracted from. In this case, it’s the text field.
  • span_preview: The span is highlighted with its context. The length of the context is specified in the load processors. In this case, span_preview captures 200 characters to the left and to the right of the span.
  • span_preview_offset: The index of the first character of the span preview in the document
  • initial_label: The label is assigned to a span at creation time (if available). After the initial upload, the label for each span comes from Snorkel Flow.

Create an application

Once you are done uploading each data source split you will create an application. Click on the Applications tab and select New application. Select the Text Extraction application template and click Create application. Name our application contract-execution-date. Also, select all fields under Specify required input fields. The Unknown Label and Labels sections are already filled out for us. Finally, select text as the Extraction field and hit Save.

Create an operator DAG in application studio

We will now add our Span Extractor to pull out dates via regex in our newly created Text Extraction Block. In this example, we show how you can write a regex-based extractor for your use case. Snorkel Flow also comes with built-in extractors (e.g. DateExtractor that you can use out-of-the-box).

Click on the Expand 6 nodes button to view the application DAG, and then select the SpanExtractor operator at the very beginning and the RegexSpanExtractor. Copy the following regex in the regex box, set the Field to text, and save by selecting Commit:

(?:[0-9]{1,2}(?:st|nd|rd|th)?(?: day)?(?: of)?.?(?:(?:Jan(?:.|uary)?)|(?:Feb(?:.|ruary)?)|(?:Mar(?:.|ch)?)|(?:Apr(?:.|il)?)|May|(?:Jun(?:.|e)?)|(?:Jul(?:.|y)?)|(?:Aug(?:.|ust)?)|(?:Sep(?:.|tember)?)|(?:Oct(?:.|ober)?)|(?:Nov(?:.|ember)?)|(?:Dec(?:.|ember)?)),?.?[12][0-9]{3})|(?:(?:(?:Jan(?:.|uary)?)|(?:Feb(?:.|ruary)?)|(?:Mar(?:.|ch)?)|(?:Apr(?:.|il)?)|May|(?:Jun(?:.|e)?)|(?:Jul(?:.|y)?)|(?:Aug(?:.|ust)?)|(?:Sep(?:.|tember)?)|(?:Oct(?:.|ober)?)|(?:Nov(?:.|ember)?)|(?:Dec(?:.|ember)?)).?[0-9]{1,2}(?:st|nd|rd|th)?(?: day)?(?: of)?,?.?[12][0-9]{3})

Lastly, click on the Model operator.

Add span-level ground truth

On the Overview page, we get a high-level view of our application. We can also add any ground truth labels we have on this page. In the bottom third of the page you’ll notice an Upload GTs button. After clicking on it you’ll be prompted to provide the following information to import ground truth from a file:

  • File path: s3://snorkel-loan-agreement/loan-execution-date-gt.csv
  • File format: CSV
  • Label column: _gt_label
  • UID column: (leave this blank)

If you click on the View Data Sources button on the bottom left of the dashboard you can view the full set of data points and ground truth across splits:

 # of data points# of GT labels# of docs
train49770467
train (dev)22622624
valid37237231
test16216225

We can view our data by clicking on the Go to Studio button on the top right.

View data at the document level

In case you have no ground truth labels in the train split and need some to help provide feedback while iterating on LFs, you can annotate directly within Develop (Studio). All the examples in this quickstart include ground truth labels, so you can skip annotating if you prefer. See Ground truth annotations for more details.

For extraction-related tasks, there is both a Record and Document view, selected by clicking on the corresponding buttons above the dataviewer. Record view shows one candidate span at a time—this view is often useful because this candidate span is the actual object you are labeling with your labeling functions, and the model you will train. The document view shows one document at a time, which may (and usually will) contain multiple candidate spans—this view is often useful to get a broader view of the context around the candidate spans.

Use the bottom arrow keys to iterate through documents and the top arrows to iterate through candidates in a given document. See Ground truth annotations for more details.

Resampling documents

Before we write LFs we will resample more documents. Click the Dev set dropdown in the top toolbar, then select the Resample data option.

Set a sample size of 100 and make sure you are sampling by docs. You can also set an optional random seed at the bottom of the box. Click Resample dev split.

Write labeling functions in label

In Develop (Studio), you will notice two main panes:

  • The left pane with accordions to

    • Manage labeling functions
    • Manage models
    • Conduct analysis
  • The right pane for viewing the data

You can use the various Filters at the top to choose what subset of data to look at. For example, you can choose to view all samples whose ground truths (GT) are POSITIVE.

Note

The label for this task is on the span level, i.e. it is the label for a candidate span (not the whole document), which means it only decides whether a candidate date span (highlighted in the Record and Document viewers) is the execution date of that document or not.

One heuristic is that if a date is repeated multiple times throughout the document, it’s likely to be the execution date. Let’s create a Span Count Builder to express that. Click into the search bar at the top and choose a Span Count Builder from the Span Based submenu. Populate the template with the following information and click Preview LF, view results, and then click Create LF.

  • Occurs: >=
  • Count: 7
  • Label: POSITIVE

Another basic but powerful type of labeling function is one that reasons keywords around the span. This is similar to the keyword builder for document classification, except includes some basic positional logic relative to the candidate span. For example, you might notice that the dates that follow is dated are probably the execution date: we can create a labeling function with the Span Context Builder under Span Based for that.

  • Span is RIGHT of is dated within 5 words
  • Label: POSITIVE

You might also want to try out the Duplicate LF feature (accessible if you click the three dots near the LF name) to duplicate the current LF for faster LF creation.

Tip: Regex Builder with {{span}} macro

The Regex Builder under Pattern Based can be powerful if you’re comfortable with basic regular expressions. It can be combined with the macro {{span}}, a special token that refers to the extracted span. For example, if a span appears within this context Closing Date shall mean {{span}}, then it probably means the end date and not the execution date. You will also notice that the matches for this regex are highlighted in the Data sources tab, which can help with regex iterations.

  • If field: span_preview
  • Contains the regular expression: \"Closing Date\" shall mean {{span}}
  • Then Label: NEGATIVE

Tip

Tip

Use filters to balance the classes

If, for example, we notice that most of our LFs are for NEGATIVE class, we could use the filters to look for patterns for POSITIVE. In fact, you can use the filters to look at all the data points whose ground truths are POSITIVE and are not yet covered by any labeling function.

Click the Filters icon in the top right. Select Advanced Filters and under Recommended Filters select Show Uncovered Examples. This filter is just a shortcut for Ground Truth is POSITIVE and All LFs (under Labeling Functions submenu) vote UNKNOWN.

Click around to explore your data and get more ideas for your labeling functions. Try to write at least one LF for each class that has 90%+ precision and 5%+ coverage. Here are some LFs you can try:

  • Span-Based LFs > Span Context Builder

    • Span is RIGHT of [0-9] within 2 words
    • Label: NEGATIVE
    • Click the three dots to the left of Span Context Builder and Advanced Settings and enable Regex functionality
  • Span-Based LFs > Span Context Builder

    • Span is LEFT OR RIGHT of and within 4 words
    • Label: NEGATIVE
  • Span-Based LFs > Span Context Builder

    • Span is RIGHT of EXECUTION VERSION within 3 words
    • Label: POSITIVE
  • Span-Based LFs > Span Context Builder

    • Span is RIGHT of Contract Date: within 3 words
    • Label: POSITIVE
  • Span-Based LFs > Span Context Builder

    • Span is LEFT of (the "Effective Date") within 5 words
    • Label: POSITIVE

Analyze labeling function performance

See Analyze Labeling Function performance in Document classification: Classifying contract types.

Data exploration and visualization with in-platform notebook

The in-platform notebook in Snorkel Flow can be a powerful platform for analyzing and visualizing data to come up with new ideas for labeling functions, using either the built-in functions or external libraries.

For example, we can create a boxplot of the distribution of char_start to help us notice that most of the execution dates appear near the beginning of the document.

import matplotlib.pyplot as plt
# This dataset filter only looks at data points with no LF labels
df = sf.get_dataset(node, all_lfs_filter="UNKNOWN")
df.boxplot(column=["char_start"], by="GT", grid=False)
plt.ylim([0,500])

We see that we can use the condition that the span occurs in the first 350 characters to label the span as POSITIVE. We can use the Span Location Builder under Span Based labeling function builders where:

  • In: chars
  • Start index: 0
  • End: 350
  • Label: Positive

LF development with in-platform notebook

The In-platform Notebook interface allows you to develop custom labeling functions and explore your data using Jupyter. For this walkthrough, we’ll cover one type of visualization and custom LF.

Say we want to look for whether other years (like 2005, 1995, etc.) are mentioned near the extracted span. We can write an LF that counts how many times the string 20 appears in the span_preview field.

from snorkel.labeling.lf import labeling_function

@labeling_function(name="my_nb_lf")
def lf(x):
if x.span_preview.count("20") >= 2:
return "NEGATIVE"
else:
return "UNKNOWN"

sf.add_code_lf(node, lf, label_str="NEGATIVE")

Train a simple model

You can train a model on any training set. For extraction tasks, we offer Span Aware model configs, in addition to all default model configs used for classification. Span Aware models train and predict the extracted spans and their surrounding context. Additionally, Span Aware models provide options for classifying spans, such as masking the span text to prevent overfitting. Additional details on models can be found in Model training.

note

The models with the * next to them are computationally intensive, and we recommend training them on GPU if possible.

Model training in the SDK

We have also included an example Jupyter notebook that uses the SDK to load the dataset and its generated labels to train a model. You can use a notebook like this one to test out other models that are not yet supported in the app or try different settings.

You can use extraction_model_examples.py for this task, which can be found under Notebook > File Browser > examples Python SDK to train an external model for this task.

Span and document level metrics

In the information extraction setting, we often have two levels at which we’d like to evaluate the final model performance:

  • Span level: This is how many candidate spans were classified correctly – e.g. in our setting, how many dates we correctly marked as either POSITIVE (an execution date) or NEGATIVE (not an execution date).
  • Document level: This measures our performance at extracting the correct value(s) from the overall document – e.g. the execution date of the document.

In many settings, such as this one, we care less about classifying each date correctly (the span level score), and instead just want to find the single execution date of the document with high accuracy (the document level score).

After the model finishes training, you can see the span-level metrics for the trained model on the non-train splits. Span-level metrics summarize the binary classification metrics over the extracted candidate data points. This is a good proxy for model training, but not the metric that we care about.

Inferring document ground truth via SDK

Since we have manually uploaded ground truth via CSV file, we will first need to explicitly infer the document ground truth (this is not the case when annotating spans in the platform). Inferring document ground truth using SDK is necessary for loading document-level metrics, so refresh after running the SDK steps to see document-level metrics in the UI. Open the Notebook interface and run the following lines:

import snorkelflow.client as sf
# NODE = <insert your node id here>
sf.get_inferred_document_ground_truth_from_span_ground_truth(NODE, split="dev")
sf.get_inferred_document_ground_truth_from_span_ground_truth(NODE, split="valid")
sf.get_inferred_document_ground_truth_from_span_ground_truth(NODE, split="test")

Document-level metrics

These metrics measure how well our model did at identifying the correct execution date for each document. To calculate these metrics, we first need a way to determine which span with a positive predicted label in the document is the execution date. In Snorkel Flow, we abstract this as two basic operations:

  • Normalization: Transforming the raw text of the candidate spans to a canonical form of the object we actually care about, e.g. a date.
  • Reduction: Picking a single extraction from the many different candidates our model has made positive / negative predictions over.

In Snorkel Flow, Normalizers and Reducers perform this process of aggregating predictions across spans into the one execution date associated with the document. To add them we will need to go back to the Application Studio page where we defined our application DAG, or we can get there also via a short-cut by clicking on the app name in the navigation breadcrumbs across the top of the application.

Normalizers

  • We will use the DateSpanNormalizer, which converts all spans into the YYYY-MM-DD format. At the very end of the DAG you will notice an IdentitySpanNormalizer operator, which just copies the span text as-is, and an empty Reducer placeholder. Replace the IdentitySpanNormalizer with a DateSpanNormalizer by clicking on the button with three vertical dots next to the normalizer and hit Uncommit. Now, click on the placeholder and add the DateSpanNormalizer operator, and hit Commit.

Reducers

  • The DocumentMostConfidentReducer will select the date with the highest probability of being a positive span as being the execution date for a given document. Click on the Reducer operator placeholder and follow the same steps as above.
  • You can also try the DocumentFirstReducer, which will select the first positively-predicted span in the document. This particular reducer is a good heuristic for this application since execution dates tend to appear near the beginning of the document.

View errors and iterate

See View errors and iterate in Document classification: Classifying contract types.

View model predictions

You can view the model predictions for the spans in the Document view in Develop (Studio). Click the gear icon on the top right corner and select Color Span by the model of interest, (e.g. “run #0”). This will change the span highlighting to be based on the model prediction. If you hover over individual spans, you can see the predicted class with the confidence associated with the prediction and the ground truth label.