Version: 0.95

Tutorial: Document classification for SEC filings

In this tutorial, you'll use Snorkel Flow to train and build a model that classifies over 20,000 SEC filings into one of the following categories: employment, services, stock, and loan. With Snorkel Flow's programmatic labeling, you'll label your training data using Labeling Functions (LFs), train your models, and analyze the results, guiding you through an iterative development process.

This tutorial is designed for first-time users of Snorkel Flow who are new to data-centric AI, programmatic labeling, and machine learning models.

What you'll learn

Uploading data into Snorkel Flow
Creating a new application
Programmatically labeling your data with Labeling Functions (LFs)
Training your Machine Learning (ML) model
Analyzing model errors and results

Concept: Data splits

In this tutorial, you'll upload four data splits for the example application, each with a specific purpose for model development:

Train: The initial training dataset used to develop your machine learning model. This split does not need to have labels. However, this tutorial includes a small number of labels for development purposes, to help you understand the classification problem. This tutorial walks you through how to build consistent labels in the data in your train split.
Dev: A randomly sampled dataset used for developing Labeling Functions. To limit the amount of data in memory, Snorkel Flow randomly samples the train split to create the dev split. By default, it samples 10% or up to 2,000 samples. The dev split dataset is viewable from your application's Develop > Studio page.
Test: The ground truth dataset held out from the training data for evaluation purposes. While not strictly necessary, it's recommended to use the test split for a final evaluation of your model against a fresh set of expert-annotated labels.

caution

To avoid biasing your evaluation procedure, do not look at data points in the test split.

Valid: Like test, the valid split contains some ground truth labels for evaluation purposes. This split can be used to tune your ML model.

For more information about uploading and splitting data, see the Uploading a dataset article.

Upload data

Before you create the example Snorkel Flow classification application, you need to upload the data you want to classify.

You will upload this data to a new dataset in your Snorkel Flow instance. Data can be ingested into Snorkel Flow from a range of sources, including cloud storage, databases, and local files. For this tutorial, use the provided dataset from AWS S3 Cloud Storage. This is the set of SEC filings that you will classify with Snorkel Flow.

From the left menu, select Datasets.
Select + Upload new dataset.

In the Upload new dataset dialogue, enter the following information:

Dataset name: Enter contract-clf as your dataset name.
Select a data source: Select Cloud Storage as your data source.

File path and Split: Add these four datasets:

File path	Split
`s3://snorkel-contracts-dataset/train-cleaned.parquet`	train
`s3://snorkel-contracts-dataset/dev-cleaned.parquet`	train
`s3://snorkel-contracts-dataset/valid-cleaned.parquet`	valid
`s3://snorkel-contracts-dataset/test-cleaned.parquet`	test

Select Verify data source(s). Snorkel Flow runs data quality checks to ensure data is cleaned and ready to upload.
Scroll down to the Define schema section. From the UID Column menu, select uid. This is the unique entry ID column in the data sources.
Select Add data source(s). When the data is done ingesting from all four data sources it shows the status Data ingestion complete.

Troubleshooting: If one or more of the example data splits does not upload, delete the dataset and try to upload the data again.

After your data is uploaded, you can create your Snorkel Flow classification application.

Create an application

Create a new Snorkel Flow application. Within this application, you can start using Snorkel Flow's powerful tools to classify the documents in your dataset. To create a new application:

From the left menu, select Applications.
Select the dropdown arrow from the Create application menu in the top right corner of the page, then select Create from template.
Choose the Classification template from the menu of preset application templates. This template creates an application that labels entire documents across the desired schema.
Click Create application.
Fill in the New Classification application dialogue. Enter the following information:
- Application name: classification-example-app
- Choose default dataset: contract-clf
- Specify required input fields: text (leave other options unchecked)
- Ground truth column (Optional): label
After inputting this information, select Autofill Labels to populate the Labels (optional) field with your labeling schema, based on existing data in your label column.
Click Save to initiate application creation. This takes 2-5 minutes, as Snorkel Flow ingests your data into the application template and prepares the data for programmatic labeling.
Once the application has initialized, select the Go to Studio button to access the Develop (Studio) page.

Next, you will develop a machine learning model that will significantly speed up the contract classification task, starting with labeling functions (LFs).

Task overview: Programmatic labeling

The remainder of this guide will walk through a working classification example that demonstrates the three core tasks of programmatic labeling:

Develop labeling functions (LFs).

Labeling functions capture the mental models that your subject matter experts (SMEs) use when they label data manually as Python functions that can be applied to larger amounts of data. The end result is a large amount of labeled training data.
Train a machine learning (ML) Model.

Using the large amount of labeled training data, Snorkel Flow applies standard machine learning techniques to train an ML model customized to your dataset.
Analyze model results.

Evaluate the model's performance, and adjust the labeling functions and training until the model meets your performance criteria.

Develop labeling functions (LFs)

Machine learning model training works better with larger amounts of labeled data to train on. Labeling Functions (LFs) allow you to programmatically label large amounts of data. LFs codify expert opinion and criteria for how to label your data. These labeled data points create more training data, so you can achieve better model results at the training stage.

Follow along with the steps in the example to create a precise LF for employment contracts.

Filter data

The first stage in creating LFs is exploratory. Often, a data scientist and an SME work together to identify patterns in already-labeled data that are good candidates for an LF. Filter the uploaded data for only documents that are already labeled employment. Do this from the Studio page for the classification-example-app:

Select the Filter icon in the top right corner of the page.
Select Ground Truth.
Select Ground truth + is + employment from the dropdown menu.
Select the checkmark (✓) to filter viewable data for documents already labeled employment.

Use Snippet view

You can choose the most efficient view of the data.

From the Record view menu, choose Snippet view to see a small chunk of text from each document.

Create, assess, and improve the first labeling function

As you scan the documents in the employment set, notice that most start with EMPLOYMENT AGREEMENT. Let's create an LF that labels all documents containing the text EMPLOYMENT AGREEMENT as employment:

In the text box in top middle pane with the description Type anything to start creating an LF, type /.
Select Keyword under the Pattern Based section. This brings up the dialogue text + contains + Keywords.
In the Keywords field, enter EMPLOYMENT AGREEMENT.
From the Select a label on the right side of the LF builder, select employment.
Select Preview LF. The Record view now shows documents that match the LF. Snorkel Flow also displays metrics for this LF:
- Prec. (Est): Measures the precision of the LF for your current data split. That is, it measures how often the LF correctly labels the documents, based on the ground truth. In this example, the precision is 93.6%, which is highly accurate. Your results may vary.
- Coverage: The % data in your split that fit this criteria. In this example, the coverage is 32.4%. Your results may vary.
Select Create LF to add this LF to the active LFs for the model you're going to train. You can see this LF added to the Labeling Functions pane.
To achieve higher precision, incorporate a second criterion for the EMPLOYMENT AGREEMENT LF to exclude all documents that contain the text purchase. Select the KEY-EMPLOYMENT AGREEMENT-text label from the Labeling Functions menu.
To build a multi-part LF, select the three-dot menu next to the first factor. Select Add another builder (and). From the three-dot menu, select Negate this builder to use the NOT option.
In the Keywords field, enter purchase.

Create five more labeling functions

Repeat Steps 1-6 in the previous section to create five more labeling functions for the SEC filings dataset. Snorkel Flow can auto-detect the label after you select Preview LF for each LF. Create these LFs:

LF Type	Keywords	Label
Keyword	`Customer`	services
Keyword	`SERVICES AGREEMENT`	services
Keyword	`LOAN AND SECURITY AGREEMENT`	loan
Keyword	`Closing Date`	stock
Keyword	`Loan Agreement`	loan

Once you've created each of these LFs, you will see the six active LFs in the Labeling Functions pane on the left side of the page. Note that your values may vary from the example.

TIP

Learn more about Creating good labeling functions.

Now more of your data has programmatically added labels. Next, you'll create a model from this larger training dataset.

Train a machine learning (ML) model

Now that you've created several LFs for your model, you are ready to train your first ML model! After training the model, you will then compare the model's label predictions with the expert-assigned labels for the valid data split.

Expand the Models pane and select Train a model.
Choose Fast model, which is significantly faster than the other options. It runs a subset of the training data through the model and provides a good baseline to gauge this model's results. Select Train Fast Model to begin model training for the SEC filings dataset.

A status bar appears that shows the model training % progress.

Once the model is finished training, it updates the Models table with the most recent model metrics.

Assess model accuracy for the valid split

Model assessment is the key to achieving high-quality results with your custom ML model. In this example, your goal is to achieve at least 90 accuracy and 90 F1 Macro for the valid data split.

Here are some example results from running a Fast model with the existing LFs:

Accuracy: 84.84%
F1 Macro: 82.51

These results are a good start, but below the desired 90 Accuracy and 90 F1 success metrics when evaluating the model against the pre-labeled valid data split.

Next, you will use the Analysis pane to identify which data points and labels your model struggles to predict.

Analyze model errors and results

After running the first ML model, you can view the model performance in the Analysis pane.

This tutorial uses the two most common error analysis visualization tools, the Clarity Matrix and Confusion Matrix to analyze and improve the example document classification model. To learn about all of Snorkel Flow's analysis visualization tools, read Analysis: Rinse and repeat.

Interpret results from the Clarity Matrix

The Analysis panel shows the Clarity Matrix by default.

The Clarity Matrix table compares labels created by LFs with labels created by the model, and compares both of them to the valid data split. Example Clarity Matrix output:

Explanation of the Clarity Matrix table:

LF	Model	Definition	Suggestion
Incorrect	Incorrect	The model learned the wrong thing from the LF.	Refine the LF for accuracy.
Incorrect	Correct	The model extrapolated correctly, despite the incorrect LF.	No action.
Correct	Incorrect	The model was inaccurate despite correctly-labeled data.	Refine the model.
Correct	Correct	The model trained correctly from the LF-labeled data.	No action.
Low confidence	Incorrect	The model extrapolated incorrectly.	Write more LFs to give the model more training data.
Low confidence	Correct	The model extrapolated correctly.	No action.

Select any cell in the table to view the corresponding data points in the Record view.

(Optional) Use the Clarity Matrix to refine an incorrect LF

These next steps show how you can use the Clarity Matrix to improve your LFs and the subsequent iteration of your model. Your output may be different from the example, so you may not be able to follow these steps exactly. In that case, just read through the steps so you can learn more about how to improve LFs.

From the Clarity Matrix, select the Refine existing LFs cell. You can see the specific document that the LF labeled incorrectly. In the Record view, the Ground truth is employment but the Model 1 prediction and Training set 1 label are for services. So, we should look for a services LF.
Expand the Labeling functions pane. You can see that the KEY-Customer-textlabel Voted incorrectly and has low Prec. (GT). This makes it clear that this is the LF that needs to be refined.
Incorporate a second criterion for the Customer LF to exclude all documents that contain the text EMPLOYMENT AGREEMENT. Select the KEY-Customer-text label from the Labeling Functions menu.
Select the three-dot menu next to the first factor. Select Add another builder (and). From the three-dot menu, select Negate this builder to use the NOT option.
In the Keywords field, enter EMPLOYMENT AGREEMENT.
Select Preview LF.
Select Save LF.

(Optional) Use the Clarity Matrix to expand LF coverage

Continue to use the Clarity Matrix to write additional LFs. Your output may be different from the example, so you may not be able to follow these steps exactly. In that case, just read through the steps so you can learn more about how to expand your LF coverage.

From the Clarity Matrix, select the Write new LFs cell.
From the Record view dropdown, select Snippet view.
Look for patterns in the documents that none of the existing LFs apply to. In this example, there are several stock documents that need an LF.
Let's add another LF that addresses more of the stock documents:

LF Type Keywords Label
Keyword STOCK PURCHASE AGREEMENT
NOT LOAN AGREEMENT + EMPLOYMENT AGREEMENT stock

LF Type	Keywords	Label
Keyword	`STOCK PURCHASE AGREEMENT` NOT `LOAN AGREEMENT` + `EMPLOYMENT AGREEMENT`	stock

Interpret results from the Confusion Matrix

Select the Confusion Matrix from the dropdown menu in the Analysis pane.

The Confusion Matrix compares every actual label with every predicted label. For example, select the cell in the LOA row and the SER column to view documents that are actually loan agreements in our valid dataset, but were classified as service agreements.

(Optional) Use the Confusion matrix to expand LF coverage

Use the Confusion Matrix to write additional LFs. Your output may be different from the example, so you may not be able to follow these steps exactly. In that case, just read through the steps so you can learn more about how to expand your LF coverage.

From the Confusion Matrix, select the cell in the LOA row and the SER column.
From the Record view dropdown, select Snippet view.
Look for patterns in the documents that none of the loan LFs apply to.
Let's add another LF that addresses more of the loan documents:

LF Type Keywords Label
Keyword Loan Party loan

LF Type	Keywords	Label
Keyword	`Loan Party`	loan

(Optional) Train a new model

Now that you've refined and added more LFs, train another fast model by following the steps in the Train a machine learning (ML) model section. Compare the results of this model with your first one and note any improvements or regressions.

Next steps: Iterate to improve model performance

After completing this tutorial, you now understand the most common features and capabilities of the Snorkel Flow platform and are ready to work towards achieving your model training success metrics!

For further learning, you can improve the model performance for this dataset and task by using the Analysis visualizations to tweak your LFs and models for better results. In this way, improving and adapting your model becomes unblocked from the need for hand-labeled training data and instead becomes fast, iterative, and analysis-driven.

As you continue to iterate on your results, we recommend using the following Snorkel Flow features:

Customize your model's architecture, hyperparameters, and hyperparameter searches
Write Search based LFs
Make use of the additional analytical views described in Analysis: Rinse and repeat

What you'll learn​

Concept: Data splits​

Upload data​

Create an application​

Task overview: Programmatic labeling​

Develop labeling functions (LFs)​

Filter data​

Use Snippet view​

Create, assess, and improve the first labeling function​

Create five more labeling functions​

Train a machine learning (ML) model​

Assess model accuracy for the valid split​

Analyze model errors and results​

Interpret results from the Clarity Matrix​

(Optional) Use the Clarity Matrix to refine an incorrect LF​

(Optional) Use the Clarity Matrix to expand LF coverage​

Interpret results from the Confusion Matrix​

(Optional) Use the Confusion matrix to expand LF coverage​

(Optional) Train a new model​

Next steps: Iterate to improve model performance​