Version: 0.95

Candidate extractor scoring

What you will learn:
How to evaluate performance of a candidate extractor against ground truth

In the text extraction application (see Information extraction: Extracting execution dates from contracts), the first step is to identify a high-recall set of candidate spans. However, in many cases, the candidate extractor may not be perfect. In this case, you should consider evaluating the performance of the candidate extractor against ground truth you collected externally.

In this example, let’s assume that we have created a text extraction application and commited the candidate extractor as detailed in Information extraction: Extracting execution dates from contracts

The in-platform Notebook interface allows you to evaluate the recall of your extractor.

First, get the node_uid of the application DAG based on the application name.

import snorkelflow.client as sf
# APP_NAME = <insert your application name here>
extractor_node = sf.get_node_uid(APP_NAME, search_op_type="SpanExtractor")[0]

Second, get the output of the extractor node:

extracted_spans_df = sf.get_node_output_data(APP_NAME, extractor_node).reset_index()

Next, add active datasources to the extractor node:

datasource_uids = [x['datasource_uid']  for x in sf.get_datasources(DATASET_NAME)]
sf.add_active_datasources(extractor_node, datasource_uids)

Then, add the external document level ground truth labels to the extractor node. We will need x_uids, which is a list of uid of documents, formatted as doc::{uid}, and labels, which is a list of spans (tuples of char_start, char_end, and _gt_label) for the corresponding document. See example below:

# assume we have two documents, with 1 and 2 as the ``uid``.
x_uids = ["doc::1", "doc::2"]
labels = [
    [(0, 1, 0)], # GT labels for doc with uid of 1
    [(0, 2, 0), (4, 5, 1)], # GT labels for doc with uid of 2
]
sf.add_ground_truth(extractor_node, x_uids, labels)

Finally, use sf.get_candidate_extractor_metrics to evaluate the candidate extractor.

sf.get_candidate_extractor_metrics(extractor_node, extracted_spans_df)