Skip to main content
Version: 0.96

Developing and registering custom extractors

In addition to Custom Operators, one can develop a custom Span Extractor for Information Extraction applications. A Span Extractor is an operator that runs over a dataframe of documents, with each row representing a single document, and produces a new dataframe where each row represents a span of text and the document in which it was extracted from. A custom extractor provides additional flexibility to SnorkelFlow built-in application templates.

A custom extractor is a Python function decorated by the @field_extractor decorator which returns a list of spans represented by a tuple of three values:

  • char_start: Index of the starting character of the span
  • char_end: Index of the ending character of the span. This represents the inclusive bounds of the span (the index of the last character in the span). In Python syntax, the span substring is content[char_start: char_end + 1].
  • span_entity: A string that serves as a canonical identifier for the span for Entity Classification applications (for most Extraction applications this should be simply set to None)

Develop a custom extractor

The following example demonstrates how to use a custom regex to extract candidates.

First, we define the custom extractor. For example, we will create an operator that extracts greeting spans from field body.

import re
from snorkelflow.operators import field_extractor

regex = re.compile(r"(?:hello|good morning)", flags=re.IGNORECASE)

@field_extractor(
name="greetings_extractor",
field="body",
resources=dict(compiled_regex=regex),
)
def greetings_extractor(content: str, compiled_regex: re.Pattern) -> List[Tuple[int, int, Optional[str]]]:
spans: List[Tuple[int, int, Optional[str]]] = []
for match in compiled_regex.finditer(content):
char_start, char_end = match.span(0)
spans.append((char_start, char_end - 1, None))
return spans

While @field_extractor lets you define an Operator using simple Python string syntax, Operator s are executed over Dask DataFrames. To test our extractor, we will first need to convert the data to a Dask DataFrame. Then we will run our extractor over the converted DataFrame and view the first rows.

import dask.dataframe as dd
# the input node (previous node) to our custom extractor
# since this extractor is the first node in our app we just set this value to -1
PREVIOUS_EXT_NODE = -1
APP_NAME = "name-of-your-application"

df = sf.get_node_output_data(
application=APP_NAME,
node=PREVIOUS_EXT_NODE,
max_input_rows=10
)
ddf = dd.from_pandas(df, npartitions=1)
df_processed_in_nb = greetings_extractor.execute([ddf]).compute()
print(df_processed_in_nb.head())

Ensure that you’re extractor is working properly on this subset by evaluating the newly created span_text column in the resulting DataFrame. If you are happy with the results, add the extractor:

sf.add_operator(greetings_extractor)

Go to Application Studio (aka “the DAG”) and click on the extractor node and search for the greetings_extractor operator that we just added to commit it in our application. Note, if the model node is in red you will also need to refresh the its datasources by clicking the curved arrow icon.

Click on the model node to view the output of the extractor in the data viewer.