Skip to main content
Version: 25.4

snorkelflow.operators.field_extractor

class snorkelflow.operators.field_extractor(field, name=None, resources=None, resources_fn=None)

Bases: object

Decorator for generating candidate spans for extraction tasks.

The field over which candidates should be extracted is specified in the decorator. The wrapped function accepts a string input (which is text in the specified field for a single row of the DataFrame) and outputs a list of tuples [(char_start, char_end, span_entity)] (using type annotation syntax: List[Tuple[int, int, Optional[str]]]). Each tuple represents the index of the starting character of the span, the index of the ending character of the span, and a canonical identifier string for the entity represented by the span (if applicable).

warning
char_end represents the inclusive bounds of the span (the index of the last character in the span). In Python syntax, the span substring is content[char_start : char_end + 1].

Examples

The following example demonstrate how to use a custom regex to extract candidates:

import re
from snorkelflow.operators import field_extractor

regex = re.compile(r"(?:hello|good morning)", flags=re.IGNORECASE)

@field_extractor(
name="greetings_extractor",
field="body",
resources=dict(compiled_regex=regex),
)
def greetings_extractor(content: str, compiled_regex: re.Pattern) -> List[Tuple[int, int, Optional[str]]]:
spans: List[Tuple[int, int, Optional[str]]] = []
for match in compiled_regex.finditer(content):
char_start, char_end = match.span(0)
spans.append((char_start, char_end - 1, None))
return spans

sf.add_operator(greetings_extractor)

Parameters

NameTypeDefaultInfo
nameOptional[str]NoneName of the Extractor.
fieldstrDataframe field that the extraction is operating over.
resourcesOptional[Dict[str, Any]]NoneResources passed in to f via kwargs
resources_fnOptional[Callable[[], Dict[str, Any]]]NoneA function for generating a dictionary of values passed to f via kwargs, that are too expensive to serialize as resources.

__init__

__init__(field, name=None, resources=None, resources_fn=None)

Methods

__init__(field[, name, resources, resources_fn])