Skip to main content
Version: 0.96

snorkelflow.operators

Functionality for writing custom operators (preprocessors and postprocessors) in Python.

In addition to built-in operators, one can develop a custom operator like below and use it as part of an application. In a nutshell, a custom operator is a Python function that accepts and outputs dataframes, decorated by one of Snorkel Flow operator decorators.

@pandas_featurizer(
name="add_num", input_schema={}, output_schema={"added_num": int},
resources={"num": 7})
def add_num(df: pd.DataFrame, num: int) -> pd.DataFrame:
df["added_num"] = num
return df

Generally, parameters for custom operators are hard-coded within the user-defined function when they are developed for simplicity, e.g., "num": 7 in the example above. To create custom operators that take in parameters when they are used, you can instead create custom operator classes. The diagram below should help you decide which approach to take and which decorator/class to use.

note
Custom operators can only be developed and registered from the in-platform Notebook server. See also Custom Operators for tutorials.

Special decorators

It’s recommended to use special decorators rather than generic decorators whenever possible as the former is easier to use and less error-prone. For example, use field_extractor rather than dask_extractor.

field_extractor(field[, name, resources, ...])

Decorator for generating candidate spans for extraction tasks.

page_splitter(name[, resources, resources_fn])

Decorator for splitting PDFs into groups of pages.

row_filter(name[, resources, resources_fn])

Decorator for filtering rows of a dataframe.

span_normalizer([name, resources])

Decorator for converting span text to a standard format.

span_reducer([name, datapoint_instance, ...])

Decorator for aggregating span-level model predictions to document-level predictions for extraction tasks.

reducer([name, datapoint_instance, ...])

Decorator for aggregating lower-level model predictions to higher-level predictions.

Generic decorators

If none of the special decorators above suits your need, you can use one of the generic decorators below. In Snorkel Flow, operators, whether built-in or custom, are applied to a Dask dataframe, which is composed of many smaller Pandas dataframes (see here for more details). It’s recommended to use pandas_featurizer or pandas_operator, which allows the user-defined function to deal with each Pandas dataframe at a time for simplicity unless you have to work with the whole Dask dataframe.

pandas_featurizer(*, input_schema[, name, ...])

Decorator for adding columns to a dataframe.

pandas_operator(*, input_schema[, name, ...])

Decorator that wraps a function mapping a pandas.DataFrame to another pandas.DataFrame.

dask_operator(*, input_schema[, name, ...])

Decorator that wraps a function mapping a dask.dataframe.DataFrame to another dask.dataframe.DataFrame.

dask_combiner(*, input_schema[, name, ...])

Decorator to define Dask Combiner from a function.

dask_extractor(*, input_schema[, name, ...])

Decorator to define Dask Extractor from a function.

Classes

Featurizer()

Operator class that adds one or more columns (features) to a DataFrame.

Operator()

Operator class that performs some transformation on dask dataframes.